beautifulsoup - extract data from website into dictionaries in python -
following code , corresponding output extract data of particular job indeed.com. alongwith data have lot of junk , want separate out title, location, job description , other important features. how can convert dictionaries?
from bs4 import beautifulsoup import urllib2 final_site = 'http://www.indeed.com/cmp/pullskill-techonoligies/jobs/data-scientist-229a6b09c5eb6b44?q=%22data+scientist%22' html = urllib2.urlopen(final_site).read() soup = beautifulsoup(html) deep = soup.find("td","snip") deep.get("p","ul") deep.get_text(strip= true) output:
u'title : data scientistlocation : seattle waduration : fulltime / permanentjob responsibilities:implement advanced , predictive analytics models usingjava,r, , pythonetc.develop deep expertise company\u2019s data warehouse, systems, product , other resources.extract, collate , analyze data variety of sources provide insights customerscollaborate research team incorporate qualitative insights projects appropriateknowledge, skills , experience:exceptional problem solving skillsexperience withjava,r, , pythonadvanced data mining , predictive modeling (especially machine learning techniques) skillsmust have statistics orientation (theory , applied)3+ years of business experience in advanced analytics rolestrong python , r programming skills required. sas, matlab plusstrong sql skills looked for.analytical , decisive strategic thinker, flexible problem solver, great team player;able communicate levelsimpeccable attention detail , strong ability convert complex data insights , action planthanksnick arthurlead recruiternick(at)pullskill(dot)com201-497-1010 ext: 106salary: $120,000.00 /yearrequired experience:java , python , r , phd level education: 4 years5 days ago-save jobwindow[\'result_229a6b09c5eb6b44\'] = {"showsource": false, "source": "indeed", "loggedin": false, "showmyjobslinks": true,"undoaction": "unsave","relativejobage": "5 days ago","jobkey": "229a6b09c5eb6b44", "myindeedavailable": true, "tellafriendenabled": false, "showmoreactionslink": false, "resultnumber": 0, "jobstatechangedtosaved": false, "searchstate": "", "basicpermalink": "http://www.indeed.com", "savejobfailed": false, "removejobfailed": false, "requestpending": false, "notesenabled": true, "currentpage" : "viewjob", "sponsored" : false, "reportjobbuttonenabled": false};\xbbapply nowplease review application instructions before applying pullskill technologies.(function(d, s, id){var js, iajs = d.getelementsbytagname(s)[0], iaqs = \'vjtk=1aa24enhqagvcdj7&hl=en_us&co=us\'; if (d.getelementbyid(id)){return;}js = d.createelement(s); js.id = id; js.async = true; js.src = \'https://apply.indeed.com/indeedapply/static/scripts/app/bootstrap.js\'; js.setattribute(\'data-indeed-apply-qs\', iaqs); iajs.parentnode.insertbefore(js, iajs);}(document, \'script\', \'indeed-apply-js\'));recommended jobsdata scientist, energy analyticsrenew financial-oakland, carenew financial-5 days agodata scientisteprize-seattle, waeprize-7 days agodata scientistdocusign-seattle, wadocusign-12 days agoeasily applyengineer - citizen or permanent residentvoxel innovations-raleigh, ncindeed-8 days agoeasily applydata scientistunity technologies-san francisco, caunity technologies-22 days agoeasily apply'
find job summary element, find b elements inside , split each b element's text ::
for elm in soup.find("span", id="job_summary").p.find_all("b"): label, text = elm.get_text().split(" : ") print(label.strip(), text.strip())
Comments
Post a Comment