python urllib2 - wait for page to finish loading/redirecting before scraping? -
i'm learning make web scrapers , want scrape tripadvisor personal project, grabbing html using urllib2. however, i'm running problem where, using code below, html not correct page seems take second redirect (you can verify visiting url) - instead code page briefly appears.
is there behavior or parameter set make sure page has finished loading/redirecting before getting website content?
import urllib2 bs4 import beautifulsoup bostonpage = urllib2.urlopen("http://www.tripadvisor.com/hacsearch?geo=34438#02,1342106684473,rad:s0,sponsors:abest_western,style:szff_6") soup = beautifulsoup(bostonpage) print soup.prettify()
edit: answer thorough, however, in end solved problem this: https://stackoverflow.com/a/3210737/1157283
inreresting problem isn't redirect page modifies content using javascript, urllib2
doesn't have js
engine gets
data, if disabled javascript on browser note loads same content urllib2
returns
import urllib2 beautifulsoup import beautifulsoup bostonpage = urllib2.urlopen("http://www.tripadvisor.com/hacsearch?geo=34438#02,1342106684473,rad:s0,sponsors:abest_western,style:szff_6") soup = beautifulsoup(bostonpage) open('test.html', 'w').write(soup.read())
test.html
, disabling js in browser, easiest in firefox content -> uncheck enable javascript, generates identical result sets.
so can well, first should check if site offers api, scrapping tends frown http://www.tripadvisor.com/help/what_type_of_tripadvisor_content_is_available
travel/hotel api's? looks might, though restrictions.
but if still need scrape it, js, can use selenium
http://seleniumhq.org/ used testing, easy , has docs.
i found scraping websites javascript enabled? , http://grep.codeconsult.ch/2007/02/24/crowbar-scrape-javascript-generated-pages-via-gecko-and-rest/
hope helps.
as side note:
>>> import urllib2 >>> bs4 import beautifulsoup >>> >>> bostonpage = urllib2.urlopen("http://www.tripadvisor.com/hacsearch?geo=34438#02,1342106684473,rad:s0,sponsors:abest_western,style:szff_6") >>> value = bostonpage.read() >>> soup = beautifulsoup(value) >>> open('test.html', 'w').write(value)
Comments
Post a Comment