python urllib2 - wait for page to finish loading/redirecting before scraping? -


i'm learning make web scrapers , want scrape tripadvisor personal project, grabbing html using urllib2. however, i'm running problem where, using code below, html not correct page seems take second redirect (you can verify visiting url) - instead code page briefly appears.

is there behavior or parameter set make sure page has finished loading/redirecting before getting website content?

import urllib2 bs4 import beautifulsoup  bostonpage = urllib2.urlopen("http://www.tripadvisor.com/hacsearch?geo=34438#02,1342106684473,rad:s0,sponsors:abest_western,style:szff_6") soup = beautifulsoup(bostonpage) print soup.prettify() 

edit: answer thorough, however, in end solved problem this: https://stackoverflow.com/a/3210737/1157283

inreresting problem isn't redirect page modifies content using javascript, urllib2 doesn't have js engine gets data, if disabled javascript on browser note loads same content urllib2 returns

import urllib2 beautifulsoup import beautifulsoup  bostonpage = urllib2.urlopen("http://www.tripadvisor.com/hacsearch?geo=34438#02,1342106684473,rad:s0,sponsors:abest_western,style:szff_6") soup = beautifulsoup(bostonpage) open('test.html', 'w').write(soup.read()) 

test.html , disabling js in browser, easiest in firefox content -> uncheck enable javascript, generates identical result sets.

so can well, first should check if site offers api, scrapping tends frown http://www.tripadvisor.com/help/what_type_of_tripadvisor_content_is_available

travel/hotel api's? looks might, though restrictions.

but if still need scrape it, js, can use selenium http://seleniumhq.org/ used testing, easy , has docs.

i found scraping websites javascript enabled? , http://grep.codeconsult.ch/2007/02/24/crowbar-scrape-javascript-generated-pages-via-gecko-and-rest/

hope helps.

as side note:

>>> import urllib2 >>> bs4 import beautifulsoup >>>  >>> bostonpage = urllib2.urlopen("http://www.tripadvisor.com/hacsearch?geo=34438#02,1342106684473,rad:s0,sponsors:abest_western,style:szff_6") >>> value = bostonpage.read() >>> soup = beautifulsoup(value) >>> open('test.html', 'w').write(value) 

Comments

Popular posts from this blog

php - Wordpress website dashboard page or post editor content is not showing but front end data is showing properly -

How to get the ip address of VM and use it to configure SSH connection dynamically in Ansible -

javascript - Get parameter of GET request -