python 2.7 - How to get a clean result when scraping a data from website using scrapy -


i new in python , trying scrape data yellow pages. able scrape messed result.

this result got:

2013-03-24 20:26:47+0800 [scrapy] info: scrapy 0.14.4 started (bot: eyp) 2013-03-24 20:26:47+0800 [scrapy] debug: enabled extensions: logstats, telnetconsole, closespider, webservice, corestats, memoryusage, spiderstate 2013-03-24 20:26:47+0800 [scrapy] debug: enabled downloader middlewares: httpauthmiddleware, downloadtimeoutmiddleware, useragentmiddleware, retrymiddleware,defaultheadersmiddleware, redirectmiddleware, cookiesmiddleware, httpcompressionmiddleware, chunkedtransfermiddleware, downloaderstats 2013-03-24 20:26:47+0800 [scrapy] debug: enabled spider middlewares: httperrormiddleware, offsitemiddleware, referermiddleware, urllengthmiddleware, depthmiddleware 2013-03-24 20:26:47+0800 [scrapy] debug: enabled item pipelines:  2013-03-24 20:26:47+0800 [eyp] info: spider opened 2013-03-24 20:26:47+0800 [eyp] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2013-03-24 20:26:47+0800 [scrapy] debug: telnet console listening on 0.0.0.0:6023 2013-03-24 20:26:47+0800 [scrapy] debug: web service listening on 0.0.0.0:6080 

how clean result? want name, address, phone number , links only.

by way, code i'm using this, was;

from scrapy.spider import basespider scrapy.selector import htmlxpathselector eyp.items import eypitem class eypspider(basespider):     def parse(self, response):         hxs = htmlxpathselector(response)         titles = hxs.select('//ol[@class="result"]/li')         items = []         title in titles:             item = eypitem()             item['title'] = title.select(".//p/text()").extract()             item['link'] = title.select(".//a/@href").extract()             items.append(item)         return items 

your code bit messy try help:

from scrapy.spider import basespider scrapy.selector import htmlxpathselector scrapy.item import item, field import string  class eypitem(item):     name = field()     address = field()     phone = field()  class eypspider(basespider):     name = "eyp.ph"     allowed_domains = ["eyp.ph"]     start_urls = ["http://www.eyp.ph/home-real-estate/search/real-estate/davao/cat/real-estate-brokers"]     def parse(self, response):         hxs = htmlxpathselector(response)         sites = hxs.select("//li/div[@class='details']")         items = []         site in sites:             iteme = eypitem()             iteme["name"] = site.select("normalize-space(p[1]/text())").extract()             iteme["address"] = site.select("normalize-space(p[2]/text())").extract()             iteme["phone"] = site.select("normalize-space(p[3]/text())").extract()             items.append(iteme)         return items 

you missing definition class eypitem. have suggested one. above saved test.py running command line:

$ scrapy runspider test.py -o items.json -t json 

will give file json output named items.json. sample of output is

[{"phone": ["phone: +63(907)6390603"], "name": ["(carlos a. vargas)"], "address": ["mezzanine wee eng apartment, guerrero street, davao city, davao del sur"]},  {"phone": ["phone: +63(921)9566577"], "name": ["(rogelio g. carbiero)"], "address": ["sto. nino heights, pantinople village, davao city, davao del sur"]},  {"phone": ["phone: +63(917)3137855"], "name": ["(florizel c. chavez)"], "address": ["12 tulip street, el rio vista village p4a, davao city, davao del sur"]}, .......... 

Comments

Popular posts from this blog

php - Wordpress website dashboard page or post editor content is not showing but front end data is showing properly -

How to get the ip address of VM and use it to configure SSH connection dynamically in Ansible -

javascript - Get parameter of GET request -