python 2.7 - How to get a clean result when scraping a data from website using scrapy -
i new in python , trying scrape data yellow pages. able scrape messed result.
this result got:
2013-03-24 20:26:47+0800 [scrapy] info: scrapy 0.14.4 started (bot: eyp) 2013-03-24 20:26:47+0800 [scrapy] debug: enabled extensions: logstats, telnetconsole, closespider, webservice, corestats, memoryusage, spiderstate 2013-03-24 20:26:47+0800 [scrapy] debug: enabled downloader middlewares: httpauthmiddleware, downloadtimeoutmiddleware, useragentmiddleware, retrymiddleware,defaultheadersmiddleware, redirectmiddleware, cookiesmiddleware, httpcompressionmiddleware, chunkedtransfermiddleware, downloaderstats 2013-03-24 20:26:47+0800 [scrapy] debug: enabled spider middlewares: httperrormiddleware, offsitemiddleware, referermiddleware, urllengthmiddleware, depthmiddleware 2013-03-24 20:26:47+0800 [scrapy] debug: enabled item pipelines: 2013-03-24 20:26:47+0800 [eyp] info: spider opened 2013-03-24 20:26:47+0800 [eyp] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2013-03-24 20:26:47+0800 [scrapy] debug: telnet console listening on 0.0.0.0:6023 2013-03-24 20:26:47+0800 [scrapy] debug: web service listening on 0.0.0.0:6080
how clean result? want name, address, phone number , links only.
by way, code i'm using this, was;
from scrapy.spider import basespider scrapy.selector import htmlxpathselector eyp.items import eypitem class eypspider(basespider): def parse(self, response): hxs = htmlxpathselector(response) titles = hxs.select('//ol[@class="result"]/li') items = [] title in titles: item = eypitem() item['title'] = title.select(".//p/text()").extract() item['link'] = title.select(".//a/@href").extract() items.append(item) return items
your code bit messy try help:
from scrapy.spider import basespider scrapy.selector import htmlxpathselector scrapy.item import item, field import string class eypitem(item): name = field() address = field() phone = field() class eypspider(basespider): name = "eyp.ph" allowed_domains = ["eyp.ph"] start_urls = ["http://www.eyp.ph/home-real-estate/search/real-estate/davao/cat/real-estate-brokers"] def parse(self, response): hxs = htmlxpathselector(response) sites = hxs.select("//li/div[@class='details']") items = [] site in sites: iteme = eypitem() iteme["name"] = site.select("normalize-space(p[1]/text())").extract() iteme["address"] = site.select("normalize-space(p[2]/text())").extract() iteme["phone"] = site.select("normalize-space(p[3]/text())").extract() items.append(iteme) return items
you missing definition class eypitem
. have suggested one. above saved test.py
running command line:
$ scrapy runspider test.py -o items.json -t json
will give file json output named items.json
. sample of output is
[{"phone": ["phone: +63(907)6390603"], "name": ["(carlos a. vargas)"], "address": ["mezzanine wee eng apartment, guerrero street, davao city, davao del sur"]}, {"phone": ["phone: +63(921)9566577"], "name": ["(rogelio g. carbiero)"], "address": ["sto. nino heights, pantinople village, davao city, davao del sur"]}, {"phone": ["phone: +63(917)3137855"], "name": ["(florizel c. chavez)"], "address": ["12 tulip street, el rio vista village p4a, davao city, davao del sur"]}, ..........
Comments
Post a Comment