java - How to extract a link with Jsoup? -
i using jsoup crawl web , results. want perform keyword search. example crawl http://www.business-standard.com/ following keywords:
google hyderabad
and should provide me link:
i wrote code below did not give me appropriate results.
import java.io.ioexception; import org.jsoup.jsoup; import org.jsoup.nodes.document; import org.jsoup.nodes.element; import org.jsoup.select.elements; public class app { public static void main(string[] args) { document doc; try { doc = jsoup.connect("http://www.business-standard.com").useragent("mozilla").get(); string title = doc.title(); system.out.println("title : " + title); elements links = doc.select("a:contains(google)"); (element link : links) { system.out.println("\nlink : " + link.attr("href")); system.out.println("text : " + link.text()); } } catch (ioexception e) { e.printstacktrace(); } } }
the results follows:
title : india news, latest news headlines, bse live, nse live, stock markets live, financial news, business news & market analysis on indian economy - business standard news link : /photo-gallery/current-affairs/mumbai-central-turns-into-wi-fi-zone-courtesy-google-power-2574.htm text : mumbai central turns wi-fi zone, courtesy google power link : plus.google.com/+businessstandard/posts text : google+
jsoup 1.8.2
try url instead:
http://www.business-standard.com/search?q=<keyword>
sample code
document doc; try { string keyword = "google hyderabad"; doc = jsoup // .connect("http://www.business-standard.com/search?q=" + urlencoder.encode(keyword, "utf-8")) // .useragent("mozilla") // .get(); string title = doc.title(); system.out.println("title : " + title); elements links = doc.select("a:contains(google)"); (element link : links) { system.out.println("\nlink : " + link.absurl("href")); system.out.println("text : " + link.text()); } } catch (ioexception e) { e.printstacktrace(); }
output
the link you're looking in second position.
title : search link : http://www.business-standard.com/article/pti-stories/google-to-invest-more-in-india-set-up-new-campus-115121600841_1.html text : google invest more in india, set new campus in hyderabad link : http://www.business-standard.com/article/companies/google-to-get-7-2-acres-in-hyderabad-it-corridor-for-its-campus-115051201238_1.html text : google 7.2 acres in hyderabad corridor campus link : http://www.business-standard.com/article/technology/swine-flu-closes-google-hyderabad-office-for-2-days-109071500023_1.html text : swine flu closes google hyderabad office 2 days link : http://www.business-standard.com/article/pti-stories/facebook-posts-strong-4q-as-company-closes-gap-with-google-116012800081_1.html text : facebook posts strong 4q company closes gap google link : http://www.business-standard.com/article/pti-stories/r-day-bsf-camel-contingent-march-on-google-doodle-116012600104_1.html text : r-day: bsf camel contingent marches on google doodle link : http://www.business-standard.com/article/international/daimler-ceo-says-apple-google-making-progress-on-car-116012501298_1.html text : daimler ceo says apple, google making progress on car link : https://plus.google.com/+businessstandard/posts text : google+
discussion
the sample code below fetch first results page. if need fetch more results, extract next link page (#hpcontentbox div.next-colum > a
) , crawl jsoup.
you'll notice there additionnal parameters above link provided you:
itemperpages
: self explanatory (default 19)page
: search results page index (default 1 if not provided)company-code
: ?? (can empty)
you may try give itemperpages
url larger values (100 or more). may reduce crawling time.
the absurl
method used in order have absolute urls instead of relative urls.
Comments
Post a Comment