java - How to extract a link with Jsoup? -

March 15, 2015

i using jsoup crawl web , results. want perform keyword search. example crawl http://www.business-standard.com/ following keywords:

google hyderabad

and should provide me link:

http://www.business-standard.com/article/companies/google-to-get-7-2-acres-in-hyderabad-it-corridor-for-its-campus-115051201238_1.html.

i wrote code below did not give me appropriate results.

import java.io.ioexception;  import org.jsoup.jsoup; import org.jsoup.nodes.document; import org.jsoup.nodes.element; import org.jsoup.select.elements;  public class app {    public static void main(string[] args) {      document doc;     try {         doc = jsoup.connect("http://www.business-standard.com").useragent("mozilla").get();         string title = doc.title();         system.out.println("title : " + title);          elements links = doc.select("a:contains(google)");         (element link : links) {             system.out.println("\nlink : " + link.attr("href"));             system.out.println("text : " + link.text());         }     } catch (ioexception e) {         e.printstacktrace();     }   } }

the results follows:

 title : india news, latest news headlines, bse live, nse live, stock markets live, financial news, business news & market analysis on indian economy - business standard news  link : /photo-gallery/current-affairs/mumbai-central-turns-into-wi-fi-zone-courtesy-goo‌gle-power-2574.htm text : mumbai central turns wi-fi zone, courtesy google power  link : plus.google.com/+businessstandard/posts text : google+

jsoup 1.8.2

try url instead:

http://www.business-standard.com/search?q=<keyword>

sample code

document doc; try {     string keyword = "google hyderabad";     doc = jsoup //             .connect("http://www.business-standard.com/search?q=" + urlencoder.encode(keyword, "utf-8")) //             .useragent("mozilla") //             .get();      string title = doc.title();     system.out.println("title : " + title);      elements links = doc.select("a:contains(google)");     (element link : links) {         system.out.println("\nlink : " + link.absurl("href"));         system.out.println("text : " + link.text());     } } catch (ioexception e) {     e.printstacktrace(); }

output

the link you're looking in second position.

title : search  link : http://www.business-standard.com/article/pti-stories/google-to-invest-more-in-india-set-up-new-campus-115121600841_1.html text : google invest more in india, set new campus in hyderabad  link : http://www.business-standard.com/article/companies/google-to-get-7-2-acres-in-hyderabad-it-corridor-for-its-campus-115051201238_1.html text : google 7.2 acres in hyderabad corridor campus  link : http://www.business-standard.com/article/technology/swine-flu-closes-google-hyderabad-office-for-2-days-109071500023_1.html text : swine flu closes google hyderabad office 2 days  link : http://www.business-standard.com/article/pti-stories/facebook-posts-strong-4q-as-company-closes-gap-with-google-116012800081_1.html text : facebook posts strong 4q company closes gap google  link : http://www.business-standard.com/article/pti-stories/r-day-bsf-camel-contingent-march-on-google-doodle-116012600104_1.html text : r-day: bsf camel contingent marches on google doodle  link : http://www.business-standard.com/article/international/daimler-ceo-says-apple-google-making-progress-on-car-116012501298_1.html text : daimler ceo says apple, google making progress on car  link : https://plus.google.com/+businessstandard/posts text : google+

discussion

the sample code below fetch first results page. if need fetch more results, extract next link page (#hpcontentbox div.next-colum > a) , crawl jsoup.

you'll notice there additionnal parameters above link provided you:

itemperpages : self explanatory (default 19)
page : search results page index (default 1 if not provided)
company-code : ?? (can empty)

you may try give itemperpages url larger values (100 or more). may reduce crawling time.

the absurl method used in order have absolute urls instead of relative urls.

Search This Blog

Live one