none
Scraping search results links from 1st page of bing.com in c#

    Question

  •  I am tried to scrap search results links from 1st page. But it is giving an error. if you help i am very thank full to you. My code is

     

     class Program

        {

            static void Main(string[] args)

            {

                ArrayList a = new ArrayList();

            byte[] aRequestHTML;

    WebClient objWebClient = new WebClient();

                string url="http://www.bing.com/search?q=hello&go=&qs=n&sk=&form=QBLH";

    aRequestHTML = objWebClient.DownloadData(url);

     

        UTF8Encoding utf8 = new UTF8Encoding();

        string myString = utf8.GetString(aRequestHTML);

        Regex r = new Regex("href\\s*=\\s*(?:(?:\\\"(?[^\\\"]*)\\\")|(?[^\\s]* ))");

        MatchCollection mcl = r.Matches(myString);

     

        foreach (Match ml in mcl)

        {

            foreach (Group g in ml.Groups)

            {

                string b = g.Value + "";

                a.Add(b);

            }

        }

     

     

                for (int i = 0; i < a.Count; i++)

    {

        Console.WriteLine(a[0]);

        Console.ReadLine();

    }

            }

        }

    But it exception at Regex. Help me. Exception is parsing "href\s*=\s*(?:(?:\"(?[^\"]*)\")|(?[^\s]* ))" - Unrecognized grouping construct.(Argument Exception is un handled

    Thanks,

    Raufee

    Sunday, August 14, 2011 12:17 AM

Answers

  • Use

    foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//div[@class='sb_tlst']//a"))

    Instead of

    foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))



    My Blogs
    • Marked as answer by Rauf_Mughal Sunday, August 14, 2011 1:41 AM
    Sunday, August 14, 2011 1:38 AM

All replies

  • HTML cannot reliably be parsed by a regex and dozens of valid HTML constructs will break the naïve regex proposed.

    I won't be mentioning all the additional invalid ones in common use on the web in Don't Use Regex To Parse HTML today.

    Also in Don't Use Regex To Parse HTML, we'll be linking to the Html Agility Pack, a .NET library you can use to parse HTML properly and subsequently extract link URLs reliably in just a couple of lines of code

     

     

    HtmlWeb hw =new HtmlWeb();
    HtmlDocument doc = hw.Load(/* url */); foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"]) { }

     


    Hope this help!

     


    My Blogs
    Sunday, August 14, 2011 12:32 AM
  • You must not do this.  

    It violates rule #2 of Bing's terms of use: a.k.a. Microsoft's Service Agreement.  (I've excerpted and marked in bold the relevant bit:)

     

    2. Using the service

    When using the service, you must comply with this contract, all applicable laws and the Microsoft Anti-Spam Policy (http://go.microsoft.com/fwlink/?LinkId=117951). As applicable, you must also obey the code of conduct (http://g.live.com/0ELHP_MEREN/243). You must not use the service to harm others or the service. For example, you must not use the service to harm, threaten, or harass another person, organization, or Microsoft. You must not: damage, disable, overburden, or impair the service (or any network connected to the service); resell or redistribute the service or any part of it; use any unauthorized means to modify, reroute, or gain access to the service or attempt to carry out these activities; or use any automated process or service (such as a bot, a spider, periodic caching of information stored by Microsoft, or metasearching) to access or use the service. You may be able to access third-party websites or services via the service; you acknowledge that we are not responsible for such websites or services or content that may be available there.

    Sunday, August 14, 2011 12:39 AM
  • Hi Wyck,

    I am not doing it for our special or personal software. I am a university student. This is my project to scrape results links from 1st page of bing.com giving by my teacher.

     

    Thanks,

    Raufee

     

    Sunday, August 14, 2011 12:47 AM
  • Hi Zain_Ali,

    it is giving some error. That is DocumentElement is not in the defination of HtmlDocument

    Sunday, August 14, 2011 12:49 AM
  • Hi Zain,

    Thanks to your reply. Now i am using this. But not showing search result links(url). Please help me.

    class Program

        {

            static void Main(string[] args)

            {

                ArrayList a = new ArrayList();

     

     

                string url="http://www.bing.com/search?q=hello&go=&qs=n&sk=&form=QBLH";

    HtmlWeb hw =new HtmlWeb();

    HtmlDocument doc = hw.Load(url);

    foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))

    {

        a.Add(link.InnerText);

     

    }

                for (int i = 0; i < a.Count; i++)

    {

        Console.WriteLine(a[i]);

    }

    Console.ReadLine();

            }

        }

    Thanks,

    Raufee 

    Sunday, August 14, 2011 12:59 AM
  • use this

     

    a.Add(link.GetAttributeValue("href","").ToString() );
    

     

     

    istead of

    a.Add(link.InnerText);


    My Blogs
    Sunday, August 14, 2011 1:09 AM
  • Hi zain,

     Thanks for your reply. it prints href of hyper links or label as well as links of search results. I only want the search results links, which starts from http://. 

    Thanks,

    Raufee

    Sunday, August 14, 2011 1:19 AM
  • Use

    foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//div[@class='sb_tlst']//a"))

    Instead of

    foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))



    My Blogs
    • Marked as answer by Rauf_Mughal Sunday, August 14, 2011 1:41 AM
    Sunday, August 14, 2011 1:38 AM
  • Thanks Zain_Ali,

    It's working nicely. You are great programmer. Thanks to you again,

    Thanks,

    Raufee

    Sunday, August 14, 2011 1:43 AM