locked
How do i use htmlagilitypack to retrive only links from a website that start with http and https ? RRS feed

  • Question

  • I have this code:

    private List<string> getLinks(HtmlAgilityPack.HtmlDocument document)
            {
                
                List<string> mainLinks = new List<string>();
                var linkNodes = document.DocumentNode.SelectNodes("//a[@href]");
                if (linkNodes != null)
                {
                    foreach (HtmlNode link in linkNodes)
                    {
                        var href = link.Attributes["href"].Value;
                        mainLinks.Add(href);
                    }
                }
                return mainLinks;
    
            }

    Then im adding the links im getting to a List<string> like this:

    private List<string> test(string url, int levels , DoWorkEventArgs eve)
            {
                    HtmlAgilityPack.HtmlDocument doc;
                    HtmlWeb hw = new HtmlWeb();
                    List<string> webSites;
                    
                    try
                    {
                        doc = hw.Load(url);
                        webSites = getLinks(doc);

    The problem is sometimes in webSites  i see links like "/" or "/videos or "//gifs

    From what i understand those are sub folders for example if i had a link : www.google.com/videos

    So /videos is the sub of www.google.com/videos

    But what i want is that in webSites all the time i will have only a links of websites like:

    www.google.com

    http://www.google.com

    or https://www.google.com

    Only this kind of links types. And not sub folders/links like "/" or "/videos"

    So how can i filter/check --  it in the getLinks function ?


    danieli

    Thursday, September 13, 2012 3:03 PM

Answers

  • I used IF and Contains

    Thanks.


    danieli

    • Marked as answer by chocolade Friday, September 14, 2012 12:23 AM
    Friday, September 14, 2012 12:23 AM

All replies

  • I need to get only links thats start as: http// and https// and www.

    I dont want to get all the links and then filter for it but to get from the website only the links that start like that.

    So maybe something is wrong with the href .

    Or maybe i must get all the links and then filter them manualy in the getLinks function after the foreach loop end . Im not sure how to do it.


    danieli

    Thursday, September 13, 2012 3:06 PM
  • I used IF and Contains

    Thanks.


    danieli

    • Marked as answer by chocolade Friday, September 14, 2012 12:23 AM
    Friday, September 14, 2012 12:23 AM
  • Hi chocolade,

      Welcome to MSDN Forum Support.

      I am glad to hear that you have solve your problem. Thank you for sharing your solution with us and provide a open source project inside of Codeplex named Html Agility Pack.

      Sincerely,

      Jason Wang


    Jason Wang [MSFT]
    MSDN Community Support | Feedback to us

    Monday, September 17, 2012 2:44 AM
  • Use Regex Expression may be help your problem
    Thursday, September 20, 2012 9:53 AM