locked
Get number of episode using Regex RRS feed

  • Question

  • User-513628628 posted

    Hello all,

    I have a content with html which look like this :

    <div>....</div>
    <a href="https://a.com/movies/monkeys-season-1?episode=1" class="btn btn-default btn-episode">1</a>
    <a href="https://a.com/movies/monkeys-season-1?episode=2" class="btn btn-default btn-episode">2</a>
    <a href="https://a.com/movies/monkeys-season-1?episode=3" class="btn btn-default btn-episode">3</a>
    <a href="https://a.com/movies/monkeys-season-1?episode=4" class="btn btn-default btn-episode">4</a>
    .....
    <a href="https://a.com/movies/monkeys-season-1?episode=20" class="btn btn-default btn-episode">20</a>
    <p>...</p>
    <a href="def.com">...</a>
    <div>...</div>

    My Question :
    How can i use Regex to get hyperlink which only have episode and number hyperlink ?
    I use C# language.
    Thank you so much !


    Friday, March 6, 2020 2:27 PM

Answers

  • User-821857111 posted

    If you want to parse HTML, you should use a library designed for that instead of Regex. Try AngleSharp: https://github.com/AngleSharp/AngleSharp

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Friday, March 6, 2020 4:47 PM
  • User-1330468790 posted

    Hi, pamyral_279,

    As @Mike said, you should use a library which would shorten your code and parse the html more precisely.

    In case you have a specific purpose, I will provide you with two ways to demonstrate how to get the hyperlink which only have episode and number hyperlink.

    One is using Regex. You have to use regular expression language.

    Another one is using HtmlAgilityPack which supports plain XPATH or XSLT to find the node from html. XPATH is user-friendly as you can refer to below code.

    1.Using Regex

    Code:

    public static void GetLinkByRegex()
            {
                string html = @"<div>....</div><a href=""https://a.com/movies/monkeys-season-1?episode=1"" class=""btn btn-default btn-episode"">1</a>
                                <a href=""https://a.com/movies/monkeys-season-1?episode=2"" class=""btn btn-default btn-episode"">2</a>
                                <a href=""https://a.com/movies/monkeys-season-1?episode=3"" class=""btn btn-default btn-episode"">3</a>
                                <a href=""https://a.com/movies/monkeys-season-1?episode=4"" class=""btn btn-default btn-episode"">4</a>
                                <a href=""https://a.com/movies/monkeys-season-1?episode=20"" class=""btn btn-default btn-episode"">20</a>
                                <p><input id=""text""</p>
                                <a href=""def.com"" > ...</a>
                                <div>...</div>";
    
                //<a\\shref=\"(URL)\">(.*)(?<text>.*?)</a>
                string pattern = @"<a\shref=""(?<url>https:\/\/a\.com\/movies\/monkeys\-season\-1\?episode=\d*?)""(.*)>(?<text>.*?)</a>";
                Regex regex = new Regex(pattern);
    
                MatchCollection matches = regex.Matches(html);
                foreach(Match match in matches)
                {
                    Console.WriteLine("Url: "+match.Groups["url"].Value + " text: " + match.Groups["text"].Value);
                }
            }

    2. Using HtmlAgilityPack

    Code:

    public static void GetLinkByAgilityPack()
            {
                string html = @"<div>....</div><a href=""https://a.com/movies/monkeys-season-1?episode=1"" class=""btn btn-default btn-episode"">1</a>
                                < a href = ""https://a.com/movies/monkeys-season-1?episode=2"" class=""btn btn-default btn-episode"">2</a>
                                <a href = ""https://a.com/movies/monkeys-season-1?episode=3"" class=""btn btn-default btn-episode"">3</a>
                                <a href = ""https://a.com/movies/monkeys-season-1?episode=4"" class=""btn btn-default btn-episode"">4</a>
                                <a href = ""https://a.com/movies/monkeys-season-1?episode=20"" class=""btn btn-default btn-episode"">20</a>
                                <p><input id=""text""</p>
                                <a href = ""def.com"" > ...</a>
                                <div>...</div>";
                var doc = new HtmlDocument();
                doc.LoadHtml(html);
    
                var links = doc.DocumentNode.SelectNodes("//a[contains(@href,'https://a.com/movies/monkeys-season-1?episode')]");
                //var links = doc.DocumentNode.SelectNodes("//a[contains(@class,'btn btn-default btn-episode')]");
    
                foreach (var link in links)
                {
                    Console.WriteLine("Url: " + link.GetAttributeValue("href","") + "  Text: " + link.InnerText);
                }
    
                
            }

    Main method:

    static void Main(string[] args)
            {
    
                Console.WriteLine("Results from Regex:");
                GetLinkByRegex();
                Console.WriteLine("Results from HtmlAgilityPack:");
                GetLinkByAgilityPack();
                Console.ReadKey();
    
            }

    Demo:

    Hope this can help you.

    Best regards,

    Sean

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Monday, March 9, 2020 4:10 AM

All replies

  • User-821857111 posted

    If you want to parse HTML, you should use a library designed for that instead of Regex. Try AngleSharp: https://github.com/AngleSharp/AngleSharp

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Friday, March 6, 2020 4:47 PM
  • User-1330468790 posted

    Hi, pamyral_279,

    As @Mike said, you should use a library which would shorten your code and parse the html more precisely.

    In case you have a specific purpose, I will provide you with two ways to demonstrate how to get the hyperlink which only have episode and number hyperlink.

    One is using Regex. You have to use regular expression language.

    Another one is using HtmlAgilityPack which supports plain XPATH or XSLT to find the node from html. XPATH is user-friendly as you can refer to below code.

    1.Using Regex

    Code:

    public static void GetLinkByRegex()
            {
                string html = @"<div>....</div><a href=""https://a.com/movies/monkeys-season-1?episode=1"" class=""btn btn-default btn-episode"">1</a>
                                <a href=""https://a.com/movies/monkeys-season-1?episode=2"" class=""btn btn-default btn-episode"">2</a>
                                <a href=""https://a.com/movies/monkeys-season-1?episode=3"" class=""btn btn-default btn-episode"">3</a>
                                <a href=""https://a.com/movies/monkeys-season-1?episode=4"" class=""btn btn-default btn-episode"">4</a>
                                <a href=""https://a.com/movies/monkeys-season-1?episode=20"" class=""btn btn-default btn-episode"">20</a>
                                <p><input id=""text""</p>
                                <a href=""def.com"" > ...</a>
                                <div>...</div>";
    
                //<a\\shref=\"(URL)\">(.*)(?<text>.*?)</a>
                string pattern = @"<a\shref=""(?<url>https:\/\/a\.com\/movies\/monkeys\-season\-1\?episode=\d*?)""(.*)>(?<text>.*?)</a>";
                Regex regex = new Regex(pattern);
    
                MatchCollection matches = regex.Matches(html);
                foreach(Match match in matches)
                {
                    Console.WriteLine("Url: "+match.Groups["url"].Value + " text: " + match.Groups["text"].Value);
                }
            }

    2. Using HtmlAgilityPack

    Code:

    public static void GetLinkByAgilityPack()
            {
                string html = @"<div>....</div><a href=""https://a.com/movies/monkeys-season-1?episode=1"" class=""btn btn-default btn-episode"">1</a>
                                < a href = ""https://a.com/movies/monkeys-season-1?episode=2"" class=""btn btn-default btn-episode"">2</a>
                                <a href = ""https://a.com/movies/monkeys-season-1?episode=3"" class=""btn btn-default btn-episode"">3</a>
                                <a href = ""https://a.com/movies/monkeys-season-1?episode=4"" class=""btn btn-default btn-episode"">4</a>
                                <a href = ""https://a.com/movies/monkeys-season-1?episode=20"" class=""btn btn-default btn-episode"">20</a>
                                <p><input id=""text""</p>
                                <a href = ""def.com"" > ...</a>
                                <div>...</div>";
                var doc = new HtmlDocument();
                doc.LoadHtml(html);
    
                var links = doc.DocumentNode.SelectNodes("//a[contains(@href,'https://a.com/movies/monkeys-season-1?episode')]");
                //var links = doc.DocumentNode.SelectNodes("//a[contains(@class,'btn btn-default btn-episode')]");
    
                foreach (var link in links)
                {
                    Console.WriteLine("Url: " + link.GetAttributeValue("href","") + "  Text: " + link.InnerText);
                }
    
                
            }

    Main method:

    static void Main(string[] args)
            {
    
                Console.WriteLine("Results from Regex:");
                GetLinkByRegex();
                Console.WriteLine("Results from HtmlAgilityPack:");
                GetLinkByAgilityPack();
                Console.ReadKey();
    
            }

    Demo:

    Hope this can help you.

    Best regards,

    Sean

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Monday, March 9, 2020 4:10 AM