locked
Web scrap a page built by javascript RRS feed

  • Question

  • User-1668014665 posted

    My source: https://www.wsj.com/market-data/stocks/marketsdiary

    The above page source code is all javascript. I need to scape the contents after the javascript has rendered.

    I am using asp net 2  - Forms

    I can use aspnet 3.5

    Any ideas?

    Note: webrequest only scrapes the source code string for a web page, and it is before rendering.

    Wednesday, February 19, 2020 6:37 PM

All replies

  • User-1330468790 posted

    Hi, icm63,

    You can consider using OpenQA.Selenium.Chrome, (automates browsers) to scrape dynamic content from page and HtmlAgilityPack to parse the element.

    Remember to set "headless" argument of the ChromeOptions so that the chrome will run without GUI.

     

    More details, you can refer to below C# code (under .NET Framework 4.0 but should work for 3.5):

    .aspx Page:

    <body>
    
        <form id="form1" runat="server">
            <div>
                <asp:Button ID="Btn1" runat="server" Text="Click" OnClick="Btn1_Click" />
            </div>
            <div id="scrapeContent" runat="server"></div>
        </form>
    </body>
    
    

    Code-behind:

    protected void Page_Load(object sender, EventArgs e)
    
            {
    
            }
    
            private void getData()
            {
                ChromeOptions options = new ChromeOptions();
               
                options.AddArguments("headless");
                ChromeDriver driver = new ChromeDriver(options);
    
               
                driver.Navigate().GoToUrl("https://www.wsj.com/market-data/stocks/marketsdiary");
    
    
                HtmlDocument doc = new HtmlDocument();
                doc.LoadHtml(driver.PageSource);
                var tables = doc.DocumentNode.SelectNodes("//table");
                foreach (HtmlNode x in tables)
                {
                    scrapeContent.InnerHtml += "<table>" + x.InnerHtml + "</table>";
                }
    
               
            }
    
            protected void Btn1_Click(object sender, EventArgs e)
            {
                getData();
            }

    Demo: 

    Hope this can help you!

    Best regards,

    Sean

    Thursday, February 20, 2020 12:44 PM
  • User-1668014665 posted

    Thanks

    Installing , questions on this : OpenQA.Selenium.Chrome

    Which one do I install from NuGet in MSVS 2015

    Click link to see formats from NuGet https://imgur.com/a/7yXyFUl

    And how to I import the correct one ?

    Or what do I download from here : https://www.selenium.dev/downloads/

    Please give details ?

    Note: Using .net 4.6, vb.net, web forms, win64

    thanks

    Thursday, February 20, 2020 6:39 PM
  • User-1330468790 posted

    Hi, Chrisip0307,

    I recommend you installing the package from nuget since it is the easiest way.

    Based on the picture you provided, the practical version is the first one, Selenium.WebDriver (18.5M downloads, v3.141.0), which is suitable for .Net framework 3.5, 4.0, 4.5 without dependency required and .NetStandard 2.0 with a requirement that Newtonsoft.Json should higher than/equal to 10.0.3.

    The screenshot of feasible package :

    After download completed, you only need to use statement "using OpenQA.Selenium.Chrome;" to import the package.

    Best regards,

    Sean

    Sunday, February 23, 2020 6:44 AM