locked
HTML Agility Pack unable to scrape RRS feed

  • Question

  • User-1256377279 posted

    Hi Guys,

    I am using HTML Agility pack to scrape below html but unable to get tr/td data under TBODY element hightlight in yellow using c#.

    for reference : https://html-agility-pack.net/

    <table id="prefectures-table" class="loading">
              <thead>
                <tr>
                  <th data-i18n="prefecture" class="prefecture">Prefecture</th>
                  <th class="trend" data-i18n="daily-trend">Daily Trend</th>
                  <th class="confirmed">
                    <span data-i18n="confirmed">Confirmed</span>
                  </th>
                  <th class="delta">
                    <div class="increment">
                      <div class="today"><span data-i18n="increment-today">(Today)</span></div>
                      <div class="yesterday"><span data-i18n="increment-yesterday">(Yesterday)</span></div>
                    </div>
                  </th>
                  <th data-i18n="recovered" class="recovered">Recovered</th>
                  <th data-i18n="deaths" class="deceased">Deaths</th>
                </tr>
              </thead>
              <tbody class="total-rows">
                <tr>
                  <td class="prefecture" data-i18n="total">Total</td>
                  <td class="trend"></td>
                  <td class="confirmed"></td>
                  <td class="delta"></td>
                  <td class="recovered"></td>
                  <td class="deceased"></td>
                </tr>
              </tbody>
              <tbody class="prefecture-rows">
    <tr class ="prefecture-row row0"> 
    <td>tokyo</td>
    </tr>
                
              </tbody>
              <tbody class="pseudo-prefecture-rows"></tbody>
             
              <tbody class="cruise-header">
                <tr>
                  <td colspan="4" data-i18n="cruise-passengers-explanation">Cruise crew and passengers are not included in totals.</td>
                </tr>
              </tbody>
              <tbody class="cruise-rows"></tbody>
              <tbody class="loading-rows">
                <tr><td colspan="4"><div class="loader"><div class="lds-dual-ring"></div></div></td></tr>
              </tbody>
            </table>

    this is my current code

       var query = htmlDoc.DocumentNode.SelectSingleNode("//section[5]//div[1]//*[@id='prefectures-table']//tbody[2]//*[contains(@class,'prefecture-row')]");
    
                HtmlNodeCollection childNodes = query.ChildNodes;
    
                foreach (var node in childNodes)
                {
                    if (node.NodeType == HtmlNodeType.Element)
                    {
    
                        Japan jp = new Japan();
                        jp.Name = node.InnerText;
                    }

    Thanks,

    Shabbir

    Wednesday, May 27, 2020 11:01 AM

All replies

  • User-474980206 posted

    In the html it’s prefecture-rows

    Wednesday, May 27, 2020 2:52 PM
  • User1535942433 posted

    Hi shabbir_215,

    Accroding to your description and codes,I create a test.As far as I think,if you want to get the second tbody's content,you could get the second tbody classname without tbody[2].

    Besides,you need to make sure the classname is match with html file.

    More details,you could refer to below codes:

    Html file:

    <html>  
    <head>  
    </head>  
    <body>  
        <table id="prefectures-table" class="loading">
              <thead>
                <tr>
                  <th data-i18n="prefecture" class="prefecture">Prefecture</th>
                  <th class="trend" data-i18n="daily-trend">Daily Trend</th>
                  <th class="confirmed">
                    <span data-i18n="confirmed">Confirmed</span>
                  </th>
                  <th class="delta">
                    <div class="increment">
                      <div class="today"><span data-i18n="increment-today">(Today)</span></div>
                      <div class="yesterday"><span data-i18n="increment-yesterday">(Yesterday)</span></div>
                    </div>
                  </th>
                  <th data-i18n="recovered" class="recovered">Recovered</th>
                  <th data-i18n="deaths" class="deceased">Deaths</th>
                </tr>
              </thead>
              <tbody class="total-rows">
                <tr>
                  <td class="prefecture" data-i18n="total">Total</td>
                  <td class="trend"></td>
                  <td class="confirmed"></td>
                  <td class="delta"></td>
                  <td class="recovered"></td>
                  <td class="deceased"></td>
                </tr>
              </tbody>
              <tbody class="prefecture-rows">
    <tr class ="prefecture-row row0"> 
    <td>tokyo</td>
    </tr>
                
              </tbody>
              <tbody class="pseudo-prefecture-rows"></tbody>
             
              <tbody class="cruise-header">
                <tr>
                  <td colspan="4" data-i18n="cruise-passengers-explanation">Cruise crew and passengers are not included in totals.</td>
                </tr>
              </tbody>
              <tbody class="cruise-rows"></tbody>
              <tbody class="loading-rows">
                <tr><td colspan="4"><div class="loader"><div class="lds-dual-ring"></div></div></td></tr>
              </tbody>
            </table> 
    </body>  
    </html> 
    HtmlAgilityPack.HtmlDocument document2 = new HtmlAgilityPack.HtmlDocument();
                document2.Load(@"C:\Users\yijings\Desktop\sample.txt");
                var query = document2.DocumentNode.SelectSingleNode("//table[@id='prefectures-table']//tbody[contains(@class,'prefecture-rows')]");
    
                HtmlNodeCollection childNodes = query.ChildNodes;
                foreach (var node in childNodes)
                {
                    if (node.NodeType == HtmlNodeType.Element)
                    {
    
                        Japan jp = new Japan();
                        jp.Name = node.InnerText;
                        label1.Text = jp.Name;
                    }
                   
                }

    Result:

    Best regards,

    Yijing Sun

    Thursday, May 28, 2020 2:27 AM