locked
I need to extract values from string or text file containing repeated html tags RRS feed

  • Question

  • User-1123701243 posted

    Hi all

    I have a text file containing html tags and a lot of structured data that I need to extract out of this file.

    something like this:

    <ol class="stream-items js-navigable-stream" id="stream-items-id">
              
          <li class="js-stream-item stream-item stream-item
    " data-item-id="1117697004497440768" id="stream-item-tweet-1117697004497440768" data-item-type="tweet" data-suggestion-json="{&quot;suggestion_details&quot;:{&quot;suggestion_type&quot;:&quot;RankedOrganicTweet&quot;,&quot;controller_data&quot;:&quot;DAABCgABAAEIRCECAAEAAA==&quot;},&quot;tweet_ids&quot;:&quot;1117697004497440768&quot;,&quot;scribe_component&quot;:&quot;tweet&quot;}">
        
    
    
      <div class="tweet js-stream-tweet js-actionable-tweet js-profile-popup-actionable dismissible-content
           original-tweet js-original-tweet
          
          
           dismissible-content
    " data-tweet-id="1117697004497440768" data-item-id="1117697004497440768" data-permalink-path="/bbcarabicalerts/status/1117697004497440768" data-conversation-id="1117697004497440768" data-tweet-nonce="1117697004497440768-06282890-73d0-4144-9fc3-a9091a9f1131" data-tweet-stat-initialized="true" data-screen-name="bbcarabicalerts" data-name="BBC Arabic - عاجل" data-user-id="49942706" data-you-follow="true" data-follows-you="false" data-you-block="false" data-reply-to-users-json="[{&quot;id_str&quot;:&quot;49942706&quot;,&quot;screen_name&quot;:&quot;bbcarabicalerts&quot;,&quot;name&quot;:&quot;BBC Arabic - \u0639\u0627\u062c\u0644&quot;,&quot;emojified_name&quot;:{&quot;text&quot;:&quot;BBC Arabic - \u0639\u0627\u062c\u0644&quot;,&quot;emojified_text_as_html&quot;:&quot;BBC Arabic - \u0639\u0627\u062c\u0644&quot;}}]" data-disclosure-type="" data-component-context="suggest_ranked_organic_tweet">
    
        <div class="context">
          
          
        </div>
    
    

    I need to extract ids out of :

    " data-item-id="1117697004497440768"

    the result will be : 1117697004497440768

    There are many ids and other values that I need to extract out of the text file.

    Monday, April 15, 2019 8:56 AM

All replies

  • User-943250815 posted

    Try HtmlAgility Pack, you can get it on nuget.
    Also check HtmlAgility web site for documentation and samples https://html-agility-pack.net/select-nodes

    Monday, April 15, 2019 12:26 PM
  • User839733648 posted

    Hi human2x,

    As jzero has suggested, you could install the HtmlAgilityPack and use it like the following.

    using HtmlAgilityPack;
    
            protected void Page_Load(object sender, EventArgs e)
            {
                var html =
            @"<ol class='stream-items js-navigable-stream' id='stream-items-id'>
                    <li class='js-stream-item stream-item stream-item'
                        data-item-id='1117697004497440768' id='stream-item-tweet-1117697004497440768' data-item-type='tweet' 
                        data-suggestion-json='{ &quot; suggestion_details & quot;:{ &quot; suggestion_type & quot;:&quot; RankedOrganicTweet & quot;,&quot; controller_data & quot;:&quot; DAABCgABAAEIRCECAAEAAA == &quot;},&quot; tweet_ids & quot;:&quot; 1117697004497440768 & quot;,&quot; scribe_component & quot;:&quot; tweet & quot;}'>
                        <div class='tweet js-stream-tweet js-actionable-tweet js-profile-popup-actionable dismissible-content original-tweet js-original-tweetdismissible-content'
                            data-tweet-id='1117697004497440768' data-item-id='1117697004497440768' data-permalink-path='/bbcarabicalerts/status/1117697004497440768'
                            data-conversation-id='1117697004497440768' data-tweet-nonce='1117697004497440768-06282890-73d0-4144-9fc3-a9091a9f1131'
                            data-tweet-stat-initialized='true' data-screen-name='bbcarabicalerts' data-name='BBC Arabic - عاجل' data-user-id='49942706'
                            data-you-follow='true' data-follows-you='false' data-you-block='false'
                            data-reply-to-users-json='[{ &quot; id_str & quot;:&quot; 49942706 & quot;,&quot; screen_name & quot;:&quot; bbcarabicalerts & quot;,&quot; name & quot;:&quot; BBC Arabic -\u0639\u0627\u062c\u0644 & quot;,&quot; emojified_name & quot;:{ &quot; text & quot;:&quot; BBC Arabic -\u0639\u0627\u062c\u0644 & quot;,&quot; emojified_text_as_html & quot;:&quot; BBC Arabic -\u0639\u0627\u062c\u0644 & quot;}}]' 
                            data-disclosure-type='' data -component-context='suggest_ranked_organic_tweet'>
                            <div class='context'>
                            </div>
                        </div>
                    </li>
                </ol>";
                var htmlDoc = new HtmlDocument();
                htmlDoc.LoadHtml(html);
                string item_id = htmlDoc.DocumentNode.SelectNodes("//*[@data-item-id]").First().GetAttributeValue("data-item-id", "");
                Response.Write(item_id);
            }

    result:

    Best Regards,

    Jenifer

    Tuesday, April 16, 2019 2:51 AM