Asked by:
I need to extract values from string or text file containing repeated html tags

Question
-
User-1123701243 posted
Hi all
I have a text file containing html tags and a lot of structured data that I need to extract out of this file.
something like this:
<ol class="stream-items js-navigable-stream" id="stream-items-id"> <li class="js-stream-item stream-item stream-item " data-item-id="1117697004497440768" id="stream-item-tweet-1117697004497440768" data-item-type="tweet" data-suggestion-json="{"suggestion_details":{"suggestion_type":"RankedOrganicTweet","controller_data":"DAABCgABAAEIRCECAAEAAA=="},"tweet_ids":"1117697004497440768","scribe_component":"tweet"}"> <div class="tweet js-stream-tweet js-actionable-tweet js-profile-popup-actionable dismissible-content original-tweet js-original-tweet dismissible-content " data-tweet-id="1117697004497440768" data-item-id="1117697004497440768" data-permalink-path="/bbcarabicalerts/status/1117697004497440768" data-conversation-id="1117697004497440768" data-tweet-nonce="1117697004497440768-06282890-73d0-4144-9fc3-a9091a9f1131" data-tweet-stat-initialized="true" data-screen-name="bbcarabicalerts" data-name="BBC Arabic - عاجل" data-user-id="49942706" data-you-follow="true" data-follows-you="false" data-you-block="false" data-reply-to-users-json="[{"id_str":"49942706","screen_name":"bbcarabicalerts","name":"BBC Arabic - \u0639\u0627\u062c\u0644","emojified_name":{"text":"BBC Arabic - \u0639\u0627\u062c\u0644","emojified_text_as_html":"BBC Arabic - \u0639\u0627\u062c\u0644"}}]" data-disclosure-type="" data-component-context="suggest_ranked_organic_tweet"> <div class="context"> </div>
I need to extract ids out of :
" data-item-id="1117697004497440768"
the result will be : 1117697004497440768
There are many ids and other values that I need to extract out of the text file.
Monday, April 15, 2019 8:56 AM
All replies
-
User-943250815 posted
Try HtmlAgility Pack, you can get it on nuget.
Also check HtmlAgility web site for documentation and samples https://html-agility-pack.net/select-nodesMonday, April 15, 2019 12:26 PM -
User839733648 posted
Hi human2x,
As jzero has suggested, you could install the HtmlAgilityPack and use it like the following.
using HtmlAgilityPack; protected void Page_Load(object sender, EventArgs e) { var html = @"<ol class='stream-items js-navigable-stream' id='stream-items-id'> <li class='js-stream-item stream-item stream-item' data-item-id='1117697004497440768' id='stream-item-tweet-1117697004497440768' data-item-type='tweet' data-suggestion-json='{ " suggestion_details & quot;:{ " suggestion_type & quot;:" RankedOrganicTweet & quot;," controller_data & quot;:" DAABCgABAAEIRCECAAEAAA == "}," tweet_ids & quot;:" 1117697004497440768 & quot;," scribe_component & quot;:" tweet & quot;}'> <div class='tweet js-stream-tweet js-actionable-tweet js-profile-popup-actionable dismissible-content original-tweet js-original-tweetdismissible-content' data-tweet-id='1117697004497440768' data-item-id='1117697004497440768' data-permalink-path='/bbcarabicalerts/status/1117697004497440768' data-conversation-id='1117697004497440768' data-tweet-nonce='1117697004497440768-06282890-73d0-4144-9fc3-a9091a9f1131' data-tweet-stat-initialized='true' data-screen-name='bbcarabicalerts' data-name='BBC Arabic - عاجل' data-user-id='49942706' data-you-follow='true' data-follows-you='false' data-you-block='false' data-reply-to-users-json='[{ " id_str & quot;:" 49942706 & quot;," screen_name & quot;:" bbcarabicalerts & quot;," name & quot;:" BBC Arabic -\u0639\u0627\u062c\u0644 & quot;," emojified_name & quot;:{ " text & quot;:" BBC Arabic -\u0639\u0627\u062c\u0644 & quot;," emojified_text_as_html & quot;:" BBC Arabic -\u0639\u0627\u062c\u0644 & quot;}}]' data-disclosure-type='' data -component-context='suggest_ranked_organic_tweet'> <div class='context'> </div> </div> </li> </ol>"; var htmlDoc = new HtmlDocument(); htmlDoc.LoadHtml(html); string item_id = htmlDoc.DocumentNode.SelectNodes("//*[@data-item-id]").First().GetAttributeValue("data-item-id", ""); Response.Write(item_id); }
result:
Best Regards,
Jenifer
Tuesday, April 16, 2019 2:51 AM