locked
Web Data string filtering Help... RRS feed

  • Question

  • User-1189274697 posted

    This is the problem i am having.

    using (WebClient client = new WebClient())
    {
    client.CachePolicy = new System.Net.Cache.RequestCachePolicy(System.Net.Cache.RequestCacheLevel.NoCacheNoStore);
    try
    {
    try
    {
    byte[] store = client.DownloadData(StateWebLocation);
    string Data = System.Text.Encoding.UTF8.GetString(store);

    -------------------------------------------------------------------------------------------------------------
    <div class="c-model-group js-model-group ">
    <div class="c-model-group__numbers">

    <span class="c-model js-model c-model--outline">
    1
    </span>

    <span class="c-model js-model c-model--outline">
    2
    </span>

    <span class="c-model js-model c-model--outline">
    3
    </span>
    </div>
    </div>
    --------------------------------------------------------------------------------------------------------------
    What i need to is match all occurrences of the string group between the lines above in the big Data string, even if 1 or 2 or 3 is different and load them in one string.

    Then i could filter it with Regex.Replace(t, "[^.0-9]", "") to get the model number..

    little lost on how to grab the data the correct way.. any ideas ?

    Thanks

    Thursday, December 6, 2018 9:27 PM

All replies

  • User766825346 posted

    can you post more information 

    Thursday, December 6, 2018 9:50 PM
  • User-943250815 posted

    Try HTMLAgiltypack https://html-agility-pack.net/

    Thursday, December 6, 2018 9:50 PM
  • User-1189274697 posted

    Have you used this?

    If so.. then you can tell me how to .. never heard of it?

    Friday, December 7, 2018 12:49 AM
  • User-943250815 posted

    Here is a sample using your data, that collect each value in <span> and store in a list, at the end you have "1", "2", "3"
    You can get HTMLAgilityPack on Nuget

    Imports HtmlAgilityPack

    Public Sub TestGetValuesFromHTML()
    Dim zHTM As New StringBuilder
    zHTM.Append("<html><body>")
    zHTM.Append("<div Class=""c-model-group js-model-group"">")
    zHTM.Append("<div Class=""c-model-group__numbers"">")
    zHTM.Append("<span Class=""c-model js-model c-model--outline"">1</span>")
    zHTM.Append("<span Class=""c-model js-model c-model--outline"">2</span>")
    zHTM.Append("<span Class=""c-model js-model c-model--outline"">3</span>")
    zHTM.Append("</div>")
    zHTM.Append("</div>")
    zHTM.Append("<body><html>")

    Dim zLstVal As New List(Of String)
    Dim zHTML As New HtmlDocument 'HTMLAgilityPack start here
    zHTML.LoadHtml(zHTM.ToString)
    Dim zHTMLBody = zHTML.DocumentNode.SelectSingleNode("/html/body")
    For Each zNode In zHTMLBody.Descendants("span")
    zLstVal.Add(zNode.InnerText)
    Next
    ' Do whatever you need with list of values
    End Sub

    Friday, December 7, 2018 2:30 PM
  • User-1189274697 posted

    Thanks for getting back with me .. but there is a problem there will be tons of data that matches that  are in <span>

    This is why i wanted to use regex. to match the pattern  of the data.

    <div class="c-model-group js-model-group ">
    <div class="c-model-group__numbers">

    <span class="c-model js-model c-model--outline">
    1
    </span>

    <span class="c-model js-model c-model--outline">
    2
    </span>

    <span class="c-model js-model c-model--outline">
    3
    </span>

    </div>
    </div>

    I looked threw the data and there are only 2 patterns like this, and i need those 2..

    Friday, December 7, 2018 5:12 PM
  • User475983607 posted

    I believe this RegEx should work, that is if I understand the pattern the are looking for...

    ^<div class="c-model-group js-model-group ">[\s]*
    <div class="c-model-group__numbers">[\s]*
    <span class="c-model js-model c-model--outline">[\s\d]*<\/span>[\s]*
    <span class="c-model js-model c-model--outline">[\s\d]<\/span>[\s]
    <span class="c-model js-model c-model--outline">[\s\d]*<\/span>$

    The RegEx looks for the literal HTML and allows white space around the HTML.  It expect the span to contain digits but will accept white space..  IF you want require digits then change the expression to...

    <span class="c-model js-model c-model--outline">[\s]*[\d]+[\s]*<\/span>

    Friday, December 7, 2018 5:48 PM
  • User-1189274697 posted

    I thin that this is really close to what am trying to achieve... this is what I have come up with so far..

    pattern = @"[\s]*[\s]*[\s\d]*<\/span>[\s]*[\s\d]<\/span>[\s][\s\d]*<\/span>$ ";


    foreach (Match m in Regex.Matches(Data2, pattern, RegexOptions.Singleline))
    {
    string t = m.ToString();
    if ("," != t.Substring(0, 1))
    {
    t = Regex.Replace(t, "[^.0-9]", "");
    myList.Add(t);
    }
    }

    Is the Pattern right? ... not very good at writing patterns..

    Friday, December 7, 2018 7:06 PM
  • User475983607 posted

    I thin that this is really close to what am trying to achieve... this is what I have come up with so far..

    pattern = @"[\s]*[\s]*[\s\d]*<\/span>[\s]*[\s\d]<\/span>[\s][\s\d]*<\/span>$ ";


    foreach (Match m in Regex.Matches(Data2, pattern, RegexOptions.Singleline))
    {
    string t = m.ToString();
    if ("," != t.Substring(0, 1))
    {
    t = Regex.Replace(t, "[^.0-9]", "");
    myList.Add(t);
    }
    }

    Is the Pattern right? ... not very good at writing patterns..

    I'm a bit confused as I tested the original RegEx using the HTML snippet that you provided.  The RegEx functioned as expected.

    The RegEx posted above does not match the HTML.   I assume you are trying to do something like this...

    ^[\d]+[\s]*<\/span>$

    ...where you are looking for a number, optional white space, followed by </span>.

    This site allows for testing RegEx.

    https://regex101.com/

    Friday, December 7, 2018 7:28 PM
  • User-943250815 posted

    For sure there will be multiple patterns, and no easy way to get all in a single shot. Sample was just an example

    superlurker

    Then i could filter it with Regex.Replace(t, "[^.0-9]", "") to get the model number


    Are you sure you want remove all numbers from "model number"?

    superlurker

    pattern = @"[\s]*[\s]*[\s\d]*<\/span>[\s]*[\s\d]<\/span>[\s][\s\d]*<\/span>$ ";


    Such pattern drive me to think you want search nested <span>

    Sounds, first you need to classify all patterns then look for a way to collect or replace data

    Friday, December 7, 2018 8:46 PM
  • User-1189274697 posted

    Running the test on  the website .. the Regex you provided works .. but it's grabbing too much ..  i had came across this..

    How to use it..

    string text = "This is an example string and my data is here";
    string data = getBetween(text, "my", "is");

    public static string getBetween(string strSource, string strStart, string strEnd)
    {
    int Start, End;
    if (strSource.Contains(strStart) && strSource.Contains(strEnd))
    {
    Start = strSource.IndexOf(strStart, 0) + strStart.Length;
    End = strSource.IndexOf(strEnd, Start);
    return strSource.Substring(Start, End - Start);
    }
    else
    {
    return "";
    }
    }

    What this does is find a set key word, witch is great because all if the data i'm trying to grab has different key words in front of them..

    This works the closest, but the only problem is once it finds 1 match it just returns.. I need it to get all matching in data string and just add to the string .. like

    list +=

    Can't figure out how to get it to loop again past the end location, and once there are no matches left ... then return.

    Friday, December 7, 2018 8:52 PM
  • User475983607 posted

    If the HTML content is XML compliant then use the .NET XML APIs to query the document.  Which is a lot easier than string parsing, IMHO. 

    Friday, December 7, 2018 9:00 PM
  • User-1189274697 posted

    Example ? Please Never used..

    Friday, December 7, 2018 11:24 PM
  • User-1189274697 posted

    Thanks everyone for your help.. I decided to go with a different solution to the problem.

    Monday, December 10, 2018 3:13 PM