Dead loop
-
Friday, June 15, 2012 11:11 AM
I am wondering if any expert could tell me why the following regular expression may cause a dead loop:
@"<span class=""time_rtq"">.+?>.+?>(?<object>.+?)\s\D\D\D</span>"
Hong
- Changed Type Hong (MA, USA) Monday, June 18, 2012 11:52 AM
- Changed Type Hong (MA, USA) Friday, July 06, 2012 6:19 PM Found text to reproduce the issue.
All Replies
-
Friday, June 15, 2012 1:35 PM
Probably this --> .+?>.+?>
Can I see some data that you may have?
Try something like this: (Of course I am shooting in the dark until I can see some examples.
@"<span class=""time_rtq""\>[^\>]*>(?<object>[^\s]*)\s\D{3}<\/span>"
John Grove, Senior Software Engineer http://www.digitizedschematic.com/
- Proposed As Answer by Mike FengMicrosoft Contingent Staff, Moderator Monday, June 18, 2012 11:25 AM
- Marked As Answer by Hong (MA, USA) Monday, June 18, 2012 11:50 AM
- Proposed As Answer by OmegaManMVP, Moderator Tuesday, July 10, 2012 7:17 PM
- Unproposed As Answer by Hong (MA, USA) Wednesday, July 11, 2012 8:15 PM
-
Friday, June 15, 2012 2:15 PM
Thanks, John, for trying to lend a hand.
I am sorry but I did not save the text. There is a chance that I can find one later today. It was 100% reproducible for the half a dozen tests that I did.
I should have mentioned that it hangs when there is no match.
Hong
-
Sunday, June 17, 2012 10:59 PMJohn, unfortunately, I am unable to reproduce the text. It is from a yahoo page, and it seems to changes frequently. I will save the text for sure next time when it occurs again.
Hong
-
Monday, June 18, 2012 11:27 AMModerator
Hi Hong,
Based on your current status, it seems that you cannot reproduce this scenario now.
Would you mind to change this thread to a discussion one and change it back when you can reproduce it?
Thank you for your understanding and support.
Best regards,
Mike Feng
MSDN Community Support | Feedback to us
Please remember to mark the replies as answers if they help and unmark them if they provide no help.
-
Monday, June 18, 2012 11:52 AM
Hi Mike,
Yes, sure. I am sorry that I did not save the text.
Hong
- Edited by Hong (MA, USA) Monday, June 18, 2012 11:53 AM
-
Monday, June 18, 2012 1:58 PMModerator
Hi Hipswich,
It is OK.
Best regards,
Mike Feng
MSDN Community Support | Feedback to us
Please remember to mark the replies as answers if they help and unmark them if they provide no help.
-
Friday, July 06, 2012 6:19 PM
John, I hope you are still watching this topic.
I have finally seen it again, and saved the text this time.
The following code enters a dead loop.
Match mDate = Regex.Match(sText, @"<span class=""time_rtq"">.+?>.+?>(?<object>.+?)\s\D\D\D</span>", RegexOptions.Singleline)
where sText exceeds the limit of 60000 characters for this forum. I put it in debug.txt for downloading.
Hong
-
Monday, July 09, 2012 1:52 AMModerator
Hi Hong,
I have tested your code like this way:
private void button2_Click(object sender, EventArgs e) { foreach (string sText in File.ReadAllLines(@"E:\C# Projects\WinFormsApp-RegEx\TestString.txt")) { Match mDate = Regex.Match(sText, @"<span class=""time_rtq"">.+?>.+?>(?<object>.+?)\s\D\D\D</span>", RegexOptions.Singleline); while (mDate.Success) { Console.WriteLine(mDate.Value); mDate.NextMatch(); } } Console.WriteLine("Done"); }
And I didn't reproduce your scenario.
You have mentioned "where sText exceeds the limit of 60000 characters for this forum", Do you mean, you have code the very very long string in your program?
Best regards,
Mike Feng
MSDN Community Support | Feedback to us
Please remember to mark the replies as answers if they help and unmark them if they provide no help.
-
Monday, July 09, 2012 2:02 AM
Mike, thanks a lot for trying to help here.
I use the following code to reproduce the problem:
#if DEBUG StreamReader sr = new StreamReader(Environment.GetFolderPath(Environment.SpecialFolder.Desktop) + @"\debug.txt"); sText = sr.ReadToEnd(); sr.Close(); #endif Match mDate = Regex.Match(sText, @"<span class=""time_rtq"">.+?>.+?>(?<object>.+?)\s\D\D\D</span>", RegexOptions.Singleline); #if DEBUG Debug.WriteLine("Parsing finished"); #endifI have just tried it again, and the problem remains.
Hong
-
Wednesday, July 11, 2012 2:05 PM
@"<span class=""time_rtq""\>[^\>]*>(?<object>[^\s]*)\s\D{3}<\/span>"
I tried John's pattern though I am not clear how it works, but it does not catch the following:
<span class="time_rtq"> <span id="yfs_t53_aau"><span id="yfs_t53_aau">9:53AM EDT</span>
When a match like the above exists, the search finishes quickly. It hangs only there is no match such as referred file. I think it should not hang if there is no match.
Mike, have you been able to reproduce it?Hong
-
Wednesday, July 11, 2012 2:22 PM
You could try to use the HTML Agility pack since you are parsing HTML. Perhaps a regex is not what you are after. Something like so?
HtmlDocument html = new HtmlDocument();
html.Load("example.html");
HtmlNodeCollection nodes = html.DocumentNode.SelectNodes("//span[@class = 'time_rtq']");
John Grove, Senior Software Engineer http://www.digitizedschematic.com/
- Edited by JohnGrove Wednesday, July 11, 2012 2:27 PM
-
Wednesday, July 11, 2012 2:36 PM
Thanks, John. That is an excellent tip.
The code is from a 7-year old application. I can definitely change a part of it to use the HAP.
Not knowing why Regex hangs under certain circumstances gives me discomfort.
Hong
-
Wednesday, July 11, 2012 3:06 PM
Using your data, what exactly am I suppose to "catch"? Here is what I did:
using System; using System.Collections.Generic; using System.Linq; using System.Text; using HtmlAgilityPack; namespace ConsoleApplication1 { class Program { static void Main(string[] args) { HtmlDocument html = new HtmlDocument(); html.Load("example.html"); HtmlNodeCollection nodes = html.DocumentNode.SelectNodes("//span[@class = 'time_rtq']"); foreach (HtmlNode node in nodes) Console.WriteLine(node.InnerHtml); Console.ReadLine(); } } }Here is what I caught:
InnerHtml:
<span id="yfs_t10_bacpx"><span id="yfs_t10_bacpx">Jul 5</span></span>
OuterHtml:
<span class="time_rtq"> <span id="yfs_t10_bacpx"><span id="yfs_t10_bacpx">Jul 5</span></span></span>
John Grove, Senior Software Engineer http://www.digitizedschematic.com/
- Edited by JohnGrove Wednesday, July 11, 2012 3:07 PM
- Marked As Answer by Hong (MA, USA) Wednesday, July 11, 2012 8:14 PM
-
Wednesday, July 11, 2012 3:34 PM
Thanks a lot, John. This will take care of the function. I just need the content of the most inner span element. I have installed HAP via NuGet.
The only down side of of HAP is that it cannot be used for Windows Phone apps.
Since I use Regex frequently (beyond parsing HTML pages), I hope Mike will offer some insight into the cause of the dead loop.
Hong
- Edited by Hong (MA, USA) Wednesday, July 11, 2012 3:49 PM
-
Wednesday, July 11, 2012 3:44 PM
Try this:
<span class="time_rtq"\>[^\>]*>[\w\d\"\s\<\>=]+(\<?\/span>)*
John Grove, Senior Software Engineer http://www.digitizedschematic.com/
-
Wednesday, July 11, 2012 7:02 PM
John, I am not clear how the pattern is supposed to work.
It returns:
<span class="time_rtq"> <span id="yfs_t53_ae"><span id="yfs_t53_ae">2
from:
<span class="time_rtq"> <span id="yfs_t53_ae"><span id="yfs_t53_ae">2:03PM EDT</span>
Hong
-
Wednesday, July 11, 2012 7:04 PM
I used that pattern against your data and caught this:
<span class="time_rtq"> <span id="yfs_t10_bacpx"><span id="yfs_t10_bacpx">Jul 5</span></span></span>
John Grove, Senior Software Engineer http://www.digitizedschematic.com/
-
Wednesday, July 11, 2012 7:05 PM
Try this pattern to catch the :
<span class="time_rtq"\>[^\>]*>[\w\d\"\s\<\>=\:]+(\<?\/span>)*
John Grove, Senior Software Engineer http://www.digitizedschematic.com/
- Marked As Answer by Hong (MA, USA) Wednesday, July 11, 2012 8:15 PM
-
Wednesday, July 11, 2012 7:47 PM
It works. Thanks again, John.
Now, I think I understand how the pattern works. HAP is surely a better way. There may be one or two nested <span> elements within <span class="time_rtq">, and other variations. All I need is the content of the inner most <span>. All of this is beyond my original intention of finding the cause of the dead loop, but it is very helpful.
Hong
-
Wednesday, July 11, 2012 8:01 PMMake sure to mark those who have helped you as the answer
John Grove, Senior Software Engineer http://www.digitizedschematic.com/
-
Wednesday, July 11, 2012 8:18 PM
The group <object>of the following pattern seems to get the content of the inner most <span>:
@"<span class=""time_rtq"">.+?>(?<object>[^<]+?)</span>"
Hong
-
Wednesday, July 11, 2012 8:20 PM
or do this, name it
<span class="time_rtq"\>(?<mySpan>[^\>]*>[\w\d\"\s\<\>=\:]+)(\<?\/span>)*
John Grove, Senior Software Engineer http://www.digitizedschematic.com/
-
Wednesday, July 11, 2012 10:43 PM
Thanks, John.
I thought I should clarify that the reason I am interested in knowing why the original pattern causes hanging is that I want to prevent such thing from happening in the future. It is OK if the match is false, but it is not if a piece of code causes hanging.
Hong
-
Thursday, July 12, 2012 1:04 AMI have just noticed that HAP does support Windows Phone. There is every reason for me to use it for parsing HTML page though Regex is still needed in the final step usually.
Hong

