none
Inexplicable Regex (effectively) infinite loop (bug in Regex?) RRS feed

  • Question

  • I was tracking down a problem this morning that was causing pegged CPUs on several of our 12-core production servers.  It came down to a Regex expression.  I've seen the posts here about exponential Regex expressions, but this seems a little different.  The string being searched is quite short (short enough that even most exponential algorithms would finish pretty quickly), and if we tweak the expression to include looking for the end of the string as well as the end of the sequence we're replacing, the same operation finishes nearly instantly.  That seems like a bug to me.  Shouldn't Regex ALWAYS stop looking at the end of the string?

    Here is the code:

                string html = "<!--BEGIN QUALTRICS POLL--> <script type='text/javascript'> var q_poll_f = function(){var s=document";

                string embeddedScriptComments = @"(\/\*.*?\*\/|\/\/.*?[\n\r])";

                string scriptPattern = String.Format(@"(?'script'<[ \n\r]*script[^>]*>(.*?{0}?)*<[ \n\r]*/script[^>]*>)", embeddedScriptComments);

                //string scriptPattern = String.Format(@"(?'script'<[ \n\r]*script[^>]*>(.*?{0}?)*(<[ \n\r]*/script[^>]*>|$))", embeddedScriptComments);

                // the pattern includes the comment and script sub-patterns

                string pattern = String.Format(@"(?s)({0})", scriptPattern);

                Regex re = new Regex(pattern, RegexOptions.IgnoreCase);

                // remove all comments and scripts from the page...

                html = re.Replace(html, "");

                Debug.WriteLine(html);

    The commented-out line is our fix that makes it run instantly, which just adds end-of-string as a condition for which to end the matching sequence.

    Thanks,

    -James

    • Moved by Bob Shen Wednesday, January 23, 2013 5:24 AM
    Tuesday, January 22, 2013 5:12 PM

Answers

All replies