locked
Faster regular expression? RRS feed

  • Question

  • can this regular expression be improved upon for speed? Used in my parser matching around 500 lines a second. I'm not great with regular expression, so any advice would be appreciated.

    (?<cdate>\d{2}:\d{2}:\d{2}):\s+(?:\(\s(?<id>\d+)\s,\sT=(?<cisplayer>[A-Z])#R=(?<cgroup>[A-Z])#(?<cid>\d+)\s,\sT=(?<tisplayer>[A-Z])#R=(?<tgroup>[A-Z])#(?<tid>\d+)\s,\sT=(?<cpisplayer>[A-Z])#R=(?<cpgroup>[A-Z])#(?<cpid>\d+)\s,\sT=(?<tpparty>[A-Z])#R=(?<tpgroup>[A-Z])#(?<tpid>\d+)\s,\s(?<cname>[A-Za-z@ ,']+)\s,\s(?<tname>[A-Za-z@ ,']+)\s,\s(?<v>[-]?\d+)\s,\s(?<aid>\d+)\s,\s(?<an>[A-za-z ':]+)\)).*

     

     

    Tuesday, June 7, 2011 11:53 AM

Answers

  • 1) Anchor your expression

    Since you're parsing lines, make sure you tell the engine that you're starting at the beginning of a line using ^ or \A. That way an incorrect line can easily be discarded.

    2) Prevent unneeded backtracking

    You have several parts of your expression that will cause excessive backtracking. Backtracking uses a lot of CPU. This happens when you have part of your expression that also matches other parts that come directly after it. By default the engine is greedy and will try to match a much as possible, only coming to the conclusion it can't match the rest of the line when it reaches the end of the line prematurely. It then has to track back to try out other options.

    ,\s(?<cname>[A-Za-z@ ,']+)\s,\s

    is one of these parts which will cause backtracking issues. The character group [a-z@ ,']+ contains both a space and a [,], these characters are also the separator to the next named group. So first the engine will take this group and the next group, only to conclude it can't fit the rest of the results, so it has to track back to the ',' and try the second group (tname).

    I'm guessing you only allow a [,] in a name if it is enclosed by ['], so reflect that in your expression:

    ,\s(?<cname>'[A-Za-z@ ,]+'|[A-Za-z@ ]+)\s,\s

    3) Prevent unwanted captures

    Each part of your expression that is enclosed in (...) will cause a capture to be created, this takes both cpu and memory. Prevent this from happening by using RegexOptions.ExplicitCapture or by putting ?: in each group that doesn't require capturing. so (...) becomes (?:...).

    4) Don't use named captures

    Even though they make your regex more maintainable, a named capture is much more expensive that an unnamed capture. If you remove the names, use (?:...) to mark each group that you don't need, leaving (...) for the groups that you do need. These groups are still accessible in the code, but only by their index (starting with 1)

    5) Reuse your Regex object

    If you're reusing the same expression multiple times in a row, it is advisable to use RegexOptions.Compiled and to store your regex object in a private static readonly Regex field.

    public static class CompiledExpressions
    {
         public static readonly Regex LineParser = new Regex(your expression here, RegexOptions.Compiled);
    }

    The first execution will be slower, but subsequent executions will be much faster

    6) Don't use \s if you mean [ ]

    Using a character class or character group when you only expect one kind of character is slower than just specifying exactly the character you're expecting, though the difference is very very small. So instead of using '\s,\s' just write ' , ', or '[ ],[ ]'.

    7) Remove the .* at the end if you're not using it

    You capture the remainder of the line, but you don't assign it to any group, so I'm assuming you don't need it. Just remove the .* at the end.

    6) Maybe an even better solution 

    Maybe regular expressions aren't the best solutions to read comma separated files. Check out the TextFieldParser class from Microsoft, which is a highly optimized class to parse a number of file formats natively:

    http://msdn.microsoft.com/en-us/library/microsoft.visualbasic.fileio.textfieldparser.aspx


    • Proposed as answer by Jesse HouwingMVP Tuesday, June 7, 2011 1:02 PM
    • Marked as answer by Paul Zhou Monday, June 20, 2011 3:32 AM
    Tuesday, June 7, 2011 1:00 PM

All replies

  • 1) Anchor your expression

    Since you're parsing lines, make sure you tell the engine that you're starting at the beginning of a line using ^ or \A. That way an incorrect line can easily be discarded.

    2) Prevent unneeded backtracking

    You have several parts of your expression that will cause excessive backtracking. Backtracking uses a lot of CPU. This happens when you have part of your expression that also matches other parts that come directly after it. By default the engine is greedy and will try to match a much as possible, only coming to the conclusion it can't match the rest of the line when it reaches the end of the line prematurely. It then has to track back to try out other options.

    ,\s(?<cname>[A-Za-z@ ,']+)\s,\s

    is one of these parts which will cause backtracking issues. The character group [a-z@ ,']+ contains both a space and a [,], these characters are also the separator to the next named group. So first the engine will take this group and the next group, only to conclude it can't fit the rest of the results, so it has to track back to the ',' and try the second group (tname).

    I'm guessing you only allow a [,] in a name if it is enclosed by ['], so reflect that in your expression:

    ,\s(?<cname>'[A-Za-z@ ,]+'|[A-Za-z@ ]+)\s,\s

    3) Prevent unwanted captures

    Each part of your expression that is enclosed in (...) will cause a capture to be created, this takes both cpu and memory. Prevent this from happening by using RegexOptions.ExplicitCapture or by putting ?: in each group that doesn't require capturing. so (...) becomes (?:...).

    4) Don't use named captures

    Even though they make your regex more maintainable, a named capture is much more expensive that an unnamed capture. If you remove the names, use (?:...) to mark each group that you don't need, leaving (...) for the groups that you do need. These groups are still accessible in the code, but only by their index (starting with 1)

    5) Reuse your Regex object

    If you're reusing the same expression multiple times in a row, it is advisable to use RegexOptions.Compiled and to store your regex object in a private static readonly Regex field.

    public static class CompiledExpressions
    {
         public static readonly Regex LineParser = new Regex(your expression here, RegexOptions.Compiled);
    }

    The first execution will be slower, but subsequent executions will be much faster

    6) Don't use \s if you mean [ ]

    Using a character class or character group when you only expect one kind of character is slower than just specifying exactly the character you're expecting, though the difference is very very small. So instead of using '\s,\s' just write ' , ', or '[ ],[ ]'.

    7) Remove the .* at the end if you're not using it

    You capture the remainder of the line, but you don't assign it to any group, so I'm assuming you don't need it. Just remove the .* at the end.

    6) Maybe an even better solution 

    Maybe regular expressions aren't the best solutions to read comma separated files. Check out the TextFieldParser class from Microsoft, which is a highly optimized class to parse a number of file formats natively:

    http://msdn.microsoft.com/en-us/library/microsoft.visualbasic.fileio.textfieldparser.aspx


    • Proposed as answer by Jesse HouwingMVP Tuesday, June 7, 2011 1:02 PM
    • Marked as answer by Paul Zhou Monday, June 20, 2011 3:32 AM
    Tuesday, June 7, 2011 1:00 PM
  • Hi Jesse,

    thumbs up (as always) ;)

    But one question: have you ever measured point 4 or do you have a reference for it?

    According to the implementation, every group is named, even if there is no name provided (index 0 has the name "0", index 1 the name "1").
    After parsing the pattern, the engine already knows the index of each group. And I'm quite sure it uses this index to store values on the stack because the class RegexNode doesn't know the name (only indexes) and the RegexInterpreter uses an int[]-array only (see also RegexRunner.Capture(int capnum, int start, int end)-member).

    So the only differences (I can think of) are at parse-time (should be very very small) and at the moment you access the values (should be ignorable).

    Is accessing a group by its name really significant slower then accessing it by its index? If so, this access is optimizable by using Regex.GetGroupNumbers / Regex.GetGroupNames (but not that convenient).

    Greetings,


    Wolfgang Kluge
    gehirnwindung.de
    Tuesday, June 7, 2011 3:01 PM
  • I did measure it, it's much slower, but I guess it is due to the fact that you can have multiple groups with the same name in different parts of your regex. Or due to the fact that checking one 0 is slower than checking AVeryLongButAlsoDescriptiveName ;).

    with data like:

    Firstname Lastname
    Lastname, Firstname 

    You can have an expression like:

    (?<FirstName>[a-z]+) (?<LastName>[a-z]+)|(?<LastName>[a-z]+), (?<FirstName>[a-z]+)

    That would result in a group named firstname and a group named lastname, and it would always contain the matched values.

    ([a-z]+) ([a-z]+)|([a-z]+), ([a-z]+)

    If you used numbered groups you'd have to check whether \1, \2, \3 or \4 matched and then ...

    If I remember correctly the difference in 10.000 iterations was 10 times slower than numbered captures only. And this was with a precompiled expression that had been invoked at least once before the test. It excludes actually reading the values, so it's extraction time only.



    Tuesday, June 7, 2011 3:12 PM
  • Just ran a new test against .NET 4, the difference is significantly smaller, but numbered expressions are still faster than named expressions:

     

    using System;
    using System.Collections.Generic;
    using System.Text;
    using System.Text.RegularExpressions;
    using System.IO;
    using System.Threading;
    
    namespace ConsoleApplication6
    {
      class Program
      {
        public static readonly Regex Named = new Regex("(?<firstname>[a-z]+) (?<lastname>[a-z]+)|(?<lastname>[a-z]+), (?<firstname>[a-z]+)", RegexOptions.Compiled | RegexOptions.ExplicitCapture);
        public static readonly Regex Numbered = new Regex("([a-z]+) ([a-z]+)|([a-z]+), ([a-z]+)", RegexOptions.Compiled);
    
        static void Main(string[] args)
        {
          List<string> names = new List<string>(1000000);
          for (int i = 0; i < names.Capacity / 2 ; i++)
          {
            names.Add("firstname lastname");
            names.Add("lastname, firstname");
          }
    
    
          Extraction("numbered", Numbered, names);
          Extraction("named", Named, names);
    
          Console.ReadKey();
        }
    
        private static void Extraction(string label, Regex expression, List<string> names)
        {
          GC.Collect();
          GC.WaitForFullGCComplete();
    
          expression.Match(string.Empty);
          Thread.Sleep(2000);
    
          DateTime start = DateTime.UtcNow;
          foreach (string name in names)
          {
            Match m = expression.Match(name);
          }
          DateTime stop = DateTime.UtcNow;
          Console.WriteLine(label + ": " + (stop - start));
        }
      }
    }
    
    

     

    Results (no debugger, release build):

     

    numbered: 00:00:03.6016398
    named: 00:00:04.6644335 

    The funny fact is that under the debugger, named expressions are actually faster than in release, but still slower than numbered groups. I cannot explain that difference.

     

    Results (under debugger, debug build):

     

    numbered: 00:00:03.4446555
    named: 00:00:03.5756424 

    Executed on a Core i7 Quad Core with plenty of memory

     

     

    Tuesday, June 7, 2011 3:39 PM
  • Thanks for the advice, will try out your suggestions :) and will check out TextFieldParser. 
    Tuesday, June 7, 2011 6:07 PM
  • Hi Jesse,

    I have completly different results. It's a 64bit Windows 7 on AMD Phenom X4 945 at 3 GHz with 8GB RAM. I compiled for x86 (it was a mistake first, but see below for more).

    Debug
    01.2270702 numbered
    01.1390652 named

    Release
    01.1798605 numbered
    01.1068564 named

    01.1976285 numbered
    01.1094305 named

    01.2068499 numbered
    01.1319045 named

    So the named Version is always faster on my system?

    If I run "named" before "numbered", I get your results, too ("numbered" is faster than "named"). So I assume GC.WaitForFullGCComplete does not work as expected (it always returns NotApplicable - but I have no idea of the GC-API).

    The difference is less significant if I only run one of both tests (in 2 different assemblies).

    01.1881047 numbered
    01.1639911 named

    01.2166462 numbered
    01.1999642 named

    The 2 patterns aren't equal (named has 2 groups, numbered has 4). But I changed the pattern, too - I get still the same results

    I have seen a pain-mode;) If I use "Any" or "x64" as platform target, it takes:

    08.0930741 numbered
    07.3210693 named

    But I guess it's due to "Console"-output type (the same project as windows application is not that slow). Insane. I think you have this output type/platform combination, too (according to your results with a way better cpu)

    You can have two groups with the same index in numbered regex pattern, two. Just use numbers instead of names, e.g. (?<1>...). But the difference at run-time should be zero (no matter whether you use (..) or (?<1>..) or (?<name>..) ). The parser recognizes these indexes and generates a RegexTree (where the group names not even exists). The compiler generates .NET-opcodes, the interpreter runs on "internal" opcodes (int-Array). I relooked at the code (I'm writing a .net-compatible regex tester - that's why I'm used to the classes). A difference at this point is just not explainable (OK, at least I can't).

    Puzzled,


    Wolfgang Kluge
    gehirnwindung.de
    Tuesday, June 7, 2011 8:33 PM
  • Funny to see such differences,,, I adjusted the code to run both tests in parallel and to run the numbered and named expressions in a striped fashion. Running x86 on a Win64 is how I tested before, I'll make a few runs to make sure...

    I also added some code to actually extract the data back out of the expression using the named and numbered groups. using the int position to extract the data is slower on my machine than the strings, but then again I'm extracting 4 numbered groups as opposed to two named groups in the test I did before. I'll try to create a more comparable test case...

    I'll also try to run the same test under .NET 1.1 to see if I can get my old results again... I also don't remember if my old tests were using a compiled expression or an interpreted expression. The whole optimization of Regex expressions has changed a lot since 1.1 ;).

    I had a Regex tester on my radar as well, care to join efforts? contact me on my avanade email (or search my name on twitter)...

    I remember I was pretty puzzled last time as well... I'll do some testing again tomorrow, now having some off time with a good book :).

    Tuesday, June 7, 2011 8:47 PM
  •  

    Hi,

     

    I am writing to check the status of the issue on your side. Would you mind letting us know the result of the suggestions?

     

    If you have got answers, please remember to mark answer and close this thread.

    If not, any more concerns, please feel free to let us know.

     

    Have a nice day!


    Paul Zhou [MSFT]
    MSDN Community Support | Feedback to us
    Get or Request Code Sample from Microsoft
    Please remember to mark the replies as answers if they help and unmark them if they provide no help.

    Wednesday, June 15, 2011 7:38 AM