none
Parallel extensions and regular expressions

    Question

  • Hi all,

     

    In my current project, I want to build a database (a XML file for now)  of hockey game statistics. For this, I wrote a bot that can be used to collect the game summary for all games in a chosen season. Thus, for example, for season 2003-2004, I collected a folder with 1230 files (30 teams * 82 games / 2). These files are about 25KB each.

     

    I want to use regular expressions to parse the game summary files to extract relevant information. I also want to make my code parallel using the new parallel extensions. A simple Parallel.For should do the trick, building XML game node objects in parallel and adding them to my XML database file. In here - http://msdn2.microsoft.com/en-us/library/8zbs0h2f.aspx - it says that regular expressions are interpreted and not compiled to MSIL unless the RegexOptions.Compiled options is used. This will effectively compile the regular expression to MSIL thus increasing performance. I plan on using MSIL compiled Regex. But there is a problem: the code for a compiled Regex is cached and cannot be unloaded. In my scenario, each loop of my Parallel.For call must use its own Regex object (note that all of these are created using the same string - the same regular expression). Thus, if the naive approach is used, there will be 1230 cached Regex MSIL code, which is not efficient.

     

    On the same page (http://msdn2.microsoft.com/en-us/library/8zbs0h2f.aspx) it says that compiled Regex can be saved to an assembly. My question is: can we create multiple instance of a pre-compiled Regex out of an assembly so that each loop of my Parallel.For call uses its own copy of the Regex, but have only 1 cached MSIL implementation of the Regex? My initial guess is yes but I thought my question could be interesting for people using the parallel extensions with regular expressions.  If not, how should my problem be handled?

     

     

    Sunday, January 27, 2008 5:14 PM

Answers

  • Since you mention that all the Regex objects are essentially the same (created from the same string), the first solution that comes to mind is creating the Regex object just before the Parallel.For() statement and referring to that instance in the loop delegate, which would simply mean the same compiled Regex will be shared by parallel loops.

    Per MSDN, Regex objects are thread safe, so this shouldn't be a concern. Is there anything else that gets in the way of this approach?


    Monday, January 28, 2008 4:53 AM
  • Here's the MSDN link about Regex thread safety: http://msdn2.microsoft.com/en-us/library/aa720723(VS.71).aspx

    It says while Regex itself is thread safe, its results objects aren't. So, as long as you don't record the result set somewhere in one iteration and refer to it in another iteration, you should be fine with the discussed usage pattern.

     

    Let me add two more points:

     

    1) Let's assume that instead of Regex, you were using some other class which isn't actually thread safe, but the rest of your requirements remained the same. In that case you could still address your concern (i.e. having to create 1300+ instances of the same object) by using one of the overloads of Parallel.For() that provides a thread local initializer. Here's an example:

     

                Parallel.For( 0, 1320, 
                                 () =>
                                     {

                                          // this is the once per worker thread initialization delegate,

                                          // which returns an object of type T (anything you like, in this case MyNonThreadsafeExpClass)


                                         MyNonThreadsafeExpClass localExpObject = new MyNonThreadsafeExpClass();

                                         localExpObject.Initialize( ... ); // whatever initialization you need to perform on the
                                         return localObject;
                                     }
                                 ,
                                 (i,state) =>
                                     {

    // this is the main iteration body - do something with i

     

                                           // and you can also refer to your exp object

                                           state.ThreadLocalState.ExpEvalFunc(...);  

                                           // ExpEvalFunc() will be called on the instance of MyNonThreadsafeExpClass which was

                                           // created in the thread local initializer of the worker which is now executing this iteration
                                     } );

     

    2) I would rethink the strategy of running a parallel loop "to read files and process" 1000s of files. My guess is your program is more I/O bound that it is CPU bound. The problem here is that you'll have a number of worker threads that keep requesting reads from different locations on the disk, therefore competing against each other for a purely sequential hardware resource (well, various caches on the system mitigate this to an extent, but that's another dicussion). This limits your scalability.

     

    Unless your processing cost per file is significantly long, a better approach would be to read these files sequentially into memory (well, at least in batches of 100s perhaps), and then run a Parallel.For() loop to perform the processing on the batch (and repeat to finish all your input files).

     

    Please let me know if this makes sense.

    Monday, January 28, 2008 8:32 PM

All replies

  • Since you mention that all the Regex objects are essentially the same (created from the same string), the first solution that comes to mind is creating the Regex object just before the Parallel.For() statement and referring to that instance in the loop delegate, which would simply mean the same compiled Regex will be shared by parallel loops.

    Per MSDN, Regex objects are thread safe, so this shouldn't be a concern. Is there anything else that gets in the way of this approach?


    Monday, January 28, 2008 4:53 AM
  • Hi Huseyin and all,

     

    I didn't know that Regex objects were thread safe. I was under the impression that you would call the Match or Matches methods and that you needed to access the Regex after to retrieve the results. In short, your solution will work since a new MatchCollection object is created after the Matches method call and this MatchCollection contains all relevant information. So each loop has its own MatchCollection object to work with, even though they share the same Regex object, which is essentially a "method".

     

    Thanks!

     

    Monday, January 28, 2008 2:18 PM
  • Here's the MSDN link about Regex thread safety: http://msdn2.microsoft.com/en-us/library/aa720723(VS.71).aspx

    It says while Regex itself is thread safe, its results objects aren't. So, as long as you don't record the result set somewhere in one iteration and refer to it in another iteration, you should be fine with the discussed usage pattern.

     

    Let me add two more points:

     

    1) Let's assume that instead of Regex, you were using some other class which isn't actually thread safe, but the rest of your requirements remained the same. In that case you could still address your concern (i.e. having to create 1300+ instances of the same object) by using one of the overloads of Parallel.For() that provides a thread local initializer. Here's an example:

     

                Parallel.For( 0, 1320, 
                                 () =>
                                     {

                                          // this is the once per worker thread initialization delegate,

                                          // which returns an object of type T (anything you like, in this case MyNonThreadsafeExpClass)


                                         MyNonThreadsafeExpClass localExpObject = new MyNonThreadsafeExpClass();

                                         localExpObject.Initialize( ... ); // whatever initialization you need to perform on the
                                         return localObject;
                                     }
                                 ,
                                 (i,state) =>
                                     {

    // this is the main iteration body - do something with i

     

                                           // and you can also refer to your exp object

                                           state.ThreadLocalState.ExpEvalFunc(...);  

                                           // ExpEvalFunc() will be called on the instance of MyNonThreadsafeExpClass which was

                                           // created in the thread local initializer of the worker which is now executing this iteration
                                     } );

     

    2) I would rethink the strategy of running a parallel loop "to read files and process" 1000s of files. My guess is your program is more I/O bound that it is CPU bound. The problem here is that you'll have a number of worker threads that keep requesting reads from different locations on the disk, therefore competing against each other for a purely sequential hardware resource (well, various caches on the system mitigate this to an extent, but that's another dicussion). This limits your scalability.

     

    Unless your processing cost per file is significantly long, a better approach would be to read these files sequentially into memory (well, at least in batches of 100s perhaps), and then run a Parallel.For() loop to perform the processing on the batch (and repeat to finish all your input files).

     

    Please let me know if this makes sense.

    Monday, January 28, 2008 8:32 PM
  • 1) I think I could make good use of the Parallel.For overload you mentionned. In my loops, I create game objects that I later write to my XML document. I could use this version of Parallel.For to reuse those game objects.

     

    2) You are correct. Reading batches of html pages is the way to go. 

     

    My worker threads also compete for the XML document object (and maybe a log file for reporting possible file errors). Several calls to document.CreateElement are needed to create a game node. Are these thread safe operations? The XmlDocument instance members are not guaranteed to be thread safe, as per msdn documentation. Insertion of a node is probably not, but element creation?

    Tuesday, January 29, 2008 1:27 AM
  • Huseyin is right.  But I would also recommend you use this same technique (to create a thread-local object) for Regex instances too.  The reason is subtle.  While Regex is thread-safe, it currently is known to use some techniques to acheive this thread-safey that will impact the scaling of your loops.  While we're working to get this remedied, a useful workaround is to use a thread-local Regex object.  FWIW.

     

    ---joe

    Wednesday, February 13, 2008 5:19 AM