none
How to parse a very large log (1.7GB) using multithread. RRS feed

  • Question

  • Hi!

    Im trying to parse a large file and split it in other txt files. But it is taking too long, so I decided to implement it using multithread, like 4 threads each one parsing 1/4 of the original log and writing the results in the new file. But I dont have any idea how to do it, Im a c# beginner, any help?

    Thanks!

    Monday, January 17, 2011 8:57 PM

Answers

  • Read the file with your main thread only and spawn new threads to process each part. The main thread continues reading while other threads process the file's contents. No problem of concurrent disk access.

    One thread should be sufficient.  Using async read and write, the time to parse and split should be insignificant.  The only way to speed up this process is to read and write to different disks or do it twice.  The second time will be much faster than the first due to caching.
    • Proposed as answer by Paul Zhou Monday, January 24, 2011 5:07 AM
    • Marked as answer by Paul Zhou Tuesday, January 25, 2011 1:29 AM
    Tuesday, January 18, 2011 11:24 AM

All replies

  • You can use memory mapped files for txt files operations

    http://msdn.microsoft.com/en-us/library/dd997372.aspx

     

    And you can use Parallel Programming Pattern to do multithread

    http://msdn.microsoft.com/en-us/library/dd537609.aspx

     

    This is the best way for you want

    Monday, January 17, 2011 9:02 PM
  • It won't take any less time.  Actually, it will take longer because the threads will be competing for time to read the file and the disk head will constantly be searching for a different location on the drive.  Save time by writing the new file on a different drive.
    Tuesday, January 18, 2011 4:32 AM
  • Read the file with your main thread only and spawn new threads to process each part. The main thread continues reading while other threads process the file's contents. No problem of concurrent disk access.

    Tuesday, January 18, 2011 11:14 AM
  • Read the file with your main thread only and spawn new threads to process each part. The main thread continues reading while other threads process the file's contents. No problem of concurrent disk access.

    One thread should be sufficient.  Using async read and write, the time to parse and split should be insignificant.  The only way to speed up this process is to read and write to different disks or do it twice.  The second time will be much faster than the first due to caching.
    • Proposed as answer by Paul Zhou Monday, January 24, 2011 5:07 AM
    • Marked as answer by Paul Zhou Tuesday, January 25, 2011 1:29 AM
    Tuesday, January 18, 2011 11:24 AM
  • Thanks for the help!

    Gonna try 2 disks.

    =D

    Tuesday, January 18, 2011 6:40 PM
  • Are you sure you know that the read/write actions should be optimized and not the process of parsing the file?

    I am not sure which "parsing" actions you do, but if it is more then the very basic, I would first investigate how long it takes to just read/write the file from one file into another. And compare that with the time that your program takes. If there is a big difference, consider analysing wheter any of the parsing-code should be optimized instead of using multiple threads?

    Tuesday, January 18, 2011 7:10 PM
  • How long does it take to simply read the entire file compared to your current performance?  I mean if you had a loop:

     

    string text;
    while( (text = file.ReadLine()) != null );
    

     

    This is like a base-line for simple reading of a file.  Is this fast enough for you?  If not, then you have IO problems and the solutions are likely beyond beginner level skill.

    So once you have a chunk of code that can read lines, you can delegate each chunk of text into the thread pool.

    while( (text = file.ReadLine()) != null )
    {
      ThreadPool.QueueUserWorkItem( delegate {
       Parse( text );
      } );
    }
    
    

     

    This technique will parse each line of text in parallel.  It potentially delegates a lot of individual work items, so you may want to bundle them up into groups of N lines.  Also, you may need to provide some context to your Parse function so that it knows what order the lines came in.  You may need to keep track of an index, like this:

    int index = 0;
    while( (text = file.ReadLine()) != null )
    {
      ThreadPool.QueueUserWorkItem( delegate {
       Parse( text, index );
      } );
      index++;
    }
    

     

    You haven't told us much about the way you parse your data and what work you want to do to it.  This may be totally off base if the parsing isn't that intensive.  I mean, if you're just trying to find the newlines then that's a completely different problem.

    I'd be happy to discuss further if you provide more specific information about what you're trying to accomplish with your "parsing".

    Tuesday, January 18, 2011 7:28 PM