none
How to correctly use TaskFactoryto split a large file. RRS feed

  • Question

  • Hi,

    I'm trying to load a 64GB file into a database.  My approach is to split, transform, then finally load the file into a database table.  I would like to use parallel programming to split and transform the file in different threads.  So one thread would split the file into smaller files and another thread would transform the rows in the smaller file.  I have a process working that will split the file, transform it, and load into the database and now I would like to integrate TPL into it.  When I introduce a TaskFactory I'm not getting the results I need, no data is being written to the file.  Here is my pseudo code, am I using TaskFactory correctly?

    public void Main() { List<string> lines = new List<string>(); List<Task> taskList = new List<Task>; TaskFactory tf = new TaskFactory(TaskCreationOptions.LongRunning, TaskContinuationOptions.ExecuteSynchronously); using (StreamReader sr = new StreamReader(filePath) { while (!sr.EndOfStream) { while(rowCnt % 500000 != 0 && !sr.EndOfStream) { lines.Add(Sr.ReadLine()); rowCnt++; }

    	taskList.add(tf.StartNew() => ProcessFileBlock(lines));

    } } Task.WaitAll(taskList.ToArray()); } private void ProcessFileBlock(List<string>lines) ( //transform file StreamWriter sr = new StreamWriter(newFile); sr.Write(lines); )





    Wednesday, May 11, 2016 12:44 AM

Answers

  • "I would have expected file creation to stop at some point because of resource constraints cause by creating, transforming, and populating the file.  For example, maybe 5 smaller files would be created and processed at one time."

    Usually the number of running tasks matches the CPU core count. However, it's not the TPL that takes care of this but the threadpool that TPL uses by default. In your case that doesn't happen because you used the LongRunning and in this case TPL simply starts a new thread for each tasks, it doesn't use the threadpool.

    You should try to simply start tasks with Task.Run and see how it goes. That said, your tasks are doing disk I/O and that may means that your attempt at using the TPL are pointless. The row transformation that a task is doing needs to be CPU intensive enough, otherwise you're just writing multiple disk files at the same time and that's usually bad.

    Thursday, May 12, 2016 4:50 AM
    Moderator

All replies

  • Something seems wrong in the way you create those tasks. There should be multiple tasks but in fact you're creating only one task and that only after you read the entire file. You probably want something like this:

    List<Task> taskList = new List<Task>;
    TaskFactory tf = new TaskFactory(TaskCreationOptions.LongRunning, TaskContinuationOptions.ExecuteSynchronously);
    
    using (StreamReader sr = new StreamReader(filePath)
    {
    	while (!sr.EndOfStream)
    	{
                    List<string> lines = new List<string>();
    		while (lines.Count < 500000 && !sr.EndOfStream)
    		{
    			lines.Add(sr.ReadLine());
    		}
             	taskList.add(tf.StartNew() => ProcessFileBlock(lines)
    	}
    }
    
    Task.WaitAll(taskList.ToArray());
    

    Wednesday, May 11, 2016 5:50 AM
    Moderator
  • That's my mistake, I wrote the pseudo code incorrectly.  In my actual code the task creation is inside the while loop like you show.  So for every 500,000 line a new task is created to process those lines.  
    Wednesday, May 11, 2016 6:03 PM
  • Have you determined if ProcessFileBlock is actually called (using breakpoints, for example)?

    Wednesday, May 11, 2016 6:23 PM
  • ProcessFileBlock is being called because in this method I create the smaller file that contains the part of the file just read.  The file is created, but there is no data in the file.  I'm having hard time debugging the application in asynchronously because the tasks are not showing up in either the Tasks or Threads windows in Visual Studio.

    The way its currently written a new task is kicked off for every 500K records through the while loop.  Running the code synchronously, 169 smaller files are created each containing a portion of the file.  But when using the TaskFactory, the files continue to be created reaching up to file 50, 100, etc.  I would have expected file creation to stop at some point because of resource constraints cause by creating, transforming, and populating the file.  For example, maybe 5 smaller files would be created and processed at one time.  

    Wednesday, May 11, 2016 9:52 PM
  • "I would have expected file creation to stop at some point because of resource constraints cause by creating, transforming, and populating the file.  For example, maybe 5 smaller files would be created and processed at one time."

    Usually the number of running tasks matches the CPU core count. However, it's not the TPL that takes care of this but the threadpool that TPL uses by default. In your case that doesn't happen because you used the LongRunning and in this case TPL simply starts a new thread for each tasks, it doesn't use the threadpool.

    You should try to simply start tasks with Task.Run and see how it goes. That said, your tasks are doing disk I/O and that may means that your attempt at using the TPL are pointless. The row transformation that a task is doing needs to be CPU intensive enough, otherwise you're just writing multiple disk files at the same time and that's usually bad.

    Thursday, May 12, 2016 4:50 AM
    Moderator