Answered Use TPL to extract huge data from database

  • Friday, August 17, 2012 2:44 AM
     
     

    Hi,

    I have  unique requirement. My customer has a table with some 4 Billion rows. He uses DB2. One unusual had come in as per which we need to extract these data into a CSV files and archive them. This task needs to happen through windows service. We did a vanilla select and used data reader and used system.IO.File to create the CSV and write it. But that took more than 4 hours for completion and the usage of server resources did not happen optimally. I would need an approach where by I can fire some concurrent SQL statement to DB2 database and also write these to CSV file concurrently. I have the following constraints

    1) The number of parallel connection should not exceed three

    2) The output has to go as a single CSV file.

    The question how should I use TPL to do some parallel queries and merge this into a single CSV. How will determine how many threads should I spawn to database these 4 billion records. How will the combine them. I do not want to receive any memory exception.

    I'm still on .NET 4.0 and using VS 2010 only

    Thanks and regards


    Venkatesh. S|MCTS(WCF, ADO.NET 3.5)|eMail: heman_1978@hotmail.com

All Replies

  • Friday, August 17, 2012 2:56 PM
     
     Proposed

    wouldn't it be easier to use DB2 to export the data to CSV?

    take a look at site:DB2 Commands

  • Saturday, August 18, 2012 2:28 AM
     
     

    Hi John,

    For some reason it was identified that we will use a .NET4 Layer only. It was a pre-existing code that was taken for maintainence and we want to bring in parallelism to cut down the time for this extract.

    Thanks and regards

    Venkatesh


    Venkatesh. S|MCTS(WCF, ADO.NET 3.5)|eMail: heman_1978@hotmail.com

  • Wednesday, August 29, 2012 2:45 AM
     
     

    Hi,

    Is their some one looking into these forums? I'm stuck with a problem for quite some time expecting a solution. Any links/ guidance offered will be of immense help.

    Thanks and regards

    Venkatesh. S


    Venkatesh. S|MCTS(WCF, ADO.NET 3.5)|eMail: heman_1978@hotmail.com

  • Thursday, November 29, 2012 5:42 PM
    Owner
     
     Answered

    Forgetting the database for a moment, it sounds like you're just looking to do three operations concurrently and then operate on the results to store them all into the same file.  There are a variety of ways that can be done with .NET 4.  For example using Tasks directly:

    Task<string> [] tasks = new [] {
        Task.Factory.StartNew(() => /* first work */),
        Task.Factory.StartNew(() => /* second work */),
        Task.Factory.StartNew(() => /* third work */) };
    foreach(var task in tasks)
    {
        string partialResult = task.Result;
        WriteOutResultToFile(partialResult);
    }

    or using PLINQ:

    IEnumerable<Input> inputs = ...;
    var results = inputs.AsParallel().Select(input => /* process input to get result string */);
    foreach(var partialResult in results)
    {
        WriteOutResultToFile(partialResult);
    }

    With that, it's just a matter of you figuring out what each piece of work is, which I expect would be to open a connection to the database and issue a request for the relevant portion of the results you want.  If you need help with constructing such a query, I'd suggest asking in a DB2 specific forum rather than here.