Use TPL to extract huge data from database
-
Friday, August 17, 2012 2:44 AM
Hi,
I have unique requirement. My customer has a table with some 4 Billion rows. He uses DB2. One unusual had come in as per which we need to extract these data into a CSV files and archive them. This task needs to happen through windows service. We did a vanilla select and used data reader and used system.IO.File to create the CSV and write it. But that took more than 4 hours for completion and the usage of server resources did not happen optimally. I would need an approach where by I can fire some concurrent SQL statement to DB2 database and also write these to CSV file concurrently. I have the following constraints
1) The number of parallel connection should not exceed three
2) The output has to go as a single CSV file.
The question how should I use TPL to do some parallel queries and merge this into a single CSV. How will determine how many threads should I spawn to database these 4 billion records. How will the combine them. I do not want to receive any memory exception.
I'm still on .NET 4.0 and using VS 2010 only
Thanks and regards
Venkatesh. S|MCTS(WCF, ADO.NET 3.5)|eMail: heman_1978@hotmail.com
All Replies
-
Friday, August 17, 2012 2:56 PM
wouldn't it be easier to use DB2 to export the data to CSV?
take a look at site:DB2 Commands
- Proposed As Answer by Stephen Toub - MSFTMicrosoft Employee, Owner Tuesday, August 28, 2012 5:36 PM
-
Saturday, August 18, 2012 2:28 AM
Hi John,
For some reason it was identified that we will use a .NET4 Layer only. It was a pre-existing code that was taken for maintainence and we want to bring in parallelism to cut down the time for this extract.
Thanks and regards
Venkatesh
Venkatesh. S|MCTS(WCF, ADO.NET 3.5)|eMail: heman_1978@hotmail.com
-
Wednesday, August 29, 2012 2:45 AM
Hi,
Is their some one looking into these forums? I'm stuck with a problem for quite some time expecting a solution. Any links/ guidance offered will be of immense help.
Thanks and regards
Venkatesh. S
Venkatesh. S|MCTS(WCF, ADO.NET 3.5)|eMail: heman_1978@hotmail.com
-
Thursday, November 29, 2012 5:42 PMOwner
Forgetting the database for a moment, it sounds like you're just looking to do three operations concurrently and then operate on the results to store them all into the same file. There are a variety of ways that can be done with .NET 4. For example using Tasks directly:
Task<string> [] tasks = new [] {
Task.Factory.StartNew(() => /* first work */),
Task.Factory.StartNew(() => /* second work */),
Task.Factory.StartNew(() => /* third work */) };
foreach(var task in tasks)
{
string partialResult = task.Result;
WriteOutResultToFile(partialResult);
}or using PLINQ:
IEnumerable<Input> inputs = ...;
var results = inputs.AsParallel().Select(input => /* process input to get result string */);
foreach(var partialResult in results)
{
WriteOutResultToFile(partialResult);
}With that, it's just a matter of you figuring out what each piece of work is, which I expect would be to open a connection to the database and issue a request for the relevant portion of the results you want. If you need help with constructing such a query, I'd suggest asking in a DB2 specific forum rather than here.
- Marked As Answer by Stephen Toub - MSFTMicrosoft Employee, Owner Thursday, November 29, 2012 5:42 PM

