Best way to process rows in batches from a raw file?
- I have written 2000 rows to a raw file after a certain amount of processing in a data flow. Now, I need to break these rows up into chunks of 500 rows and write them to 4 different flat files (with both the detail rows and an aggregate row for each batch). What is the best sequence of flows and components to accomplish this? I am thinking of loading the data into a record set, then creating a loop container to loop through the records 500 at a time with a data flow task inside it? Does this sound like the right direction? Any suggestions on a best way??
Any input would be greatly appreciated!!
Thanks in advance!
Answers
The batch size will stay the same. The number of destinations(flat files) and their names will change based on the total number of records in the raw file. I have it working with the dataflow inside the for loop then using a script task that has 2 outputs. The first output gets the first number of rows where a counter is <=50 and the second output (any row over 50) over writes the working raw file to be used by the next iteration of the loop.
I used the technique to create the two outputs from here:
http://consultingblogs.emc.com/jamiethomson/archive/2005/09/05/2113.aspx.
There is more work I have to do to really get it to do everything I need, but I think I have the basic functionality. Unless you can suggest something that is much better.
Thank you very much for your time and expertise!- Marked As Answer byNai-dong Jin - MSFTMSFT, ModeratorFriday, November 13, 2009 2:59 AM
All Replies
- Instead of looping through, I'd do this way:-Use a DataFlow task and add a Raw File Source to read from the raw file. Attach it to a Script Component, that'd update a new column, say RowNumber, in my data flow and attach a value to it via a incrementing Counter in the script. Then, I;d do a conditional split, splitting my 2000 records to 4 different sets based on the values of my RowNumber column ( 1 to 500, 501 to 1000, 1001 to 1500, and 1501 to 2000).There can be more (elegant :P) ways of doing this, as you'll hear from people here.Cheers!M.Edit: Just to make it more clear, add the RowNumber column by using a DerivedColumn control just after the RawFileSource component. And don't forget to double click the Script component ( placed right after the Derived Column component), "check" the RowNumber column in "Input Columns" tab and making it as "ReadWrite". Here is the code I have used in my script component:
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
/*
Add your code here
*
*/
Row.RowNum = iRowCounter;
iRowCounter = iRowCounter + 1;
}
- Hi M
Thanks for the suggestion! I do need to loop or be able to create a dynamic number of destinations. It won't always be 2000 rows. While I was waiting for answers, I did the following:
-use a data flow task to make a working copy of the raw file
-then use a for loop container where variable @RowsToProcess>0, incrementing variable @int_batchNum
-then a dataflow
-with working rawfile source
-to a rowSampling transform
-that outputs to the finalTxtFile[@int_batchNum].txt and
-what isn't part of the sample goes to a rowcount transform that updates @RowsToProcess and
-then uses a raw file destination transform to overwrite the working raw file.
This actual worked! However, I think I will have to change to a script instead of the row sampling because that appears to be random and my rows need to stay in a certain order.
Am I doing this efficiently or is this a mess? I am so new to SSIS that I just don't really get it. I will suggest using For Loop Container, declare batchsize variable and then you can set value on "For Loop Editor" as
InitExpression = @EndRecordNo = @RecordInOneBatch
EvalExpression = @EndRecordNo<=@TotalRec + @RecordInOneBatch
AssignExpression = @EndRecordNo = @EndRecordNo + @RecordInOneBatch
You need to initialize
RecordInOneBatch to 500
EndRecordNo to 0
TotalRec = 2000 (if that is fix otherwise rec count)
Once you do that then using ScriptComponent you can achieve your desired result.
As far as creating multiple files are concern so set PackageConfiguration for Connection and set path to Connection inside ScriptComponent.So what parameters are going to change across the executions of your package? Records per batch (i.e count of records in each destination split)? The total number of records you fetch ( I know u mentioned it above), the ordering criteria of your records?
Experssions come in handy when you need to build some dynamism in your package and Variables in the package help achieve this. Please post your exact requirements and we should be able to suggest something worthwhile.
Cheers!
M.The batch size will stay the same. The number of destinations(flat files) and their names will change based on the total number of records in the raw file. I have it working with the dataflow inside the for loop then using a script task that has 2 outputs. The first output gets the first number of rows where a counter is <=50 and the second output (any row over 50) over writes the working raw file to be used by the next iteration of the loop.
I used the technique to create the two outputs from here:
http://consultingblogs.emc.com/jamiethomson/archive/2005/09/05/2113.aspx.
There is more work I have to do to really get it to do everything I need, but I think I have the basic functionality. Unless you can suggest something that is much better.
Thank you very much for your time and expertise!- Marked As Answer byNai-dong Jin - MSFTMSFT, ModeratorFriday, November 13, 2009 2:59 AM


