locked
Optimize Very Large Export to Flat File RRS feed

  • Question

  • I have an SSIS package that exports several tables to csv file.  Some of the files become very large (> 50GB).  I'm wondering what, if anything can be done to tweak the export to make those files as small as possible.  The codepage I am using for the files is UTF-8.  I have left the data types in the flat file as the default based on what SSIS designer chose.  All of the character fields in the output file are string [DT_STR] with various output column widths (again set by the SSIS designer based on data sample).  Unicode is not selected as an option on the connection properties.  All of the characters in the database are latin characters.  Everything looks good to me, but the file size seems a bit excessive.  I just wanted to make sure I wasn't overlooking anything obvious that could make the files smaller such as changing the codepage, or some other tricks.  I suspect many SSIS users are not doing routine exports of flat files this large, however it is a regular task at my workplace.  I do compress the files later, however making the files even 20% smaller during the export itself would save a lot of disk IO.  Any suggestions or tips would be appreciated.

    Thanks in advance,

    Bill

    Friday, December 30, 2011 3:32 AM

Answers

  • HI Bill_Powell,

    We can configure the Data Flow task for better performance :

    1. Configure the following properties of the Data Flow task, all of which affect performance:

    1) Specify the locations for temporary storage of buffer data (BufferTempStoragePath property) and of columns that contain binary large object (BLOB) data (BLOBTempStoragePath property).
    2) Define the default size of the buffer that the task uses, by setting the DefaultBufferSize property, and define the maximum number of rows in each buffer, by setting the DefaultBufferMaxRows property.
    3) Set the number of threads that the task can use during execution, by setting the EngineThreads property.
    4) Indicate whether the Data Flow task runs in optimized mode (RunInOptimizedMode property).

    2. Adjust the Sizing of Buffers: The data flow engine begins the task of sizing its buffers by calculating the estimated size of a single row of data. Then it multiplies the estimated size of a row by the value of DefaultBufferMaxRows to obtain a preliminary working value for the buffer size.

    3. Configure the Package for Parallel Execution: Parallel execution improves performance on computers that have multiple physical or logical processors. To support parallel execution of different tasks in the package, Integration Services uses two properties: MaxConcurrentExecutables and EngineThreads.

    4. The MaxConcurrentExcecutables Property: The MaxConcurrentExecutables property is a property of the package itself. This property defines how many tasks can run simultaneously. The default value is -1, which means the number of physical or logical processors plus 2.

    5. The EngineThreads Property: The EngineThreads property is a property of each Data Flow task. This property defines how many threads the data flow engine can create and run in parallel.

    For more information about the topic, please see:
    Improving the Performance of the Data Flow: http://msdn.microsoft.com/en-us/library/ms141031.aspx

    Thanks,
    Eileen

    • Marked as answer by Bill_Powell Friday, April 19, 2013 5:11 PM
    Thursday, January 5, 2012 9:02 AM

All replies

  • Hello Bill,

    A question, what consumes these large files? B/c in general it is better to figure what columns are not subsequently used.

    You may want to try to convert it to binary (but not base 64), see http://www.codeproject.com/KB/security/TextCoDec.aspx


    Arthur My Blog
    Friday, December 30, 2011 3:48 AM
  • I'm not familiar with the data warehouse downstream that consumes these files.  I know for sure that it is not SQL Server.  I am only told very basic specifications to follow.

    • CSV File
    • Columns delimited by a vertical var

    So I am constrained to a test file format.  I've also been told that if the downstream process cannot process unicode we may have to change it to ISO 8859 Western European, which I do not see as an option in SSIS.  I have however conducted a review of required columns and have already eliminated many columns that were not in use.  Not sure if there is any more optimizing I can do.

     

    Friday, December 30, 2011 4:12 AM
  • Hi Bill

    I know that your issue is with how big the file is and the char code, and etc.., but you mentioned that you are dealling with >50GB and you mentioned that "Not sure if there is any more optimizing I can do."

    yes you can have more Optimized package by

    1- speed up the reading from your table (Link)

    2- read parallel (link)


    Sincerely SH -- MCITP 2008, MCTS 2008 & 2005 -- Please kindly mark the post(s) that answered your question and/or vote for the post(s).
    Friday, December 30, 2011 3:53 PM
  • Thanks for the optimization tips.  I was not aware of the read parallel method.  I don't believe it will make any difference in my case, since I have several tasks exporting at once (so I'm guessing that all cpu's are busy), but more importantly I believe that the bottleneck in this process will be writing to disk.  As stated above, I am already optimizing the OLEDB input to limit to required columns only. 
    Friday, December 30, 2011 6:04 PM
  • HI Bill_Powell,

    We can configure the Data Flow task for better performance :

    1. Configure the following properties of the Data Flow task, all of which affect performance:

    1) Specify the locations for temporary storage of buffer data (BufferTempStoragePath property) and of columns that contain binary large object (BLOB) data (BLOBTempStoragePath property).
    2) Define the default size of the buffer that the task uses, by setting the DefaultBufferSize property, and define the maximum number of rows in each buffer, by setting the DefaultBufferMaxRows property.
    3) Set the number of threads that the task can use during execution, by setting the EngineThreads property.
    4) Indicate whether the Data Flow task runs in optimized mode (RunInOptimizedMode property).

    2. Adjust the Sizing of Buffers: The data flow engine begins the task of sizing its buffers by calculating the estimated size of a single row of data. Then it multiplies the estimated size of a row by the value of DefaultBufferMaxRows to obtain a preliminary working value for the buffer size.

    3. Configure the Package for Parallel Execution: Parallel execution improves performance on computers that have multiple physical or logical processors. To support parallel execution of different tasks in the package, Integration Services uses two properties: MaxConcurrentExecutables and EngineThreads.

    4. The MaxConcurrentExcecutables Property: The MaxConcurrentExecutables property is a property of the package itself. This property defines how many tasks can run simultaneously. The default value is -1, which means the number of physical or logical processors plus 2.

    5. The EngineThreads Property: The EngineThreads property is a property of each Data Flow task. This property defines how many threads the data flow engine can create and run in parallel.

    For more information about the topic, please see:
    Improving the Performance of the Data Flow: http://msdn.microsoft.com/en-us/library/ms141031.aspx

    Thanks,
    Eileen

    • Marked as answer by Bill_Powell Friday, April 19, 2013 5:11 PM
    Thursday, January 5, 2012 9:02 AM