none
How To Process flat files with header and footer? RRS feed

  • Question

  • Hello,

     

    I am designing an SSIS application that will upload csv files to the database.  The files may be in any one of 5 different layouts, each one will have header, footer and detail records.

     

    The file layout can be identified by both file name and header record.  Information from header, footer and detail records must be stored in the db.

     

    I am currently investigation using the conditional split task to identify the file type and each file type will have a further conditional split to identify header, footer and detail records.

     

    This approach seems somewhat long winded and i would be grateful for feedback as to if its the recommended approach or would you have any better suggestions??

     

    Many Thanks

     

    Paul

     

     

     

     

    Tuesday, September 30, 2008 4:13 PM

Answers

  • Paul,

     

    There are many ways to skin a cat. The answer as usual is "it depends". Your approach is perfectly valid, under certain circumstances. Having multiple data flows would mean that processing could occur in parallel rather than sequentially as with the loop. Things to consider:

    • How frequently will this package run? Daily, intra-day, ad-hoc?
    • Do the packages need to be processed as a set?
    • Do you need to first validate that all files are there for a particular date
    • Should the package be able to load multiple versions of the same file (with different date stamps)?
    • Should the load be dependent i.e. if one file fails then the whole package fails, or should you be able to load the 4 files and then just re-load the bad one.
    • How big are these files? And will you gain any benefit out of parallelizing the execution

    These and many other sonsiderations will deterrmine whether it's better to do this as

    1. a single package with a single data flow
    2. a single package with multiple data flows
    3. multiple packages (one per file) with a single data flow

    You need to strike the right balance of perfomance and maintainability in your scenario.

     

    Personally I like the idea of having a pre-processing step that splits up the file by row type into multiple files and then subsequent step(s)  that have dedicated flat file connection managers with the expected meta data (column names and data types) making the subsequent loads less complex i.e. no script task with hard coded logic and delimiters, no derived column experssions etc.

     

    Pre-process stage

    Have a generic flat file connection manager which will be updated in the loop

    1. Have a foreach file loop for the files in the directory
    2. Have 3 expression based variables that add _header, _footer and _detail to the filename based on the "current" file in the loop
    3. Have 3 more flat file connections for the outputs header, detail and footer and base them on the 3 varaibles from the previous step, all with just a single column
    4. In the data flow split the input file out into 3 files and write them out to disk (without stripping out double quotes or splitting)

    File process stage:

    You can now decide how to load the files in depending on your requirements. As all the header and footer files have a consistent layout you could build a generic process for this (maybe a child package) that takes the file name as a parameter.

     

    You can now have a separate package per detail file, or 1 package with multiple connection mamangers (with the correct meat data)...

    Thursday, October 2, 2008 4:32 PM
    Moderator

All replies

  • The problem scenario you describe ought to be straightforward to handle in SSIS - just not using the Conditional Split.

     

    The Conditional Split component is part of the Data Flow side of SSIS - for when you're actually handling the data in column form.  At that time, it's "too late" to "reformat" your columns depending on input file type.

     

    What you want to do - thankfully because you can tell file format by file name - is make that distinction in the Control Flow.

     

    Basically, you would set up a Foreach Container to loop over the files in your directory.  Inside the Foreach, you would determine the "file type" - perhaps by creating a variable with a long-winded expression on it that pulls apart your file name and assumes the a "file type" value - then passes control on to one of five Data Flows via conditional connectors.  (Double-click on the standard green connector, change it's Evaluation Operation to Expression and Constraint, and set the expression to be "file_type_variable = <file_type>".)  Then each Data Flow picks apart one "file type".

     

     

     

    Tuesday, September 30, 2008 4:26 PM
    Moderator
  • Hi Todd,

    Thanks for your help, much appreciated.

    Would a conditional split then be appropriate within the data flow to identify header, detail and footer records? - all of which are of a different format.

    Thanks

    Paul
    Wednesday, October 1, 2008 7:28 AM
  •  Paul_0101 wrote:
    Hi Todd,

    Thanks for your help, much appreciated.

    Would a conditional split then be appropriate within the data flow to identify header, detail and footer records? - all of which are of a different format.

    Thanks

    Paul

     

    if i'm not mistaken, a conditional split won't work in your scenario because it requires that all the records have the same metadata.  a possible solution would be to employ a script transformation component to filter the records.

     

    hth

    Wednesday, October 1, 2008 7:50 AM
    Moderator
  • Hi,

    I am having trouble changing "evaluation operation" to "expression and constraint", i cannot see this anywhere when double clicking on the green connector?

    thanks

    Paul
    Wednesday, October 1, 2008 9:02 AM
  • I'm guessing you're double-clicking a green connector in the Data Flow tab.  The green arrows in the Data Flow represent streams of "good" data.  It's not possible to "stop" a data flow based on a variable value.  (That's not entirely true, but it's not how you'll end up thinking about it.)

     

    You want to manage precedence (which tasks get run) on the Control Flow tab - it can be very confusing at first!  So make sure you're on the Control Flow tab.  The green arrows on the Control Flow tab mean "do this thing next".  Double-clicking on those green arrows modifies the condition for which to "do this thing next".

     

    Duane is very right though - header and footer flat files are not the easiest to pick apart in SSIS - but you've already got a thread going on how to handle those...

     

    Wednesday, October 1, 2008 3:18 PM
    Moderator
  • Duane,

    This is not necessarily true as you are dealing with a flat file. You can import the data by Data row and then set up a custom script task to mark the rows pertaining to which data flow route they should take via the conditional split.

    This approach is commonly used for complex flat file types.

     

    Cheers,
    AJ

     

    Thursday, October 2, 2008 1:33 AM
  •  Jones0878 wrote:

    Duane,

    This is not necessarily true as you are dealing with a flat file. You can import the data by Data row and then set up a custom script task to mark the rows pertaining to which data flow route they should take via the conditional split.

    This approach is commonly used for complex flat file types.

     

    Cheers,
    AJ

     

     

    as i said in my previous post, it depends upon whether the rows all have the same metadata.  your approach requires that each row have the same metadata.

    Thursday, October 2, 2008 4:57 AM
    Moderator
  • You could try the following:

     

    Configure the flat file connection manager meta data to have just a single varchar column. In the Data Flow, an entire row is read into a single string column. You can then use a conditional split (or a script transform) to distinguish between header, footer and detail, directing each to a separate output. You then have 2 options:

    • write the detail out to a file for which you have another connection manager with all the correct meatdata and then load that in a subsequent data flow
    • Use derived column (or script task) to split up the sigle varchar column into multiple columns e.g. using FINDSTRING and SUBSTRING in derived column or string.Split(delimiter) VB function in a script task.

     

    • Proposed as answer by NickDDFox Monday, September 28, 2009 8:41 PM
    Thursday, October 2, 2008 9:32 AM
    Moderator
  • Hello,

    Just to explain my problem further:

    i have the following file layout:

    "FF","HDR","xx","28/09/2008"
    "FF", "xx","F1233","Acme Inc.","O1234","","Y","F9999","CORP","BLAH BLAH"
    "FF", "xx","F1234","Acme","O1235","","Y","F9998","CORP","BLAH BLAH"
    "FF","TRL","xx","28/09/2008",2

    The above file has 2 detail records, a header and footer.  There will be 5 other file types, the header and footer will be the same format as above but detail records will differ between files.  Within every file the detail records will be the same.

    The file is called datestampFF.txt where FF identifies the file type and will be different for all 5 files.  File type can also be identified by 1st two chars of each record.

    My current solution comprises of a for each loop, within this i have a single data flow and the file name is passed into it.

    The following tasks exist within the data flow:

    1.  flat file source - reads records into one single column
    2.  derived column - removes double quotes using the expression REPLACE(Line,"\"","")
    3.  conditional split - splits flow based on file type taken from 1st two chars of each record - SUBSTRING(Line,1,2) == "FF"
    4.  5 script component tasks for each file type - the outputs of each task are detail record (each column has been defined and populated in script) and a concat of header and footer.

    Within the script task the header is being held in a global variable, when the footer is read (i.e. eof) the script calls a further task which concats the two and updates the database. - this will be a global task called by all 5 file type script component tasks.

    As i seem to be going round in circles my question is does this sound like a recommended solution and if not could you suggest any improvements please?

    Just thinking out loud, as i know the file type from the file name, having 5 contol flows for each file type would be more efficient than having the one control flow??  Currently i will be checking each record for file type - and i already know what it is from the file name. - I think this is what Todd was suggesting. 

    Todd could you provide a link to the other thread you mention please, this is the only thread i have on this subject.

    Thanks for your help everyone


    Paul


    Thursday, October 2, 2008 9:53 AM
  • Sorry, Paul - I could have sworn it was you Smile

     

    There are TONS of threads in here about how to process flat files with headers and footers, or multiple line types.  Search for "flat file header footer" and you'll get ten THOUSAND hits.  Here's a sampling - all of which basically boil down to "use a script component in the dataflow" or "use a script task in the control flow".  Read them until you find a wording of the solution that you understand, and have the confidence to implement.

     

    SSIS - Handling different types of record in same flat file source

    Multiple table from one source file

     

    Blog posts:

     

    Processing a Flat File with Header and Detail Rows - John Welch

    Thursday, October 2, 2008 4:10 PM
    Moderator
  • Paul,

     

    There are many ways to skin a cat. The answer as usual is "it depends". Your approach is perfectly valid, under certain circumstances. Having multiple data flows would mean that processing could occur in parallel rather than sequentially as with the loop. Things to consider:

    • How frequently will this package run? Daily, intra-day, ad-hoc?
    • Do the packages need to be processed as a set?
    • Do you need to first validate that all files are there for a particular date
    • Should the package be able to load multiple versions of the same file (with different date stamps)?
    • Should the load be dependent i.e. if one file fails then the whole package fails, or should you be able to load the 4 files and then just re-load the bad one.
    • How big are these files? And will you gain any benefit out of parallelizing the execution

    These and many other sonsiderations will deterrmine whether it's better to do this as

    1. a single package with a single data flow
    2. a single package with multiple data flows
    3. multiple packages (one per file) with a single data flow

    You need to strike the right balance of perfomance and maintainability in your scenario.

     

    Personally I like the idea of having a pre-processing step that splits up the file by row type into multiple files and then subsequent step(s)  that have dedicated flat file connection managers with the expected meta data (column names and data types) making the subsequent loads less complex i.e. no script task with hard coded logic and delimiters, no derived column experssions etc.

     

    Pre-process stage

    Have a generic flat file connection manager which will be updated in the loop

    1. Have a foreach file loop for the files in the directory
    2. Have 3 expression based variables that add _header, _footer and _detail to the filename based on the "current" file in the loop
    3. Have 3 more flat file connections for the outputs header, detail and footer and base them on the 3 varaibles from the previous step, all with just a single column
    4. In the data flow split the input file out into 3 files and write them out to disk (without stripping out double quotes or splitting)

    File process stage:

    You can now decide how to load the files in depending on your requirements. As all the header and footer files have a consistent layout you could build a generic process for this (maybe a child package) that takes the file name as a parameter.

     

    You can now have a separate package per detail file, or 1 package with multiple connection mamangers (with the correct meat data)...

    Thursday, October 2, 2008 4:32 PM
    Moderator
  • Thanks very much for your reply Adam, very helpful, i hadnt thought of this approach.

    Everyone else, your input was much appreciated, an excellent forum.

    Paul
    Friday, October 3, 2008 7:50 AM
  • Thanks Adam, that's the most workable solution I've come across so far.  I'm a bit disappointed that a toolkit as sophistocated as SSIS 2008 doesn't have a dedicated task to do this though. It's a common scenario.  Nonetheless, your solution works very well. Thank you.
    Monday, September 28, 2009 8:44 PM
  • Hi Adam,

    I got a similar problem, but I am very new to SQL. Can you please help me out with below requirement.

    I am working on a financial project; the requirement is interface file should not contain any duplicated arrangement. (e.g. Check Sum) this requires the source extraction process to include quality checks to ensure that same arrangement is not included in interface files more than once.

    Header Record

    Fields                                                            Type                                          Value

    Record TypeCode                                      String                                          000

    Record Format Version No                     Number                                       1

    Creation Date/Time Stamp                    DateTime                                       20060601013120

    Region                                                             String                                                US

     

    Trailer Records

    Record Type Code                                     String                                               999

    Record Format Version No                     Number                                      1

    Creation Date/Time Stamp                    DateTime                                   20060601013120

    Region                                                    String                                          US

    DetailRecordCount                                 Number                                      35001

    Control Total                                           Number                                     -2993278934.87

     

    Every interface file will consist of the following record types:

    ·         Header record – The first record will be the header record.

    The purpose of the header is to provide identification information of the file that can be checked by the receving system for assurance that the correct file received. There is only one such record per file and it is the 1st record in the file.

     

    ·         Detail records – There will be zero or many detail records. They will follow the header record.

    ·         Trailer record – The last record will be the trailer record.

    The purpose of the trailer record is to provide identification information of the file and control data that can be checked by the receiving system for assurance that the correct file received and provide a set of batch control total that can be used by the receiving system to check if the content of the file has not been altered since the file was created.

     

    Control Totals: Detail Record count, excluding the header record and trailerrecord.

    Control Total 1 = Simple sum of “principle_amount_1” of every detail arrangement.

     

    There is only one such record per file and it is the last record in the file.

     

    I need some help here, how to approach this requirement. Do I need to create separate tables for header and footer with value types.

     

    How to check sum header and footer as per above requirement ?

     

    Any article or code related to above requirements please let me know I will try my best to figure it out.


    Management insist me to do it MS SSIS.
     

    Thanks in Advance,

     

    D

    Monday, November 9, 2009 12:15 PM
  • Please correct me if I'm wrong, but by the tone of your email it doesn't sound like you have very much (if any?) experience with SSIS and if so, then it would be a good idea to get some, even if it's just working through tutorials and samples.

    In terms of guidance, my post above gives you enough detail in terms of process to complete the task and should be pretty staraight foward to follow if you know SSIS.

    So I'm not really sure what else I can do to help short of a detailed step by step which I simply don't have the time or inclination to do. I would suggest you try for yourself anthen come back with specific questions if you get stuck. I suggest starting new posts for additional questions as you're more likely to get an answer from others on thsi forum (because this thread has already been marked as answered).
    Adam Tappis. (MCSD VB6 & .NET, MCDBA SQL 2000)
    Monday, November 9, 2009 2:12 PM
    Moderator