locked
Lots of columns to check/convert - best practice? RRS feed

  • Question

  • I've got about 6 formats of flat files coming through data flows & heading into a relational db, then later to a data warehouse.  In each file type, I've got about 60-70 columns to perform basically two levels of validations on - first is straight data type conversions, then 2nd is finer level stuff.  Some of the data from each file type overlaps in other files, so for instance some lookup codes are maintained for both.

     

    The data is pretty dirty, so I'm keeping everything as varchar coming into the staging area, just so I can get the data in the system, b/c the users demand on some form of of the data making it into the system, no matter the dirtiness. 

     

    So then I'm running my two steps - first converting data types from varchar to bit/datetime/int, etc. as applicable.  And then , I'll be running finer levels of validation, doing range checks, etc.  My question is - with so many columns, what's the most efficient & best way of doing all these checks in the data flow, and recording errors out of each check?  Do I put one column check after another with the success constraint, do the lookup/range check/other, and then record the error if an error is encountered, then move on to the next column?  Or would it be better to multicast out the stream to 60 flows and do all the needed checks, then union all the good stuff back together at the end?  Anything to help save some headache - b/c this data is dirty and there's a lot of it.

    Wednesday, June 27, 2007 1:05 AM

Answers

  • Sorry, I should have been a bit more clear. I was suggesting implementing each "check" as a condition in the conditional split. You'd then take the output associated with each check, do the necessary processing, and use a Union All to bring it back into the same flow.
    Wednesday, June 27, 2007 1:20 PM

All replies

  • You might want to consider a conditional split, so you can implement multiple checks. Just be aware that a row will be output based on the first condition that it meets.
    Wednesday, June 27, 2007 1:52 AM
  • Not sure what I'd be splitting on?  My 6 file formats will be handled in different packages or data flows, so really I'm just working with one file.  And in that file theres lots of columns, with multiple "checks" to run for each column.  Sorry if the first post was misleading.  Based on this, any other ideas?
    Wednesday, June 27, 2007 12:29 PM
  • Sorry, I should have been a bit more clear. I was suggesting implementing each "check" as a condition in the conditional split. You'd then take the output associated with each check, do the necessary processing, and use a Union All to bring it back into the same flow.
    Wednesday, June 27, 2007 1:20 PM