locked
Validation - Best Practices RRS feed

  • Question

  • I wanted to get others feedback on what common/best practices are used for validating data in SSIS ETL processes.  Here are some specific areas I have in mind:

    1) Validating field lengths (min/max)
    2) Validating data types
    3) Validating string formats (i.e. phone numbers, zip codes, email addresses, etc.)
    4) Validating required fields
    5) How to respond to broken validations
    6) If and how to handle multiple broken validations per value

    7) Also, I'd like to know if others are using custom components for validation.  For example, we're designing a C# library to use for standard validation across web sites, C# apps, and possibly, SSIS.  But I'm not sure if this is a recommended approach, or if we should stick to using native SSIS components for doing data validation.

    Thanks in advance for your feedback.
    Jerad
    Tuesday, January 6, 2009 10:27 PM

Answers

  • My experiences / suggestions:

    1) Field lengths, string formats, required fields, and the like are easy to handle in Conditional Split transforms. And it's fast. However, this has the downside of hardcoded your validation logic, so it's not very reusable.
    2) I usually use Derived Column transforms for Data Type validation, and enable error redirection to catch failures.
    5) Broken validations are usually handled by error redirection in the data flow.
    6) You can do this, but it's a little messay. You have add some type of flag column to your data flow, and use a Derived Column transform to set it appropriate after each error check.
    7) Yes, I have used custom components. The advantage is that the logic is reusable across multiple packages, you can catch multiple errors, etc.. The downside is that there can be a performance impact.


    Edit: Also, if you can't do what you want validation-wise easily in the data flow, consider moving it into SQL. I've found it better on occasion to stage the data to a table, do my validation there, then pull it back into the data flow.


    John Welch | www.mariner-usa.com | www.agilebi.com
    • Marked as answer by Tony Tang_YJ Wednesday, January 14, 2009 7:53 AM
    Tuesday, January 6, 2009 10:44 PM

All replies

  • My experiences / suggestions:

    1) Field lengths, string formats, required fields, and the like are easy to handle in Conditional Split transforms. And it's fast. However, this has the downside of hardcoded your validation logic, so it's not very reusable.
    2) I usually use Derived Column transforms for Data Type validation, and enable error redirection to catch failures.
    5) Broken validations are usually handled by error redirection in the data flow.
    6) You can do this, but it's a little messay. You have add some type of flag column to your data flow, and use a Derived Column transform to set it appropriate after each error check.
    7) Yes, I have used custom components. The advantage is that the logic is reusable across multiple packages, you can catch multiple errors, etc.. The downside is that there can be a performance impact.


    Edit: Also, if you can't do what you want validation-wise easily in the data flow, consider moving it into SQL. I've found it better on occasion to stage the data to a table, do my validation there, then pull it back into the data flow.


    John Welch | www.mariner-usa.com | www.agilebi.com
    • Marked as answer by Tony Tang_YJ Wednesday, January 14, 2009 7:53 AM
    Tuesday, January 6, 2009 10:44 PM
  • Also, if you can't do the validation you want in the data flow, consider staging the data to a table and doing it in SQL. In some cases, I've found that approach to be quick to implement and work well.
    John Welch | www.mariner-usa.com | www.agilebi.com
    Tuesday, January 6, 2009 10:49 PM
  •   You can also use regEx inside script component as transformation.

    Visit http://www.pedrocgd.blogspot.com - If your question is answered, please mark as answered.
    Tuesday, January 6, 2009 11:16 PM