none
Design pattern: handling schema changes over a period of time

    Question

  • Hello,

     I’m looking for help with the design pattern to address schema changes to the input files read by U-SQL.

    I’m reading a daily snapshot of a file with X number of columns. Over a period of time the number of columns in the file can change to X+n which causes my extract to break.

    What is the recommended pattern to read the file set over the date range with different schema without breaking my EXTRACT? The same is applicable for the consumers of my data, what is the best design pattern so that they are not impacted due to the schema changes that I output to data lake store.

    Here are what I’d researched:

    1. Use flex extractor – sample implementation here

    https://blogs.msdn.microsoft.com/mrys/2016/08/15/how-to-deal-with-files-containing-rows-with-different-column-counts-in-u-sql-introducing-a-flexible-schema-extractor/

    2. Possibly Use TVFs

     

    what is the best way to handle these schema change seamlessly?

     

    Thanks!

    Friday, May 25, 2018 12:50 AM

All replies

  • Hello,

     I’m looking for help with the design pattern to address schema changes to the input files read by U-SQL.

    I’m reading a daily snapshot of a file with X number of columns. Over a period of time the number of columns in the file can change to X+n which causes my extract to break.

    What is the recommended pattern to read the file set over the date range with different schema without breaking my EXTRACT? The same is applicable for the consumers of my data, what is the best design pattern so that they are not impacted due to the schema changes that I output to data lake store.

    Here are what I’d researched:

    1. Use flex extractor – sample implementation here

    https://blogs.msdn.microsoft.com/mrys/2016/08/15/how-to-deal-with-files-containing-rows-with-different-column-counts-in-u-sql-introducing-a-flexible-schema-extractor/

    2. Possibly Use TVFs

     

    what is the best way to handle these schema change seamlessly?

     

    Thanks!

    I'm glad I'm not alone with this thought.  I have run into this exact scenario in the project I'm working on.  The way I have dealt with schemas changing (in my case new columns on a source table) is I have a reload folder in my U-SQL project that holds a script for each source table.  When a new column comes in, I update the reload script to add the new column to existing (archived) files in my Data Lake. 

    So I do a EXTRACT [ALL historical columns]

    Then I do a SELECT *, "\N" (NULL VALUE) As NewColumn 

    Then I overwrite the existing file.  This is not ideal, but it allows me to merge the archived data with the newer files when needed.  It's a pretty manual process but I haven't found a better way to handle it either.

    Edit: I think the ideal solution would be to allow a filter on the Virtual Column.  So that you could have two extracts for specific windows of time that you then merge together in a later statement.  So if I knew a column got added on 2018/06/05 I could have an extract for the files previous to that date and one for the files after that date (with the new column).  I like this better than using the Flexible Schema code-behind because it would keep the structure strict and not allow for any schema mixups.

    Second Edit: I just had another thought that I am going to test out.  If I add a version number to the file name I could do multiple extractions per schema/file.  I currently store my schemas in a Control Table in SQL DB that tells ADF what columns to extract.  If a schema change happens, I can create a new version of the schema in my control table and add a new version number to the output file.  So in my Data Lake I would have a period of time for SourceTable_v1, SourceTable_v2, SourceTable_v3, etc.  Then I could have an Extract for each version of the file and merge them together later in the U-SQL script.  That way I could still leverage the U-SQL virtual column to go through all folders, but it would do it for each version separately and the schemas wouldn't break.  This is just a theory at this point but I think it will work.

    • Edited by FrankMn Friday, July 6, 2018 4:01 PM
    • Proposed as answer by FrankMn Thursday, August 23, 2018 9:09 PM
    Friday, July 6, 2018 3:32 PM