locked
How to implement SCD type 2 in Azure Sql Data warehouse RRS feed

  • Question

  • Dear All, I have to pull data from sources to data lake store using ADF and then finally to Azure sql dwh using polybase. What I understand is Azure data warehouse doesn't generate proper surrogate keys even using identity columns. Also there is no MERGE command support in azure data warehouse. How will I implement SCD type 2 . I see a lot of blogs and videos related to UPSERT but nothing related to SCD type 2 . Any kind of help would be highly appreciated.
    Thursday, April 4, 2019 6:37 AM

All replies

  • Hi Sanju,

    Please reference the following Stack Overflow discussion: Data warehouse type 2 scd Employee dimension and HR Facts (Kimball's)

    In a SCD (type 2 or type 3), you want to think in terms of 2 types of key; natural keys and pseudo keys. The natural key is the identifier which the "real world" would understand, in the example of an Employee dimension, this would probably be some kind of Employee Id. Each time you add an entry to this table, you get a new pseudo key, and I like to think of this as the "as-was" key. It represents the state of that dimensional member "as it was", when the record was added.

    Over time in a SCD, you will have many, many records per natural key, each with it's own "as-was" key. Considering the most recent entry, it's "as-was" key is also the "as-is" key, as it represents the current state.

    In a fact table, you should ALWAYS expect to find the "as-was" key. If you're going to assume the fact table will always hold the "as-is" key, or the most recent key, then it assumes you're going to go back and update historical records in your fact table simply because an attribute of the dimension changed. This is a waste of resources for started, and is actually counter-productive as one of the major benefits of a SCD is the ability to do "as-was vs as-is" analysis, and to do this you need to preserve the "as-was" state.

    Friday, April 5, 2019 1:51 AM
  • Dear Mike,

    Thank you for your response.

    However, I am not looking for an answer around what a SCD is and how to implement it. I have implemented it on multiple warehouse and complex dimensions. What I am struggling with is how to get it working on Azure sql data warehouse . Polybase is our querying tool. Azure dwh doesn't support MERGE command, Polybase CTAS is also limited to SCD type 1. I am sure many people have had such scenario on Azure warehouse and someway or other they must have implemented it. Could you guide me pls.

    Sunday, April 7, 2019 2:51 PM
  • Hi Sanjeet,

    I would look into using CTAS to load the table with a UNION ALL between your different cases and then handle your different types of rows with different SELECT statements.

    So for example:

    SELECT * FROM OldData WHERE KEY NOT IN NewData

    UNION ALL

    SELECT * FROM NewData WHERE KEY NOT IN OldData

    UNION ALL

    SELECT * FROM NewData WHERE KEY  IN OldData AND NewData<>OldData

    UNION ALL

    SELECT * FROM NewData WHERE KEY  IN OldData AND NewData=OldData


    Monday, April 8, 2019 3:44 PM
  • Thank you Daniel.

    I am little concerned over the 3rd SELECT statement where we compare oldData and NewData. Not sure how the performance will be. Also, the CTAS while copying existing tables has known issues of uniqueness (https://blog.westmonroepartners.com/azure-sql-dw-identity-column-bugs/)

    I have tried following query for SCD type 2 and it seems to be working. Not sure how it will work on larger dimensions though. Any comment on below code please?

            

       -- find out latest record --CustomerID is business key from source

      CREATE TABLE LatestMatchingrecord
      WITH (
        DISTRIBUTION=ROUND_ROBIN,
        HEAP
      ) AS
      WITH Latestrow
      AS
      (SELECT [CustomerID] AS C_ID, max([Customer Key]) AS MAxCustomerKey
      FROM    edw.[Customers] 
      group by [CustomerID]
      )
      select * from Latestrow
      ---


      -- find out updated records
      CREATE TABLE UpdatedRecordsinDelta
      WITH (
        DISTRIBUTION=ROUND_ROBIN,
        HEAP
      ) AS
      WITH updatedRows
      AS
      (SELECT   [Address],[City],[ContactName],[Country],[Customer],[CustomerID],[Fax],[Phone],[PostalCode],[CountryCode],[Latitude],[Longitude]
      FROM     [Stage].[Customer] AS stage
      WHERE EXISTS 
      (SELECT     [Address],[City],[ContactName],[Country],[Customer],[CustomerID],[Fax],[Phone],[PostalCode],[CountryCode],[Latitude],[Longitude]
      FROM    edw.[Customers] AS Main 
      where Main.[CustomerID] = stage.[CustomerID]
      AND (main.[Address] != stage.[Address] OR main.[City] != stage.[City] OR main.[ContactName] != stage.[ContactName] OR main.[Country] != stage.[Country]
      OR main.[Customer] != stage.[Customer] OR main.[Fax] != stage.[Fax] OR main.[Phone] != stage.[Phone] OR main.[PostalCode] != stage.[PostalCode]
      OR main.[CountryCode] != stage.[CountryCode] OR main.[Latitude] != stage.[Latitude] OR main.[Longitude] != [Longitude])))
      select * from updatedRows INNER JOIN LatestMatchingrecord on C_ID=[CustomerID]



      --mark older row as complete
      UPDATE edw.[Customers]
      SET [edw].[Customers].[Valid To] = @Valid_From
      FROM  UpdatedRecordsinDelta A WHERE A.MAxCustomerKey = edw.[Customers].[Customer Key]  



    --insert new data
      INSERT INTO edw.[Customers]
      ([Address],[City],[ContactName],[Country],[Customer],[CustomerID],[Fax],[Phone],[PostalCode],[CountryCode],[Latitude],[Longitude],[Valid From],[Valid To],[Lineage Key])
     -- updated row
      SELECT   [Address],[City],[ContactName],[Country],[Customer],[CustomerID],[Fax],[Phone],[PostalCode],[CountryCode],[Latitude],[Longitude],@Valid_From,NULL,@Lineage
       FROM      UpdatedRecordsinDelta AS s
       UNION ALL  
    -- New row
    SELECT     [Address],[City],[ContactName],[Country],[Customer],[CustomerID],[Fax],[Phone],[PostalCode],[CountryCode],[Latitude],[Longitude],@Valid_From,NULL,@Lineage
    FROM      [Stage].[Customer] AS st
    WHERE NOT EXISTS
    (   SELECT  [Address],[City],[ContactName],[Country],[Customer],[CustomerID],[Fax],[Phone],[PostalCode],[CountryCode],[Latitude],[Longitude]
        FROM    [edw].[Customers] ma
        WHERE   ma.[CustomerID] = st.[CustomerID])
    DROP TABLE UpdatedRecordsinDelta
    DROP TABLE LatestMatchingrecord

      END

    Saturday, April 20, 2019 5:12 PM