Delta Processing Azure Data Lake Analytics (Parallel execution issue)


  • Scenario:

    We have "employees_final" ADLA catalog/table. 

    Users from various locations would be loading employees data into "employees_final" ADLA catalog/table "in parallel" through ADF invoking U-SQL.

    When users are loading the data, it goes to staging table (employees_staging) first. If there is an update to existing record, then we will have two versions of data stored in staging table (employees_staging). Next step is, we need to have "employees_final" table with latest version of record ONLY.

    To create final table, we are JOINING staging and final tables to find the insert/update employees and combining the existing records with new records and RECREATE final table.

    NOTE: As there is no DELETE option, we are caching the existing records and appending new/update records.

    The drawback of this approach is, when users are running ADF in parallel, it will try to update SAME employees_final table and there is a chance of DATA LOSS due to TRUNCATE/RECREATE table approach.

    Do we have any better way to handle PARALLEL scenario?

    Saturday, February 24, 2018 3:50 AM

All replies

  • Hi Sathiq,

    If you're trying to avoid potential data loss due to TRUNCATE/RECREATE, your best bet may be to provide each ADF user with their own employees_staging table, and then run a single U-SQL script that scans each of these tables and only selects the most recent updates to load into the employees_final table.  This way multiple users should never be operating on the same table simultaneously, minimizing potential data loss. If this doesn't work for you, we can discuss different options as well. 

    Thursday, March 8, 2018 7:19 PM