locked
Importing multiple blob files to Azure Data Warehouse RRS feed

  • Question

  • We considering using Azure Data Factory to import data from the blob storage. Each stored blob should map to a table in the Data Warehouse. However, there might be around 150 tables to import.

    It seems like we would need do define 150 Copy Activities in ADF, which should generate a tonne of json. All documentations of CopyActivity that I found online deal just with one blob-table import. What about 150?

    On the other hand, we could leverage polybase to load the data. We would need to create a stored proc with CTAS queries. This stored proc could be executed with StoredProcActivity. It seems like a neater approach.  

    https://azure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-get-started-load-with-polybase/

    What are your thoughts?

    Tuesday, August 9, 2016 1:54 PM

All replies

  • Hi bozza, 

    You can use "Copy Wizard",  which is a Web UX to config a copy from Blob to SQL Data Warehouse. It also has the property to use polybase populating data into SQL DW. ADF internally will leverage polybase. In case the data format does not meet polybase restriction, ADF can help resolve that by employee a staged blob as intermittent storage and take care of the clean up as well. There's no need to write the StoredProcActivity and trigger polybase script yourself. ADF copy will do that for you. Please check the links below. 

    https://azure.microsoft.com/en-in/documentation/articles/data-factory-copy-wizard/ 

    https://azure.microsoft.com/en-us/documentation/articles/data-factory-copy-activity-performance/

    In your case, the 150 blobs are for 150 tables in SQL DW, right? If the case, 150 copy activities is required, but you can do like this: config one copy activity use the UX tool mentioned above, it will generate the JSON for that one. Then you can copy paste that and modifying the blob name in it. 


    panchao

    Wednesday, August 10, 2016 3:48 AM
  • Thanks for the response! I saw the documentation about copy activities with polybase. So, performance wise it should be the same as the CTAS queries.

    My primary concern is the sheer volume of generated json and the effort to maintain it. There would be required to generate 300 data sets (blob file and table) and 150 activities, each with its own scheduler, etc. In addition, pricing works per activity, so every run should be multiplied by 150.

    On the other hand, the SQL script seems more intuitive from the maintenance point of view, pricing, etc.

    Wednesday, August 10, 2016 4:08 AM