locked
General Azure ETL questions RRS feed

  • Question

  • Hello,

    I currently have a number of Windows Scheduled Tasks that execute python code to parse and load text files to a database.  With our company moving more to Azure, I'm now looking to migrate some of the ETL processes to Azure services.  It's been a steep learning curve thus far trying to understand the many, many pieces and services, and I have a very basic understanding of ADF, Workflows and Databricks.  I seem to be getting lost on specific implementation scenarios I require and some of the help is a tad 'light' on details for anything other than very basic scenarios.

    For example, I currently have a web app that creates a folder with a date suffix and uploads multiple files to an Azure Blob Storage.  I would like to trigger the blob creation event to call some python to do some cleanup on the uploaded txt files by deleting some specific lines by line number, and then load this data into a SQL database.

    I'm not sure if I should be building a Data Flow to handle all of this, or a simple Pipeline with a python Databrick, or if this should be done with Event Grids and Subscriptions?

    I know there's probably many different ways to accomplish the above with any or all of the options I mentioned, but any suggestions/comments on accomplishing something like this would be greatly appreciated.

    Warren M

    Tuesday, May 7, 2019 6:18 PM

All replies

  • Hello Warren , 

    This is a very broad scenario . The scenario which you have mentioned can be handled by using a copy activity . But since we do not know what kind of cleaning which needs to be done on the blob before it is ingested to the SQL DB , we are not sure how this will work for you , Copy activity does provide limited cleaning options . You can also use the data brick activity from a pipeline . It looks like the trigger which you can use should be Event-based trigger . 

    We do not have the any idea of the data size , we understand that Azure data-brick may turn out to be a expensive solution compared to ADFV2 but it is feature rich  as you can use Python or Scala .


    Thanks Himanshu

    Wednesday, May 8, 2019 8:25 PM
  • Thanks for the response Himanshu.

    One scenario is:

    • 4 txt files uploaded once a month to Azure Blob Storage
    • Process the text files to: remove a few header rows, remove a column and unpivot the data (currently with Pandas)
    • Unpivoted, the total data loaded might be in the 50k-60k rows range
    • load to a SQL Server db
    • execute a SQL Server stored procedure to do the final cleanup and dataload

    I am making some progress using Databricks, but I just don't know enough about the services yet to know if this is the correct path.

    Thanks again.


    Warren M

    Wednesday, May 8, 2019 10:44 PM
  • let me reply on this 

    • 4 txt files uploaded once a month to Azure Blob Storage .[HImanshu] : Since you are currently using Webapps , we may be consuming an API and writing the data to the blob . We have Web activity in ADF which may be used . 
    • Process the text files to: remove a few header rows, remove a column and unpivot the data (currently with Pandas) [Himanshu] : Remove headers and columns are fine with ADP copy activity , I am unsure about the unpivot part . 
    • Unpivoted, the total data loaded might be in the 50k-60k rows range
    • load to a SQL Server db [Himanshu] : ADF can do this 
    • execute a SQL Server stored procedure to do the final cleanup and dataload [Himanshu] : ADF can do this ]


    Thanks Himanshu

    Thursday, May 9, 2019 4:07 PM