none
How to transfer data from MSSQL to Datastore in a Pipeline?

    Question

  • I am trying to build a pipeline for my Machine Learning service, that will make use of data stored in an MSSQL server. I know that I can connect to the database directly from a script, which I could then add to the pipeline using a 'PythonScriptStep', but I would really like to use the 'DataTransferStep', if that is possible.

    If I use the 'PythonScriptStep', I would have to pass the login credentials to my MSSQL server along with the script, e.g. included in the script, or passed as arguments (locally I am using environment variables, but I have not yet found a way to use that in the pipeline). Both of these ways makes me unsure of the security when executing the script. Should I be worried that the credentials will leak, so that my database could be hacked? Which way is most secure? Passing as arguments seems most redundant if I want to change the credentials, but at the same time less secure?

    This is where 'DataTransferStep' comes into the picture. As far as I can understand, this is specifically developed to handle data transfer in a secure way. So if I want to transfer data from an MSSQL server, this seems like the obvious choice. However, the example notebooks doesn't cover SQL connections, and the documentation indicates that it might only be possible to connect to an Azure SQL server anyway. Does anyone know more on this matter?



    • Edited by ArvidBaa Wednesday, December 5, 2018 1:12 PM
    Wednesday, December 5, 2018 1:11 PM

Answers

  • DataTransferStep supports copying data from/to Azure SQL Database, in addition to Azure Blob Storage and Azure Data Lake Storage. We'll be updating the notebooks shortly to demonstrate data transfer from Azure SQL Database.

    In your case, it seems you're tying to copy data from on-premises MSSQL server, which is not currently supported by DataTransferStep. You can use Azure Data Factory to copy data from on-premises MSSQL server to Azure Blob storage and then directly use that in your pipeline.

    Documentation for using Azure Data Factory to copy data from on-premises MSSQL server to Azure Blob storage:
    [https://docs.microsoft.com/en-us/azure/data-factory/tutorial-hybrid-copy-powershell]

    (I'm an engineer in Azure ML team)
    • Marked as answer by ArvidBaa Friday, December 7, 2018 2:29 PM
    Thursday, December 6, 2018 6:02 PM

All replies

  • DataTransferStep supports copying data from/to Azure SQL Database, in addition to Azure Blob Storage and Azure Data Lake Storage. We'll be updating the notebooks shortly to demonstrate data transfer from Azure SQL Database.

    In your case, it seems you're tying to copy data from on-premises MSSQL server, which is not currently supported by DataTransferStep. You can use Azure Data Factory to copy data from on-premises MSSQL server to Azure Blob storage and then directly use that in your pipeline.

    Documentation for using Azure Data Factory to copy data from on-premises MSSQL server to Azure Blob storage:
    [https://docs.microsoft.com/en-us/azure/data-factory/tutorial-hybrid-copy-powershell]

    (I'm an engineer in Azure ML team)
    • Marked as answer by ArvidBaa Friday, December 7, 2018 2:29 PM
    Thursday, December 6, 2018 6:02 PM
  • Thank you for the answer. I will read the link you refer to and see if it is feasible to transfer the data to a blob storage, or if it is better to connect to the database using a python script, that I could add to the pipeline as a 'PythonScriptStep'. This is a working solution, but I only worry about the login credentials being vulnerable.

    Do you now if there are any plans on integrating MSSQL server in the 'DataTransferStep' functionality, and if so, what the timeline could be?

    Friday, December 7, 2018 2:36 PM