Azure Data Lake - Best Practice Patterns? RRS feed

  • Question

  • I'm new to Azure and new to Azure Data Lake Store & Analytics, but have been using SQL Server & BI tools since MS SQL Server 7.  I am looking for advice on the best architecture or implementation pattern for consuming customer data into a cloud-data solution using Azure.


    We receive multiple files (typically CSV and XLSX, but could be other formats like XML, PDF, or DOC) from clients.  We will need to 1) allow clients to easily and reliably send us files daily; 2) ingest these files into a relational data staging area; and 3) import these files into a central data warehouse and data marts used for analytics and reporting.

    Proposed Implementation

    The implementation I thought would work best would be to use Azure Data Lake store for accepting files from clients since there will be multiple formats and layouts that cannot be predicted ahead of time (and our business model does not impose or require rigid standard formats for clients to conform to) so storing in their native format will allow us to on-board new clients as quickly as possible.  We will have data analysts dedicated to mapping new file layouts initially (new clients will most likely have consistent layouts/file types once they are on-boarded).  This mapping would be done using some combination of Azure Data Factory, SSIS, and/or U-SQL to get data from the Data Lake store into the SQL Server staging database.  From there we would using Azure Data Factory or SSIS to populate data warehouse and dimensional data marts.

    My questions to the experts are these:

    1. Does my implementation strategy make sense, or are there better patterns or practices I should be considering?
    2. If we pursue the implementation I am suggesting then:
    • What technologies should we use to implement to client upload solution?  I am having trouble visualizing the mix of technologies that would best support this.  For example, build a custom .NET web application for them to upload to our Data Lake directly; or use sFTP to have them drop the files to a location within our firewall and then use PowerShell or .NET to upload them; or some other better solution?
    • Will Azure security have any direct impact on the implementation we choose?  For example are there specific solutions that will be more difficult to implement due to security requirements?
    • We expect most of our clients will want to automate file-sending process on their end as well.  What would be the best implementation approach to make this happen?  Is that FTP/sFTP or are there better technologies available for B-to-B file exchange out there?

    I am just looking for thoughts and ideas- not a detailed implementation plan or anything of that sort.  I feel like I am in the ballpark but want confirmation that I am heading in the right direction.

    Tuesday, February 2, 2016 2:17 AM

All replies

  • Hi AzureData_Newbie,

    Regarding question #1, the strategy does make sense.

    Regarding question #2, could you please clarify what type of clients you have? Are they all Windows machines?

    If your clients are Windows clients, you could leverage the .NET SDK sample (https://azure.microsoft.com/en-us/documentation/articles/data-lake-store-get-started-net-sdk)  to create an app to load data into Azure Data Lake.

    There are also other mechanisms possible, but will depend on the nature of your clients.

    Sachin Sheth
    Program Manager
    Azure Data Lake

    Tuesday, February 2, 2016 2:38 AM
  • Our clients are financial institutions, banks, brokers, etc. I would say most all would be Windows clients.  Are you recommending a Windows desktop application that would require the clients to install on their PCs?  I doubt their IT departments would allow a 3rd party application to be installed within their company.

    The solution would have to be something that can be automated on both their end and our end.  Traditionally FTP was used in these scenarios where either our company had permission to connect to their remote FTP server to pull down a file, or they had rights to our FTP server to push a file.  FTP connections and file transfers can be automated with SSIS or scripting.  I didn't know if sFTP was still the best technology choice for B2B file transfers or if there was a more modern solution for todays cloud computing environments?

    Tuesday, February 2, 2016 4:12 PM
  • May I ask the question on your data formats? Which ones do you want to process with U-SQL and which ones do you just consider blobs?

    Note that U-SQL currently has extractors for CSV, TSV, XML and JSON, but not PDF or XLSX. You would have to write your own custom extractor at the moment. And given that many of the Office XML formats are compressed archives of XML documents, it may be non-trivial to do so (and not as scalable as just asking people to save their documents as CSV/TSV from within Excel)

    Michael Rys

    Wednesday, February 3, 2016 8:32 PM
  • The majority of files will be either CSV or Excel (xlsx, xls) so handling those types will be our top priority. I would say if we were given PDF we would go back to the client to see if we can get a different format.

    I mentioned U-SQL as an option if we choose to use the Data Lake store but I am not familiar with that as I am with SSIS, so most likely we would be using SSIS to do our importing, staging, and integration.

    The most likely scenario I am thinking would make this work for the majority of clients is to provide an sFTP site on our end that the could transfer files either manually or programmatically, and we would have a process that runs periodically and pulls anything it finds into the Data Lake store (question- can the Data Lake store be used as the direct location for the FTP landing zone?)

    Our business process would work something like this:

    1. A new client (Client A) wants to send us their data so we can incorporate it into our reporting model.  They have Excel workbooks and CSV files with the data.  We send them the instructions to transfer the files via sFTP.  The files are transferred and moved into our Data Lake store.  A data analyst on our end opens the files and inspects them.  SSIS packages are created for that clients data format and SSIS pulls the data from the lake into SQL Server staging area database.  From there the process becomes more standardized as the staging area is set up to feed our data warehouse data marts.
    2. An existing client (Client B) sends us data, but they have an IT department that can pull directly from their source systems and produce an extract that is automatically sent to our sFTP server.  From there an SSIS job picks up their files, moves them into the Data Lake store and then and loads our staging database and so on...

    We wanted to incorporate the Data Lake into this architecture rather than just a file server because as we gain more clients, we will be able to easily accept data in a variety of formats and we can immediately perform some analysis and reporting more readily than files sitting on a file share.  And presumably since our data warehouse and data marts will also be in the cloud, there should be better ways to leverage the Data Lake using other cloud services.

    Wednesday, February 3, 2016 9:21 PM
  • Thanks for the additional information.

    I think U-SQL's strength in this scenario would be that it can easily write SQL based transforms of the information that you are having inside the data files. And you can use it to do transformations of the file formats (although you would have to write more of the file format readers and writers yourself until the community and we ourselves catch up).

    I think it is a good idea to start with the Data Lake storage as the place to gather and collect the data. Over time you can then develop more services to analyze, augment and transform the data.

    I will defer to Sachin to answer the FTP question. I would also suggest to reach out to the SSIS team regarding their ability to read from webhdfs services such as ADLS.

    Michael Rys

    • Proposed as answer by Michael Amadi Friday, February 19, 2016 12:16 PM
    Thursday, February 11, 2016 10:43 PM
  • Hi AzureData_Newbie,

    Azure Data Lake Store (ADLS) does not provide an inbuilt FTP server unfortunately. It only supports a WebHDFS compatible interface. So, for now, you will need to setup a wokerrole as an FTP server and then push the data to ADLS.

    Regarding SSIS support for ADLS, that is currently being worked on. What you can use is Azure Data Factory to copy data from ADLS to other stores. Hoping this will meet your requirements.


    • Proposed as answer by Michael Amadi Friday, February 19, 2016 12:16 PM
    Thursday, February 18, 2016 10:31 PM
  • Your Implementation strategy makes sense. I can also say that some of our customers who have a similar use case are already following this strategy.

    Few suggestions
    1. Have the control of migrating files (custom app) from ftp/sftp to ADLS at your end itself. Client installable apps are always a hassle to maintain...security, maintenance, support, compatibility is a nightmare
    2. For those customers who are not willing to host the files in ftp/sftp, give them the custom app to push the files directly to ADLS
    3. You can make use of Data factory FTP connector to directly push the files to ADLS from FTP
    4. Use SSIS ADLS connector to populate your staging DB. SSIS ADLS connector is in beta and is buggy right now (we built one connector for ourself using SDK)

    Social MSDN thread

    Monday, March 6, 2017 2:39 PM
  • Hi,

    How about utilizing the capabilities of LogicApps? Have FTP/SFTP as trigger (When a file is added or modified) and have Azure Data Lake connector as Action to upload the file to Azure Data Lake Store.

    This would be more real time then scheduled jobs.

    I guess Azure Data Lake connector is currently in preview but does not stop us from trying out, please note that there is a limit of 50mb file size per insert.



    Monday, March 6, 2017 5:10 PM