none
Best way to provide data to applications from Data Lake

    Question

  • The data lake I have created takes in daily incremental files and stores them into a RAW directory.  Then they get moved into a Staging directory area, which then gets picked up for processing and eventually gets placed into a Production directory that contains a full set of all the table's data.  

    My question is, given that the Production files contain all the latest version of the data (500gbs of data in some files), what is the best way to provide other applications access to the data in these files?  It's easy to source a cube/tabular model, somewhat easy to source Power BI, but if another department within our organization has an application that wants to pull certain data (filtered by their needs) what would be the best way for them to connect? For instance, in SQL server they could connect in many different ways.  But in Azure Data Lake, the data isn't queryable like it is in SQL server.  How do we provide down stream applications data outside of Power BI/SSAS/etc.

    Friday, January 18, 2019 3:08 PM

All replies

  • Not sure I understand the issue. Azure Data Lake allows for the use of U-SQL, so it *is* queryable. 

    If you have an application with specific data needs, then you could write an extract using U-SQL, and dump that extract to a place where the applications could make use of the data. 

    I suppose the missing piece here is the requirements for the applications. If they need real time data, then this extract method may not be a best fit. But you should be able to fish for what you need in the data lake and then supply theat to a data mart, or warehouse, and let other applications connect.

    HTH

    Friday, January 18, 2019 3:54 PM
  • Not sure I understand the issue. Azure Data Lake allows for the use of U-SQL, so it *is* queryable. 

    If you have an application with specific data needs, then you could write an extract using U-SQL, and dump that extract to a place where the applications could make use of the data. 

    I suppose the missing piece here is the requirements for the applications. If they need real time data, then this extract method may not be a best fit. But you should be able to fish for what you need in the data lake and then supply theat to a data mart, or warehouse, and let other applications connect.

    HTH

    You are getting to the heart of the question.  On-demand/live data requests.  That is what I mean by queryable by applications.  I am already leveraging U-SQL to process the data lake files from my RAW directory and then create the production files.  But that is a very static process that is scheduled. 

    It seems like the only way to make this work is to not have on demand queries against the Data Lake Production files, but to actually process data for the down-stream application (U-SQL) and store it in another area in Production for the application to pick up daily and store in its own database.  So the end users wouldn't read data live from the Data Lake, but rather from it's own application database after it was extracted.

    Friday, January 18, 2019 5:07 PM
  • FrankMn, I'm beginning to face that same challenge.  Data Lake Analytics is not a quick or dynamic computing environment which I would want to point PowerBI or other reporting clients to.  The job submission paradigm is not really conducive to that, not like a relational database engine like SQL Server or Azure SQL DB/DW.  Or even CosmosDB?  ...maybe.

    In our data environment we use Blob storage for inbound files, then process with ADLA/U-SQL jobs.  We enrich, enhance or otherwise transform datasets this way producing outbound files for consumption, either for another application or another system.  I think of ADLA as the Operational Data Store (ODS) staging the data.  I'm now starting the exploration of standing-up either/or a Azure SQL DB or DW to better understand each types capabilities to query other external sources. 

    But ultimately I agree, ADLA is not suited to query directly for reporting, too much latency and unpredictability in my opinion.


    Bill Blakey

    Thursday, January 24, 2019 4:15 PM