Queries performances on ADLS gen 2 RRS feed

  • Question

  • I'm trying to migrate our "old school" database (mostly time series) to an Azure Data Lake.

    So I took a random table (10 years of data, 200m records, 20Gb), copied the data in a single csv file AND also to the same data and created 4000 daily files (in monthly folders).

    On top of those 2 sets of files, I created 2 external tables.... and i'm getting pretty much the same performance for both of them. (?!?)

    No matter what I'm querying, whether I'm looking for data on a single day (thus in a single small file) or making summation of the whole dataset... it basically takes 3 minutes, no matter if I'm looking at a single file or the daily files (4000). It's as if the whole dataset had to be loaded into memory before doing anything ?!?

    So is there a setting somewhere that I could change so avoid having load all the data when it's not required?? It could literally make my queries 1000x faster.

    As far as I understand, indexes are not possible on External tables. Creating a materialized view will defeat the purpose of using a Lake. t

    Full disclosure; I'm new to Azure Data Storage, I'm trying to see if it's the correct technology to address our issue.

    Monday, April 13, 2020 2:41 PM

All replies

  • Hello Ben ,

    We are reaching out to the team internally for suggestion here . 

    Thanks Himanshu

    Wednesday, April 15, 2020 11:24 PM
  • Hi Ben,

    That is how Azure Synapse(Data Warehouse) uses External Tables, it loads all the data into TempDB and then processes it. You can speed up that load through increasing Resource Class & DWU level. In the future (its in their announced roadmap) processing will be able to pushed down on to the files (as long as they are in Parquet file format). The 4000 daily files is the way to go as that will speed up the load if you have high enough resource class/DWU level.

    Assuming you just want to process files in Azure Data Lake Storage Gen2 and not use any of the database features in Azure Data Warehouse then I would suggest converting the files to Parquet and then looking into using Azure Databricks to run queries on them.

    Thursday, April 16, 2020 8:06 AM