Slow Loading of Single Compressed CSV with PolyBase RRS feed

  • Question

  • Asking on behalf of customer:

    I have a single CSV file, uncompressed, everything in the same region.  File resides on a Standard Blob Storage.


    The CTAS is taking a very long time at 1000 DWU (see below) and I am trying to load 40 MM records.


    Should we see DWU usage at 1K during the CTAS operation?  Is the usage below expected?  What performance should we expect?  Do we have any benchmarks?


    Wednesday, January 25, 2017 5:21 PM

All replies

  • The slow performance is due to the single large compressed file.

    For uncompressed CSVs, PolyBase can split the file into multiple 512MB sections and load them in parallel with multiple threads per distribution. In the uncompressed CSV case, you would expect scaling the Data Warehouse SLO to increase loading performance.

    Since the single CSV is compressed in this case, file splits are not possible. As a result, a single thread is trying to read the data which results in slower performance. Additionally this explains the lower than expected DWU used.

    So how to fix it:

    There are a couple options depending on how optimized you need this to be and level of data prep you are willing to do.

    1) Don't compress the file: This will be the easiest to do, but will have slower performance than other alternatives, but much better than current.

    2) Split single file into multiple uncompressed files: This will reduce file split overhead and increase performance, but you will need to split your files and remove compression.

    3) Split single file into N compressed files under 512MB: This is the optimal solution because it leverages the parallel loading capabilities of PolyBase while also leverage data compression.

    At DWU 6000, we have a bench mark of 1250 MB/s using PolyBase and ADF  

    Hope that helps,


    Wednesday, January 25, 2017 5:33 PM