none
CsEnumDirectory failed with error 0x83090AF1 - Large number of files?

    Question

  • I am getting this exception when running a USQL job over a data lake store folder with a large (100,000 +) number of files. The Data Lake explorer in the portal and Visual Studio also fails when trying to list files in the folder.

    I'm guessing this is a time out issue with enumerating a large folder. Is this known issue that anyone else is having? Working with smaller sets < 30,000 files appears to work.

    Tuesday, January 12, 2016 11:43 AM

Answers

  • Hi Neil

    Thanks for the additional information. Here are some suggestions and comments:

    1. What compression do you use? If the file is gzipped with the file extension .gz, we should do the decompression automatically.
    2. Using file sets can currently take a long time to compile and compilation may time out at a few thousand files. We are working on improving the scalability at the moment, but I cannot give you any estimates yet on how many files you will be able to process with one file set expression.
    3. Unless you can benefit from "partition elimination" it is normally more efficient if you have a few large files rather than many small files. Although if your files are not easily row-splittable (currently requiring CR and/or LF as the row delimiter) and you need to run the custom extractor on the whole file, then it may be better to keep the file size to below a file extent size (250MB).
    4. We do seem to have a bug in the store's file enumeration logic that is getting investigated right now.
    5. Regarding the downloading issue, could you please open another thread here and I will loop the tool team in.

    Thanks

    Michael


    Michael Rys

    Friday, January 15, 2016 8:33 AM
    Moderator

All replies

  • Hi,

    I believe this is a known issue, but I am going to confirm with the engineering team regarding the specifics you've described.  

    As for your scenario, can you please describe the specifics of your use case?  Why do you have many files in a single directory?

    Cheers,

    Ricardo

    Wednesday, January 13, 2016 8:49 PM
  • Hi Neil

    Can you provide more details on the script on how you refer to the files? Is this with the Powershell commands? The SDK? Inside a U-SQL script?

    There is a bug in the Visual Studio tool around listing large number of files that we currently investigating. Thanks for reporting this.


    Michael Rys

    Wednesday, January 13, 2016 10:57 PM
    Moderator
  • Hi Ricardo,

    Thanks for the reply.

    We have a lot small files (<10KB) of compressed, binary serialised data (using protobuf) generated by a business process. It's still evolving but this will be many millions of files in future.

    I have written a custom extractor to de-compress and de-serialize the data which has worked when a smaller number of files are used in folder.

    I have tried splitting the files out into sub-folders in data lake store and using a wildcard match to pull all files across the sub-folders which appears to get around the CSEnumDirectory error. However, the USQL script now takes too long run to be practical.

    In general, is it better to use a small number of large files over a large number of small files when using data lake? Or should there really be no difference?

    To get around the long running script issue I'm extracting a summary of the data into csv from the business process that should be more data lake friendly. However, that mean effectively schematising the data in advance and loosing the flexibly of adjusting the extractor as and when we run USQL scripts.

    - Neil



    Thursday, January 14, 2016 9:59 AM
  • Hi Michael,

    I'm submitting a job using the Visual Studio tools. The USQL script refers to the files in data lake store folder via a wildcard. I've also re-submitted the job via the portal when the same error.

    Noticed another minor issue with the VS tool. the data lake explorer doesn't always download all files when you select 'download all'.

    In general it's a great service, I'm very happy to give feedback!

    - Neil

    Thursday, January 14, 2016 10:06 AM
  • Hi Neil

    Thanks for the additional information. Here are some suggestions and comments:

    1. What compression do you use? If the file is gzipped with the file extension .gz, we should do the decompression automatically.
    2. Using file sets can currently take a long time to compile and compilation may time out at a few thousand files. We are working on improving the scalability at the moment, but I cannot give you any estimates yet on how many files you will be able to process with one file set expression.
    3. Unless you can benefit from "partition elimination" it is normally more efficient if you have a few large files rather than many small files. Although if your files are not easily row-splittable (currently requiring CR and/or LF as the row delimiter) and you need to run the custom extractor on the whole file, then it may be better to keep the file size to below a file extent size (250MB).
    4. We do seem to have a bug in the store's file enumeration logic that is getting investigated right now.
    5. Regarding the downloading issue, could you please open another thread here and I will loop the tool team in.

    Thanks

    Michael


    Michael Rys

    Friday, January 15, 2016 8:33 AM
    Moderator