Million of JSON Files - Best Way to Read?


  • I've read that Data Lake is a limitless store and that the Analytics engine is strong enough to take on any load, but I just ran a job that timed out after 6000 JSON files...

    After some googling I found that there is a mode in preview currently that expands this amount and improves performance.... okay.

    So what if I'm trying to process a million files or more? Am I stuck with processing in one vertex?

    I also am reading that a folder structure that divides data up into separate year, month, and day folders is recommended. Is this so I can segment my processes manually? This seems like a very old way of dealing with a very common problem. Where is this powerhouse I was promised? Am I missing something?

    Friday, November 10, 2017 10:37 PM

All replies

  • Hi sorry for the late reply.

    As you noticed, you should be using the fast file set preview feature. That should be scaling to several 10ks/100ks files by now and by summer we aim for 1m.

    I am not sure what you mean with processing in one vertex. We will scale out the processing of the files over many vertices. Actually right now we may be processing them with too many vertices (a feature is planned to be released soon that will do a better job at combining small files into a single vertex). 

    If you would like to parallelize the processing of a single JSON document, then you are unfortunately out of luck. Not because of ADL but because the JSON format is a hierarchical data format and cannot be split for parallel processing in general. Anything but the first vertex would have lost the hierarchical context of the data.

    The folder structure recommendation is just best practice to organize file data in a manageable and partitioned way. That gives you the ability to write queries that give you "partition" pruning (e.g., only query files for a specific date range), thus reducing cost and improving performance.

    Feel free to reach out with more detailed questions. Now that the holidays are over, we should be able to reply more quickly.

    Michael Rys

    Thursday, February 8, 2018 12:24 AM