none
Files with JSON File per Line non-Parallelizable

    Question

  • It seems that files that have a JSON file per line which are extracted using the text extractor and manipulated using the JSON functions from GitHub are not parallelizable. I came to this conclusion using the usage modeler in Visual Studio.  Is there a reason why this is the case?

    • Edited by Pfav Thursday, July 28, 2016 6:35 PM
    Thursday, July 28, 2016 2:46 PM

Answers

  • Files are getting "parallelized" in two layers. Each file is internally represented in 250MB extents. Each vertex can work on up to 4 extents in parallel (thus you feel that you are not getting parallelized until you are above 1GB). Once you step over the 1GB limit, you will get vertex level parallelism as well.

    This is currently solely based on your extent count of the file which defaults to 250MB. I think you can create files with smaller extents if you append data to a file that creates new extents, but I need to check what the current rules are (they may have changed since I last looked).


    Michael Rys

    Thursday, July 28, 2016 10:36 PM
    Moderator

All replies

  • Pfav, can you clarify what you mean by JSON file per line? Does this mean you have on Json node per line?

    The JsonExtractor as you rightly figured out does not parse a file in parallel, if you notice the code for JsonExtractor in our GitHub site, you will see that the extractor has the AtomicFileProcessing flag set to true to ensure the data in the file is not processed in parallel.

    [SqlUserDefinedExtractor(AtomicFileProcessing=true)]

    The reason for this is extractors for file formats such as Json, Xml or Images need to see the full file (since a Json node can span across multiple lines) to understand the structural semantics of the Json node, so the file is designed to be processed as an atomic unit that sacrifices parallelization.

    Thursday, July 28, 2016 4:57 PM
  • I modified my question to be more accurate. By JSON file per line I mean a full JSON object per line. Basically, each line could be a legal JSON file by itself. An example of this is shown in the second JSON example on the GitHub page.
    Thursday, July 28, 2016 6:37 PM
  • However, it seems that I was wrong about them not being parallelizable. The files seem to be parallelizable, but not until the file size reaches about 2 GB. Does parallelizability rely solely on the size of the input files?
    • Edited by Pfav Thursday, July 28, 2016 6:41 PM
    Thursday, July 28, 2016 6:41 PM
  • Files are getting "parallelized" in two layers. Each file is internally represented in 250MB extents. Each vertex can work on up to 4 extents in parallel (thus you feel that you are not getting parallelized until you are above 1GB). Once you step over the 1GB limit, you will get vertex level parallelism as well.

    This is currently solely based on your extent count of the file which defaults to 250MB. I think you can create files with smaller extents if you append data to a file that creates new extents, but I need to check what the current rules are (they may have changed since I last looked).


    Michael Rys

    Thursday, July 28, 2016 10:36 PM
    Moderator