none
An interesting anomaly while reading a file on a custom structure.

    Question

  • Whenever I try to read  paralely .fastq file, with format like below, the exception is thrown, which suggest, that exception is caused by incorrect file splitting, but when I’m formatting the file, to structure, where all data concerning one sequence, is stored in one row of file, and columns are separated by the '|' character, I can read this data parallely using DLA script. Next, I can recreate .fastq file from formatted schema using DLA script. This recreated fastq file looks the same, as the one before formatting did, but when this file is created by job from formatted structure, I can read this data parallely using my custom extractor without any problems.  What happens there?

    Extractors with AtomicFileProcessing set on true works fine on single node.

    Fastq file format:
    @description_string
    dna_sequence_string
    +second_description_string
    quality_score_string

    Formatted fastq file:
    @description_string|dna_sequence_string|+second_description_string|quality_score_string| 

    Tuesday, May 17, 2016 11:19 AM

All replies

  • Can you please add the exception you got and also the snippet of the script you used for processing?
    Tuesday, May 17, 2016 12:37 PM
  • Hi! Extractor throws my own exception when the line from stream is different than expected. This fastq file is large so DLA is trying to split the file between nodes. If I set AtomicFileProcessing on true for my custom extractors  or if I run my code locally, everything works fine. I asked a question how should I extract data from similar files and I got the answer that I should read this file on single node (AtomicFileProcessing set on true), but the most interesting for me is a fact, that my Extractor works parallely (AtomicFileProcessing set on false) only for this recreated files by DLA job, but should throws Exception. Maybe DLA save some metadata about file, so know how split this recreated file ?


    Tuesday, May 17, 2016 2:13 PM
  • How did you upload the original file?

    Files larger than 250MB need to be uploaded with a tool such as the Powershell upload command that will align the record-boundary with the internal file extent boundaries. If these boundaries are misaligned, you will get "interesting" errors if you process them in parallel.

    The OUTPUT statement will do the right alignment and thus such files will work.

    Note that we are working on a solution for this issue, but it is a fairly involved fix and is taking some time to make sure it works in all cases and does not regress in other areas.


    Michael Rys

    Wednesday, June 15, 2016 10:48 PM
    Moderator
  • Hi! I was thinking that the file with custom structure always should be extracted with AtomicFileProcesing set on true at this moment. You have suggested it here.
    Friday, June 17, 2016 1:17 PM
  • If your custom structure follows a row format where the rows are separated by a known delimiter (CSV or a format where you take XML or JSON documents separated by for example CR and not containing an unescaped CR) then it can be split an parallelized, even if it is a custom structure. The AtomicFileProcessing is set to true if your format parsing or generating cannot be parallelized based on a row split.

    Since I am not familiar with your fastq format, I don't know if you need the AtomicFileProcessing set to true.


    Michael Rys

    Friday, June 17, 2016 9:09 PM
    Moderator
  • Uploading data using Visual Studio Tools works correctly at this moment (for uncompressed file uploaded as row structured) or only Powershell is good solution? 

    Saturday, June 18, 2016 9:47 AM
  • Both should work well. Powershell can parallelize your upload.

    Michael Rys

    Monday, June 20, 2016 7:51 PM
    Moderator