none
Job is hanging when i tried read large compressed file.

    Question

  • Hi! Could someone help me? When I try read large compressed file the job is hanging. A large file (almost 4 GB) is processed only on one node (one vertice). I tried use build in and custom extractor, but it doesnt work. Below are my script and extractor codes and also image showing the state of the process after 15 minutes (but i wait longer for previous script). 

    Script:

    REFERENCE ASSEMBLY [NGSQualityControl.Helper];
    REFERENCE ASSEMBLY [NGSQualityControl.Domain];
    
    @result =
        EXTRACT name_r1 string,
                name_r2 string,
                sequence_r1 string,
                sequence_r2 string,
                optionalName_r1 string,
                optionalName_r2 string,
                qualityScore_r1 string,
                qualityScore_r2 string
        FROM @"adl://mgrdatalakestore.azuredatalakestore.net/Gzip/testowyFastq.fastq.gz"
        //USING NGSQualityControl.Domain.Factories.ExtractorsFactory.GetFormattedPairedEndFastqExtractor();
        USING Extractors.Text(delimiter : '|', quoting : false);
    
    @result_1 =
        SELECT name_r1,
               sequence_r1,
               optionalName_r1,
               qualityScore_r1
        FROM @result;
    
    OUTPUT @result_1
    TO @"adl://mgrdatalakestore.azuredatalakestore.net/Gzip/Output/testowyFastq2.fastq"
    USING NGSQualityControl.Domain.Factories.OutputtersFactory.GetFastqOutputter();
    
    @result_2 =
        SELECT name_r2,
               sequence_r2,
               optionalName_r2,
               qualityScore_r2
        FROM @result;
    
    OUTPUT @result_2
    TO @"adl://mgrdatalakestore.azuredatalakestore.net/Gzip/Output/testowyFastq3.fastq"
    USING NGSQualityControl.Domain.Factories.OutputtersFactory.GetFastqOutputter();


    Extractor code:

    [SqlUserDefinedExtractor(AtomicFileProcessing = false)]
        public class FormattedPairedEndExtractor : IExtractor
        {
            private readonly Int32 _columnCount;
            private readonly Boolean _computeQualityControl;
    
            public FormattedPairedEndExtractor(Boolean cumputeQualityControl = false)
            {
                _computeQualityControl = cumputeQualityControl;
                if (cumputeQualityControl)
                {
                    _columnCount = 10;
                }
                else
                {
                    _columnCount = 8;
                }
            }
    
            public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
            {
                using (StreamReader sr = new StreamReader(input.BaseStream))
                {
                    string currentLine;
    
                    if (output.Schema.Count == _columnCount)
                    {
                        while ((currentLine = sr.ReadLine()) != null)
                        {
                            String[] substrings = currentLine.Split('|');
                            output.Set<object>(output.Schema[0].Name, substrings[0]);
                            output.Set<object>(output.Schema[1].Name, substrings[1]);
                            output.Set<object>(output.Schema[2].Name, substrings[2]);
                            output.Set<object>(output.Schema[3].Name, substrings[3]);
                            output.Set<object>(output.Schema[4].Name, substrings[4]);
                            output.Set<object>(output.Schema[5].Name, substrings[5]);
                            output.Set<object>(output.Schema[6].Name, substrings[6]);
                            output.Set<object>(output.Schema[7].Name, substrings[7]);
                            if (_computeQualityControl)
                            {
                                output.Set<object>(output.Schema[8].Name, Int32.Parse(substrings[8]));
                                output.Set<object>(output.Schema[9].Name, Int32.Parse(substrings[9]));
                            }
    
                            yield return output.AsReadOnly();
                        }
                    }
                    else
                    {
                        throw new Exception("Incorrect number of columns");
                    }
                }
            }
        }


    Job graph:


    Job link:

    https://mgrdatalakeanalytics.azuredatalakeanalytics.net/Jobs/cd822a03-6b64-4544-ad34-f4871d2fb893?api-version=2015-10-01-preview

    https://mgrdatalakeanalytics.azuredatalakeanalytics.net/Jobs/19da0c38-034e-4adf-9e07-97d26cbc91e0?api-version=2015-10-01-preview

    Could you tell me what happens? This is very important for me.

    Tuesday, April 26, 2016 6:19 PM

Answers

  • Thanks for reporting this and my apologies for the late reply (The forums had an issue that did not let me reply).

    You discovered a problem with our GZip implementation which didn't handle multi member gzip file correctly.

    We are now working on the fix. Until the fix is being deployed, please operate on the decompressed version of the file.


    Michael Rys

    Friday, April 29, 2016 10:40 PM
    Moderator

All replies

  • Thanks for reporting this and my apologies for the late reply (The forums had an issue that did not let me reply).

    You discovered a problem with our GZip implementation which didn't handle multi member gzip file correctly.

    We are now working on the fix. Until the fix is being deployed, please operate on the decompressed version of the file.


    Michael Rys

    Friday, April 29, 2016 10:40 PM
    Moderator
  • Hi! Are you fix this bug? I have noticed that the decompression works correctly now.
    Monday, May 2, 2016 12:10 PM
  • In the succeeded job,  did you use a single gzip file (as compressed from one single file using gzip or other tools) or concatenated multiple gzip file as one on the input ?  The failed case was using a concatenated gzip file that was from multiple gzip files.    Actually if you have multiple gzip files, it will be faster to do the extraction by specify them as a stream set than concatenating them as one file because stream set allows parallel extraction, while concatenated file can only be extracted sequentially.
    Wednesday, May 4, 2016 6:07 AM
  • How specify them as a stream?
    Wednesday, May 4, 2016 8:00 AM
  • You mean stream set ?  Sorry, my mistake. The correct formal name is file set https://msdn.microsoft.com/en-us/library/azure/mt621294.aspx

    And some samples here:  https://github.com/Azure/usql

    Thursday, May 5, 2016 5:16 AM