none
Vertex failure triggered quick job abort - Exception thrown during data extraction

    Question

  • I am running data lake analytics job, and during extraction i getting error.
    I use in my scripts TEXT extractor and also my own extractor. I try get data from file containing two columns separated by space character. When i run my scripts locally everything works fine, but not when i try run scripts using my DLA account. I have problem only when i try get data from files with many thousands of rows (but only 36 MB of data), for smaller files everything also works correctly. I noticed that exception is throwing when total number of vertices is larger than one for extraction node. I met this problem erlier, working with other "big" files (.csv, .tsv) and extractors. Could someone tell me what happens?  

    Error message:
    >Vertex failure triggered quick job abort. Vertex failed: SV1_Extract[0][0] with error: Vertex user code error.
    Vertex failed with a fail-fast error

    Script code:

        @result =
        EXTRACT s_date string,
                s_time string
        FROM @"/Samples/napis.txt"
        //USING USQLApplicationTest.ExtractorsFactory.getExtractor();
        USING Extractors.Text(delimiter:' ');

        OUTPUT @result
        TO @"/Out/Napis.log"
        USING Outputters.Csv();

    Code behind:

        [SqlUserDefinedExtractor(AtomicFileProcessing = true)]
        public class MyExtractor : IExtractor
        {
            public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
            {
                using (StreamReader sr = new StreamReader(input.BaseStream))
                {
                    string line;
                    // Read and display lines from the file until the end of 
                    // the file is reached.
                    while ((line = sr.ReadLine()) != null)
                    {
                        string[] words = line.Split(' ');
                        int i = 0;
                        foreach (var c in output.Schema)
                        {
                            output.Set<object>(c.Name, words[i]);
                            i++;
                        }

                        yield return output.AsReadOnly();
                    }
                }
            }
        }

        public static class ExtractorsFactory
        {
            public static IExtractor getExtractor()
            {
                return new MyExtractor();
            }
        }

    Part of sample file:

        ...
        str1 str2
        str1 str2
        str1 str2
        str1 str2
        str1 str2
        ...

    In job resources i found jobError message:

    "Unexpected number of columns in input stream."&#45;"description":"Unexpected number of columns in input record at line 1.\nExpected 2 columns&#45; processed 1 columns out of 1."&#45;"resolution":"Check the input for errors or use \"silent\" switch to ignore over(under)-sized rows in the input.\nConsider that ignoring \"invalid\" rows may influence job results.

    But i checked file again and i dont see incorrect number of column. Is this possible that error is caused by incorrect file split and distribution? I read that the big files can be extracted parallel.

    Sorry for my poor English.

                            
    Tuesday, March 22, 2016 10:42 PM

Answers

  • Hi Miesko

    How do you upload the file?

    We currently have an issue with large files where the row is not aligned with the file extent boundary if you upload the file with the "wrong" tool. If you upload it as row-oriented file through Visual Studio or via the Powershell command, you should get it aligned (if the row delimiter is CR or LF). If you did not use the "right" upload tool, the built-in extractor will show the behavior that you report because it currently assumes that record boundaries are aligned to the extents that we split the file into for parallel processing. We are working on a general fix.

    If you see similar error messages with your custom extractor that uses AtomicFileProcessing=true and should be immune to the split, please send me your job link so I can file an incident and have the engineering team review your case.


    Michael Rys

    Wednesday, March 23, 2016 1:10 AM
    Moderator

All replies

  • Hi Miesko

    How do you upload the file?

    We currently have an issue with large files where the row is not aligned with the file extent boundary if you upload the file with the "wrong" tool. If you upload it as row-oriented file through Visual Studio or via the Powershell command, you should get it aligned (if the row delimiter is CR or LF). If you did not use the "right" upload tool, the built-in extractor will show the behavior that you report because it currently assumes that record boundaries are aligned to the extents that we split the file into for parallel processing. We are working on a general fix.

    If you see similar error messages with your custom extractor that uses AtomicFileProcessing=true and should be immune to the split, please send me your job link so I can file an incident and have the engineering team review your case.


    Michael Rys

    Wednesday, March 23, 2016 1:10 AM
    Moderator
  • Hi! Thank you for your answer :)

    I uploaded my files as a binary files using visual studio tools. Now, when the files are  uploaded as a row-oriented files, everything works correctly. But could you tell me, what does the number of vertices in job graph means? Are those pieces of work that can be performed in paralel? When my job starts, the number of vertices for "big" file (400 MB) equals two, but when job end, the number of vertices changees to one. What happens?

    My custom extractor also throws exception for large binary file. The job error message informs, that the „Index was outside the bounds of the array", what probably means, that there is also problem with incorrect number of columns. How can i get and send you the necessary data about job error?

    Wednesday, March 23, 2016 12:09 PM
  • The numbers of vertices in the job graph indicate the number of work units. So if a SV node (SV stands for super-vertex) indicates 10 vertices, then the work for that super vertex is split into 10 units that can be processed in parallel.

    Currently an extractor that can run in parallel (AtomicFileProcessing=false), will split the files into about 250MB extents and then run in parallel. So 400MB would be two. However, depending on the actual processing (like you specify AtomicFileProcessing to true or we end up doing some internal optimizations), it may reduce the number if nodes during execution.

    You can get the job link from the job view in VisualStudio by clicking the link:


    Michael Rys

    Wednesday, March 23, 2016 8:51 PM
    Moderator
  • Thanks :) I put the job links with short description for my custom extractor below:

    Failed job for .txt file uploaded as binary file. File size less than 400 MB and AtomicFileProcessing set to true:

    https://mgrdatalakeanalytics.azuredatalakeanalytics.net/Jobs/a4aeef87-6a5a-4ba2-b93a-9220b5e9dee7?api-version=2015-10-01-preview

    Successfully completed job for the same file like above uploaded as row-oriented file with AtomicFileProcessing set to false:

    https://mgrdatalakeanalytics.azuredatalakeanalytics.net/Jobs/3f139249-61ea-4163-8926-a82f2782b7f2?api-version=2015-10-01-preview

    Successfully completed job for the same file uploaded as row-oriented file with AtomicFileProcessing set to true:

    https://mgrdatalakeanalytics.azuredatalakeanalytics.net/Jobs/f86bb2c0-9f36-492f-91a4-08c089a2f881?api-version=2015-10-01-preview

    Is this enough or should i send you something more?

    Could you also advise me how should i extract data from big files where the data for row are saved in many lines of file and in the custom structure (for example file with a structure like in GenBank Record: http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html).  Is this possible to split this file, or should i always set the AtomicFileProcessing for my extractors to true for this type of files?




    Thursday, March 24, 2016 11:50 AM
  • Thanks. I will ask the team to look into the first failure but I expect it is the same cause that requires the row-oriented file upload (even though the file is not being split).

    The GenBank file format looks like a very long header with a lot of order dependent rows and then a fixed row sequence of a genetic sequence that uses a carriage return or line feed as a row separator.

    As such, once we fix the record boundary alignment issue, you should be able to write a custom extractor with AtomicFileProcessing set to false, as long as the header and the first row fits into 250MB (the first extent). There should be enough information be available in the UDO model so you can know if you are operating on the first extent of the file or others.

    If you can identify the header rows in your code, you probably can even write an extractor that does parallel processing now.

    If you however want to pivot data from the header into the rows below (basically merge rows beyond an extent boundary), then you need to use AtomicFileProcessing=true.


    Michael Rys

    Saturday, March 26, 2016 12:56 AM
    Moderator
  • Do you know something new about my issue? Now i try create a custom extractor like here:

    https://social.msdn.microsoft.com/Forums/azure/en-US/6d2452dc-9b6b-40c1-bd68-02fcd06d833f/custom-parallel-extractor-usql?forum=AzureDataLake

    But problem is the same, when i use AtomicFileProcessing set to true, a DLA service is still trying to divide the file and throws exception. Locally my extractor works correctly. 

    Maybe there is another way to force file processing on one node without splitting?
    Friday, April 1, 2016 9:13 PM
  • Michael - Is this still an issue?  Is it triggered by wide files (for example 250 columns?)  At what file size is this triggered?  Also, can you tell us which tool specifically in VS we can use to work around this?  I've tried Server Explorer and Cloud Explorer.

    Thanks!

    Monday, September 25, 2017 5:12 PM