none
Data Lake Store file upload problem.

    Question

  • Hi!

    A few months ago I have uploaded some data to Data Lake Store - large row oriented and FASTQ files. My custom extractors to read data from those files works correctly, but yesterday I tried upload  the same data (using VS 2013 and portal) to other Data Lake Store, and my extractors throws exceptions during the data extracting. I dont know why, for the same data, my extractors stops working.  Later I have tried uload my files again to my old Data Lake Store, and problem is the same (for the same files, but uploaded a few months ago, everything is working). Did you change something in Data Lake Store or Analytics?

    Friday, October 14, 2016 7:07 PM

All replies

  • When we fixed the so called "boundary-alignment" issue (see ), some custom extractors that did not follow the right API calls failed. Mostly they tried to generate lines with calling ReadLine on the base stream instead of using the recommended form of:

    foreach (Stream current in input.Split(this._row_delim)) 
    { 
      using (StreamReader streamReader = new StreamReader(current, this._encoding)) 
      { 
        // extract the line information into outputrow
        yield return outputrow.AsReadOnly(); 
      } 
    } 
    

    From your description, this may be the cause for your issues too. If you feel you are following the right pattern in your extractor, please send me your code and I take a look.


    Michael Rys

    Friday, October 14, 2016 7:39 PM
    Moderator
  •     [SqlUserDefinedExtractor(AtomicFileProcessing = false)]
        public class FormattedPairedEndExtractor : IExtractor
        {
            private readonly Int32 _columnCount;
    
            public FormattedPairedEndExtractor()
            {
                _columnCount = 8;
            }
    
            public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
            {
                if (output.Schema.Count == _columnCount)
                {
                    foreach (Stream stream in input.Split(Encoding.Default.GetBytes("\n")))
                    {
                        using(StreamReader streamReader = new StreamReader(stream, Encoding.Default))
                        {
                            String[] substrings = streamReader.ReadToEnd().Split(new string[] { "|" }, StringSplitOptions.None);
                            output.Set<object>(output.Schema[0].Name, substrings[0]);
                            output.Set<object>(output.Schema[1].Name, substrings[1]);
                            output.Set<object>(output.Schema[2].Name, substrings[2]);
                            output.Set<object>(output.Schema[3].Name, substrings[3]);
                            output.Set<object>(output.Schema[4].Name, substrings[4]);
                            output.Set<object>(output.Schema[5].Name, substrings[5]);
                            output.Set<object>(output.Schema[6].Name, substrings[6]);
                            output.Set<object>(output.Schema[7].Name, substrings[7]);
                            yield return output.AsReadOnly();
                        }
                    }
                }
                else
                {
                    throw new Exception("Incorrect number of columns");
                }
            }
        }
    I wrote to you after DLA release two months ago. After DLA update I had a problems also with reading data which was in my storage earlier. I have implemented my new extractors according to your solution and it starts working. But now I have new problems with new uploaded files.
    Friday, October 14, 2016 9:30 PM
  • Do you have a repro (doc and script) you could send me to take a look at? Or send me a joblink?

    Thanks!


    Michael Rys

    Wednesday, October 19, 2016 11:13 PM
    Moderator