none
Get First Row in a large file

    Question

  • I have the following query:

    DECLARE @files = "/PUMS/Population/2015/{filename}c.csv";
    DECLARE @outFile = "/PUMS/Profile.csv";
    
    @fileDDL = EXTRACT
      content string
      , filename string
    FROM @files
    USING Extractors.Text(delimiter: '\0');
    
    @raw = SELECT content,   filename,
    ROW_NUMBER() OVER () AS RowNum
    FROM @fileDDL;
    
    @OutColumns = SELECT  filename
    , c.column
    , r.RowNum
    ,ROW_NUMBER() OVER (
        PARTITION BY   filename) AS RowNumCol
     FROM @raw AS r
     CROSS APPLY EXPLODE(new SQL.ARRAY<string>(r.content.Split (','))) AS c (column)
      WHERE r.RowNum == 1;
    
      OUTPUT @OutColumns
      TO @outFile
      USING Outputters.Csv(outputHeader:true);

    What I am trying to do is to get the first row and to output the column names that are in that first row.  The problem is that the query returns some random row from the file.  My theory is that since this file is 3 GB long, the file was partitioned into multiple vertices and what is row_number 1 is fluid.

    Is there another way to get the very first row in a large file?


    Russel Loski, MCSE Data Platform/Business Intelligence Twitter: @sqlmovers; blog: www.sqlmovers.com

    Wednesday, June 28, 2017 5:45 PM

All replies