none
While processing CSV file using U SQL, input extractor should exclude headers and output of the u SQL should add headers to the output file.

    Question

  • Hello,

    I have multiple CSV files and need to do some operations like MIN, MAX, AVG etc.. I have the input file which contains headers and data.

    While processing data, I want to exclude the headers. But while saving the result to the output file, I need headers to be added to the new output CSV file.  How can I do that?

    Thanks,


    Thursday, February 25, 2016 1:13 PM

Answers

  • No easy way to do this.  There is supposed to be ignore 1st row headers like there is in Hive,(Extractors.Text(delimiter: ',', skipFirstNRows: 1) but for now you'll need a where clause to rip them out, bring all columns in as strings and then convert in the select statement..

    @TwitterData =
        EXTRACT processtimestamp string,
                tweet string,
                tweetdate string,
                author string,
                favourite string,
                retweet string,
                tweetid string,
                hashtags string,
                language string,
                sensitive string,
                tweetlength string,
                usermentions string,
                urls string,
                filename string
        FROM "/TweetData/{filename:*}.csv"
        USING Extractors.Csv(silent : true);

    @HashTags =
        SELECT @TwitterData.hashtags.ToLower() AS hashtags,
               COUNT(Convert.ToInt16(@TwitterData.retweet)) AS retweets,
               COUNT(Convert.ToInt16(@TwitterData.favourite))AS favourites,
               (COUNT(Convert.ToInt16(@TwitterData.retweet))*3) + COUNT(Convert.ToInt16(@TwitterData.favourite)) AS Impact
        FROM @TwitterData

        WHERE @TwitterData.hashtags != ""
              AND @TwitterData.processtimestamp != "Column_0"
              AND @TwitterData.processtimestamp != "Timestamp" /*Timestamp is the name of column1 in the csv */
        GROUP BY @TwitterData.hashtags.ToLower() ;
    OUTPUT @HashTags
    TO "/output/hashtagdata.csv"
    USING Outputters.Csv()

    TO output reverse the process

    select everything as strings and union the select from the data set with a row for your headers ..

    @myoutput =

    select "heading1" as col1, "heading2" as col2

    UNION

    select col1, col2 from myquery;

    OUTPUT @myoutput
    TO "/output/mycsv.csv"
    USING Outputters.Csv();

    Hope this helps and do feedback on what you would like to see changed in U-SQL

    please mark this as answer if it helps @deepfat

            

    Monday, February 29, 2016 8:56 AM
  • We're working on addressing those two issues (both ignoring headers on read, and being able to write column headers on output).  Vote early and often on our feedback site:

    https://feedback.azure.com/forums/327234-data-lake/suggestions/11500386-skip-header-in-extractor

    https://feedback.azure.com/forums/327234-data-lake/suggestions/10635558-generate-heading-rows-using-the-built-in-adla-outp


    Program Manager -- hadoop -- http://blogs.msdn.com/mwinkle

    Wednesday, March 2, 2016 5:13 PM

All replies

  • No easy way to do this.  There is supposed to be ignore 1st row headers like there is in Hive,(Extractors.Text(delimiter: ',', skipFirstNRows: 1) but for now you'll need a where clause to rip them out, bring all columns in as strings and then convert in the select statement..

    @TwitterData =
        EXTRACT processtimestamp string,
                tweet string,
                tweetdate string,
                author string,
                favourite string,
                retweet string,
                tweetid string,
                hashtags string,
                language string,
                sensitive string,
                tweetlength string,
                usermentions string,
                urls string,
                filename string
        FROM "/TweetData/{filename:*}.csv"
        USING Extractors.Csv(silent : true);

    @HashTags =
        SELECT @TwitterData.hashtags.ToLower() AS hashtags,
               COUNT(Convert.ToInt16(@TwitterData.retweet)) AS retweets,
               COUNT(Convert.ToInt16(@TwitterData.favourite))AS favourites,
               (COUNT(Convert.ToInt16(@TwitterData.retweet))*3) + COUNT(Convert.ToInt16(@TwitterData.favourite)) AS Impact
        FROM @TwitterData

        WHERE @TwitterData.hashtags != ""
              AND @TwitterData.processtimestamp != "Column_0"
              AND @TwitterData.processtimestamp != "Timestamp" /*Timestamp is the name of column1 in the csv */
        GROUP BY @TwitterData.hashtags.ToLower() ;
    OUTPUT @HashTags
    TO "/output/hashtagdata.csv"
    USING Outputters.Csv()

    TO output reverse the process

    select everything as strings and union the select from the data set with a row for your headers ..

    @myoutput =

    select "heading1" as col1, "heading2" as col2

    UNION

    select col1, col2 from myquery;

    OUTPUT @myoutput
    TO "/output/mycsv.csv"
    USING Outputters.Csv();

    Hope this helps and do feedback on what you would like to see changed in U-SQL

    please mark this as answer if it helps @deepfat

            

    Monday, February 29, 2016 8:56 AM
  • We're working on addressing those two issues (both ignoring headers on read, and being able to write column headers on output).  Vote early and often on our feedback site:

    https://feedback.azure.com/forums/327234-data-lake/suggestions/11500386-skip-header-in-extractor

    https://feedback.azure.com/forums/327234-data-lake/suggestions/10635558-generate-heading-rows-using-the-built-in-adla-outp


    Program Manager -- hadoop -- http://blogs.msdn.com/mwinkle

    Wednesday, March 2, 2016 5:13 PM