none
U-SQL Query in DataLakeAnalytic Job - Multiple Input Files RRS feed

  • Question

  • Hi All,

    I have a question, I am learning U-SQL and trying to extract data from multiple files. I tried various ways but every time it just copies data from 1 file and not from all 4:

    Explicitly Speifying File Names

     @results = EXTRACT postcode string,
      oacode string,
      total int,
      percentage int
      FROM "/oacodes/Postcode_Estimates_2_A_F.csv","/oacodes/Postcode_Estimates_2_G_L.csv","/oacodes/Postcode_Estimates_2_M_R.csv","/oacodes/Postcode_Estimates_2_S_Z.csv"
      USING Extractors.Csv();
     
      @oacoderesults = SELECT oacode, COUNT(*) AS total
      FROM @results
      GROUP BY oacode;

      OUTPUT @oacoderesults TO "/output/oacode_totals.csv"
      ORDER BY total DESC
      USING Outputters.Csv();

    Using a WILD Card Naming

     

     @results = EXTRACT postcode string,
      oacode string,
      total int,
      percentage int,
      filename string
      FROM "/oacodes/Postcode_Estimates_2_{filename}.csv"
      USING Extractors.Csv();
     
      @oacoderesults = SELECT oacode, COUNT(*) AS total
      FROM @results
      GROUP BY oacode;

      OUTPUT @oacoderesults TO "/output/oacode_totals.csv"
      ORDER BY total DESC
      USING Outputters.Csv();

    Using a WILD Card Naming

     

     @results = EXTRACT postcode string,
      oacode string,
      total int,
      percentage int,
      filename string
      FROM "/oacodes/{filename}.csv"
      USING Extractors.Csv();
     
      @oacoderesults = SELECT oacode, COUNT(*) AS total
      FROM @results
      GROUP BY oacode;

      OUTPUT @oacoderesults TO "/output/oacode_totals.csv"
      ORDER BY total DESC
      USING Outputters.Csv();

    ----------------------------------------------------------------------------------------- 

    @results = EXTRACT postcode string,
      oacode string,
      total int,
      percentage int,
      filename string
      FROM "/oacodes/{*}.csv"
      USING Extractors.Csv();
     
      @oacoderesults = SELECT oacode, COUNT(*) AS total
      FROM @results
      GROUP BY oacode;

      OUTPUT @oacoderesults TO "/output/oacode_totals.csv"
      ORDER BY total DESC
      USING Outputters.Csv();

    Everytime getting below output where is only copies 1 file

    Not sure what I am doing wrong as Microsoft U-SQL site also says that its achievable:

    https://msdn.microsoft.com/en-us/azure/data-lake-analytics/u-sql/input-files-u-sql#listfilepath

    Note : I am using DATA LAKE STORAGE GEN1 for your information.


    Regards Gursethi Blog: http://gursethi.blogspot.com/ ++++ Please mark "Propose As Answer" if my answer helped ++++



    • Edited by GURSETHI Wednesday, September 12, 2018 6:11 AM more info
    Tuesday, September 11, 2018 10:55 PM

Answers

  • I was able to fetch inputs from all the files and output it. I did a simple script of fetching from all inputs and outputting it into a single file. It works, Below is my script.

    @results = EXTRACT postcode string,
    total int
    FROM "/input/Postcode_Estimates_{*}.csv"
    USING Extractors.Csv(skipFirstNRows:1);

    OUTPUT @results
    TO "/output/Postcode_Estimates_output.csv"
    ORDER BY postcode
    USING Outputters.Csv();

    Input File names: 

    Postcode_Estimates_2_A_F.csv, Postcode_Estimates_2_G_L.csv, Postcode_Estimates_2_M_R.csv, Postcode_Estimates_2_S_Z.csv 

    You can double check if the input is coming from 4 files or not in the job graph tab:

    If you look at the input in the job graph, it should say 4 files

    Wednesday, September 12, 2018 5:59 PM
    Moderator

All replies

  • I was able to fetch inputs from all the files and output it. I did a simple script of fetching from all inputs and outputting it into a single file. It works, Below is my script.

    @results = EXTRACT postcode string,
    total int
    FROM "/input/Postcode_Estimates_{*}.csv"
    USING Extractors.Csv(skipFirstNRows:1);

    OUTPUT @results
    TO "/output/Postcode_Estimates_output.csv"
    ORDER BY postcode
    USING Outputters.Csv();

    Input File names: 

    Postcode_Estimates_2_A_F.csv, Postcode_Estimates_2_G_L.csv, Postcode_Estimates_2_M_R.csv, Postcode_Estimates_2_S_Z.csv 

    You can double check if the input is coming from 4 files or not in the job graph tab:

    If you look at the input in the job graph, it should say 4 files

    Wednesday, September 12, 2018 5:59 PM
    Moderator
  • Thanks VairavanS,

    I guess it was my mistake (A learning). I was also getting similar execution plan and it was mentioning same thing i,e, 4 filed and 1st file 1.65 MB. I was in an assumption that I will see 4 different input's as a first step in execution plan (seperate for each file) and all flow to one SV1 Extract task.

    My bad, thanks for tnlightening.


    Regards Gursethi Blog: http://gursethi.blogspot.com/ ++++ Please mark "Propose As Answer" if my answer helped ++++

    Wednesday, September 12, 2018 6:45 PM