none
Read data from multiple files and folder using u-sql

    Question

  • Hello,

    I have 100 GB data in folder/sub folders and file structure and there are around 197 files in it. What I want is to read data from these files and want to club the analyzed data in 1 output file.

    My question is, how can I read data from these files and do processing on it, likewise we do in C# by writing loops on folder sub folder structure and pick up file. Currently i know to pick up 1 file by hardcode url and extract data from it, but want to do in loop so that same extract will work on each file?


    Tuesday, March 1, 2016 6:23 AM

Answers

  • @Manthan, thanks for using the service.

    In this case, there is a wildcard syntax that you can use (the feature name is called "file sets").  Basically, in your path in the extractor, you can use {*}.csv rather than foo.csv and you will read all of the files.

    The fileset feature is richer than that, so you can also create virtual columns which would be used to prune out files that won't be used in the query (eg, /{year:*}/{month:*}/{day:*}/{customer:*}.csv

    You can find more details in the docs here: https://msdn.microsoft.com/en-us/library/azure/mt621320.aspx


    Program Manager -- hadoop -- http://blogs.msdn.com/mwinkle

    Tuesday, March 1, 2016 6:44 PM

All replies

  • @Manthan, thanks for using the service.

    In this case, there is a wildcard syntax that you can use (the feature name is called "file sets").  Basically, in your path in the extractor, you can use {*}.csv rather than foo.csv and you will read all of the files.

    The fileset feature is richer than that, so you can also create virtual columns which would be used to prune out files that won't be used in the query (eg, /{year:*}/{month:*}/{day:*}/{customer:*}.csv

    You can find more details in the docs here: https://msdn.microsoft.com/en-us/library/azure/mt621320.aspx


    Program Manager -- hadoop -- http://blogs.msdn.com/mwinkle

    Tuesday, March 1, 2016 6:44 PM
  • @Matt,

    I have a bunch of *.txt files in my input directory, and I'm trying to access them according to your suggestion with

    @rs =
        EXTRACT s_type string,
                s_date DateTime,
                s_time string,
                s_domain string,
                s_id int,
                s_message string
        FROM @"/Samples/logs/{*}.txt"
        USING Extractors.Tsv();

    When running the script, I'm getting an error:

    The path contains invalid file char,

    path:C:\LocalRunRoot\DataRoot\Samples\logs\{*}.txt, invalidChar:'*'

    What am I doing wrong?

    Wednesday, March 2, 2016 3:49 PM
  • it appears that in our local runtime, we don't have support for the filesets.  This is going to be added, I'm following up on which release of the tools will have this.

    Program Manager -- hadoop -- http://blogs.msdn.com/mwinkle

    Wednesday, March 2, 2016 5:15 PM
  • Hi Matt,

    thank you for your reply. This will helps us to read data from file. On top of this, I have 1 more question, that is, while reading files using {*} can we extract the name of the file by any keyword or anything?

    Also, can we also apply {*} for the sub folder under the root? I am not having data folders in YYYY/MM/dd format, hence would like to know if I can get names from structure like  foo/doo and etc..?

    Thanks,

    Manthan Upadhyay


    Thanks, Manthan Upadhyay

    Thursday, March 3, 2016 8:51 AM
  • Hi Matt, thanks for the hint! It's working indeed when not running locally.
    Thursday, March 3, 2016 9:26 AM
  • Hey Matt,

    I found the way to extract the filename. :-) I did it like {filename:*}.csv and extract filename and wrote it in output file.


    Thanks, Manthan Upadhyay

    Thursday, March 3, 2016 10:58 AM
  • :-)  Good, the virtual columns feature is pretty cool (imo) and I'm glad you've figured it out.  We should have a tooling update at the end of March which will include support for the filesets when running locally, I'm sorry for any trouble there.


    Program Manager -- hadoop -- http://blogs.msdn.com/mwinkle

    Thursday, March 3, 2016 3:27 PM