none
Using a combiner?

    Question

  • Hello,

    I am not sure if this is the proper forum to ask this question but I hope that you can point me in the right direction at least. I have about a few million lines (let's say 1M) in a comma delimited text file (file_A) and I have a few thousand keywords & phrases (1K) (fileB) that I would like to see if they exists for each line's particular columns. In other words, the final outcome would be a 1M X 1K matrix with values 0/1 depending on if a keyword exists in that particular record.

    I was thinking of using a cross join so I would have a row in the resultset for each line combination from fileA and fileB that I can easily check whether the existence of the phrase in each line. But I can't really pivot the resultset to show the lines from fileA as rows and the result of search for each of the keywords in fileB as columns.

    I was thinking of using a combiner that would take fileA and fileB as right and left rowsets and process with the help of RegularExpressions library and return the combined row as output.

    Apart from not having a COMBINE example, I am also not sure if this is the right approach to take to make this a performing script. Could you either give me a few pointers or point me some tutorials that might help me? Many thanks!


    • Edited by Fero Us Thursday, November 26, 2015 8:50 PM
    Thursday, November 26, 2015 8:50 PM

Answers

  • In order to use a file inside your assembly you need to do one of two things.

    Either you register it with your assembly using CREATE ASSEMBLY's ADDITIONAL_FILES option (this requires the file to be in a location in the ADLS account). Note that this is supported by VisualStudio's Register Assembly feature (right click menu on assembly project) and will do the upload of the file for you, but the code behind feature does not support this. Then your file should be visible in the code in the current working directory.

    Alternatively, you upload the file to an ADLS location, and in your script, you use DEPLOY RESOURCE pathname to make the file available in your code in the current working directory.

    If you still have problems feel free to post some of your script or code fragments. If you don't want to publicly share it, send an email to usql at Microsoft email domain.


    Michael Rys

    • Marked as answer by Fero Us Thursday, December 3, 2015 5:15 AM
    Wednesday, December 2, 2015 6:23 PM
    Moderator

All replies

  • Hi Fero

    You probably could do the cross join with a combiner, although the PRODUCES clause needs to know the output schema at compilation time, so generating the columns for each keyword may not be the right way to go.

    But before diving deeper into using a combiner, I would like to better understand your ultimate goal. Note that generating a several 1000 column wide and million row long matrix seems like an unwieldy beast to use. What would you do with that matrix?

    Maybe generating an inverted index using a keyword and an array of matching records may be more appropriate?

    Also, it sounds as if the keyword file may be acting more as a reference file. Depending on the size of that file, you could also load the file as part of the assembly that contains the udo (probably an applier) which does create the output you are looking for.

    As to performance: in general, using a UDO such as a processor or combiner will hide logic and semantics from the query optimizer and thus is often less efficient than expressing it in U-SQL with joins or predicates and user-defined functions. I would suggest to do some perf tests if you feel that you need the UDO model for its expressive power.

    Let me know if you have more questions.

    Michael

    PS: We will add more documentation on UDOs over the next couple of weeks. 


    Michael Rys

    Monday, November 30, 2015 7:47 AM
    Moderator
  • Thanks a lot Michael. A smaller subset of the rows will make up of a feature matrix to train a model for various classifiers. For the live scenario, we will need a subset of the columns to make classification predictions. Using a sparse matrix might be an option to discover.

    I will look into your applier suggestion as the reference file is quite a light-weight one size-wise.  I'll share my findings once I test it out. Cheers.


    • Edited by Fero Us Monday, November 30, 2015 8:07 AM
    Monday, November 30, 2015 8:06 AM
  • Hello again Michael. It appears that I am running into some basic problems with respect to the file io re: your reference file suggestion in UDO. It probably helps to know the environment that in which I am working. 

    a.Passing the absolute path obviously does not help for the Cloud scenario as I do not have the advance knowledge of the environment.

    b.Passing the relative path does not help either as when done, the process fails with "d:\.... not found" even if I somehow managed to put the file in the mentioned folder, how would I know if and when the script is run in parallel, the file will be available to the worker processes?

    c.Passing a webhdfs:// path does not help for the codebehind as it fails with unsupported path.

    d.Passing a wasb:// path and leveraging Azure.Storage library also fails with "The given path's format is not supported"

    Given above findings, I am assuming that you were referring to reading the file directly from a blob from within code behind. Did I misunderstand your suggestion? If not, I'll continue trying to make it work. If so, what is the right way to go about this?

    Wednesday, December 2, 2015 8:18 AM
  • In order to use a file inside your assembly you need to do one of two things.

    Either you register it with your assembly using CREATE ASSEMBLY's ADDITIONAL_FILES option (this requires the file to be in a location in the ADLS account). Note that this is supported by VisualStudio's Register Assembly feature (right click menu on assembly project) and will do the upload of the file for you, but the code behind feature does not support this. Then your file should be visible in the code in the current working directory.

    Alternatively, you upload the file to an ADLS location, and in your script, you use DEPLOY RESOURCE pathname to make the file available in your code in the current working directory.

    If you still have problems feel free to post some of your script or code fragments. If you don't want to publicly share it, send an email to usql at Microsoft email domain.


    Michael Rys

    • Marked as answer by Fero Us Thursday, December 3, 2015 5:15 AM
    Wednesday, December 2, 2015 6:23 PM
    Moderator
  • Thanks Michael. It worked like a charm. Is there a document that you can point me to explain differences between appliers vs. processors as they both seem to be row-based operations?
    Thursday, December 3, 2015 5:22 AM
  • The main differences are:

    1. Processors are:

    • 1 row to 0 or 1 row.
    • Invoked with a PROCESS expression.

    2. Appliers are:

    • 1 row to 0 to n rows.
    • Invoked with a CROSS/OUTER APPLY expression.

    The reference documentation kind of calls it out, but it is a bit buried right now :).


    Michael Rys

    Thursday, December 3, 2015 9:36 PM
    Moderator