none
How to transform rowset parallely.

    Question

  • Hi! I am looking for solutions how transform rowset parallely using own code. I have a rowset with string columns and i want check all rows and depending on many conditions, transform strings from those columns. For example:

    • If string from one column is too long, remove row from rowset
    • If string from one column is to short or empty, remove row from rowset
    • String from one column store encoded value - decode this value and trim string from second column depending on this value.
    • Remove first or last N characters from string.

    Is a good idea to create own implementation of IProcessor interface (like here) with list of commands (conditions) as a constructor argument? Perhaps there are better solutions for this problem? Execution time is very important.


    Monday, April 18, 2016 6:52 PM

Answers

  • A custom processor transforms 1 row to either 0 or 1 row, so you can theoretically use it to filter out rows. However, unless you need to write a generic complex rowset transformer, it is probably better to write the filter as a user-defined function instead and then use it in the SELECT's WHERE clause.

    A processor probably will not block parallelization. A badly written reducer or combiner may. Or an extractor or outputter may need to limit parallelism to read a file as a unit.

    What a processor can block is "predicate push-downs" or column pruning to earlier phases in the processing, since the UDO code is a black box to the optimizer.


    Michael Rys

    Friday, April 22, 2016 5:29 PM
    Moderator

All replies

  • You could write a processor but you would potentially block the U-SQL optimizer from pushing predicates and column pruning through the processor.

    Most of your operations and conditions can be fairly easy be expressed either using C# inline the U-SQL expression, or with using user-defined functions.

    For example

    SELECT string_col.Trim(...) AS new_string_col 
    FROM @rowset
    WHERE mynamespace.myclass.correct_size(string_col);


    Michael Rys

    Monday, April 18, 2016 7:26 PM
    Moderator
  • Thank you for your answer. Is this possible that custom processor may block parallelization? Here I see that each row are processed separately (in Process method), but on this site  you wrote that sometimes UDO operators may block parallelization. How avoid this situation using custom udo? Is this possible to remove row from rowset using custom processor?
    Monday, April 18, 2016 11:42 PM
  • A custom processor transforms 1 row to either 0 or 1 row, so you can theoretically use it to filter out rows. However, unless you need to write a generic complex rowset transformer, it is probably better to write the filter as a user-defined function instead and then use it in the SELECT's WHERE clause.

    A processor probably will not block parallelization. A badly written reducer or combiner may. Or an extractor or outputter may need to limit parallelism to read a file as a unit.

    What a processor can block is "predicate push-downs" or column pruning to earlier phases in the processing, since the UDO code is a black box to the optimizer.


    Michael Rys

    Friday, April 22, 2016 5:29 PM
    Moderator