none
How ORDER BY clause affects the parallelization of data saving by custom Outputter.

    Question

  • Hi! How ORDER BY clause affects the parallelization of data saving by custom Outputter? I have created my own implementation of parallel outputter, but when I am using ORDER BY clause in my OUTPUT statement, the data are saving to disk only on single node (vertex). Is this normal and expected behavior?
    Thursday, May 19, 2016 8:44 PM

Answers

  • Unless your ORDER BY can be supported by the previous partitions of the data flow (e.g., you order by a and b and the previous part of the job graph can guarantee that you are partitioned by a so you can order on b in each partition before you combine the partitions in the final ordering), an ORDER BY will force a single vertex since it needs to see all the data for ordering.

    Michael Rys

    Wednesday, June 15, 2016 11:17 PM
    Moderator