none
Parallelism: Using PLINQ RRS feed

  • Question

  • All,

    I have taken an interest in parallel processing (Framework 4.0) and my first go in this is with PLINQ, the parallel equivalent of LINQ-To-Object. I can imagine Reed saying “It only took you a half decade to get there!” … yea, yea, ok. ;-)

    Ultimately I have only one question, although I’ll get into specifics shortly about related questions as I go. My question is this:

     

    When should I use PLINQ as opposed to staying with standard LINQ-To-Object?

     

    I’ll now present some tests before I continue.

    *****

    I took something which I put together a few weeks ago to use as a testing ground for this. It’s a good one to use – sort of – because it allows me to easily select the same sort of data but in varying collection sizes. It’s nothing more than a means that I can grab data from various XML files on my website (which I’ve shown here ad nauseam).

    I say “sort of” because of what’s shown in the MSDN documentation from the link above:

     

    However, parallelism can introduce its own complexities, and not all query operations run faster in PLINQ. In fact, parallelization actually slows down certain queries. Therefore, you should understand how issues such as ordering affect parallel queries.

     

    It then links to this MSDN document which also plays into all of this, or so I think it does.

    So even though it’s not a perfect fit (because I’m ordering the result), my multi-count collections are what I used to test with.

    The following shows the code used to test the various collection sizes:

    Dim sw As New Stopwatch sw.Start() 'Dim qry As System.Linq.IOrderedEnumerable(Of String) = _ ' From pd As PersonData In pdList _ ' Where pd.State.ToLower.Replace(" "c, "") = _ ' stateName.ToLower.Replace(" "c, "") _ ' Select pd.City Distinct _ ' Order By City Dim qry As System.Linq.OrderedParallelQuery(Of String) = _ From pd As PersonData In pdList.AsParallel _ Where pd.State.ToLower.Replace(" "c, "") = _ stateName.ToLower.Replace(" "c, "") _ Select pd.City Distinct _ Order By City If qry.Count > 0 Then retVal = qry.ToArray sw.Stop() Dim sb As New System.Text.StringBuilder sb.AppendLine("Source Collection Count: " & _ pdList.Count.ToString("n0")) sb.AppendLine("Result Collection Count: " & _ qry.Count.ToString("n0")) sb.AppendFormat("Elapsed Time: {0:f1} Milliseconds", _ sw.Elapsed.TotalMilliseconds) MessageBox.Show(sb.ToString, "PLINQ Query") End If



    You can see how in one I’m using standard LINQ-To-Object and by commenting one out and uncommenting out the other, I can switch between LINQ and PLINQ. I tested four (4) collection sizes:

     

    • 100
    • 50,000
    • 250,000
    • 1,000,000

     

    The results are shown below:

    *****

    *****

    *****

    *****

    *****

    If nothing else, the results are interesting.

    In the first one – with just a hundred in the source collection – PLINQ is a clear loser as it added tremendously to the elapsed time, but in the others which followed that test, PLINQ is an obvious winner, cutting down the time by as much as a third.

    I can (but didn’t) also pick several collection sizes between 100 and 50,000. I’m not sure that it would be useful or not though?

    So to embellish my question then: How can I know when to use it? In my example, the user can just as easily have chosen a small set as they could have selected a large one.

    Continuing, and based on that second MSDN article, it seems that the result count also plays into this but I have no way to predict that. To what extent does the result count participate here?

    So then, to sum up, how can I know which is the best way? I realize it’s not a one-size-fit-all answer and I’m not looking for that, but certainly your inputs are all welcome and appreciated.

    Thanks :)


    Still lost in code, just at a little higher level.

    :-)

    Sunday, April 19, 2015 5:42 PM

Answers

  • So then, to sum up, how can I know which is the best way? I realize it’s not a one-size-fit-all answer and I’m not looking for that, but certainly your inputs are all welcome and appreciated.

    It also depends on how the query fits into the larger context.  If the queries are infrequent, then the absolute wasted time from using PLINQ on small queries is so small it doesn't matter, so you may as well use PLINQ.   If you are doing lots of these queries then the overhead of PLINQ might become relevant if they are all small.

    Monday, April 20, 2015 12:59 AM
  • I think your question contains it's own answer.  :)

    You pretty much have to do just as you've done and test the target scenario both ways then pick the winner; or more accurately, determine the threshold at which you switch from executing the non-parallelized version of the routine to the parallelized one.

    You can also try to handle it somewhat dynamically... something like starting a traditional loop and monitoring its execution time.  If the time hits a certain threshold and there's still a lot of the collection remaining, exit the loop and processes the remainder of the collection in parallel then combine the results.

    It's definitely a case-by-case situation.  Unless you know for certain that there are going to be a lot of objects and that the processing of each will definitely be CPU-bound, you pretty much have to experiment to determine if parallelization will provide a benefit.


    Reed Kimble - "When you do things right, people won't be sure you've done anything at all"

    Monday, April 20, 2015 12:37 AM
    Moderator

All replies

  • I think your question contains it's own answer.  :)

    You pretty much have to do just as you've done and test the target scenario both ways then pick the winner; or more accurately, determine the threshold at which you switch from executing the non-parallelized version of the routine to the parallelized one.

    You can also try to handle it somewhat dynamically... something like starting a traditional loop and monitoring its execution time.  If the time hits a certain threshold and there's still a lot of the collection remaining, exit the loop and processes the remainder of the collection in parallel then combine the results.

    It's definitely a case-by-case situation.  Unless you know for certain that there are going to be a lot of objects and that the processing of each will definitely be CPU-bound, you pretty much have to experiment to determine if parallelization will provide a benefit.


    Reed Kimble - "When you do things right, people won't be sure you've done anything at all"

    Monday, April 20, 2015 12:37 AM
    Moderator
  • So then, to sum up, how can I know which is the best way? I realize it’s not a one-size-fit-all answer and I’m not looking for that, but certainly your inputs are all welcome and appreciated.

    It also depends on how the query fits into the larger context.  If the queries are infrequent, then the absolute wasted time from using PLINQ on small queries is so small it doesn't matter, so you may as well use PLINQ.   If you are doing lots of these queries then the overhead of PLINQ might become relevant if they are all small.

    Monday, April 20, 2015 12:59 AM
  • Thanks to both of you.

    I can't think of a way to quantize what constitutes "large enough for PLINQ" and I tend to also think that it would have to be by experimentation.

    At any rate, it's interesting, that's for sure. :)


    Still lost in code, just at a little higher level.

    :-)

    Monday, April 20, 2015 1:09 PM