Answered by:
Parallel Extensions disappointingly slow  what am I doing wrong?

I've been doing some initial tests with the parallel extensions, and so far they've been disappointingly slow. I'm hoping that it's just me doing stuff wrong!
For my test, I'm summing all the values of an int array with one million elements. It's basic stuff!
I run the tests 200 times in an inner loop, and repeat the entire test 3 times to try to eliminate JIT delays.
I run the release build on a Windows 7 64bit PC with a dual core processor. I've also run it on a 32bit XP PC with a quad core processor, with similar results scaled as you'd expect.
The results on my dual core processor (for the third run of the outermost test loop):
Normal loop (sum=1500000) took 00:00:00.5960881
Linq (sum=1500000) took 00:00:02.2253108
Plinq (sum=1500000) took 00:00:01.1123125
Parallel For (sum=1500000) took 00:00:01.6646867
Delegated For (sum=1500000) took 00:00:00.1983380
As you'd expect, the Linq results are laughably slow for this kind of thing. Similarly the parallel Linq, but at least it has scaled up with the processors.
I was really hoping that the "Parallel For" would be a lot faster, but even Plinq is faster! However, even the Plinq is five times slower than my handrolled "delegated for" loop. I find this very disappointing. It's slower than even a normal loop.
So now I'm in the position of hoping that I'm doing something wrong  otherwise, these parallel extensions are looking to be pretty much useless for us. (We are doing quite a lot of numerical calculations on large arrays; this is to do with analysing ECG in a medical application.)
Here's the code: How can I improve it?
using System; using System.Diagnostics; using System.Linq; using System.Threading; namespace ThreadingTest { class Program { static Stopwatch sw = new Stopwatch(); static void Main(string[] args) { const int iters = 200; int[] array = new int[1000000]; for (int i = 0; i < array.Length; ++i) { array[i] = i&3; } int sum = 0; sw.Start(); for (int outer = 0; outer < 3; ++outer) { for (int inner = 0; inner < iters; ++inner) { sum = 0; for (int i = 0; i < array.Length; ++i) { sum += array[i]; } } Results("Normal loop (sum=" + sum + ")"); for (int inner = 0; inner < iters; ++inner) { sum = array.Sum(); } Results("Linq (sum=" + sum + ")"); IParallelEnumerable<int> parray = array.AsParallel<int>(); for (int inner = 0; inner < iters; ++inner) { sum = parray.Sum(); } Results("Plinq (sum=" + sum + ")"); // This is the thing I was hoping was faster. // I'm very unsure that what I'm doing here is correct, although // it returns the right answer! Just very slowly. for (int inner = 0; inner < iters; ++inner) { sum = 0; Parallel.For(0, array.Length, () => 0, (i, state) => { state.ThreadLocalState+=array[i]; }, partialSum => Interlocked.Add(ref sum, partialSum)); } Results("Parallel For (sum=" + sum + ")"); // Here comes my handcoded approach. int numProcessors = System.Environment.ProcessorCount; Func<int, int, int[], int> thread = AddElements; IAsyncResult[] results = new IAsyncResult[numProcessors]; for (int inner = 0; inner < iters; ++inner) { int elementsPerProcessor = array.Length/numProcessors; sum = 0; for (int i = 0; i < numProcessors; ++i) { int end; int start = i * elementsPerProcessor; if (i != (numProcessors  1)) { end = start + elementsPerProcessor; } else // Last thread  go right up to the last element. { end = array.Length; } results[i] = thread.BeginInvoke(start, end, array, null, null); } for (int i = 0; i < numProcessors; ++i) { sum += thread.EndInvoke(results[i]); } } Results("Delegated For (sum=" + sum + ")"); Console.WriteLine(); } } private static void Results(string text) { sw.Stop(); Console.WriteLine(text + " took " + sw.Elapsed); sw.Reset(); sw.Start(); } static int AddElements(int inclusiveStart, int exclusiveEnd, int[] array) { int result = 0; for (int i = inclusiveStart; i < exclusiveEnd; ++i) { result += array[i]; } return result; } } }
Tuesday, May 12, 2009 3:39 PM
Question
Answers

Hi Matthew,
Regarding Parallel.For, the bad performance you experienced is due to two issues. First, each iteration requires a delegate invocation. Usually, this is insignificant, but when the loop body is small, it becomes more noticeable. Second, Parallel.For's partitioning scheme is designed for finegrained loadbalancing of larger, more complex workloads. Each thread worker grabs a relatively small number of iterations to process at a time; this requires synchronization that is, again, more significant when the loop body is small and constant. Both of these issues cause Parallel.For to perform badly in the following scenario, where parallelism can yield benefit: large number of iterations, small loop body. Your example (one million addstore operations) fits this category exactly.
We have recently done something about this. The solution (as you've realized in your handcoded approach) is to use range partitioning and have each worker thread take a much larger chunk of the iteration range. In Parallel Extensions, we already have a Partitioner<T> feature that is intended for custom partitioning, so that is where our solution lies. The code would look something like this:
int[] array = ...; int sum = 0; Parallel.ForEach(Partitioner.Create(0, array.Length), () => 0, (Tuple<int,int> subRange, ParallelLoopState loopState, int threadLocalState) => { for (int i = subRange.Item1; i < subRange.Item2; i++) { threadLocalState += array[i]; } return threadLocalState; }, finalThreadLocalState => Interlocked.Add(ref sum, finalThreadLocalState));
Basically, the Partitioner.Create call returns a Partitioner that takes a range and returns Tuple<int, int>s that represent subranges. Then, the Parallel.For call is replaced with a Parallel.ForEach call that iterates over that Partitioner. Finally, inside the loop body, a normal for loop is used to iterate over the range indicated by the Tuple<int, int>.
Compared to the Parallel.For approach, delegate invocations and synchronization overheads have been dramatically reduced. On a 4core, I'm seeing near linear speedup with the above solution, while Parallel.For is about twice as slow as sequential.
I'm sorry that you cannot use this feature yet, but any feedback you have regarding how we designed it would be appreciated!
Thanks,
Danny Edited by Danny Shih Wednesday, May 13, 2009 12:59 AM clarification
 Proposed as answer by Danny Shih Wednesday, May 13, 2009 6:33 PM
 Marked as answer by Stephen Toub  MSFTMicrosoft employee, Moderator Thursday, May 14, 2009 5:16 AM
Wednesday, May 13, 2009 12:53 AM
All replies

My experience has shown that some of these "very basic" benchmarks don't really show the benefits of using the Parallel Extensions.
You have overhead in the parallelization. If you're doing an operation where the total operation on a million elements is taking a fraction of a second, the overhead from parallelizing is worse than the speed benefits received in many cases.
However, if you're doing an operation where the million operation takes seconds (or minutes, or even hours, like some of my work), the overhead becomes a very small percentage, and the parallel loops really shine. Its when you have longer running single operations that the parallel looping becomes much more attractive.
Try this on a case where you're actually doing significant numerical processing inside of your loop, and you'll see a huge difference. In my case, I did a similar benchmark, but using one of our heavy numerical routines, and was seeing a 3.8x speedup (on my quad core system), which is amazingly good scalability.
Reed Copsey, Jr.  http://reedcopsey.comTuesday, May 12, 2009 3:46 PMModerator 
My example code is indicative of exactly the kinds of computations that we will be performing  relatively simple mathematical calculations between elements of arrays, used for filtering data and so on. So you're confirming my suspicion that the Parallel library will be useless for us! It is NOT SUITED TO NUMERICAL TYPE CALCULATIONS  that's what you're saying, yes?
One thing I don't understand: You say I have overhead in the parallelization. Yet my sample code has a handrolled parallelization that runs five times faster than Plinq and  most weirdly  more than twice as fast as the basic unparallelized for loop. I can write code that doesn't suffer from this parallelization overhead, but you're saying that the Parallel library has a lot of overhead?
However, to try to getter a better picture, I've changed the code so that the array is an array of doubles, and I'm summing the square root of the sine of each value in the array. I can do that easily for the AddElements() method  but how can I do that in the Linq or Plinq? I can't work out the syntax at the moment... ;)
Here's what AddElements() looks like now:
static double AddElements(int inclusiveStart, int exclusiveEnd, double[] array)
{
double result = 0;
for (int i = inclusiveStart; i < exclusiveEnd; ++i)
{
result += Math.Sqrt(Math.Sin(array[i]));
}
return result;
}
That increases the time of the handrolled loop to 7 seconds.
But how do I do that in Plinq?
Tuesday, May 12, 2009 3:59 PM 
Well I've sussed out how to do it in Linq and Plinq now. The results are getting better  Plinq is catching up a bit. But the mathemetical function is probably slower than a lot of our calculations, so it's still looking like it will be significantly faster if we hand roll the code rather than using Plinq.
Here's the results with the sine/square root calculation:
Normal loop (sum=561111.930635547) took 00:00:13.8989863
Linq (sum=561111.930635547) took 00:00:16.6015420
Plinq (sum=280505.446970453) took 00:00:10.6839926
Delegated For (sum=561111.930635169) took 00:00:07.0097653
And my modified code (note I'm using random numbers in the array now):
using System; using System.Diagnostics; using System.Linq; using System.Threading; namespace ThreadingTest { class Program { static Stopwatch sw = new Stopwatch(); static void Main(string[] args) { Console.WriteLine("Press RETURN to start test"); Console.ReadLine(); Test(); Console.WriteLine("\nPress RETURN"); Console.ReadLine(); } static void Test() { const int iters = 200; double[] array = new double[1000003]; Random rng = new Random(); for (int i = 0; i < array.Length; ++i) { array[i] = rng.Next() & 3; } double sum = 0; sw.Start(); for (int outer = 0; outer < 3; ++outer) { for (int inner = 0; inner < iters; ++inner) { sum = 0; for (int i = 0, n = array.Length; i < n; ++i) { sum += Math.Sqrt(Math.Sin(array[i])); } } Results("Normal loop (sum=" + sum + ")"); for (int inner = 0; inner < iters; ++inner) { sum = array.Aggregate((total, current) => total + Math.Sqrt(Math.Sin(current))); } Results("Linq (sum=" + sum + ")"); IParallelEnumerable<double> parray = array.AsParallel<double>(); for (int inner = 0; inner < iters; ++inner) { sum = parray.Aggregate((total, current) => total + Math.Sqrt(Math.Sin(current))); } Results("Plinq (sum=" + sum + ")"); // I have no idea how to compute this using Parallel.For() now, // so I've commented this out... //for (int inner = 0; inner < iters; ++inner) //{ // sum = 0; // Parallel.For(0, array.Length, () => 0.0, (i, state) => // { // state.ThreadLocalState+=array[i]; // }, partialSum => Interlocked.Add(ref sum, partialSum)); //} //Results("Parallel For (sum=" + sum + ")"); // Here comes my handcoded approach. int numProcessors = System.Environment.ProcessorCount; AddElementsDelegate thread = AddElements; IAsyncResult[] results = new IAsyncResult[numProcessors]; for (int inner = 0; inner < iters; ++inner) { int elementsPerProcessor = array.Length/numProcessors; sum = 0; for (int i = 0; i < numProcessors; ++i) { int end; int start = i * elementsPerProcessor; if (i != (numProcessors  1)) { end = start + elementsPerProcessor; } else // Last thread  go right up to the last element. { end = array.Length; } results[i] = thread.BeginInvoke(start, end, array, null, null); } for (int i = 0; i < numProcessors; ++i) { sum += thread.EndInvoke(results[i]); } } Results("Delegated For (sum=" + sum + ")"); Console.WriteLine(); } } delegate double AddElementsDelegate(int inclusiveStart, int exclusiveEnd, double[] array); private static void Results(string text) { sw.Stop(); Console.WriteLine(text + " took " + sw.Elapsed); sw.Reset(); sw.Start(); } static double AddElements(int inclusiveStart, int exclusiveEnd, double[] array) { double result = 0; for (int i = inclusiveStart; i < exclusiveEnd; ++i) { result += Math.Sqrt(Math.Sin(array[i])); } return result; } } }
I'll try it with real code some time, but I'm not impressed so far...Tuesday, May 12, 2009 4:23 PM 
Oh dear  I just realised that the Plinq result is wrong. Why is that? :(
Clearly I've got something wrong... I thought I could just take the Linq code and change it to use IParallelEnumerable()...Tuesday, May 12, 2009 4:25 PM 
Hi Matthew,
Regarding Parallel.For, the bad performance you experienced is due to two issues. First, each iteration requires a delegate invocation. Usually, this is insignificant, but when the loop body is small, it becomes more noticeable. Second, Parallel.For's partitioning scheme is designed for finegrained loadbalancing of larger, more complex workloads. Each thread worker grabs a relatively small number of iterations to process at a time; this requires synchronization that is, again, more significant when the loop body is small and constant. Both of these issues cause Parallel.For to perform badly in the following scenario, where parallelism can yield benefit: large number of iterations, small loop body. Your example (one million addstore operations) fits this category exactly.
We have recently done something about this. The solution (as you've realized in your handcoded approach) is to use range partitioning and have each worker thread take a much larger chunk of the iteration range. In Parallel Extensions, we already have a Partitioner<T> feature that is intended for custom partitioning, so that is where our solution lies. The code would look something like this:
int[] array = ...; int sum = 0; Parallel.ForEach(Partitioner.Create(0, array.Length), () => 0, (Tuple<int,int> subRange, ParallelLoopState loopState, int threadLocalState) => { for (int i = subRange.Item1; i < subRange.Item2; i++) { threadLocalState += array[i]; } return threadLocalState; }, finalThreadLocalState => Interlocked.Add(ref sum, finalThreadLocalState));
Basically, the Partitioner.Create call returns a Partitioner that takes a range and returns Tuple<int, int>s that represent subranges. Then, the Parallel.For call is replaced with a Parallel.ForEach call that iterates over that Partitioner. Finally, inside the loop body, a normal for loop is used to iterate over the range indicated by the Tuple<int, int>.
Compared to the Parallel.For approach, delegate invocations and synchronization overheads have been dramatically reduced. On a 4core, I'm seeing near linear speedup with the above solution, while Parallel.For is about twice as slow as sequential.
I'm sorry that you cannot use this feature yet, but any feedback you have regarding how we designed it would be appreciated!
Thanks,
Danny Edited by Danny Shih Wednesday, May 13, 2009 12:59 AM clarification
 Proposed as answer by Danny Shih Wednesday, May 13, 2009 6:33 PM
 Marked as answer by Stephen Toub  MSFTMicrosoft employee, Moderator Thursday, May 14, 2009 5:16 AM
Wednesday, May 13, 2009 12:53 AM 
Ah, thanks for the info! So it seems like this stuff will be useful for numerical processing at some point. :)Wednesday, May 13, 2009 8:43 AM

Hi Danny!Thanks for the answer. I hope you'll include this scenario explanation and Partitioner recommendations in documentation.Thursday, May 21, 2009 8:59 PM

Hi Dario
Thanks for the suggestion. In the meantime, you can utilize the docs at http://msdn.microsoft.com/enus/library/dd560853(VS.100).aspx.Friday, May 22, 2009 6:13 PMModerator