PPL overhead even when call is never made?
-
quarta-feira, 7 de março de 2012 00:47
Hi All. I was creating a demo recently to compare performance of serial code to parallel code. So I basically had this:
while (stepwise algorithm is not yet done)
{
if (sequential execution)
{
for(...) dowork();
}
else // parallel execution
{
parallel_for( ... );
}
}//whileI would run the program and input a value that runs the program in sequential mode. Then I would run again, enter a different value, and now run in parallel mode. Reported times gave me some speedup, great.
So later I was playing around, and commented out the parallel_for, like this:
while (stepwise algorithm is not yet done)
{
if (sequential execution)
{
for(...) dowork();
}
else // parallel execution
{
// parallel_for( ... );
}
}//whileNow, I run the program, input value for sequential exeuction, and it RUNS MUCH FASTER (2x faster). In other words, the mere *presence* of the calls to parallel_for slowed down the sequential run, regardless of whether parallel_for was called or not. Is there some hidden initialization going on when parallel_for is present, even when it's not explicitly called?
Or what am I missing? Thanks!
- joe
Todas as Respostas
-
quarta-feira, 7 de março de 2012 14:52
Without looking at code (C++ and generated assembly) I am only guessing here...
Short answer: The lambda for your parrallel loop captured variables by reference. This spooks optimizer that won't be able to enregister variables, prove safe no-aliasing, etc...
Yes compiler doesn't "see" that if/else requires one branch or the other.
It has nothing to do with PPL, you will run into the same situation with any other function/code that takes a reference, pointer to the variable visible in the sequential part. Even std::swap will spoil your day!The common issue with optimizer is that they don't fully understand your code, they pattern match and "play it safe", when they detect the code they can't prove they can optimize safely
-
quarta-feira, 7 de março de 2012 17:47
Ahh, interesting, I should have thought of that. I'll replace the lambda with a function and function pointer, and see what impact that has... I'll report back what I find!
-
terça-feira, 13 de março de 2012 05:11Proprietário
Hi Joe,
This doesnt sound right at all. Do let us know if you can share any excerpts of code. Also, have you double checked that the data and conditions were the same when you measured?
To answer your question, there is no PPL related initialization just with the mere presence of its code. All init happens lazily at execution time.
The speed differences are completely unexpected and we can not reproduce this in our testing. We have also not seen anyone report something like this before. If dimkaz is right, we would be very interested in fixing the optimization.
Thanks!
Rahul V. Patil
- Editado Rahul V. PatilModerator terça-feira, 13 de março de 2012 05:15
-
quinta-feira, 15 de março de 2012 20:37
Hi Rahul. I double-checked the results, in both VS2010 and VS2011. I generate a large matrix of 2500x2500 elements, and use Gaussian Elimination to solve the system. Same set of random numbers in every test case. The VS projects are in my dropbox: VS2010 and VS2011. Build and run either version, and you'll be prompted for
# of equations>> 2500
go parallel? >> 0
The 0 means run sequentially. On my laptop this takes about 5 seconds in VS2010, and 8 seconds in VS2011 (different project settings are causing it to run slower is my guess). Anyway, the code as provided has the parallel_for PPL call commented out. Now open "matrix.cpp", and search for "PPL": you'll see the parallel_for. Uncomment, and run it again --- notice the parallel_for is inside an if-then-else, so when you run sequentially, this code is never executed. Run again with 2500 and 0 --- you should get the same results, but time will double in VS2010 (from 5 to 10 seconds) and increase by 2 in VS2011 (from 8 to 10 seconds).
I haven't had time yet to dig into the asm code, that's the next step. I'll let you know what I discover after I investigate further. Cheers!
- joe
-
quinta-feira, 15 de março de 2012 22:46Proprietário
Thanks Joe!
I'll take a look and get back to you.
Rahul V. Patil
-
sexta-feira, 16 de março de 2012 07:14
Hi Joe,
Thanks for providing your sample. As expected this has to do, with optimizer getting defensive when it sees pointers
Here is the test for you:
Uncomment PPL, and then introduce a new temp inside of the sequential code, and use that as in
auto MC = M;
for (int otherR=r+1; otherR < rows; otherR++)
{
double pivotFactor = MC[otherR][c] / MC[r][c]; // pivot factor:
for (int k=c; k < cols; k++)
MC[otherR][k] = MC[otherR][k] - (MC[r][k] * pivotFactor);
}As you can see I didn't change anything, just introduced a copied local
In my case timing now doesn't change.
BTW, the speed can be improved in the code in general, but I guess it wasn't your point
- Marcado como Resposta DanielMothMicrosoft Employee, Owner segunda-feira, 30 de abril de 2012 04:59
-
sábado, 17 de março de 2012 06:33
Thanks for tracking that down, I hope the sample helps tweak the optimizer. I'll add that to my toolkit of things to mention to folks when they are performance testing --- playing with closure variables, adding locals, etc.
Fun, fun, fun! :-)

