none
PPL overhead even when call is never made?

    質問

  • Hi All.  I was creating a demo recently to compare performance of serial code to parallel code.  So I basically had this:

      while (stepwise algorithm is not yet done)
      {
          if (sequential execution)
          {
                for(...) dowork();
          }
          else  // parallel execution
          {
                parallel_for( ... );
          }
      }//while

    I would run the program and input a value that runs the program in sequential mode.  Then I would run again, enter a different value, and now run in parallel mode.  Reported times gave me some speedup, great. 

    So later I was playing around, and commented out the parallel_for, like this:

     while (stepwise algorithm is not yet done)
      {
          if (sequential execution)
          {
                for(...) dowork();
          }
          else  // parallel execution
          {
                // parallel_for( ... );
          }
      }//while

    Now, I run the program, input value for sequential exeuction, and it RUNS MUCH FASTER (2x faster).  In other words, the mere *presence* of the calls to parallel_for slowed down the sequential run, regardless of whether parallel_for was called or not.  Is there some hidden initialization going on when parallel_for is present, even when it's not explicitly called?

    Or what am I missing?  Thanks!

      - joe

    2012年3月7日 0:47

回答

  • Hi Joe,

      Thanks for providing your sample. As expected this has to do, with optimizer getting defensive when it sees pointers

      Here is the test for you:

      Uncomment PPL, and then introduce a new temp inside of the sequential code, and use that as in 

      auto MC = M;
      for (int otherR=r+1; otherR < rows; otherR++)
    {
    double pivotFactor = MC[otherR][c] / MC[r][c];  // pivot factor:
    for (int k=c; k < cols; k++)
    MC[otherR][k] = MC[otherR][k] - (MC[r][k] * pivotFactor);
    }

    As you can see I didn't change anything, just introduced a copied local

    In my case timing now doesn't change.

    BTW, the speed can be improved in the code in general, but I guess it wasn't your point

    2012年3月16日 7:14

すべての返信

  • Without looking at code (C++ and generated assembly) I am only guessing here...

    Short answer: The lambda for your parrallel loop captured variables by reference. This spooks optimizer that won't be able to enregister variables, prove safe no-aliasing, etc...

    Yes compiler doesn't "see" that if/else requires one branch or the other.
    It has nothing to do with PPL, you will run into the same situation with any other function/code that takes a reference, pointer to the variable visible in the sequential part. Even std::swap will spoil your day!

    The common issue with optimizer is that they don't fully understand your code, they pattern match and "play it safe", when they detect the code they can't prove they can optimize safely

    2012年3月7日 14:52
  • Ahh, interesting, I should have thought of that.  I'll replace the lambda with a function and function pointer, and see what impact that has...  I'll report back what I find!

    2012年3月7日 17:47
  • Hi Joe,

    This doesnt sound right at all. Do let us know if you can share any excerpts of code. Also, have you double checked that the data and conditions were the same when you measured?

    To answer your question, there is no PPL related initialization just with the mere presence of its code. All init happens lazily at execution time.

    The speed differences are completely unexpected and we can not reproduce this in our testing. We have also not seen anyone report something like this before. If dimkaz is right, we would be very interested in fixing the optimization.

    Thanks!


    Rahul V. Patil


    2012年3月13日 5:11
    所有者
  • Hi Rahul.  I double-checked the results, in both VS2010 and VS2011.  I generate a large matrix of 2500x2500 elements, and use Gaussian Elimination to solve the system.  Same set of random numbers in every test case.  The VS projects are in my dropbox:  VS2010 and VS2011.  Build and run either version, and you'll be prompted for

      # of equations>>  2500

      go parallel?  >> 0

    The 0 means run sequentially.  On my laptop this takes about 5 seconds in VS2010, and 8 seconds in VS2011 (different project settings are causing it to run slower is my guess).  Anyway, the code as provided has the parallel_for PPL call commented out.  Now open "matrix.cpp", and search for "PPL": you'll see the parallel_for.  Uncomment, and run it again --- notice the parallel_for is inside an if-then-else, so when you run sequentially, this code is never executed.  Run again with 2500 and 0 --- you should get the same results, but time will double in VS2010 (from 5 to 10 seconds) and increase by 2 in VS2011 (from 8 to 10 seconds).

    I haven't had time yet to dig into the asm code, that's the next step.  I'll let you know what I discover after I investigate further.  Cheers!

      - joe 

    2012年3月15日 20:37
  • Thanks Joe!

    I'll take a look and get back to you.


    Rahul V. Patil

    2012年3月15日 22:46
    所有者
  • Hi Joe,

      Thanks for providing your sample. As expected this has to do, with optimizer getting defensive when it sees pointers

      Here is the test for you:

      Uncomment PPL, and then introduce a new temp inside of the sequential code, and use that as in 

      auto MC = M;
      for (int otherR=r+1; otherR < rows; otherR++)
    {
    double pivotFactor = MC[otherR][c] / MC[r][c];  // pivot factor:
    for (int k=c; k < cols; k++)
    MC[otherR][k] = MC[otherR][k] - (MC[r][k] * pivotFactor);
    }

    As you can see I didn't change anything, just introduced a copied local

    In my case timing now doesn't change.

    BTW, the speed can be improved in the code in general, but I guess it wasn't your point

    2012年3月16日 7:14
  • Thanks for tracking that down, I hope the sample helps tweak the optimizer.  I'll add that to my toolkit of things to mention to folks when they are performance testing --- playing with closure variables, adding locals, etc. 

    Fun, fun, fun! :-)

    2012年3月17日 6:33