very slow when using vs2010 express PPL to parallel my chess engine

Answered very slow when using vs2010 express PPL to parallel my chess engine

  • samedi 26 novembre 2011 08:34
     
     

    I am trying to parallel my chess engine using vs2010 express PPL. I am already using sample::parallel_for_fixed like below:-

    structured_task_group taskgp;
       //task_group taskgp;
       taskgp.run_and_wait([&] {
       critical_section cs;
       //reader_writer_lock rwlock;
       samples::parallel_for_fixed(0, 4, [&] (int j) //->int
       //parallel_for(0, 4, [&] (int j) //->int            
       //     for (int i=1; i<size; i++)
            {
                        Board* spboard;          
                        Board spboardref;
                        spboardref = board;
                        spboard = &spboardref;
                        for (int i=j+1; i<size; i+=4) 
                       {

                          //parallel search moves           

                       }

              } //end i (inner loop)
    tend_loopi:
            ;
            } // end for j (outer loop)
            ); //end parallel_for j
            }); //end task 

    The result is very slow, even slower than serial version.

    My question is : Is the overhead of PPL so larger than the native Windows 32 thread programming? I am using Windows XP SP3 and VS2010 C++ express SP1

Toutes les réponses

  • mardi 6 décembre 2011 18:42
    Propriétaire
     
     

    Hello,

    It really depends on the workload.

    How long does the serial version take to complete?

    Did you try with threads (or openmp parallel for) - if so, what kind of speeds up do you see when using those? 

    I see a critical section in your code - can you elaborate where you use the CS and see if that is an issue?

    IF you can share a version of code that I can compile and test, that would be very helpful.

     

    Regards


    Rahul V. Patil
  • mardi 6 décembre 2011 19:14
     
     
    Hello, What algorithm are you using to parallelize your search? If your serial search is using alpha beta pruning, the first move you search will take the longest because you will not have reduced your alpha/beta bounds yet. If you search the first N moves in parallel, all of them will take the longer amount of time and the parallel search will have more overhead than serial search with no optimization - hence slower. Check out this link for lots of information on making chess engines parallel: http://chessprogramming.wikispaces.com/Parallel+Search -Patrick
  • jeudi 8 décembre 2011 08:50
     
     

    Hi Patrick, thanks for your answer. I am aware of the so called YBWC (Young Brother Wait Criteria). I use alpha beta pruning and only parallel subsequent moves after searching the first move.  My suspect of the overhead of PPL is as follows. Normally, in Win32 native thread programming for paralleling chess engine, a scheduler and thread pool is created once and then subsequent moves searching are assigned to threads running in parallel. These threads are REUSEd during the whole program running. Now, in PPL implementation, I have no control on the REUSE of the threads.

    May I ask some questions on the underlining PPL scheduler and thread pool management?

    1. Is the scheduler created once per program, or each time a PARALLEL_FOR is used?

    2. Are the threads in the thread pool REUSED in multiple PARALLEL_FOR ?

    ---------------------

    Below is some coding on the beta cutoff :-

            spboard->unmakemove();
               
            if (taskgp.is_canceling())
            {
                printf("info nPV canceling, j=%d, i=%d\n", j, i);
               fflush(stdout);              
               //goto end_loopi;
               break;
            }
           
                        // 12. Alpha-Beta pruning       
            if (val > best)
            {           
                //best = val;
                InterlockedExchange(&best, val);
                if (val > alpha)
                {
                    mvBest.move = tempmove.table.move;
                    nHashFlag = HASH_PV;
                    if (val >= beta)
                    {                                       
                      taskgp.cancel(); // beta-cutoff remaining tasks 
                      //betacut = i; 
                      InterlockedExchange(&betacut, i);                 
                      break;
                    //return val;
                    }              
                }           
            }  
        } //end i (inner loop)
           
            } // end for j (outer loop)
            ); //end parallel_for j     
            }); //end task

    My question is :  are the taskgp cancel() and is_canceling() function overhead very large?

     

  • vendredi 9 décembre 2011 19:20
    Propriétaire
     
     

    Hi Edward,

    >>1. Is the scheduler created once per program, or each time a PARALLEL_FOR is used?

    Yes. Created Once.

    >>2. Are the threads in the thread pool REUSED in multiple PARALLEL_FOR ?

    Yes. Largely reused. You can verify this by logging thread ids (or using the Concurrency Visualizer in Visual Studio Ultimate).

     

    >>3. My question is : are the taskgp cancel() and is_canceling() function overhead very large?

    It shouldnt be. 

    Based on your code above: I can speculate two things:

    One: the number of iterations. In your earlier post, you have parallel for from 0 to 4. So you have only 4 iterations in the loop?  If you have less than 200 iterations, you can significantly reduce the overhead by pumping in the tasks to task group directly (we'll have a parallel for version of this in the next release - and i can share a version that works with Visual Studio 2010 for you as well).

    Second: The reaction time to cancelation. The default parallel for (and the fixed version), may not be reacting to the cancelation request quickly enough.

    Regards,

     

     

     


    Rahul V. Patil
  • dimanche 11 décembre 2011 16:33
     
     

    Hi Rahul,

    Thanks for your answer.

    Some more questions for discussion:-

    1.   In my code, because the function Alphabeta is recursive,  is it true that many structure_task_group taskgp will be created as Alphabeta is recursively called:-

    int Engine::AlphaBeta(Board &board, int beta,int depth,....)

    {

    structured_task_group taskgp;

    taskgp.run_and_wait([&] {

    samples::parallel_for_fixed(0, NCORE, [&] (int j) {

    ....

    val = -AlphaBeta(*spboard, 1-beta, newdepth, NULL_YES);

    ); //end parallel_for j
    }); //end task

    Can I use a global structured_task_group taskgp outside of Alphabeta to minimize overhead?

    2. Your observation is correct. I made the number of iteration to 4 (or number of cores) in order to maximum the work load for each iteration. You are also correct on the  reaction time to cancellation: it is slow.  I now checked for if (betacut >= 0) break;  instead of if (taskgp.is_canceling()).  As for the actual cancellation by taskgp.cancel(); // beta-cutoff remaining tasks , I don't know if this is executed immediately, or wait for the scheduler to be interrupted.

    3. Since Alphabeta is called recursively, is it true that many parallel tasks will be spawned down the search tree? Is there a way to know how many threads are waiting in the pool?

    In conclusion, there seems to be no appropriate pattern for paralleling a recursive function such as Alphabeta.

     

     

  • dimanche 18 décembre 2011 10:51
     
     Traitée
    After many trials, I finally abandon the use of parallel_for for my chess engine. Instead, at last I got a parallel version using parallel_invoke and it ran faster than the serial version on my Core i5 laptop. 
  • mardi 3 janvier 2012 20:51
    Propriétaire
     
     

    Hi Edward,

    Sorry for the late reply. I was out of office in the month of December. 

    Answers to more questions you had:

    >>1. "is it true that many structure_task_group taskgp will be created as Alphabeta is recursively called:-"

    Yes.

    >>"Can I use a global structured_task_group taskgp outside of Alphabeta to minimize overhead?"

    We would recommend using task_group instead of structured_task_group. However, the cancellation will then affect the global tree, as opposed to the subtree in the algorithm you have above. If that behavior is ok, then using global should work more efficiently.

    >>2. " I made the number of iteration to 4 (or number of cores) "

    It sounds like a more appropriate pattern for you would indeed be the parallel_invoke - as you've correctly concluded.

    >>"Cancelation is slow"

    In the case you outline above, this is specifically because you have only 4 iterations - all of which will likely execute immediately. Remember that cancelation will only affect work that is scheduled but not yet started. So if all 4 iterations start of immediately, then you have no opportunity to cancel - except by checking explicitly in your code.

    >>3. "is it true that many parallel tasks will be spawned down the search tree? ".

    Yes - I assume that is what you intended. Think of parallel_* as opportunities to identify tasks that can run in parallel. The scheduler underneath will do work of optimally chunking and mapping the tasks to underlying threads.

    >> Is there a way to know how many threads are waiting in the pool?

    Your operating assumptions should be that by default, this is equal to the number of cores + 1 main thread. Behind the scenes, the runtime may spawn additional threads to deal with blocking calls. There is no predictable way of knowing the exact number of threads, but the number of "actively running threads" will be as mentioned above.

     

    As you correctly concluded, parallel_invoke would be the right pattern for the algortithm above.

     

    Cheers and happy new year!


    Rahul V. Patil