locked
Non busy waiting for completion of parallel_for_each RRS feed

  • Question

  • Hi

    I am Master Student at the FHNW in Windisch (Switzerland) and am using C++ AMP for the parallelization on the GPU. Right now I am developing a heterogeneous image pipeline which uses all CPU Cores and GPU engines simultaniously. To do this, I depend on that all tasks of the pipeline can be executedasynchronously.

    My Question is, how to correctly wait for the completion of the execution of a parallel_for_each which runs on the GPU? The goal is, that the waiting thread sleeps after the call of the wait method and wakes up until the execution has finished.

    My first approach was to use the accelerator_view and call the accelerator_view::wait() method. This waits until the end of the execution on GPU, but seems to do this with busy waiting.

    After many hours of searching and testing other ways, I found a way to do non busy waiting: myAccView.create_marker().to_task().wait();

    Attached are two screenshots of the ConcurrencyVisualizer. Pay attention to the idle time of the CPU in the second version, against the full utilisation of the CPU in the first version. Why is this?


    Thursday, November 7, 2013 2:08 PM

Answers

  • Hi Lang,

    The difference you are seeing is because accelerator_view::wait() does busy-waiting where as to_task().wait() does cooperative waiting. The completion_future::to_task() function returns a concurrency::task<void> object and waiting on concurrency::task<> is cooperative waiting. Cooperative waiting allows the CPU to be available for other threads to use.

    Also accelerator_view::wait() function starting with Windows 8 (and higher versions) does cooperative waiting and you should see similar behavior for the two cases above on Windows 8 (higher version).

    Please feel free to ask if you have any further questions.

    • Marked as answer by Lang Christian Thursday, November 21, 2013 6:32 AM
    Thursday, November 21, 2013 12:24 AM

All replies

  • Hi Lang,

    The difference you are seeing is because accelerator_view::wait() does busy-waiting where as to_task().wait() does cooperative waiting. The completion_future::to_task() function returns a concurrency::task<void> object and waiting on concurrency::task<> is cooperative waiting. Cooperative waiting allows the CPU to be available for other threads to use.

    Also accelerator_view::wait() function starting with Windows 8 (and higher versions) does cooperative waiting and you should see similar behavior for the two cases above on Windows 8 (higher version).

    Please feel free to ask if you have any further questions.

    • Marked as answer by Lang Christian Thursday, November 21, 2013 6:32 AM
    Thursday, November 21, 2013 12:24 AM
  • Hy Rahman

    Thank you for your answer. You confirmed my assumptions. My System is a Windows 7.

    I have done many testing since the posted question and have seen another weird phenomenon. When my system is fully loaded, the return of the wait() method (the one of the completion_future) takes much longer, then the execution of the task on the GPU. In my case i do some matrix multiplication (64x64) on the GPU and want to repeat that until I reach 50 milliseconds. I do the same thing on CPU until I reach 100ms. A test with only one CPU core loaded is shown in the next graph:

    It is one task on the CPU shown, that requires approximately 100ms. The other task runs on GPU and takes approx. 50ms. You can see that the GPU task repeats the matrix multiplication several times, where the load of the GPU is shown too.

    In the second graph the GPU task needs as much time as the CPU task but repeats the matrix multiplication only two times. The execution on GPU does not need more time than in the first example.

    Is this behavior a consequence of the fact, that the system is to loaded to communicate with the GPU? Or how can I improve that?

    Below you can see the relevant code of the GPU task (I know that GetTickCount() is not that precise, but it should not be the problem):

    const int start = static_cast<int>(GetTickCount());
    do {
        mSeries->write_flag(_TwithInt("GPU starting: ", input));
        parallel_for_each(av_c.extent, [&](index<2> idx) restrict(amp) {
            // do mat mult
        });
        mSeries->write_flag(_TwithInt("GPU started: ", input));
        mGpuAccView.create_marker().to_task().wait();
        mSeries->write_flag(_TwithInt("GPU finished: ", input));
    } while (static_cast<int>(GetTickCount()) - start < waitTimeMs);

    Thursday, November 21, 2013 7:19 AM
  • Hi Lang,

    Much of the below is guesswork given the limited amount of information I have about your software and hardware stack, so please correct me if I'm wrong.

    I believe your CPU has 4 logical cores, thus capable of running 4 threads simultaneously. In the second graph in the above post, there are 5 threads running, 4 of which are labelled "CPU" and in my understanding are saturating all the cores with computation, making the 5th thread - the GPU work submitter - preempted (waiting) for an extended amount of time. You can verify that by unhiding the thread activity lanes in Concurrency Visualizer - green segments are execution and mean active work being performed in the given thread, while yellow segments are preemption designating waiting.

    Since your GPU task is fully synchronizing CPU and GPU execution with the "wait" call, it causes a starvation on the GPU.

    There are couple of approaches you can try to improve the situation, depending on the architecture of the rest of your application:

    • submitting larger work items to GPU would lower the relative overhead caused by the synchronization and the submission cost itself
    • synchronizing CPU and GPU on a lower frequency (e.g. every couple of p_f_e executions, instead after every one)
    • lowering the load on the other CPU threads in order to give the GPU submitter thread chance to run

    Another important thing to note is that your work submission frequency is on the fringe of Windows scheduler granularity. The default thread quantum in client version of Windows 7 is approximately 30 ms, and is much longer in server editions. This may introduce some noise to your experiments.

    Saturday, November 30, 2013 10:04 PM
    Moderator