none
C++ PPL & AMP RRS feed

  • Question

  • when i run the code bellow, my cpu is crazy

    #include "stdafx.h"
    #include <amp.h>
    
    using namespace concurrency;
    
    int _tmain(int argc, _TCHAR* argv[])
    {
    	parallel_for(0, 100000000, [](int p)
    	{
    		float result[4];
    		array_view<float, 1> amp_result(4, result);
    		amp_result.discard_data();
    
    		parallel_for_each(amp_result.extent, [=](index<1> i) restrict(amp)
    		{
    			amp_result[i] = p;
    		});
    
    		amp_result.synchronize();
    	});
    
    	return 0;
    }
    running a little time(about half minute), in task manager, it displays as below:
    Image Name: ppl_amp.exe 
    CPU:46
    Memory(Private Working set):67704K
    Threads:1273

    and the threads increases more and more

    what's the wrong?


    • Edited by cger Tuesday, August 27, 2013 1:23 PM
    Tuesday, August 27, 2013 1:20 PM

Answers

  • I agree with this remark, but we have to call av.synchronize() for data correctness (if we need to consume the av). But with a small CPU loop, there is no issue. But to be honest my real version it was without PPL.

    void example2()
    {
     int values[100000000];
     array_view<int, 1> av(100000000, values);
     av.discard_data();
                
     parallel_for_each(av.extent, [=](index<1> i) restrict(amp)
     {
      float result[4];
      int idx = i[0]%4;
      result[idx] = idx;
      av[i] = idx;
     });
     av.synchronize();
    }

    Bruno

     


    Boucard Bruno - http://blogs.msdn.com/b/devpara/

    Thursday, August 29, 2013 9:02 PM

All replies

  • Hi cger,

    There is an error on the use of PPL and C ++ AMP.

    By nature, all GPU technologies are efficient for a very large amount of data (using only 4 iterations isn't correct). The body of the parallel_for loop is quite long regarding the large number of iterations (the concurrency runtime launches new threads if need). The code below works correctly.

    void example()

    {

           parallel_for(0, 10, [](int p)

           {

                        int values[10000000];

                        array_view<int, 1> av(10000000, values);

                        av.discard_data();

                

                        parallel_for_each(av.extent, [=](index<1> i) restrict(amp)

                        {

                               float result[4];

                               int idx = i[0]%4;

                               result[idx] = idx;

                               av[i] = idx;

                        });

                        av.synchronize();

              });

    }

    Bruno


    Boucard Bruno - http://blogs.msdn.com/b/devpara/

    Wednesday, August 28, 2013 2:07 PM
  • Hi cger,

    Bruno's suggestion correctly demonstrates the fact that having larger datasets per GPU kernel invocation will result in a much better performance results.  Many small kernel invocations on the GPU will incur additional data copy overhead that may end up dominating the overall execution time while also resulting in very low GPU execution hardware utilization (% of execution units used).

    With respect to your question regarding the CPU thread growth, it should be noted that the call to av.synchronize() is a blocking call that can result in creation of new threads by the Concurrency Runtime.  This behavior is described in the Best Practices in the Parallel Patterns Library page on MSDN. 

    --Daniel

    Thursday, August 29, 2013 6:06 PM
  • I agree with this remark, but we have to call av.synchronize() for data correctness (if we need to consume the av). But with a small CPU loop, there is no issue. But to be honest my real version it was without PPL.

    void example2()
    {
     int values[100000000];
     array_view<int, 1> av(100000000, values);
     av.discard_data();
                
     parallel_for_each(av.extent, [=](index<1> i) restrict(amp)
     {
      float result[4];
      int idx = i[0]%4;
      result[idx] = idx;
      av[i] = idx;
     });
     av.synchronize();
    }

    Bruno

     


    Boucard Bruno - http://blogs.msdn.com/b/devpara/

    Thursday, August 29, 2013 9:02 PM
  • I don't believe that explicitly calling synchronize is required in this context. Yes you are queuing work from multiple threads, but through the same accelerator_view, and for an unique host side container. I think that the AMP runtime should handle this implicitly just fine, and do the sync whenever you'll access the data somewhere other than the accelerator on which the p_f_e was run.
    Thursday, August 29, 2013 9:16 PM
  • Hi Alex,

    All GPU technologies have the same behavior regarding the kernel execution: Asynchronous.

    If the original data come from parameters for instance, there is no need to call synchronize(), because the array_view dtor will synchronize the data automatically, like this example:

    void example3(int *values, int length)

    {

           array_view<int, 1> av(length, values);

           av.discard_data();

           parallel_for_each(av.extent, [=](index<1> i) restrict(amp)

           {

                 int result[4];

                 int idx = i[0]%4;

                 result[idx] = idx;

                 av[i] = idx;

           });

    }

    For the caller, values will be updated, but the synchronize cost is inside the array_view destructor. There no magic, we have to synchronize to be sure to get clean data.

    Bruno


    Boucard Bruno - http://blogs.msdn.com/b/devpara/

    Friday, August 30, 2013 4:42 PM
  • I am not sure I follow what you are trying to convey (sorry!). I would also say that this is less of an issue tied to asynchrony, but rather one tied to non-overlapping memory spaces (as you say, you call sync to get some datum, which lives in another memory space, the GPU's, even if execution were synchronous the story would be the same). The runtime implicitly synchronizes when the array_view is accessed from differing accelerators. If you mean that working on the data through the array_view and then messing with the underlying container somewhere else through its native interface / not through the array_view then yes, explicit synchronize is needed, but I am still missing how this applies in context, could you please detail a bit more?
    Saturday, August 31, 2013 2:01 PM
  • I agree Alex, in this context, there is no need to call synchronize() manually. Regarding the type array_view, it handles the synchronization implicitly if we access any item in the native buffer. But the original problem was not about a synchronization usage. And the original snippet was calling synchronize() and to avoid a radical response, I answered with this call even if it’s useless in this context.       


    Boucard Bruno - http://blogs.msdn.com/b/devpara/

    Friday, September 6, 2013 12:37 PM