locked
Iterating on the GPU with C++ AMP RRS feed

  • Question

  • Hi I'm going to code a GPU based version of Dijkstra search, and since I have just started Learning C++ AMP my knowledge is quite limited. 

    I guess there is a simple answer, but here goes:

    				//Create GPU array views of the data
    				array_view<const size_t, 1> V(n_nodes, v);  //All vertices (nodes)
    				array_view<const size_t, 2> E(n_nodes, nn, e); //All edges
    				array_view<const size_t, 2> W(n_nodes, nn, w); //Weights
    				
    				array_view<int, 1> M(n_nodes, m); //Mask array
    				array_view<float, 1> C(n_nodes, c); //Cost array
    				array_view<float, 1> U(n_nodes, u); //Updating cost array
    				
    				// Run the following code on the GPU
    				// ---------------------------------
    				//while M not Empty do (or while there are still some M[idx] == 1)
    				// {
    					parallel_for_each(V.extent, [V, E, W, M, C, U] (index<1> idx) restrict(amp)
    					{
    						if (M[idx] == 1) 
    						{
    							//Update cost in C and U
    							//Update mask array M, setting several more M[newIdx] = 1; //Think of it as a wavefront propagating in all direction through the graph
    						}
    					});
    
    					//Wait for all threads to finish updating M.
    					// I do not want to copy back to CPU between iterations!
    				// }
    				// ------------- (end GPU code)

    As you can see from the code above what I want to acomplish is to iterate on the GPU, performing some sort of while loop (see comments in code). For each iteration I want to perform a parallel execution over all vertices (V) in my graph, then wait until all threads are finished updating the Mask array (M), and then perform the next iteration.

    So my question is how can I perform the iterations on the GPU. I dont want to copy Mask array(M) between the GPU and CPU between each operation. I just want to use the updated Mask array on the GPU in the next iteration. How can I perform something like the while loop indicated in the code comments above?

    Monday, July 9, 2012 12:50 PM

Answers

  • Hi mm_h

    Without sharing your actual C++ AMP code, we cannot really comment on “if” and where exactly you are going wrong.

    To answer your generic question:

    • If you don’t want the data to come back to the CPU, then don’t use array_view::synchronize and don’t try to access the data. The data remains on the accelerator regardless.
    • You do not need to wait for a parallel_for_each computation to complete, simply start the next invocation (e.g. in your loop) and if the data is not ready for it, then it will block from executing – the runtime takes care of that for you so you don’t need to.

    In other words, your pseudo code should work, without you substituting your comments for any code.

    You also mentioned something about timings, so to make sure you are timing things correctly, please read our two blog posts on that:
    http://blogs.msdn.com/b/nativeconcurrency/archive/2011/12/28/how-to-measure-the-performance-of-c-amp-algorithms.aspx
    http://blogs.msdn.com/b/nativeconcurrency/archive/2012/04/25/data-warm-up-when-measuring-performance-with-c-amp.aspx

    Cheers
    Daniel


    http://www.danielmoth.com/Blog/

    • Marked as answer by h7o_ Tuesday, July 10, 2012 11:29 PM
    Monday, July 9, 2012 6:06 PM
  • Hi mm_h,

    C++ AMP is an essentially data parallel model that does not support all of the richness of a CPU-based parallelism model such as PPL.  In particular, there are really no collective coordination mechanisms that apply to all threads in a computation. You just start them and wait for them to finish. That means in general if you have an iterative algorithm with a global convergence test, then that test has to be performed on the CPU with multiple kernel launches to form the iteration.

    In the tiled version of C++ AMP however, there is support for coordination of the threads in a tile. If you algorithm can be split so multiple steps within a tile are performed and then a global convergence test, that will often be faster. We use that approach in our reduction sample and the bitonic sort sample.

    Monday, July 9, 2012 9:03 PM

All replies

  • Have been working some more with the code, and I have now a working version where the while loop is running on the CPU. What I want is of course a version where the while loop (or similar construct) runs entirely on the GPU, avoiding the synchronization of the "stop criteria" between each iteration.

    Here is a "pseudo code" explaining what I want to achieve. I want the execution on the GPU to break when a stop criteria is hit.  

    while(stopCriteria == false) //Some sort of gpu_while loop { //This parallel loop will be called iteratively until stopCriteria hits parallel_for_each(data.extent, [=] (index<1> idx) restrict(amp) { //Add lots of code here ... // <....> if (someCritera) { stopCriteria = true; break; //Break further execution } }); //I do not want to synchronize stopCriteria or data between iterations,
    //the same data structure is residing on the GPU iteratively updated until completion. }

    Please let me know if this is possible or not? It could be that I have misunderstood whats possible on the GPU...

    I'm in my CPU based while loop now using an array_view around a single integer (which is initially set to 0 and then set to 1 when the stopCriteria hits), between each iteration I use array_view.synchronize() to check if the stopCriteria hits in the while loop. This takes 0.35ms in my code, which is way too much since the actual iteration takes on average around 0.27ms. It adds up when you do around 30-40 iterations!

     

    Monday, July 9, 2012 4:48 PM
  • Hi mm_h

    Without sharing your actual C++ AMP code, we cannot really comment on “if” and where exactly you are going wrong.

    To answer your generic question:

    • If you don’t want the data to come back to the CPU, then don’t use array_view::synchronize and don’t try to access the data. The data remains on the accelerator regardless.
    • You do not need to wait for a parallel_for_each computation to complete, simply start the next invocation (e.g. in your loop) and if the data is not ready for it, then it will block from executing – the runtime takes care of that for you so you don’t need to.

    In other words, your pseudo code should work, without you substituting your comments for any code.

    You also mentioned something about timings, so to make sure you are timing things correctly, please read our two blog posts on that:
    http://blogs.msdn.com/b/nativeconcurrency/archive/2011/12/28/how-to-measure-the-performance-of-c-amp-algorithms.aspx
    http://blogs.msdn.com/b/nativeconcurrency/archive/2012/04/25/data-warm-up-when-measuring-performance-with-c-amp.aspx

    Cheers
    Daniel


    http://www.danielmoth.com/Blog/

    • Marked as answer by h7o_ Tuesday, July 10, 2012 11:29 PM
    Monday, July 9, 2012 6:06 PM
  • Thank you Daniel for the clarifications. I didn't include the whole code because it is quite big now and I thought it would only clutter the actual thing I was asking for, that's why I just added more simple pseudo code. 

    The only thing I don't understand is how I can tell the continuous C++ AMP loop to stop without doing the s_v.synchronize() in the code below. Each synchronize takes around ~0.3ms (thats the last numbers after 30-40 iterations, long after initial warm-up and such. I'm measuring using the Timer class from Your blog post about performance). So my question is; is there some sort of "break;" command that can be executed on the GPU code?

    Or actually, a break from the parallel_for_each is not enough. I also need some sort of while loop residing on the GPU because I want to break out from both the parallel_for_each AND the while loop when the stop criteria is reached. Hope you see what I'm trying to achieve...  

    Is the s_v.synchronize() and CPU while loop from my code the right way to do it? Or is there a faster variant where some sort of while loop can reside on the GPU? 

    int stop = 0;
    array_view<int,1> s_v(1,&stop);
    int numiter = 0;
    int nodeFrom = 1; //just a sample number for illustration
    int nodeTo = 45302; //just a sample number for illustration
    				
    while (stop == 0)
    {
    	parallel_for_each(M.extent, [=] (index<1> idx) restrict(amp)
    	{
            // My code performed per iteration goes here...
            // updating nodeCurrent to something for each iteration
            // < .... >
            //
    		if (nodeCurrent == nodeTo) //Eventually after lets say ~30 iterations a given stop critera reached
    		{
    			s_v[0] = 1; //All iterations are now finished, setting stop criteria to 1
                
                //Actually what I really want to do here is to call something similar to normal break of CPU code. 
                //break;
                
    		}
    	});		
    
    	//Synchronize stop criteria
    	s_v.synchronize();
    	stop = s_v[0];
    
        numiter++; //count the total iterations
    }

    Here is a code sample I found from the PPL help documentation, breaking out early from a CPU loop. Even though this sample does not perform exactly what I want because I want the while loop on top of this parallel loop. Is there something similar in C++ AMP or some other way to achieve my goal?

    vector<double> results = ...
    int workLoad = ...
    task_group tg;
    size_t fillTo = results.size() - 5 ;
    fill(results.begin(), results.end(), -1.0);
    
    task_group_status status = tg.run_and_wait([&]
      {
         parallel_for(0u, results.size(), [&](size_t i)
         {
           if (i > fillTo)
             tg.cancel();
           else
             results[i] = DoWork(i, workLoad);
         });
       });
    Thank you very much for all help!

    • Edited by h7o_ Monday, July 9, 2012 7:15 PM
    Monday, July 9, 2012 7:09 PM
  • Hi mm_h,

    C++ AMP is an essentially data parallel model that does not support all of the richness of a CPU-based parallelism model such as PPL.  In particular, there are really no collective coordination mechanisms that apply to all threads in a computation. You just start them and wait for them to finish. That means in general if you have an iterative algorithm with a global convergence test, then that test has to be performed on the CPU with multiple kernel launches to form the iteration.

    In the tiled version of C++ AMP however, there is support for coordination of the threads in a tile. If you algorithm can be split so multiple steps within a tile are performed and then a global convergence test, that will often be faster. We use that approach in our reduction sample and the bitonic sort sample.

    Monday, July 9, 2012 9:03 PM
  • David, thank you for your suggestions. I guess its part of my learning progress, trying to understand the conceptual differences between CPU and GPU programming, and whats possible and whats not. Appreciate the answer and help from both of you.
    Tuesday, July 10, 2012 11:29 PM