none
parallel_for_each queuing problem RRS feed

  • Question

  • Hello,

    I have a project calculating data in from a 2D array. The program executes row by row. This is because the calculation of a row of data is dependent on the preceding row's result. Basically for each element of the row, the result of the previous row must be known.

    The way i have it set up i have something like this:

    for(int i = 0; i < arrayHeight; i++)
    {
       parallel_for_each(extent<1> e(i)[=](index<1> idx)
       {
    
    	int x = idx[0];
    	y = i;
    
    	workingArray[x][y] = workingArray[x][y-1]
    
       });
    }

    I have a parallel_for_each kernel which processes a row of data, with one thread per index of the row. This works fine, but requires the program to wait for the parallel_for_each to finish before the GPU can start processing the next row.  

    Each parallel_for_each kernel is 1 element larger than the previous one, basically creating a wedge of data being processed as it expands out to the arrayHeight.

    So the first parallel_for_each executes 1 thread, the second paralle_for_each executes 2 threads and so on. 

    Is there a more efficient way to perform this operation? I imagine that multiple parallel_for_each operations would just queue up and be executed serially and not take advantage of the available parallelism?



    Thanks in advance,

    AranC






    • Edited by AranC Thursday, May 23, 2013 4:37 AM
    Thursday, May 23, 2013 4:28 AM

All replies

  • Well I'm still figuring this stuff out but until someone who knows what they're talking about comes along, there are 2 things that spring to my mind:

    1) The CPU won't wait. I think it'll just slam through the entire loop and will only wait on the GPU once you actually try to access the array_view elements through [] or synchronize() on the workingArray object (which is presumably an array_view type).

    2) If you still want to increase the work for each iteration why not simply place the parallel_for_each on the outside? I don't know if I understood this right but something like this might work:

    parallel_for_each(extent<1> e(arrayHeight),[=](index<1> idx)
    {
       for(unsigned i = 0; i < idx[0]; ++i)
       {
    	int x = i;
    	y = idx[0];
    
    	workingArray[x][y] = workingArray[x][y-1];
    
       }
    });

    Friday, May 24, 2013 4:17 AM
  • Hey thanks for the reply!

    That solution sounds good, but the one problem is that thread 256 for example will execute at the same time or before thread 32. If this happens, workingArray[0][256] will access workingArray[0][255] before workingArray[0][32] has processed. This means there would be a dependency issue.

    In other words, there needs to be a way to block thread 256  for example until the whole previous row has been calculated. 


    • Edited by AranC Friday, May 24, 2013 7:11 AM
    Friday, May 24, 2013 7:09 AM
  • What happens if we add "gpu_acc.wait()" in the code??

    accelerator_view gpu_acc = accelerator().default_view;
    
    for(int i = 0; i < arrayHeight; i++)
    {
       parallel_for_each(gpu_acc, extent<1> e(i)[=](index<1> idx)
       {
    
    	int x = idx[0];
    	y = i;
    
    	workingArray[x][y] = workingArray[x][y-1]
    
       });
    
       gpu_acc.wait();  <----- Adding this extra line
    }
    

    By adding wait(), will the CPU be forced to wait for GPU to finish before going to the next i iteration in the loop?

    Monday, September 22, 2014 11:34 PM