Answered by:
A nudge with twodimensional matrix reduction across all tiles
Question

Greetings (again)!
This is related to my previous question, with which Dragon89 was already rather helpful. Basically what I have a code that runs correctly on a CPU. I transformed it for GPU, but it isn't that efficient, it has plenty to do how I do some matrix operations and especially reducing the results. This is a bit lengthy as I provide a lot of code, but please, bear with me here as this excercise contains separately many things I'm trying to wrap my head around. :)
I have some doubts with the following issues:
 Should I have reduction scratchpads (tile_static arrays) per tile or just one used acrossa all tiles? The context in the following code snippet. This is my main problem, in the following snippet I have a data race I haven't figured out how to handle efficiently. It's related to the fact the processing is happening in 2D.
 How could I utilize tile_static memory and threads more efficiently when processing a 2D matrix (see code that follows). This is somewhat secondary.
struct best_pair { int best1; int best2; best_pair() best1(0), best2(0) {}; }; inline void max_change(best_pair& a, best_pair const& b) restrict(amp) { a = (a.best1 > b.best1) ? a : b; } //This function processes a huge matrix and should return a pair of numbers as a result of the processing. //To acquire the numbers require a reduction step, //which is a bit of a problem for me to do efficiently now, see comments (also for the purpose of the arguments)... best_pair amp_exercise(std::vector<int> matrix, std::vector<int> processing_indices, int matrix_dimension) { concurrency::array_view<int, 1> processing_indices_view(matrix_dimension, processing_indices); static const int TS = 32; static const int BLOCKS = 1024 / TS; //Here the idea would be to collect either the best value per tile or just one across all tiles. std::array<best_pair, TS> out_pairs; concurrency::array_view<best_pair, 1> out(out_pairs.size(), out_pairs); //The matrix is in row major order. Note that it doesn't need to be read out. const concurrency::array_view<const int, 2> matrix_view(matrix_dimension, matrix); auto domain = concurrency::extent<2>(BLOCKS * TS, BLOCKS * TS).tile<TS, TS>(); parallel_for_each(domain.pad(), [=](concurrency::tiled_index<TS, TS> t_idx) restrict(amp) { best_pair best = best_pair(99999, 9999); //Initialize this array. I'm not sure if it'd be better //this were either per tile or across all tiles. That is, should //the size be the size of all tiles or the size of //matrix dimension given as a parameter? //In any event, //Dragon89 provided some ideas for this, so the following //initialization routine is just a "place holder"... tile_static best_pair best_pairs[TS]; if(t_idx.local[0] == 0 && t_idx.local[1] == 0) { for(int k = 0; k < TS; ++k) { best_pairs[k] = best; } } t_idx.barrier.wait(); //There probably is a more efficient method //to process through the matrix. //Note that //the "uppertriangular" processing is actually //only used to generate indices together with //processing_indices_view, which is used also //outside of this kernel. That is, the access pattern //to the matrix itself isn't uppertriangular. for(int i = 0; i < matrix_dimension  1; ++i) { for(int j = i + 1; j < matrix_dimension  1; ++j) { int a = processing_indices_view[i];int b = processing_indices_view[i + 1]; int c = processing_indices_view[j]; int d = processing_indices_view[j + 1]; int a1 = matrix_view[a][b]; int a2 = matrix_view[a][c]; //Etc. calculations with the combinations of the matrix //entries obtained from the above code. //The result stored and processed as follows //for illustration. //Now, here lies the main rub. This will //generate a data race, for which I haven't found an //efficient method to handle. This is linked on the //matter how should I gather the "best result" of all of //these operations? This probably would need //some kind of a indexing with thread id, but how to //ensure the index would be within tile_static bounds? int result = a1 + a2; if(result > best.best1) { best.best1 = a1; best.best2 = a2; best_pairs[t_idx.local[0]] = best; } } } t_idx.barrier.wait(); //Apart from the data race, in previous if clause, //this looks like works all right if the //scratchpad is made big enough. Though, still, //should the reduction be per tile or across //all tiles so that the result from this //parallel_for_each is only one best_pair //or an arrayful of them? //Reduction tiled_3 from http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/08/parallelreductionusingcamp.aspx. for(int s = TS / 2; s > 0; s /= 2) { if(t_idx.local[0] < s) { max_change(best_pairs[t_idx.local[0]], best_pairs[t_idx.local[0] + s]); } t_idx.barrier.wait(); } //Store the tile result in the global memory. if(t_idx.local[0] == 0) { auto i = t_idx.tile[0]; out_pairs[i] = best_pairs[0]; } }); out_pairs.synchronize(); auto best_pair = *std::max_element(out_pairs.begin(), out_pairs.end(), [](best_pair const& a, best_pair const& b) { return(a.best1 > b.best1); }); return best_pair; };
 Edited by Veikko Eeva Tuesday, August 28, 2012 4:20 PM
Tuesday, August 28, 2012 1:00 PM
Answers

The CPU code is like this
for(int i = 0; i < 100; ++i) { int max = 0; int best_i; int best_j; do { int local_max(0); int local_best_i(0); int local_best_j(0); //These for loops are what I'm trying to execute on the GPU. for(int i = 0; i < processing_indices.size()  1; i++) { for(int j = i + 1; j < processing_indices.size()  1; j++) { int a = processing_indices[i]; int b = processing_indices[i + 1]; etc. [...] int temp_max = matrix[a][b] + matrix... //These local values I'm having a hard //time collect. //Currently I try to put them to //an array, write out from the GPU //and then search through all of them //in CPU with the std::max_element //call. if(temp_max < local_best) { local_best = temp_max; local_best_i = i; local_best_j = j; } } } max = local_max; best_i = local_best_i; best_j = local_best_j;
I have modified the GPU code, which works as far as I have understood, but it uses global memory and I'm still having hard time to write the maximum element, the result of the processing, out of the GPU. The GPU code is like follows now (I didn't replicate it in its entirety as I'd need to clean it before, it's pretty much in the middle of flight currently, but I'm not on a computer doing the code right now)
//The code before parallel_for_each is the same as before, but out is now defined as follows. Note that the related CPU code handles also the indices, which I intend to add when I get this working (that is, this is code in flight...) std::array<int, 4096> out_structs; concurrency::array_view<int, 1> out(out_structs.size(), out_structs); //The matrix is in row major order. Note that it doesn't need to be read out. const concurrency::array_view<const int, 2> matrix_view(matrix_dimension, matrix); auto domain = concurrency::extent<2>(BLOCKS * TS, BLOCKS * TS).tile<TS, TS>(); parallel_for_each(domain.pad(), [=](concurrency::tiled_index<TS, TS> t_idx) restrict(amp) { int row = t_idx.local[0]; int col = t_idx.local[1]; for(int i = 0; i < matrix_dimension  1; i += TS) { tile_static int locA[TS][TS]; locA[row][col] = matrix_view(t_idx.global[0], col + i); t_idx.barrier.wait(); int upper_trig_col = col + i; if(row < matrix_dimension  1 && col < matrix_dimension  1 && (upper_trig_col > row)) { int x1 = row; int x2 = row + 1; int x3 = upper_trig_col; int x4 = upper_trig_col + 1; int a = processing_indices_view[x1]; int b = processing_indices_view[x2]; int c = processing_indices_view[x3]; int d = processing_indices_view[x4]; int temp_max = locA[a][c] + locA[b][d]... //Here's a data race... out[i + (row * dimension + upper_trig_row)] = temp_max; } } } t_idx.barrier.wait();
I apologize a bit for the messy code, but I'm currently rather pressed with time and it looks my browser is having a bit of problems with the forum software. If you feel like being as helpful as you are and you really want to understand what I'm trying to do, I can send you the code too. Though then this would start to feel less recreational passtime and more like serious trying. :)
 Edited by Veikko Eeva Wednesday, August 29, 2012 12:56 PM
 Marked as answer by Veikko Eeva Friday, September 14, 2012 7:03 AM
Wednesday, August 29, 2012 12:48 PM
All replies

It would be helpful if you explained what the code is supposed to do, possibly with the working CPU implementation.Tuesday, August 28, 2012 1:07 PM

Disregard this, your code seems to do more than I saw just from a quick glance, though I still don't see what you are trying to do.

To me it looks like you want the minimum element of the input array.
I don't think tiling is useful here as you are not reusing/sharing any data between threads.
I would suggest that you split the data set into smaller blocks, i.e. extent/block_size, and then have each gpu thread get the minimum element of it's block.
After calling this kernel you would get a new smaller data set with elements, then you would either re run the kernel on the new data set or get the final result on the cpu (depending on the data set size).
array_view<int> reduce_gpu(const array_view<int>& input) { const int block_size = 256;
array<int> out(input.extent/block_size); parallel_for_each(out.extent, [=](const index<int>& idx) restrict(amp) { int min_value = MAX_INT; for(int n = 0; n < block_size; ++n) min_value = min(min_value, input[idx * block_size + n); out[idx] = min_value; }); return out; } int main() { array_view<int> data = /* ... */; static const int size_threshold = 4096; while(data.size() > size_threshold) data = reduce_gpu(data); int min_value = *std::min_element(data.begin(), data.end()); }
I think Dynamic Parallelism would be very useful here (as you would avoid the while(data.size() > size_threshold part), however it is only supported on certain GPUs and using CUDA. Though I hope C++ AMP will support it in the future.
 Edited by Dragon89 Tuesday, August 28, 2012 1:25 PM
Tuesday, August 28, 2012 1:16 PM 
Hi!
No, I don't want a minimum/maximum element of the input array (I've named the function wrongly, I edit the code). The input array is actually a matrix, which just happens to be in rowmajor order (see that I used array_view with two dimensions to introduce a different view to processing). What I actually want to do is to process some entries in the matrix and keep a note on the maximum/minimum result, of which I then return from the kernel.
My CPU code is serial, but on the GPU there will be multiple threads  and even multiple tiles. Now the problem I'm chiefly encountered is that the various threads going over the matrix introduce a data race whilst writing to the variable holding the maximum/minimum element. Moreover, the different tiles will introduce multiple maximum elements, from which I need to choose the overall smallest/greates. For that purpose I have the std::max_element call after the kernel to postprocess the reduction results.
What I wonder then is, how could I process the matrix efficiently within the two loops (and I need to use the processing_indices_view) and still be able to return the one and only minimum/maximum element from amp_exercise. It would be great if I only needed to use a single variable, as currently, per tile to store the minimum/maximum element, but perhaps storing the results to a temporary tile_static array of the tile size (size is indicated by variable TS in my code) would be good too, though the storage requirement would naturally be higher than in case of a single variable.
I've come across this very problem before too, but I haven't been able to find a good solution as of yet. I'm also rather new into GPU computing and C++ AMP in particular (I'm learning this basically during nights, so takes a bit of time to introduce myself to all the things).
The problem here is that when I try to introduce tiles, I get multiple maximum elements from the various tiles and I have thus far been unable to efficiently solve the problem of getting the minimum/maximum.
Sudet ulvovat  karavaani kulkee
 Edited by Veikko Eeva Tuesday, August 28, 2012 8:29 PM
Tuesday, August 28, 2012 4:19 PM 
Actually, the reduction step won't work correctly in 2D setting. There will be a writeafterwrite hazard in writing to out_pairs. It looks the reduction would work if I made the matrix_view a single dimensional and access it with a different kind of a rowmajor indexing. Though I'm after how to accomplish this in 2D...
Sudet ulvovat  karavaani kulkee
Tuesday, August 28, 2012 8:33 PM 
Sorry, but I still don't understand what you are trying to do. Are you trying to get the min/max element from the input array or not, your explanation seems a bit contradicting?
Please provide the CPU code as that might bring some clarity.
 Edited by Dragon89 Wednesday, August 29, 2012 8:17 AM
Wednesday, August 29, 2012 8:17 AM 
The CPU code is like this
for(int i = 0; i < 100; ++i) { int max = 0; int best_i; int best_j; do { int local_max(0); int local_best_i(0); int local_best_j(0); //These for loops are what I'm trying to execute on the GPU. for(int i = 0; i < processing_indices.size()  1; i++) { for(int j = i + 1; j < processing_indices.size()  1; j++) { int a = processing_indices[i]; int b = processing_indices[i + 1]; etc. [...] int temp_max = matrix[a][b] + matrix... //These local values I'm having a hard //time collect. //Currently I try to put them to //an array, write out from the GPU //and then search through all of them //in CPU with the std::max_element //call. if(temp_max < local_best) { local_best = temp_max; local_best_i = i; local_best_j = j; } } } max = local_max; best_i = local_best_i; best_j = local_best_j;
I have modified the GPU code, which works as far as I have understood, but it uses global memory and I'm still having hard time to write the maximum element, the result of the processing, out of the GPU. The GPU code is like follows now (I didn't replicate it in its entirety as I'd need to clean it before, it's pretty much in the middle of flight currently, but I'm not on a computer doing the code right now)
//The code before parallel_for_each is the same as before, but out is now defined as follows. Note that the related CPU code handles also the indices, which I intend to add when I get this working (that is, this is code in flight...) std::array<int, 4096> out_structs; concurrency::array_view<int, 1> out(out_structs.size(), out_structs); //The matrix is in row major order. Note that it doesn't need to be read out. const concurrency::array_view<const int, 2> matrix_view(matrix_dimension, matrix); auto domain = concurrency::extent<2>(BLOCKS * TS, BLOCKS * TS).tile<TS, TS>(); parallel_for_each(domain.pad(), [=](concurrency::tiled_index<TS, TS> t_idx) restrict(amp) { int row = t_idx.local[0]; int col = t_idx.local[1]; for(int i = 0; i < matrix_dimension  1; i += TS) { tile_static int locA[TS][TS]; locA[row][col] = matrix_view(t_idx.global[0], col + i); t_idx.barrier.wait(); int upper_trig_col = col + i; if(row < matrix_dimension  1 && col < matrix_dimension  1 && (upper_trig_col > row)) { int x1 = row; int x2 = row + 1; int x3 = upper_trig_col; int x4 = upper_trig_col + 1; int a = processing_indices_view[x1]; int b = processing_indices_view[x2]; int c = processing_indices_view[x3]; int d = processing_indices_view[x4]; int temp_max = locA[a][c] + locA[b][d]... //Here's a data race... out[i + (row * dimension + upper_trig_row)] = temp_max; } } } t_idx.barrier.wait();
I apologize a bit for the messy code, but I'm currently rather pressed with time and it looks my browser is having a bit of problems with the forum software. If you feel like being as helpful as you are and you really want to understand what I'm trying to do, I can send you the code too. Though then this would start to feel less recreational passtime and more like serious trying. :)
 Edited by Veikko Eeva Wednesday, August 29, 2012 12:56 PM
 Marked as answer by Veikko Eeva Friday, September 14, 2012 7:03 AM
Wednesday, August 29, 2012 12:48 PM 
Hi Veksi,
Cursorily this looks like a classic reduction problem and looking at our earlier C++ AMP sample on parallel reduction would help with your implementation. "reduction_cascade" is specifically the one I would suggest you to look at.
Amit
Amit K Agarwal
Wednesday, September 12, 2012 6:37 PMModerator 
Hi!
I went another route and use onedimensional rowmajor array and use actually the reduction number three there. It was something I could understand and implement easily as an extra step after other calculations without leaving the GPU.
But I'll take a new look at the cascaded reduction, nevertheless.
Sudet ulvovat  karavaani kulkee
Friday, September 14, 2012 7:03 AM