Answered by:
Help with tiling

I've just converted one of my C++ AMP algorithms to use tiling, and I got a nice 80% speed up, great!
However, I'm unsure if I'm doing it correctly (most efficiently), since my code looks quite different from the various examples I have seen, and would therefore appreciate some feedback.
Basically my scenario is that I have an one dimensional array filled with consecutive 8x8 dct matrices laid out in linear order on which I need to peform idct (basically matrix multiplication). So matrix1 is array[0..63] and matrix 2 is array[64..127] etc..
My current code looks as follows. Note that TS = 64 = 8 *8, is the only tile size that works for this implementation (unlike others I have seen where the algorithm works regardless).
array_view<float> coeffs = /*...*/; float idct[8][8] = /*...*/; static const int TS = 64; concurrency::parallel_for_each(coeffs.extent.tile<TS>(), [=](const concurrency::tiled_index<TS>& t_idx) restrict(amp) { using namespace concurrency; auto local_idx = t_idx.global % 64; auto row = (t_idx.local % 8)[0]; auto col = (t_idx.local / 8)[0]; tile_static float local_coeffs[8][8]; local_coeffs[row][col] = coeffs[t_idx.global]; t_idx.barrier.wait(); float tmp = 0.0; for (int i = 0; i < 8; ++i) { float tmp2 = 0.0; tmp2 += local_coeffs[0][i] * idct[row][0]; tmp2 += local_coeffs[1][i] * idct[row][1]; tmp2 += local_coeffs[2][i] * idct[row][2]; tmp2 += local_coeffs[3][i] * idct[row][3]; tmp2 += local_coeffs[4][i] * idct[row][4]; tmp2 += local_coeffs[5][i] * idct[row][5]; tmp2 += local_coeffs[6][i] * idct[row][6]; tmp2 += local_coeffs[7][i] * idct[row][7]; tmp += idct[col][i] * tmp2; } dest[t_idx.global] = tmp; });
 Edited by Dragon89 Wednesday, August 22, 2012 5:35 PM
Question
Answers

Hi Dragon89,
I would suggest the following towards improving the performance of this implementation.
 Increase tile size: This computation processes a 8X8 matrix and a choice of 8X8 tile makes the implementation simple. However, GPUs rely on availability of a large number of threads to hide latency of global memory accesses or other high latency arithmetic operations. Current GPU hardware typically has a limit on the number of tiles that can be scheduled for concurrent execution on a compute unit (a SIMD engine essentially) and hence small tile sizes may result in inadequate occupancy which in turn may translate into suboptimal performance. As you mentioned, most of the samples you may have seen work on larger problem sizes which require chunks of data to be iteratively brought into tile_static memory and hence are able to use larger tile sizes in a natural way.
I would suggest experimenting with larger tile sizes (256 would be good number to begin). For example you can use a tile size of 256 and consider it to be 4 teams of 64 threads each processing a different 8X8 matrix.
 Secondly, I noticed that much of the computation performed by each thread is also performed by all other threads in the same “row” (tmp2 +=local_coeffs[0][i]*idct[row][0];).
It will be interesting to get rid of the redundant computations and have each thread in a row to perform this computation for a different “i” and store the result in tile_static memory to be subsequently shared by other threads in that row.
Hope this helps
Amit
Amit K Agarwal
 Proposed as answer by Zhu, Weirong Thursday, August 23, 2012 8:45 PM
 Marked as answer by Dragon89 Thursday, August 23, 2012 9:04 PM
All replies

Hi Dragon89,
I would suggest the following towards improving the performance of this implementation.
 Increase tile size: This computation processes a 8X8 matrix and a choice of 8X8 tile makes the implementation simple. However, GPUs rely on availability of a large number of threads to hide latency of global memory accesses or other high latency arithmetic operations. Current GPU hardware typically has a limit on the number of tiles that can be scheduled for concurrent execution on a compute unit (a SIMD engine essentially) and hence small tile sizes may result in inadequate occupancy which in turn may translate into suboptimal performance. As you mentioned, most of the samples you may have seen work on larger problem sizes which require chunks of data to be iteratively brought into tile_static memory and hence are able to use larger tile sizes in a natural way.
I would suggest experimenting with larger tile sizes (256 would be good number to begin). For example you can use a tile size of 256 and consider it to be 4 teams of 64 threads each processing a different 8X8 matrix.
 Secondly, I noticed that much of the computation performed by each thread is also performed by all other threads in the same “row” (tmp2 +=local_coeffs[0][i]*idct[row][0];).
It will be interesting to get rid of the redundant computations and have each thread in a row to perform this computation for a different “i” and store the result in tile_static memory to be subsequently shared by other threads in that row.
Hope this helps
Amit
Amit K Agarwal
 Proposed as answer by Zhu, Weirong Thursday, August 23, 2012 8:45 PM
 Marked as answer by Dragon89 Thursday, August 23, 2012 9:04 PM


An interesting thing I found, explicitly using mad is faster than letting the kernel compiler optimize it, how come?
i.e. instead of:
row_sum += dct[y][0][z] * idct[x][0][z]; row_sum += dct[y][1][z] * idct[x][1][z]; row_sum += dct[y][2][z] * idct[x][2][z]; row_sum += dct[y][3][z] * idct[x][3][z];
the following is faster:
row_sum = mad(dct[y][0][z], idct[x][0][z], row_sum); row_sum = mad(dct[y][1][z], idct[x][1][z], row_sum); row_sum = mad(dct[y][2][z], idct[x][2][z], row_sum); row_sum = mad(dct[y][3][z], idct[x][3][z], row_sum);
 Edited by Dragon89 Saturday, September 01, 2012 8:23 PM

This is related to IEEE strictness of the floating point behavior in the generated code. This can be controlled by the /fp compiler switch. By default the C++ AMP compiler uses /fp:precise and hence cannot use the faster “mad” hardware instruction (which is not guaranteed to be IEEE strict) and uses a mul + add instead. Compiling with /fp:fast, will result in the “mad” instruction being used. If the big /fp:fast hammer (for your entire code) is undesirable, you can explicitly use the mad intrinsic to accelerate specific multiply add operations in your kernel.
Amit
Amit K Agarwal