none
Help with tiling

    Question

  • I've just converted one of my C++ AMP algorithms to use tiling, and I got a nice 80% speed up, great!

    However, I'm unsure if I'm doing it correctly (most efficiently), since my code looks quite different from the various examples I have seen, and would therefore appreciate some feedback.

    Basically my scenario is that I have an one dimensional array filled with consecutive 8x8 dct matrices laid out in linear order on which I need to peform idct (basically matrix multiplication). So matrix1 is array[0..63] and matrix 2 is array[64..127] etc..

    My current code looks as follows. Note that TS = 64 = 8 *8, is the only tile size that works for this implementation (unlike others I have seen where the algorithm works regardless).

    array_view<float> coeffs = /*...*/;
    float idct[8][8] = /*...*/;
    
    static const int TS = 64;
    
    concurrency::parallel_for_each(coeffs.extent.tile<TS>(), [=](const concurrency::tiled_index<TS>& t_idx) restrict(amp)
    {
    	using namespace concurrency;
    
    	auto local_idx = t_idx.global % 64;
    
    	auto row = (t_idx.local % 8)[0];
    	auto col = (t_idx.local / 8)[0];
    
    	tile_static float local_coeffs[8][8];
    
    	local_coeffs[row][col] = coeffs[t_idx.global];
    
    	t_idx.barrier.wait();
    
    	float tmp = 0.0;
    	for (int i = 0; i < 8; ++i)
    	{
    		float tmp2 = 0.0;
    
    		tmp2 += local_coeffs[0][i] * idct[row][0];  
    		tmp2 += local_coeffs[1][i] * idct[row][1];  
    		tmp2 += local_coeffs[2][i] * idct[row][2];  
    		tmp2 += local_coeffs[3][i] * idct[row][3];  
    		tmp2 += local_coeffs[4][i] * idct[row][4];  
    		tmp2 += local_coeffs[5][i] * idct[row][5];  
    		tmp2 += local_coeffs[6][i] * idct[row][6];  
    		tmp2 += local_coeffs[7][i] * idct[row][7]; 
    		
    		tmp += idct[col][i] * tmp2;
    	}
    	dest[t_idx.global] = tmp;	
    });















    • Edited by Dragon89 Wednesday, August 22, 2012 5:35 PM
    Wednesday, August 22, 2012 4:04 PM

Answers

  • Hi Dragon89,

    I would suggest the following towards improving the performance of this implementation.

    • Increase tile size: This computation processes a 8X8 matrix and a choice of 8X8 tile makes the implementation simple. However, GPUs rely on availability of a large number of threads to hide latency of global memory accesses or other high latency arithmetic operations. Current GPU hardware typically has a limit on the number of tiles that can be scheduled for concurrent execution on a compute unit (a SIMD engine essentially) and hence small tile sizes may result in inadequate occupancy which in turn may translate into sub-optimal performance. As you mentioned, most of the samples you may have seen work on larger problem sizes which require chunks of data to be iteratively brought into tile_static memory and hence are able to use larger tile sizes in a natural way.

           I would suggest experimenting with larger tile sizes (256 would be good  number to begin). For example you can use a tile size of 256 and consider it to be 4 teams of 64 threads each processing a different 8X8 matrix.

    • Secondly, I noticed that much of the computation performed by each thread is also performed by all other threads in the same “row” (tmp2 +=local_coeffs[0][i]*idct[row][0];).

    It will be interesting to get rid of the redundant computations and have each thread in a row to perform this computation for a different “i”  and store the result in tile_static memory to be subsequently shared by other threads in that row.

    Hope this helps

    -Amit


    Amit K Agarwal

    • Proposed as answer by Zhu, Weirong Thursday, August 23, 2012 8:45 PM
    • Marked as answer by Dragon89 Thursday, August 23, 2012 9:04 PM
    Thursday, August 23, 2012 2:11 AM

All replies

  • Hi Dragon89,

    I would suggest the following towards improving the performance of this implementation.

    • Increase tile size: This computation processes a 8X8 matrix and a choice of 8X8 tile makes the implementation simple. However, GPUs rely on availability of a large number of threads to hide latency of global memory accesses or other high latency arithmetic operations. Current GPU hardware typically has a limit on the number of tiles that can be scheduled for concurrent execution on a compute unit (a SIMD engine essentially) and hence small tile sizes may result in inadequate occupancy which in turn may translate into sub-optimal performance. As you mentioned, most of the samples you may have seen work on larger problem sizes which require chunks of data to be iteratively brought into tile_static memory and hence are able to use larger tile sizes in a natural way.

           I would suggest experimenting with larger tile sizes (256 would be good  number to begin). For example you can use a tile size of 256 and consider it to be 4 teams of 64 threads each processing a different 8X8 matrix.

    • Secondly, I noticed that much of the computation performed by each thread is also performed by all other threads in the same “row” (tmp2 +=local_coeffs[0][i]*idct[row][0];).

    It will be interesting to get rid of the redundant computations and have each thread in a row to perform this computation for a different “i”  and store the result in tile_static memory to be subsequently shared by other threads in that row.

    Hope this helps

    -Amit


    Amit K Agarwal

    • Proposed as answer by Zhu, Weirong Thursday, August 23, 2012 8:45 PM
    • Marked as answer by Dragon89 Thursday, August 23, 2012 9:04 PM
    Thursday, August 23, 2012 2:11 AM
  • Removing the redundant calculation gave me a 50% speedup and increasing tile size to 256 have me another ~40%;

    • Edited by Dragon89 Saturday, September 01, 2012 8:23 PM
    Friday, August 31, 2012 6:58 AM
  • An interesting thing I found, explicitly using mad is faster than letting the kernel compiler optimize it, how come?

    i.e. instead of:

    row_sum += dct[y][0][z] * idct[x][0][z];  
    row_sum += dct[y][1][z] * idct[x][1][z];  
    row_sum += dct[y][2][z] * idct[x][2][z];  
    row_sum += dct[y][3][z] * idct[x][3][z];  

    the following is faster:

    row_sum = mad(dct[y][0][z], idct[x][0][z], row_sum);  
    row_sum = mad(dct[y][1][z], idct[x][1][z], row_sum);  
    row_sum = mad(dct[y][2][z], idct[x][2][z], row_sum);  
    row_sum = mad(dct[y][3][z], idct[x][3][z], row_sum);  


    • Edited by Dragon89 Saturday, September 01, 2012 8:23 PM
    Friday, August 31, 2012 7:35 AM
  • This is related to IEEE strictness of the floating point behavior in the generated code. This can be controlled by the /fp compiler switch. By default the C++ AMP compiler uses /fp:precise and hence cannot use the faster “mad” hardware instruction (which is not guaranteed to be IEEE strict) and uses a mul + add instead. Compiling with /fp:fast, will result in the “mad” instruction being used. If the big /fp:fast hammer (for your entire code) is undesirable, you can explicitly use the mad intrinsic to accelerate specific multiply add operations in your kernel.

    -Amit


    Amit K Agarwal

    Friday, September 14, 2012 6:43 PM