none
constant memory vs shared memory RRS feed

  • Question

  • I've noticed some strange behavior I was hoping to get some clarity at. It seems that shared memory (i.e tile_static) is faster than constant memory (i.e. capture by value).

    e.g. The following is faster:

    		tile_static float dct[8][8];
                    tile_static float local_idct[8][8];
    
    		dct[y][x] = coeffs[t_idx.global];
    		local_idct[y][x] = idct[y][x];
    
    		t_idx.barrier.wait_with_tile_static_memory_fence();
    				
    		tile_static float row_sums[8][8];
    
    		precise_float row_sum = 0.0;
    		row_sum += dct[y][0] * local_idct[x][0];  
    		row_sum += dct[y][1] * local_idct[x][1];  
    		row_sum += dct[y][2] * local_idct[x][2];  
    		row_sum += dct[y][3] * local_idct[x][3];  
    		row_sum += dct[y][4] * local_idct[x][4];  
    		row_sum += dct[y][5] * local_idct[x][5];  
    		row_sum += dct[y][6] * local_idct[x][6];  
    		row_sum += dct[y][7] * local_idct[x][7];  
    
    		row_sums[x][y] = row_sum;

    than:

    		tile_static float dct[8][8];
    
    		dct[y][x] = coeffs[t_idx.global];
    
    		t_idx.barrier.wait_with_tile_static_memory_fence();
    				
    		tile_static float row_sums[8][8];
    
    		precise_float row_sum = 0.0;
    		row_sum += dct[y][0] * idct[x][0];  
    		row_sum += dct[y][1] * idct[x][1];  
    		row_sum += dct[y][2] * idct[x][2];  
    		row_sum += dct[y][3] * idct[x][3];  
    		row_sum += dct[y][4] * idct[x][4];  
    		row_sum += dct[y][5] * idct[x][5];  
    		row_sum += dct[y][6] * idct[x][6];  
    		row_sum += dct[y][7] * idct[x][7];  
    
    		row_sums[x][y] = row_sum;

    I am a bit curious about this, how come constant memory is slower than shared memory, I thought constant memory was cached?

    Tuesday, August 28, 2012 10:07 AM

Answers

  • While the actual performance characteristics vary across different GPU hardware, you are right that constant memory is typically backed by a L1 cache and offers very low access latency comparable to tile_static memory. However, there is an important difference between these 2 types of memory. The constant cache is typically optimal for broadcast access patterns; i.e. if for a constant memory access, all threads in a team of consecutive threads (warp or wavefront) access the same constant memory location, they can be served in one go. However, if different threads in a warp/wavefront access different constant memory locations, the accesses are typically serialized.

    For example assuming a warp size of 32, the const_data access inside the loop can be serviced for all threads in the warp/wavefront in one go as the constant memory location addressed by all threads is the same.

    float value = 0.0f;

    for (int i = 0; i < 10; i++)

    {

        value += const_data[i];

    }

    On the other hand, in the following code (assuming a 2D tile of 8 X 32), for the constant memory access instruction inside the loop, each thread accesses a different constant memory location and consequently, this access is typically serialized in hardware and its performance is equivalent to 32 accesses (each thread in the warp/wavefront is served one at a time).

    float value = 0.0f;

    for (int i = 0; i < 10; i++)

    {

        value += const_data[i + tile_local_idx[1]];

    }

    tile_static memory also has very low access latency and at the same time is typically divided into multiple banks (consecutive 4 byte words go to consecutive banks) and hence is capable of serving accesses for all threads in a warp/wavefront in one go, as long as the locations accessed by threads belong to different banks. If multiple threads in a warp/wavefront access the same word in a tile_static memory bank, they can served together (broadcast) but if different threads access different words within the same tile_static memory bank, they are typically serialized. The blog post on C++ AMP tile_static memory describes this in greater detail.

    Hence for the above code (2<sup>nd</sup> example), putting const_data in tile_static memory would be significantly faster.

    In your example, the access “idct[x][0]” ends up being different locations (due to “x”) for each thread in a warp/wavefront and hence is slower when accessed from constant memory compared to tile_static memory. In fact even for tile_static memory, the accesses end up being serialized to some extent due to bank conflicts (described in the tile_static memory blog post I mentioned earlier) but the serialization is not as bad as it is for constant memory.

    -Amit


    Amit K Agarwal

    • Marked as answer by Dragon89 Tuesday, August 28, 2012 7:48 PM
    Tuesday, August 28, 2012 7:10 PM
    Moderator

All replies

  • Hi, Dragon89!

    I can't offer any direct advice other than that some factors outside of your code snippets could influence on the matter. E.g. how the code is JITted or how the data is captured to parallel_for_each. In case you haven't read already, there's a blog post in Parallel Programming Native Code blog called Using Constant Memory in C++ AMP, which goes into details.


    Sudet ulvovat -- karavaani kulkee

    • Edited by Veikko Eeva Tuesday, August 28, 2012 3:53 PM
    Tuesday, August 28, 2012 3:50 PM
  • While the actual performance characteristics vary across different GPU hardware, you are right that constant memory is typically backed by a L1 cache and offers very low access latency comparable to tile_static memory. However, there is an important difference between these 2 types of memory. The constant cache is typically optimal for broadcast access patterns; i.e. if for a constant memory access, all threads in a team of consecutive threads (warp or wavefront) access the same constant memory location, they can be served in one go. However, if different threads in a warp/wavefront access different constant memory locations, the accesses are typically serialized.

    For example assuming a warp size of 32, the const_data access inside the loop can be serviced for all threads in the warp/wavefront in one go as the constant memory location addressed by all threads is the same.

    float value = 0.0f;

    for (int i = 0; i < 10; i++)

    {

        value += const_data[i];

    }

    On the other hand, in the following code (assuming a 2D tile of 8 X 32), for the constant memory access instruction inside the loop, each thread accesses a different constant memory location and consequently, this access is typically serialized in hardware and its performance is equivalent to 32 accesses (each thread in the warp/wavefront is served one at a time).

    float value = 0.0f;

    for (int i = 0; i < 10; i++)

    {

        value += const_data[i + tile_local_idx[1]];

    }

    tile_static memory also has very low access latency and at the same time is typically divided into multiple banks (consecutive 4 byte words go to consecutive banks) and hence is capable of serving accesses for all threads in a warp/wavefront in one go, as long as the locations accessed by threads belong to different banks. If multiple threads in a warp/wavefront access the same word in a tile_static memory bank, they can served together (broadcast) but if different threads access different words within the same tile_static memory bank, they are typically serialized. The blog post on C++ AMP tile_static memory describes this in greater detail.

    Hence for the above code (2<sup>nd</sup> example), putting const_data in tile_static memory would be significantly faster.

    In your example, the access “idct[x][0]” ends up being different locations (due to “x”) for each thread in a warp/wavefront and hence is slower when accessed from constant memory compared to tile_static memory. In fact even for tile_static memory, the accesses end up being serialized to some extent due to bank conflicts (described in the tile_static memory blog post I mentioned earlier) but the serialization is not as bad as it is for constant memory.

    -Amit


    Amit K Agarwal

    • Marked as answer by Dragon89 Tuesday, August 28, 2012 7:48 PM
    Tuesday, August 28, 2012 7:10 PM
    Moderator
  • Thanks or the explanation!

     "the accesses end up being serialized to some extent due to bank conflicts"

    Yes, I saw that, which is why I'm accessing it as idct[x][0] instead of idct[0][x] in order to avoid bank conflicts between different threads.


    Tuesday, August 28, 2012 7:41 PM
  • If “x” is a function of the thread’s local index’s least significant dimension value, then successive threads in a warp will have different values of “x” and hence access different rows of the tile‑static memory which is likely to result in bank conflicts. The following posts describe this in greater detail:

    -Amit


    Amit K Agarwal

    Friday, September 14, 2012 6:52 PM
    Moderator