none
Tile size VS Number of Threads RRS feed

  • Question

  • I can understand there being a limit to the tile size, but the tile size (from what I'm reading) is also going to be the number of threads executed that will use that tile cache.

    So if my tile size is 16x16, my local storage (which is faster than global cache storage) is going to be 256.

    That is fine, but there will only also be 256 threads executed that use this tile cache.

    Am I reading this wrong?

    "Tile Size" IS ALSO EQUAL TO "Number of Threads"?

    Why is this done?

    I can see there being a limit to the tile size (which is 1024), but why is this tied to the number of threads that will access that tile cache?

    Monday, April 8, 2013 11:30 PM

Answers

  • No, there's no direct relationship. The belief that the shape / size of the software unit of scheduling (tiles in C++ AMP, workgroups in OCL, thread groups in DirectCompute) probably stems from how many GPU programming tutorials are laid out, and is unfortunate IMHO. Tile_static merely denotes that you're placing something in a software managed cache that's visible to all parts of a software unit of scheduling, and it is the programmer's task to manage interactions with it. As long as the sum of the sizes of the things that you're putting in tile_static memory is under 32KB, you're golden and free to shape them in whichever way you like:)
    • Marked as answer by MartinDOrtiz Tuesday, April 9, 2013 2:37 AM
    Tuesday, April 9, 2013 2:13 AM

All replies

  • OK...I re-read the tile documentation....I think I'm confusing tiles with tile_static memory.

    Does the tile_static memory have to be related to the tile size in any way?

    My tile size could be M x N = total threads for that GROUP within the context of the entire size over which your extend is  (ok that makes sense)

    OK...I was confusing the two...but still I am not sure...

    Does the tile_static memory have to be related to tile size(ie # of threads)?

    I'm guessing not, but the example I'm looking at happens to use the same size for it's tile_static memory

    (M x N), it's probably just a coincidence.

    Monday, April 8, 2013 11:43 PM
  • No, there's no direct relationship. The belief that the shape / size of the software unit of scheduling (tiles in C++ AMP, workgroups in OCL, thread groups in DirectCompute) probably stems from how many GPU programming tutorials are laid out, and is unfortunate IMHO. Tile_static merely denotes that you're placing something in a software managed cache that's visible to all parts of a software unit of scheduling, and it is the programmer's task to manage interactions with it. As long as the sum of the sizes of the things that you're putting in tile_static memory is under 32KB, you're golden and free to shape them in whichever way you like:)
    • Marked as answer by MartinDOrtiz Tuesday, April 9, 2013 2:37 AM
    Tuesday, April 9, 2013 2:13 AM
  • The tutorial I was reading used the same size for the tile as for the tile_static memory, so that threw me off and confused me.

    A related question. Why the restriction on the tile size? The dimension of the tile size can not be larger than 1024. Why not just allow all the threads (how ever big the number of threads are) share this common area of local storage? It seems arbitrary to just allow a maximum of 1024 threads share a particular segment of local memory (tile_static memory).

    Tuesday, April 9, 2013 4:03 AM
  • Umm, because a particular 32KB block of tile_static memory is associated (discrete by rapport) with a particular compute unit, and there is a mapping from the set of software units of scheduling (tiles) unto compute units which makes it impossible for tile X which is mapped to CU0 to access the tile_static memory of any other CU{1, ..., N}. This is a property of the execution model / the underlying hardware.

    Now if the question is why no more than 1024 threads in a tile, let us note that this is a common limitation for AMP, DirectCompute (unsurprising since the former layers itself on top of the latter), OpenCL and CUDA. This strongly hints that it is a hardware imposed limit (albeit it is possible that modern hardware no longer suffers from it). The software unit of scheduling is mapped unto the hardware as 1 up to M hardware units of scheduling (the famous wavefronts, warps, whatevers). E.g. for a 1024 thread tile and an ATI GPU, that translates into the need for 16 wavefronts / warps to be tracked by a CU - I'm not sure that was possible with earlier DX11 hardware.

    Tuesday, April 9, 2013 6:47 AM