What tile size to use when extent is not a multiple of 32 or 64?

Answered What tile size to use when extent is not a multiple of 32 or 64?

  • Thursday, August 16, 2012 2:48 AM
     
     

    Hi,

    I came across a post at http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/26/warp-or-wavefront-of-gpu-threads.aspx. In that post, it was recommended the tile size to be a multiple of 32 or 64. But what if I have a 100 x 100 matrix? It could not be divided to a tile size that’s a multiple of 32 or 64. What tile size will you recommend in this case, and why?

    I used the Dump statistics to Output Window button on GPU Threads window to see what tile size C++ AMP chooses for simple mode. It’s 16 x 16 x 1. So my second question is how 100 x 100 fits into this tile size?

    Thanks/Allen

All Replies

  • Thursday, August 16, 2012 6:04 PM
     
     Answered

    Hi Allen,

    For the simple model, C++ AMP compiler/runtime handles the tiling underlying. Enough tiles will be launched to cover the number of threads requested (via "extent" passed to parallel_for_each). There might be some threads unused, but it's not transparent to user.

    If user wants to explicitly control the tiling, one choose the tiled parallel_for_each. As you found, it's recommended to use tile whose size is a multiple of the size of warp/wavefront. Otherwise, there would be some warp/wavefront not fully utilized, thus you loss performance.  If your original data size cannot be divided into the size of the tile, we have a blog post talking about this issue, basically you can Pad orTruncate  the extent (if you cannot make sure your data is divisble).  I think you will see that this is a trade-off that the application developer needs to make -- are you OK with under-utilized warp/wavefront (note you will have such warp/wavefront within every tile, so I personally think this should not be preferred.)?  Are you OK with some threads doing nothing (Pad),  or some threads doing extra (truncate)?  What's your typical problem size (since the size may decide which approach is more effecitve)? Such decision should be driven by the experiments and performance measurements.   Please read these blog posts.

    Thanks,

    Weirong

  • Friday, August 17, 2012 1:50 AM
     
     

    Hi Weirong,

    The post you recommended talks about extent must be divisible by tile size. Now that I have a 100 x 100 matrix, a tile size of 10 x 10 meets the requirement. So my question is what will happen if I choose 10 x 10 as the tile size?

    Thanks/Allen

  • Friday, August 17, 2012 5:15 AM
     
     

    Hi Yonglun,

    If your extent is 100 x 100, and you choose 10 x 10 tile size, you got 10 x 10 tiles, each with 10 x 10 threads. Correctness wise, there is no problem. Performance wise, you need to consider the trade-off that are mentioned in my previous response. Also, I assume 100 x 100 is just a number you come up for asking the question. It does not represent the real problem size.  Otherwise, 100 x 100 could be too small to gain benefits on shipping the computation to GPU. 

    Thanks,

    Weirong


  • Friday, August 17, 2012 9:51 AM
     
     
    Thanks Weirong. And yes 100 x 100 is what I came up for asking how C++ AMP work under the hook on managing tiling.