C++ AMP: (Performance issue) why every barrier-free program is compiled in HLSL with workgroup size==256 regardless of values in parallel_for_each?
Friday, January 06, 2012 7:41 PM
All DX Compute Shaders produced by C++ AMP / VS11 contain a declaration with a workgroup of 256,1,1.
As a result, there are always a few instructions in a shader's prolog, which get the current thread and compare in a run-time with group limits. A real thread group size is passed to the kernel in run-time, what is ineffective from a performance point of view.
It seems to be a simplification in a beta version of the tool. Are Microsoft developers going to fix this performance issue?
Tuesday, January 10, 2012 12:57 AM
First of all, as you may have known, the current release of C++ AMP is built on top of DirectX platform. Unlike CUDA/OpenCL, in HLSL (the shader language of DX), the shape of thread group has to be specified statically using numthreads[x, y, z] annotation to the entry function. As a result, C++ AMP also has this static-compilation requirement. (If we go to the dynamic/runtime compilation strategy, then it would be different. However, dynamic compilation has its high cost and other problems that we decided not to go with). In the future, this specific static compilation requirement might be relaxed. However, for now, these parameters have to be statically known. Bear this in mind.
For the tiled version, user controls and specifies the tile shape (which maps to the HLSL group directly). Note the tile shape is specificed via template parameters, thus static known to compiler.
For the simple one (which is probably what you are talking about here), user does not work with a tiled model and does not specify anything about the tile. It's the C++ AMP compiler to make a choice underlying. Note the compiler does not know the "grid" information (the first paramemter to parallel_for_each), which is known at runtime. So it does not have a way to choose the tile shape according to the grid. It needs to statically choose a tile shape for all cases. As you have seen, [256, 1, 1] were chosen. However, a program should not rely on this in any way. The compiler may make a different choice (still static) due to some other factors. So this does introduce the prolog as you obsovered. However, to support a variaty of grids (even with rank > 3), we could not avoid it. From what we have experimented, for memory bound computations, the cost of the prolog does not really matter. However, there can be pathological cases, for example, user requires a 1D grid of 257 threads at runtime. If compiler uses 256 at compilation time, we will have to launch 2 tiles (512 threads), thus waste 255 threads. However, in most real-world cases, user launches millions of threads, the possible waste in the last tile does not really matter either.
Again, if user wants to control over the tile (group) , tiled parallel_for_each should be used.
Tuesday, January 10, 2012 10:44 PM
Thank you for the explanation.
I cannot agree with you:
"the possible waste in the last tile does not really matter either."
Please try to imagine millions of GPUs wasting time and energy... You just don't help to save the planet Earth ;-) .
I always assumed that Microsoft developers do all its best, but here is the example that you do not care if wasting is not so big ;-) . In my opinion you should try to avoid the prolog...
Wednesday, January 11, 2012 7:53 AMOwner
I think Weirong has answered your question.
We have two models: the tiled model and the simple model. We aspire to make our simple model be smarter under the covers as we make new releases, but we are comfortable that if a user wants to take matters into their own hands, we have the tiled model for them to use which gives more control.
Thank you for your feedback on how to make the simple model better.