C++ AMP: (Performance issue) why every (!) structured resource in HLSL has always stride==4 regardless of an array_view element's size?
Friday, January 06, 2012 7:26 PM
structured UAV, structured TGSM, etc. - all of them are always defined/created w/ the same stride of 4 bytes regardless of a data record (structure) in C++ AMP program.
It has a big impact on performance because it causes that a number of memory operations is 2 (or 3 or 4) times greater than in a case of writing HLSL code of DX Compute Shader in the 'standard' way. There is a restriction in DX11 (a single R/W operation accessing more bytes than a stride of a structured resource).
It seems to be a simplification in a beta version of the tool. Are Microsoft's developers going to fix this?
- Edited by OlafOlafT Saturday, January 07, 2012 7:43 AM title improved
Tuesday, January 10, 2012 12:21 AM
Regarding to this question, it has been raised and discussed in this thread: http://social.msdn.microsoft.com/Forums/en-US/parallelcppnative/thread/8011ef24-f4c8-495c-aaeb-bef7e19ca54e
You're welcome to join and continue the discussion there. Meanwhile, if you can provide some code repro/benchmark to demonstrate the big performance impact you mentioned above, it would be very helpful for the discussion.
Also, as mentioned in this blog post, in the coming Beta Release, the underlying buffer type is going to be ByteAddressBuffer/RWByteAddressBuffer, basically, a RAW buffer with stride of 4 (though it's called ByteAddressBuffer). So it does not really affect the question you asked.
Tuesday, January 10, 2012 10:26 PM
Thank you for the explantion.
I've read the thread: I does not agree w/ this statement:
'These are the types of tests that we have conducted before making the decision to rely on StructuredBuffer<uint> since, somewhat surprisingly, we were not seeing the performance degradation that you are projecting.'
I was running some samples from your blog and I'm sure that this assumption will not be always true:
'Modern GPUs do have a layer of transparent caching, which allow servicing the access requests for the y,z,w components from the cache, after the respective cache lines have been loaded into the cache as a result of asking for the x components.'
I guess, a GPU driver should handle this case and optimize it accordingly to its GPU characteristic, but the question is why C++ AMP was not flexible enough? Why did you assume that a C++ AMP developer, which knows DX11 and how GPUs work, does not want/like to write a program w/ data structures organized in the best (in his/her opinion) way? Maybe he/she wants/like to make some experiments or make research how is behavior of a given GPU and what is the best (fast) version of his/her program? By implementing this:
'Actually, I meant what I wrote, we map all array<T>, regardless of what T is, to StructuredBuffer<uint>.'
Microsoft just said 'we'll not allow for pumping your ride' ;-)
'the underlying buffer type is going to be ByteAddressBuffer/RWByteAddressBuffer, basically, a RAW buffer'
means that performance issue will be fixed in the Beta Release.
Hence finally, I have only one question: what is the current schedule of VS11 releases?
- Edited by OlafOlafT Tuesday, January 10, 2012 11:22 PM -
Wednesday, January 11, 2012 7:58 AMOwner
Thank you for your interest in C++ AMP.
As per the other thread where this conversation took place, we invite you to submit a repro that clearly demonstrates there is a performance penalty due to our implementation decisions. Other than the theoretical, we have not seen any concrete examples.
When the Visual Studio 11 Beta release is available, we will announce it on our blog, so please stay tuned there: http://blogs.msdn.com/b/nativeconcurrency/rss.aspx
- Marked As Answer by DanielMothMicrosoft Employee, Owner Friday, April 06, 2012 6:58 AM