C++ AMP features
-
Sunday, September 18, 2011 5:26 AM
I've been perusing the C++ AMP docs on msdn. I have a few questions on achieving feature parity with the existing APIs:
1) How do you specify texture cache backing for a device resource? It would naturally fit in one of the many array or array_view ctor overloads, although I don't see it. This is really critical, especially for any app with random access to global memory. In D3D11 you'd just use a Buffer to get texture sampling backing, or a StructuredBuffer for unbuffered reads. This was a decent setup.
2) How do you query for warp/wavefront size. Even more importantly, is shared memory considered volatile (in the CL/CUDA sense) by default, so you can perform intra-warp communication without explicit synchronization? Can you add the volatile qualifier? Not having this will devastate parallel scan performance.
3) How do you load a 2D or 3D array with swizzle. This would be like cuArrayCreate versus cuMemAlloc in CUDA, or iirc a ID3D11Device::CreateBuffer without D3D11_BIND_RENDER_TARGET, to allow for tiled textures. Obviously this is needed for spatial coherency when leveraging the texture cache for 2D textures.
4) How do you retrieve basic device info like the number of SMs/CUs, amount of shared memory (per SM, not available per block, which I will accept as 32k per the D3D11 requirements), max threads per SM, max blocks per SM, etc. None of these are defined by D3D11, so the programmer should be able to query to optimize for occupancy.
5) Where are samplers provided? The best thing about D3D11 Compute was its exposing full sampler capabilities. CUDA and CL currently only support linear samplers in 3D. There are also methods needed for setting texture sampler normalization (i.e. reference within the unit square, or reference in pixel space) and to read integers as integers rather than promoting to normalized floats.
6) Are there intrinsics to access instructions that aren't in the baseline D3D spec (prmt, ballot, etc - bfi/bfe and some others are in the D3D IL even though they aren't available as intrinsics.. it would also be good to make these available for cases when the compiler fails to generate them). There are going to be more intra-warp "horizontal" instructions in upcoming devices, and even the few already available are critical in some situations.
7) Does the API support page-locked memory (a la cuMemHostAlloc)? I believe some of this is supported in D3D11, eg D3D11_USAGE_STAGING uses write-combining.
Thanks, sean
Answers
-
Tuesday, September 20, 2011 12:46 AM
Thank you for your detailed question, keep them coming!
With respect to #1, #3, #5: the preview release of C++ AMP doesn’t offer any support for textures, samplers, (non-structured) buffers and swizzling formats. We are listening to customer feedback and will try to address some of it in the beta timeframe. If you can provide more detailed feedback on your scenario, that would be helpful. But it should be kept in mind that C++ AMP doesn’t strive, at least not in this release, to supplant Direct3D as a graphics development platform. Initially, we are going after the “pure compute” subset of functionality, with provisions for graphics-related interop where it makes the most sense for graphics-heavy workloads (and again, in that context, would love to hear about specific scenarios you might be interested in).
Also with respect to #1, arrays and array_views are backed by an HLSL StructuredBuffer<uint>. There is no flexibility offered with respect to the underlying type, as we need to generate code to a specific underlying HLSL type. Had we wanted to offer more flexibility to target other types of HLSL storage, that would have needed to be captured as template parameters of array and array_view. It is likely that if and when we’ll add support for Textures and Buffers, it will come in the form of C++ classes separate from concurrency::array and concurrency::array_view.
However, note that Direct3D doesn’t tell the hardware vendor how to use their caching resources. So conceivably they could (and some would) utilize caching resources to cache a StructuredBuffer as much as they would a Buffer or a Texture. Also note that in DX11, sampling functionality is only provided on Texture, not on Buffer.
With respect to #4, C++ AMP is built on top of Direct3D which, as you correctly said, doesn’t expose a rich set of hardware configuration and topology properties that the developer can query. Part of it is due to the fact that Direct3D devices could have radically different architectures, so even agreeing on a common terminology could be challenging. Even something apparently as simple as “dedicated memory” and “hardware threads” is open to widely differing interpretations, and hence Direct3D refrained from defining those controversial hardware properties. It is very unlikely we’ll offer anything in this space which isn’t already offered by Direct3D. We recommend a “performance sampling” approach instead---run you workload with smaller inputs, or during deployment of your application, or during start-up for a long running application, to try and gauge the performance you’re getting for your specific workload, or adapt to the performance you are observing as you go. Such adaptive schemes will likely yield the most realistic results and they have the benefit of avoiding the codifying of hardware-dependent heuristics into your code which may get invalidated sooner than you’ll have the chance to update them.
With respect to coding to a specific warp/wavefront width (#2), that too, isn’t part of the Direcr3D model, and therefore you cannot query the system for this value. However if you knew what hardware you were running on you could program, non-portably of course, to take advantage of this hardware characteristic. Unlike CUDA or OpenCL which, as you note, by default treat group shared memory as volatile, Direct3D and C++ AMP do not treat any piece of memory as volatile by default. So while CUDA implicitly places a fence between any two instructions and lets warps execute them in a lock-step fashion, C++ AMP will require you to put explicit fences between statements that you want to ensure are executed as spelled out literally in your code. Under many conditions, the graphics driver will be able to elide such fences from the code, as they are mostly just an indication that memory accesses shouldn’t be reordered, combined or eliminated around the barrier, so you won’t pay an extra cost for putting them in the code.
#6: C++ AMP currently doesn’t provide any direct access to Direct3D byte-codes which isn’t already available through HLSL. This is another area where we’d like to get prioritized feedback on what you’d like to have available to you.
#7: The preview release of C++ AMP does offer support for staging arrays, i.e., creating underlying StructuredBuffer resources with the D3D11_USAGE_STAGING flag: check out the array constructor which takes two accelerator_view arguments. If you pass the “CPU accelerator” as the first argument and an accelerator corresponding to a GPU as the second argument, you will receive an array that is optimized for repeated copying between the two, and is accessible on the CPU while it isn’t consumed by a parallel_for_each() call. However, note that in general, Direct3D doesn’t guarantee that these resources will be pinned and directly mapped to the hardware. Whether that happens or not is an implementation detail of the Windows graphics kernel and of the graphics driver.
Thanks,
--Yossi
Yossi Levanoni, Principal Development Lead Parallel Computing Platform, Microsoft- Proposed As Answer by DanielMothMicrosoft Employee, Owner Wednesday, September 21, 2011 3:07 AM
- Marked As Answer by DanielMothMicrosoft Employee, Owner Friday, April 06, 2012 7:08 AM
-
Wednesday, September 21, 2011 8:55 PM
Actually, I meant what I wrote, we map all array<T>, regardless of what T is, to StructuredBuffer<uint>.
Initially we thought that this would be a tough tradeoff to make. On the one hand, this implies the hardware will not be able to apply struct-of-arrays transformation (interleaving), as you write, on the other hand, we want to support a C++ dense array notion, including low-level pointer access to individual fields within T, and guarantees of contiguous layout, which point towards a more regular and “primitive” mapping of memory.
What made the decision to go with StructuredBuffer<uint> easier, was the observation that current hardware doesn’t, actually, take advantage of the ability to interleave structs in a structured buffer. (I’m not saying they never apply the transformation, or that they have adopted this as a policy going forward, only that the testing that we did, revealed that this was an opportunity that the drivers chose not to take advantage of.)
As I mentioned in my previous response, it is likely that in order to expose the benefits of struct-of-arrays transformations you’d have to use separate classes. If and when we introduce such classes, we’d ask you to use types specific for that purpose, such as texture<T> and/or buffer<T>. array<T>, on the other hand, provides lower-level guarantees of layout.
Yossi Levanoni, Principal Architect Parallel Computing Platform, Microsoft- Marked As Answer by DanielMothMicrosoft Employee, Owner Friday, April 06, 2012 7:09 AM
-
Wednesday, September 21, 2011 10:25 PM
No, we do not de-interleave automatically, the memory layout will be identical to that of a C array of int4. Again, array and array_view are designed with compatibility and interoperation with C++ CPU code in mind, up to the point that you could share memory between the CPU and GPU (when that becomes an option).
The hardware does take advantage of coalesced accesses and thus using structs as the value types of arrays has the potential for the deficiencies that you describe.
My point was that on the hardware platforms we have tested, accessing the y component of a StructuredBuffer<int4> at position i, wasn’t any more efficient than accessing a StructuredBuffer<uint> at position 4*i plus the offset of y (which is 4). In other words, the hardware drivers do not seem to take advantage of the opportunity to transparently de-interleave, at least when it comes to StructuredBuffers. If you have concrete and reproducible evidence the to contrary, I would love to see it (really love to see it).
Our guideline to developers would be to use texture<T> and or buffer<T> if and when they become available, or apply the array-of-struct transformation manually. Which brings me to your last question. We currently don’t do any of that automatically. With respect to what one could do manually: first, we have increased the number of UAV’s from 8 to 64, so you could de-interleave into separate arrays. I don’t think this imposes significant memory consumption overheads, and on the other hand, it provides the system's the ability to swap in-and-out specific components as they are used (i.e., in a more granular fashion). Second, you could do such packing into a single array, as you describe, manually. I don’t think that this will result in any additional memory overhead, but it would mean more index math, possibly more constants to pass along, and more register pressure in the kernel, so it’s not for free.
Yossi Levanoni, Principal Architect Parallel Computing Platform, Microsoft
- Edited by Yossi Levanoni Wednesday, September 21, 2011 10:28 PM
- Marked As Answer by DanielMothMicrosoft Employee, Owner Friday, April 06, 2012 7:09 AM
All Replies
-
Tuesday, September 20, 2011 12:46 AM
Thank you for your detailed question, keep them coming!
With respect to #1, #3, #5: the preview release of C++ AMP doesn’t offer any support for textures, samplers, (non-structured) buffers and swizzling formats. We are listening to customer feedback and will try to address some of it in the beta timeframe. If you can provide more detailed feedback on your scenario, that would be helpful. But it should be kept in mind that C++ AMP doesn’t strive, at least not in this release, to supplant Direct3D as a graphics development platform. Initially, we are going after the “pure compute” subset of functionality, with provisions for graphics-related interop where it makes the most sense for graphics-heavy workloads (and again, in that context, would love to hear about specific scenarios you might be interested in).
Also with respect to #1, arrays and array_views are backed by an HLSL StructuredBuffer<uint>. There is no flexibility offered with respect to the underlying type, as we need to generate code to a specific underlying HLSL type. Had we wanted to offer more flexibility to target other types of HLSL storage, that would have needed to be captured as template parameters of array and array_view. It is likely that if and when we’ll add support for Textures and Buffers, it will come in the form of C++ classes separate from concurrency::array and concurrency::array_view.
However, note that Direct3D doesn’t tell the hardware vendor how to use their caching resources. So conceivably they could (and some would) utilize caching resources to cache a StructuredBuffer as much as they would a Buffer or a Texture. Also note that in DX11, sampling functionality is only provided on Texture, not on Buffer.
With respect to #4, C++ AMP is built on top of Direct3D which, as you correctly said, doesn’t expose a rich set of hardware configuration and topology properties that the developer can query. Part of it is due to the fact that Direct3D devices could have radically different architectures, so even agreeing on a common terminology could be challenging. Even something apparently as simple as “dedicated memory” and “hardware threads” is open to widely differing interpretations, and hence Direct3D refrained from defining those controversial hardware properties. It is very unlikely we’ll offer anything in this space which isn’t already offered by Direct3D. We recommend a “performance sampling” approach instead---run you workload with smaller inputs, or during deployment of your application, or during start-up for a long running application, to try and gauge the performance you’re getting for your specific workload, or adapt to the performance you are observing as you go. Such adaptive schemes will likely yield the most realistic results and they have the benefit of avoiding the codifying of hardware-dependent heuristics into your code which may get invalidated sooner than you’ll have the chance to update them.
With respect to coding to a specific warp/wavefront width (#2), that too, isn’t part of the Direcr3D model, and therefore you cannot query the system for this value. However if you knew what hardware you were running on you could program, non-portably of course, to take advantage of this hardware characteristic. Unlike CUDA or OpenCL which, as you note, by default treat group shared memory as volatile, Direct3D and C++ AMP do not treat any piece of memory as volatile by default. So while CUDA implicitly places a fence between any two instructions and lets warps execute them in a lock-step fashion, C++ AMP will require you to put explicit fences between statements that you want to ensure are executed as spelled out literally in your code. Under many conditions, the graphics driver will be able to elide such fences from the code, as they are mostly just an indication that memory accesses shouldn’t be reordered, combined or eliminated around the barrier, so you won’t pay an extra cost for putting them in the code.
#6: C++ AMP currently doesn’t provide any direct access to Direct3D byte-codes which isn’t already available through HLSL. This is another area where we’d like to get prioritized feedback on what you’d like to have available to you.
#7: The preview release of C++ AMP does offer support for staging arrays, i.e., creating underlying StructuredBuffer resources with the D3D11_USAGE_STAGING flag: check out the array constructor which takes two accelerator_view arguments. If you pass the “CPU accelerator” as the first argument and an accelerator corresponding to a GPU as the second argument, you will receive an array that is optimized for repeated copying between the two, and is accessible on the CPU while it isn’t consumed by a parallel_for_each() call. However, note that in general, Direct3D doesn’t guarantee that these resources will be pinned and directly mapped to the hardware. Whether that happens or not is an implementation detail of the Windows graphics kernel and of the graphics driver.
Thanks,
--Yossi
Yossi Levanoni, Principal Development Lead Parallel Computing Platform, Microsoft- Proposed As Answer by DanielMothMicrosoft Employee, Owner Wednesday, September 21, 2011 3:07 AM
- Marked As Answer by DanielMothMicrosoft Employee, Owner Friday, April 06, 2012 7:08 AM
-
Wednesday, September 21, 2011 8:39 PM
I noticed you said the arrays are backed by StructuredBuffer<uint>. Surely you mean StructuredBuffer<T> where T is uint, uint2, or uint4? De-interleaving all arrays into uints would make for pretty poor global load performance, especially when there is no cache.
sean
-
Wednesday, September 21, 2011 8:55 PM
Actually, I meant what I wrote, we map all array<T>, regardless of what T is, to StructuredBuffer<uint>.
Initially we thought that this would be a tough tradeoff to make. On the one hand, this implies the hardware will not be able to apply struct-of-arrays transformation (interleaving), as you write, on the other hand, we want to support a C++ dense array notion, including low-level pointer access to individual fields within T, and guarantees of contiguous layout, which point towards a more regular and “primitive” mapping of memory.
What made the decision to go with StructuredBuffer<uint> easier, was the observation that current hardware doesn’t, actually, take advantage of the ability to interleave structs in a structured buffer. (I’m not saying they never apply the transformation, or that they have adopted this as a policy going forward, only that the testing that we did, revealed that this was an opportunity that the drivers chose not to take advantage of.)
As I mentioned in my previous response, it is likely that in order to expose the benefits of struct-of-arrays transformations you’d have to use separate classes. If and when we introduce such classes, we’d ask you to use types specific for that purpose, such as texture<T> and/or buffer<T>. array<T>, on the other hand, provides lower-level guarantees of layout.
Yossi Levanoni, Principal Architect Parallel Computing Platform, Microsoft- Marked As Answer by DanielMothMicrosoft Employee, Owner Friday, April 06, 2012 7:09 AM
-
Wednesday, September 21, 2011 9:46 PM
Correct me if I'm misunderstanding - array<uint4> will be internally deinterleaved into four distinct arrays (or at least distinct intervals within the same array) of StructuredBuffer<uint>?
Also, the hardware does take advantage of AoS with structures up to 16 bytes. On Fermi, as you know, when code makes 4-byte load/store requests, a memory transaction is issued for every segment addressed by the warp. For 8-byte requests, it's a transaction for every segmented addressed by the half-warp, and for 16-bytes, a transaction for each segmented addressed by the quarter-warp. ATI works identically, just substitute "half-wavefront" for warp. Larger structures than 16 bytes should of course not be used, because then there is no best-case scenario for addressing them.
If AMP stores xyzw structs as StructuredBuffer<uint>, worst-case random access will result in 128 transactions per warp, to retrieve the entire struct (32 transactions per component). With StructuredBuffer<uint4>, the same access pattern results in at most 32 transactions (8 transactions per quarter-warp). Since there is no texture cache, how can random access to larger structs possibly be serviced?
The other thing that springs to mind is the bound UAV limit:
#define D3D11_PS_CS_UAV_REGISTER_COUNT ( 8 )
I take it that AMP kernels can write to more than 8 32-bit components. Is the runtime is doing its own memory management on top of D3D to pack multiple array<>s into structured buffers, and generating the additional indexing when accessing the packed arrays in [RW]StructuredBuffers? What kind of additional memory overhead does this add?
sean
-
Wednesday, September 21, 2011 10:25 PM
No, we do not de-interleave automatically, the memory layout will be identical to that of a C array of int4. Again, array and array_view are designed with compatibility and interoperation with C++ CPU code in mind, up to the point that you could share memory between the CPU and GPU (when that becomes an option).
The hardware does take advantage of coalesced accesses and thus using structs as the value types of arrays has the potential for the deficiencies that you describe.
My point was that on the hardware platforms we have tested, accessing the y component of a StructuredBuffer<int4> at position i, wasn’t any more efficient than accessing a StructuredBuffer<uint> at position 4*i plus the offset of y (which is 4). In other words, the hardware drivers do not seem to take advantage of the opportunity to transparently de-interleave, at least when it comes to StructuredBuffers. If you have concrete and reproducible evidence the to contrary, I would love to see it (really love to see it).
Our guideline to developers would be to use texture<T> and or buffer<T> if and when they become available, or apply the array-of-struct transformation manually. Which brings me to your last question. We currently don’t do any of that automatically. With respect to what one could do manually: first, we have increased the number of UAV’s from 8 to 64, so you could de-interleave into separate arrays. I don’t think this imposes significant memory consumption overheads, and on the other hand, it provides the system's the ability to swap in-and-out specific components as they are used (i.e., in a more granular fashion). Second, you could do such packing into a single array, as you describe, manually. I don’t think that this will result in any additional memory overhead, but it would mean more index math, possibly more constants to pass along, and more register pressure in the kernel, so it’s not for free.
Yossi Levanoni, Principal Architect Parallel Computing Platform, Microsoft
- Edited by Yossi Levanoni Wednesday, September 21, 2011 10:28 PM
- Marked As Answer by DanielMothMicrosoft Employee, Owner Friday, April 06, 2012 7:09 AM
-
Thursday, September 22, 2011 12:24 AM
Thanks for your help. I understand much better now. This is my last question :)
Let's say you have a float4 array and want to run a kernel to return the lengths of each vector. It's obviously memory bottlenecked. The data gets copied to the device without de-interleaving, and is accessed through a StructuredBuffer<uint> (basically a pointer typecast).
So if you have something like:
array_view<const float4, 1> source(vectors); array_view<float, 1> dest(lengths); in the parallel_for_each: int tid = idx[0]; float4 v = source[tid]; float len = sqrt(v.x * v.x + v.y * v.y + v.z * v.z + v.w * v.w); dest[tid] = len;
Does the compute shader use 32-bit loads on each component successively? If this is the case, each component fetch would require four load transactions, as the each component for a single warp are spread over 4 segments. Do this for each component and you have load 16 transactions, where only 4 are required... The entire operation would require 17 transactions per warp (1 for the store) compared to the 5 that would be required if StructuredBuffer<uint4> was used. The thing would run 3.4x slower than it should. Maybe the vendor's D3D driver can optimize this out, but it's not clear how the compiler would reliably do that from the IL I have in my head.
I may be missing something obvious.. Of course I agree that manually doing pointer arithmetic on uint* will result in identical memory performance to grabbing individual components within a StructuredBuffer<uint4>. But isn't the uint4 buffer actually more flexible, in that ld_structured_indexable (and store_structured) can use the dest mask and the struct member offset in the 3rd param (l(0), l(4) etc) to support grabbing individual structure members as well as entire structures? In the code I pasted, a StructuredBuffer<uint4> would grab v in a single instruction - 16 bytes per lane per cycle in a quarter-warp, and you'd achieve peak bandwidth.
I'm still under the impression that AMP only performs efficiently when the user limits himself to array<float> or array<int>, even when he'd want to grab entire 8- or 16-byte structs.
thanks,
sean
-
Friday, September 23, 2011 12:11 AM
These are the types of tests that we have conducted before making the decision to rely on StructuredBuffer<uint> since, somewhat surprisingly, we were not seeing the performance degradation that you are projecting. We think that the reasons for that are:
1) Modern GPUs do have a layer of transparent caching, which allow servicing the access requests for the y,z,w components from the cache, after the respective cache lines have been loaded into the cache as a result of asking for the x components.
2) We also believe there is some level of forward-prefetching going on.
At any rate, I’ll try this particular test (in HLSL) and will make an effort to report back to the forum on what I find.
Thank you for the thoughtful questions and comments.
Yossi Levanoni, Principal Architect Parallel Computing Platform, Microsoft- Proposed As Answer by DanielMothMicrosoft Employee, Owner Saturday, September 24, 2011 12:44 AM