Wednesday, March 21, 2012 8:40 PM
I've been experimenting with C++ AMP and OpenCL on an NVIDIA GTX 460 and have noticed that memory transfer speeds for C++ AMP are seemingly a lot lower than my test case in OpenCL transferring 4 MB of data:
C++ AMP: host-to-device ~3 GB/sec, device-to-host is ~1 GB/sec
OpenCL: host-to-device ~5.5 GB/sec, device-to-host is ~6 GB/sec
I'm using the Concurrency Visualizer for the C++ AMP test and NVIDIA's Visual Profiler for the OpenCL test. If I use unpinned memory in the OpenCL test, the transfer rate slows to about 4 GB/sec in both directions.
NVIDIA's Visual Profiler actually shows the transfer rate while I have to calculate the transfer rate in the Concurrency Visualizer. Is there overhead that isn't being shown in the Concurrency Visualizer or are they rates actually different?
Wednesday, March 21, 2012 10:59 PM
Thanks! We have done some work on improving the copy performance. We have seen great improvement. It was not included in Beta, but it will be avaliable in final product. Meanwhile, you can use C++ AMP staging array, which should deliver better copy performance, please see http://blogs.msdn.com/b/nativeconcurrency/archive/2011/11/10/staging-arrays-in-c-amp.aspx.
For the mesurement, another thing to pay attention is the OS page-fault/zeroing-out cost. With C++ AMP, we pay it lazily at first touch. Other may pay it eargly at creation time. So it would be better to warm-up the data before measuring the copy performance to exclude the zeroing-out cost.
std::vector<int> vec(size); array<int> arr(size); // use it for a copy or parallel_for_each, so it's touched // measure the copy perf start_timer(); copy(vec.begin(), vec.end(), arr); stop_timer();
In this way, you focus on the copy perf without the zero-ing out cost. The same strategy probably should be used for OpenCL measurement. Between start_timer and stop_timer, you can also consider repeating the copy function mutliple times, then get an average.
This blog post give more detail on the C++ AMP time measurement: http://blogs.msdn.com/b/nativeconcurrency/archive/2011/12/28/how-to-measure-the-performance-of-c-amp-algorithms.aspx. You can user accelerator_view::wait() method to eliminate any asynchony happening in the implementation.
Thursday, March 22, 2012 3:44 PM
Thank you for the reply. Great blog posts.
Using a staging array I was able to get almost 5 GB/sec. Were there performance improvements made to staging arrays?
Thursday, March 22, 2012 5:28 PM
Thanks for trying out the staging array. The improvement I talked about is made to the common copy between a host container (non-staging array) and a device array. Meanwhile we will try to work with IHV partners on improvements for staging array copy.
- Edited by Zhu, Weirong Thursday, March 22, 2012 5:30 PM