none
I don't find any data copy performance improvements since beta

    Question

  • Hi,

    I have a hobby project where I try to implement neural network based computations using GPGPU programming. I have an OpenCL layer which performs really well, but I want to implement a C++ AMP based code path also. I have to say, I like your framework. It is elegant, simple to use and simple to debug.

    When implementing neural computations, there are a lot of data to process, and a lot of result to evaluate, and evaluated result have to be visualized often, so it cannot be done on GPU. The computations are quite simple and highly parellelizabe, my current most advanced algorithm that runs on GPU is Real-time Recurrent Backpropagation, which can be done by simple microkernels working together.

    The bottleneck is the data copy performance. There are a lot of live training data, that have to be transferred to GPU memory, and the result have to transfered back to do training result visualization and metaoptimization algorithms for feature vector selection.

    I have to say that C++ AMP data copy performance is 40% lower than OpenCL's, when I do the exact same thing on both platforms (as far as I know).

    Yes, I've been using staging arrays and did warmup.

    I created a little project to try various GPGPU programming concept. You can download it from SkyDrive:

    http://sdrv.ms/Lhzcl7

    The current build is going to measure data copy performance of two paltform as you can see. I'm using direct vector copying, but the performance is same by using staging array's array_view.

    The result is: C++ AMP is much slower than OpenCL.

    My system is:

    Windows 8
    Radeon 6870 1G
    AMD Phanom II x4
    4GByte DDR3

    ps.: Yes I know that usings in headers are evil, but c'mon, this is a little test project. ;)








    Monday, June 11, 2012 8:32 AM

Answers

  • Hi unbornchikken

    In the RC we have indeed made performance improvements on both kernel execution and for copying data without using staging arrays explicitly. These are not optimizations that apply to every single piece of C++ AMP code written, but only apply for certain scenarios. For your scenario, the use of staging arrays is recommended for optimal performance. While you may not be observing notable differences for small problem sizes, it can make a significant difference for larger problem sizes.

    Having said that, you are right that for your specific scenario with such small data sizes, even with staging arrays, the copy time performance is not as good as you would like it to be when comparing with OpenCL. The reason is that your size of 400 is too small to amortize the current DirectX kernel dispatch overhead (which is the main performance degradation factor here). I think you are stating that increasing the data size is not an option for you, but for reference, if your size was 40000 then you’d see comparable performance (and even faster with C++ AMP for even larger data sizes).

    If you’d like us to offer input on how you could re-express your design to involve fewer kernel launches over larger data, please share more of your design so we can look into it, and even contact me offline. Other than that, this is one case we cannot improve further – sorry.

    Aside: for scenarios like this where you are accessing the data of the array_view on the CPU host side repeatedly, you can obtain a CPU pointer corresponding to the array_view through the array_view::data() function and then access the data through the raw pointer inside your loop for increased efficiency.

           float *pOutPtr = outputView.data();
           for (unsigned i = 0; i < copySize; i++)
           {
              if (pOutPtr[i] != testValue) throw logic_error("Output value is not what expected.");
           }

    This is a performance tip, but not one that will make a difference for this specific scenario with such small size of data. If your data was 400000 and you used this technique, you’d see C++ AMP perform twice as fast as the OpenCL variant.

    Cheers
    Daniel


    http://www.danielmoth.com/Blog/

    • Proposed as answer by Zhu, Weirong Saturday, June 16, 2012 7:07 PM
    • Marked as answer by unbornchikken Monday, June 18, 2012 8:23 AM
    Friday, June 15, 2012 10:17 PM
    Owner

All replies

  • Thanks for using C++ AMP and sharing your feedback.

    I looked at the code you shared and have a couple of observations which I believe are largely responsible for the performance behavior you are experiencing:

    a) The code currently copies from the std::vector to a staging array, followed by a copy from the staging array to a device array. This is not the optimal use of staging arrays. Staging arrays are meant to be used as the host container itself replacing the std::vector to elide the extra copy from the vector to the staging array. When copying from a std::vector to a device array, the runtime already does what you code  attempts to do (with some added performance optimizations) and hence you would not observe much difference between what your code is currently doing vs. directly copying from the vector to the device array/array_view. I would encourage you to read our blog post on staging arrays if you haven't already had a chance to do so.

    b) The amount of data copied in each iteration is very small - 400 bytes. Is this representative of the real-world data sizes for your problem? Copying such small sizes of data is inefficient as the ratio of cycles spent setting up and scheduling the transfer to cycles spent transferring the bits across PCIe bus is pretty high. I would recommend testing using larger problem sizes or devise alternate algorithms/techniques to batch the processing of such small data sets on the GPU.

    Our copy benchmarks show C++ AMP copy performance to be equivalent to OpenCL for large data sizes. We would love to hear about your findings after applying the changes suggested above.

    Also, following are a couple of blog posts on C++ AMP performance measurements, that you may find relevant:

    http://blogs.msdn.com/b/nativeconcurrency/archive/2011/12/28/how-to-measure-the-performance-of-c-amp-algorithms.aspx

    http://blogs.msdn.com/b/nativeconcurrency/archive/2012/04/25/data-warm-up-when-measuring-performance-with-c-amp.aspx

    - Amit


    Amit K Agarwal


    Tuesday, June 12, 2012 4:52 PM
    Owner
  • Hi,

    a) As I said: "I'm using direct vector copying, but the performance is same by using staging array's array_view." I've uploaded another version to the SkyDrive folder (gpuconcept2.7z) that uses staging array by array_view. Same performance. Exactly. It seems it doesn't matter which method I'm using: copying or array_view.

    b) I'm implementing neural network computations and I have to transfer each iteration's result from GPU to host, because it is needed for visualization and metaoptimization algorithms. It is often a few bytes only. But - as I said - I have an OpenCL implementation which is working perfectly even there are tiny data block transfers.

    If you see my provided code carefully you can see that OpenCL and C++ AMP implementations have to do the same. They copying same memory block sizes, in same order, to same GPU, but the OpenCL version is nearly 40% faster.

    Maybe I still don't get it, but I've done everything that you advised and was in the linked articles, but still no performance gain.


    Tuesday, June 12, 2012 6:33 PM
  • Hi unbornchikken

    In the RC we have indeed made performance improvements on both kernel execution and for copying data without using staging arrays explicitly. These are not optimizations that apply to every single piece of C++ AMP code written, but only apply for certain scenarios. For your scenario, the use of staging arrays is recommended for optimal performance. While you may not be observing notable differences for small problem sizes, it can make a significant difference for larger problem sizes.

    Having said that, you are right that for your specific scenario with such small data sizes, even with staging arrays, the copy time performance is not as good as you would like it to be when comparing with OpenCL. The reason is that your size of 400 is too small to amortize the current DirectX kernel dispatch overhead (which is the main performance degradation factor here). I think you are stating that increasing the data size is not an option for you, but for reference, if your size was 40000 then you’d see comparable performance (and even faster with C++ AMP for even larger data sizes).

    If you’d like us to offer input on how you could re-express your design to involve fewer kernel launches over larger data, please share more of your design so we can look into it, and even contact me offline. Other than that, this is one case we cannot improve further – sorry.

    Aside: for scenarios like this where you are accessing the data of the array_view on the CPU host side repeatedly, you can obtain a CPU pointer corresponding to the array_view through the array_view::data() function and then access the data through the raw pointer inside your loop for increased efficiency.

           float *pOutPtr = outputView.data();
           for (unsigned i = 0; i < copySize; i++)
           {
              if (pOutPtr[i] != testValue) throw logic_error("Output value is not what expected.");
           }

    This is a performance tip, but not one that will make a difference for this specific scenario with such small size of data. If your data was 400000 and you used this technique, you’d see C++ AMP perform twice as fast as the OpenCL variant.

    Cheers
    Daniel


    http://www.danielmoth.com/Blog/

    • Proposed as answer by Zhu, Weirong Saturday, June 16, 2012 7:07 PM
    • Marked as answer by unbornchikken Monday, June 18, 2012 8:23 AM
    Friday, June 15, 2012 10:17 PM
    Owner
  • Clear. Thanks. 
    Monday, June 18, 2012 8:27 AM
  • C++ AMP looks amazing on the surface but the performance of smaller data sets worries me and less so current lock to DirectX.  Is it feasible use a combination OpenCL's copy mechanism with C++ AMP's parallelization to get the benefits of both? (After all, the data essentially ends up in the same place)
    Sunday, October 21, 2012 1:17 PM
  • Thanks for the feedback. We are actively working on improving the performance of C++ AMP for smaller problem sizes.

    Regarding your question about the feasibility of combining OpenCL copy with C++ AMP - yes it is possible. OpenCL 1.2 introduced an optional extension "cl_khr_d3d11_sharing" which provides APIs to enable sharing of resources between OpenCL and Direct3D 11. You should be able to obtain OpenCL handles to Direct3D 11 resources underlying C++ AMP resources (the Direct3D 11 resources corresponding to AMP objects can be obtained through the C++ AMP Direct3D interop APIs), and use OpenCL APIs for manipulating the contents of those resources. We haven’t tried this ourselves but would love to hear your experience if you try this approach. A couple of caveats to look out for would be:

    1. The extension is optional and your code would have to account for the unavailability of this extension at runtime.
    2. C++ AMP performs some optimizations for large data transfers, which we have found in our experience to cause C++ AMP to outperform OpenCL copy performance, for large data sets. You may want to account for this, when choosing a mixed OpenCL + C++ AMP approach.

    -Amit


    Amit K Agarwal


    Thursday, October 25, 2012 9:02 PM
    Owner