locked
C++ AMP Performance Jump Based on Transfer Size RRS feed

  • General discussion

  • I would like to start a discussion on the following:

    I have observed that memory transfer size significantly affects performance. For example, when transferring floating point values, there is a definite performance jump when transferring 1048545 or more floating point values. Below this threshold, I observe a throughput of about 3 gigabytes per second, and above this threshold I observe a throughput of about 10 gigabytes per second.

    The specific hardware involved in this particular observation is a Dell Precision T3600, 16GB RAM, Intel Xeon E5-2665 0 @ 2.40GHz running Windows 8 and an NVidia GTXTitan.

    Here is the C++ AMP source code fragment that is involved in the memory transfer:

    for( auto myloop = 0; myloop < 1000 / streams; myloop++ )
    {
        for (auto stream = 0; stream < streams; stream++)
        {
            if( downloadFutures[stream].valid() )
            {
                downloadFutures[stream].wait();
                loops++;
            }
            uploadFutures[stream] = copy_async(*cpuMemIns[stream], *gpuMemIns[stream]);
        }
     
        for (auto stream = 0; stream < streams; stream++)
        {
            if( uploadFutures[stream].valid() )
            {
                uploadFutures[stream].wait();
            }
            downloadFutures[stream] = copy_async(*gpuMemOuts[stream], *cpuMemOuts[stream]);
        }
    }
    

    The complete source code (which also also contains a similar implementation using CUDA streams) is located here:

    https://www.assembla.com/code/cudafyperformance/subversion/nodes

    At the risk of being accused of self-promotion, here is a link to my blog article of my observations of CUDA vs. AMP:

    http://w8isms.blogspot.com/2013/06/smoked-cuda-cheese-macaroni.html

    Finally, here is a graphic of the C++ AMP performance threshold in the broader context of a larger range of transfer sizes, different number of concurrently scheduled transfer operations, and various CUDA results which do not exhibit this jump.

    - John Michael Hauck

    Tuesday, July 23, 2013 1:02 PM

All replies

  • There is a bit of a discussion going on about this at CodeProject

    John Michael Hauck

    Wednesday, July 24, 2013 2:49 PM
  • Hi John,

    I am sorry we were not able to look into the issue you are reporting sooner. Can we please ask you at this point to submit a bug through Microsoft Connect adding the source code as an attachment, so we can access it with a clear legal status?

    Thank you and sorry for the extra trouble.

    Friday, August 16, 2013 5:46 PM
    Moderator
  • Łukasz,

    I submitted this as a bug as you asked (though I am not 100% sure I would call this a bug).

    I am concerned that it is a bit complicated to reproduce without the right hardware.  In any case, I am available to help answer questions for anyone wishing to reproduce this issue.

    - John


    John Michael Hauck

    Friday, August 16, 2013 7:16 PM