none
AMP Performance in 64 bits RRS feed

  • Question

  • Hi,

    After getting a new machine, I tested a bit more my AMP code, and I get some strange result between a 32 and 64 bits binary.

    I tested on a freshly installed Windows 7 Pro 64 bits SP1. I use only FastMath and float in AMP code.

    When compiling in 32 bits (x86 in VS):

    • Nvidia Geforce GTX 660 Ti: 85ms
    • AMD Radeon HD 6450: 140ms
    • Intel HD 4000 (in CPU): 200ms

    When compiling in 64 bits (AnyCPU in VS for the .Net overlay):

    • Nvidia Geforce GTX 660 Ti: 115ms
    • AMD Radeon HD 6450: 305ms
    • Intel HD 4000 (in CPU): 150ms (but the shader compilation is very slow, more than 3 seconds)

    I don't understand why the external cards are so much slower in 64 bits, but the on-board card is going faster.

    Is that normal, or is something wrong in my C++ AMP dll or my C++/CLI wrapper dll?

    With theses performance, I don't see a big advantage for using an high-end graphic card like the Geforce in 64 bits, if the onboard (free) is just 30% slower.

    Tuesday, November 6, 2012 4:03 PM

Answers

  • Hi PYB_42,

    Your time measurement code needs to take into account the asynchronous nature of GPU execution.

    That means adding an av.wait() call just before starting the timer -- to exclude from timing any work scheduled previously on the accelerator_view; and av.wait() call just before stopping the timer -- to force the work to complete before sampling the time.

    Please refer to our blog post "How to measure the performance of C++ AMP algorithms?" for broader explanation.

    • Proposed as answer by Amit K AgarwalModerator Tuesday, November 13, 2012 6:27 PM
    • Marked as answer by PYB_42 Wednesday, November 14, 2012 7:37 AM
    Friday, November 9, 2012 5:01 PM
    Moderator

All replies

  • Such a difference between 32 and 64 bit performance is unusual. Are these timing numbers just for the kernel execution part? If not, can you share the timing numbers for the kernel execution itself?

    If this difference is in the time it takes for just executing the kernel, it will require further investigation. If it is possible for you to share the code, I can take a look.

    -Amit


    Amit K Agarwal

    Tuesday, November 6, 2012 8:49 PM
    Moderator
  • Hi,

    I really don't understand what is happening, as today I get more consistent times after adding two HighResolutionTimer in my code to see better. I didn't make any other code change, and now also without the timers, the times are similar between 32 and 64 bits.

    Now, I just don't know if my strange times from yesterday come from the building of the AMP dll, or is related with the machine state (it was just in sleep/hibernate over the night).

    When compiling in 32 bits (x86 in VS):


        Nvidia Geforce GTX 660 Ti:
        ComputeFieldCorrelation parallel_for_each took: 3.36078 ms
        ComputeFieldCorrelation parallel_for_each took: 0.11259 ms
        ComputeFieldCorrelation full took: 83.8765 ms
        .Net: GetPositionalXFormMatrix executed in 00:00:00.0840084  

     
        AMD Radeon HD 6450:
        ComputeFieldCorrelation parallel_for_each took: 4.30768 ms
        ComputeFieldCorrelation parallel_for_each took: 0.384857 ms
        ComputeFieldCorrelation full took: 150.197 ms
        .Net: GetPositionalXFormMatrix executed in 00:00:00.1500150

        Intel HD 4000 (in CPU):
        ComputeFieldCorrelation parallel_for_each took: 22.9073 ms
        ComputeFieldCorrelation parallel_for_each took: 0.608225 ms
        ComputeFieldCorrelation full took: 208.314 ms
        .Net: GetPositionalXFormMatrix executed in 00:00:00.2080208    

    When compiling in 64 bits (AnyCPU in VS for the .Net overlay):

        Nvidia Geforce GTX 660 Ti:
        ComputeFieldCorrelation parallel_for_each took: 3.1848 ms
        ComputeFieldCorrelation parallel_for_each took: 0.247818 ms
        ComputeFieldCorrelation full took: 97.4078 ms
        .Net: GetPositionalXFormMatrix executed in 00:00:00.1010101

        AMD Radeon HD 6450:
        ComputeFieldCorrelation parallel_for_each took: 3.54068 ms
        ComputeFieldCorrelation parallel_for_each took: 0.544837 ms
        ComputeFieldCorrelation full took: 144.696 ms
        .Net: GetPositionalXFormMatrix executed in 00:00:00.1480148

        Intel HD 4000 (in CPU):
        ComputeFieldCorrelation parallel_for_each took: 23.1829 ms
        ComputeFieldCorrelation parallel_for_each took: 0.575625 ms
        ComputeFieldCorrelation full took: 228.684 ms
        .Net: GetPositionalXFormMatrix executed in 00:00:00.2320232

    Wednesday, November 7, 2012 10:45 AM
  • What mechanism are you using to time the kernels?

    -L

    Thursday, November 8, 2012 5:33 PM
  • Hi,

    I use in C++ the High-resolution timer:

    http://blogs.msdn.com/b/nativeconcurrency/archive/2011/12/28/high-resolution-timer-for-c.aspx

    HighResolutionTimer timer;
    timer.Start();
    parallel_for_each(av, e, [=] (index<2> idx)  restrict(amp)
    {
    ....
    });
    timer.Stop();
    std::wcout << "ComputeFieldCorrelation parallel_for_each took: " << timer.Elapsed() << " ms" << std::endl;  

             


    Friday, November 9, 2012 7:20 AM
  • Hi PYB_42,

    Your time measurement code needs to take into account the asynchronous nature of GPU execution.

    That means adding an av.wait() call just before starting the timer -- to exclude from timing any work scheduled previously on the accelerator_view; and av.wait() call just before stopping the timer -- to force the work to complete before sampling the time.

    Please refer to our blog post "How to measure the performance of C++ AMP algorithms?" for broader explanation.

    • Proposed as answer by Amit K AgarwalModerator Tuesday, November 13, 2012 6:27 PM
    • Marked as answer by PYB_42 Wednesday, November 14, 2012 7:37 AM
    Friday, November 9, 2012 5:01 PM
    Moderator
  • Sorry, I forgot about the av.wait(). Now the times make more senses:

    When compiling in 32 bits (x86 in VS):


        Nvidia Geforce GTX 660 Ti:
        ComputeFieldCorrelation parallel_for_each took: 41.3414 ms
        ComputeFieldCorrelation parallel_for_each took: 23.4753 ms
        ComputeFieldCorrelation full took: 88.0982 ms
        .Net: GetPositionalXFormMatrix executed in 00:00:00.0880088

     
        AMD Radeon HD 6450:
        ComputeFieldCorrelation parallel_for_each took: 66.4128 ms
        ComputeFieldCorrelation parallel_for_each took: 56.808 ms
        ComputeFieldCorrelation full took: 132.294 ms
        .Net: GetPositionalXFormMatrix executed in 00:00:00.1320000

        Intel HD 4000 (in CPU):
        ComputeFieldCorrelation parallel_for_each took: 103.576 ms
        ComputeFieldCorrelation parallel_for_each took: 82.0039 ms
        ComputeFieldCorrelation full took: 202.002 ms
        .Net: GetPositionalXFormMatrix executed in 00:00:00.2030000   

    When compiling in 64 bits (AnyCPU in VS for the .Net overlay):

        Nvidia Geforce GTX 660 Ti:
        ComputeFieldCorrelation parallel_for_each took: 40.718 ms
        ComputeFieldCorrelation parallel_for_each took: 22.1792 ms
        ComputeFieldCorrelation full took: 92.2948 ms
        .Net: GetPositionalXFormMatrix executed in 00:00:00.0940000

        AMD Radeon HD 6450:
        ComputeFieldCorrelation parallel_for_each took: 66.2085 ms
        ComputeFieldCorrelation parallel_for_each took: 60.0722 ms
        ComputeFieldCorrelation full took: 140.169 ms
        .Net: GetPositionalXFormMatrix executed in 00:00:00.1420000

        Intel HD 4000 (in CPU):
        ComputeFieldCorrelation parallel_for_each took: 107.017 ms
        ComputeFieldCorrelation parallel_for_each took: 80.1714 ms
        ComputeFieldCorrelation full took: 209.74 ms
        .Net: GetPositionalXFormMatrix executed in 00:00:00.2120000

    Tuesday, November 13, 2012 5:06 PM