none
Fastest way to copy array in C++Amp RRS feed

  • Question

  • Hello everyone

    I want to copy an array of bytes to another one.
    It seems that "std::copy" or "concurrency::copy" are faster than copying within C++Amp, Right????

    Here is the code

    std::copy(buffer, buffer + Lenght, *Pixels);

    or

    concurrency::copy(arrayBuffer, arrayPixels);  

    And here is my C++Amp code :

    //Use AMP
    array_view<unsigned int> arrayBuffer((Lenght + 3) / 4, reinterpret_cast<unsigned int*>(buffer));

    array_view<unsigned int> arrayPixels((Lenght + 3) / 4, reinterpret_cast<unsigned int*>(*Pixels));
    arrayPixels.discard_data();

    parallel_for_each(arrayPixels.extent, [arrayBuffer, arrayPixels] (index<1> idx) restrict(amp)
    {
    auto pixel = Amp_ReadByte(arrayBuffer, idx);
    Amp_WriteByte(arrayPixels, idx, pixel);
    });

    arrayPixels.synchronize();

    Any help is really appreciated.

    Best Regards. Pooya Eimandar.

    Tuesday, April 2, 2013 4:55 PM

All replies

  • I think sending data over the bus to the GPU is what is going to kill your parallel copy speed.

    You copy data over to GPU (major cost)

    Then do a parallel copy.

    For low data sample sizes, your CPU will outperform GPU.

    It will probably outperform for larger data sizes as well I'm guessing.

    Try a sample size of 10 million, and see what results you get. (again, I'm going to guess that the CPU will still outperform GPU in this case).

    You are essentially doing 3 copies....

    1) copying of data to GPU

    2) parallel copy

    3) copying data back out of GPU.

    If you are not doing any heavy math, in this kind of situation, the cost of #1 and #3 will out weigh any speed you get from #2, and thus your CPU will outperform GPU.

    Also, I would use w_memcpy


    Tuesday, April 2, 2013 10:20 PM
  • Thank you Martin.
    I tested all, it seems that, CPU outperform GPU.
    I thought that C++Amp is faster than all and I thinked may be my code is wrong.

    Best Regards. Pooya Eimandar.

    Wednesday, April 3, 2013 8:26 AM
  • Err, it's unclear why you would expect this particular pattern (or anti-pattern, for correctness) to be faster. Basically you're doing 3 spurious copies - it would be quite hard for it not to be slower.
    Wednesday, April 3, 2013 2:56 PM
  • Hi Alex.

    I just read each element of array as byte, then store it to the new element of new array. buffer and Pixels are array of bytes.

    I tested them all methods separately.
    For getting more info, here is the body of those methods which used in my AMP:

    #pragma once

    using namespace Concurrency;

    // Read character at index idx from array arr.
    template <typename T>
    unsigned int Amp_ReadByte(T& arr, int idx) restrict(amp)
    {
    return (arr[idx >> 2] & (0xFF << ((idx & 0x3) << 3))) >> ((idx & 0x3) << 3);
    }

    // Write value val to character at index idx in array arr.
    template<typename T>
    void Amp_WriteByte(T& arr, int idx, unsigned int val) restrict(amp)
    {
    atomic_fetch_xor(&arr[idx >> 2], arr[idx >> 2] & (0xFF << ((idx & 0x3) << 3)));
    atomic_fetch_xor(&arr[idx >> 2], (val & 0xFF) << ((idx & 0x3) << 3));
    }

    template <typename T>
    unsigned int Amp_ReadByte(T& arr, index<1> idx) restrict(amp) 
    {
    return Amp_ReadByte(arr, idx[0]);
    }

    template<typename T>
    void Amp_WriteByte(T& arr, index<1> idx, unsigned int val) restrict(amp) 

    Amp_WriteByte(arr, idx[0], val); 
    }


    Best Regards. Pooya Eimandar.

    Wednesday, April 3, 2013 3:08 PM
  • Sure, but that's not a particularly useful comparison to make since:

    • currently the workflow would be something like this (for the AMP case): 1 copy (across PCI-E, which is pretty slow / latencious) from main RAM to GPU RAM, 1 copy in GPU RAM, 1 copy (back across PCI-E) from GPU RAM to main RAM - you're doing no computation on the data, just moving it across memory domains through rather narrow pathways; it's expected for std::copy or concurrency::copy to do better, since they merely do 1-copy, within the same memory space (which happens to be fast and low latency main RAM);
    • furthermore, you need to manually unpack bytes, since there's no byte type in C++ AMP, whereas std::copy can (and probably does) optimise this particular case by way of calling memcpy - this is probably a secondary consideration though;

    I guess what I'm not getting (and perhaps you could elaborate on, if you don't mind) is why would you be interested in shuffling data all the way to the GPU just to do a copy that you merely read back into main RAM?

    Wednesday, April 3, 2013 5:05 PM
  • Thanks in advanced for reply.

    The matter is, I wanted to get the BackBuffer of GPU and copy it to CPU. Because BackBuffer is type of Texture2D, so I decided to use AMP in order to read BakBuffer and copy it to array of bytes.
    But it seems that , Most of CPUs implement hardware instructions designed specifically for moving memory and therefore std::copy() is faster way.
    Also in AMP, I used a simple conversion from bytes to int and vice versa, may be this is the next reason.


    Best Regards. Pooya Eimandar.

    Wednesday, April 3, 2013 5:22 PM
  • As mentioned, you're basically doing 3 copies. So the CPU copy will always outperform the GPU parallel copy.

    If you used a for loop on CPU side to do copy, it would still outperform the GPU.

    It's when you are performing some expensive calculation that the GPU is going to outperform the CPU. (I have found that with trivial non-"expensive" calculations the CPU will still beat the GPU)

    For example in my testing, the CPU will beat GPU for low numbered data sets (less than 10,000).

    Then GPU will start to match CPU speeds for larger data sets (more than 10,000)

    Then GPU will start to exceed CPU speeds (more than 100,000-1,000,000)



    Wednesday, April 3, 2013 6:43 PM