none
Is there any way to narrow down performance difference of CUDA and AMP? RRS feed

  • Question

  • Hi, I,m immigrating CUDA marching cube algorithm to C++ AMP.

    There are some performance problem so it is not appropriate to convert code by 1:1 matching.

    The result of performance of marching cube is CUDA is 2ms and AMP is 340ms on one marching cube(128*128*128 size).

    I attached one phase of source code, classify voxels, both version of AMP and CUDA.

    The time delay for this phase are AMP(20ms) and CUDA(may be 0?).

    Course, I believe it can be optimized but I don't know how to do.

    I guess the reason of performance down is access accessing array_view.

    CUDA uses cudaBindTexture library to access volume data.

    In AMP, I couldn't find similar library so I used just array_view.

    I figure out accessing d_volume(array_view volume data) cause performance down by test.

    This is AMP Code.

    inline unsigned int sampleVolumeIndex(const uint3& p, const uint3& gridSize) restrict(amp)

    {
    return (p.z*(gridSize.x+1)*(gridSize.y+1)) + (p.y*(gridSize.x+1)) + p.x;
    }

    // launch_classifyVoxel

    parallel_for_each(d_voxelVerts.extent.tile<1, 1, TILE_SIZE>(), [=] (tiled_index<1, 1, TILE_SIZE> t_idx) restrict(amp)

    { unsigned int i = (t_idx.global[1] * d_voxelVerts.extent[2]) + t_idx.global[2]; uint3 gridPos = calcGridPos(i, gridsizeShift, gridsizeMask); float field[8]; field[0] = d_volume[sampleVolumeIndex(gridPos, gridsize)]; field[1] = d_volume[sampleVolumeIndex(gridPos + uint3(1, 0, 0), gridsize)]; field[2] = d_volume[sampleVolumeIndex(gridPos + uint3(1, 1, 0), gridsize)]; field[3] = d_volume[sampleVolumeIndex(gridPos + uint3(0, 1, 0), gridsize)]; field[4] = d_volume[sampleVolumeIndex(gridPos + uint3(0, 0, 1), gridsize)]; field[5] = d_volume[sampleVolumeIndex(gridPos + uint3(1, 0, 1), gridsize)]; field[6] = d_volume[sampleVolumeIndex(gridPos + uint3(1, 1, 1), gridsize)]; field[7] = d_volume[sampleVolumeIndex(gridPos + uint3(0, 1, 1), gridsize)]; // calculate flag indicating if each vertex is inside or outside isosurface unsigned int cubeindex; cubeindex = unsigned int(field[0] < ref_iso_value); cubeindex += unsigned int(field[1] < ref_iso_value)*2; cubeindex += unsigned int(field[2] < ref_iso_value)*4; cubeindex += unsigned int(field[3] < ref_iso_value)*8; cubeindex += unsigned int(field[4] < ref_iso_value)*16; cubeindex += unsigned int(field[5] < ref_iso_value)*32; cubeindex += unsigned int(field[6] < ref_iso_value)*64; cubeindex += unsigned int(field[7] < ref_iso_value)*128; // read number of vertices from texture unsigned int numVerts = d_numVertsTable[cubeindex]; if (i < numVoxels) { d_voxelVerts[t_idx.global] = numVerts; d_voxelOccupied[t_idx.global] = (numVerts > 0); } }); }

    This is CUDA code.

    // sample volume data set at a point
    __device__
    float sampleVolume(uchar *data, uint3 p, uint3 gridSize)
    {
        p.x = min(p.x, gridSize.x - 1);
        p.y = min(p.y, gridSize.y - 1);
        p.z = min(p.z, gridSize.z - 1);
    
        uint i = (p.z*gridSize.x*gridSize.y) + p.y*gridSize.x) + p.x;
        //return (float) data[i] / 255.0f;
        return tex1Dfetch(volumeTex, i);
    }
    
    
    __global__ void
    classifyVoxel(uint* voxelVerts, uint *voxelOccupied, uchar *volume, uint3 gridSize, uint3 gridSizeShift, uint3 gridSizeMask, uint numVoxels, float3 voxelSize, float isoValue)
    {
        uint blockId = __mul24(blockIdx.y, gridDim.x) + blockIdx.x;
        uint i = __mul24(blockId, blockDim.x) + threadIdx.x;
    
        uint3 gridPos = calcGridPos(i, gridSizeShift, gridSizeMask);
    
        // read field values at neighbouring grid vertices
        float field[8];
        field[0] = sampleVolume(volume, gridPos, gridSize);
        field[1] = sampleVolume(volume, gridPos + make_uint3(1, 0, 0), gridSize);
        field[2] = sampleVolume(volume, gridPos + make_uint3(1, 1, 0), gridSize);
        field[3] = sampleVolume(volume, gridPos + make_uint3(0, 1, 0), gridSize);
        field[4] = sampleVolume(volume, gridPos + make_uint3(0, 0, 1), gridSize);
        field[5] = sampleVolume(volume, gridPos + make_uint3(1, 0, 1), gridSize);
        field[6] = sampleVolume(volume, gridPos + make_uint3(1, 1, 1), gridSize);
        field[7] = sampleVolume(volume, gridPos + make_uint3(0, 1, 1), gridSize);
    
        // calculate flag indicating if each vertex is inside or outside isosurface
        uint cubeindex;
        cubeindex =  uint(field[0] < isoValue); 
        cubeindex += uint(field[1] < isoValue)*2; 
        cubeindex += uint(field[2] < isoValue)*4; 
        cubeindex += uint(field[3] < isoValue)*8; 
        cubeindex += uint(field[4] < isoValue)*16; 
        cubeindex += uint(field[5] < isoValue)*32; 
        cubeindex += uint(field[6] < isoValue)*64; 
        cubeindex += uint(field[7] < isoValue)*128;
    
        // read number of vertices from texture
        uint numVerts = tex1Dfetch(numVertsTex, cubeindex);
    
        if (i < numVoxels) {
            voxelVerts[i] = numVerts;
            voxelOccupied[i] = (numVerts > 0);
        }
    }

    Monday, September 24, 2012 9:20 AM

Answers

  • Hi Manny,

    Thanks for providing the complete code.

    I've given a quick look into it, focusing on generateTriangles function clearly being the hot spot. On my machine it used to take ~600 ms to execute.

    In that function you're creating two array_views: d_vertexList and d_cubeIndex on top of memory allocated dynamically on the heap. Than content of these views is discarded with discard_data() call, so the bits are not copied to the GPU (that's correct), where the views are filled in the first kernel. Next, another kernel is invoked that uses the information stored in them to compute the result in pos and norm array_views passed to the function by reference.

    What is implicit, but happens shortly thereafter, is the sequence of destructor calls for local objects in that function scope. This includes destructors for these local array_views and, as they're the last remaining reference to that memory, causes their synchronization (i.e. copy back to the CPU memory). This costs ~400 ms on my machine and can be alleviated by calling discard_data() on d_vertexList and d_cubeIndex again after the second parallel_for_each (so data is not only not copied in, but also not copied out). You can read more about this behavior in "concurrency::array_view - Implicit synchronization on destruction" blog post.

    Another issue is the memory allocation and initialization that is performed before creating array_views. The cost of such operation is non-trivial, on my machine it's up to 160 ms. Since we cannot create an array_view without "home location", I would suggest you to refactor this part of the function to use either array_views that are created once and used throughout the program execution, or use arrays instead, which would not have to be initialized with any data, nor you wouldn't have to worry about discarding data. A good introduction to both arrays and array_views is "array and array_view from amp.h" blog post. I think that this pattern might have been a result of mechanical translation of cudaMalloc, but the behavior is very different. In that case, please refer to our guide for CUDA programmers.

    After factoring out overheads caused by the aforementioned problems, the execution time of generateTriangles drops to 30 ms on my machine, and the total time is similarly reduced.

    Having said that, there is still a difference versus CUDA version result that you're reporting, I will try to investigate this part further.

    One final note I'd like to make is the timing code that you're using. Please remember that the GPU execution is asynchronous to CPU execution and may be performed lazily. Therefore sampling the timer on the CPU is futile, unless you synchronize the two units. The easiest way to do so is to call accelerator_view.wait() before taking every time sample. In your current version of the code it happens to work without it, but you'll be observing inconsistent or surprising results after incorporating the suggested fixes. Please refer to the performance measurement posts that I've mentioned before.

     

    <UPDATE>

    As I promised, I've done some additional investigation of the performance disparity. But let's start with numbers!

    On my machine (Nvidia GTX 580, Intel Core i7 930), the performance of the original CUDA sample is as follows:

    1. classifyVoxel: 0.468036 ms
    2. scan: 0.447243 ms
    3. readback: 0.0649341 ms
    4. compactVoxels: 0.132787 ms
    5. scan: 1.46685 ms
    6. readback: 0.0809852 ms
    7. generateTriangles: ---

    The only change I've made to that code is the addition of the timing. The same advice for taking CPU samples as for C++ AMP holds, you have to synchronize the two units, in the case of CUDA you may use cudaDeviceSynchronize().

    The C++ AMP version with changes that I've made takes:

    1. classifyVoxel: 0.656637 ms
    2. scan: 1.86594 ms
    3. readback: 0.526039 ms
    4. compactVoxels: 0.374283 ms
    5. scan: 1.87433 ms
    6. readback: 0.614321 ms
    7. generateTriangles: ---

    I get more detailed results, as I've been using the high-resolution counter (code available on our blog). The fact that chrono::high_resolution_clock (that you've been using) is inferior, is a known bug.

    To dig into the results, the fact that the scan is ~4x slower in C++ AMP may be attributed to the fact that the CUDA version is using the implementation from Thrust library, which is known to be highly tuned; while C++ AMP version is implemented with an ad-hoc solution. This is a gap that we're planning to close with C++ AMP Algorithms Library.

    The reason I haven't given the result for generateTriangles in C++ AMP is two-fold:

      • The input that is passed to that function is different, the problem size (activeVoxels) in C++ AMP is ~80x larger.
      • The C++ AMP kernel is broken into two parallel_for_each invocation without any good reason, while the CUDA version is using only one dispatch.

    Since it would not be apples to apples comparison, it is more fair to omit the result.

    Let's get back to the source. The only change that I had to make to your code to get the results above was to modify the code that was reading back results of the scan to the CPU. Originally it has been:

    inline void getValue(uint& Value, array_view<uint, 3> Array /*,...*/) restrict(cpu)
    {
        // ...
        Value = Array[Index];
    }

    Where the array_view contains a result computed on the GPU.
    Accessing the single element through a subscript operator of the array_view has had a subtle, but very important performance-wise, effect that the content of the whole array_view had to be fetched back to the CPU memory (this is due to our coherence protocol). Not only that, but since the array_view is not const (i.e. array_view<const uint, 3>, also see this blog post) the data was marked dirty and subsequently was copied back to the GPU with the next parallel_for_each invocation.

    The solution is to change that one line of code to:

    copy(Array.section(Index, extent<3>(1, 1, 1), &Value);

    This statement causes one and only one element to be copied and the data residing on the GPU is kept in a "clean" state.

    I hope that my explanation will help you.

    Thanks once again for the code, it's very helpful for us to understand how people are using our APIs. In case you think anything of that should be changed, please voice your feedback in the forum thread "What do you want in the next version of C++ AMP? – we are listening".


    Tuesday, September 25, 2012 8:13 PM
    Moderator

All replies

  • Hi Manny,

    I've tried to make up the missing parts of the source code to run your snippet on my machine. For 128x128x128 grid that you're mentioning, I'm observing timing results close to 1ms.

    I take for granted that you're following our advises for measuring performance (How to measure the performance of C++ AMP algorithms? and Data warm up when measuring performance with C++ AMP).

    I'd like to experiment more with the scenario you're describing. Can you provide us with a self-contained repro (i.e. code that I can just copy and paste to run on my system) where you're observing such disparity?

    Monday, September 24, 2012 5:54 PM
    Moderator
  • Thanks about your comment.

    I'll examine your advises.

    Above all, I just made demo project can do marching cube algorithm in AMP.

    There is CUDA version of this in 'NVIDIA GPU Computing SDK 4.2' sample codes.

    The time result of my project AMP.

    classifyVoxels : 10(ms)

    exclusive scan1 : 30(ms)

    compactVoxels : 7(ms)

    exclusive scan2 : 2(ms)

    generateTriangles : 213(ms)

    Total Time : 262(ms)

    and CUDA version

    Total Time : 2(ms)

    Thanks.

    Tuesday, September 25, 2012 2:36 AM
  • Hi Manny,

    Thanks for providing the complete code.

    I've given a quick look into it, focusing on generateTriangles function clearly being the hot spot. On my machine it used to take ~600 ms to execute.

    In that function you're creating two array_views: d_vertexList and d_cubeIndex on top of memory allocated dynamically on the heap. Than content of these views is discarded with discard_data() call, so the bits are not copied to the GPU (that's correct), where the views are filled in the first kernel. Next, another kernel is invoked that uses the information stored in them to compute the result in pos and norm array_views passed to the function by reference.

    What is implicit, but happens shortly thereafter, is the sequence of destructor calls for local objects in that function scope. This includes destructors for these local array_views and, as they're the last remaining reference to that memory, causes their synchronization (i.e. copy back to the CPU memory). This costs ~400 ms on my machine and can be alleviated by calling discard_data() on d_vertexList and d_cubeIndex again after the second parallel_for_each (so data is not only not copied in, but also not copied out). You can read more about this behavior in "concurrency::array_view - Implicit synchronization on destruction" blog post.

    Another issue is the memory allocation and initialization that is performed before creating array_views. The cost of such operation is non-trivial, on my machine it's up to 160 ms. Since we cannot create an array_view without "home location", I would suggest you to refactor this part of the function to use either array_views that are created once and used throughout the program execution, or use arrays instead, which would not have to be initialized with any data, nor you wouldn't have to worry about discarding data. A good introduction to both arrays and array_views is "array and array_view from amp.h" blog post. I think that this pattern might have been a result of mechanical translation of cudaMalloc, but the behavior is very different. In that case, please refer to our guide for CUDA programmers.

    After factoring out overheads caused by the aforementioned problems, the execution time of generateTriangles drops to 30 ms on my machine, and the total time is similarly reduced.

    Having said that, there is still a difference versus CUDA version result that you're reporting, I will try to investigate this part further.

    One final note I'd like to make is the timing code that you're using. Please remember that the GPU execution is asynchronous to CPU execution and may be performed lazily. Therefore sampling the timer on the CPU is futile, unless you synchronize the two units. The easiest way to do so is to call accelerator_view.wait() before taking every time sample. In your current version of the code it happens to work without it, but you'll be observing inconsistent or surprising results after incorporating the suggested fixes. Please refer to the performance measurement posts that I've mentioned before.

     

    <UPDATE>

    As I promised, I've done some additional investigation of the performance disparity. But let's start with numbers!

    On my machine (Nvidia GTX 580, Intel Core i7 930), the performance of the original CUDA sample is as follows:

    1. classifyVoxel: 0.468036 ms
    2. scan: 0.447243 ms
    3. readback: 0.0649341 ms
    4. compactVoxels: 0.132787 ms
    5. scan: 1.46685 ms
    6. readback: 0.0809852 ms
    7. generateTriangles: ---

    The only change I've made to that code is the addition of the timing. The same advice for taking CPU samples as for C++ AMP holds, you have to synchronize the two units, in the case of CUDA you may use cudaDeviceSynchronize().

    The C++ AMP version with changes that I've made takes:

    1. classifyVoxel: 0.656637 ms
    2. scan: 1.86594 ms
    3. readback: 0.526039 ms
    4. compactVoxels: 0.374283 ms
    5. scan: 1.87433 ms
    6. readback: 0.614321 ms
    7. generateTriangles: ---

    I get more detailed results, as I've been using the high-resolution counter (code available on our blog). The fact that chrono::high_resolution_clock (that you've been using) is inferior, is a known bug.

    To dig into the results, the fact that the scan is ~4x slower in C++ AMP may be attributed to the fact that the CUDA version is using the implementation from Thrust library, which is known to be highly tuned; while C++ AMP version is implemented with an ad-hoc solution. This is a gap that we're planning to close with C++ AMP Algorithms Library.

    The reason I haven't given the result for generateTriangles in C++ AMP is two-fold:

      • The input that is passed to that function is different, the problem size (activeVoxels) in C++ AMP is ~80x larger.
      • The C++ AMP kernel is broken into two parallel_for_each invocation without any good reason, while the CUDA version is using only one dispatch.

    Since it would not be apples to apples comparison, it is more fair to omit the result.

    Let's get back to the source. The only change that I had to make to your code to get the results above was to modify the code that was reading back results of the scan to the CPU. Originally it has been:

    inline void getValue(uint& Value, array_view<uint, 3> Array /*,...*/) restrict(cpu)
    {
        // ...
        Value = Array[Index];
    }

    Where the array_view contains a result computed on the GPU.
    Accessing the single element through a subscript operator of the array_view has had a subtle, but very important performance-wise, effect that the content of the whole array_view had to be fetched back to the CPU memory (this is due to our coherence protocol). Not only that, but since the array_view is not const (i.e. array_view<const uint, 3>, also see this blog post) the data was marked dirty and subsequently was copied back to the GPU with the next parallel_for_each invocation.

    The solution is to change that one line of code to:

    copy(Array.section(Index, extent<3>(1, 1, 1), &Value);

    This statement causes one and only one element to be copied and the data residing on the GPU is kept in a "clean" state.

    I hope that my explanation will help you.

    Thanks once again for the code, it's very helpful for us to understand how people are using our APIs. In case you think anything of that should be changed, please voice your feedback in the forum thread "What do you want in the next version of C++ AMP? – we are listening".


    Tuesday, September 25, 2012 8:13 PM
    Moderator