none
C++ AMP: Tiling for performance only?

    Question

  • Hi there,

    I'm working on converting an algorithm and the stage I'm at is putting the for loops into the known parallel_for_each loop. It's a nested loop which I increment using steps of 4 per iterations:

    for(int j = 0; j < height; j += 4, data += width * 4 * 4)
    {
        for(int i = 0; i < width; i += 4)
        {
    Using the typical parallel_for_each implementation doesn't work here since I can't manipulate the index explicitly to achieve this. I looked around and started using tiling. Using a 4x4 structure seems the most appropriate way here but since tiling is used as a performance enhancer (if used correctly), I wonder if there are better ways to achieve such a for loop using AMP?

    Since it's an image based algorithm, C++ AMP graphics lib (DX Interop API) offers a good alternative for capturing texture data. Nevertheless, there seems to be no other way then combing this with tiling to skip head in the loop.

    With kind regards,

    Kinetomatics



    Monday, June 04, 2012 9:11 AM

Answers

  • Hi Kinetomatics,

    Unless I have misunderstood you, I think the following code should work for you with the simple model:

    extent<2> ext((height + 3) / 4, (width + 3) / 4);
    parallel_for_each(ext, [=](index<2> idx) restrict(amp)
    {
        int j = idx[0] * 4;
        int data = idx[0] * width * 4 * 4;
        int i = idx[1] * 4;
        index<2> sparse_idx(j, i);
    // ... });


    Monday, June 04, 2012 4:33 PM
    Moderator

All replies

  • Hi Kinetomatics,

    Unless I have misunderstood you, I think the following code should work for you with the simple model:

    extent<2> ext((height + 3) / 4, (width + 3) / 4);
    parallel_for_each(ext, [=](index<2> idx) restrict(amp)
    {
        int j = idx[0] * 4;
        int data = idx[0] * width * 4 * 4;
        int i = idx[1] * 4;
        index<2> sparse_idx(j, i);
    // ... });


    Monday, June 04, 2012 4:33 PM
    Moderator
  • Hi Kinetomatics,

    I'd think 4 x 4 tile is not a good idea for performance.

    A tile consists of multiple hardware scheduling units (often refered as warp or wavefront). In modern GPUs, the size of a warp/wavefront is normally 32 or 64. With 4 x 4 tile, each tile will only have one un-filled warp/wavefront, this leads to under-utilization of computation resources.  Also GPU schedules multiple tiles to resident on a stream multiprocessor as long as resource allows (register/tile_static memory) and the number of tiles does not exceed the maximum number of resident tiles allowed on a multiprocessor.  It then relies on the computation of these multiple warps from these multiple tiles to hide latencies from accessing global memory.  The maximum number of resident tiles allowed on a multiprocessor is usually 8 or 16. Assuming it's 16, using 4 x 4 tile, there will be only 16 warps (each is already under utilized) resident on a multiprocessor. It may not be enough for hiding the memory latencies, thus may not be able to fully utilize the computation power of the GPU.

    I'd suggest that you start with what Lukasz suggests, and then look at the memory access pattern in the kernel to decide the next step on optimization.

    Thanks,

    Weirong


    Monday, June 04, 2012 5:15 PM
  • Hi Lukasz,

    This should work, I havn't tested it yet due to other conversion issues I'm dealing with atm. It seems I looked/searched to far after seing your proposal.

    Thanks,

    Kinetomatics


    Monday, June 04, 2012 8:31 PM
  • Hi Weirong,

    This is the reason why I asked for a second opinion. Your explanation gave me a better view on tiles. After more research, it seems the limit of threads per tile is 1024, which would have been a limiting factor for my algorithm concerning image compression.

    With kind regards,

    Kinetomatics


    Monday, June 04, 2012 8:41 PM