C++ AMP: grids with tiles in which the tile size is a computed value.

Respondida C++ AMP: grids with tiles in which the tile size is a computed value.

  • sábado, 29 de octubre de 2011 9:14
     
      Tiene código

    Hi,

    In Daniel Moth's examples in his blog, and in the online reference for C++ AMP Concurrency, there are examples for creating grids with const int template parameters:

     

    extent<1> e(20);   // 20 units in a single dimension with indices from 0-19
    grid<1> g(e);      // same as extent
    tiled_grid<4> tg = g.tile<4>(); // OK
    const int four = 4;
    tiled_grid<four> tg2 = g.tile<four>(); // OK

     

    The problem is that I want to create a tile of a size that is computed, which is possible in CUDA. This isn't as easy as one might think.  The problem is that the template parameters for tiled_grid<> and grid method tile<>() must be a constant integer.

    So, the question is: how can I create tiled_grid's (and tiled_index<>) with a runtime-computed parameter?

    I played around with with some examples and I seem to have stumbled upon something that works--sometimes.  Here is an example which show how it works, and when it does not.

     

    	{
    		int blockCount = 512;
    		int threadCount = 512;
    		int * mdata = (int*)malloc(blockCount * threadCount * sizeof(int));
    		for (int i = 0; i < blockCount * threadCount; ++i)
    			mdata[i] = -1;
    
    		std::vector<accelerator> accelerators = get_accelerators();
    		std::vector<accelerator>::iterator it;
    		for (it = accelerators.begin(); it != accelerators.end(); ++it)
    		{
    			if ((*it).get_has_display() == false)
    				continue;
    			break;
    		}
    		
    
    		// Execute a grid of 512*512, with a tile containing 512 threads (constant).
    		{
    			array_view<int, 1> a(blockCount * threadCount, mdata);
    			extent<1> e(blockCount * threadCount);
    			grid<1> g(e);
    			parallel_for_each(*it, g.tile<512>(),
    				[=](tiled_index<512> idx) restrict(direct3d)
    			{
    				int ig = idx.global[0];
    				if (ig < 0)
    					return;
    				else if (ig >= blockCount * threadCount)
    					return;
    				a[ig] = 0;
    			});
    		}
    
    		// Execute a grid of 10, with a tile containing 2 threads (computed).
    		// WORKS
    		{
    			int threads = 2;
    			array_view<int, 1> a(blockCount * threadCount, mdata);
    			extent<1> e(10);
    			grid<1> g(e);
    			tiled_grid<1> tg = g.tile<1>();
    			tg.get_tile_extent() = extent<1>(threads);
    			parallel_for_each(*it, tg,
    				[=](tiled_index<1> idx) restrict(direct3d)
    			{
    				int ig = idx.global[0];
    				if (ig < 0)
    					return;
    				else if (ig >= blockCount)
    					return;
    				a[ig] = 1;
    			});
    		}
    
    		// Execute a grid of 512*512, with a tile containing 2 threads (computed).
    		// FAILS TO RUN
    		{
    			int threads = 2;
    			array_view<int, 1> a(blockCount * threadCount, mdata);
    			extent<1> e(blockCount * threadCount);
    			grid<1> g(e);
    			tiled_grid<1> tg = g.tile<1>();
    			tg.get_tile_extent() = extent<1>(threads);
    			parallel_for_each(*it, tg,
    				[=](tiled_index<1> idx) restrict(direct3d)
    			{
    				int ig = idx.global[0];
    				if (ig < 0)
    					return;
    				else if (ig >= blockCount * threadCount)
    					return;
    				a[ig] = 2;
    			});
    		}
    
    	}
    

     

    In the first parallel_for_each, the code works fine.  This code is a control.  Using constants in the template parameters, it sets the tile size to 512.  Upon completion, the entire data buffer is set to zero.

    In the second parallel_for_each, a 1D grid is created of size 10.  The tile size for the grid by default seems to be the size of the grid.  This code change the extent of the tile to be 2.  In other words, there are now 5 tiles in the grid.  To my surprise, this code works fine. The first ten integers are correctly set after executing the parallel_for_each and exiting the block containing array_view a.

    However, this method does not seem to work always.  In the third parallel_for_each, the grid is scaled back to the full size (512*512), and a tile size of 2 is set. This code gets an exception:

     

    runtime_exception: Concurrency::parallel_for_each: unsupported compute domain for specified accelerator.

    invalid_compute_domain: Concurrency::parallel_for_each: unsupported compute domain for specified accelerator.

    The code actually doesn't work for any tile size because the grid is too large.  That sort of doesn't make sense because the first parallel_for_each works fine.

    This is for an NVIDIA GTX 470 accelerator.

    How can I get this code to work? Can someone give an example how to execute a parallel_for_loop with a tile in which the size of the tile is a computed value?

     

    Ken



    • Editado Ken Domino sábado, 29 de octubre de 2011 9:15
    • Editado Ken Domino sábado, 29 de octubre de 2011 15:03
    •  

Todas las respuestas

  • sábado, 29 de octubre de 2011 14:10
     
     

    Actually, I now see that get_tile_extend() is declared "Concurrency::extent<rank> get_tile_extent() const restrict(cpu, direct3d);" So, that shouldn't work (and it should probably return a const extent<rank> to avoid what I did). If there is no way to construct a tiled_grid that has the tile extent size parameterized with a runtime-computed value, that will definitely make porting CUDA and OpenCL programs to C++ AMP much, much harder. The number of threads per block is almost always a computed value. If I'm wrong, please let me know.

    Ken



    • Editado Ken Domino sábado, 29 de octubre de 2011 14:13 span in wrong place
    • Editado Ken Domino sábado, 29 de octubre de 2011 15:04
    •  
  • lunes, 31 de octubre de 2011 23:20
     
     

    what are you trying to achieve?

     

     


    Windows MVP 2010-11, XP, Vista, 7. Expanding into Windows Server 2008 R2, SQL Server, SharePoint, Cloud, Virtualization etc. etc.

    Hardcore Games, Legendary is the only Way to Play

    Developer | Windows IT | Chess | Economics | Vegan Advocate | PC Reviews

  • martes, 01 de noviembre de 2011 5:35
     
      Tiene código

    what are you trying to achieve?

    I am trying to do the equivalent in C++ AMP as what can be done in CUDA:

    int grid = a computed value at runtime;
    int threads = a computed value at runtime;
    kernel<<<grid, threads>>>();
    

    In CUDA (and OpenCL), the number of threads per block can be a value computed at runtime. In C++ AMP, you cannot do that.  The number of threads per block must be an integer constant defined at compile time because it is a parameter to a template.  This restriction has several consequences.

    NVIDIA provides many examples in its GPU Computing SDK where the number of threads per block is a value that is set at runtime, not compile time (e.g., bilateralFilter).  Therefore, porting any one of these examples may not be particularly easy.

    Sometimes the number of threads per block needs to be a GPU-specific value for performance.  In CUDA, the number of threads per block could be easily changed by changing the executive configuration for the kernel call.  In C++ AMP, code must be instantiated with all sizes that may be of interest, and selected by conditional statements.

    While none of this is really horrible, it just seems that the API is inconsistent.  "grid<1> g(extent<1>(5))" creates a one-dimensional grid of 5 elements.  But "grid.tile<5>()" creates a tiled grid with 5 threads per tile.  The parameter to the template is a dimension in one case, and an extent of the tile in the other case.  Why doesn't something like "grid.tile<1>(extent<1>(threads_per_tile))" exist, where "1" consistently refers to the number of dimensions, and tiled_grid and grid constructors both input an extent?

    Ken

  • martes, 01 de noviembre de 2011 12:00
     
     Respuesta propuesta

    AMP is not complete but it should be better when the new version of VS is released

     


    Windows MVP 2010-11, XP, Vista, 7. Expanding into Windows Server 2008 R2, SQL Server, SharePoint, Cloud, Virtualization etc. etc.

    Hardcore Games, Legendary is the only Way to Play

    Developer | Windows IT | Chess | Economics | Vegan Advocate | PC Reviews

    • Propuesto como respuesta Vegan FanaticMVP miércoles, 02 de noviembre de 2011 21:09
    •  
  • miércoles, 02 de noviembre de 2011 17:48
    Propietario
     
     

    Hi Ken

    Sorry, that is a limitation we are aware of, but it is an underlying DirectCompute limitation that we could not work around in this release.

    It looks like you've already figured out the workaround that you'll have to employ with the first release of C++ AMP: dynamically select one of the compile-time fixed options (potentialy using templates to reduce code bloat). Let us know if you need help with that by sharing your code...

    We are actively brainstorming ways of getting around this in a future release (they are all very intrusive), so any more data you can share about your scenario would be helpful to us - thanks for the feedback.

    Cheers

    Daniel


    http://www.danielmoth.com/Blog/
  • miércoles, 02 de noviembre de 2011 18:17
     
     

    Hello Ken,

     

    First you are right that get_tile_extent() should return a const value. Had that been the case, you would have discovered the problem earlier. So thank you for uncovering that!

     

    With respect to the question on dynamic tile size: we do not support dynamic tile sizes at this point. DirectX doesn’t support them, and we also haven’t seen them being pervasively used. In some cases, such as generating code for a SIMD machine (x86/SSE/AVX) without the knowledge of tile size is extremely difficult. I suspect that in many cases while you do get the illusion of a dynamic tile/block/group size what really happens is that code is generated for a few fixed sizes, and then an appropriate version is selected from amongst this fixed set of options, at runtime. That, you could simulate relatively easily using template programming with integer template parameters, potentially with better control over what versions to generate in relation to the particular problem you are trying to solve.

     

    Having said that, we are looking for scenarios where having this feature is really important, so that we can decide whether to pursue it in the future or not (like everybody else, we have to prioritize…). Do you have a particular example in mind where you think this would have made a big difference in performance or developer productivity? If you can point me to an OpenCL/CUDA sample you have a particular interest in, I could try to think and test how well it is expressed and performs with our static tile dimensions restriction.

     

    Obviously, just the fact that the model is different in this respect, makes porting a little harder. That is true and we realize that. On the other hand, we do want to optimize for hardware portability and a minimal programming model design.

     

    Thanks…

     


    Yossi Levanoni, Principal Architect Parallel Computing Platform, Microsoft
  • sábado, 05 de noviembre de 2011 2:14
     
      Tiene código

    Here is an scenario where there a tile size that can be set at runtime would be handy.

    This is a naive implementation of +-reduce.  The function reduce_serialA is a serial implementation of reduce_parallel, where code is split at explicit synchronization points (see Guo, Z., E. Zhang, et al. (2011). Correctly Treating Synchronizations in Compiling Fine-Grained SPMD-Threaded Programs for CPU. for a discussion of serialization of CUDA code). The program only runs the serial code because the parallel cannot compile with the guards "if (i >= n) return; if (i < 0) return;", and those have to be in the code in some form.

    #include <stdlib.h>
    #include <stdio.h>
    #include <iostream>
    #include <amp.h>
    #include <sys/timeb.h>
    #include <time.h>
    #include <stack>
    #include <list>
    #include <assert.h>
    
    using namespace concurrency;
    
    static const unsigned int THREADCOUNT = 512;
    
    unsigned int log2(unsigned int v)
    {
        unsigned int x = 0;
        while ((v = (v >> 1)) != 0)
        {
            x++;
        }
        return x;
    }
    
    unsigned int pow2(int e) restrict(direct3d, cpu)
    {
        unsigned int x = 1;
        for (int i = 0; i < e; ++i)
            x *= 2;
        return x;
    }
    
    
    // From http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious
    int flog2(int v)
    {
        int x = 0;
        while ((v = (v >> 1)) != 0)
        {
            x++;
        }
        return (int)x;
    }
    
    
    void reduce_serialA(int * fun, int n)
    {
        int K = THREADCOUNT;
    
        // Round up n to power of 2.
        int rounded_n = 1;
        for (int i = 0; rounded_n < n; ++i)
            rounded_n <<= 1;
    
        assert(rounded_n % K == 0);
    
        int * g_data = (int*)malloc(rounded_n * sizeof(int));
        memcpy(g_data, fun, rounded_n * sizeof(int));
    
        int block_size = K;
        int increment = flog2(block_size);
        int levels = flog2(rounded_n);
    
        int level = 0;
        for (level = 0; level < levels; level += increment)
        {
            int gstep = pow2(level);
            int blocks = (rounded_n / gstep) / block_size;
            if (blocks == 0)
                blocks = 1;
            assert(blocks != 0);
            assert(blocks == 1 || blocks % 2 == 0);
            int threads = blocks * THREADCOUNT;
    
            for (int bid = 0; bid < blocks; ++bid)
            {
                int s_data[THREADCOUNT];
    
                for (int tid = 0; tid < THREADCOUNT; ++tid)
                {
                    // Each thread loads one element from global to shared mem
                    unsigned int i = (bid * THREADCOUNT + tid + 1) * gstep - 1;
                    if (i >= n)
                        break;
                    if (i < 0)
                        break;
                    s_data[tid] = g_data[i];
                }
                // syncthreads
                for (unsigned int s = 2; s <= THREADCOUNT; s <<= 1)
                {
                    for (int tid = 0; tid < THREADCOUNT; ++tid)
                    {
                        unsigned int i = (bid * THREADCOUNT + tid + 1) * gstep - 1;
                        if (i >= n)
                            break;
                        if (i < 0)
                            break;
    
                        int tidp1 = tid + 1;
    
                        // Do reduction in shared memory.
                        if ((tidp1 % s) == 0)
                            s_data[tid] += s_data[tid - s/2];
                        // syncthreads
                    }
                }
                // syncthreads
                for (int tid = 0; tid < THREADCOUNT; ++tid)
                {
                    unsigned int i = (bid * THREADCOUNT + tid + 1) * gstep - 1;
                    if (i >= n)
                        break;
                    if (i < 0)
                        break;
                    g_data[i] = s_data[tid];
                }
            }
        }
        for (int i = 0; i < n; ++i)
            std::cout << "i " << i << " " << g_data[i] << "\n";
    }
    
    void reduce_parallel(accelerator acc, int * fun, int n)
    {
        int K = THREADCOUNT;
    
        // Round up n to power of 2.
        int rounded_n = 1;
        for (int i = 0; rounded_n < n; ++i)
            rounded_n <<= 1;
    
        assert(rounded_n % K == 0);
    
        array_view<int, 1> g_data(rounded_n, fun);
    
        int block_size = K;
        int increment = flog2(block_size);
        int levels = flog2(rounded_n);
    
        int level = 0;
        for (level = 0; level < levels; level += increment)
        {
            int gstep = pow2(level);
            int blocks = (rounded_n / gstep) / block_size;
            if (blocks == 0)
                blocks = 1;
            assert(blocks != 0);
            assert(blocks == 1 || blocks % 2 == 0);
            int threads = blocks * THREADCOUNT;
    
            extent<1> e(threads);
            grid<1> g(e);
            parallel_for_each(acc, g.tile<THREADCOUNT>(),
                [=](tiled_index<THREADCOUNT> idx) restrict(direct3d)
            {
                tile_static int s_data[THREADCOUNT];
    
                // Each thread loads one element from global to shared mem
                int bid = idx.tile[0];
                int tid = idx.local[0];
    
                unsigned int i = (bid * THREADCOUNT + tid + 1) * gstep - 1;
    
                //if (i >= n)
                //  return;
                //if (i < 0)
                //  return;
    
                s_data[tid] = g_data[i];
                idx.barrier.wait();
                int tidp1 = tid + 1;
    
                for (unsigned int s = 2; s <= THREADCOUNT; s <<= 1)
                {
                    // Do reduction in shared memory.
                    if ((tidp1 % s) == 0)
                        s_data[tid] += s_data[tid - s/2];
                    idx.barrier.wait();
                }
                idx.barrier.wait();
                g_data[i] = s_data[tid];
            });
        }
        for (int i = 0; i < n; ++i)
            std::cout << "i " << i << " " << g_data[i] << "\n";
    }
    
    
    
    int main()
    {
        int size = 1024 * 32;
        int * A = (int*)malloc(size * sizeof(int));
        for (int i = 0; i < size; ++i) A[i] = 1;
        reduce_serialA(A, size);
        return 0;
    }
    


    In this example, the program sorts an array of 32K integers, all 1's.  The sum is computed and stored in the last element in the array.  A tile is declared to be 512 elements.  The number of blocks is the length of the array divided by the tile size.  For each tile, the sum is computed in-place for 512 elements of the array.  Threads of a block load 512 elements into shared memory, then selective threads sum elements "one level at a time".  Since the input in this example is 32K, the sum is computed for 64 different blocks. (Note: this is NOT a very good implementation because there is a large number of warps with branch divergence.)

    At the next iteration of "for (level = 0; level < levels; level += increment)", the sum is now computed for 64 elements spread over 64 blocks.  The problem is that the number of elements to sum is 64, but the block size is 512 because it is fixed at compile time.  So, there are going to be more threads than the problem size.

    Here is exactly where the tile size set at runtime would be useful.  It would wonderful to create a parallel_for_each with a grid of 64 elements, and a tile size of 64 (i.e., one block):

            // threads = 64 by computation.
            parallel_for_each(acc, g.tile<1>(threads),
                [=](tiled_index<1> idx) restrict(direct3d)

    Unfortunately, that's not possible.  Instead, the tile size must be a constant, and guard statements must--somehow--be inserted.

     

                if (i >= n)
                  return;
                if (i < 0)
                  return;
    

     

    If you look at the parallel code, you will see that the guards are commented out.  That's because it actually is illegal in C++ AMP.  Uncommenting those statements yields the error:  

    error C3561: tile barrier operation found in varying flow control when compiling the call graph for the entry function for a direct3d target at ...

    Although it is not exactly kosher in CUDA either, the CUDA compiler accepts the statements, and the code seems to work fine the many times I have run it!  I suspect that it works because CUDA relaxes the condition for threads that have "exited".  I haven't quite figured out how to write the guards so that the synchronization is not conditional, but I'll figure it out.

    Ken

     

     


    • Editado Ken Domino sábado, 05 de noviembre de 2011 11:23 Fixed English
    •  
  • martes, 08 de noviembre de 2011 17:07
     
     Respondida Tiene código

    Thank you Ken for the example. One caveat: I’m not sure I totally understood all the index math. Still, I think that the problem here perhaps has something very common about it, which is actually the need to do something special when reaching the bottom of a reduction. i.e., when you get to data that is smaller than 512 elements, you have to do something special, and it doesn’t really matter how efficient that thing is, but on the other hand, it’s important that it doesn’t get in the way of the kernel that needs to do deal with the big data…

     

    So if you end up generating code that can handle any kernel size, which is what you’ll get with dynamic tile sizes, you’ll be sacrificing the code quality of the important case (kernel size == 512) for the non-important case (kernel size < 512).

     

    But still, if I’m following the logic of the sample correctly, if you are willing to have a dynamic overflow test (i > n) in the kernel than you only need one kernel, and it could deal with the situation where the overall data size is smaller than 512, the same way that it would deal with data that isn’t divisible by 512. I think in this case you’ll just have to change this line:

     

    s_data[tid] = g_data[i];

     

    into

     

    s_data[tid] = i < n ? g_data[i] ? 0;

     

    It may still be more efficient to have two kernels, one that never has to do deal with boundary conditions, and one that cleans up the leftover, even if you can only apply the more efficient kernel to the first one or two levels.

     

    You could use templates to express those specializations: wrap the parallel_for_each call in a template function like template <int tile_size, bool evenly_divided> void kernel(…);

     

    Then you can switch based on the dynamic number of threads you have

     

                    if (threads > 512 and threads%512==0) kernel<512,true>(…);

                    else if (threads > 512 /* or some other threshold number */) kernel<512,false>(…);

                    else  kernel<64,false>(…);

     

    In the kernel you’ll have something like

     
    s_data[tid] = (evenly_divided || i < n) ? g_data[i] ? 0;

     

    This will result in the code which is instantiated with evenly_divided==true not having a dynamic branch. One thing to watch out for (in addition to the added complexity of the code), is the increase in compilation time due to the proliferation of instantiated kernls. You win some, you lose some…

     

     


    Yossi Levanoni, Principal Architect Parallel Computing Platform, Microsoft
  • lunes, 21 de noviembre de 2011 21:38
     
     

    Ken

    After reading your post closely I noticed you would like a parallel for loop.

    At the moment OpenMP is the only thing I am aware of that supports parallel for

    That would use each core on the CPU only, never noticed quite the same for other packages.

     

    OpenMP is CPU oriented and might be helpful in combination with your other code.

    Hope this a helpful suggestion

     


    Windows MVP 2010-11, XP, Vista, 7. Expanding into Windows Server 2008 R2, SQL Server, SharePoint, Cloud, Virtualization etc. etc.

    Hardcore Games, Legendary is the only Way to Play

    Developer | Windows IT | Chess | Economics | Vegan Advocate | PC Reviews

  • lunes, 21 de noviembre de 2011 22:43
     
     Respondida

    Ken

    After reading your post closely I noticed you would like a parallel for loop.

    At the moment OpenMP is the only thing I am aware of that supports parallel for

    That would use each core on the CPU only, never noticed quite the same for other packages.

     

    OpenMP is CPU oriented and might be helpful in combination with your other code.

    Hope this a helpful suggestion

     


    Windows MVP 2010-11, XP, Vista, 7. Expanding into Windows Server 2008 R2, SQL Server, SharePoint, Cloud, Virtualization etc. etc.

    Hardcore Games, Legendary is the only Way to Play

    Developer | Windows IT | Chess | Economics | Vegan Advocate | PC Reviews

    Yes, OpenMP directives tell the compiler to parallelize for-loops, and could be used to compute the tail-end of an algorithm like +-reduce on the CPU.  That might be useful.  But, the issue is one of aesthetics: code must inserted to handle problems smaller than the size of a tile.  There are two places one could do this: (1) guard statements in the kernel; or (2) after the parallel_for_loop, to solve the remainder of the problem that is smaller than a tile size on the CPU (serially, or using OpenMP).  In either case, the solution is now cluttered with special-case code in what could be very nice, clean code.  It's just an annoying limitation of C++ AMP at the moment, and all the more frustrating because CUDA and OpenCL allow the size of the tile to be specified at runtime.  Oh well.... --Ken

    • Marcado como respuesta Ken Domino lunes, 21 de noviembre de 2011 22:43
    •