locked
How to create independent local memory in GPU for each thread? RRS feed

  • Question

  • Suppose I have the following code:

    accelerator_view gpu_acc = accelerator().default_view; extent<1> ext(50); array<float, 1> in(ext, gpu_acc); array<float, 1> out(ext, gpu_acc); parallel_for_each(gpu_acc, ext, [=, &in, &out](index<1> idx) restrict(amp) {
        create_data(in);

    out[idx] = in[idx]; });

    In my previous mini-projects, the input data was computed in CPU memory and then transferred to a local GPU memory. Therefore each thread had the same input data from the CPU.

    This time I want each GPU thread to compute its own input data. That means that each GPU thread should have its own local memory. So how do I do that?

    If I have 50 threads and I want each thread to have its own local memory

    array<float, 1> in and array<float, 1> out as shown above

    How can this be done? From the above code, is that the correct way to do it?

    Is the above code enough to tell the compiler that each thread should have its own local input and output memory in the GPU?

    Do I need additional code? If so, please show me how with coding example. Thank you.

    • Edited by LaParma Friday, March 8, 2013 7:22 AM
    Friday, March 8, 2013 7:16 AM

Answers

  • As far as I know, in DirectX11 shader model 5.0, each shader can have up to 4K local registers (each has 4 32-bit components), thus 64KB in total. Currently C++ AMP is built upon DX, so it has similar limits.

    Note that this is a upper bound. Each hardware has its specific resource limits, which affects scheduling capability. In general, the more resource (e.g. local registers) each thread needs, the fewer threads the hardware can schedule simultaneously, thus less parallelism.

    In addition, local arrays that require dynamic indexing (e.g. in[i]) become indexable temps in shader bytecode, which is up to hardware vendors to handle it, and usually they are spilled to global memory. So in general, you should avoid dynamic indexable local arrays for performance concerns.

    In short, I would encourage you to carefully redesign your algorithm to avoid the code pattern you showed above if you want to achieve good performance.

    Friday, March 15, 2013 6:24 AM
    Moderator

All replies

  • Hi LaParma,

    Looks like you want isolated memory area for each threads. If so, then use registers. It's as simple as just declearing local variables inside the kernel like:

    parallel_for_each(gpu_acc, ext, [=, &in, &out](index<1> idx) restrict(amp) { create_data(in); float local_var = in[idx]; // <-- local memory out[idx] = in[idx]; });

    You need to be carefull not to declare too many local variables which would exceed the limits of available register storage to each thread. This limit might be different for different GPU. Refer to specific GPU spec for how much you can use.

    Basically you can just access the global memory represented by array object. Each thread will find its own location to the global memory with its index. If you need to reuse that data inside a thread multiple time, then using register represented by local variable would reduce the access time.

    Friday, March 8, 2013 8:15 AM
  • If the memory each thread needs is bigger than local register limit, another way is to create 2D array, 1 dimension holds memory for each thread, and 1 dimension is distributed among thread. For example, assume your algorithm requires M threads in parallel, each thread needs to work on N integers, you can create array<int, 2> in(M, N), and pass extent<1>(M) to p_f_e. Make sure to make memory access coalesced. Also, if the memory that each thread requires exceeds local register limit, it usually indicates that you haven't explored parallelism fully. It's better to re-exam your algorithm to see if you can design it differently to fit GPU execution model better.

    Friday, March 8, 2013 3:27 PM
    Moderator
  • Hi Li

    float local_var = in[idx];  where local_var just one local register of size float the same way we do in C/C++

    However what I am thinking of is something like this.

    parallel_for_each(gpu_acc, ext, [=](index<1> idx) restrict(amp) 
    { 
        float in[100];  <-- is this allowed in Visual Studio 2012?
        float out100];  <-- is this allowed in Visual Studio 2012? 
    
        create_data(in);
    
        /*data is copied from in[] to out[]
        copy_data(in, out); 
    });
    

    Is this allowed in C++ AMP?

    Honestly I don't know how much local register limit each thread is allowed? How can I find this?

    I have Nvidia GeForce GTX 560 Ti card in my computer. My GeForce 560 Ti has 8 Streaming Multiprocessor (SM). Each SM contains 48 CUDA cores (so 8*48 = 384 cores in total). Each core can run 32 threads. The dedicated memory is 1,278,400 KB or 1.21 GB

    Friday, March 8, 2013 3:59 PM
  • Hi Lingli

    So basically this code means that only two local GPU memory of size 50 is been created and all 50 threads will have to share these two arrays?

    extent<1> ext(50);
    array<float, 1> in(ext, gpu_acc);
    array<float, 1> out(ext, gpu_acc);

    and NOT each thread having its own two local GPU memory of size 50.

    I want to confirm this.

    Friday, March 8, 2013 4:09 PM
  • array<T,N> represents global memory on the accelerator that all threads share. So your code above creates 2 data containers in global GPU memory of size 50 each, and they are visible and shared by all threads you launch via p_f_e.
    Friday, March 8, 2013 5:23 PM
    Moderator
  • Hi

    Thank you for the confirmation. There is also another question I posted right before the second question if you can see it. Its about finding how much local register limit each thread is allowed. Any help on this would be great.

    Friday, March 8, 2013 6:49 PM
  • This is architecture dependent, so your only way of seeing how many registers are (theoretically) available per lane is to work back from public specifications in terms of register file size, ALU count etc...or use the IHVs' programming manuals which tend to list this. Note that you can't always easily reason about what gets stuffed into registers, because the compiler(s) tend to work their own shuffling fu. As to your pattern of declaring thread-local arrays, that is valid in C++ AMP but is an anti-pattern. You can't really index registers and, by extension, you can't really (mostly) do dynamically indexed in-register arrays. For something like GCN (the 79xx Radeons from ATI, for example), within a particular set of constraints you can sort of do it, but in not necessarily very interesting ways (the indices have to be known at compile time, f.e.).
    Friday, March 8, 2013 7:35 PM
  • Thank you Alex for your answer.

    Lingli,
     
    1) When you say that array<T,N> represents global memory on the accelerator that all threads share, by global memory did you mean L2 Cache?

    In Fermi architecture, the L2 cache has 768 KB which is enough for my mini-project.


    2) My GeForce GTX 560 Ti is based on Fermi architecture, where each Streaming Multiprocessor (SM) has 64 KB of on-chip memory that can be configured as 48 KB of Shared memory with 16 KB of L1 cache or as 16 KB of Shared memory with 48 KB of L1 cache (per SM).

    Based on what I would like to do, I don't have a need for a default of 48 KB of shared memory.   I would like to reduce it to 16 KB of shared memory and instead increase the L1 cache memory. My code relies more on L1 cache then shared memory.

    So how can I manually configure the memory from default 16 KB to 48 KB of L1 cache in Visual Studio 2012? Shared Memory and L1 cache can be configurable as it was shown in CUDA programming.


    • Edited by LaParma Friday, March 8, 2013 8:57 PM
    Friday, March 8, 2013 8:44 PM
  • LDS configurability is not available outside of CUDA, mainly because all other APIs/programming interfaces enforce a 32KB lower bound on local memory size. So in brief you can't do that in C++ AMP, OCL, DX Compute or GL Compute, AFAIK. Also, note that local memory (tile_static, shared memory, LDS, __local etc) != registers, it's lower down the memory hierarchy pyramid. You can have dynamically indexed arrays here, and if you want to do AoS (please don't), you could also probably have an array of arrays so that each lane gets its own discrete array (basically fusing Lingli's idea with what you were trying to suggest as the goal). Please don't do this though, AoS is pretty evil.
    Friday, March 8, 2013 8:50 PM
  • What does abbrev. LDS mean?

    "LDS configurability is not available outside of CUDA because all other APIs/programming interfaces enforce a 32KB lower bound on local memory size."

    Are you saying that I cannot configure the memory for L1 cache and shared memory in Visual Studio 2012 at all?

    If the API interfaces enforce 32KB lower bound then still I could reduce from 48KB to 32 KB right? Some projects require more shared memory, and other projects might require more L1 cache memory. To be able to configure this memory is extremely important.

    I am not going to use AoS style code.  I am going to use 2D array<T,N> instead. Now when Lingli said that array<T,N> represents global memory on the accelerator that all threads share, by global memory did she mean L2 Cache or DRAM?

    • Edited by LaParma Friday, March 8, 2013 9:34 PM
    Friday, March 8, 2013 9:13 PM
  • LDS is Local Data Store, that's the name AMD uses to describe the hardware unit where tile_static memory is located.

    Yes, shared memory size is not configurable in C++ AMP. Using the common 32 kB denominator ensures that the same program will be able to execute correctly on all DirectX 11 capable GPUs. Having said that, the GPU driver has a freedom to detect the size of tile_static memory used in the kernel (it has to be declared statically) and adjust the cache size accordingly (however I am not aware if any IHV is taking this optimization opportunity).

    I'm not sure if I would agree with your judgment that it is extremely important. Could you elaborate on that?

    By saying that array<T,N> is located in global memory, we mean the lowest level of the memory hierarchy - DRAM. Of course accesses to such memory will follow the usual caching behavior, so would utilize L2 and L1 caches where appropriate.

    Monday, March 11, 2013 6:42 PM
    Moderator
  • As far as I know, in DirectX11 shader model 5.0, each shader can have up to 4K local registers (each has 4 32-bit components), thus 64KB in total. Currently C++ AMP is built upon DX, so it has similar limits.

    Note that this is a upper bound. Each hardware has its specific resource limits, which affects scheduling capability. In general, the more resource (e.g. local registers) each thread needs, the fewer threads the hardware can schedule simultaneously, thus less parallelism.

    In addition, local arrays that require dynamic indexing (e.g. in[i]) become indexable temps in shader bytecode, which is up to hardware vendors to handle it, and usually they are spilled to global memory. So in general, you should avoid dynamic indexable local arrays for performance concerns.

    In short, I would encourage you to carefully redesign your algorithm to avoid the code pattern you showed above if you want to achieve good performance.

    Friday, March 15, 2013 6:24 AM
    Moderator