none
AMP GPU memory use

    Pregunta

  • Hello,

    I have the following question. My code has the following structure:

    array_view <float, 1> a_gpu(n, my_data);

    for (int i=0, i< ni, ++i) {

    parallel_for_each(a_gpu.extent,  [=] (index<1> idx) restrict(amp) {

    body of lambda, which includes
    a_gpu[idx]+=(some expression)
    }

    a_gpu.synchronize();

    }

    As you see, I have a p.f.e inside a loop. My p.f.e updates array a_gpu. I would like to copy the data my_data to GPU only once, then run the p.f.e. ni times, and then copy the data from a_gpu back to the CPU memory. However, I could not make it to work. I tried both array_view and array (of course I included &a_gpu between []), but every time the code breaks. Therefore, I had to insert the command a_gpu.synchronize(), which is excecuted after each p.f.e. Unfortunately, this slows the code down, since this command causes the data to be copied back to the CPU for each i. Is there any way to copy data to GPU, update the data over the course of several p.f.e-s, and then copy the result back?

    Any help would be greatly appreciated!
    Thanks.

    }

    sábado, 14 de abril de 2012 23:12

Respuestas

  • Hi Sasha-K,

    To clarify, your problem was solved by using an accelerator_view with queuing_mode_immediate? Or it's solved by upgrading to the latest driver?

    For your question, we are going to have a blog talking about it. But let me give you some information here first.  C++ AMP is built on top of DirectX.  An array_view is realized using a DirectX buffer. When launch a parallel_for_each, each buffer needs to be bound to either a DX UnorderedAccessView (allow read-write) or a DX ShaderResourceView (allow read-only).  Let's refer these view as UAV/SRV respectively. The number of UAV/SRV allowed is determined by DX11 Feature Level. For DX11 Feature Level 11.0,  the # of UAV allowed is 8, and the # of SRV allowed is 128.  For DX Feature Level 11.1,  the limit for UAV is increased to 64.  For 11.1, it requires that the graphic card to support 11.1, and requires a display driver that is at least implemented WDDM 1.2.  You need to check with the hardware vendor on whether your card support 11.1, and whether the driver implements WDDM 1.2.

    Now, your card/driver seems to allow 11.0.  So you hit the "8" limit.  Before you exploring the approach on combining multiple arrays into one. Please consider whether each of the array_view you pass to the parallel_for_each needs to be written? If it is actually read-only, instead of creating array_view<float, 1>, you should create array_view<const float, 1>.  Such array_view will be read-only, therefore can be bound to SRV, which has a much higher limit (as mentioned before, 128). 

    Also, if the data is indeed read-only, regardless of the UAV limit, you should use array_view<const float, 1>.  It has several benefits: (1) indicate to the runtime that there is no need to copy the data back after p_f_e (since it's not written in p_f_e);  (2) the compiler may generate more efficient code by knowing it's read-only; and (3) it can also be useful for DirectX scheduler, since two commands that both read-only from a buffer do not have real dependence.

    Finally, I'd like to encourage you to always use the latest driver provided by the hardware vendor.

    Thanks,

    Weirong


    domingo, 15 de abril de 2012 21:17

Todas las respuestas

  • Hi Sasha-K,

    From the description of your scenario,  you should call a_gpu.synchronize() outside the for loop, not after every p_f_e. 

    You mentioned that "every time the code breaks. Therefore, I had to insert the command a_gpu.synchronize()".  Could you be more specific on what breaks and how?

    If you can share a small repro example, we can take a look and help you figure out the issue. Please also share with us your GPU model, your driver version, your OS version. If you use command line to compile your program, please share the exact command line you were using.

    Thanks,

    Weirong

    domingo, 15 de abril de 2012 0:45
  • Hello Weirong,

    Many thanks for your response. It is hard for me to tell exactly what happens. When I move a_gpu.synchronize() outside the for loop and run the code, the for loop runs for several iterations in a somewhat erratic fashion: faster, then slower, faster, etc., then the screen goes black. When the screen comes back again, I get a message "Display driver stopped working and has recovered ...", and the program is aborted. I have GeForce GTX 590 (driver version 8.17.12.8562), windows vista. I use Visual Studio 11 to compile the code, but then just click on the executable to run the code. With a_gpu.synchronize() outside the for loop, the code aborts regardless of whether I run it from inside the studio or by clicking on the executable. Another piece of information. If I compile the code in the Debug mode, then the code does not abort. However, this does not help me since it works much slower if compiled in the Debug mode. The problem occurs when the code is compiled in the Release mode.

    Also, what is a "repro example"?

    Best,

    Sasha K.

    domingo, 15 de abril de 2012 2:52
  • Hi Sasha K,

    From "Display driver stopped working and has recovered ...", it seems you experienced a TDR. Please read "Handling TDRs in C++ AMP" for information of detecting and recovering from a TDR. I would also recommend you to read "Understanding accelerator_view queuing_mode in C++ AMP", especially the "Choosing between immediate and deferred quequing_mode" section about how you could use "queuing_mode_immediate" to help deal with the TDR issue with long running commands. Please let us know if you have more questions.

    FYI, a "repro example" is a simplified program that could re-produce the problems you encoutered. :)

    Cheers,

    Lingli

    domingo, 15 de abril de 2012 5:30
    Propietario
  • Hi Lingli,

    Thank you. I read the materials that you refered me to. Yes, it does look like I have a TDR issue. Hopefully, if I set “queuing_mode_immediate”, the problem will be resolved. Unfortunately, I did not see any code snippets/samples showing how to actually set this queuing mode. Here is the structure of my code again for your convenience:

    array_view <float, 1> a_gpu(n, my_data);

    for (int i=0, i< ni, ++i) {

    parallel_for_each(a_gpu.extent,  [=] (index<1> idx) restrict(amp) {

    body of lambda, which includes
    a_gpu[idx]+=(some expression)
    }

    a_gpu.synchronize();

    }

    What command(s) and at what place should I insert in order to set “queuing_mode_immediate”? I am sorry if this is a simple question. I am not a computer expert, but I do need parallel computing for my project.

    Best regards,

    Sasha K

    domingo, 15 de abril de 2012 12:39
  • Hi Sasha-K,

    To use queuing_mode_immediate, you need to first create an accelerator_view as:

    accelerator_view myAv = accelerator().create_view(queuing_mode_immediate);

    Then, when you call parallel_for_each, you can specify on using this accelerator_view (instead of the default)

    parallel_for_each(myAv, a_gpu.extent,  [=] (index<1> idx) restrict(amp) {
       //body of lambda, which includes
       //a_gpu[idx]+=(some expression)
    }

    Please try this out. 

    However, since you also mentioned that the code works fine in Debug configuration without using a_gpu.synchronize(), so there is also another possibility that your code might hit a driver bug for the code generated under Release configuration.  So if the queuing_mode_immediate does not help you, please try to

    1. Update your driver from http://www.nvidia.com/Download/index.aspx. There is a newer Nvidia driver than the one you currently have. Then try your program again.
    2. If the new driver does not help. Please run your program using the reference device.
    // using automatic queuing mode
    accelerator_view myAv = accelerator(accelerator::direct3d_ref).default_view;
    
    // or if you want to try immediate_queuing_mode
    accelerator_view myAv = accelerator(accelerator::direct3d_ref).create_view(queuing_mode_immediate);

          Then use "myAv" as the first parameter to parallel_for_each.  Notice that the reference device is very very slow. So you may want to reduce the size a_gpu to a small number.  See if the code hangs on the reference device. If so, it's likely a C++ AMP bug. If possible, please send a repro to us. We'd appreciate it.

        3. If the code works fine on reference device. It might indicate a driver bug. If you happen to have a machine with an DirectX11 AMD graphics card, please try your code on that card (with the latest driver). Again, if it's possible for you to send a repro to us, we can help try it out.

        4. Please note that Windows Vista is not supported OS for C++ AMP. If possible, please try Windows 7/Windows Server 2008 R2,  or Windows 8.

    Thanks,

    Weirong


    domingo, 15 de abril de 2012 18:34
  • Hi Weirong,

    Your first suggestion worked very well, so I did not need to do the additional steps 1-4. Many thanks!! The code is running much faster now. Sorry for the confusion, I have Windows 7, not Vista. I had Vista until very recently.

    A related question. I am doing pretty hefty calculations inside the p_f_e. At first, I created a bunch (9 to be exact) arrays using array_view and passed them on to the kernel, but the system did not like it. It said that the maximum number is 8. I managed to rewrite my code to use only eight arrays. Do you know if 8 is the absolute maximum, or maybe there is a setting that can change this number? In principle, one can go around this limitation by artificially combining several arrays into one, but it is not good programming practice and creates an unreadable code.

    Best regards,
    Sasha-K

    domingo, 15 de abril de 2012 20:44
  • Hi Sasha-K,

    To clarify, your problem was solved by using an accelerator_view with queuing_mode_immediate? Or it's solved by upgrading to the latest driver?

    For your question, we are going to have a blog talking about it. But let me give you some information here first.  C++ AMP is built on top of DirectX.  An array_view is realized using a DirectX buffer. When launch a parallel_for_each, each buffer needs to be bound to either a DX UnorderedAccessView (allow read-write) or a DX ShaderResourceView (allow read-only).  Let's refer these view as UAV/SRV respectively. The number of UAV/SRV allowed is determined by DX11 Feature Level. For DX11 Feature Level 11.0,  the # of UAV allowed is 8, and the # of SRV allowed is 128.  For DX Feature Level 11.1,  the limit for UAV is increased to 64.  For 11.1, it requires that the graphic card to support 11.1, and requires a display driver that is at least implemented WDDM 1.2.  You need to check with the hardware vendor on whether your card support 11.1, and whether the driver implements WDDM 1.2.

    Now, your card/driver seems to allow 11.0.  So you hit the "8" limit.  Before you exploring the approach on combining multiple arrays into one. Please consider whether each of the array_view you pass to the parallel_for_each needs to be written? If it is actually read-only, instead of creating array_view<float, 1>, you should create array_view<const float, 1>.  Such array_view will be read-only, therefore can be bound to SRV, which has a much higher limit (as mentioned before, 128). 

    Also, if the data is indeed read-only, regardless of the UAV limit, you should use array_view<const float, 1>.  It has several benefits: (1) indicate to the runtime that there is no need to copy the data back after p_f_e (since it's not written in p_f_e);  (2) the compiler may generate more efficient code by knowing it's read-only; and (3) it can also be useful for DirectX scheduler, since two commands that both read-only from a buffer do not have real dependence.

    Finally, I'd like to encourage you to always use the latest driver provided by the hardware vendor.

    Thanks,

    Weirong


    domingo, 15 de abril de 2012 21:17
  • Hi Weirong,

    Thank you very much for providing very valuable information. Yes, most of my arrays are indeed read-only, so making them constant will ineed allow me to pass a lot more arrays to the kernel. I saw that somebody is already writing a book about C++ AMP. I hope that this information (and mabe a lot more other useful advices) are included in there.

    Thanks again for taking the time to answer my questions,

    Sasha-K

    lunes, 16 de abril de 2012 0:32
  • Hi Sasha

    Was your problem solved with using a different queueing_mode or by updating the driver?

    Cheers

    Daniel


    http://www.danielmoth.com/Blog/

    lunes, 16 de abril de 2012 3:47
    Propietario
  • Hi Daniel,

    Yes. My problem was solved by using a different queueing_mode, so I did not even have to update the driver.

    Thank you,

    Sasha

    martes, 17 de abril de 2012 1:17
  • Hi Sasha

    Thanks. I am bit puzzled by the queuing_mode being the solution, because you said this worked fine under the debug config. The TDR, that the queuing_mode apparently helped you with, should have occurred under DEBUG mode too (even more so).

    Do you mind please sharing a complete code listing, which demonstrates everything working under debug, and failing in release without the queuing_mode change? I would really appreciate that. You can narrow the code down by removing unnecessary parts for the repro, but it should still compile and run for us without having to add code.

    Also, can you try to run under REF without the queueing_mode to confirm that the TDR does not occur in that scenario: http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/11/direct3d-ref-accelerator-in-c-amp.aspx

    Cheers
    Daniel


    http://www.danielmoth.com/Blog/

    martes, 17 de abril de 2012 4:35
    Propietario