none
What do you want in the next version of C++ AMP? – we are listening

    General discussion

  • Visual Studio 2012 includes the first release of the C++ AMP technology and hopefully by now you have had a chance to learn about and even better try your hands at it. We would like you to know that this first release is just a beginning and our team is actively planning new features and improvements for the next version of C++ AMP.

    If you have a feature request or suggestion for the next version of C++ AMP, we would love to hear about it. Accompanying details of what scenarios would be enabled by your suggestion or how the suggested feature would simplify your use of C++ AMP, would be very useful, and are encouraged. Several of you have already been sharing feedback with us on our MSDN forum – it has been duly noted and we are sincerely thankful.

    While we cannot guarantee all feature suggestions and requests will be fulfilled, we promise to sincerely listen to and include your feedback and suggestions in our planning process for the next version of C++ AMP. So if there is a feature or piece of functionality in your mind that you wish C++ AMP had, please let us know by responding below on this thread. 

    Looking forward to your responses.

    C++ AMP Team


    Amit K Agarwal


    Wednesday, September 19, 2012 9:41 PM

All replies

  • I am using C++AMP in my WinRT app for some image manipulation algorithms (image filters). I would like to port this app to Windows Phone 8 and I would LOVE if C++AMP would support it.

    I know that WP8 hardware (feature level 9_3) will not provide GPU processing but I understand that the algorithms would automatically fallback to something that has similar perf to PPL. That would be totally great for my usecase because I could use the same code - optimized for each platform.

    Thanks.

    Thursday, September 20, 2012 4:34 PM
  • Drive C++ AMP into other platforms (MAC OS, Linux).

    That's the best you can do.

    Thursday, September 20, 2012 9:28 PM
  • 1. Profiler!

    The biggest thing missing currently is a good profiler and tooling for C++ AMP code. CUDA has a big advantage here with "Parallel Nsight".

    2. optimized cross accelerator view copies when possible

    If you want to display data using another direct3d device you currently have to perform a very slow device->host->host->host->device copy (it is 3x host!).

    3. nested kernels (i.e. "dynamic parallelism)

    This can allow adaptive sampling for some algorithms, e.g. ray-tracing.

    4. invocation interop with C++ AMP views

    Currently it is very limited what you can do with the interop support since you often cannot do anything useful with objects received without synchronizing with the AMP direct3d device thread.

    5. Create unorm without performing clamping for values that are known to be within bounds.

    6. Fix copy_async continuation design problem

    copy_async(..).then() is currently quite inefficient since it depends on the CPU load, i.e. you can get an unnecessary delay between the copy finishing and the continuation is invoked, thus the GPU can stand idle.

    7. Inline DirectCompute?

    8. OpenGL Interop?




    • Edited by Dragon89 Tuesday, October 02, 2012 11:09 PM
    Friday, September 21, 2012 7:53 AM
  • I forgot to add, We want libraries which utilize AMP.

    Something like Intel's MKL / IPP.

    That would make this great!!!

    Sunday, September 23, 2012 8:36 AM
  • I agree with Royi. Until I can write portable AMP code, my usage of AMP will be very limited.
    Monday, September 24, 2012 2:15 PM
  • These features are likely both dependent on an updated to DirectCompute before C++ AMP can support them but I'd like to see these hardware features exposed:

    • Nested Kernels / Dynamic Parallelism
    • Support for shared virtual memory (sharing pointers between host and device)

    There's also some areas for improved D3D interop that would be interesting but may not be possible without D3D updates

    • Allow C++ AMP kernels to interop with ID3D11Predicate objects (e.g. an AMP kernel can do culling / visibility computation and set predicates which control rendering later in the GPU command queue)
    • Finer control over sequencing / dependencies between C++ AMP kernels and D3D commands 
    Monday, September 24, 2012 11:06 PM
  • What about entire AMP classes, so you can have whole objects on the accelerator. Would make it easier to convert exisiting large code bases.
    Friday, September 28, 2012 12:26 PM
  • Similarly for ease of porting existing codebases, like our financial library, support for virtual functions.
    Saturday, September 29, 2012 1:12 PM
  • Agree with most of the above.

     For me it's anything that makes it easier to port existing C++ code, so support for virtual functions would be great.

    The easier porting gets, the more folks can pitch AMP to their managers as the best solution for accelerating code.

    Monday, October 01, 2012 8:27 AM
  • Hi Royi A,

    Several C++ AMP library projects have been initiated by our team on CodePlex, including the C++ AMP BLAS library. We would love to hear your feedback on these library projects, or suggestions for additional C++ AMP libraries.

    - Amit


    Amit K Agarwal

    Tuesday, October 02, 2012 7:41 PM
  • Hi John,

    C++ AMP already supports user-defined types (classes and structs) in parallel_for_each kernels, with some restrictions (pointer members, virtual functions etc.) - the full set of restrictions are described in theC++ AMP open specification. The C++ AMP random number generator libraryis one example that uses user-defined types in kernels. Please let us know if you find the support for user-defined type in its current form (in VS 2012), limiting for your usage scenario.

    Thanks for the feedback - keep it coming.

    -Amit


    Amit K Agarwal

    Tuesday, October 02, 2012 7:43 PM
  • Great feedback Dragon89 - thanks.

    Some clarifying questions.

    Regarding "4. invocation interop with C++ AMP views". As you may be aware, synchronization between C++ AMP commands and calls to the underlying DirectX objects (obtained through C++ AMp interop methods), is necessary due to the DirectX threading model. However, the current implementation does not allow C++ AMP users to synchonize DirectX calls with any asynchronous command submission done by the C++ AMP runtime. We plan to fix this in the next version by providing means to lock/unlock the concurrency::accelerator_view. Does this address your concern?

    Regarding "7. Inline DirectCompute?". I assume you are referring to the ability to inline HLSL code in C++ AMP kernels. It would be helpful if you can share your compute scenario where you think this will be useful. We would ideally like to close any functional gaps w.r.t. HLSL in C++ AMP, for users to be able to express their computations in C++ AMP without having to resort to learn and mix HLSL code.


    Amit K Agarwal

    Tuesday, October 02, 2012 8:38 PM
  • Thanks for the feedback mattnewport.

    Can you elaborate further on "Finer control over sequencing / dependencies between C++ AMP kernels and D3D commands" to help me understand the suggestion better. As you may be aware, synchronization between C++ AMP commands and calls to the underlying DirectX objects (obtained through C++ AMP interop methods), is necessary due to the DirectX threading model. However, the current implementation does not allow C++ AMP users to synchonize DirectX calls with any asynchronous command submission done by the C++ AMP runtime. We plan to fix this in the next version by providing means to lock/unlock the concurrency::accelerator_view. Does this address your concern?


    Amit K Agarwal

    Tuesday, October 02, 2012 8:39 PM
  • The ability to lock / unlock the concurrency::accelerator_view sounds like it will be useful for some scenarios but ideally I'd like to see even finer grained control and synchronization. Getting that kind of support might be something for further in the future (it might well require a new version of D3D) but the kind of things I'd like to be able to do are:

    • Schedule an AMP task as a continuation of some D3D event on the GPU. For example, create a D3D event query and insert it after rendering a scene, then create an AMP task that kicks off when that query completes to perform post processing on the render target. Ideally this could be optimized by the driver / GPU to just ensure that the GPU runs the AMP task after the GPU has finished writing to the render target without having to involve the CPU in handling the event.
    • Schedule a D3D draw as dependent on an AMP task. For example, schedule the final draw of the post processed scene render target to the back buffer to happen after the AMP post processing task has completed by inserting a new event Query type in the D3D command buffer that is signaled by the AMP task completing, or schedule an AMP task to update the vertex data for a particle system and schedule a D3D draw of the particles to only run once the AMP task has completed.
    • Allow an AMP task on the GPU to set the value of a D3D predicate to control whether a draw will be executed or not. For example, an AMP task could do culling / visibility determination for objects in a scene and set predicates allowing the GPU to skip objects that are not visible without requiring CPU involvement / synchronization.
    Tuesday, October 02, 2012 10:33 PM
  •  "We plan to fix this in the next version by providing means to lock/unlock the concurrency::accelerator_view. Does this address your concern?"

    Sounds like an inefficient solution. Can't you provide something like the following to be able to keep the asynchronous nature of the accelerator_view?

    class accelerator_view
    {
         public:
    
          /* ... */
    
          template<typename F>
          auto execute_async(F&& func) -> task<decltype(func())>
          {
             /* execute "func" synchronized with C++ AMP accelerator_view thread. */
          }
    
          template<typename F>
          auto execute(F&& func) ->decltype(func())
          {
             /* ... */
          }
    };
    • Edited by Dragon89 Tuesday, October 02, 2012 11:07 PM
    Tuesday, October 02, 2012 11:06 PM
  • I cannot port my CUDA and OpenCL code to C++ AMP due to types like char and long long are not supported.

    http://www.codeproject.com/Articles/380399/Permutations-with-CUDA-and-OpenCL

    Please support char, short, long long and their unsigned counterparts in the next version.

    Thursday, October 11, 2012 7:24 AM
  • C++ AMP kernel disassembly (HLSL)
    Thursday, October 11, 2012 1:00 PM
  • The GPU debugger in VS2012 can be used to see the C++ AMP kernel disassembly, both for optimized and non-optimized codegen. One just needs to ensure that the application links against the debug version of the C++ AMP runtime (by using the /MDd or /MTd compiler switches).

    Following are the steps to see the disassembled optimized bytecode for C++ AMP kernels in the GPU debugger.

      • Build in Release mode
      • In “Project Properties >> C/C++ >> Code Generation >> Runtime Library” select “Multi-threaded Debug DLL” option (to link against the debug C++ AMP runtime, so the debugger can work).
      • Rebuild.
      • Hit F11 to step into the first line of the kernel, and open the disassembly window to see the HLSL bytecodes for the C++ AMP shader.

    Thanks for the feedback - keep it coming.


    Amit K Agarwal

    Friday, October 12, 2012 8:25 AM
  • A short vector math library would be very useful.

    -L

    Tuesday, October 16, 2012 9:36 PM
  • Hi LKeen,

    A C++ AMP short vector library is available in VS2012. We would love to hear from you regarding any new feature requests or suggestions for the library.

    -Amit


    Amit K Agarwal

    Wednesday, October 17, 2012 6:40 AM
  • Hi Amit, after reading this post:

    http://social.msdn.microsoft.com/Forums/is/parallelcppnative/thread/337ef772-81e8-41bf-83ad-72f322867c8b

    it seems as though the short vector library is present only for those who don't want to define their own vector types and doesn't necessarily vectorize under the hood. Is there a vectorized instruction I can use in AMP to efficiently compute the dot product between two float_4 vectors?

    -L

    Wednesday, October 17, 2012 3:47 PM
  • Hi LKeene,

    C++ AMP targets GPUs and is compiled to D3D11 Compute Shader bytecode. If the compiler isn't already doing vectorization when it compiles to bytecode then the GPU driver is likely to do it when compiling to hardware instructions if it makes sense for the target GPU architecture.

    Most D3D11 GPUs are scalar architectures anyway and so won't see any performance benefit from vectorized short vector code.



    • Edited by mattnewport Wednesday, October 17, 2012 4:55 PM
    Wednesday, October 17, 2012 4:54 PM
  • That is accurate.


    Amit K Agarwal

    Wednesday, October 17, 2012 7:39 PM
  • Hello Amit,

    A small additional 2c - perhaps it would be possible to add (sort of) dynamic allocation for tile_static memory? Whilst using template meta-programming to achieve the same result is possible, TMP is not necessarily the most understandable thing ever (so this causes issues when code gets fling-ed at somebody else). There are certain cases when this can be quite useful: consider you are building a list of indices of some sort, and its length is unknown at compile-time.

    For the second cent, perhaps (I mentioned this to Daniel a (relatively) long time ago) you can also look at the Increment/Decrement Counter mechanism in DX CS? For cases when one needs simple countering (think something like a semaphore, for example), it would be easier and more intuitive to leverage the above DX mechanism as opposed to having some token array_view wrapped variable that one does atomics on. It's also on the fast-path for both IHVs (it definitely is for ATI's stuff prior to GCN and, bar any major mischief, including it), versus naked global atomics which may not be. Cheers.

    Alex Voicu

    Thursday, October 18, 2012 9:48 PM
  • A complex number library
    Sunday, October 21, 2012 8:08 PM
  • Dear Amit,
    C++ AMP is an excellent start, but it is currently limited in its scope. In its current form it embeds clues in the code as to where operations may be accelerated using parallel co-processing resources, but only in the specific case where that hardware is a graphics card, and probably an nVidia single-sourced graphics card at that.

    Having done the hard work of embedding in applications the meta-data regarding where the code can be accelerated, it seems a shame not to allow the operating system to then make use of that to accelerate it using whatever resources are available to it, not just the graphics card.
    For example, in Control Panel I should be able to select the machine's default acceleration technique. I should be able to select from any class of acceleration hardware for which an “acceleration driver” has been installed.

    This could be the main CPU, it could be one or more GPUs, it could be 3rd party acceleration hardware (e.g. FPGA or DSP based cards), or it could be a local cluster of 'slave' PCs coordinated from this PC.

    This system would make C++ AMP:
    1) Far more powerful and general purpose
    2) It would create a new computing paradigm
    3) It would make C++ AMP programming main-stream, standard and ubiquitous.
    4) It would open-up a whole new market for acceleration hardware, because manufacturers would know that all they had to do was supply a C++ AMP “acceleration driver” with their accelerator card and the customer's C++ AMP applications would run on it without needing to be re-designed or re-coded.

    Regards,
    Nicholas Lee

    Wednesday, November 21, 2012 9:16 AM
  • Hey Nicholas,

    "In its current form it embeds clues in the code as to where operations may be accelerated using parallel co-processing resources, but only in the specific case where that hardware is a graphics card, and probably an nVidia single-sourced graphics card at that."

    C++ AMP can take advantage of any device that supports D3D11 compute shaders. That includes NVIDIA graphics cards but also AMD graphics cards and Intel and AMD processors with D3D11 level on board processor graphics (e.g. AMD's Fusion parts or Intel's Ivybridge CPUs). On Windows 8 you can also use a WARP device to accelerate C++ AMP code on CPU SIMD units.

    "For example, in Control Panel I should be able to select the machine's default acceleration technique."

    C++ AMP offers more flexible control over where code should be accelerated through its concept of an accelerator. This allows you to take advantage of multiple GPUs in a system (including a mix of discrete GPUs, processor graphics and WARP running on x86/x64 SIMD units). You can enumerate available accelerators in your code and choose what code should run on what accelerator. If you use the default accelerator you will pick up the default D3D11 accelerator set for the system.

    "This could be the main CPU, it could be one or more GPUs, it could be 3rd party acceleration hardware (e.g. FPGA or DSP based cards), or it could be a local cluster of 'slave' PCs coordinated from this PC."

    With Intel's Shevlin Park project, the first non-Microsoft implementation of the open C++ AMP standard, there is a proof of concept of C++ AMP code compiling to OpenCL which opens up the range of devices that can be targeted to anything with OpenCL support which includes some FPGA and DSP hardware that does not support D3D11.

    C++ AMP still has a way to go before it is really main-stream, standard and ubiquitous but it is already further along than you seem to think.

    Matt. 


    • Edited by mattnewport Wednesday, November 21, 2012 4:37 PM Fixing spelling
    Wednesday, November 21, 2012 4:36 PM
  • (1) swizzling operations using the same component of short vector types.

        ex.

        float_3 a, b, c;

        a =  b + c.xxx;

        instead that
             a.x = b.x + c.x;
             a.y = b.y + c.x;
             a.z = b.z + c.x;

    (2) math functions for short vector types.
     
        ex.

        float_3 a, b, c, d;

        d = mad( a, b, c );

        instead that
             d.x = mad( a.x, b.x, c.x );
             d.y = mad( a.y, b.y, c.y );
             d.z = mad( a.z, b.z, c.z );

        a = sin( b );

        instead that 
            a.x = sin( b.x );
            a.y = sin( b.y );
            a.z = sin( b.z );

    (3) constant short vector types.

        ex.

        float_3 a, b;

        a =  b + (float_3)(1.0f, 2.0f, 3.0f);

        instead that
            a.x = b.x + 1.0f;
            a.y = b.y + 2.0f;
            a.z = b.z + 3.0f;

     Don't you need them ?

    Sunday, December 02, 2012 5:11 AM
  • Thanks for the feedback Ruru.

    A simple way to achieve "1" is: a = b + float_3(c.x);

    "3" can be achived as: a =  b + float_3(1.0f, 2.0f, 3.0f);

    Do these address your need? We will take "2" into consideration.

    -Amit


    Amit K Agarwal

    Monday, December 03, 2012 7:24 AM
  • Dear Mr. Amit.

    Thank you for your kind explanation.

    I am looking forward to the consideration "2".

    By the way, if I make a code such as "a = b + float_3(c.x)", does C++AMP compiler generate vector operation code (by means of machine code)  or  scalar operation code ?

    ie.

         a.xyz = b.xyz + c.xxx  (vector operation, summation at one time)

    or

         a.x = b.x + c.x
         a.y = b.y + c.x        (scalar operation, summation repeated 3 times)
         a.z = b.z + c.x
     
    ???

     

    Monday, December 03, 2012 1:36 PM
  • Yes, the compiler wil generate vector operations in the bytecodes if it finds that profitable taking several other factors (such as register allocation) into account.

    In fact, even the following is highly likely to be implemented as vector operations (as the previous form: a.xyz = b.xyz + c.xxx )

         a.x = b.x + c.x
         a.y = b.y + c.x
         a.z = b.z + c.x

    -Amit


    Amit K Agarwal

    Monday, December 03, 2012 8:52 PM
  • Oh! Great!

    the C++AMP compiler's optimization is amazing.

    Thanks for telling me.

    Tuesday, December 04, 2012 11:44 AM
  • Some things I've run into, mostly 'GPU Only' debug related :

    - direct3d_printf runs great when debugging with 'GPU Only'.  When debugging without 'GPU Only', it fails silently, which is fine.  But switch to Release and it crashes.  Could this fail silently for Release as well?

    - I am using interop to share data between C++ AMP and HLSL, using concurrency::direct3d::create_accelerator_view to attach to the Direct3D device.  When I want to debug C++ AMP, I switch to 'GPU Only' but debugging fails with this accelerator view.  I work around this by querying the accelerator, if it's emulated it runs with the default view.  But this then breaks the interop.

    - While debugging with 'GPU Only', all C++ printfs are disabled, no text appears in the Output > Debug window.  This makes it difficult to track progress outside of the AMP kernels.  I have substituted parallel_for_each calls with a single line calling direct3d_printf to see some progress:

        parallel_for_each(acceleratorView, extentOne, [=](index<1>) restrict(amp) { direct3d_printf("Making progress in C++\n"); });

    - When debugging with 'GPU Only', a kernel will sometimes crash when F10 stepping out of it.  F5 Continue seems more stable to exit a kernel, continuing to the next kernel with a breakpoint.

    - Ability to disable optimizations in a block of code.  I have an if-statement optimized away in Release build.

    Thanks!

    Enjoying working with C++ AMP, thanks for your work.

    Windows 8 Enterprise, Version 6.2.9200 Build 9200
    Microsoft Visual Studio Professional 2012, Version 11.0.51106.01 Update 1



    • Edited by greenkalx Thursday, December 13, 2012 8:17 PM
    Tuesday, December 11, 2012 9:39 PM
  • 0. PGO/anything-else-that-could-work-here for automatic (ATLAS-style) tuning of tiled memory parameters (size and no. of tiles) and general accelerator's (be it GPU or CPU) memory (hierarchy) access patterns optimization using the actual run-time code-path data.

    1. Kernel Fusion: http://dl.acm.org/citation.cfm?id=1953383.1953410

    2. Increased compatibility with STL, e.g., additional STL-style member-types as discussed here.



    • Edited by MattPD Wednesday, December 12, 2012 5:18 PM
    Wednesday, December 12, 2012 5:10 PM
  • like it

    Thanks

    Wednesday, December 26, 2012 10:42 AM
  • 64 bit support on Windows 7> I believe this is a WDDM 1.1 -> WDDM 1.2 issue with WDDM 1.2 only being supported by Windows 8. At least MS should clearly state if something like a patched WDDM 1.1.1 will ever be released for Windows 7 rather then the usual "the answer to this is unknown" response. Lack of 64 bit support in Windows 7 will probably make AMP a non starter in this space for many development shops for a few years as we have to support the OS that our clients use.
    Thursday, December 27, 2012 4:56 PM
  • More atomic operations, especially atomic_fetch_add for 32 bit float.

    I'm aware that it is hardware dependent, but it is supported by CUDA since version 2.0 so there are hardware that support more atomic operations than these.

    Thursday, December 27, 2012 10:32 PM
  • Hi Hystaspes,

    C++ AMP already supports pointers and pointer casting, albeit with some restrictions (refer section 2.4.1.3 of the C++ AMP open spec for deatils).

    Please let us know if you are referring to specific capabilities w.r.t. pointers that are currently unsupported.

    -Amit


    Amit K Agarwal

    Wednesday, January 02, 2013 10:41 PM
  • Hi Amit,

    Thanks for your reply. I was confused because in the C++ AMP book on page 58 it is specifically mentioned that pointer casting cannot be done in restricted functions:

    " The actual code in your amp-compatible function is not running on a CPU and therefore can't do certain things that you might be used to doing:

    * recursion

    * pointer casting

    * use of virtual functions ..."

    You are correct, I just tested a simple up-casting and it works, although at the moment in my renderer project I don't have any use for it without virtual functions.

    Friday, January 04, 2013 12:15 AM
  • Hi Hystaspes,

    Due to reasons of hardware portability, C++ AMP can only expose features that are available across different hardware,

    Fortunately, atomic_fetch_add for 32 bit float can be implemented using atomic_compare_exchange for unsigned int. Following is an implementation for your reference:

    float atomic_fetch_add(float *_Dest, float _Value) restrict(amp)
    {
        float oldVal = *_Dest;
        float newVal;
        do {
            newVal = oldVal + _Value;
        } while (!atomic_compare_exchange(reinterpret_cast<unsigned int*>(_Dest), reinterpret_cast<unsigned int*>(&oldVal), (*(reinterpret_cast<unsigned int*>(&newVal)))));
    
        return newVal;
    }

    We will consider adding this in the next version of the product.

    Thanks for the feedback,

    -Amit


    Amit K Agarwal

    Friday, January 04, 2013 2:28 AM
  • Yes, this is an error in the book. We will work with the author to have this corrected.

    Thanks for bringing this to our attention.

    -Amit


    Amit K Agarwal

    Friday, January 04, 2013 2:29 AM
  • I just realized everything I requested for future C++ AMP versions is already mentioned in section 13.2 (Projected Evolution of amp-Restricted Code) of the specifiction:

    So I will delete my comment regarding support for new data types, virtual function, virtual base class, recursion, etc. to make this post shorter and easier to read.


    • Edited by Hystaspes Sunday, January 06, 2013 4:28 AM
    Sunday, January 06, 2013 4:12 AM
  • Hi, all. I think the most important thing is to improve the performance of copying data from accelerator view to CPU. We are running a project that process large live image(320x1440) 60 times per second, and the data copy is the main net-bottle. we need only 5 ms in one loop when using CUDA, but now we need 18 ms.
    Wednesday, January 09, 2013 4:56 AM
  • Thanks for the feedback @dz.john.luo.

    Can you please help us understand the scenario better for us to be able to replicate the performance behaviour at our end?

    - Are you using a texture to store the image data on the GPU?

    - What is the total size in bytes of the data that is transferred to the CPU from the accelerator_view? i.e. what is the byte size of each of the 320x1440 elements in the image?

    - Does your workload only involve copying data from the accelerator_view to the CPU or also from the CPU to the accelerator_view?

    - Additional information about your environment will be very helpful. Windows 7 or Windows 8? 32 bit or 64 bit? Which graphics card are you using?

    -Amit


    Amit K Agarwal

    Wednesday, January 09, 2013 8:31 AM
  • it's so much appreciated for the quick response @Amit K Agarwal.

    - Are you using a texture to store the image data on the GPU?

    no, we just use one dimension array.

    - What is the total size in bytes of the data that is transferred to the CPU from the accelerator_view? i.e. what is the byte size of each of the 320x1440 elements in the image?

    only one byte per pixel. that is a mono image.

    - Does your workload only involve copying data from the accelerator_view to the CPU or also from the CPU to the accelerator_view?

    the workload include :

    while(!terminate)

    {

    fetch data from camera;

    copying data to GPU; // a stage array doing this

    do some process with each pixel, including Gaussian blur etc.;

    copying result data to CPU; // a stage array doing this

    process the result data in CPU;

    }

    - Additional information about your environment will be very helpful. Windows 7 or Windows 8? 32 bit or 64 bit? Which graphics card are you using?

    I test the program after I post this issue again. The actual result is complicated. 

    1, we run this program @ windows 8 32 or 64 bit. 

    2, the card is nvidia GeForce GTX 550 with latest driver.

    3, I disabled the code in the loop at "copying result data to CPU;", and the time used is not slow down.

    4, at the first beginning 10 seconds or less, the time used in GPU processing is about 6 ms, and then it is increasing to 18 ms. ( This is so amazing! I don't know why)

    5, we run almost the same code with CUDA, it needs only 5ms in GPU processing within this loop.

    so, need I provide more information?

    again, so much appreciate.


    Thursday, January 10, 2013 9:18 AM
  • @Amit K Agarwal.

    - additional information:

      I restart the system, and the time is 6  ms within GPU processing .

    The last posted testing is in that the system running about 3 days or long without restart. ( only shutdown).

    Thursday, January 10, 2013 9:35 AM
  • I cannot port my CUDA and OpenCL code to C++ AMP due to types like char and long long are not supported.

    http://www.codeproject.com/Articles/380399/Permutations-with-CUDA-and-OpenCL

    Please support char, short, long long and their unsigned counterparts in the next version.


    Me too...
    Thursday, January 10, 2013 4:14 PM
  • Thanks for the info.

    The results are indeed strange. I tried this on both NVidia and AMD cards are was unable to reproduce the results that you observed. I let the program run for several minutes and get consistent performance across different loop iterations.

    If you continue getting inconsistent results, I would suggest contacting the hardware vendor seeking an explanation.


    Amit K Agarwal

    Friday, January 18, 2013 4:15 PM
  • staging textures the same way as staging arrays.
    Saturday, January 19, 2013 9:50 PM
  • I would like easier ways to avoid (not just recover from) TDRs

    a rand function.

    Sunday, February 03, 2013 7:57 PM
  • Thanks for the feedback Dan.

    Regarding disabling TDR please refer to our earlier blog post on the topic.

    A C++ AMP random number generator library is available on CodePlex. We would love to have your feedback on the library. 

    -Amit


    Amit K Agarwal

    Wednesday, February 06, 2013 8:35 PM
  • Me too - need char types
    Tuesday, February 12, 2013 3:38 PM
  • I'd like to make a special request for "array_view" optimizations on devices that share memory resources. As I understand it, C++ AMP does not properly optimize for unified CPU/GPU memory in chips like Ivy Bridge/AMD APUs and the upcoming Haswell, and instead performs redundant memory copies. The proper optimization for fusion architectures is critical not only from a performance perspective, but also in terms of power consumption. I think it's very important that this runtime optimization make its way into v2 of AMP to enable new high-power applications on tablets like the Surface Pro.

    -LKeene

    Thursday, February 14, 2013 7:01 PM
  • Thanks for the feedback LKeene.

    I completely agree with your comments about the importance of this feature both from a power and performance perspective. Optimizing array_views for shared memory architectures is very high on our list priorities and we are actively working on enabling this. Stay tuned for updates regarding this on our blog.

    -Amit


    Amit K Agarwal


    Wednesday, February 20, 2013 4:45 PM
  • What do I want in next AMP?

    1) support for 64bit integers (long long)

    2) support for char and char*

    While I do not know why AMP miss such basic things (I guess those were not needed in DirectX graphics), those are very useful in many GPGPU algorithms and, even if HW do not support them directly, compiler should be able to optimize them much better ( no access to carry flags and stuff in C++).

    Friday, March 29, 2013 7:53 PM
  • In my opinion the performance of data transfer (or avoiding uneccessary transfers at all) is at least as important as the computation performance itself.

    You wrote: "In the future releases, we may extend the interpretation of staging array for other purposes. For example, we may allow direct access of the stagingArray (which is physically located on acclView1) from computation executing on acclView2 (i.e. zero-copy)." (http://blogs.msdn.com/b/nativeconcurrency/archive/2011/11/10/staging-arrays-in-c-amp.aspx)

    This would be a great feature! Staging textures would be also great.

    Also, interop functions for OpenGL like those provided for DX would be appreciated as well. And - if possible - debugging of p_f_e even when it is using textures created by DX.

    Sunday, March 31, 2013 1:32 PM
  • I'd like to see Perlin noise in 2 and 3 dimensions because that's something that gets used every day.

    Friday, April 12, 2013 10:29 PM
  • Hey dan, can you provide more details about your scenario. Are you asking for a API support to generate Perlin noise? Please do email us more details (bobyg AT Microsoft Dot com) about your scenario when you get a chance.
    Friday, April 12, 2013 11:11 PM
  • The application is 3d rendering and raytracing, and of course Perlin noise is very fundamental there.  I see one of the api's supports Perlin noise in 1 dimension, but not in 2 or 3 dimensions.

    Perlin noise is fundamental in creation of most procedural texture types.  An optimal perlin nose/ improved Perlin/ or perhaps simplex noise would be very helpful as part of he api so that developers don't have to code their own.

    Sunday, April 14, 2013 7:26 PM
  • Hi,

    Whats about expanding range of supported devices like intel graphics, intel phi and then, maybe even ASICs and FPGAs?

    When we will see decent CPU support which is not emulated, with SSE,AVX etc. In general, it will be lovely to see the whole "C++ AMP roadmap" to understand where the stuff is going to be in next 3-5 years.

    Saturday, April 20, 2013 6:29 PM
  • As up plus Constant memory area in GPU for multiple execution
    Friday, May 24, 2013 9:44 AM
  • I think it would be great if we had access to simultaneous kernel dispatch, much like CUDA 5. Such an important feature I think.
    Thursday, May 30, 2013 5:41 AM
  • It would be nice to have a LAPACK library.
    Monday, June 24, 2013 9:07 PM
  • --->   C# AMP   ----->  Version: MANAGED CODE  

    Thursday, June 27, 2013 2:54 PM