Friday, July 20, 2012 10:23 AMHi all,
Newbie to C++-AMP here...
I am currently working on a largish bunch of OpenCL prototype code that should go into product in 6-10 months. The code is executed on GPU or CPU depending on customer hardware (workstation -> usually GPU, virtualized server -> CPU).
Since we are a Windows-only company I would like to port the code to C++-AMP if possible (VS11 switch getting started anyway, C++ integration is wonderful and I hope for better driver compatibility/stability than OpenCL).
From the recent post http://social.msdn.microsoft.com/Forums/en-GB/parallelcppnative/thread/71716f1a-c1e1-40bf-ac2a-4bc5cc718b05 and the blog post http://www.danielmoth.com/Blog/Running-C-AMP-Kernels-On-The-CPU.aspx I gather that we cannot expect CPU acceleration using C++-AMP under Windows 7, at least not in the near-future? Is this official by now?
The lack of CPU fallback for Windows 7 is problematic since I expect many of our customers will stay on Windows 7 for at least 2-3 years. Anyone has a hunch on when we will see a DirectCompute capable WARP on Windows 7 (if ever)? (This question may be better directed to a WARP forum if anyone would like to point me there.)
My options right now seems to be to either stay with OpenCL for a year or two and hope that the future task of porting everything written in the meantime will not be beyond all hope. ...or, go with C++-AMP right away and write a custom wrapper that enables CPU-fallback to PPL (for a fairly limited subset of C++-AMP of course). It seems likely other people have been working in this direction...
If I cannot keep the code execution path more-or-less identical on GPU and CPU I will stay with OpenCL, so I would like to execute the same lambdas using C++-AMP and PPL. The obstacles are obviously how memory is represented (array(_view) vs. vector) and the capture-clause (need to be able to capture by-ref when using PPL). Is it possible (supported?) to use cpu_accelerator for memory management only and use memory allocated with this accelerator when executing the lambda with PPL? I think that would make for a very thin wrapper... :-) If so, is memory management using cpu_accelerator significantly slower than normal?
I would be grateful for any related suggestions on how to best build a wrapper that enables a CPU-fallback to PPL.
Friday, July 20, 2012 6:06 PMOwner
You are correct on almost all accounts.
As an aside, it is a matter of opinion how fast the Windows ecosystem out there will move to Windows 8, and looking at past history is not going to help much this time. The reason is that this time there is a unique and new factor that is aimed at accelerating that adoption dramatically. I am referring to the upgrade path to Windows 8 from Windows XP, Windows Vista, Windows 7 costing only $39.99:
Having said that, I agree that if you are not writing a Metro-style app then you should have a CPU fallback solution to maximize your reach, if the feature you are implementing for your app is not a light-up feature, but a core feature. And if the feature can indeed benefit from CPU acceleration alone.
You are correct that our implicit CPU fallback is WARP. You don’t have to change a line of code, and it will kick in when there is no hardware detected on the system with a DirectX 11 driver. You are also correct that WARP is only a Windows 8 solution.
You are also right that rolling your own PPL solution is the recommended and easiest way of implementing your own CPU fallback solution for non-Windows 8 targets. We know many folks that have gone that route.
As you implement your own CPU algorithm, please start new forum threads with specific questions for issues that you encounter and we’ll help you out.
In short: I recommend starting with PPL’s parallel_for which is the equivalent of C++ AMP parallel_for_each. For the first two arguments pass it 0 and N, which is the equivalent of extent<1>(N). In the lambda you receive an int, which is the equivalent of index<1>. Instead of capturing in the lambda the array_view that wraps your CPU container, capture the CPU container directly. Within the lambda, use the CPU container instead of the array_view. For snippets of code that are truly identical, factor them out into their own restrict(cpu,amp) functions so you can call them from either of the two entry point lambdas (from the CPU and AMP paths).
Again, how complex and diverged your CPU fallback solution is going to be depends on your own workload and your own C++ AMP implementation. So give it a shot and please start new forum threads with specific questions for issues that you encounter and we’ll help you out.
PS: You also mentioned in your question something about cpu_accelerator. Sorry, that is a total red herring for this discussion, there is nothing to assist you there for this scenario. Please read our blog post on that to understand its only utility for this release: http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/10/cpu-accelerator-in-c-amp.aspx
PS 2: If you are only working with single dimensional data and want your C++ AMP path to look closer to the PPL path, you cna use this small utility function: http://blogs.msdn.com/b/nativeconcurrency/archive/2012/01/24/simplifying-single-dimensional-c-amp-code.aspx
Monday, July 23, 2012 8:41 AM
Thanks for your very informative reply, it answers most of my questions (marked as answer).
A few comments:
Windows 8 upgrade: This is not the reason for the long upgrade cycle. I work in a very safety conscious industry that is reluctant to upgrade any software simply because it will always involve new validation etc on their part. The customers that recently moved to Windows 7 will not move again anytime soon. ...and at the same time they are desperate for the performance improvements yielded by GPUs. :-) (It is sometimes a fine line to handle these kinds of cross-purposes).
cpu_accelerator: Not sure I agree that it is a red herring, although of course you know more about the internal workings here than I do. Continuing this discussion in a new thread.