C++ AMP, general 1D custom CPU fallback to PPL
-
23 กรกฎาคม 2555 9:25
Hi all,
(moving specific question from originating discussion "C++ AMP, custom CPU fallback - PPL?")
Is it possible to use PPL (for executing lambdas) together with cpu_accelerator (for memory management) to achieve a general 1D custom CPU fallback in the way illustrated below?
Modified Hello World:
#include <iostream> #include <amp.h> using namespace concurrency; using std::wcout; using std::endl; template< typename Kernel_type > void pfe( const Concurrency::extent< 1 >& e, const Kernel_type& kernel ) { if ( accelerator().device_path == accelerator::cpu_accelerator ) { wcout << "PFE: Using CPU" << endl; auto pplKernel = [&kernel]( int i ) { index< 1 > idx( i ); kernel( idx ); }; parallel_for( 0, int( e.size() ), pplKernel ); // TODO: Avoid PPL scheduling overhead } else { wcout << "PFE: Using GPU" << endl; parallel_for_each( e, kernel ); } } int main() { bool useCPU = true; // Toggle to use CPU/GPU if ( useCPU ) { accelerator::set_default( accelerator::cpu_accelerator ); } int v[11] = { 'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c' }; array_view< int > av( 11, v ); pfe( av.extent, [ = ]( index< 1 > idx ) restrict ( amp, cpu ) { av[idx] += 1; } ); wcout << "default_accelerator description = " << accelerator().description << endl; wcout << "default_accelerator device_path = " << accelerator().device_path << endl; for ( unsigned int i = 0; i < av.extent.size(); i++ ) { wcout << static_cast< char >( av( i ) ); } wcout << endl; }Whether this is a useful and working fallback depends on positive answers to the following questions:
- Is the memory management done by cpu_accelerator (on, e.g., arrays) much slower than native STL?
Comment: It should be fast enough to enable non-debug use in the case of staging arrays: http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/10/cpu-accelerator-in-c-amp.aspx. - Is there a significant performance penalty to access memory on the "normal" heap through array(_view) using the index class (as opposed to vector::operator[]) when using PPL?
Comment: Most of the kernels I am interested in runs an internal for-loop, so I can always side-step this by extracting data pointer and int index using array_view::data() and index::operator[] before the loop. - Is there a performance penalty associated with wrapping a lambda in another lambda? It should just be inlined, right?
- Is the usage of cpu_accelerator in the code above officially supported? Will it work also for more complex constructs, e.g., using array? If not, any workarounds?
I am grateful for any advice on how to implement a custom CPU fallback while keeping code size/duplication to a minimum.
If the above does not work I believe the next leanest fallback will be to wrap handles to vector and array in my own custom container (to facilitate memory management in the host code) and use independent lambdas for AMP and PPL (Daniel has already noted "For snippets of code that are truly identical, factor them out into their own restrict(cpu,amp) functions so you can call them from either of the two entry point lambdas (from the CPU and AMP paths).").
Cheers,
T
(Edit: Minor changes to code sample)
- แก้ไขโดย TwoPointSevenOh 23 กรกฎาคม 2555 9:59
- Is the memory management done by cpu_accelerator (on, e.g., arrays) much slower than native STL?
ตอบทั้งหมด
-
23 กรกฎาคม 2555 17:08เจ้าของ
Hi T
To answer your questions
1. The cpu_accelerator memory allocation and destruction is just a thin wrapper over the CRT heap “new” and “delete” and has the same performance characteristics.
2. There is a small performance panalty with using the array_view subscript operator on the CPU, because internally it checks to see if it needs to synchronize the data. So for repeated access on CPU, we recommend accessing through the data() function.
3. A lambda calling another lambda will be inlined on the GPU side (we inline everything in this release), and I guess inliening could happen for the CPU side but I personally don’t know that for sure. As always the best thing to do here is measure.
4. The cpu_accelerator will work the same for concurrency::array, either by passing an accelerator_view of the cpu_accelerator explicitly to the array constructor, or by setting it as a default globally, like in your example.Note that with your restrict(cpu,amp) approach on the lambda, you are restricting your code from the ability to use tile_static memory and hence to benefit from tiling which is the #1 optimization technique for most C++ AMP (memory bound) kernels.
As a general comment and personal opinion: I would favor having two separate entry points (lambdas) and getting any reuse further down by calling trully common restrict(cpu,amp) functions. This will allow you to optimize for each hardware target seperatelly. In other words, I would trade off a bit of elegance and commonality in the code, for keeping all the options open for better runtime performance on each divergent code path… Once all performance tuning is done for each hardware on the two paths, you could revisit to see if some refactoring can merge back some of the code divergence… But that is just an opinion, not based on experience since I haven’t tried to do what you are trying to do. Let us know how it goes...
Cheers
Danielhttp://www.danielmoth.com/Blog/
- ทำเครื่องหมายเป็นคำตอบโดย TwoPointSevenOh 24 กรกฎาคม 2555 7:47
-
24 กรกฎาคม 2555 7:52
Hi Daniel,
Thanks for a great reply! I will attempt to port my OpenCL code right away and get back to you if I run into problems.
As you mention I will have to allow for the special case of use of local memory (tiling). This is ok, I only use it in a few places.
Cheers,
T
-
15 สิงหาคม 2555 6:34
Hi Daniel,
Minor follow up to your reply for item 2 above (just to be sure): Are the auto-syncronization features of array_view protected in such a way that using array_view in a lambda executed in parallel by PPL is safe?
Cheers
T