locked
Multi-threading AMP RRS feed

  • Question

  • This is not exactly a question, maybe we just start discussion it and find out a good pattern together.

    I tried to multi-thread kernel submission. If failed horribly, the implementation quickly blows up with corrupt reference-counters, or maybe access to eliminated objects from reference-counting. There is not that much information that I could build a mental model of what AMP actually is in terms of threading.

    The documentation says the accelerator_views are thread-safe, but not concurrency-safe, while the accelerator itself is. From reading I assume an accelerator_view represents something in the spirit of a DX11DeviceContext. But it's not clear to me if an accelerator_view per thread (and not deferred) would allow the concurrent submition of parallel_for_eachs to a single accelerator. I'm also not sure if wrapping AMP-code [with array_view-construction and all] into a parallel_for-lambda is even sane.

    I did a very thorough job in my testings, each [concurrency-runtime] thread [spawned from a parallel_for] has basically it's own view(s), it's own arrays and doesn't intersect in any way with the other threads. Still I can't really get it not to blow up.

    Here is the schema of how I tested it.

    ... // one resource per thread (pb == thread-id) static vector<accelerator_view> accs; static vector<array<double, 1> *> arr; ... // run a kernel on the view assigned to the thread double executeAMP(int pb) {
    // the accelerator_views ended up in there with push_back(),
    // prevents the default constructor to be used (which doesn't exist) accelerator_view &acc = (Repository::accs[pb]); array<double, 1> arr = *(Repository::arrs[pb]); ... extent<2> ee(3, 9); tiled_extent<3, 9> te(ee); ... parallel_for_each(acc, te, [=, &arr](tiled_index<3, 9> th) restrict(amp) { ... }); ... return something_read_back_via_synchronize_buffer. } ... // submit X threads concurrently parallel_for(0, p, 1, [&](int pb) { pb %= concurrent; robin[pb].lock(); reslt[pb] = executeAMP(pb); robin[pb].unlock(); }); ...

    What would be a working and safe pattern to achieve concurrent AMP-threads to execute? Or is there simply no way to use multiple [immediate] views, and it need to be done with a deferred-approach?

    Maybe we get a few interesting ideas, maybe from the DX11 side as I'm sure there has been accumulated a bit of experience with the situation (multiple immediate contexts).

    Tuesday, July 10, 2012 5:16 PM

Answers

  • Hi Ethatron,

    Concurrent parallel_for_each invocations from multiple threads are allowed. Multiple threads can concurrently invoke parallel_for_each on the same accelerator_view or different accelerator_views without requiring any synchronization between the threads, if the parallel_for_each invocations use different resources and thus do not introduce any data races. In short, yes an accelerator_view is thread-safe. Whether the kernel actually execute concurrently depends on the available hardware resources – concurrent parallel_for_each invocation on different accelerator_views of different accelerators can and will execute concurrently.

    I would have to look at your code to tell the exact reason for the access violation that you are encountering – I would be happy to do so if you can share/post your actual code. Fundamentally the pattern that you are using is supported.

    An “accelerator” in C++ AMP is analogous to a IDXGIAdapter that represents a physical GPU and an accelerator_view is analogous to a Direct3D device. DirectX requires programmers to synchronize concurrent submission of commands from multiple CPU threads to the same Direct3D device context but this not required in C++ AMP – the C++ AMP runtime internally manages any DirectX synchronization requirements. Multiple accelerator_views on the same accelerator allow producing multiple independent command-streams – whether they actually execute concurrently on the GPU depends on whether the hardware supports such concurrency. C++ AMP does not provide an equivalent of DeferredContexts in DirectX which is basically a mechanism for batching commands together before actually submitting to the GPU for execution – its helps reduce the cost of synchronization in the application code for submitting each command to the device. Multiple threads can batch their respective commands in their own deferred contexts which can then be submitted together to the device incurring the synchronization cost only at the point of the batch’s submission instead of paying the cost at each command submission. This is useful when a large number of tiny draw primitives need to be submitted for execution to a device and the synchronization cost for each such draw primitive can add up to be significant. We think this is not so much of a problem for compute where each command by itself performs sizable work and dwarfs any synchronization costs. If you have any compute scenarios that benefit from deferred contexts we would love to hear about them.

    Finally, parallel_for_each does not rely on the initialization of the AMP resources beyond what is required by your algorithm. So if 2 threads are working towards invoking independent parallel_for_each invocations are using different sets of resources, the initialization of the resources and parallel_for_each invocations from multiple threads can interleave arbitrarily without any problems. No locks are needed.

    Regards,

    Amit


    Amit K Agarwal

    Thursday, July 12, 2012 2:14 AM

All replies

  • Hi Ethatron

    There is nothing special about multiple CPU threads accessing C++ AMP APIs, such as parallel_for_each, array_view objects, or any other parts of the APIs that result in GPU commands under the covers. The API is thread safe, and you can call those from multiple CPU threads, modulo races that your own code may introduce.

    I do not understand what does “blow up” mean: if it is an exception what exception is it. Please always use exception handling so you can catch the specific exception and report it:
    http://blogs.msdn.com/b/nativeconcurrency/archive/2012/02/09/runtime-exception-of-c-amp.aspx

    Just looking at your code, there are a 3 observations:

    1. You are not using exception handling so it is hard to tell where things are going wrong.
    2. The line of code with array assignment {array<double, 1> arr = *(Repository::arrs[pb]);} would result in a deep copy, so you should use &arr =
    3. Your parallel_for_each usage specifies a specific accelerator_view, but from the code I can’t tell if you are using that same accelerator_view when you created the array objects. They have to be on the same accelerator_view or you would receive a corresponding exception when calling the function.

    Cheers
    Daniel


    http://www.danielmoth.com/Blog/

    Wednesday, July 11, 2012 3:49 PM
  • There is nothing special about multiple CPU threads accessing C++ AMP APIs, such as parallel_for_each, array_view objects, or any other parts of the APIs that result in GPU commands under the covers. The API is thread safe, and you can call those from multiple CPU threads, modulo races that your own code may introduce.

    I use a parallel_for to spawn several jobs, all of which want to execute it's own parallel_for_each concurrently. The documentation doesn't state if that is possible, or if concurrent "kernels" are even possible. I don't want to synchronize the threads with a mutex, because that serializes the parallel_for_each, which means I don't even need to try it, I just eliminate the parallel_for and it's serial.
    Using asynchronicy and a future doesn't help performance at all in my real case.

    The question is, if a lockfree and concurrent use of the compute device is possible, if all resources in use are independent (no shared views, arrays, texture etc. just the same "accelerator").

    I do not understand what does “blow up” mean: if it is an exception what exception is it.

    kernel32.dll!_InterlockedIncrement@4()  + 0x9 Bytes
    xyz.exe!Concurrency::details::_Get_accelerator_view_impl_ptr(const 
    Concurrency::accelerator_view & _Accl_view={...})  Zeile 1240 + 0x10 Bytes ... Unbehandelte Ausnahme bei 0x777d1389 (kernel32.dll) in xyz.exe: 0xC0000005:
    Zugriffsverletzung beim Schreiben an Position 0x00000037.

    This happens executing the parallel_for_each, in the 3rd concurrent thread. I assume the object has been deleted because of the reference-counter somehow reaching 0. I can't see that to be possible in the pattern above, but then I don't know the internals of AMP that well.

    Please always use exception handling so you can catch the specific exception and report it:
    http://blogs.msdn.com/b/nativeconcurrency/archive/2012/02/09/runtime-exception-of-c-amp.aspx


    Just looking at your code, there are a 3 observations:

    1. You are not using exception handling so it is hard to tell where things are going wrong.
    2. The line of code with array assignment {array<double, 1> arr = *(Repository::arrs[pb]);} would result in a deep copy, so you should use &arr =

    That's just a typo, the code is just an illustration of the multi-threading pattern I tried and by no means compilable. :^)

    • Your parallel_for_each usage specifies a specific accelerator_view, but from the code I can’t tell if you are using that same accelerator_view when you created the array objects. They have to be on the same accelerator_view or you would receive a corresponding exception when calling the function.

    They are distinct, I create a seperate accelerator_view per thread via "create_view()" and put them in the vector.

    More than getting to know that the used code contains an error, I'm interested in knowing if concurrency is possible. If not, I can stop, and it's very clear I never manage the code to work. If yes, I want to know what the programming pattern is which achieves concurrency. The best would be a sketch of the pattern, I don't really need working code, but it'd be nice too.

    In case the tested pattern should work, principly, I would invest the time to find what's wrong.

    The pattern for AMP and DX11 should be roughly the same. If on DX11 we can instance multiple Devices, one for each thread, and are able to produce multiple independent command-streams running concurrently on the GPU, we could use an "accelerator" in the case of AMP to do the same. If on DX11 we need to use deferred Contexts to reduce the impact of serial submition of command-streams to the GPU, how does this work in AMP? Is there a guarantee that calling only the parallel_for_each produces a valid coherent and complete command-stream, or does the parallel_for_each rely on initialization code to be executed previously (setup of array, array_view etc.)? This is an important question, because if we have this execution-order:

    init_some_amp_resources(); parallel_for_each(); parallel_for_each(); parallel_for_each(); parallel_for_each(); sync_on_some_amp_resources();

    Is it possible to have multiple threads to even work in serial lock-step (not concurrently)?:

    thread 1: init_some_thread_local_amp_resources();
    thread 2: init_some_thread_local_amp_resources();
    thread 1: parallel_for_each(with_thread_local_accview);
    thread 2: parallel_for_each(with_thread_local_accview);
    thread 1: parallel_for_each(with_thread_local_accview);
    thread 2: parallel_for_each(with_thread_local_accview);
    thread 1: parallel_for_each(with_thread_local_accview);
    thread 2: parallel_for_each(with_thread_local_accview);
    thread 1: parallel_for_each(with_thread_local_accview);
    thread 2: parallel_for_each(with_thread_local_accview);
    thread 1: sync_on_some_thread_local_amp_resources();
    thread 2: sync_on_some_thread_local_amp_resources();
    

    If we need to lock(), where?

    // variant 1
    lock();
    init_some_thread_local_amp_resources();
    parallel_for_each(with_thread_local_accview);
    parallel_for_each(with_thread_local_accview);
    parallel_for_each(with_thread_local_accview);
    parallel_for_each(with_thread_local_accview);
    sync_on_some_thread_local_amp_resources();
    unlock();
    
    // variant 2
    init_some_thread_local_amp_resources();
    lock();
    parallel_for_each(with_thread_local_accview);
    parallel_for_each(with_thread_local_accview);
    parallel_for_each(with_thread_local_accview);
    parallel_for_each(with_thread_local_accview);
    unlock();
    sync_on_some_thread_local_amp_resources();
    
    // variant 3
    init_some_thread_local_amp_resources();
    lock();
    parallel_for_each(with_thread_local_accview);
    unlock();
    lock();
    parallel_for_each(with_thread_local_accview);
    unlock();
    lock();
    parallel_for_each(with_thread_local_accview);
    unlock();
    lock();
    parallel_for_each(with_thread_local_accview);
    unlock();
    sync_on_some_thread_local_amp_resources();
    It's complex, I just hope we can discuss it.
    Wednesday, July 11, 2012 6:16 PM
  • Hi Ethatron,

    Concurrent parallel_for_each invocations from multiple threads are allowed. Multiple threads can concurrently invoke parallel_for_each on the same accelerator_view or different accelerator_views without requiring any synchronization between the threads, if the parallel_for_each invocations use different resources and thus do not introduce any data races. In short, yes an accelerator_view is thread-safe. Whether the kernel actually execute concurrently depends on the available hardware resources – concurrent parallel_for_each invocation on different accelerator_views of different accelerators can and will execute concurrently.

    I would have to look at your code to tell the exact reason for the access violation that you are encountering – I would be happy to do so if you can share/post your actual code. Fundamentally the pattern that you are using is supported.

    An “accelerator” in C++ AMP is analogous to a IDXGIAdapter that represents a physical GPU and an accelerator_view is analogous to a Direct3D device. DirectX requires programmers to synchronize concurrent submission of commands from multiple CPU threads to the same Direct3D device context but this not required in C++ AMP – the C++ AMP runtime internally manages any DirectX synchronization requirements. Multiple accelerator_views on the same accelerator allow producing multiple independent command-streams – whether they actually execute concurrently on the GPU depends on whether the hardware supports such concurrency. C++ AMP does not provide an equivalent of DeferredContexts in DirectX which is basically a mechanism for batching commands together before actually submitting to the GPU for execution – its helps reduce the cost of synchronization in the application code for submitting each command to the device. Multiple threads can batch their respective commands in their own deferred contexts which can then be submitted together to the device incurring the synchronization cost only at the point of the batch’s submission instead of paying the cost at each command submission. This is useful when a large number of tiny draw primitives need to be submitted for execution to a device and the synchronization cost for each such draw primitive can add up to be significant. We think this is not so much of a problem for compute where each command by itself performs sizable work and dwarfs any synchronization costs. If you have any compute scenarios that benefit from deferred contexts we would love to hear about them.

    Finally, parallel_for_each does not rely on the initialization of the AMP resources beyond what is required by your algorithm. So if 2 threads are working towards invoking independent parallel_for_each invocations are using different sets of resources, the initialization of the resources and parallel_for_each invocations from multiple threads can interleave arbitrarily without any problems. No locks are needed.

    Regards,

    Amit


    Amit K Agarwal

    Thursday, July 12, 2012 2:14 AM
  • Thanks a lot for the really clear answer! Awesome insights.

    So the creation of the accelerator_view for each concurrent thread is superfluous. That makes the code a bit easier to check.

    I'm not sure deferring would help, it'd just have been the straw I'd have grasped for if concurrent use of accelerator_view wouldn't have been possible. So a deferred context in DX11 and the deferred queuing mode in AMP aren't at all related then, under the hood.

    I have two possibilities for concurrency I think. I can create a pipeline, or I go more massive tiles. Say I have 4 datasets to compute on, they are entirely independent computationally (and don't share outputs), though they query on the same read-only dataset.

    Now I could raise the tile's third dimension to 4 (currently the tile is 2-dimensional), and demultiplex the used resources by that third dimension, effectively creating/simulating array[_view]-arrays, a bit like texture-arrays in DX11. This possibly hides more compute latency and allows more CUs to be used. But it doesn't contribute to solving the problem of long GPU-download latencies, when I need the results back. And the algorithm is not constant time, so I have to manage a diverging 3rd-dimension case when a dataset terminates pre-maturely (4->3->2->1). It's bad for code-complexity, I would want to prevent this inevitable explosion of templates. Or I continue executing dead data-sets, then the longest running data-set defines the shortest execution time.

    If I create a pipeline it's not so much about the concurrent execution of the kernels which helps, but the hope I can hide the GPU-download latency. If the device allows the concurrent download and/or upload and execution of the next kernel (in a concurrent thread) it'd be possible to fully hide the download/upload-impact. The download-impact is so severe (~1:15 kernel:download) that speculative execution of a number of kernels in quick succession is faster than checking 1x kernel-results every time after execution. The pipeline would be "a) upload unique data - b) run kernel - c) check scalar results/summary and loop to b) - d) request download of a larger collection of results", with 4 threads I could hide a), c) and d). If the hardware would allow concurrent kernels in addition it'd be the golden ticket, as I could reach 100% saturation of the GPU and maximum possible parallel execution on the CPU.

    Thanks a lot for the information, and the discussion.

    Thursday, July 12, 2012 8:42 PM