C++ AMP: multiple concurrent parallel_for_each on the same accelerator
-
2012年2月17日 下午 02:21
Hi,
parallel_for_each is asynchronous. There is a command queue on the accelerator_view. What happens if multiple parallel_for_each are invoked in the same thread (and assuming those kernels are working on different concurrency::arrays) ? Will they be running in parallel on the same concurrency::accelerator ? Will they be sequentialized (in that case, what happens if parallel_for_each is invoked on multiple threads) ?
Best regards, Arnaud.
所有回覆
-
2012年2月18日 上午 12:52擁有者
Hi Arnaud
First, not directly addressing your question, but related to scheduling so you may find it interesting, is this post:
http://blogs.msdn.com/b/nativeconcurrency/archive/2011/11/23/understanding-accelerator-view-queuing-mode-in-c-amp.aspxSecond, an older response also doesn’t directly answer your question, but has some related information:
http://social.msdn.microsoft.com/Forums/en-US/parallelcppnative/thread/19488de7-84cf-4ff2-b7fb-410b9c1f56d8To more directly answer your question, regardless of how many CPU threads you use to schedule kernels (parallel_for_each computations) to a single accelerator, only one of them will execute at a time on the accelerator. That is taken care of by Windows and the underlying DirectX layer.
After you’ve consumed the resources above, if you have a follow up question please feel free to let us know, ideally with the scenario you are trying to satisfy.
Cheers
Daniel
http://www.danielmoth.com/Blog/
- 已標示為解答 Arnaud Faucher 2012年2月18日 上午 03:14
-
2012年2月18日 上午 03:14
Hi Daniel,
The algorithm we are working on can be decomposed in a work flow made of parallel branches. For code clarity and maintainability, we are using the non-tiled version of p-f-e to program the kernels in these parallel branches; this implies that the tile size is determined at compile time.
Because the GPU we are using (GTX 580) has quite a lot of processing power, I thought that multiple non-tiled p-f-e kernels would be able to run simultaneously on the same accelerator. I understand this is not the case, even in Windows 8 (where there is an optional suspend/resume mode of operation for lengthy tasks).
To summarize, we have 2 options:
1. build a machine with multiple cheaper GPUs and continue using non-tiled computation domains (but beware of I/O contentions);
2. feed our big GPU with adequately tiled computation domains (512, 768 or even 1024 thread-wide) in order not to waste the available power.
I'll naturally opt for the second option, because I'm not sure about the parallelism of I/O, and because multiple GPU machines is not the standard nowadays anyway. But this also implies that we'll have to 'sense' the right tile sizes at build time (depends on the GPU isn't it ?) and use 'switch' cases and templates, at the expense of less readability and maintainability... But this is AMP v1, and it's already awesome.
Quite a lot of interesting challenges for the next versions of AMP, operating systems and hardware !
Thanks, Arnaud.
-
2012年2月18日 上午 05:06擁有者
Hi Arnaud
Exactly, there is no concurrent execution of parallel_for_each computations on a single piece of hardware.
Please note that the difference between using the simple model and using the tiled model has nothing to do with the power of your GPUs. I say this because your description seems to suggest that there is some correlation between them e.g. that using the simple model is best for low-end hardware whereas the tiled model is for higher end hardware - that is not the case at all. Those are orthogonal considerations. The simple model will use all the threads that you schedule in the parallel_for_each so there will be no wastage of available compute power.
For more on the tiled model, please read these two posts (and the links they point to):
http://www.danielmoth.com/Blog/Scheduling-Thread-Tiles-With-C-AMP.aspx
http://www.danielmoth.com/Blog/tilestatic-Tilebarrier-And-Tiled-Matrix-Multiplication-With-C-AMP.aspxYou’ll see that the reason you use tiling is to take advantage of tile_static memory, which you can't with the simple model. If you think you can implement your algorithm to do that, you should do it regardless of the GPU hardware you are using. Equally if you are happy with the performance gains of using the simple model, again you’ll be proportionally happy regardless of the GPU hardware you are using.
Cheers
Daniel
http://www.danielmoth.com/Blog/

