AMP reduction sample strange performance - sequential faster?
-
Mittwoch, 11. April 2012 00:39
So I'm running the parallel reduction sample (http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/08/parallel-reduction-using-c-amp.aspx) and I'm timing functions using this: http://blogs.msdn.com/b/nativeconcurrency/archive/2011/12/28/high-resolution-timer-for-c.aspx
Weirdly enough, the sequential reduction is apparently faster:
Using device : NVIDIA GeForce GTS 450 Running kernels... SUCCESS: reduction1! in 137.910769ms SUCCESS: reduction2! in 99.061875ms SUCCESS: reduction3! in 66.364982ms SUCCESS: reduction4! in 58.343183ms SUCCESS: reduction5! in 74.262493ms SUCCESS: reduction6! in 55.159395ms SUCCESS: reduction7! in 49.434881ms SUCCESS: sequential_reduction! in 16.771351ms
Alle Antworten
-
Mittwoch, 11. April 2012 02:12Besitzer
Hi lsr
Looking at the output you shared, tells me that you haven’t grabbed the updated reduction code from the blog post. Looking at your question, tells me that you haven’t read the updated text from the blog post. Please revisit the blog text and ZIP file.
To answer your question here:
- The reduction algorithm does not show speedups on a GPU compared with a CPU implementation *when* the data copy time is included.
- If you exclude the copy time, and measure only the kernel execution, you will notice the speedup.
I can sense follow up questions, so I’ll proactively address them:
- Reduction is a popular algorithm, so that is why we shared it on our blog. As per the notice at the top of that blog post, we are using it to demonstrate techniques, not to demonstrate speed-up over the CPU for this algorithm. In fact our simple model is almost as fast as the maxed out tiled one so you don’t need to bother with all the optimizations.
- Typically reductions are performed on data that has *already been transferred* to the GPU for other manipulations, and at the end you want to perform a reduction on the output of the previous processing. So the data is already there. This is a common pattern with GPU solutions: copy the data once, run a bunch of kernels without copying back to the CPU, in the end copy back the results. And that is why measuring kernel execution alone for some algorithms is interesting.
HTH, and apologies that we updated the blog post after you had read it and downloaded the old code.
Cheers
Danielhttp://www.danielmoth.com/Blog/
- Als Antwort vorgeschlagen Łukasz MendakiewiczMicrosoft Employee Mittwoch, 11. April 2012 02:29
- Als Antwort markiert lsr Mittwoch, 11. April 2012 19:11
-
Mittwoch, 11. April 2012 20:30thanks Daniel, that was very helpful

