none
AMP reduction sample strange performance - sequential faster?

Antworten

  • Hi lsr

    Looking at the output you shared, tells me that you haven’t grabbed the updated reduction code from the blog post. Looking at your question, tells me that you haven’t read the updated text from the blog post. Please revisit the blog text and ZIP file.

    To answer your question here:

    1. The reduction algorithm does not show speedups on a GPU compared with a CPU implementation *when* the data copy time is included.
    2. If you exclude the copy time, and measure only the kernel execution, you will notice the speedup.

    I can sense follow up questions, so I’ll proactively address them:

    1. Reduction is a popular algorithm, so that is why we shared it on our blog. As per the notice at the top of that blog post, we are using it to demonstrate techniques, not to demonstrate speed-up over the CPU for this algorithm. In fact our simple model is almost as fast as the maxed out tiled one so you don’t need to bother with all the optimizations.
    2. Typically reductions are performed on data that has *already been transferred* to the GPU for other manipulations, and at the end you want to perform a reduction on the output of the previous processing. So the data is already there. This is a common pattern with GPU solutions: copy the data once, run a bunch of kernels without copying back to the CPU, in the end copy back the results. And that is why measuring kernel execution alone for some algorithms is interesting.

    HTH, and apologies that we updated the blog post after you had read it and downloaded the old code.

    Cheers
    Daniel


    http://www.danielmoth.com/Blog/

    Mittwoch, 11. April 2012 02:12
    Besitzer

Alle Antworten

  • Hi lsr

    Looking at the output you shared, tells me that you haven’t grabbed the updated reduction code from the blog post. Looking at your question, tells me that you haven’t read the updated text from the blog post. Please revisit the blog text and ZIP file.

    To answer your question here:

    1. The reduction algorithm does not show speedups on a GPU compared with a CPU implementation *when* the data copy time is included.
    2. If you exclude the copy time, and measure only the kernel execution, you will notice the speedup.

    I can sense follow up questions, so I’ll proactively address them:

    1. Reduction is a popular algorithm, so that is why we shared it on our blog. As per the notice at the top of that blog post, we are using it to demonstrate techniques, not to demonstrate speed-up over the CPU for this algorithm. In fact our simple model is almost as fast as the maxed out tiled one so you don’t need to bother with all the optimizations.
    2. Typically reductions are performed on data that has *already been transferred* to the GPU for other manipulations, and at the end you want to perform a reduction on the output of the previous processing. So the data is already there. This is a common pattern with GPU solutions: copy the data once, run a bunch of kernels without copying back to the CPU, in the end copy back the results. And that is why measuring kernel execution alone for some algorithms is interesting.

    HTH, and apologies that we updated the blog post after you had read it and downloaded the old code.

    Cheers
    Daniel


    http://www.danielmoth.com/Blog/

    Mittwoch, 11. April 2012 02:12
    Besitzer
  • thanks Daniel, that was very helpful
    Mittwoch, 11. April 2012 20:30