none
Gaussian sample: Removing condition from the kernel increases (sic) computation time 4x

    问题

  • Hi, 

    I've modified the Gaussian blur sample from here http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/14/gaussian-blur-using-c-amp.aspx to measure impact of conditional statement in the kernel on computation time as below, and the result was that without the if in the kernel it runs 4x slower in release configuration. Details below.

    The following changes made to the code:

    1. pass array_view section to compute on smaller sub-domain to avoid the need of having if statement in the kernel.

    2. Added wait() after parallel_for_each 

    3. Print time it takes to compute - i.e run paraller_for_each and wait().

    static_assert(BLUR_MTX_DIM % 2 == 1, "need odd size"); #define BLUR_OFFSET 2 // BLUR_MTX_DIM / 2 ... typedef std::chrono::high_resolution_clock::time_point time_point; typedef std::chrono::high_resolution_clock::duration duration; typedef std::chrono::high_resolution_clock hrclock; void gaussian_blur::execute( Concurrency::accelerator_view& accview) { array<float, 2> a_data(size, size, data.begin(),accview); array<float, 2> a_amp_result(size, size, accview);

    index<2> offset(BLUR_OFFSET, BLUR_OFFSET); extent<2> extent(size - 2*BLUR_OFFSET, size - 2*BLUR_OFFSET); array_view<const float, 2> a_data_view(a_data); array_view<float, 2> a_amp_result_view = a_amp_result.section(offset, extent); a_amp_result_view.discard_data(); time_point start = hrclock::now(); gaussian_blur_simple_amp_kernel(a_data_view, a_amp_result_view, accview); accview.wait(); duration elapsed = hrclock::now() - start; std::cout << "Compute time: " << std::chrono::duration_cast<std::chrono::milliseconds> (elapsed).count() << "ms\n"; //a_amp_result_view.synchronize(); copy(a_amp_result, amp_result.begin()); }

    Then I run two cases: first with the original kernel (TEST1 uncommented) and the second with TEST1 commented out (since compute domain and output matrix is smaller than the input if is not necessary):

    void gaussian_blur::gaussian_blur_simple_amp_kernel(const array_view<const float,2> &input, array_view<float,2> &output, Concurrency::accelerator_view& accview)
    {
    	static_assert(BLUR_MTX_SIZE == (BLUR_MTX_DIM*BLUR_MTX_DIM), "Sample assumes filter matrix to be a square matrix");
    	int size = input.extent[1];
    
    	parallel_for_each(accview,output.extent, [=] (index<2> idx) restrict(amp)
            {
                float value = 0.0f;
                float total = 0.0f;
                const float gaussian_blur_matrix[BLUR_MTX_DIM][BLUR_MTX_DIM] = { BLUR_MTX_VALUES };
                for(int i=0; i<BLUR_MTX_DIM; i++) 
                {
                    for(int j=0; j<BLUR_MTX_DIM; j++) 
                    {
                        int x = BLUR_OFFSET + idx[1] + i - (BLUR_MTX_DIM / 2) ;
                        int y = BLUR_OFFSET + idx[0] + j - (BLUR_MTX_DIM / 2);
    					
    // TEST1	    if (x > -1000)
    // TEST2	    if ((x >= 0) & (y >= 0) & (x < size) & (y < size))
                        {
                            float coef = gaussian_blur_matrix[i][j];
                            total += coef;
                            value += coef * input(y, x);
                        }
                    }
    	    }
                output[idx] = value / total;
            });
    }

    Then I compile in debug and release configurations with and without commenting out "if" in the kernel.

    The time is takes to compute are as follows:

    If on, debug = 857ms
    If off, debug = 835 ms (slightly faster; ok)

    If on, release = 292 ms (release is faster than debug; ok)
    If off, release = 717 ms (Why?)

    In fact it does not matter what expression is in the if statement - even always-true expression such as x > -10000 brings performance back.

    What could be the reason for this behaviour?

    Thanks,

    Alex.

    P.S. 

    I'm running this on NVIDIA NVS 4200M with matrix size = 5000.

    The code is here: http://dl.dropbox.com/u/1496653/AMP/gaussian_blur_views_conditions.zip

    Update1:

    On ATI FirePro V3800 the behavior is correct

    Using device : ATI FirePro V3800 (FireGL)
    Applying Gaussian filter using non-tiled version of kernel
    Compute time if: 670ms
    Compute time noif: 343ms
    Comparing results done. Verification Pass

    Update2:

    Updated code to run both scenarios sequentially

    • 已编辑 Saspus01 2012年5月2日 22:06 Update URL for the source code
    2012年5月2日 21:43

答案

  • Thanks for reporting this.  I downloaded the code and built it with VS11 Beta, I tried on my NVidia GTX 580, here is what I got (for Release/x64):

       Using device : NVIDIA GeForce GTX 580
       Applying Gaussian filter using non-tiled version of kernel
       Compute time if: 135ms
       Compute time noif: 40ms
       Comparing results done. Verification Pass

    However, I noticed some issues with timing. (For timing a C++ AMP application, please read http://blogs.msdn.com/b/nativeconcurrency/archive/2011/12/28/how-to-measure-the-performance-of-c-amp-algorithms.aspx, and http://blogs.msdn.com/b/nativeconcurrency/archive/2012/04/25/data-warm-up-when-measuring-performance-with-c-amp.aspx)

    So in the code I downloaded, there are a few issues I want to bring to your attention
    •Both gaussian_blur_simple_amp_kernel and gaussian_blur_simple_amp_kernel_fast are only invoked once, so your timing includes the JIT time that compile the bytecode into hardware's machine code.
    •It would be a good idea to add a accview.wait() before launching the kernel. This ensures all the outstanding activities on the accview has completed. For example, any outstanding copy operation. Note even for a synchrounous copy, the implementation still have freedom to use asynchrony underlying as long as it can ensure the copy is as-if synchrnous.

    So I modified your code a little bit as:

    gaussian_blur_simple_amp_kernel(a_data_view, a_amp_result_view, accview); // warmup kernel
    gaussian_blur_simple_amp_kernel_fast(a_data_view, a_amp_result_view, accview); //warmup kernel
    accview.wait(); // all previous commands are done
    
    time_point start = hrclock::now();
    gaussian_blur_simple_amp_kernel(a_data_view, a_amp_result_view, accview);
    accview.wait();
    duration elapsed = hrclock::now() - start;
    std::cout << "Compute time if: " << std::chrono::duration_cast<std::chrono::milliseconds> (elapsed).count() << "ms\n";
    
    start = hrclock::now();
    gaussian_blur_simple_amp_kernel_fast(a_data_view, a_amp_result_view, accview);
    accview.wait();
    elapsed = hrclock::now() - start;
    std::cout << "Compute time noif: " << std::chrono::duration_cast<std::chrono::milliseconds> (elapsed).count() << "ms\n";

    Then I re-ran the test, I got:

        Using device : NVIDIA GeForce GTX 580
        Applying Gaussian filter using non-tiled version of kernel
        Compute time if: 17ms
        Compute time noif: 33ms
        Comparing results done. Verification Pass

    So basically, it confirms what you reported on GTX 580. I also ran the code on ATI HD5870, which behaves as expected -- "noif" is faster than "if".

    This looks like a driver issue. We will do more investigation on our side and will talk to the hardware vendor on it.

    Again, thanks for reporting the issue. Please keep it coming.

    Regards,

    Weirong


    2012年5月3日 16:51