Gaussian sample: Removing condition from the kernel increases (sic) computation time 4x
-
2012年5月2日 21:43
Hi,
I've modified the Gaussian blur sample from here http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/14/gaussian-blur-using-c-amp.aspx to measure impact of conditional statement in the kernel on computation time as below, and the result was that without the if in the kernel it runs 4x slower in release configuration. Details below.
The following changes made to the code:
1. pass array_view section to compute on smaller sub-domain to avoid the need of having if statement in the kernel.
2. Added wait() after parallel_for_each
3. Print time it takes to compute - i.e run paraller_for_each and wait().
static_assert(BLUR_MTX_DIM % 2 == 1, "need odd size"); #define BLUR_OFFSET 2 // BLUR_MTX_DIM / 2 ... typedef std::chrono::high_resolution_clock::time_point time_point; typedef std::chrono::high_resolution_clock::duration duration; typedef std::chrono::high_resolution_clock hrclock; void gaussian_blur::execute( Concurrency::accelerator_view& accview) { array<float, 2> a_data(size, size, data.begin(),accview); array<float, 2> a_amp_result(size, size, accview);
index<2> offset(BLUR_OFFSET, BLUR_OFFSET); extent<2> extent(size - 2*BLUR_OFFSET, size - 2*BLUR_OFFSET); array_view<const float, 2> a_data_view(a_data); array_view<float, 2> a_amp_result_view = a_amp_result.section(offset, extent); a_amp_result_view.discard_data(); time_point start = hrclock::now(); gaussian_blur_simple_amp_kernel(a_data_view, a_amp_result_view, accview); accview.wait(); duration elapsed = hrclock::now() - start; std::cout << "Compute time: " << std::chrono::duration_cast<std::chrono::milliseconds> (elapsed).count() << "ms\n"; //a_amp_result_view.synchronize(); copy(a_amp_result, amp_result.begin()); }
Then I run two cases: first with the original kernel (TEST1 uncommented) and the second with TEST1 commented out (since compute domain and output matrix is smaller than the input if is not necessary):
void gaussian_blur::gaussian_blur_simple_amp_kernel(const array_view<const float,2> &input, array_view<float,2> &output, Concurrency::accelerator_view& accview) { static_assert(BLUR_MTX_SIZE == (BLUR_MTX_DIM*BLUR_MTX_DIM), "Sample assumes filter matrix to be a square matrix"); int size = input.extent[1]; parallel_for_each(accview,output.extent, [=] (index<2> idx) restrict(amp) { float value = 0.0f; float total = 0.0f; const float gaussian_blur_matrix[BLUR_MTX_DIM][BLUR_MTX_DIM] = { BLUR_MTX_VALUES }; for(int i=0; i<BLUR_MTX_DIM; i++) { for(int j=0; j<BLUR_MTX_DIM; j++) { int x = BLUR_OFFSET + idx[1] + i - (BLUR_MTX_DIM / 2) ; int y = BLUR_OFFSET + idx[0] + j - (BLUR_MTX_DIM / 2); // TEST1 if (x > -1000) // TEST2 if ((x >= 0) & (y >= 0) & (x < size) & (y < size)) { float coef = gaussian_blur_matrix[i][j]; total += coef; value += coef * input(y, x); } } } output[idx] = value / total; }); }Then I compile in debug and release configurations with and without commenting out "if" in the kernel.
The time is takes to compute are as follows:
If on, debug = 857ms
If off, debug = 835 ms (slightly faster; ok)If on, release = 292 ms (release is faster than debug; ok)
If off, release = 717 ms (Why?)In fact it does not matter what expression is in the if statement - even always-true expression such as x > -10000 brings performance back.
What could be the reason for this behaviour?
Thanks,
Alex.
P.S.
I'm running this on NVIDIA NVS 4200M with matrix size = 5000.
The code is here: http://dl.dropbox.com/u/1496653/AMP/gaussian_blur_views_conditions.zip
Update1:
On ATI FirePro V3800 the behavior is correct
Using device : ATI FirePro V3800 (FireGL)
Applying Gaussian filter using non-tiled version of kernel
Compute time if: 670ms
Compute time noif: 343ms
Comparing results done. Verification PassUpdate2:
Updated code to run both scenarios sequentially
- 已编辑 Saspus01 2012年5月2日 22:06 Update URL for the source code
全部回复
-
2012年5月3日 16:51
Thanks for reporting this. I downloaded the code and built it with VS11 Beta, I tried on my NVidia GTX 580, here is what I got (for Release/x64):
Using device : NVIDIA GeForce GTX 580
Applying Gaussian filter using non-tiled version of kernel
Compute time if: 135ms
Compute time noif: 40ms
Comparing results done. Verification PassHowever, I noticed some issues with timing. (For timing a C++ AMP application, please read http://blogs.msdn.com/b/nativeconcurrency/archive/2011/12/28/how-to-measure-the-performance-of-c-amp-algorithms.aspx, and http://blogs.msdn.com/b/nativeconcurrency/archive/2012/04/25/data-warm-up-when-measuring-performance-with-c-amp.aspx)
So in the code I downloaded, there are a few issues I want to bring to your attention
•Both gaussian_blur_simple_amp_kernel and gaussian_blur_simple_amp_kernel_fast are only invoked once, so your timing includes the JIT time that compile the bytecode into hardware's machine code.
•It would be a good idea to add a accview.wait() before launching the kernel. This ensures all the outstanding activities on the accview has completed. For example, any outstanding copy operation. Note even for a synchrounous copy, the implementation still have freedom to use asynchrony underlying as long as it can ensure the copy is as-if synchrnous.So I modified your code a little bit as:
gaussian_blur_simple_amp_kernel(a_data_view, a_amp_result_view, accview); // warmup kernel gaussian_blur_simple_amp_kernel_fast(a_data_view, a_amp_result_view, accview); //warmup kernel accview.wait(); // all previous commands are done time_point start = hrclock::now(); gaussian_blur_simple_amp_kernel(a_data_view, a_amp_result_view, accview); accview.wait(); duration elapsed = hrclock::now() - start; std::cout << "Compute time if: " << std::chrono::duration_cast<std::chrono::milliseconds> (elapsed).count() << "ms\n"; start = hrclock::now(); gaussian_blur_simple_amp_kernel_fast(a_data_view, a_amp_result_view, accview); accview.wait(); elapsed = hrclock::now() - start; std::cout << "Compute time noif: " << std::chrono::duration_cast<std::chrono::milliseconds> (elapsed).count() << "ms\n";
Then I re-ran the test, I got:
Using device : NVIDIA GeForce GTX 580
Applying Gaussian filter using non-tiled version of kernel
Compute time if: 17ms
Compute time noif: 33ms
Comparing results done. Verification PassSo basically, it confirms what you reported on GTX 580. I also ran the code on ATI HD5870, which behaves as expected -- "noif" is faster than "if".
This looks like a driver issue. We will do more investigation on our side and will talk to the hardware vendor on it.
Again, thanks for reporting the issue. Please keep it coming.
Regards,
Weirong
- 已编辑 Zhu, Weirong 2012年5月3日 16:52
- 已标记为答案 Saspus01 2012年5月3日 17:19

