none
How to deal with a huge amount of RGB data using C++ AMP

    Question

  • Hi

    I am now trying to use C++ AMP to speed up the image processing of a DirectShow filter. I read the tiling introductions and the examples but I think they are not quite suitable for what I am doing.

    The images/video I want to filter is from a 1080*1920 at 60fps HD source, which means 2073600 pixels per frame. What I want to do is to change the RGB values of each pixel one by one and give them back to the pixel. For example:

    RNew = 0.5 * ROrigin + 20;

    GNew = 0.5 * GOrigin + 30;

    BNew = 0.5 * BOrigin + 10;

    This is not similar to the sum function in the tiling examples and it doesn't involve computation with other pixels. So I am not sure how should I do the tiling. Here are the two ways I thought of:

    (1) I put the RGB values in an int vector[2073600*3] so it is like R G B R G B...R G B. I tried to tile them into 3*(256 pixels)=768/tile, but I am not sure what I should do next. The values going to RGB channels are different, and I only have one index idx to process that in parallel... If I use a for loop, it seems that it's not a parallel processing...

    (2) If I make it only one pixel, which is R, G and B values in each tile, I know how to write the code. At this time, does the GPU work on one tile (one pixel) with three threads and then move on to the next tile? Or does it work on several tiles at the same time? If it works in the first way, it seems also not a parallel processing.

    The data I need to process is quite a lot. I do hope I can process the pixels in parallel to increase the speed of the program. But at the moment it is slowing down the program, so I must have failed processing it in parallel. I would appreciate it if someone can help me with this problem. Thanks in advance.

    kfzaer


    • Edited by kfzaer Thursday, October 31, 2013 12:07 PM
    Thursday, October 31, 2013 12:07 PM

Answers

  • Hi kfzaer,

    The terms 'pfe' and 'TDR' used by Boby in previous reply refers to 'parallel_for_each' and 'Timeout Detection Recovery'. We informally use these short terms internally in C++ AMP team. Boby probably used these term mistakenly. You can read about timeout detection recovery in context of C++ AMP on our team blog.

    It would be worth giving AMP a try and see how much performance gain you get. You can take two approach.

    • You can use array/array_view for storing your pixels and if your GPU supports CPU/GPU shared memory, then it will spend less time in transferring data between GPU and CPU.
    • You can use texture/texture_view to store pixels. Textures use locality of reference and hence provide better performance for large sized data than array/array_view. However, texture/texture_view currently does not support CPU/GPU shared memory. Also 3-component textures have many limitation and cannot be used in your scenario. You will need to use 4-component texture which will then involve copying extra memory between CPU and GPU. Also you cannot both read-from and write-to a texture in a single parallel_for_each invovation. 

    You might want to try the second approach if the first approach does not give desirable speed up. Though the texture has many limitation but for larger data size the benefit of locality of reference may overshadow the limitations. Below is sample code showing the first approach.

    void compute_rgb(std::vector<float_3>& rgb_pixels)
    {
    	accelerator def_acc(accelerator::default_accelerator);
    
    	// Create accelerator_view with TDR disabled (Supported on Windows 8 and later)
    	accelerator_view def_acc_v = concurrency::direct3d::create_accelerator_view(def_acc, true);
    
    	array_view<float_3, 1> rgb_arr_v(rgb_pixels.size(), rgb_pixels);
    	
    	parallel_for_each(def_acc_v, rgb_arr_v.extent, [=](index<1> idx) {
    		float_3 rgb = rgb_arr_v[idx];
    
    		rgb.r = 0.5f * rgb.r + 20;
    		rgb.g = 0.5f * rgb.g + 30;
    		rgb.b = 0.5f * rgb.b + 10;
    
    		rgb_arr_v[idx] = rgb;
    	});
    
    	rgb_arr_v.synchronize();
    
    	// Use your computed data here.
    }
    
    void compute_rgb_block(std::vector<float_3>& rgb_pixels, int block_size)
    {
    	// If the OS does not support diabling the TDR, we can divide the 
    	// computation into blocks and issue multiple parallel_for_each.
    	
    	array_view<float_3, 1> rgb_arr_v(rgb_pixels.size(), rgb_pixels);
    
    	// Assuming data size is divisible by block_size.
    	int num_blocks = rgb_pixels.size() / block_size;
    	
    	for (int block = 0; block < num_blocks; block++)
    	{
    		array_view<float_3, 1> block_arr_v = rgb_arr_v.section(block * block_size, block_size);
    
    		parallel_for_each(block_arr_v.extent, [=](index<1> idx) {
    			float_3 rgb = block_arr_v[idx];
    
    			rgb.r = 0.5f * rgb.r + 20;
    			rgb.g = 0.5f * rgb.g + 30;
    			rgb.b = 0.5f * rgb.b + 10;
    
    			block_arr_v[idx] = rgb;
    		});
    	}
    
    	rgb_arr_v.synchronize();
    
    	// Use your computed data here.
    }

    Please feel free to post if you have further queries.

    • Marked as answer by kfzaer Wednesday, December 04, 2013 6:38 PM
    Thursday, November 21, 2013 3:43 AM

All replies

  • Hi Kfzaer,

    The tiling technic is adapted when you reuse several time the same data (as the matrix product algorithm). If in your case, you do not reuse the same data, then you have not to implement the tiling technic.

    Bruno


    Boucard Bruno - http://blogs.msdn.com/b/devpara/

    Thursday, October 31, 2013 4:09 PM
  • Hi Bruno,

    Thanks for your answer.

    If I don't use the tiling, how can I make use of the multi-thread processing of GPU? Do I need to split the whole frame into several blocks?

    Could you give me some advise about it please?

    kfzaer

    Thursday, October 31, 2013 4:38 PM
  • You could split the frame into several blocks and each block handled by a pfe. Also you can try using a single pfe and process the frame. But since your data is large, there is a chance for getting TDR (note in win8 you can disable TDR).
    Monday, November 04, 2013 10:00 PM
  • It doesn't look like you're doing enough work on each color channel/pixel to warrant using the GPU and suffering the memory copy operation overhead. I would recommend doing this on the CPU with multiple threads, each thread making use of SSE /AVX instructions.

    Come to think of it, VS 2013 might just auto-vectorize your code for you.

    -L

    Tuesday, November 05, 2013 11:00 PM
  • Hi Boby

    Thank you for your answer!

    I have been thinking the same method to split the frame into several blocks and process them at the same time, but I have no idea how to realize it. Is it a good idea to use C++ AMP to deal with it? Or do you suggest any other methods? 

    Meanwhile please excuse my lack of programming knowledge, what is a pfe and TDR? I can't understand it properly by searching them, which lead me to many other meanings. If you could shortly explain them to me, I will be very grateful.

    kfzaer

     

    Tuesday, November 19, 2013 10:50 AM
  • Hi kfzaer,

    The terms 'pfe' and 'TDR' used by Boby in previous reply refers to 'parallel_for_each' and 'Timeout Detection Recovery'. We informally use these short terms internally in C++ AMP team. Boby probably used these term mistakenly. You can read about timeout detection recovery in context of C++ AMP on our team blog.

    It would be worth giving AMP a try and see how much performance gain you get. You can take two approach.

    • You can use array/array_view for storing your pixels and if your GPU supports CPU/GPU shared memory, then it will spend less time in transferring data between GPU and CPU.
    • You can use texture/texture_view to store pixels. Textures use locality of reference and hence provide better performance for large sized data than array/array_view. However, texture/texture_view currently does not support CPU/GPU shared memory. Also 3-component textures have many limitation and cannot be used in your scenario. You will need to use 4-component texture which will then involve copying extra memory between CPU and GPU. Also you cannot both read-from and write-to a texture in a single parallel_for_each invovation. 

    You might want to try the second approach if the first approach does not give desirable speed up. Though the texture has many limitation but for larger data size the benefit of locality of reference may overshadow the limitations. Below is sample code showing the first approach.

    void compute_rgb(std::vector<float_3>& rgb_pixels)
    {
    	accelerator def_acc(accelerator::default_accelerator);
    
    	// Create accelerator_view with TDR disabled (Supported on Windows 8 and later)
    	accelerator_view def_acc_v = concurrency::direct3d::create_accelerator_view(def_acc, true);
    
    	array_view<float_3, 1> rgb_arr_v(rgb_pixels.size(), rgb_pixels);
    	
    	parallel_for_each(def_acc_v, rgb_arr_v.extent, [=](index<1> idx) {
    		float_3 rgb = rgb_arr_v[idx];
    
    		rgb.r = 0.5f * rgb.r + 20;
    		rgb.g = 0.5f * rgb.g + 30;
    		rgb.b = 0.5f * rgb.b + 10;
    
    		rgb_arr_v[idx] = rgb;
    	});
    
    	rgb_arr_v.synchronize();
    
    	// Use your computed data here.
    }
    
    void compute_rgb_block(std::vector<float_3>& rgb_pixels, int block_size)
    {
    	// If the OS does not support diabling the TDR, we can divide the 
    	// computation into blocks and issue multiple parallel_for_each.
    	
    	array_view<float_3, 1> rgb_arr_v(rgb_pixels.size(), rgb_pixels);
    
    	// Assuming data size is divisible by block_size.
    	int num_blocks = rgb_pixels.size() / block_size;
    	
    	for (int block = 0; block < num_blocks; block++)
    	{
    		array_view<float_3, 1> block_arr_v = rgb_arr_v.section(block * block_size, block_size);
    
    		parallel_for_each(block_arr_v.extent, [=](index<1> idx) {
    			float_3 rgb = block_arr_v[idx];
    
    			rgb.r = 0.5f * rgb.r + 20;
    			rgb.g = 0.5f * rgb.g + 30;
    			rgb.b = 0.5f * rgb.b + 10;
    
    			block_arr_v[idx] = rgb;
    		});
    	}
    
    	rgb_arr_v.synchronize();
    
    	// Use your computed data here.
    }

    Please feel free to post if you have further queries.

    • Marked as answer by kfzaer Wednesday, December 04, 2013 6:38 PM
    Thursday, November 21, 2013 3:43 AM