none
Low Latency Decoding and Small Array Size

    Question

  • I'm working on a video decoder which is to decode each frame as fast as possible with lowest possible latency, however I have some issues with achieving this with C++ AMP. In order to reduce latency as much as possible I send data as soon as possible to the GPU for processing, i.e. I send dct blocks to the gpu as soon as they are available.

    My code looks something like this:

    	concurrency::parallel_for(0, nb_macro_block_rows, 1, [&](int row)
    	{
    		for(int col = 0; col < nb_macro_block_colums; ++col)
    		{
    			for(int i = 0; i < nb_dct_per_macro_block; ++i)
    			{
    				array<int> stage_dct_coeffs(64, accelerator(accelerator::cpu_accelerator).default_view, accelerator(accelerator::default_accelerator).default_view);
    				decode_dct_coeffs(stage_dct_coeffs, bit_reader.rows[row]);
    								
    				result[row].push_back(task<array<int>>(std::bind([](const array<int>& stage_dct_coeffs) -> array<int>
    				{
    					/* C++ AMP IDCT Code */
    				}, std::move(stage_dct_coeffs)))); // Use std::bind in order to avoid copying stage array.
    			}
    		}
    	});

    I am a bit worried if it is a good idea to work with such small arrays.

    • Allocating such small 64 element stage arrays is VERY slow. Any advice regarding that? Should I pre-allocate and manage a pool of arrays myself? Would using different accelerator views for each parallel_for calling context improve allocation performance?
    • What are the performance characteristics of small blocks in regards of host<-> device transfers and GPU processing? And if there is a difference what is a good minimum block size to use?


    • Edited by Dragon89 Friday, August 10, 2012 7:43 PM
    Friday, August 10, 2012 7:39 PM

Answers

  • Hi Dragon89,

    Allocation of arrays, issuing data transfers to/from the accelerator and launching compute kernels on an accelerator, each have some overhead and using such small array sizes is likely to result in the overhead costs to overwhelm the overall performance.

    The overhead associated with allocations is typically (the actual numbers are IHV driver dependent) very small but every time a new array is accessed, the OS may have to wipe out the underlying memory's contents for security reasons as described in this post about data warmup. So yes, reusing array objects from an application managed pool may be helpful in your scenario. The gains from using different accelerator_views in each parallel_for context would be fairly small. The overhead of launching a data transfer or kernel operation is relatively higher (compared to allocation costs) unless you are able to batch them (which is undesirable in your case since it will defeat your attempts of pipelining).

    As for the right block sizes, the bigger the better for obvious reasons (the larger the block size, smaller is the overhead compared to actual work). You are trying to pipeline work on the CPU and the GPU so it will require some tuning to determine the right block size. Mid to high-end GPUs typically require several thousand active threads to be saturated so I would suggest starting with a block size of at least 5000 – 10000 and then experiment with different sizes. Breaking up the work in each parallel_for context into 5 or so blocks should be fairly good (for pipelining benefits) but if this results in each block falling below a size of 5-10K I would choose large block sizes over higher number of blocks.

    - Amit



    Amit K Agarwal


    Friday, August 10, 2012 10:19 PM
    Owner

All replies

  • Hi Dragon89,

    Allocation of arrays, issuing data transfers to/from the accelerator and launching compute kernels on an accelerator, each have some overhead and using such small array sizes is likely to result in the overhead costs to overwhelm the overall performance.

    The overhead associated with allocations is typically (the actual numbers are IHV driver dependent) very small but every time a new array is accessed, the OS may have to wipe out the underlying memory's contents for security reasons as described in this post about data warmup. So yes, reusing array objects from an application managed pool may be helpful in your scenario. The gains from using different accelerator_views in each parallel_for context would be fairly small. The overhead of launching a data transfer or kernel operation is relatively higher (compared to allocation costs) unless you are able to batch them (which is undesirable in your case since it will defeat your attempts of pipelining).

    As for the right block sizes, the bigger the better for obvious reasons (the larger the block size, smaller is the overhead compared to actual work). You are trying to pipeline work on the CPU and the GPU so it will require some tuning to determine the right block size. Mid to high-end GPUs typically require several thousand active threads to be saturated so I would suggest starting with a block size of at least 5000 – 10000 and then experiment with different sizes. Breaking up the work in each parallel_for context into 5 or so blocks should be fairly good (for pipelining benefits) but if this results in each block falling below a size of 5-10K I would choose large block sizes over higher number of blocks.

    - Amit



    Amit K Agarwal


    Friday, August 10, 2012 10:19 PM
    Owner
  • Thanks for the answer.

    Could you maybe explain a bit more about "The gains from using different accelerator_views in each parallel_for context would be fairly small."? 

    Would something like this be faster?

    concurrency::parallel_for(0, nb_macro_block_rows, 1, [&](int row)
    {
    	auto cpu_acc_view = concurrency::accelerator_view(concurrency::accelerator(concurrency::accelerator::cpu_accelerator));
    	
    	// Use cpu_acc_view to allocate staging arrays.
    });

    Would managing my own pools be faster? Or does pooling alrdy occur behind the scenes?

    Saturday, August 11, 2012 9:42 AM
  • Yes, maintaining your own pool would be faster since it will help avoid the cost of the OS zeroing the data each time. C++ AMP runtime does not pool user created staging arrays.

    By using different accelerator_views in each parallel_for context, I meant the device accelerator_view:

    accelerator_view cpuAv  = accelerator(accelerator::cpu_accelerator).default_view;

    accelerator_view av = accelerator().create_view();

    array<int> stage_dct_coeffs(N, cpuAv, av);

    But I would expect the gains from using different accelerator_views in this scenario, to be fairly insignificant. Using different accelerator_views helps minimize contention for the default accelerator_view but since the C++ AMP runtime uses very lightweight locks for accelerator_view thread-safety, the gains from using different accelerator_views (and hence avoiding this contention) will be insignificant.

    -Amit


    Amit K Agarwal

    Monday, August 13, 2012 5:18 PM
    Owner