none
Copy between textures always goes through host? RRS feed

  • Question

  • Looking at the implementation for graphics::texture I notice that the copy is always performed by downloading src data to the host and then uploading to dst on the device.

    This seems rather inefficient, and though I understand (from previous question) that optimizing this when the src and dst is on different accelerator_views can be somewhat complicated, though I was hoping that copying when src and dst are on the same view could be made a lot faster?

    i.e. instead of:

        std::vector<unsigned char> _Host_buffer(_Size);
        _Copy_async_impl(_Src, reinterpret_cast<void *>(_Host_buffer.data()), _Size)._Get();
        _Copy_async_impl(reinterpret_cast<void *>(_Host_buffer.data()), _Size, _Dest)._Get();

    do something like:

    	if(_Src.accelerator_view != _Dest.accelerator_view)
    	{
    		std::vector<unsigned char> _Host_buffer(_Size);
    		_Copy_async_impl(_Src, reinterpret_cast<void *>(_Host_buffer.data()), _Size)._Get(); // Should we really wait here? Isn't a .then(/*...*/) better?
    		_Copy_async_impl(reinterpret_cast<void *>(_Host_buffer.data()), _Size, _Dest)._Get();
    	}
    	else
    	{
    
    	}

    In my case I wanted to create a helper method:

    void fast_copy_helper(/*...*/)
    {
            if(src.accelerator_view != dst.accelerator_view)
    		graphics::copy(src, dst);
    	else
    	{		
    		CComPtr<IUnknown> d3d_src_unknown;
    		d3d_src_unknown.Attach(get_texture(_Src));
    		CComQIPtr<ID3D11Resource> d3d_src = d3d_src_unknown;
    		
    		CComPtr<IUnknown> d3d_dest_unknown;
    		d3d_dest_unknown.Attach(get_texture(_Dest));
    		CComQIPtr<ID3D11Resource> d3d_dest = d3d_dest_unknown;
    		
    		CComPtr<IUnknown> d3d_device_unknown;
    		d3d_device_unknown.Attach(concurrency::direct3d::get_device(_Src.accelerator_view));
    		CComQIPtr<ID3D11Device> d3d_device = d3d_device_unknown;
    
    		_ASSERTE(concurrency::direct3d::get_device(_Src.accelerator_view) == concurrency::direct3d::get_device(_Dest.accelerator_view));
    
    		// Schedule CopyResource call on accelerator_view?
    	}
    }

    I got stuck here as I haven't yet figured out how to schedule direct3d11 interop (ImmidiateContext) calls onto the accelerator_view thread.



    • Edited by Dragon89 Friday, July 20, 2012 2:20 PM
    Friday, July 20, 2012 11:11 AM

Answers

All replies

  • Actually I think the current implementation is worse than I first thought as it seems to go:

    source texture -> staging texture -> vector<unsigned char> -> staging texture -> dest texture (2 DMA transfers + 2 cpu copies!)

    It would be nice if it at least was reduced to:

    texture -> staging texture -> staging texture -> texture







    • Edited by Dragon89 Friday, July 20, 2012 3:32 PM
    Friday, July 20, 2012 2:19 PM
  • Hi Dragon89,

    Thanks for reporting the issue. This was a known performance problem and we have fixed the roundtrip to host when the copy is between textures on the same accelerator_view. You can expect a direct texture->texture copy without staging buffers or roundtrip to host. The update should  be available for Visual Studio 2012 RTM. In the meanwhile, you can work around this by scheduling a copy on the ImmediateContext of the device. Refer:

    For your second question about waiting on the copy_async call. You are right, for an async version of copy, it would be beneficial to use a continuation. However, in v1 we only support synchronous texture->texture copy and a wait is required in this case.


    Pooja Nagpal


    Friday, July 20, 2012 6:57 PM
  • Great!

     In the meanwhile, you can work around this by scheduling a copy on the ImmediateContext of the device. 

    How would I do that? The ImmediateContext is not thread safe as far as I know?
    • Edited by Dragon89 Friday, July 20, 2012 7:15 PM
    Friday, July 20, 2012 7:15 PM
  • You are right. The ImmediateContext is not thread safe. You will need to ensure that no other thread in your app is submitting commands to this ImmediateContext concurrently. This includes threads using the device through C++ AMP  or DirectX.

    Pooja Nagpal

    Friday, July 20, 2012 7:25 PM
  • You are right. The ImmediateContext is not thread safe. You will need to ensure that no other thread in your app is submitting commands to this ImmediateContext concurrently. This includes threads using the device through C++ AMP  or DirectX.

    Pooja Nagpal

    Which I wonder how I can do? If I use the default accelerator and default view then there are a lot of things going on? I assume that the accelerator_view has some form of synchronization for the stuff it does, is there any way to use the accelerator_view to perform a synchronized or scheduled call to the ImmediateContex? Without such a feature the direct3d interop with C++ AMP is somewhat limited.


    • Edited by Dragon89 Friday, July 20, 2012 7:46 PM
    Friday, July 20, 2012 7:45 PM
  • Hi Dragon89,

     You can expect a direct texture->texture copy without staging buffers or roundtrip to host


    Pooja Nagpal

    Will you also remove one of the copies on the cpu for the different accelerator view case? i.e. copy between the staging textures directly instead of an intermediate std::vector?
    Friday, July 20, 2012 7:47 PM
  • accelerator_view internally synchronizes for multi-threaded access. However, we do not expose this functionality in our API. As long as you use the interop accelerator_view and D3D device/ImmediateContext from the same thread (or not concurrently from multiple threads), your app should work fine. Can you tell us more about your scenario and how multithreaded device access in used your app?

    For your second question, the update to texture copy applies only when copying between textures on the same accelerator view. We will not be able to fix the intermediate vector when copying between different accelerator_views in this release.“


    Pooja Nagpal

    Friday, July 20, 2012 9:13 PM
  • accelerator_view internally synchronizes for multi-threaded access. However, we do not expose this functionality in our API. As long as you use the interop accelerator_view and D3D device/ImmediateContext from the same thread (or not concurrently from multiple threads), your app should work fine.

    I'm using the default accelerator_view (i.e default, i.e. Concurrency::details::_Select_default_accelerator().default_view) everywhere in my application which runs a lot of amp stuff in parallel, I believe this is a very common case. Without exposing synchronization functionality, what you can do with d3d interop is very limited in this scenario. Also, if AMP does asynchronous calls to the device, there is no way for me to synchronize it even if I put a mutex around every section in my application that does AMP calls.

    For your second question, the update to texture copy applies only when copying between textures on the same accelerator view. We will not be able to fix the intermediate vector when copying between different accelerator_views in this release.“

    Ok, that is unfortunate, luckily I can fix that myself in the meantime.



    • Edited by Dragon89 Friday, July 20, 2012 10:13 PM
    Friday, July 20, 2012 9:52 PM
  • Allowing synchronized access to the ImmediateContext using the accelerator_view would definitely help with your scenario. Thanks very much for the feedback. We will consider this as a feature request for future releases.

    Pooja Nagpal

    Saturday, July 21, 2012 2:54 AM
  • One more quick question:

    std::vector<unsigned char> _Host_buffer(_Size);
    _Copy_async_impl(_Src, reinterpret_cast<void *>(_Host_buffer.data()), _Size)._Get();
    _Copy_async_impl(reinterpret_cast<void *>(_Host_buffer.data()), _Size, _Dest)._Get(); // Why do we need to wait here, i.e. _Get()? It's an upload to the accelerator, which should be implicitly and asynchronously synced when the resource is used?


    • Edited by Dragon89 Saturday, July 21, 2012 9:41 AM
    Saturday, July 21, 2012 9:41 AM
  • In the current implementation, the async copy in from host to device does a copy from host to the staging buffer synchronously,  and the get() for an asynchronous copy in actually returns immediately and does not wait for uploading to complete. So it's effectively a no-op here. However, please also see the discussion here: http://social.msdn.microsoft.com/Forums/en-US/parallelcppnative/thread/a7ded160-2859-4504-9ea2-3f5be713062c

    Thanks,

    Weirong

    Saturday, July 21, 2012 5:49 PM
  • Allowing synchronized access to the ImmediateContext using the accelerator_view would definitely help with your scenario. Thanks very much for the feedback. We will consider this as a feature request for future releases.

    Pooja Nagpal

    Maybe one way to solve this is to explicitly create a Direct3D device and use it instead of the "default_accelerator". The question is, how can do I explicitly create a Direct3D device that matches the one that is created by default?
    Monday, August 13, 2012 9:37 AM
  • Hi Dragon89,

    Do you mean create a Direct3d device and set that as the default accelerator? This can be done by

    1. Create a C++ AMP accelerator_view using a Direct3d device of your choice   (http://blogs.msdn.com/b/nativeconcurrency/archive/2011/12/29/interoperability-between-direct-3d-and-c-amp.aspx)
    2. Set the underlying accelerator as default (http://blogs.msdn.com/b/nativeconcurrency/archive/2012/02/02/default-accelerator-in-c-amp.aspx)

    I am not sure how this would solve the problem of synchronizing access to the Immediate context. You would still need some mechanism to co-ordinate with the C++ AMP runtime for submitting commands. Could you give us some more details on how this would help your scenario?

    Pooja Nagpal


    Monday, August 13, 2012 10:49 PM
  • Thanks for the answer, however my question was regarding how I can create a Direct3d device which has the exact same/similar parameters as the one created by C++ AMP.

    I'm pretty much in non-standard land with this solution, I'm creating a wrapper that matches the direc3d device interface over the explicitly created device which handles the co-ordination, I then set the default accelerator to this wrapper.


    • Edited by Dragon89 Tuesday, August 14, 2012 5:49 AM
    Tuesday, August 14, 2012 5:48 AM
  • Hi Dragon89,

    I assume you want to create a D3D device with the same parameters as the default accelerator. Since the default can vary from system to system, you can create a matching device by:

    1. Query for underlying device of the accerator(accelerator::default_accelerator).default_view (using get_device interop API)
    2. You can then query for required properties using interfaces of the Direct3D device returned
      a. ID3D11Device ( Creation Flags and Feature level )
      b. IDXGIDevice ( Adapter )
    3. You can infer the DRIVER_TYPE based on the device_path of the C++ AMP accelerator.

    Thanks,


    Pooja Nagpal

    Tuesday, August 14, 2012 7:28 PM