AMP: copy array to texture? RRS feed

  • Question

  • How can a copy a concurrency::array to a concurrency::graphics::texture, without first mapping the array to host memory? Given that the array uses a direct3d buffer, such a copy should be quite fast?

    I haven't found any concurrency::copy overload that does this.

    • Edited by Dragon89 Sunday, July 15, 2012 12:18 PM
    Saturday, July 14, 2012 12:03 PM


All replies

  • There is no direct copy function between texture and array. However, if the texture and the array are created on the same accelerator_view, you could consider launching a parallel_for_each for the copying task.



    Monday, July 16, 2012 12:44 AM
  • Question: is there any reason for that no such overload exists? May we expect it in the future?
    Monday, July 16, 2012 10:56 AM
  • Thanks. This feature request will be considered and evaluated for future releases. Meanwhile, if you could share your scenario where this is needed, it would be helpful.



    Monday, July 16, 2012 5:29 PM
  • Well currently if I need to a processing operation that is done on the cpu I first have to write it to a host memory location during processing and then issue the copy over to the texture.

    If I could copy from an array to texture I could do the cpu calculation directly to a staging buffer than issue a DMA copy to a texture, and avoid an unnecessary copy.

    currently:                          data  - [host] -> processing - [host] -> data - [host] -> texture - [dma] ->  texture

    with array->texture copy: data - [host] -> processing - [host] -> staging array - [dma] ->  texture

    • Proposed as answer by Zhu, Weirong Wednesday, July 18, 2012 4:10 PM
    • Unproposed as answer by Zhu, Weirong Wednesday, July 18, 2012 4:10 PM
    Wednesday, July 18, 2012 4:04 PM
  • I think what you want is a staging texture. (Note staging texture is different from staging array, which uses a linear buffer underlying.  Staging texture has row-pitches, etc.)  In this release, staging texture is not exposed. This is something we will evaluate and consider for future releases. Again, thanks for the feedback.
    Wednesday, July 18, 2012 4:15 PM
  • This is also particularly important when you have images/frames where rowbytes != width * stridebytes and you need to do a line by line copy to the texture.
    Wednesday, July 18, 2012 4:21 PM
  • I'm having some problems getting this to work, here is what I'm trying to do:

    array<int, 2> staging_array(extent<2>(frame->height/sizeof(int), frame->row_bytes/sizeof(int)), cpu_acc.default_view); memcpy(, frame->data, frame->size); array<int, 2> device_array(extent<2>(frame->height/sizeof(int), frame->row_bytes/sizeof(int)));

    copy(staging_array, device_array); texture<unorm_4, 2> target_texture(extent<2>(frame->height, frame->width), 8U); // Note: frame->width != frame->row_bytes parallel_for_each(target_texture.extent, [&](index<2> idx) restrict (amp) { target_texture.set(idx, device_array[idx]); // cannot convert from int to unorm_4 });

    How can I write an int (containing 32 bit rgba value) to a unorm4?

    • Edited by Dragon89 Wednesday, July 18, 2012 5:19 PM
    Wednesday, July 18, 2012 5:03 PM
  • Hi, there are multiple issues in your code above.

    First of all, you didn't create a staging array, you created an array on cpu_accelerator.  To create a staging array, you need to provide an associated accelerator view, as shown in this post.  Also, here you are trying to use the constructor array<int, 2>(int, int, accelerator_view, accelerator_view). However, the type of "frame->height/sizeof(int)" is unsigned int, this could confuse the compiler to choose incorrect overload and can result in compilation error. Better do a cast to make sure "int" is used, e.g.  static_cast<int>(frame->height/sizeof(int)).

    Second, when creating an array_view from an array, there is no need to specify extents. The constructor looks like: array_view(array<type, rank>& src); It was a typo in the example in that blog post, I will correct it. Thanks!  (For the array_view you created, since you will only read from it inside parallel_for_each, consider making it array_view<const int, 2>)

    Third, when you create a texture with unorm_4, you have to specify the bits_per_scalar_element. Otherwise, there will be a compiler error (static_assertion).

    Forth, when supplied to a parallel_for_each, texture needs to be capture by reference, not by value. 

    Fifth, the way you write to texture is incorrect.  The subscript operator does not return a reference, it returns a const value. In this case, you are trying to write assign an temporary typed "const unorm_4",  compiler would deny it.  Also as mentioned in the post, to write to a texture that contains texel with more than one components, you cannot directly write to it, you need to use a writeonly_texture_view. Note, unlike texture, writeonly_texture_view needs to be captured by value.

    Sixth, there is no implicit conversion from int to unorm_4 (since we want user to be aware of the possible data loss), you need to be explicit, e.g.  unorm_4(staging_array_view[idx]), so the int -> float -> unorm_4, note clamping will happen.



    Wednesday, July 18, 2012 5:58 PM
  • Thank you for an excellent answer, one last bit I'm stuck with:

    		auto target = writeonly_texture_view<unorm_4, 2>(;
    		parallel_for_each(target.accelerator_view, target.extent, [=](concurrency::index<2> idx) restrict (amp)
    			target.set(idx, unorm_4(staging_array_view[idx[0] * row_bytes/sizeof(int) + idx[1]])); // cannot convert from 'Concurrency::array_view<_Value_type,_Rank>' to 'Concurrency::graphics::unorm_4'


    Also, I'm getting a warning (C4244) about possible loss in int to float conversion in the int -> unorm_4 conversion.

    • Edited by Dragon89 Wednesday, July 18, 2012 6:26 PM
    Wednesday, July 18, 2012 6:17 PM
  • Note your staging_array_view is ranked 2.  In this statement:

       staging_array_view[idx[0] * row_bytes/sizeof(int) + idx[1]];

    What you did is

      staging_array_view[int], not staging_array_view[index<2>],

    As a result, it triggers the projection, which returns an array_view<int, 1>, thus the compilation error you see.

    I think you can just do  "staging_array_view[idx]" to achieve what you want, though I may not understand your goal.



    Wednesday, July 18, 2012 6:33 PM
  • My goal is to copy a frame/image in host memory, which has the following layout:

      width * stride_bytes  + padding_bytes   = row_bytes
    |-----------------------|---------------| height

    To a texture without the extra padding:

      width * stride_bytes
    |----------------------| height

    Also, can I ignore the warning about int to float conversion inside the amp lambda?

    • Edited by Dragon89 Wednesday, July 18, 2012 6:52 PM
    Wednesday, July 18, 2012 6:47 PM
  • Ignoring the padding, this is what I've got now:

    	array<int, 2> staging_array(extent<2>(frame->height/sizeof(int), frame->linesize[0]/sizeof(int)), cpu_acc.default_view, target.accelerator_view);
    	// ...
            writeonly_texture_view<unorm_4, 2> target(/*...*/);
    	// Works
    	copy(, frame->size, target);
    	// Doesn't work, all black when rendered into a direct3d window.
    	//array_view<const int, 2> source(staging_array);
    	//parallel_for_each(target.accelerator_view, target.extent, [=](concurrency::index<2> idx) restrict (amp)
    	//	target.set(idx, unorm_4(source[idx]));

    • Edited by Dragon89 Wednesday, July 18, 2012 7:24 PM
    Wednesday, July 18, 2012 7:09 PM
  • What the "copy" does and what the "parallel_for_each" does are not equivalent. Please read our blog entries related to textures, see if you can tell the difference.



    Wednesday, July 18, 2012 9:13 PM
  • What the "copy" does and what the "parallel_for_each" does are not equivalent. Please read our blog entries related to textures, see if you can tell the difference.



    I've read them, and I have no idea, the explicit unorm_4 cast doesn't rly make sense to me considering the implementation of unorm_4.

    I will try d3d interop with DeferredContext and Map to see if I can achieve what I want that way. I'll get back if I have any success with that.

    Wednesday, July 18, 2012 9:27 PM
  • Assume the staging_array only contains one interger: 0x01010101, and the texture "target" only contains one unorm4 texel

    The "copy" does a raw data copy.   Each scalar of the texel gets 0x01. "0x01" represents a fix-point floating point value.

    What your paralell_for_each does:

       target.set(idx, unorm_4(source[idx]));


       int tmp1 = source[idx];  // (1)

       unorm_4 tmp2 = unorm_4(tmp1); // (2)

       target.set(idx, tmp1);  // (3)

    (1) Load the 32-bit integer, so tmp1 = 0x01010101

    (2) static cast from the integer value to a float value, the clamp it into the range of [0, 1.0f], see we get "v" here. Then each scalar component of tmp2 is initialized as "v". 

    (3) store back to target, the texture unit will translate "v" into a fix-point representation in the texture storage.

    Hope you can see the difference now.



    Wednesday, July 18, 2012 9:44 PM
  • This is probably not the best of solutions (using things that shouldn't be used), but it does exactly what I wanted to achieve. I hope this might be possible in future versions of the API:

    		_Texture_ptr_ target_tex_ptr = _Get_texture(target);
    		_Texture_ptr_ target_staging_tex_ptr = target_tex_ptr->_Create_stage_texture(
    			target.accelerator_view, accelerator(accelerator::cpu_accelerator).default_view, 
    			1, target_tex_ptr->_Get_format(), true);
    		target_staging_tex_ptr->_Map_stage_buffer(_Write_access, true);
    		auto frame2            = std::shared_ptr<AVFrame> (avcodec_alloc_frame(), av_free);	
    		frame2->linesize[0] = static_cast<int>(target_staging_tex_ptr->_Get_row_pitch());
    		frame2->data[0]     = reinterpret_cast<uint8_t*>(target_staging_tex_ptr->_Get_host_ptr());
    		sws_scale(this->sws.get(), frame->data, frame->linesize, 0, frame->height, frame2->data, frame2->linesize);	

    • Edited by Dragon89 Friday, July 20, 2012 9:44 AM
    Wednesday, July 18, 2012 9:49 PM
  • Thanks again for the explanation.

    "The "copy" does a raw data copy.   Each scalar of the texel gets 0x01. "0x01" represents a fix-point floating point value."

    As far as I understand this is true inside of a "parallel_for_each", however on the cpu "0x01" represents a 8-bit unsigned integer (i.e. DXGI_FORMAT_R8G8B8A8_UNORM) which is then converted into a fix-floating point value. 

    "What your paralell_for_each does:"

    I still think they should be doing the same thing. In the copy case 8-bit integer scalars are copied into the texture and then implicitly converted in fix-floating point, while in the parallel_for_each case, 8-bit integer scalars are copied into an integer array which is then converted into fix-floating point in the unorm4(/*..*/) statement.

    What am I missing?

    char v[4] = {255, 0, 255, 0};
    texture<unorm4, 2> target(1, 1);
    // copy
    copy(&v[0], sizeof(v), target); // copies to DXGI_FORMAT_R8G8B8A8_UNORM (A four-component, 32-bit unsigned-normalized-integer format that supports 8 bits per channel including alpha.)
    parallel_for_each(target.extent, [&](concurrency::index<2> idx) restrict (amp)
       auto tmp = target[idx]; // == unorm4(1.0, 0.0, 1.0, 0.0)
    // parallel_for_each
    array<int, 2> ar(target.extent);
    memcpy(, v, sizeof(v));
    array_view<const int, 2> av(ar);
    parallel_for_each(target.extent, [=](concurrency::index<2> idx) restrict (amp)
        auto tmp1 = av[idx]; // == 0xFF00FF00
    	auto tmp2 = unorm_4(tmp1); // == unorm4(1.0, 0.0, 1.0, 0.0)

    • Edited by Dragon89 Thursday, July 19, 2012 12:22 PM
    Thursday, July 19, 2012 12:21 PM
  • Hi Dragon89, thanks for your question. Your understanding about of copy is accurate. For the parallel_for_each, the conversion you expect between 8-bit integer to fixed point is taken care by the texture hardware and not by C++ AMP.

    To C++ AMP, a unorm is 32-bit single precision floating point numbers whose value is clamped to [0.0f, 1.0f], but it’s still a 32-bit floating point value. C++ AMP does not treat it as an 8-bit or 16-bit fixed point value. This reduction of precision happens only when storing into a texture. (On the other hand, when you load from the texture,  what you get back is a 32-bit floating point value, not a 8/16-bit fixed point value. )

    This means inside your kernel you don’t need to use integers; you could directly use single precision floating point values and they will be stored as the closest fixed point value. In fact, there is no unorm constructor that takes an integer. If your kernel uses integers for calculation, create a texture<int4, 2>. If your kernel uses unorms for calculation, create a texture<unorm4, 2>. And as you have pointed, copying the same 8-bit integer data into both textures will work fine. You will be able to read the same value interpreted as either integers / unorms depending on your kernel.

    I will be posting a blog with more information on norms and unorms in textures next week. I hope that will make the behavior more clear.

    Pooja Nagpal

    Friday, July 20, 2012 1:56 AM
  • Here is the blog post with information on how norms and unorms work in C++ AMP textures:

    Please feel free to ask questions at the blog post or here in our MSDN concurrency forum.

    Pooja Nagpal

    Wednesday, July 25, 2012 8:22 PM