array_view of rank 2 slow multithreaded access on CPU


  • I'm trying to run code on both via AMP and on multithreaded CPU for validation purposes and to provide comparison timing.  Various low level functions are needed by both implementations so are marked as restrict(amp, cpu).  These need to take a 2D array, so i'm using array_view<uint32_t, 2> on both AMP and CPU and values are looked up as table[r][c].  However the multithreaded CPU implementation seems to only run at 50% CPU due to heavy locking inside array_view::operator[](int) - stack trace below.

    I've tried templating the function so it takes a vector<vector<uint32_t> instead for the CPU, but that doesn't work as the function is marked restrict(amp, cpu) so the vector template fails to compile for AMP (even though i won't be calling it).  Do you have a better solution than duplicating the function?



    vector<uint32_t> v_table; array_view<const uint32_t, 2> table(v_table.size(), v_table); parallel_for(0, n, [](int i) { //... test(table); //... });

    void test(const array_view<const uint32_t, 2>& table) restrict(amp, cpu)

    uint32_t u = table[tileIdx][Y];
    //... }

    Stack trace for block:

     	msvcr110d.dll!Concurrency::details::ThreadProxy::SuspendExecution() Line 112	C++
     	msvcr110d.dll!Concurrency::details::FreeVirtualProcessorRoot::ResetOnIdle(Concurrency::SwitchingProxyState switchState) Line 121	C++
     	msvcr110d.dll!Concurrency::details::FreeThreadProxy::SwitchOut(Concurrency::SwitchingProxyState switchState) Line 133	C++
     	msvcr110d.dll!Concurrency::details::InternalContextBase::SwitchTo(Concurrency::details::InternalContextBase * pNextContext, Concurrency::details::InternalContextBase::ReasonForSwitch reason) Line 973	C++
     	msvcr110d.dll!Concurrency::details::InternalContextBase::Block() Line 217	C++
     	msvcr110d.dll!Concurrency::Context::Block() Line 63	C++
     	msvcr110d.dll!Concurrency::details::LockQueueNode::Block(unsigned int currentTicketState) Line 684	C++
     	msvcr110d.dll!Concurrency::critical_section::_Acquire_lock(void * _PLockingNode, bool _FHasExternalNode) Line 1127	C++
     	msvcr110d.dll!Concurrency::critical_section::scoped_lock::scoped_lock(Concurrency::critical_section & _Critical_section) Line 1201	C++
     	vcamp110d.dll!Concurrency::details::_Ubiquitous_buffer::_Get_view_shape(Concurrency::details::_Buffer_descriptor * _Key) Line 1388	C++
     	AMP-MC.exe!Concurrency::details::_Get_buffer_view_shape(const Concurrency::details::_Buffer_descriptor & _Descriptor) Line 3065	C++
     	AMP-MC.exe!Concurrency::details::_Array_view_base<2,1>::_Create_projection_buffer_shape(const Concurrency::details::_Buffer_descriptor & _Descriptor, unsigned int _Dim, int _Dim_offset) Line 1986	C++
     	AMP-MC.exe!Concurrency::details::_Array_view_base<2,1>::_Project0(int _I, Concurrency::details::_Array_view_base<1,1> & _Projected_view) Line 1854	C++
     	AMP-MC.exe!Concurrency::array_view<unsigned int const ,2>::_Project0(int _I, Concurrency::array_view<unsigned int const ,1> & _Projected_view) Line 3351	C++
     	AMP-MC.exe!Concurrency::details::_Array_view_projection_helper<unsigned int const ,2>::_Project0(const Concurrency::array_view<unsigned int const ,2> * _Arr_view, int _I) Line 36	C++
    >	AMP-MC.exe!Concurrency::array_view<unsigned int const ,2>::operator[](int _I) Line 3085	C++

    • Edited by tspitz Friday, June 29, 2012 2:40 PM
    Friday, June 29, 2012 2:39 PM


  • My lock problem was access 2D array_view using table[row][col].  Switching to table(row, col) fixes it. 

    • Marked as answer by tspitz Monday, July 02, 2012 4:50 PM
    Monday, July 02, 2012 12:44 PM

All replies

  • BTW: my full sample code is at: (latest) under AMP-MC subdir/solution.
    Friday, June 29, 2012 5:14 PM
  • Hi tspitz,

    The lock acquisition that you are observing happens only on accessing the array_view for the first time or when there are some unsynchronized modifications to the array_view on another accelerator_view. Subsequent accesses of the array_view do not acquire the lock if the array_view was not modified on another accelerator_view in between. So lock contention should not be an issue here. When I compiled and ran your code, I saw CPU activity go up to 90% when executing the multicore version of your algorithm.

    However, if you still want to completely eliminate even the first lock acquisition from inside the parallel_for, you can create the array_views outside the parallel_for, call synchronize on them and capture the array_views inside your parallel_for.


    Amit K Agarwal

    • Proposed as answer by Zhu, Weirong Monday, July 02, 2012 4:14 PM
    • Unproposed as answer by tspitz Monday, July 02, 2012 4:50 PM
    Friday, June 29, 2012 9:09 PM
  • My lock problem was access 2D array_view using table[row][col].  Switching to table(row, col) fixes it. 

    • Marked as answer by tspitz Monday, July 02, 2012 4:50 PM
    Monday, July 02, 2012 12:44 PM
  • That makes sense. I missed this when looking at the code. Using table[] on an array_view of rank > 1, projects the array_view in the least significant dimension and the projected array_view is recorded in the runtime's internal data structures which accounts for the lock acquisitions that your were seeing. The use of table(row,col) or table(idx) forms is recommended when accessing individual elements of array_views.

    Amit K Agarwal

    Tuesday, July 03, 2012 6:06 PM