none
C++ Amp copying the data only once on the device

    שאלה

  • I need to make sure that I copy the data only once to the device and then use it many times. To test this I created a small test program. In one instance I use an array_view like this:

    ..................

     concurrency::array_view<const float, 1> a(size, data);

    for (...) // run this 10 times

    {

      parallel_for(

    {

    })

    ///////////

    the other implementation is based on array

     concurrency::array<const float, 1> a(size, data.begin(), data.end());

    parallel_for ....

    What I found is interesting: the first implementation will take say X ms to run but the other runs in the loop will take around  X / 3 ms.

    The second implementation is constant at a value Y where Y is aprox equal to X.

    It seems that in the first implementation, the array_view will copy the data to the device first time but then it uses it. The second implementation seems to copy data all the time.

    The question is: what is the best way to 'cache' the data on the device?

    Thanks,

    G.

    יום שלישי 13 דצמבר 2011 03:03

תשובות

  • Hi G,

    Yes, constructing an array with constructor versions that initialize the array contents (like the one you specified above) is a synchronous operation and any subsequent operations that attempt to access the array are guaranteed to see the data in the array,

    In event of insufficient memory, the C++ AMP runtime will throw a "Concurrency::out_of_memory" exception. Note that Windows virtualizes GPU memory and you may be able to successfully allocate more memory than physically available on the device. The array is bound to physical memory on the device when accessed in a device operation such as a copy or parallel_for_each. So you should also be prepared for handling errors from operations that access the data (array/array_view) besides the construction of these data containers.

    Regards,

    Amit

     


    Amit K Agarwal
    יום חמישי 29 דצמבר 2011 19:23
    בעלים
  • You can find information about WDDM's video memory management on MSDN, here.

    To answer your specific question, normally, there is no “automatic” data movement to/from the GPU when using the “concurrency::array” type and the data stays resident in GPU memory. So if you create an array with its contents initialized using an iterator (or explicitly copy data to it) the array is actually resident in GPU memory after that operation. Any subsequent uses of that array in a p_f_e will not result in any automatic data transfers. Any copying of data to the array thereafter would have to be explicit.

    // Allocates memory on the default accelerator_view and copies

    // the contents of hostVector to the array. The data stays resident

    // on the GPU after this constructor returns.

    Concurrency::array<int> acceleratorData(size, hostVector.begin(), hostVector.end());

     

    // On executing this parallel_for_each the array is NOT automatically copied over to the GPU

    // since its already resident on the GPU

    Concurrency::parallel_for_each(computeDomain, [&acceleratorData](index<1> idx) restrict(direct3d) {

        ...

        ...

    });

     

    // On executing this parallel_for_each the array is NOT automatically copied over to the GPU

    // since its already resident on the GPU

    Concurrency::parallel_for_each(computeDomain, [&acceleratorData](index<1> idx) restrict(direct3d) {

        ...

        ...

    });

     

     

    // To copy new content into the array you need to explicitly copy new data into the array

    Concurrency::copy(newHostData.begin(), newHostData.end(), acceleratorData);

     

     

    Note that sometimes WDDM may automatically move data to/from the GPU memory if the GPU memory is overcommitted. Consider the following hypothetical situation.

     

    Suppose that your GPU has 1 GB of local video memory. Now if you create an array of size 800 MB and copy data to it, it will result in the array being resident in GPU memory. Now if you create another array of size 800 MB and copy data to it, both the old and the new array cannot obviously stay in GPU memory simultaneously. Hence WDDM will copy the old array (maybe just a part of it) to the CPU memory, so that there is enough GPU memory for holding the contents of the new array. Now if you access the old array in a p_f_e, the old array needs to brought back into GPU memory and hence any parts of it which may have been earlier moved to the CPU memory (to accommodate other more recently used data), will be brought back into GPU memory by WDDM, for the p_f_e to start execution.

     

    Regards,

    Amit


    Amit K Agarwal
    יום שלישי 03 ינואר 2012 02:40
    בעלים
  • Hi G,

    Yes, you can specify constness of data even when using arrays by capturing a const reference to the array in the p_f_e.

    For example:

    const array<int, 2> mA(M, W, matrixA.begin(), matrixA.end());

     

    // Capturing this array in a lambda results in the array being captured as read-only

    parallel_for_each(extent, [&mA](index<2> index) restrict(direct3d) {

        ...

    });

     

    As for points 1 & 4 in your list of issues/observations above, I would have to look at the code to be able to tell the root causes of the anomalies.

    Regards,

    Amit


    Amit K Agarwal
    יום רביעי 04 ינואר 2012 03:49
    בעלים
  • Hi G,

    The speedup in kernel execution time you get from const greatly depends on the graphics card you are using and also on the data access patterns in your algorithm. An example of an optimization that hardware vendors can perform on read-only data is back it up with non-coherent caches (since there are no writes there are no coherence issues to worry about) and if your algorithm's data access pattern results in signifcant reuse from the cache you may see a boost in performance.

    But I would reiterate that any such gains are very hardware specific and may even change across generations of cards from the same vendor. Also, if a hardware vendor has a coherent cache on their GPU chip, the difference between read-only and read-write would probably not be significant.

    If you do see the potential of data sharing between threads, I would encourage you to explore the tile_static memory feature of C++ AMP which is a portable way of improving your algorithm's performance through leveraging data reuse in your algorithms.

    As for the correctness issue, I am sorry for not being able to help without looking at the actual code. I would suggest you to try the Visual Studio C++ AMP debugging for getting to the root of the problem. Currently there is no mechanism to view the generated kernels - we have taken note of your feedback (please keep it coming).

     

    Regards,

    Amit


    Amit K Agarwal
    יום רביעי 11 ינואר 2012 06:06
    בעלים
  • Zooba

    Yes you are right, and in fact our generic guidance on measuring performance covers this:

    http://blogs.msdn.com/b/nativeconcurrency/archive/2011/12/28/how-to-measure-the-performance-of-c-amp-algorithms.aspx

    Cheers

    Daniel


    http://www.danielmoth.com/Blog/
    יום שני 06 פברואר 2012 04:03
    בעלים

כל התגובות

  • Hi G

    By parallel_for, I presume you mean parallel_for_each (abbreviated as p_f_e), right?

    In general, data will stay on the accelerator (and not be copied back) if you don't touch it (through array or array_view) between p_f_e invocations.

    Can you please post your full repro code (including how you measure the perf differences) so we can evaluate if your reported results? That would help offer more concrete advice/response.

    Cheers

    Daniel


    http://www.danielmoth.com/Blog/
    יום שלישי 13 דצמבר 2011 05:41
    בעלים
  • Hi Daniel,

    Here is a more complete example.

    // suppose you have these structures

     

     

    struct TDataIn
    {
      float dataIn_[1024];
    };
    
    struct TDataOut
    {
      float value_;
    };
    
    // declare a vector like this:
    int vSize = 1000000;
    std::vector<TDataOut> h_vec_out(vSize); // this is the out vector
    
    // data in
    std::vector<TDataIn> h_vec_in(vSize);
    // load the vector with data; this is done just once outside of the for loop

     // variant using array_view concurrency::array_view<const TDataIn, 1> a(vSize, h_vec_in); concurrency::array_view<TDataOut, 1> b(vSize, h_vec_out); for ( int i = 0; i < nIterations_; i++ ) { // create an extra processing buffer here. It will be used once for each iteration
    // on the for loop. // start the timer concurrency::parallel_for_each(b.grid, [=](concurrency::index<1> idx) mutable restrict(direct3d) { // do processing here } ); b.synchronize(); // stop the timer, records the data (milliseconds) } // in this case the first time through the loop, the time is say X. Next iterations are around X / 3 .
    This makes sense because the input data is not changed Now, if instead of array_view I use just array, the time in taken going through the loop is
    always X for all 10 iterations.


     Thanks,

    G.

     

     

    יום שלישי 13 דצמבר 2011 19:10
  • Hello G,

    The behavior of the example using array_views is as expected. The problem appears to be in the version of your code that uses Concurrency::array - I suspect copies of arrays are being unitentionally created. Can you please also share the actual code that uses arrays and is exhibiting anomalous performance behavior?

    יום רביעי 14 דצמבר 2011 00:40
    בעלים
  • Hello G,

    The behavior of the example using array_views is as expected. The problem appears to be in the version of your code that uses Concurrency::array - I suspect copies of arrays are being unitentionally created. Can you please also share the actual code that uses arrays and is exhibiting anomalous performance behavior?

    Hello Amit,

    Yes I think that copies of arrays are created every time because the time taken is suspiciously close to the time taken first time through the loop using array_view.

    The changes are as follows:

    replace

    concurrency::array_view<const TDataIn, 1> a(vSize, h_vec_in);

    with

    concurrency::array<TDataIn, 1> a(vSize, h_vec_in.begin(), h_vec_in.end());

    and modify the parralel_for_each like this:

    concurrency::parallel_for_each(b.grid, [=, &a](concurrency::index<1> idx) mutable restrict(direct3d)

    Please note the passing of the 'a' array by reference. This works and the results returned are ok but, it seems that array 'a' is somehow copied to the device all the time. Also please note that the definition of the array no longer uses the 'const' qualifier. If I try to use it, the compiler will fail because there is no overloaded function, etc...

    I am compiling in 64 bit.

    No other changes are done.

    Thanks,

    G.

    יום רביעי 14 דצמבר 2011 14:04
  • Hi G,

    The code above looks fine and I cannot see a reason for any extra copying of the array data. I also do not expect the array being non-const to cause the kind of performance difference that you are observing. However, to be sure, can you help eliminate this as the cause of the difference by checking if the performance of the array_view version regresses on making the input array_view "a" non-const.

    To be able to further investigate this mysterious perf difference between the array_view and array versions of your code, I would need the complete code you are running for me to experiment at my end. Also, it would help if you can let me know the actual GPU card that you are using and whether you are running on Windows 7 or the Windows 8 Developer preview version.

    Finally, I noticed that your input array "a" is approximately of size 4 GB - I was curious about which card you are using since there are few that have such large amounts of video RAM :)

     

    - Amit

    יום רביעי 14 דצמבר 2011 16:43
    בעלים
  • Wow!!! I just cut the 'const' out the time is almost identical to the time when I use array instead of array_view!!

    So this is it! Somehow the compiler does something and really copy the data ONLY ONCE when the element is const. Otherwise, it does copy all the time.

    Here is the data without the const qualifiery

    ......................

    0 concurrency::parallel_for_each(ms): 484
    1 concurrency::parallel_for_each(ms): 171
    2 concurrency::parallel_for_each(ms): 172
    3 concurrency::parallel_for_each(ms): 171
    4 concurrency::parallel_for_each(ms): 171
    5 concurrency::parallel_for_each(ms): 172
    6 concurrency::parallel_for_each(ms): 172
    7 concurrency::parallel_for_each(ms): 172
    8 concurrency::parallel_for_each(ms): 171
    9 concurrency::parallel_for_each(ms): 172
    GPU Average time: 202

    ....................

    and here is with const

    ...............

    0 concurrency::parallel_for_each(ms): 359
    1 concurrency::parallel_for_each(ms): 62
    2 concurrency::parallel_for_each(ms): 63
    3 concurrency::parallel_for_each(ms): 62
    4 concurrency::parallel_for_each(ms): 47
    5 concurrency::parallel_for_each(ms): 47
    6 concurrency::parallel_for_each(ms): 47
    7 concurrency::parallel_for_each(ms): 62
    8 concurrency::parallel_for_each(ms): 62
    9 concurrency::parallel_for_each(ms): 63

    .............

    Yes I use a 1 million elements array but the data is not as big as I showed here. It is smaller and in fact it fits comfortably in 1 GB or memory.

    I use a laptop with a NVidia Quadro 3000M (Dell Precision 6600). The graphics card has 2 GB or memory. The laptop has 16 GB.

    I am running under W7 64 bit and the app is compiled as 64 bit. I did not try 32 bit but for this app, 32 bit cannot be used.

    Finally, when compiling in 64 bit I have some warnings like this:

    ..........

    1>C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\include\amp.h(2094): warning C4267: 'initializing' : conversion from 'size_t' to 'unsigned int', possible loss of data
    1>          C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\include\amp.h(2093) : while compiling class template member function 'Concurrency::_View_shape_ptr Concurrency::details::_Array_view_base<_Rank,_Element_size>::_Get_buffer_view_shape(void) const'
    1>          with
    1>          [
    1>              _Rank=1,
    1>              _Element_size=1
    1>          ]
    1>          C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\include\amp.h(2744) : see reference to class template instantiation 'Concurrency::details::_Array_view_base<_Rank,_Element_size>' being compiled
    1>          with
    1>          [
    1>              _Rank=1,
    1>              _Element_size=1
    1>          ]

    .........

    The first warning is in amp.h here

        _View_shape_ptr _Get_buffer_view_shape() const __CPU_ONLY
        {
            unsigned int bufElemSize = _M_buffer_descriptor._Get_buffer_ptr()->_Get_master_buffer()->_Get_elem_size();
    The variable is declared unsigned int but the stl string uses size_t which is 64 bit when compile x64.

    thanks,

    G

     

    יום רביעי 14 דצמבר 2011 17:58
  • Hi G,

    Thanks for sharing the data.

    The problem here does NOT seem to be extra copying since the non-const version takes longer than the const version even in the first iteration. There is a consistent delta of ~120 mS between the const and non-const versions across all iterations and something else (may be the execution of the kernel on the GPU) seems to be slower on the non-const version compared to the version using const array_view.

    I would need to look at the complete code (including the code inside the p_f_e) and analyze it at my end to identify the root of this behavior. Can you please share your complete code if possible or at least a minimal version that I can use to repro the issue?

     

    Regards,

    Amit


    Amit K Agarwal
    יום שישי 16 דצמבר 2011 20:48
    בעלים
  • Hi Amit,

    I will try to see how much from my code I can share with you... I will be out for a few days and back by the end of next week.

    Thanks,

    G.

     

    יום שישי 16 דצמבר 2011 22:42
  • Hi Amit,

    Unfortunately, I will not be able to share with you the full code because it shows structures and algorithms which are proprietary. I can try to see if I can create another sample not related to my proprietary algorithms and still replicate the issue.

    Question: Does the execution of an instruction like this:

    concurrency::array<TDataIn, 1> a(vSize, h_vec_in.begin(), h_vec_in.end());

    guarantees that after it is executed, the data in 'a' is actually on the device? How will this code fail if there is not enough memory on the device?

     

    Thanks,

    G.

     

    יום חמישי 29 דצמבר 2011 17:41
  • Hi G,

    Yes, constructing an array with constructor versions that initialize the array contents (like the one you specified above) is a synchronous operation and any subsequent operations that attempt to access the array are guaranteed to see the data in the array,

    In event of insufficient memory, the C++ AMP runtime will throw a "Concurrency::out_of_memory" exception. Note that Windows virtualizes GPU memory and you may be able to successfully allocate more memory than physically available on the device. The array is bound to physical memory on the device when accessed in a device operation such as a copy or parallel_for_each. So you should also be prepared for handling errors from operations that access the data (array/array_view) besides the construction of these data containers.

    Regards,

    Amit

     


    Amit K Agarwal
    יום חמישי 29 דצמבר 2011 19:23
    בעלים
  • Hi G,

    ....Note that Windows virtualizes GPU memory and you may be able to successfully allocate more memory than physically available on the device. The array is bound to physical memory on the device when accessed in a device operation such as a copy or parallel_for_each. So you should also be prepared for handling errors from operations that access the data (array/array_view) besides the construction of these data containers.

    Regards,

    Amit

     


    Amit K Agarwal

    Hi Amit,

    Can you please give me some more details about this? Or point me to a link documenting it?

    For example you say that Windows may virtualizes GPU memory. Does that mean that for example, I can load a concurrency::array from an a buffer bigger than the physical memory and actually at the point of creating the gpu array, not get an exception? In this case the buffer will actually not be moved to the gpu but, the data will be moved as needed when worked on it inside p_f_e? So actually when I create a concurrency::array, the data may or may not be copied to the gpu. The only guarantee is that when used inside pfe, it will be first copied as needed. True? If so, is there a way where I can make sure prior to actually using the data, that it IS copied to the gpu?

    The crux of the issue I need to solve is this: I have a buffer of data I need to move just once (or not-so-often) to the device. I need to process this data many times and ONLY when it changes (under my control), I need to copy it again to the gpu.

    Thanks

    G

     

     

    יום חמישי 29 דצמבר 2011 21:05
  • You can find information about WDDM's video memory management on MSDN, here.

    To answer your specific question, normally, there is no “automatic” data movement to/from the GPU when using the “concurrency::array” type and the data stays resident in GPU memory. So if you create an array with its contents initialized using an iterator (or explicitly copy data to it) the array is actually resident in GPU memory after that operation. Any subsequent uses of that array in a p_f_e will not result in any automatic data transfers. Any copying of data to the array thereafter would have to be explicit.

    // Allocates memory on the default accelerator_view and copies

    // the contents of hostVector to the array. The data stays resident

    // on the GPU after this constructor returns.

    Concurrency::array<int> acceleratorData(size, hostVector.begin(), hostVector.end());

     

    // On executing this parallel_for_each the array is NOT automatically copied over to the GPU

    // since its already resident on the GPU

    Concurrency::parallel_for_each(computeDomain, [&acceleratorData](index<1> idx) restrict(direct3d) {

        ...

        ...

    });

     

    // On executing this parallel_for_each the array is NOT automatically copied over to the GPU

    // since its already resident on the GPU

    Concurrency::parallel_for_each(computeDomain, [&acceleratorData](index<1> idx) restrict(direct3d) {

        ...

        ...

    });

     

     

    // To copy new content into the array you need to explicitly copy new data into the array

    Concurrency::copy(newHostData.begin(), newHostData.end(), acceleratorData);

     

     

    Note that sometimes WDDM may automatically move data to/from the GPU memory if the GPU memory is overcommitted. Consider the following hypothetical situation.

     

    Suppose that your GPU has 1 GB of local video memory. Now if you create an array of size 800 MB and copy data to it, it will result in the array being resident in GPU memory. Now if you create another array of size 800 MB and copy data to it, both the old and the new array cannot obviously stay in GPU memory simultaneously. Hence WDDM will copy the old array (maybe just a part of it) to the CPU memory, so that there is enough GPU memory for holding the contents of the new array. Now if you access the old array in a p_f_e, the old array needs to brought back into GPU memory and hence any parts of it which may have been earlier moved to the CPU memory (to accommodate other more recently used data), will be brought back into GPU memory by WDDM, for the p_f_e to start execution.

     

    Regards,

    Amit


    Amit K Agarwal
    יום שלישי 03 ינואר 2012 02:40
    בעלים
  • Hi Amit,

    Thanks for this detailed explanation. What you say here is actually what I was expected regarding moving data to the GPU when the constructor of the array class finishes.  Here are some results from my tests with comments:

    Run GPU array...
    Copy Host->Device 416000000 bytes, concurrency::array(ms): 191
    0 concurrency::parallel_for_each(ms): 391 records: 899644
    1 concurrency::parallel_for_each(ms): 179 records: 899644
    2 concurrency::parallel_for_each(ms): 175 records: 899644
    3 concurrency::parallel_for_each(ms): 172 records: 899644
    4 concurrency::parallel_for_each(ms): 173 records: 899644
    5 concurrency::parallel_for_each(ms): 172 records: 899644
    6 concurrency::parallel_for_each(ms): 172 records: 899644
    7 concurrency::parallel_for_each(ms): 172 records: 899644
    8 concurrency::parallel_for_each(ms): 172 records: 899644
    9 concurrency::parallel_for_each(ms): 172 records: 899644
    
    Run GPU array_view...
    Copy Host->Device 416000000 bytes, concurrency::array_view(ms): 0
    0 concurrency::parallel_for_each(ms): 390 records: 899644
    1 concurrency::parallel_for_each(ms): 62 records: 899644
    2 concurrency::parallel_for_each(ms): 59 records: 899644
    3 concurrency::parallel_for_each(ms): 56 records: 899644
    4 concurrency::parallel_for_each(ms): 56 records: 899644
    5 concurrency::parallel_for_each(ms): 56 records: 899644
    6 concurrency::parallel_for_each(ms): 56 records: 899644
    7 concurrency::parallel_for_each(ms): 56 records: 899644
    8 concurrency::parallel_for_each(ms): 56 records: 899644
    9 concurrency::parallel_for_each(ms): 56 records: 899644

    In both runs I use the same input data and I call the p_f_e in a loop 10 times. In the first run I use an array class while in the second I use an array_view. In both cases the input data is around 416 MB which fits on the GPU.

    In the first run it takes 191 ms to copy the data from the memory to the GPU. Since array_view does not copy the data during construction, in the second case it takes no time. A few observations:

    1. In both cases the first run is longer then the other and that is still something I have to figure out why. I still think that there is some extra copying going on, maybe something in my code but, whatever it is, is not simple to see (at least I cannot see it)!

    2. In both cases the results are correct. They filter out the same number of records.

    3. Using array_view is faster then using array and that is because of the const used (see the declarations in one of these reports). If I try to do this, it fails to compile:

    concurrency::array<const TDataIn, 1> a(vSize, h_vec_in.begin(), h_vec_in.end());

    Do you know why the const qualifier is not allowed with array class? If I would be able to use the const qualifier in the array class, most likely the optimizations will give the same speed as when using array_view with const qualifier.

    4. If I increase the amount of data to > 1 GB, something interesting happens. The test with array class works ok but the test with array_view no longer work. It does not get me any errors but it is clear from results that it does NOT do the correct processing. Since I am in w7, I have no way of stepping through the code to see what is going on. This is strange... Please note that I have 2 GB of gpu memory so even when I use a little over 1 GB of data, it should have enough space to copy it. Something is really strange with array_view class...

    Ok so this was a long message but I want to take this step by step. The first thing to solve is the speed difference between array and array_view. Am I suppose to be able to use the const qualifier with array? Or is it something specific to array_view?

    Thanks,

    G.

     

     

    יום שלישי 03 ינואר 2012 18:48
  • Hi G,

    Yes, you can specify constness of data even when using arrays by capturing a const reference to the array in the p_f_e.

    For example:

    const array<int, 2> mA(M, W, matrixA.begin(), matrixA.end());

     

    // Capturing this array in a lambda results in the array being captured as read-only

    parallel_for_each(extent, [&mA](index<2> index) restrict(direct3d) {

        ...

    });

     

    As for points 1 & 4 in your list of issues/observations above, I would have to look at the code to be able to tell the root causes of the anomalies.

    Regards,

    Amit


    Amit K Agarwal
    יום רביעי 04 ינואר 2012 03:49
    בעלים
  • Hi G,

    Yes, you can specify constness of data even when using arrays by capturing a const reference to the array in the p_f_e.

    For example:

    const array<int, 2> mA(M, W, matrixA.begin(), matrixA.end());

     

    // Capturing this array in a lambda results in the array being captured as read-only

    parallel_for_each(extent, [&mA](index<2> index) restrict(direct3d) {

        ...

    });

     

    As for points 1 & 4 in your list of issues/observations above, I would have to look at the code to be able to tell the root causes of the anomalies.

    Regards,

    Amit


    Amit K Agarwal

    Hi Amit,

    Well this worked ok! So using const somehow speeds up the process significantly. However, since the code is the same, I am curious what kind of optimizations does the compiler do which are not present when not using const?! Maybe the compiler does some data coalescing when using const? That will be amazing... 

    So with const, I get the same speed when I use either array or array_view. However, now with const and array I get the problem #4!?

    Is there a way to see the kernel generated? There has to be something which makes using const go faster but at the same time, when the data goes up, the computations are done wrong (if actually they are done?).

    Thanks,

    G.

    יום רביעי 04 ינואר 2012 19:03
  • Hi G,

    The speedup in kernel execution time you get from const greatly depends on the graphics card you are using and also on the data access patterns in your algorithm. An example of an optimization that hardware vendors can perform on read-only data is back it up with non-coherent caches (since there are no writes there are no coherence issues to worry about) and if your algorithm's data access pattern results in signifcant reuse from the cache you may see a boost in performance.

    But I would reiterate that any such gains are very hardware specific and may even change across generations of cards from the same vendor. Also, if a hardware vendor has a coherent cache on their GPU chip, the difference between read-only and read-write would probably not be significant.

    If you do see the potential of data sharing between threads, I would encourage you to explore the tile_static memory feature of C++ AMP which is a portable way of improving your algorithm's performance through leveraging data reuse in your algorithms.

    As for the correctness issue, I am sorry for not being able to help without looking at the actual code. I would suggest you to try the Visual Studio C++ AMP debugging for getting to the root of the problem. Currently there is no mechanism to view the generated kernels - we have taken note of your feedback (please keep it coming).

     

    Regards,

    Amit


    Amit K Agarwal
    יום רביעי 11 ינואר 2012 06:06
    בעלים
  • 1. In both cases the first run is longer then the other and that is still something I have to figure out why. I still think that there is some extra copying going on, maybe something in my code but, whatever it is, is not simple to see (at least I cannot see it)!

    I've always just assumed that the first p_f_e call has to copy/initialise the kernel code on the GPU somehow, but once it's on there it can simply be invoked again. I have a set of about 60 kernels (designed for arbitrary composition, rather than a write-once algorithm); when I run my test suite twice in the same process the second run is always significantly faster.

    יום ראשון 05 פברואר 2012 21:11
  • .........

    I've always just assumed that the first p_f_e call has to copy/initialise the kernel code on the GPU somehow, but once it's on there it can simply be invoked again. I have a set of about 60 kernels (designed for arbitrary composition, rather than a write-once algorithm); when I run my test suite twice in the same process the second run is always significantly faster.

    .........

    Well, first time f_p_e takes longer is ok when the structure is passed as an array_view. However, when I use an array, the data should be copied at the end of executing the array instruction to the device. So, why first run of the kernel takes longer?

    BTW, I run the same process (same data, same data size) using CUDA and it behaves as expected. I mean, I copy the data to the device and then every run of the kernel takes consistently similar times. No first time taking twice.

    G.

     

    יום ראשון 05 פברואר 2012 22:03
  • I'm referring to the actual code, rather than the data. I don't know the details for certain, but there appears to be some sort of lazy compilation/copying of kernels to the GPU. My testing all uses data that's randomly generated on the GPU and clearly shows the same perf. increase over the first three iterations as your figures show.
    יום ראשון 05 פברואר 2012 22:09
  • Ok understand. In this case, it should not be very difficult for MS to figure out exactly what is going on.

    G.

    יום ראשון 05 פברואר 2012 23:38
  • Hi GT227

    Amit has already requested the code you are using or a small repro, so we can comment accordingly rather than in the abstract.

    If you are not comfortable sharing code here, you can email us directly. We even have paperwork that we can get in place to protect IP on both sides, if you are concerned about that - we've used this approach with customers' kernels already and it works well.

    So please post small repro code with your question, or if you can't and it only repros in your larger code base, get in touch off list to see if we can work out something else.

    Cheers

    Daniel


    http://www.danielmoth.com/Blog/
    יום שני 06 פברואר 2012 00:38
    בעלים
  • I can't help with the #4 point above, but I noticed the #1 when posting about the sort algorithms: http://social.msdn.microsoft.com/Forums/en-US/parallelcppnative/thread/90090ae1-44a9-470a-8be9-f40b4392c3d1#e73e9639-3a54-4788-9f47-c373a0a62876

    The code linked there (and here) consistently shows a considerably slower time for the first run of both the C++ AMP sorts.

    The CV trace looks like this on my machine (some worker threads are hidden), the gaps in the GPU trace for the first iterations are what make me suspect lazy copying of code from disk to the GPU:

    יום שני 06 פברואר 2012 01:08
  • Zooba

    Yes you are right, and in fact our generic guidance on measuring performance covers this:

    http://blogs.msdn.com/b/nativeconcurrency/archive/2011/12/28/how-to-measure-the-performance-of-c-amp-algorithms.aspx

    Cheers

    Daniel


    http://www.danielmoth.com/Blog/
    יום שני 06 פברואר 2012 04:03
    בעלים
  • Ah, I hadn't seen that. It certainly fulfils the "official answer" criteria for point #1 above.

    Good luck with #4.

    יום שני 06 פברואר 2012 04:19
  • ........

    Ah, I hadn't seen that. It certainly fulfils the "official answer" criteria for point #1 above.

    Good luck with #4.

    ............

    OK this make sense. And in my case, taking longer the first time, is not a big deal because the code, is supposed to be run hundred of thousands of time a day.

    The reason I saw this issue is because I was doing in parallel tests using CUDA and C++ Amp. In CUDA because of more granular control, the runs are uniform in time including the first one.

    As of #4, I will wait for the next release of VS 2011. If I get the same result, I will have to send some code to MS.

    Thanks

    G.

     

     

    יום שני 06 פברואר 2012 15:29
  • G, may I ask you to please contact me offlist? Sorry I don't have your email to reach out to you. You can contact me via my blog (in signature below).
    http://www.danielmoth.com/Blog/
    יום שני 06 פברואר 2012 17:58
    בעלים