locked
C++ AMP Matrix multiplication sample throw exception RRS feed

  • Question

  • Hi,

    as previously described here, I am experiencing a problem with the AMP matrix multiplication code on my Geforce GT430 (in release build).
    A minimal code sample is

    #include <amp.h>
    #include <vector>
    using namespace concurrency;
    using namespace std;
      
    
    void MatrixMultiplySimple(std::vector<float>& vC, 
             const std::vector<float>& vA, 
             const std::vector<float>& vB, const int M, const int N, const int W)
    {
      array_view<const float,2> a(M, W, vA);
      array_view<const float,2> b(W, N, vB);
      array_view<float,2> c(M, N, vC); c.discard_data();
      concurrency::parallel_for_each(c.extent, 
      [=](concurrency::index<2> idx) restrict(amp) {
        int row = idx[0]; int col = idx[1];
        float sum = 0.0f;
        for(int i = 0; i < W; i++)
          sum += a(row, i) * b(i, col);
        c[idx] = sum;
      });
    }
    
    void main()
    {
    
      vector<float> A(100, 1.0f);
      vector<float> B(100, 1.0f);
      vector<float> C(100);
      MatrixMultiplySimple(C, A, B, 10, 10, 10);
    }

    There are two things that will make this compile: One is to reduce all the matrix dimensions (M, N and W) to 8 or smaller. The other possibility is to use a hard-coded loop bound, so in the above example replacing the i < W by i < 10.

    I think this points to the same underlying problem that Zooba described in his recent post:
    http://social.msdn.microsoft.com/Forums/en-US/parallelcppnative/thread/4b684bcb-366b-4abe-a678-b0f86bc719c0

    I am running on 32-bit Windows 7, with the NVidia 295.73 driver.

    Thomas

    Friday, March 2, 2012 7:30 AM

Answers

  • Hi Thomas

    Thanks for confirming it works on REF. A colleague also just confirmed they are seeing the same behavior and also that it works on an HD5870.

    So this is an NVIDIA driver bug.

    If you have a way of reporting this to NVIDIA please do, we'll also do the same...

    Thank you for reporting this, please keep them coming.

    Cheers

    Daniel


    http://www.danielmoth.com/Blog/

    • Proposed as answer by Zhu, Weirong Friday, March 2, 2012 4:36 PM
    • Marked as answer by Thomas Trenner Friday, March 2, 2012 4:44 PM
    Friday, March 2, 2012 9:44 AM

All replies

  • Hi Thomas

    Thank you for the complete repro (I recognize that code :))

    Can you try using the direct3d_ref accelerator please? The easiest way to switch to REF is by setting it as default:

    http://blogs.msdn.com/b/nativeconcurrency/archive/2012/02/02/default-accelerator-in-c-amp.aspx

    Also do you have an AMD device to try this on?

    BTW, what happens if you change the three 10s to be 1024 and hence change the vectors to be of 1024*1024 size? I ask because those are the numbers I typically use and haven't run into issues...

    Cheers

    Daniel


    http://www.danielmoth.com/Blog/

    Friday, March 2, 2012 7:55 AM
  • Hi Daniel,

    yes, it's nice and simple code. I first encountered the problem working on something different, which ran ok in the DP but threw an exception in the Beta, so I decided to start with some simple examples to see if I got the same or similar errors. Adding arrays was too easy, but the matrix multiplication did reproduce the driver crash.

    Back to your questions/suggestions:

    1) using direct3d\ref as the accelerator the code works fine.
    2) Sorry, I currently do not have a DirectX11 AMD device available, so cannot test this
    3) Using 1024 (or 1000, to make it a non 2^n number works fine). I then decided to try a few more small numbers, and the behaviour is erratic, but somehow consistent: 9 + 10 fail, 11 and 12 work, 13 + 14 fail, 15 + 16 work, 17 + 18 fail, and so on. The same is true for numbers around 1024.

    Thanks,
    Thomas

    Friday, March 2, 2012 8:23 AM
  • Hi Thomas

    Thanks for confirming it works on REF. A colleague also just confirmed they are seeing the same behavior and also that it works on an HD5870.

    So this is an NVIDIA driver bug.

    If you have a way of reporting this to NVIDIA please do, we'll also do the same...

    Thank you for reporting this, please keep them coming.

    Cheers

    Daniel


    http://www.danielmoth.com/Blog/

    • Proposed as answer by Zhu, Weirong Friday, March 2, 2012 4:36 PM
    • Marked as answer by Thomas Trenner Friday, March 2, 2012 4:44 PM
    Friday, March 2, 2012 9:44 AM
  • Hi Thomas (and others hitting this bug)

    Given that many folks are running into this, we have a temporary wonky workaround, until nvidia post a driver with the fix.

    Change the loop inside the lambda of the parallel_for_each to be as follows:
        for(int i = 0; i < W; i+=2) {
          sum += a(row, i) * b(i, col);
          if ((i + 1) < W) // protects for even W, otherwise this line is not needed
            sum += a(row, i + 1) * b(i + 1, col);     
        }

    Cheers
    Daniel

    So the full code is:

    void MatrixMultiplySimple(std::vector<float>& vC,
    const std::vector<float>& vA,
    const std::vector<float>& vB, const int M, const int N, const int W)
    {  
      array_view<const float,2> a(M, W, vA);  
      array_view<const float,2> b(W, N, vB);      
      array_view<float,2> c(M, N, vC); c.discard_data();    
    
      concurrency::parallel_for_each(c.extent,
        [=](concurrency::index<2> idx) restrict(amp) {    
      int row = idx[0]; int col = idx[1];    
      float sum = 0.0f;    
      for(int i = 0; i < W; i+=2) {      
        sum += a(row, i) * b(i, col);      
        if ((i + 1) < W) {        
          sum += a(row, i + 1) * b(i + 1, col);      
        }    
      }    
        c[idx] = sum;  
      });  
      c.synchronize();
    }


    http://www.danielmoth.com/Blog/

    Saturday, June 2, 2012 4:35 AM