C++ AMP Matrix multiplication sample throw exception

Dotaz

• Hi,

as previously described here, I am experiencing a problem with the AMP matrix multiplication code on my Geforce GT430 (in release build).
A minimal code sample is

#include <amp.h>
#include <vector>
using namespace concurrency;
using namespace std;

void MatrixMultiplySimple(std::vector<float>& vC,
const std::vector<float>& vA,
const std::vector<float>& vB, const int M, const int N, const int W)
{
array_view<const float,2> a(M, W, vA);
array_view<const float,2> b(W, N, vB);
concurrency::parallel_for_each(c.extent,
[=](concurrency::index<2> idx) restrict(amp) {
int row = idx[0]; int col = idx[1];
float sum = 0.0f;
for(int i = 0; i < W; i++)
sum += a(row, i) * b(i, col);
c[idx] = sum;
});
}

void main()
{

vector<float> A(100, 1.0f);
vector<float> B(100, 1.0f);
vector<float> C(100);
MatrixMultiplySimple(C, A, B, 10, 10, 10);
}

There are two things that will make this compile: One is to reduce all the matrix dimensions (M, N and W) to 8 or smaller. The other possibility is to use a hard-coded loop bound, so in the above example replacing the i < W by i < 10.

I think this points to the same underlying problem that Zooba described in his recent post:

I am running on 32-bit Windows 7, with the NVidia 295.73 driver.

Thomas

2. března 2012 7:30

Odpovědi

• Hi Thomas

Thanks for confirming it works on REF. A colleague also just confirmed they are seeing the same behavior and also that it works on an HD5870.

So this is an NVIDIA driver bug.

If you have a way of reporting this to NVIDIA please do, we'll also do the same...

Thank you for reporting this, please keep them coming.

Cheers

Daniel

http://www.danielmoth.com/Blog/

2. března 2012 9:44

Všechny reakce

• Hi Thomas

Thank you for the complete repro (I recognize that code :))

Can you try using the direct3d_ref accelerator please? The easiest way to switch to REF is by setting it as default:

http://blogs.msdn.com/b/nativeconcurrency/archive/2012/02/02/default-accelerator-in-c-amp.aspx

Also do you have an AMD device to try this on?

BTW, what happens if you change the three 10s to be 1024 and hence change the vectors to be of 1024*1024 size? I ask because those are the numbers I typically use and haven't run into issues...

Cheers

Daniel

http://www.danielmoth.com/Blog/

2. března 2012 7:55
• Hi Daniel,

yes, it's nice and simple code. I first encountered the problem working on something different, which ran ok in the DP but threw an exception in the Beta, so I decided to start with some simple examples to see if I got the same or similar errors. Adding arrays was too easy, but the matrix multiplication did reproduce the driver crash.

1) using direct3d\ref as the accelerator the code works fine.
2) Sorry, I currently do not have a DirectX11 AMD device available, so cannot test this
3) Using 1024 (or 1000, to make it a non 2^n number works fine). I then decided to try a few more small numbers, and the behaviour is erratic, but somehow consistent: 9 + 10 fail, 11 and 12 work, 13 + 14 fail, 15 + 16 work, 17 + 18 fail, and so on. The same is true for numbers around 1024.

Thanks,
Thomas

2. března 2012 8:23
• Hi Thomas

Thanks for confirming it works on REF. A colleague also just confirmed they are seeing the same behavior and also that it works on an HD5870.

So this is an NVIDIA driver bug.

If you have a way of reporting this to NVIDIA please do, we'll also do the same...

Thank you for reporting this, please keep them coming.

Cheers

Daniel

http://www.danielmoth.com/Blog/

2. března 2012 9:44
• Hi Thomas (and others hitting this bug)

Given that many folks are running into this, we have a temporary wonky workaround, until nvidia post a driver with the fix.

Change the loop inside the lambda of the parallel_for_each to be as follows:
for(int i = 0; i < W; i+=2) {
sum += a(row, i) * b(i, col);
if ((i + 1) < W) // protects for even W, otherwise this line is not needed
sum += a(row, i + 1) * b(i + 1, col);
}

Cheers
Daniel

So the full code is:

void MatrixMultiplySimple(std::vector<float>& vC,
const std::vector<float>& vA,
const std::vector<float>& vB, const int M, const int N, const int W)
{
array_view<const float,2> a(M, W, vA);
array_view<const float,2> b(W, N, vB);

concurrency::parallel_for_each(c.extent,
[=](concurrency::index<2> idx) restrict(amp) {
int row = idx[0]; int col = idx[1];
float sum = 0.0f;
for(int i = 0; i < W; i+=2) {
sum += a(row, i) * b(i, col);
if ((i + 1) < W) {
sum += a(row, i + 1) * b(i + 1, col);
}
}
c[idx] = sum;
});
c.synchronize();
}

http://www.danielmoth.com/Blog/

2. června 2012 4:35