C++ AMP Matrix multiplication sample throw exception
-
Friday, March 02, 2012 7:30 AM
Hi,
as previously described here, I am experiencing a problem with the AMP matrix multiplication code on my Geforce GT430 (in release build).
A minimal code sample is
#include <amp.h> #include <vector> using namespace concurrency; using namespace std; void MatrixMultiplySimple(std::vector<float>& vC, const std::vector<float>& vA, const std::vector<float>& vB, const int M, const int N, const int W) { array_view<const float,2> a(M, W, vA); array_view<const float,2> b(W, N, vB); array_view<float,2> c(M, N, vC); c.discard_data(); concurrency::parallel_for_each(c.extent, [=](concurrency::index<2> idx) restrict(amp) { int row = idx[0]; int col = idx[1]; float sum = 0.0f; for(int i = 0; i < W; i++) sum += a(row, i) * b(i, col); c[idx] = sum; }); } void main() { vector<float> A(100, 1.0f); vector<float> B(100, 1.0f); vector<float> C(100); MatrixMultiplySimple(C, A, B, 10, 10, 10); }
There are two things that will make this compile: One is to reduce all the matrix dimensions (M, N and W) to 8 or smaller. The other possibility is to use a hard-coded loop bound, so in the above example replacing the i < W by i < 10.
I think this points to the same underlying problem that Zooba described in his recent post:
http://social.msdn.microsoft.com/Forums/en-US/parallelcppnative/thread/4b684bcb-366b-4abe-a678-b0f86bc719c0I am running on 32-bit Windows 7, with the NVidia 295.73 driver.
Thomas
All Replies
-
Friday, March 02, 2012 7:55 AMOwner
Hi Thomas
Thank you for the complete repro (I recognize that code :))
Can you try using the direct3d_ref accelerator please? The easiest way to switch to REF is by setting it as default:
http://blogs.msdn.com/b/nativeconcurrency/archive/2012/02/02/default-accelerator-in-c-amp.aspx
Also do you have an AMD device to try this on?
BTW, what happens if you change the three 10s to be 1024 and hence change the vectors to be of 1024*1024 size? I ask because those are the numbers I typically use and haven't run into issues...
Cheers
Daniel
http://www.danielmoth.com/Blog/
-
Friday, March 02, 2012 8:23 AM
Hi Daniel,
yes, it's nice and simple code. I first encountered the problem working on something different, which ran ok in the DP but threw an exception in the Beta, so I decided to start with some simple examples to see if I got the same or similar errors. Adding arrays was too easy, but the matrix multiplication did reproduce the driver crash.
Back to your questions/suggestions:
1) using direct3d\ref as the accelerator the code works fine.
2) Sorry, I currently do not have a DirectX11 AMD device available, so cannot test this
3) Using 1024 (or 1000, to make it a non 2^n number works fine). I then decided to try a few more small numbers, and the behaviour is erratic, but somehow consistent: 9 + 10 fail, 11 and 12 work, 13 + 14 fail, 15 + 16 work, 17 + 18 fail, and so on. The same is true for numbers around 1024.Thanks,
Thomas -
Friday, March 02, 2012 9:44 AMOwner
Hi Thomas
Thanks for confirming it works on REF. A colleague also just confirmed they are seeing the same behavior and also that it works on an HD5870.
So this is an NVIDIA driver bug.
If you have a way of reporting this to NVIDIA please do, we'll also do the same...
Thank you for reporting this, please keep them coming.
Cheers
Daniel
http://www.danielmoth.com/Blog/
- Proposed As Answer by Zhu, Weirong Friday, March 02, 2012 4:36 PM
- Marked As Answer by Thomas Trenner Friday, March 02, 2012 4:44 PM
-
Saturday, June 02, 2012 4:35 AMOwner
Hi Thomas (and others hitting this bug)
Given that many folks are running into this, we have a temporary wonky workaround, until nvidia post a driver with the fix.
Change the loop inside the lambda of the parallel_for_each to be as follows:
for(int i = 0; i < W; i+=2) {
sum += a(row, i) * b(i, col);
if ((i + 1) < W) // protects for even W, otherwise this line is not needed
sum += a(row, i + 1) * b(i + 1, col);
}Cheers
DanielSo the full code is:
void MatrixMultiplySimple(std::vector<float>& vC, const std::vector<float>& vA, const std::vector<float>& vB, const int M, const int N, const int W) { array_view<const float,2> a(M, W, vA); array_view<const float,2> b(W, N, vB); array_view<float,2> c(M, N, vC); c.discard_data(); concurrency::parallel_for_each(c.extent, [=](concurrency::index<2> idx) restrict(amp) { int row = idx[0]; int col = idx[1]; float sum = 0.0f; for(int i = 0; i < W; i+=2) { sum += a(row, i) * b(i, col); if ((i + 1) < W) { sum += a(row, i + 1) * b(i + 1, col); } } c[idx] = sum; }); c.synchronize(); }
http://www.danielmoth.com/Blog/

