Răspuns C++ AMP: "Failed to query D3D marker event status."

  • 15 iulie 2012 14:40
     
      Are cod

    Hi Folks,

    I have another problem with a TDR event.  This code computes the LU-decomposition using a naive implementation of Doolittle's algorithm, on a tri-diagonal matrix that is 40x40 elements.  The code works fine in Debug mode on an NVIDIA GTX 470, but fails in Release mode on the third parallel-for-each, on the first iteration of i (=0). The template is instantiated with float.

    template <typename _type>
    void LU_DecompositionC(accelerator_view & acc, Matrix<_type> & AC, Matrix<_type> & L, Matrix<_type> & U)
    {
        std::cout << "Starting parallel C ... ";
        Matrix<_type> A(AC);
    
        int N = A.rows;
    
        array_view<_type, 2> a(N, N, A.Data());
        array_view<_type, 2> l(N, N, L.Data());
        array_view<_type, 2> u(N, N, U.Data());
        mycache2(acc, a);
        l.discard_data();
        u.discard_data();
    
        Counter counter;
        counter.Start();
    
        for (int i = 0; i < A.rows; ++i)
        {
            //cout << "i = " << i << "\n";
            //cout.flush();
            parallel_for_each(acc, 1, [=](int j) restrict(amp)
            {
                l(i,i) = 1;
            });
            acc.wait();
            parallel_for_each(acc, A.cols, [=](int j) restrict(amp)
            {
                _type sum = 0;
                for (int k = 0; k < i; ++k)
                {
                    sum += l(i,k) * u(k,j);
                }
                u(i, j) = a(i, j) - sum;
            });
            acc.wait();
            //cout << "A.rows - (i+1) " << (A.rows - (i+1)) << "\n";
            //cout.flush();
            if (A.rows - i - 1 <= 0)
                continue;
            parallel_for_each(acc, A.rows - i - 1, [=](int j) restrict(amp)
            {
                int jj = j + (i + 1);
                _type sum = 0;
                for (int k = 0; k < i; ++k)
                {
                    sum += l(jj,k) * u(k,i);
                }
                l(jj,i) = (a(jj, i) - sum) / u(i,i);
            });
            acc.wait();
        }
        std::cout << counter.Stop() << " ms.\n";
    }
    

    Any ideas why, and a work around?

    Ken Domino

Toate mesajele

  • 16 iulie 2012 01:34
     
     Răspuns

    Hi Ken,

    Does it work on REF? If it works on REF but only fail on Nvidia card. It might be the same driver bug we have discussed here: http://social.msdn.microsoft.com/Forums/en-US/parallelcppnative/thread/28c4933d-df19-4aa2-93b8-9f9cc4e85a7a. We have worked with Nvidia closely on this one. Hopefully we will see a fix soon. Assume this is indeed the same issue, a work around was also discussed in that thread. I also found that changing the for loop to a do/while loop (assume i >= 0) may also help. Please give it a try. If it does not work for your case, please let us know.

    Thanks,

    Weirong

    • Marcat ca răspuns de Ken Domino 16 iulie 2012 14:57
    •  
  • 16 iulie 2012 14:57
     
      Are cod

    Hi Weirong,

    Yes, it seems to be an NVIDIA driver bug. Ugh. "Software Adapter" works fine, as well as an AMD GPU.  But, it doesn't seem to have anything to do with the size of the problem, as I get a TDR even with very small matrices (e.g., 4 x 4 floats).  The problem occurs with the for-loop:

                for (int k = 0; k < i; ++k)
                    sum += L(i,k) * U(k,j);
    

    In fact, the bug appears with either of the for-loops that appear in the 2nd and 3rd parallel-for-each's.  Rewriting the for-loop using the usual do-while construct does not fix the TDR:

                int k = 0;
                do {
                    if (k >= i)
                        break;
                    sum += l(i,k) * u(k,j);
                    k += 1;
                } while (true);
    

    However, it is fixed using the do-while in reverse:

                int k = i - 1;
                do {
                    if (k < 0)
                        break;
                    sum += l(i,k) * u(k,j);
                    k -= 1;
                } while (true);
    

    It is fixed using two for-loops:

                for (int k2 = 0; k2 < i; k2 += 10)
                    for (int k = k2; k < k2+10 && k < i; ++k)
                        sum += l(i,k) * u(k,j);
    
    Thanks Weirong!

    Ken

  • 16 iulie 2012 17:34
     
     

    Thanks Ken,

    Thanks for trying it out. We believe it's a Nvidia driver JIT bug.

    Just for my curiosity.  Have you tried to write the loop as (assume i > 0):

                int k = 0;

                do {
                    sum += l(i,k) * u(k,j);
                    k
    += 1;
               
    } while (k >= i);

    So the loop exits from the bottom but not top. See if this can also work-around the problem.  Also for your reverse-loop workaround, does it also work for "for" loop?

    Thanks,

    Weirong

  • 16 iulie 2012 18:33
     
     

    Well, the "do { sum += l(i,k) * u(k,j); k += 1; } while(k>=i);" substitution doesn't bypass the NVIDIA driver bug.  But, the reverse-loop "for (int k = i - 1; k >= 0; k--) sum += l(i,k) * u(k,j);" substitution does work.

    Thanks for the help.

    Ken

  • 16 iulie 2012 18:54
     
     
    Thanks for the experiments, Ken!