Parallel Computing Developer Center > Microsoft Visual Studio 2010 Beta 2 Forums > Parallel Computing in C++ and Native Code > problem parallelizing linear system resolution with PPL (Visual C++ 2010 beta)
Ask a questionAsk a question
 

Questionproblem parallelizing linear system resolution with PPL (Visual C++ 2010 beta)

  • Wednesday, October 28, 2009 4:50 PMmanuel-- Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     

    Hello,
        I am trying to parallelize using PPL (Visual Studio 2010 beta) the resolution of a linear system with Gauss elimination method.
    Below is the code I wrote to do the testing. One parallel_for is inserted in the triangulation phase.

        A priori, everything works fine when I run it on an Intel dual core. The task manager shows that the CPU usage
    is 50% in sequential mode and 100% in parallel mode.

        There is just a "small" problem: it actually takes 50.6 seconds in sequential mode and 51.6 seconds in parallel mode...
    Can anybody tell me if I missed a point by any chance? Is it a problem of granularity (I also tried to cut
    the loop on j in two independent half loops to increase the granularity, to no avail...) ?

        Notice that I already used PPL to parallelize a matrix multiplication in a very similar way
    (with parallel_for, of course) and the result was then perfect.

        Many thanks in advance.
        Manuel

     

    #include <stdio.h>
    #include <time.h>
    #include <functional>
    #include <ppl.h>
    using namespace Concurrency;

    #define  N  3000

    #define EL(A , i , j)  (*((A) + (i) * N + (j)))

    int solve_linear_system (double *V , double *A , double *B)
    {
        int i;
        double tt = (double )clock() / CLOCKS_PER_SEC;
        for (i = 0 ; i < N ; i++)
        {
            if (fabs(EL(A , i , i)) < 1e-20)  return 0;

    ////**** PARALLEL TRIANGULATION
            parallel_for (i + 1 , N , [ A , B , i ] (int j)
            {  double r = EL(A , j , i) / EL(A , i , i);
                for (int k = i ; k < N ; k++)  EL(A , j , k) -= r * EL(A , i , k);
                B[j] -= r * B[i];  });
    ////****
    ////**** SEQUENTIAL TRIANGULATION
    //        for (int j = i + 1 ; j < N ; j++)
    //        {  double r = EL(A , j , i) / EL(A , i , i);
    //            for (int k = i ; k < N ; k++)  EL(A , j , k) -= r * EL(A , i , k);
    //            B[j] -= r * B[i];  }
    ////****
        }
        printf("\n    triangulation done in %.3f s\n" , (double )clock() / CLOCKS_PER_SEC - tt);

        if (fabs(EL(A , N - 1 , N - 1)) < 1e-20)  return 0;
        V[N - 1] = B[N - 1] / EL(A , N - 1 , N - 1);
        for (i = N - 2 ; i >= 0 ; i--)
        {  double s = 0;    for (int j = i + 1; j < N ; j++)  s += EL(A , i , j) * V[j];
            V[i] = (B[i] - s) / EL(A , i , i);  }
        return 1;
    }

    void main()
    {
        double *A = (double *)malloc(N * N * sizeof(double));
        double *B = (double *)malloc(N * sizeof(double));
        double *V = (double *)malloc(N * sizeof(double));

        for (int i = 0 ; i < N ; i++)  for (int j = 0 ; j < N ; j++)
        {  EL(A , i , j) = (i == j ? 1 : 0);    B[i] = 10;  }

        solve_linear_system(V , A , B);
    }

     

    • Moved byWesley YaoMSFTFriday, October 30, 2009 3:29 AM (From:Visual C++ Express Edition)
    •  

All Replies

  • Wednesday, October 28, 2009 7:05 PMSheng Jiang 蒋晟MVPUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    I suggest you post to the Parallel Computing in C++ and Native Code forum. This forum does not cover specific code libraries.

    The following is signature, not part of post
    Please mark the post answered your question as the answer, and mark other helpful posts as helpful.
    Visual C++ MVP
  • Sunday, November 01, 2009 7:05 AMrickmolloyMSFT, OwnerUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    I'll look at this again tomorrow, but I am seeing speedups on my quad core, they just aren't 4x.  I'm getting a serial time of approx. 40.5s and parallel time of 28s.
    Rick Molloy Parallel Computing Platform : http://blogs.msdn.com/nativeconcurrency http://parallelroads.com/blog
  • Monday, November 02, 2009 9:16 AMmanuel-- Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Thanks Rick. It comforts me in the idea that it's a problem of implementation, i.e. both the sluggish dual core I'm using and the current implementation of PPL. By the way, I was told that OpenMP gives better results, I might try it out.
    I am very interested by your runs on your quadcore. Maybe you could try the following version of the code.
    It allows for much higher granularity by cutting the loop on j in 2 (or 4 if you put N_CORE at 4 for your quadcore)
    and by processing the two (or four) blocks in parallel.
    Note: only the triangulation need be parallelized. The resolution of the triangular system (end of the program) is fast enough in sequential.


    #include <stdio.h>

    #include <time.h>

    #include <functional>

    #include <ppl.h>

    using namespace Concurrency;

     

    #define  N  3000

    #define N_CORE  2

     

    #define EL(A , i , j)  (*((A) + (i) * N + (j)))

     

    int solve_linear_system (double *V , double *A , double *B)

    {

        int i;

        double tt = (double )clock() / CLOCKS_PER_SEC;

        for (i = 0 ; i < N ; i++)

        {

            if (fabs(EL(A , i , i)) < 1e-20)  return 0;

     

    ////**** PARALLEL TRIANGULATION WITH HIGHER GRANULARITY

            if (N - (i + 1) < N_CORE)    /* sequential if less remaining rows than cores */

                for (int j = i + 1 ; j < N ; j++)

                {  double r = EL(A , j , i) / EL(A , i , i);    for (int k = i ; k < N ; k++)  EL(A , j , k) -= r * EL(A , i , k);

                    B[j] -= r * B[i];  }

            else

                parallel_for (0 , N_CORE , [ A , B , i ] (int i_core)

                {  int jC0 = (i + 1) + i_core * (N - (i + 1)) / N_CORE , jC1 = (i + 1) + (i_core + 1) * (N - (i + 1)) / N_CORE;

                    for (int j = jC0 ; j < jC1 ; j++)

                    {  double r = EL(A , j , i) / EL(A , i , i);    for (int k = i ; k < N ; k++)  EL(A , j , k) -= r * EL(A , i , k);

                        B[j] -= r * B[i];  }  });

    ////****

    ////**** PARALLEL TRIANGULATION

    //        parallel_for (i + 1 , N , [ A , B , i ] (int j)

    //        {  double r = EL(A , j , i) / EL(A , i , i);    for (int k = i ; k < N ; k++)  EL(A , j , k) -= r * EL(A , i , k);

    //            B[j] -= r * B[i];  });

    ////****

    ////**** SEQUENTIAL TRIANGULATION

    //        for (int j = i + 1 ; j < N ; j++)

    //        {  double r = EL(A , j , i) / EL(A , i , i);    for (int k = i ; k < N ; k++)  EL(A , j , k) -= r * EL(A , i , k);

    //            B[j] -= r * B[i];  }

    ////****

        }

        printf("\n    triangulation done in %.3f s\n" , (double )clock() / CLOCKS_PER_SEC - tt);

        if (fabs(EL(A , N - 1 , N - 1)) < 1e-20)  return 0;

        V[N - 1] = B[N - 1] / EL(A , N - 1 , N - 1);

        for (i = N - 2 ; i >= 0 ; i--)

        {  double s = 0;    for (int j = i + 1; j < N ; j++)  s += EL(A , i , j) * V[j];

            V[i] = (B[i] - s) / EL(A , i , i);  }

        return 1;

    }

     

    void main()

    {

        double *A = (double *)malloc(N * N * sizeof(double));

        double *B = (double *)malloc(N * sizeof(double));

        double *V = (double *)malloc(N * sizeof(double));

     

        for (int i = 0 ; i < N ; i++)  for (int j = 0 ; j < N ; j++)  {  EL(A , i , j) = (i == j ? 1 : 0);    B[i] = 10;  }

     

        solve_linear_system(V , A , B);

    }

     

     

  • Monday, November 02, 2009 9:18 AMmanuel-- Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    I forgot to precise that this code with higher granularity runs in 51.5 seconds on my dual core, still exactly
    the same as the sequential version.
  • Saturday, November 07, 2009 5:56 AMrickmolloyMSFT, OwnerUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    I just wanted to touch base on this. This loops appears to be very small execution wise and I'm not seeing any performance improvements when I use OpenMP.  Are you seeing perf improvements with OpenMP?  I'm also assuming you can't parallelize the outer loop (I'm sorry to say I didn't read your code very  closely) because that will also show better speedups.

    I will say that you should consider unrolling the loop a bit. Perhaps I've got a bug here, but the following does really well, particularly when you turn on SSE instructions in the compiler options.

     

     

    for (int j = i + 1 ; j < N ; j++)
    {
       double r = EL(A , j , i) / EL(A , i , i);
       for (int k = i ; k < N/4; k++)
       {
          EL(A , j , k*
    4) -= r * EL(A , i , k*4);
          EL(A , j , k*
    4+1) -= r * EL(A , i , k*4+1);
          EL(A , j , k*
    4+2) -= r * EL(A , i , k*4+2);
          EL(A , j , k*
    4+3) -= r * EL(A , i , k*4+3);
       }

       B[j] -= r * B[i];
    }


    Rick Molloy Parallel Computing Platform : http://blogs.msdn.com/nativeconcurrency http://parallelroads.com/blog
  • Tuesday, November 10, 2009 9:32 AMmanuel-- Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     

    It is true that the big issue about the triangulation of Gauss elimination method is the outer loop on i that must remain sequential (as far as I know). Then there are two imbricated loops on j and k inside that make something running
    in O(N2), good stuff for parallelization. Actually, I also tested with N=6000 and I obtained a speed-up of 1.27 on a dual core (instead of 1.0 for N=3000), so there's definitely something stirring when the linear system
    gets "big enough". Unfortunately, it may still require a huge N to get enough granularity and have a speed-up approach 2.0.

    So, I have come to believe that general tools such as parallel_for may not be adapted for such an inner loop (as I heard, it recreates threads each time, i.e. at each passage in the outer loop on i). Instead of learning OpenMP, I think I will "hardcode" the parallelization directly with threads and mutexes (_beginthread, WaitForSingleObject, ...). That way, I should be able to create a pool of N_CORE threads only once, outside of the outer loop on i, and use these threads to parallelize the inner loop on j (by cutting it in N_CORE parts). Hopefully, there will be far less overhead.

    By the way, since I will then create one thread for each core, will it be possible to precise in the program
    that the first thread must run on core 0, the second thread on core 1, ... ?
    Since the tasks are all of the same size,  that would prevent any possible problem of load balancing between cores.
    Is there a way of forcing a given thread to run on a given core ?

    Thank you so much to remind me about SSE. Hopefully, that will make me run up to twice as fast (since I work on doubles).

    Well, that makes of lot of work in front of me. Wish me good luck...   ;-)