Ask a questionAsk a question
 

AnswerPLINQ - bug in Aggregate method

  • Tuesday, June 16, 2009 2:37 AMJoe AlbahariMVPUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    I've nearly finished writing about PLINQ in my upcoming book and am very impressed with the technology! I'm getting good performance gains in a variety of situations.

    I'm running into a problem, though, which I believe is a bug in the API. The following intermittently gives incorrect results:

      var numbers = Enumerable.Range (1, 100).Select (n => (double) n);
      double answer = numbers.AsParallel().Aggregate ((total, n) => total + Math.Sqrt (n));

    It would seem that PLINQ is using stale values of total (perhaps due to a naive lock-free parallelization algorithm?) This would imply that the following query is also thread-unsafe:

      Aggregate ((total, n) => total + n)     // This should be valid - it's not side-effecting and it doesn't mutate state.

    I appreciate that there's another overload of Aggregate that fixes this problem via a local accumulator (while also improving efficiency). If it's the case that we must use that overload to get correct results, then the above query should do one of two things:

    • Throw a NotSupportedException, suggesting that you use the local-accumulator overload instead
    • Implement the operator sequentially (like PLINQ does for other operators that can't be efficiently parallelized)

    Intermittently giving incorrect results is not the answer. Many people will not realize that PLINQ defines special new overloads for Aggregate and will merrily use the standard ones unaware that they don't work reliably. And unit tests may fail to pick up the problem - because of the intermittent nature of this problem.

    Regards

    Joe


    Write LINQ queries interactively - www.linqpad.net

Answers

  • Tuesday, June 16, 2009 7:48 AMIgor Ostrovsky - MSFT Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     Answer

    Joe,

    This is a surprisingly tricky case. Your code subtly breaks the assumptions of LINQ's Aggregate overload, but things just happen to work out in LINQ to Objects, because sqrt(1) = 1.

    For example, if you reverse the sequence before aggregating over it, the first element will be 99, and LINQ to Objects will also give a wrong answer.

    Why is this happening? The reason is that this particular Aggregate overload uses the first element of the sequence as the initial accumulator value. In general, that doesn't work for the sum of square roots aggregation, but it just so happens that 1 or 0 will work.

    You can fix both LINQ and PLINQ cases by explicitly stating the initial accumulator value. With accumulator value specified, PLINQ will run the aggregation sequentially, unless you also provide a function to combine different accumulators.

    The fixed PLINQ query will look like this:

    double result = Enumerable.Range(1, 100) .AsParallel().Aggregate(0.0, (acc, x) => acc + Math.Sqrt((double)x));

    Does that clarify things?

    Igor Ostrovsky - http://igoro.com

  • Tuesday, June 16, 2009 6:34 PMIgor Ostrovsky - MSFT Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     Answer
    Miha,

    Note that these are semantics of LINQ to Objects, and PLINQ has to follow them.

    Sometimes, 0 is a good initial accumulator value, and sometimes it isn't. For example, if you are doing a multiplication, if you initialize the accumulator to 0, the result of the multiplication will always be 0 (arr.Aggregate((x,y)=>x*y).

    I suspect that this is why LINQ to Objects takes the first element of the sequence as the accumulator, rather than the default value.

    Igor Ostrovsky - http://igoro.com
  • Tuesday, June 16, 2009 7:10 PMIgor Ostrovsky - MSFT Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     AnswerHas Code
    (1) The scenarios where you can use that overload in PLINQ are generally the same as the ones where you can use it in LINQ to Objects.

    Any time the Func<T,T,T> is associative and commutative, you can use the Aggregate overload that doesn't accept a seed. For example, say that you have an IEnumerable<int[]>, where all integer arrays are the same length. You can do a vector sum over those arrays:

    IEnumerable<int[]> vectors = ...;
    int[] vectorSum = vectors.AsParallel()
        .Aggregate((v1,v2) =>
        {
            int[] v = new int[v1.Length];
            for(int i=0; i<v1.Length; i++) v[i] = v1[i] + v2[i];
            return v;
        });
    
    Now, I am not aware of real-world cases where you have to use the overload without a seed (even though you can probably construct an artificial example). So, this is a convenience overload.

    (2) LINQ's seed-less Aggregate overload assumes that the first element of the sequence can be used as an accumulator value. PLINQ only extends this slightly by assuming that any element of the sequence can be used as an accumulator value. For robust code, the two assumptions should typically be equivalent (since otherwise inserting Reverse or OrderBy into the query before the Aggregate operator will break it).

    Igor Ostrovsky - http://igoro.com
  • Wednesday, June 17, 2009 6:46 AMIgor Ostrovsky - MSFT Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     Answer
    PLINQ's special overloads should have the same performance as the unseeded aggregation.

    In the unseeded variant, PLINQ will use the one user function to combine an accumulator with an element, and also to combine accumulators together.

    Igor Ostrovsky - http://igoro.com

All Replies

  • Tuesday, June 16, 2009 7:40 AMMiha MarkicMVPUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Just wanted to confirm the behaviour you described - sometimes the result is ok, sometimes not. Very dangerous.
    I believe that it is a bug as well and I'd be surprised if the behaviour is by design.
    Miha Markic [MVP C#] http://blog.rthand.com
  • Tuesday, June 16, 2009 7:48 AMIgor Ostrovsky - MSFT Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     Answer

    Joe,

    This is a surprisingly tricky case. Your code subtly breaks the assumptions of LINQ's Aggregate overload, but things just happen to work out in LINQ to Objects, because sqrt(1) = 1.

    For example, if you reverse the sequence before aggregating over it, the first element will be 99, and LINQ to Objects will also give a wrong answer.

    Why is this happening? The reason is that this particular Aggregate overload uses the first element of the sequence as the initial accumulator value. In general, that doesn't work for the sum of square roots aggregation, but it just so happens that 1 or 0 will work.

    You can fix both LINQ and PLINQ cases by explicitly stating the initial accumulator value. With accumulator value specified, PLINQ will run the aggregation sequentially, unless you also provide a function to combine different accumulators.

    The fixed PLINQ query will look like this:

    double result = Enumerable.Range(1, 100) .AsParallel().Aggregate(0.0, (acc, x) => acc + Math.Sqrt((double)x));

    Does that clarify things?

    Igor Ostrovsky - http://igoro.com

  • Tuesday, June 16, 2009 9:59 AMMiha MarkicMVPUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Hi Igor,

    Huh?
    Why don't you take the total type's default value instead (=0)? Taking the first value doesn't make sense to me. Perhaps I am missing something.

    However, there is another, original problem of this thread I believe: ParallelEnumerable.Aggregate (the original overload) doesn't work (even if one takes into the account first value) correclty almost as if has concurrency issues. The result is a random number sometimes the correct one.

    Miha Markic [MVP C#] http://blog.rthand.com
  • Tuesday, June 16, 2009 10:05 AMJoe AlbahariMVPUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    So:

    (1) is there any valid scenario in which you could use Aggregate with PLINQ and not specify a seed?

    (2) you say that this code "subtly breaks the assumptions of LINQ's Aggregate overload". What are the assumptions of LINQ's Aggregate overload? (I don't see anything stated in the documentation).

    Joe


    Write LINQ queries interactively - www.linqpad.net
  • Tuesday, June 16, 2009 10:37 AMJoe AlbahariMVPUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    "Why don't you take the total type's default value instead (=0)? Taking the first value doesn't make sense to me"

    Good point. The type's default value seems a lot more intuitive. But I guess they have to follow the semantics of LINQ's standard query operators - and it's too late to change that now.


    Write LINQ queries interactively - www.linqpad.net
  • Tuesday, June 16, 2009 6:34 PMIgor Ostrovsky - MSFT Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     Answer
    Miha,

    Note that these are semantics of LINQ to Objects, and PLINQ has to follow them.

    Sometimes, 0 is a good initial accumulator value, and sometimes it isn't. For example, if you are doing a multiplication, if you initialize the accumulator to 0, the result of the multiplication will always be 0 (arr.Aggregate((x,y)=>x*y).

    I suspect that this is why LINQ to Objects takes the first element of the sequence as the accumulator, rather than the default value.

    Igor Ostrovsky - http://igoro.com
  • Tuesday, June 16, 2009 7:10 PMIgor Ostrovsky - MSFT Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     AnswerHas Code
    (1) The scenarios where you can use that overload in PLINQ are generally the same as the ones where you can use it in LINQ to Objects.

    Any time the Func<T,T,T> is associative and commutative, you can use the Aggregate overload that doesn't accept a seed. For example, say that you have an IEnumerable<int[]>, where all integer arrays are the same length. You can do a vector sum over those arrays:

    IEnumerable<int[]> vectors = ...;
    int[] vectorSum = vectors.AsParallel()
        .Aggregate((v1,v2) =>
        {
            int[] v = new int[v1.Length];
            for(int i=0; i<v1.Length; i++) v[i] = v1[i] + v2[i];
            return v;
        });
    
    Now, I am not aware of real-world cases where you have to use the overload without a seed (even though you can probably construct an artificial example). So, this is a convenience overload.

    (2) LINQ's seed-less Aggregate overload assumes that the first element of the sequence can be used as an accumulator value. PLINQ only extends this slightly by assuming that any element of the sequence can be used as an accumulator value. For robust code, the two assumptions should typically be equivalent (since otherwise inserting Reverse or OrderBy into the query before the Aggregate operator will break it).

    Igor Ostrovsky - http://igoro.com
  • Wednesday, June 17, 2009 1:51 AMJoe AlbahariMVPUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Thanks, Igor.
    Write LINQ queries interactively - www.linqpad.net
  • Wednesday, June 17, 2009 4:23 AMJoe AlbahariMVPUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    One further question: assuming an aggregation function is associative and commutative, and the aggregation can be expressed as unseeded, will use of PLINQ's unique Aggregate overloads improve performance?

    Joe

    Write LINQ queries interactively - www.linqpad.net
  • Wednesday, June 17, 2009 6:46 AMIgor Ostrovsky - MSFT Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     Answer
    PLINQ's special overloads should have the same performance as the unseeded aggregation.

    In the unseeded variant, PLINQ will use the one user function to combine an accumulator with an element, and also to combine accumulators together.

    Igor Ostrovsky - http://igoro.com