locked
PLINQ to objects: Requirements on the enumerator

    Question

  •  

    Hi,

      I am using PLINQ to objects and I get a different answer if I use PLINQ or plain-old-LINQ. I'm pretty sure this is because my iterators violate some basic requirements:

     

      My iterator moves through memory unpacking a small chunk of data for each iteration, Each iteration overwrites the previous one, which I suspect causes the problem.

     

      Before I explore redesigning this, I thought it would be good to know what constraints are put on the iterator the PLINQ runs:

     

      - The object returned by the enumerator must not be altered while it is alive. BTW, this means creating a new object for each iteration - is that going to be worth it, or will it provide an impossible overhead. guess I can't tell until I try it out.

     

      - What are the thread saftey requirements for the iterator. I've got it implemented as a yield-return loop. I assume that only one thread will be in that yield-return - in short I don't need to do anything with it to make it thread safe, but the object returned by the iterator must not alter over its lifetime.

     

      Are there other requirements? Any help appreciated!

     

    Cheers,

    Gordon.

    Thursday, February 28, 2008 12:19 AM

Answers

  • The design of IEnumerable<T> is inherently not thread-safe in that two separate calls are required in order to access the next element from the list (MoveNext and Current).  Thus, we internally lock when accessing the provided IEnumerable<T>, such that all of the threads PLINQ uses will play nicely with each other trying to access the enumerable. 

     

    It's fine to implement your enumerable with C# iterators (e.g. yield return); that's certainly much easier than implementing IEnumerator<T>/IEnumerator by hand. However, as you correctly point out, PLINQ does assume that each result it gets back from Current can be operated on indepently; otherwise, there would be little opportunity for parallelism.  In this case, if you're handing back the same object for each element, but writing over it with new data each time, multiple PLINQ threads will grab the data from the enumerator, but your enumerator will be changing that data out from under them, and the results will not be what you want.

     

    Whether or not returning a new object from each iteration will be prohibitively expensive really depends on what kind of data you're creating and returning and how much work is being done while processing each of the elements.  Good luck.

    Thursday, February 28, 2008 3:34 PM
    Moderator

All replies

  • The design of IEnumerable<T> is inherently not thread-safe in that two separate calls are required in order to access the next element from the list (MoveNext and Current).  Thus, we internally lock when accessing the provided IEnumerable<T>, such that all of the threads PLINQ uses will play nicely with each other trying to access the enumerable. 

     

    It's fine to implement your enumerable with C# iterators (e.g. yield return); that's certainly much easier than implementing IEnumerator<T>/IEnumerator by hand. However, as you correctly point out, PLINQ does assume that each result it gets back from Current can be operated on indepently; otherwise, there would be little opportunity for parallelism.  In this case, if you're handing back the same object for each element, but writing over it with new data each time, multiple PLINQ threads will grab the data from the enumerator, but your enumerator will be changing that data out from under them, and the results will not be what you want.

     

    Whether or not returning a new object from each iteration will be prohibitively expensive really depends on what kind of data you're creating and returning and how much work is being done while processing each of the elements.  Good luck.

    Thursday, February 28, 2008 3:34 PM
    Moderator
  • Thanks, that confirms what I thought.

     

    I have two choices: try to make several versions of the reader, and assign each reader to a single "thread". This is a bit tricky not just because I need to know when PLINQ is done with the object but also because of the way the reader I'm working with operates. It would also increase memory usage quite dramatically.

     

    The other option is to make a copy of the data. When the data is just floats or doubles, that will be easy. But the content can be flattened objects and I'm not sure how to do that. One thing I can do is clone the live object. This doesn't always work due to the way the library I'm using is built. The other options is to copy the flattened data to a "safe" buffer and use that.

     

    I'm using a list processing engine optimized for sequential access to huge amounts of structured data (terrabyte) arranged in columns (similar to a relational database, where some colums can be arrays of things). The library is designed to copy the data only once when it is read in for optimal speed - hence the "overwriting" problem.

     

    Thanks for your help. I'll do some timing tests -- to see how much overhead the parallel object creation adds vs doing it the wrong way that I'm currently doing it.

     

    -Gordon.

    Thursday, February 28, 2008 7:29 PM