none
Why we need partitioning at all when using IList<T> types?!

    Question

  • Hi

    After many hours studying and thinking about partitioning and different kind of static partitioning such as range and strip partitioning I eventually reached to a very basic and fundumental question. Why we need partitioning at all for IList<T> types?!

    While we can access the items directly by their index and there's no need to synchornize the threads to access the collection, then why we should partitionize the collection at all? We can easily define a shared index between the threads and increment it using Interlocked class. For example Parallel.For() can be implemented this way:

    void MyParallelFor(int fromInclusive, int toExclusive, Action<int> body) { int numProcs = Environment.ProcessorCount; int index = fromInclusive - 1; Task[] tasks = new Task[numProcs]; for (int i = 0; i < numProcs; i++) tasks[i] = Task.Factory.StartNew(() => { int currentIndex; while ((currentIndex = Interlocked.Increment(ref index)) < toExclusive) body(currentIndex); }); Task.WaitAll(tasks); }

    We can also implement Parallel.ForEach() using Parallel.For() for IList<T> sources:

    void MyParallelForEach<T>(IList<T> source, Action<T> body)
    {
    	MyParallelFor(0, source.Count, (i) =>
    	{
    		body(source[i]);
    	});
    }

    This implementation satisfies load-balancing as well. I tested both of these methods against Parallel.For() and Parallel.ForEach() and this technique was at least 50% faster. The technique is not faster, but is as fast as Parallel.For() for CPU-bound workloads (in fact they are identical).

    Am I wrong?





    • Edited by Mansoor Omrani Sunday, February 26, 2012 6:38 AM correction due to a bug in benchmark program
    Saturday, February 25, 2012 1:59 PM

Answers

  • Just for other people -

    Partitioning helps load balance the workload.  By partitioning an implementation of IList<T>, you allow a single Task to work on many elements of the list.  This is especially important if you have small loop bodies, as the overhead of processing each element individually will become relatively more expensive.  Sometimes it's even worth handling this yourself. For details, see the small loop bodies how to on MSDN: http://msdn.microsoft.com/en-us/library/dd560853.aspx

    Also, realize, while most implementations of IList<T> are thread safe, there is no guarantee of thread safety in the IList<T> interface itself.  As such, you shouldn't assume all IList<T> implementations are thread safe without looking at the concrete implementation being used.


    Reed Copsey, Jr. - http://reedcopsey.com
    If a post answers your question, please click "Mark As Answer" on that post and "Mark as Helpful".

    • Marked as answer by Mansoor Omrani Saturday, February 25, 2012 9:39 PM
    Saturday, February 25, 2012 8:40 PM
    Moderator
  • re: "If the operation is CPU-Bound why we need partitioning at all for arrays and IList<T>?"

    I don't completely understand the question.  What you're doing in MyParallelFor is partitioning... partitioning is just the act of taking a set of data and distributing it to the various workers that will process the data.  You're partitioning the data in a dynamic fashion one element at a time.

    Maybe what you're asking is why does Parallel.For do a more complicated form of partitioning than just doing this one element at a time scheme? If you haven't already, you might want to read http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=19222, and in particular the section on partitioning.  Your approach of using an interlocked operation per element is great if there's lots of work to be done for each element; in that case, you get the benefits of the best possible load-balancing, and the extra synchronization costs are dwarfed by the work being done for each element.  But as the amount of work per element gets smaller and smaller, the relative overhead of the interlocked per element starts to become noticeable.

    Tuesday, February 28, 2012 7:43 PM
    Owner
  • Any kind of synchronization between threads has cost.  As the work gets smaller, the ratio of the overhead to the work becomes greater and more noticeable.  Moreover, synchronization such as with interlocked operations is typically more expensive when there's contention, and the smaller the work, the more likely there is to be contention.
    • Marked as answer by Mansoor Omrani Wednesday, February 29, 2012 4:18 PM
    Wednesday, February 29, 2012 4:01 PM
    Owner

All replies

  • Please don't answer! I got it. I was mistaken! (How fool I am!!)

    Saturday, February 25, 2012 2:59 PM
  • Just for other people -

    Partitioning helps load balance the workload.  By partitioning an implementation of IList<T>, you allow a single Task to work on many elements of the list.  This is especially important if you have small loop bodies, as the overhead of processing each element individually will become relatively more expensive.  Sometimes it's even worth handling this yourself. For details, see the small loop bodies how to on MSDN: http://msdn.microsoft.com/en-us/library/dd560853.aspx

    Also, realize, while most implementations of IList<T> are thread safe, there is no guarantee of thread safety in the IList<T> interface itself.  As such, you shouldn't assume all IList<T> implementations are thread safe without looking at the concrete implementation being used.


    Reed Copsey, Jr. - http://reedcopsey.com
    If a post answers your question, please click "Mark As Answer" on that post and "Mark as Helpful".

    • Marked as answer by Mansoor Omrani Saturday, February 25, 2012 9:39 PM
    Saturday, February 25, 2012 8:40 PM
    Moderator
  • I understand. But still I'm doubtful. For example what's the difference between the implementation I wrote and the dynamic partitioner implemented in this page which is a chunk partitioner with a chunk size equal to 1:

    http://msdn.microsoft.com/en-us/library/dd997416.aspx

    They both return a single item at a time.

    Saturday, February 25, 2012 9:46 PM
  • That just happens to be that example.

    The default dynamic partitioning used with IEnumerable<T>, for example, does partitioning where the chunks grow in size.  Doing partitions of 1 element at a time is typically not the best approach (though there are exceptions), as it adds extra overhead for scheduling.


    Reed Copsey, Jr. - http://reedcopsey.com
    If a post answers your question, please click "Mark As Answer" on that post and "Mark as Helpful".

    Sunday, February 26, 2012 7:19 PM
    Moderator
  • What scheduling? Do you mean thread synchronization?
    Sunday, February 26, 2012 8:12 PM
  • What scheduling? Do you mean thread synchronization?

    The Parallel class uses partitioning to push each "chunk" of work into a specific Task (which in turn, is mapped to a thread).  The act of actually assigning the loop body (or bodies) to a Task is done via a TaskScheduler (whcih you can override in ParallelOptions: http://msdn.microsoft.com/en-us/library/system.threading.tasks.paralleloptions.aspx)  The more individual Tasks that must be created, the more work the TaskScheduler has to do.

    By partitioning your work into larger "chunks",you reduce the number of individual Tasks that must be created.  This, in turn, reduces the overhead of the entire operation.

     


    Reed Copsey, Jr. - http://reedcopsey.com
    If a post answers your question, please click "Mark As Answer" on that post and "Mark as Helpful".

    Monday, February 27, 2012 5:00 PM
    Moderator
  • Reed

    I urge you to test MyParallelFor() method I wrote against Paralle.For(). You'll see that for CPU-Bound workloads the method runs almost equal to Parallel.For(). The reason I withdraw my assertion was that we can not guarantee that the workload is always CPU-Bound. For I/O bound jobs, MyParallelFor() is worse than Parallel.For() (almost 100%) and that is natural. Because it only creates a limited number of threads and because the job is I/O bound most of the time its threads are waiting and do no useful thing. But that sly Parallel.For() is so smart that in such cases increases the number of its threads so it finishes the operation very faster. However after I changed a little MyParallelFor() and increased the number of threads I observed that MyParallelFor() works a little faster than Parallel.For() this time.

    again I insist. If the workload is CPU-Bound it seems there is little difference between MyParallelFor() and Parallel.For(). You can check it yourself. Now I ask the same question, this time with a big 'If':

    If the operation is CPU-Bound why we need partitioning at all for arrays and IList<T>?

    Kind Regards

    Monday, February 27, 2012 6:01 PM
  • re: "If the operation is CPU-Bound why we need partitioning at all for arrays and IList<T>?"

    I don't completely understand the question.  What you're doing in MyParallelFor is partitioning... partitioning is just the act of taking a set of data and distributing it to the various workers that will process the data.  You're partitioning the data in a dynamic fashion one element at a time.

    Maybe what you're asking is why does Parallel.For do a more complicated form of partitioning than just doing this one element at a time scheme? If you haven't already, you might want to read http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=19222, and in particular the section on partitioning.  Your approach of using an interlocked operation per element is great if there's lots of work to be done for each element; in that case, you get the benefits of the best possible load-balancing, and the extra synchronization costs are dwarfed by the work being done for each element.  But as the amount of work per element gets smaller and smaller, the relative overhead of the interlocked per element starts to become noticeable.

    Tuesday, February 28, 2012 7:43 PM
    Owner
  • Is it because of the memeory barriers applied to Interlocked that when we have a tiny work per element, the overhead of the Interlocked starts to become noticable?

    Wednesday, February 29, 2012 9:48 AM
  • Any kind of synchronization between threads has cost.  As the work gets smaller, the ratio of the overhead to the work becomes greater and more noticeable.  Moreover, synchronization such as with interlocked operations is typically more expensive when there's contention, and the smaller the work, the more likely there is to be contention.
    • Marked as answer by Mansoor Omrani Wednesday, February 29, 2012 4:18 PM
    Wednesday, February 29, 2012 4:01 PM
    Owner