Performance of CTP and TPL Dataflow compared to CCR
-
Friday, November 05, 2010 11:10 PMHas anyone compared the performance of the Async CTP and/or TPL Dataflow model with the CCR model. Is it similiar, faster, slower?
- Moved by Stephen Toub - MSFTMicrosoft Employee, Owner Wednesday, January 26, 2011 3:50 PM New forum for TPL Dataflow (From:Visual Studio Async CTP)
Answers
-
Saturday, November 06, 2010 12:16 AMOwner
Hi MonkeyBall-
Our goal for releasing TPL Dataflow as part of the Async CTP was primarily to get feedback on the approach and APIs, rather than performance. We still have a good deal of performance work left to do, and there's still a fair number of places in the codebase that we can optimize, both in terms of speed and memory allocation. That said, to directly answer your question, we do run comparisons between them in our testing; after all, we've worked closely with the Robotics team to ensure that TPL Dataflow can serve to address for .NET Framework customers the same scenarios as the CCR has in the past. There are currently workloads where TPL Dataflow performance will exceed that of the CCR, and there are certain workloads where the inverse is true. Part of this is due to where we are in the development cycle, and part of it is due to the scenarios we've optimized the design of TPL Dataflow for. As we release subsequent CTPs more geared towards improving performance, we'll certainly be looking for feedback along those lines... for now, both for the System.Threading.Tasks.Dataflow.dll and for the language support for asynchrony, this release was really about directional and functional feedback rather than perf. That said, if you have particular scenarios that you want to see really scream in terms of performance, please do let us know about them, so that we can factor those into our performance plans and goals.
Thanks for your interest.
- Proposed As Answer by Stephen Toub - MSFTMicrosoft Employee, Owner Saturday, November 06, 2010 12:16 AM
- Marked As Answer by MonkeyBall Saturday, November 06, 2010 4:39 PM
All Replies
-
Saturday, November 06, 2010 12:16 AMOwner
Hi MonkeyBall-
Our goal for releasing TPL Dataflow as part of the Async CTP was primarily to get feedback on the approach and APIs, rather than performance. We still have a good deal of performance work left to do, and there's still a fair number of places in the codebase that we can optimize, both in terms of speed and memory allocation. That said, to directly answer your question, we do run comparisons between them in our testing; after all, we've worked closely with the Robotics team to ensure that TPL Dataflow can serve to address for .NET Framework customers the same scenarios as the CCR has in the past. There are currently workloads where TPL Dataflow performance will exceed that of the CCR, and there are certain workloads where the inverse is true. Part of this is due to where we are in the development cycle, and part of it is due to the scenarios we've optimized the design of TPL Dataflow for. As we release subsequent CTPs more geared towards improving performance, we'll certainly be looking for feedback along those lines... for now, both for the System.Threading.Tasks.Dataflow.dll and for the language support for asynchrony, this release was really about directional and functional feedback rather than perf. That said, if you have particular scenarios that you want to see really scream in terms of performance, please do let us know about them, so that we can factor those into our performance plans and goals.
Thanks for your interest.
- Proposed As Answer by Stephen Toub - MSFTMicrosoft Employee, Owner Saturday, November 06, 2010 12:16 AM
- Marked As Answer by MonkeyBall Saturday, November 06, 2010 4:39 PM
-
Tuesday, February 01, 2011 6:51 PM
Hi Stephen,
This library could be something very interesting for us, but as you mentionned, it is not currently as performant as it should be. We are currently using CCR extensively in our main product to have multiple video pipelines in parrallel composed mainly of different agents (stream reception, reassembling, reordering, decoding and rendering, along with other special agents specific to our product). This let us get very good performance spread between different cores very effectively. I would be very interested in testing this framework when its performance will be deemed as good as what CCR can offer.
We can have up to 64 different pipelines running concurrently, each composed of the agents I mentioned before. This is our main scenario, where the performance has to be found. Is this the kind of scenario TPL would be good for? You mention that in some scenarios, TPL is better than CCR, well I'd be interested to know if that would be the case for our scenario or not before considering making a prototype to test it on our side. As you can guess, it is quite a complex design and before having the time to make tests with TPL, we have to have some hints on the performance we could acheive with that.
Thanks,
Luc Ferron
-
Friday, February 04, 2011 3:58 PMMany moons ago, I was way into the CCR and did a community version called PCR on codeplex. Anyway, my version used linkedlists for the internal queues instead of the normal .Net Queue class which expands/shrinks a List (which is much overhead if done a lot). It was something like 50% faster or more IIRC - even with the extra linkedlist Node object allocations which was a suprise to me. Sometime after that, George changed the CCR Port to also using linkedlist IIRC. So I wonder if TPL is using linkedlists or Queue/List<t> for internals (i.e. concurrentqueue) or done any testing yet on that. I have not looked. Maybe there is still some low hanging fruit to be had.
- Edited by WilliamStaceyMicrosoft Community Contributor Friday, February 04, 2011 4:27 PM correction
-
Friday, February 04, 2011 5:22 PMOwner
There is definitely perf work we're still doing (including some that'll influence the interfaces a bit), and I'm hopeful that the next preview release will show some nice speedups over what's already there (some of our internal benchmarks have already improved by > 30% since the latest preview release).
Luc, to your question, your scenario of pipelines maps very well into what TPL Dataflow is designed for, where you have a lot of data flowing from block to block to block; I'd suggest creating your prototype and see how it fairs, knowing that it will very likely improve in future releases.
William, to your question, depending on the block, we use various kinds of internal data structures... for example, TransformBlock currently uses a ConcurrentQueue<T> for its target-side storage and a Queue<T> protected by some external synchronization for its source-side storage. ConcurrentQueue<T> uses a hybrid scheme, where it has linked lists of arrays; this amortizes the cost of allocations over multiple elements while still retaining many of the benefits one finds with both an array approach (e.g. locality of data, fewer allocations) and linked list approach (e.g. lock freedom, not requiring a large continugous region of memory).
- Proposed As Answer by Stephen Toub - MSFTMicrosoft Employee, Owner Friday, February 04, 2011 5:22 PM
-
Monday, February 07, 2011 5:39 PM
Thank you Stephen,
I'll try to make some time to create a small prototype to compare results.. I'll keep you informed

