Visual Studio Developer Center > Visual Studio vNext Forums > TPL Dataflow > Performance of CTP and TPL Dataflow compared to CCR

Answered Performance of CTP and TPL Dataflow compared to CCR

Answers

  • Saturday, November 06, 2010 12:16 AM
    Owner
     
     Answered

    Hi MonkeyBall-

    Our goal for releasing TPL Dataflow as part of the Async CTP was primarily to get feedback on the approach and APIs, rather than performance.  We still have a good deal of performance work left to do, and there's still a fair number of places in the codebase that we can optimize, both in terms of speed and memory allocation.  That said, to directly answer your question, we do run comparisons between them in our testing; after all, we've worked closely with the Robotics team to ensure that TPL Dataflow can serve to address for .NET Framework customers the same scenarios as the CCR has in the past.  There are currently workloads where TPL Dataflow performance will exceed that of the CCR, and there are certain workloads where the inverse is true.  Part of this is due to where we are in the development cycle, and part of it is due to the scenarios we've optimized the design of TPL Dataflow for.  As we release subsequent CTPs more geared towards improving performance, we'll certainly be looking for feedback along those lines... for now, both for the System.Threading.Tasks.Dataflow.dll and for the language support for asynchrony, this release was really about directional and functional feedback rather than perf.  That said, if you have particular scenarios that you want to see really scream in terms of performance, please do let us know about them, so that we can factor those into our performance plans and goals.

    Thanks for your interest.

All Replies

  • Saturday, November 06, 2010 12:16 AM
    Owner
     
     Answered

    Hi MonkeyBall-

    Our goal for releasing TPL Dataflow as part of the Async CTP was primarily to get feedback on the approach and APIs, rather than performance.  We still have a good deal of performance work left to do, and there's still a fair number of places in the codebase that we can optimize, both in terms of speed and memory allocation.  That said, to directly answer your question, we do run comparisons between them in our testing; after all, we've worked closely with the Robotics team to ensure that TPL Dataflow can serve to address for .NET Framework customers the same scenarios as the CCR has in the past.  There are currently workloads where TPL Dataflow performance will exceed that of the CCR, and there are certain workloads where the inverse is true.  Part of this is due to where we are in the development cycle, and part of it is due to the scenarios we've optimized the design of TPL Dataflow for.  As we release subsequent CTPs more geared towards improving performance, we'll certainly be looking for feedback along those lines... for now, both for the System.Threading.Tasks.Dataflow.dll and for the language support for asynchrony, this release was really about directional and functional feedback rather than perf.  That said, if you have particular scenarios that you want to see really scream in terms of performance, please do let us know about them, so that we can factor those into our performance plans and goals.

    Thanks for your interest.

  • Tuesday, February 01, 2011 6:51 PM
     
     

    Hi Stephen, 

    This library could be something very interesting for us, but as you mentionned, it is not currently as performant as it should be. We are currently using CCR extensively in our main product to have multiple video pipelines in parrallel composed mainly of different agents (stream reception, reassembling, reordering, decoding and rendering, along with other special agents specific to our product). This let us get very good performance spread between different cores very effectively.  I would be very interested in testing this framework when its performance will be deemed as good as what CCR can offer.  

    We can have up to 64 different pipelines running concurrently, each composed of the agents I mentioned before. This is our main scenario, where the performance has to be found.  Is this the kind of scenario TPL would be good for? You mention that in some scenarios, TPL is better than CCR, well I'd be interested to know if that would be the case for our scenario or not before considering making a prototype to test it on our side.  As you can guess, it is quite a complex design and before having the time to make tests with TPL, we have to have some hints on the performance we could acheive with that.

    Thanks,

    Luc Ferron

  • Friday, February 04, 2011 3:58 PM
     
     
    Many moons ago, I was way into the CCR and did a community version called PCR on codeplex. Anyway, my version used linkedlists for the internal queues instead of the normal .Net Queue class which expands/shrinks a List (which is much overhead if done a lot). It was something like 50% faster or more IIRC - even with the extra linkedlist Node object allocations which was a suprise to me. Sometime after that, George changed the CCR Port to also using linkedlist IIRC.  So I wonder if TPL is using linkedlists or Queue/List<t> for internals (i.e. concurrentqueue) or done any testing yet on that. I have not looked. Maybe there is still some low hanging fruit to be had.
  • Friday, February 04, 2011 5:22 PM
    Owner
     
     Proposed Answer

    There is definitely perf work we're still doing (including some that'll influence the interfaces a bit), and I'm hopeful that the next preview release will show some nice speedups over what's already there (some of our internal benchmarks have already improved by > 30% since the latest preview release). 

    Luc, to your question, your scenario of pipelines maps very well into what TPL Dataflow is designed for, where you have a lot of data flowing from block to block to block; I'd suggest creating your prototype and see how it fairs, knowing that it will very likely improve in future releases. 

    William, to your question, depending on the block, we use various kinds of internal data structures... for example, TransformBlock currently uses a ConcurrentQueue<T> for its target-side storage and a Queue<T> protected by some external synchronization for its source-side storage.  ConcurrentQueue<T> uses a hybrid scheme, where it has linked lists of arrays; this amortizes the cost of allocations over multiple elements while still retaining many of the benefits one finds with both an array approach (e.g. locality of data, fewer allocations) and linked list approach (e.g. lock freedom, not requiring a large continugous region of memory).

  • Monday, February 07, 2011 5:39 PM
     
     

    Thank you Stephen,

    I'll try to make some time to create a small prototype to compare results.. I'll keep you informed