none
Low latency applications RRS feed

  • Question

  • How appropriate is TDF for low-latency applications?

    1. Basing it on TPL seems to discount any low-latency use? Spinning up a Task via the TaskScheduler is a long operation..

    2. Is there anyone considering using LMAX Trader's open sourced Disruptor framework (Java) to port into .NET/TDF framework? It is a VERY low-latency framework with a lot of overlap.. in particular taking the message passing from one thread to another down to the nano-second measurement..

    From what I can see TPL (therefore TDF) is just too slow for these apps (eg. signal processing, financial markets etc.)

     


    Regards, Fil.
    Monday, July 4, 2011 3:23 AM

Answers

  • Hi Fil-

    There are techniques employed by Disruptor that we've actively chosen not to utilize in TPL Dataflow, for the very reason that our dataflow solution has to be generally applicable and as such cognizant of other work happening in the process and on the machine, of power utilization, and the like.  For example, the wait strategies employed by disruptor involve either blocking threads or spinning on threads, and neither of those is really acceptable for the dataflow blocks in TPL, as in many scenarios it will result in significant numbers of threads getting blocked, lots of wasted CPU utilization that starves other work (including but not limited to other dataflow blocks), etc.  We do use some spinning in limited cases (though only in builds more recent than the CTPs we've released thus far); for example, when a Task finds there's no more work to be done, it may now spin a little bit just in case another message arrives imminently.

    Note that there are some techniques employed in disruptor similar to those employed in dataflow.  For example, we have a specialized queue data structure used by ActionBlock, TransformBlock, and TransformManyBlock (this queue isn't in the CTPs as of yet, where they just use ConcurrentQueue<T>).  That data structure involves very few allocations (it often reaches a steady state where there aren't any at all), is organized to minimize false sharing between producers and consumers, and not only entirely avoids locks but interlockeds as well.  We saw significant throughput improvements as a result of applying this queue.  Hopefully you will, too, when we have another build to release.

    Note that if you wanted to, it should be relatively easy for you to create an ITargetBlock<T> facade around disruptor, which would then allow you to use it as part of dataflow networks, to connect it to other blocks, to utilize all of the dataflow extension methods, to convert to observers, etc.  As you suggested, you could also hook in at the TaskScheduler level, implementing what would effectively be a thread pool using disruptor as the producer/consumer hand-off queue of tasks between the queuers and the processing threads.

    Regarding the time it takes to spin up a Task, where are you getting your 10ms number from?  That's much, much, much higher than any timings I've seen.  For example, here's a simple test that spins up a task, which spins up another task, which spins up another task, and so on indefinitely, repeatedly timing how long it takes to process a set number of tasks:

    using System;
    using System.Diagnostics;
    using System.Threading.Tasks;

    class Program
    {
        static void Main()
        {
            int counter = 0;
            var sw = new Stopwatch();
            Action body = null;
            body = () =>
            {
                if (++counter % 1000000 == 0)
                {
                    Console.WriteLine(sw.ElapsedMilliseconds);
                    sw.Restart();
                }
                Task.Factory.StartNew(body);
            };
            sw.Start();
            Task.Factory.StartNew(body);
            Console.ReadLine();
        }
    }

    With .NET 4 and Windows 7, on my wimply 1.6GHz laptop (so very unofficial numbers ;), this consistently outputs that it takes ~1.5 seconds to process 1,000,000 tasks; in other words, ~1.5 microseconds per task, orders of magnitude below what you're quoting.  And as mentioned, even this cost is only paid when we do spin up a task, which we end up reusing by default as long as there's a least one message to be processed in the queue (and as mentioned, the tasks may spin for a little bit even if there aren't messages in the queue).  Further, for .NET vNext we've worked hard to successfully decrease these costs further.

    Thanks.


    Wednesday, July 6, 2011 4:01 PM
  • Fil, just to clarify, TPL Dataflow doesn't spin up a Task per message, at least not by default.  Let's consider an ActionBlock, for example.  If you create an ActionBlock with the default parameters, when you send it a message, that message gets put into a queue.  Additionally, it checks to see whether a processing Task is currently up and running; if one already is, nothing more happens.  If one isn't, it spins one up.  Thus, as long as there's work to be processed, the same Task will be used to process all of the messages, and that Task will only go away if it finds there's no more work to be done.  So, if there's a steady-state/constant stream of work to be processed, you'll only get one Task for processing all of those messages.  The nice thing about this approach is that you can avoid busy waiting and blocking, only taking up threads and CPU resources when there's work to be done. (If you do want a Task per message, you can get that by setting MaxMessagesPerTask to 1; the default is -1, meaning that a Task should process as many messages as it can.)

    As an aside, in the current CTP that's out there, the internal queue is just a ConcurrentQueue<T>, but for single-producer-single-consumer (which is the default for ActionBlock<T>), we can do a lot better with regards to memory allocation, synchronization, and the like.  We already have such an improved queue in our current implementation, and I think you'll notice a significant speedup as a result when these bits are released (no ETA to state publicly on when that will be).

    Tuesday, July 5, 2011 4:55 PM
  • Hi Fil,

    We are currently porting the disruptor to .NET (http://code.google.com/p/disruptor-net/) and we plan to look at possible integration with TPL Dataflow and probably some RX extensions as well.

    We should be able to start some spikes with TPL dataflow next week.

    Olivier

    Tuesday, July 5, 2011 2:58 PM

All replies

  • Hi Fil,

    We are currently porting the disruptor to .NET (http://code.google.com/p/disruptor-net/) and we plan to look at possible integration with TPL Dataflow and probably some RX extensions as well.

    We should be able to start some spikes with TPL dataflow next week.

    Olivier

    Tuesday, July 5, 2011 2:58 PM
  • Fil, just to clarify, TPL Dataflow doesn't spin up a Task per message, at least not by default.  Let's consider an ActionBlock, for example.  If you create an ActionBlock with the default parameters, when you send it a message, that message gets put into a queue.  Additionally, it checks to see whether a processing Task is currently up and running; if one already is, nothing more happens.  If one isn't, it spins one up.  Thus, as long as there's work to be processed, the same Task will be used to process all of the messages, and that Task will only go away if it finds there's no more work to be done.  So, if there's a steady-state/constant stream of work to be processed, you'll only get one Task for processing all of those messages.  The nice thing about this approach is that you can avoid busy waiting and blocking, only taking up threads and CPU resources when there's work to be done. (If you do want a Task per message, you can get that by setting MaxMessagesPerTask to 1; the default is -1, meaning that a Task should process as many messages as it can.)

    As an aside, in the current CTP that's out there, the internal queue is just a ConcurrentQueue<T>, but for single-producer-single-consumer (which is the default for ActionBlock<T>), we can do a lot better with regards to memory allocation, synchronization, and the like.  We already have such an improved queue in our current implementation, and I think you'll notice a significant speedup as a result when these bits are released (no ETA to state publicly on when that will be).

    Tuesday, July 5, 2011 4:55 PM
  • Thanks for the great reply Stephen.

    The core issue here is the spin-up time for the first Task. I take your point that this is not always necessary, but the fact that it can take ~10ms from A to B means that TPL cannot be used for low-latency apps. I am wondering if the Disruptor (if you have time check this out) could hook in underneath and effectively replace the Task scheduler.

    Now, low latency programming is not a requirement for everyone - but it would be nice to be able to do it with Dataflow. There are compromises to be made: threads must be very alert/available, and potentially wasteful to ensure short response times.

    One thing to note that you rarely ever have queues of any significance in LLP; you really want units of work to flow through really fast, as any queuing is only going to add to your response times. So, anything other than 1 message-per-Task would be a bad thing as far as LLP goes.

    Personally I think the Disruptor has great techniques that could be integrated into TPL to really ramp up the performance. Really interested in what you think of it Stephen?

     


    Regards, Fil.
    Tuesday, July 5, 2011 11:11 PM
  • Hi Fil-

    There are techniques employed by Disruptor that we've actively chosen not to utilize in TPL Dataflow, for the very reason that our dataflow solution has to be generally applicable and as such cognizant of other work happening in the process and on the machine, of power utilization, and the like.  For example, the wait strategies employed by disruptor involve either blocking threads or spinning on threads, and neither of those is really acceptable for the dataflow blocks in TPL, as in many scenarios it will result in significant numbers of threads getting blocked, lots of wasted CPU utilization that starves other work (including but not limited to other dataflow blocks), etc.  We do use some spinning in limited cases (though only in builds more recent than the CTPs we've released thus far); for example, when a Task finds there's no more work to be done, it may now spin a little bit just in case another message arrives imminently.

    Note that there are some techniques employed in disruptor similar to those employed in dataflow.  For example, we have a specialized queue data structure used by ActionBlock, TransformBlock, and TransformManyBlock (this queue isn't in the CTPs as of yet, where they just use ConcurrentQueue<T>).  That data structure involves very few allocations (it often reaches a steady state where there aren't any at all), is organized to minimize false sharing between producers and consumers, and not only entirely avoids locks but interlockeds as well.  We saw significant throughput improvements as a result of applying this queue.  Hopefully you will, too, when we have another build to release.

    Note that if you wanted to, it should be relatively easy for you to create an ITargetBlock<T> facade around disruptor, which would then allow you to use it as part of dataflow networks, to connect it to other blocks, to utilize all of the dataflow extension methods, to convert to observers, etc.  As you suggested, you could also hook in at the TaskScheduler level, implementing what would effectively be a thread pool using disruptor as the producer/consumer hand-off queue of tasks between the queuers and the processing threads.

    Regarding the time it takes to spin up a Task, where are you getting your 10ms number from?  That's much, much, much higher than any timings I've seen.  For example, here's a simple test that spins up a task, which spins up another task, which spins up another task, and so on indefinitely, repeatedly timing how long it takes to process a set number of tasks:

    using System;
    using System.Diagnostics;
    using System.Threading.Tasks;

    class Program
    {
        static void Main()
        {
            int counter = 0;
            var sw = new Stopwatch();
            Action body = null;
            body = () =>
            {
                if (++counter % 1000000 == 0)
                {
                    Console.WriteLine(sw.ElapsedMilliseconds);
                    sw.Restart();
                }
                Task.Factory.StartNew(body);
            };
            sw.Start();
            Task.Factory.StartNew(body);
            Console.ReadLine();
        }
    }

    With .NET 4 and Windows 7, on my wimply 1.6GHz laptop (so very unofficial numbers ;), this consistently outputs that it takes ~1.5 seconds to process 1,000,000 tasks; in other words, ~1.5 microseconds per task, orders of magnitude below what you're quoting.  And as mentioned, even this cost is only paid when we do spin up a task, which we end up reusing by default as long as there's a least one message to be processed in the queue (and as mentioned, the tasks may spin for a little bit even if there aren't messages in the queue).  Further, for .NET vNext we've worked hard to successfully decrease these costs further.

    Thanks.


    Wednesday, July 6, 2011 4:01 PM
  • Thanks for the detailed reply Stephen.

    Yes you are right, Disruptor is a lifestyle choice - you sacrifice everything else to achieve low latency. What I hear you saying is this is a specialized application, fair enough.

    The 10ms is the worst-case figure, as in low-latency/real-time you are always judged by this - not the average. This high figure is the first execution, and subsequent executions after significant pauses in event flow.

    Note your figures here indicate 1.5 microseconds, or 1,500 nanoseconds.


    Regards, Fil.
    Wednesday, July 6, 2011 10:17 PM
  • Thanks, Fil. 

    Regarding nanoseconds, yeah, I'd meant microseconds, and that was just a typo when writing out my reply quickly; I've fixed it above.  Thanks for pointing out the error.

    Regarding the worst-case figure, can you send along a repro of the case where it takes 10ms for a task to run after significant pauses?  It'd be helpful to see the exact case you're concerned about.

    Thursday, July 7, 2011 8:25 AM
  • Hi Stephen,

    I perfectly understand your point of view and why you choosed to not use those techniques.

    Some financial application have very specific requirements and require low latency and maybe even more importantly predictable latency. To be more explicit latency requirements can vary from single digit millisecond (socket to socket) to double/single digit microsecond, under heavy load (up to 100,000s of messages per second to de-serialize, process and forward)

    This is not something easy to achieve with a managed runtime:
     - GC adds a lot of jitter and you have no choice: 0 allocation on fast path code (object pooling required, all long live object pre-allocated at startup),
     - any source of contention leads to non predictable latency
     - lack of control due to several levels of abstraction (hard to predict memory layout, hardware instructions, etc)
     - lack of tools to profile and diagnose issues at this level of latency

    Try to have a look to this video, they provide a very good overview of the problem and how they takle it: www.infoq.com/presentations/LMAX (How to Do 100K TPS at Less than 1ms Latency)
    Or maybe their technical paper: http://disruptor.googlecode.com/files/Disruptor-1.0.pdf

    I understand TPL Dataflow is still under development but I would be interested to know if you are designing it with this type of requirement in mind or if it is out of your radar ;-)

    Olivier

    Thursday, July 7, 2011 8:58 AM
  • Hi Stephen, here's one I prepared earlier. It focuses on the worst performance in each second (call me a pessimist :)
    var last = 0;
    while (true)
    {
      var sw = new Stopwatch();
      var max = 0.0;
      sw.Start();
      TaskEx.Run(() =>
        {
          sw.Stop();
          max = Math.Max(max, sw.Elapsed.TotalMilliseconds);
        }).Wait();
    
      // once per second
      var seconds = DateTime.Now.Second;
      if (seconds != last)
      {
        last = seconds;
        Trace.WriteLine(string.Format("{0}", max));
      }
    }
    
    Sample output - release build on my laptop: (max per second, in milliseconds)
    5.339
    0.010
    0.041
    0.003
    0.452
    4.239
    0.003
    0.010
    5.530
    0.009
    0.009
    0.008
    0.020
    0.009

    Regards, Fil.
    Thursday, July 7, 2011 12:08 PM
  • In real-time systems you generally put a brake on the GC until you have a known down time. But lets face it - we're only talking pseudo real-time (predictable) systems here.

    With a 'real' real-time system you are not using Windows, Linux, or any operating system for that matter. You're down to FPGA's and other fancy hardware.

     


    Regards, Fil.
    Thursday, July 7, 2011 12:10 PM