none
BUG: Locking threads from threadpool leading to deadlock on latest release

    Question

  • OK, this was another lost morning :)

    Migrating Rx from 2.0.20823 to 2.0.21030 has killed my software dead.  Now, when running, the ThreadPool quickly fills up with deadlocked threads all with similar callstacks:

    >	mscorlib.dll!System.Threading.SpinWait.SpinOnce() + 0x6a bytes	
     	mscorlib.dll!System.Threading.SpinWait.SpinUntil(System.Func<bool> condition, int millisecondsTimeout) + 0x62 bytes	
     	mscorlib.dll!System.Threading.SpinWait.SpinUntil(System.Func<bool> condition) + 0x8 bytes	
     	System.Reactive.PlatformServices.dll!System.Reactive.Concurrency.ConcurrencyAbstractionLayerImpl.Timer.Tick(object state) + 0x6f bytes	
     	mscorlib.dll!System.Threading.TimerQueueTimer.CallCallbackInContext(object state) + 0x2c bytes	
     	mscorlib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state, bool preserveSyncCtx) + 0xa7 bytes	
     	mscorlib.dll!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state, bool preserveSyncCtx) + 0x16 bytes	
     	mscorlib.dll!System.Threading.TimerQueueTimer.CallCallback() + 0x71 bytes	
     	mscorlib.dll!System.Threading.TimerQueueTimer.Fire() + 0xc0 bytes	
     	mscorlib.dll!System.Threading.TimerQueue.FireQueuedTimerCompletion(object state) + 0x25 bytes	
     	mscorlib.dll!System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem() + 0x34 bytes	
     	mscorlib.dll!System.Threading.ThreadPoolWorkQueue.Dispatch() + 0x149 bytes	
     	mscorlib.dll!System.Threading._ThreadPoolWaitCallback.PerformWaitCallback() + 0x5 bytes	
     	[Native to Managed Transition]	
     	[Appdomain Transition]	
     	[Native to Managed Transition]	

    Here is a callstack where we first see deadlock threads getting added (this took me ages to get an example), for brevity I've excluded the bit before and after, and focussed on the the calls through Rx.

     	System.Reactive.Linq.dll!System.Reactive.Linq.QueryLanguage.Create<object[]>.AnonymousMethod__1e2(System.IObserver<object[]> observer) + 0x91 bytes	
     	System.Reactive.Core.dll!System.Reactive.AnonymousObservable<object[]>.SubscribeCore(System.IObserver<object[]> observer) + 0x33 bytes	
     	System.Reactive.Core.dll!System.Reactive.ObservableBase<object[]>.Subscribe(System.IObserver<object[]> observer) + 0x162 bytes	
     	System.Reactive.Core.dll!System.ObservableExtensions.SubscribeSafe<object[]>(System.IObservable<object[]> source, System.IObserver<object[]> observer) + 0x11c bytes	
     	System.Reactive.Linq.dll!System.Reactive.Linq.Observαble.Select<object[],WebApplications.Traveller.Entities.Security.ISession>.Run(System.IObserver<WebApplications.Traveller.Entities.Security.ISession> observer, System.IDisposable cancel, System.Action<System.IDisposable> setSink) + 0xe6 bytes	
     	System.Reactive.Core.dll!System.Reactive.Producer<WebApplications.Traveller.Entities.Security.ISession>.SubscribeRaw(System.IObserver<WebApplications.Traveller.Entities.Security.ISession> observer, bool enableSafeguard) + 0x294 bytes	
     	System.Reactive.Core.dll!System.ObservableExtensions.SubscribeSafe<WebApplications.Traveller.Entities.Security.ISession>(System.IObservable<WebApplications.Traveller.Entities.Security.ISession> source, System.IObserver<WebApplications.Traveller.Entities.Security.ISession> observer) + 0x186 bytes	
     	System.Reactive.Linq.dll!System.Reactive.Linq.Observαble.FirstAsync<WebApplications.Traveller.Entities.Security.ISession>.Run(System.IObserver<WebApplications.Traveller.Entities.Security.ISession> observer, System.IDisposable cancel, System.Action<System.IDisposable> setSink) + 0x178 bytes	
     	System.Reactive.Core.dll!System.Reactive.Producer<WebApplications.Traveller.Entities.Security.ISession>.SubscribeRaw(System.IObserver<WebApplications.Traveller.Entities.Security.ISession> observer, bool enableSafeguard) + 0x294 bytes	
     	System.Reactive.Core.dll!System.ObservableExtensions.SubscribeSafe<WebApplications.Traveller.Entities.Security.ISession>(System.IObservable<WebApplications.Traveller.Entities.Security.ISession> source, System.IObserver<WebApplications.Traveller.Entities.Security.ISession> observer) + 0x186 bytes	
     	System.Reactive.Linq.dll!System.Reactive.Linq.Observαble.Catch<WebApplications.Traveller.Entities.Security.ISession,System.Exception>._.Run() + 0xc9 bytes	
     	System.Reactive.Linq.dll!System.Reactive.Linq.Observαble.Catch<WebApplications.Traveller.Entities.Security.ISession,System.Exception>.Run(System.IObserver<WebApplications.Traveller.Entities.Security.ISession> observer, System.IDisposable cancel, System.Action<System.IDisposable> setSink) + 0x97 bytes	
     	System.Reactive.Core.dll!System.Reactive.Producer<WebApplications.Traveller.Entities.Security.ISession>.SubscribeRaw(System.IObserver<WebApplications.Traveller.Entities.Security.ISession> observer, bool enableSafeguard) + 0x294 bytes	
     	System.Reactive.Core.dll!System.ObservableExtensions.SubscribeSafe<WebApplications.Traveller.Entities.Security.ISession>(System.IObservable<WebApplications.Traveller.Entities.Security.ISession> source, System.IObserver<WebApplications.Traveller.Entities.Security.ISession> observer) + 0x186 bytes	
     	System.Reactive.Linq.dll!System.Reactive.Linq.Observαble.Using<WebApplications.Traveller.Entities.Security.ISession,WebApplications.Utilities.Logging.Performance.PerformanceTimer.Timer>._.Run() + 0x21e bytes	
     	System.Reactive.Linq.dll!System.Reactive.Linq.Observαble.Using<WebApplications.Traveller.Entities.Security.ISession,WebApplications.Utilities.Logging.Performance.PerformanceTimer.Timer>.Run(System.IObserver<WebApplications.Traveller.Entities.Security.ISession> observer, System.IDisposable cancel, System.Action<System.IDisposable> setSink) + 0x97 bytes	
     	System.Reactive.Core.dll!System.Reactive.Producer<WebApplications.Traveller.Entities.Security.ISession>.Run(System.Reactive.Concurrency.IScheduler _, System.Reactive.Producer<WebApplications.Traveller.Entities.Security.ISession>.State x) + 0xf8 bytes	
     	System.Reactive.Core.dll!System.Reactive.Concurrency.ScheduledItem<System.TimeSpan,System.Reactive.Producer<WebApplications.Traveller.Entities.Security.ISession>.State>.InvokeCore() + 0x50 bytes	
     	System.Reactive.Core.dll!System.Reactive.Concurrency.ScheduledItem<System.TimeSpan>.Invoke() + 0x49 bytes	
     	System.Reactive.Core.dll!System.Reactive.Concurrency.CurrentThreadScheduler.Trampoline.Run(System.Reactive.Concurrency.SchedulerQueue<System.TimeSpan> queue) + 0x10e bytes	
     	System.Reactive.Core.dll!System.Reactive.Concurrency.CurrentThreadScheduler.Schedule<System.Reactive.Producer<WebApplications.Traveller.Entities.Security.ISession>.State>(System.Reactive.Producer<WebApplications.Traveller.Entities.Security.ISession>.State state, System.TimeSpan dueTime, System.Func<System.Reactive.Concurrency.IScheduler,System.Reactive.Producer<WebApplications.Traveller.Entities.Security.ISession>.State,System.IDisposable> action) + 0x175 bytes	
     	System.Reactive.Core.dll!System.Reactive.Concurrency.LocalScheduler.Schedule<System.Reactive.Producer<WebApplications.Traveller.Entities.Security.ISession>.State>(System.Reactive.Producer<WebApplications.Traveller.Entities.Security.ISession>.State state, System.Func<System.Reactive.Concurrency.IScheduler,System.Reactive.Producer<WebApplications.Traveller.Entities.Security.ISession>.State,System.IDisposable> action) + 0xc0 bytes	
     	System.Reactive.Core.dll!System.Reactive.Producer<WebApplications.Traveller.Entities.Security.ISession>.SubscribeRaw(System.IObserver<WebApplications.Traveller.Entities.Security.ISession> observer, bool enableSafeguard) + 0x1bb bytes	
     	System.Reactive.Core.dll!System.Reactive.Producer<WebApplications.Traveller.Entities.Security.ISession>.Subscribe(System.IObserver<WebApplications.Traveller.Entities.Security.ISession> observer) + 0x6a bytes	
     	System.Reactive.Linq.dll!System.Reactive.Threading.Tasks.TaskObservableExtensions.ToTask<WebApplications.Traveller.Entities.Security.ISession>(System.IObservable<WebApplications.Traveller.Entities.Security.ISession> observable, System.Threading.CancellationToken cancellationToken, object state) + 0x301 bytes	
    

    Again, without access to release notes it's hard for me to guess what change you've made has led to this breaking change, however, I have narrowed down the breaking change to the Rx update (the only difference between broken and not broken is this update).

    Hopefully the stack traces help.


    thargy

    Craig Dean (MCPD)
    Chief Executive
    www.webappuk.com

    Please consider marking this as the answer if you have been genuinely helped!

    Friday, November 02, 2012 12:23 PM

Answers

  • First of all, thanks a bunch for taking the time to report this issue and to write up this lengthy discussion. We really do appreciate this. Our apologies for not having used the voting feature yet; this is corrected now :-).

    Anyway, back to the issue. The story goes back a long time, to the pre v1.0 days of Rx running on a wide variety of platforms, including .NET 3.5 and Windows Phone 7's version of .NET based on the Compact Framework. Back then, one of our uses of the System.Threading.Timer class was flawed, because the timer didn't get rooted properly, causing it to be GC'ed prematurely. While this was a well-known issue, it slipped through the cracks in a pre-release of v1.0 and we corrected this issue by adding rooting logic based on a Hashtable in a static field. Obviously, this brought its own slew of complexity in that the access to this collection has to be thread-safe and the rooted timers have to get removed from the collection eventually, in a reliable fashion to prevent possible memory leaks.

    During the early days of the Rx v2.0 work, we analyzed various performance traces for different uses of Rx in a variety of our own assets. For cases where a lot of ad-hoc temporal queries were hosted in the same process (think of a service monitoring product, not to be revealed by name right now), the lock for this collection started to show up. As such, we broke it up into separate locks for separate timer queues in Rx v2.0 RTM: each CAL had its own; the ThreadPoolScheduler (which is also used by TaskPoolScheduler for time-based operations) also had its own; and there was further distinction between rooting collections for relative timer versus absolute timer uses. This improved the situation a bit (especially due to the separation for absolute versus relative, making different use cases not have to wait for the same lock), which was welcome for monitoring scenarios where we want to have as little impact as possible (otherwise you end up monitoring yourself).

    However, somewhere during the v1.x days, the reason to root the timers became obsolete in a rather unexpected way... I don't recall the exact moment, given that the IScheduler interface underwent a lot of changes back then, and the disposal needs changed with it. Anyway, here's the essence. In order to allow for stopping time-based operations, we've always handed out an IDisposable from the scheduler layer, essentially wrapping the Timer object. This by itself is not enough to prevent the timer from getting collected of course: after all, no-one has to keep IDisposable objects in Rx alive to prevent things from stopping (which would be the - arguably odd - Timer behavior all over again). However, as this IDisposable also had code in it to remove the timer - under the lock, as discussed before - from the rooting collection (to prevent leaks again - remember the root collection is in a static field), it got reused within the Tick handler to achieve the same unrooting effect upon expiration of the timer. Now we have an interesting object graph. The Timer holds on to the Tick handler delegate, which was written as a lambda closing over outer scope variables, one of which was the disposable (to unroot in the callback in a finally block), which by itself held on to the Timer object in order to remove it from the rooting collection. To run the timer, the delegate gets added to a timer queue in the BCL, and now we're effectively keeping the Timer object alive by the mere fact that its Tick handler has a reference to it! So, did we need the rooting after all? Well, if we remove it, the effect of rooting the Timer through its Tick handler's reference-to-self goes away, putting us back at square one.

    In fact, the question whether to root or not didn't come up again in the v1.x days: things worked just great with the rooting trick. The insight of the self-rooting behavior only came later during the v2.0 performance investigation work. In fact, through our sync-ups with the async/await feature team, we saw the introduction of the Task.Delay method which also uses a Timer and required similar rooting tricks. However, at that very time we learned about the Timer behavior (at least in .NET 4.0 and beyond) with regards to self-rooting. It turns out that one of the constructor overloads captures the constructed Timer object ("this" deep inside the Timer code) as the state parameter, which ends up alongside the Tick callback in a queue, effectively rooting the timer. Other overloads with custom state don't have self-rooting behavior, causing potential premature GC. Recalling the hot locks (not to mention the cost of the rooting collection hash table, linear in the number of running timers, which in monitoring scenarios can be rather large), it seemed that leveraging this self-rooting behavior was something to investigate further. We didn't get to this until after the initial v2.0 release, also awaiting confirmation of Timer behavior from the team that was just wrapping up its 4.5 release (the beauty of sim-shipping is that everyone is busy at the same time!).

    So, in order to make the self-rooting behavior work, we had to make sure the Timer held onto itself either through the "this-as-state-behavior" constructor or the Tick handler delegate's instance having a reference to the Timer. Having a state parameter passed to scheduler and CAL methods, the former approach seems less appealing given that Timer already has a slot to share state. Luckily, we need a way to dispose the timer when the upper layers of Rx request a cancellation, so we need to store it in a field anyway. And the field would be on the object that also has the Tick handler method, so the second approach works. (Notice all of this was only done for .NET 4.0 and beyond where this self-rooting behavior was confirmed and also recorded as something that could never be changed in the future due to the breaking change behavior it'd have, causing timers to be collected prematurely. For earlier platforms, where the concerns of this stress test don't hold, e.g. WP7 devices, we kept things as-is.) An experiment showing this behavior is shown below:

    var t = default(Timer);
    t = new Timer(_ =>
    {
        t = null; // Comment out this line to see the timer stop
        Console.WriteLine("Hello!");
    }, null, 5000, Timeout.Infinite);

    Trigger a GC when this timer is active to see whether it eventually fires or not. Depending on the commented line, behavior will be different. You can use the SOS !gcroot command to reveal how the timer is rooted. Below is an example output on a .NET 4.5 system (when the line isn't commented out):

    !gcroot 02492440
     HandleTable:
         005a13fc (pinned handle)
         -> 03491010 System.Object[]
         -> 024924dc System.Threading.TimerQueue
         -> 02492450 System.Threading.TimerQueueTimer
         -> 02492420 System.Threading.TimerCallback
         -> 02492414 Program+<>c__DisplayClass1
         -> 02492440 System.Threading.Timer

    Either way, we now have a wrapper class that contains a Timer field which can be used to stop the timer early and has the useful side-effect of keeping the timer rooted through the callback delegate. Now there's one more requirement, namely to make this wrapper clean up after itself when the timer expires. After all, the instance may be held onto from the layers above to allow for cancellation, but the layer above won't remove the instance from whatever collections (e.g. a CompositeDisposable for an N-ary operator that uses time-based sequences) it has upon the timer elapsing. As such, the holders can be long-lived and we don't want to hold on to potentially large pieces of state. One such thing is the supplied action delegate itself of course, which may hold on to a closure. Notice that closures in C# can cause space leaks because each anonymous method in the same enclosing method will share the same display class, even if different anonymous methods have distinct sets of variables they capture - letting go of the delegate if it's no longer needed is a good thing. To see this, compile the following code and analyze it in ILDASM (what happens to the closure if "a" holding on to 100MB is no longer reachable, but "d" still is?):

    var b = new byte[1024 * 1024 * 100];
    var c = "";
    var a = new Action(() =>
    {
        Console.WriteLine(b.Length);
    });
    var d = new Action(() =>
    {
        Console.WriteLine(c);
    });

    In the same go, we can also drop the timer field as it's no longer needed; IIRC, the current structure of the Timer in the BCL is to contain an intermediate holder object, which holds on to a timer queue object, which is a linked list entry with a bunch of state alongside of it. This can add up if a lot of expired timers are artificially kept alive at a higher level.

    So, the natural thing is to null out the timer field and immunize the action delegate if it's no longer needed (in case of cancellation, if the Tick fires, we don't want it to null-ref, so we assign a shared "Nop" stub instance that's used in various places). However, if the Timer fires during construction and before assignment (the constructing thread may be swapped out before the stfld instruction), the Tick handler would clean up the timer field that's not yet set, so ultimately the timer will still end up in the field when the .ctor thread continues running. Now its lifetime is tied to the holder instance which may be long-lived for various classes of standing queries (as discussed before, the IDisposable holder may be held onto by a disposable subscription object graph). At this point, we can argue whether this case is rare enough to warrant a "don't care" approach. And yes, likely it is. But no, we made the other call upon rewriting this code and introduced the spinning logic. The one oversight is what you found, namely that the Dispose method can be called during the Tick handler's execution, causing it to face an everlasting null reference for the field it so badly wants to clean up...

    As for the thread abort case, when the thread running the holder object's constructor is running up to the point of creating a new Timer instance and then gets aborted, the assignment of the Timer instance to the field will never take place. However, the Tick handler delegate already has captured the "this" reference, hence it can eventually run on top of the holder instance whose timer field never got assigned, causing it to spin indefinitely. The following code fragment shows how the failed-to-construct instance is usable by the timer, nonetheless:

    class Program
    {
        static void Main()
        {
            try
            {
                new Bar(true);
            }
            catch { }
            Console.ReadLine();
        }
    }
    class Bar
    {
        private Timer timer;
        public Bar(bool b)
        {
            var t = new Timer(Tick, null, 1000, Timeout.Infinite);
            if (b)
                throw new Exception();
            timer = t;
        }
        private void Tick(object _)
        {
            Console.WriteLine(timer == null);
        }
    }

    You're absolutely right that the use of Change seems more appropriate here in order to ensure prior field assignment, rather than doing the funny dance with the finally handler to ensure assignment in this pathological situation (notice we don't do anything special for the case where the Timer constructor throws; in such a case, the Tick handler should not be called, so there's no need to ensure field assignment). We'll double-check the Change behavior to ensure the desired rooting behavior is maintained (I can't see right now how it would not be, but these things have been proven tricky in the past) and may revisit this code in the future to simplify it, as deemed appropriate. Task.Delay ended up using an internal hook to achieve the rooting effect but doesn't use Change. We mirrored the Rx code after it, combined with self-rooting through a field, because we can't access the internal hook from outside the BCL. Then the thread abort situation was brought up and for some reason Change wasn't (re)considered. (I'll check whether I'm missing something here when I get back to the office later this week.)

    Once more, we want to extend our apologies for the inconvenience this has caused and appreciate your ongoing participation here.

    • Marked as answer by thargy Monday, November 05, 2012 11:32 AM
    Monday, November 05, 2012 10:15 AM
    Owner
  • Firstly, apologies to anyone following along, I'm not posting code because we're a Gold partner and I don't want any Microsoft lawyers knocking at the door.  To my knowledge they haven't chosen to OS this project, though I strongly believe this poor sibling of TPL would strongly benefit from a bit more oversight by the community at large.  Any insight I may be eluding to in this thread is through legitimate fair use of the provided software for the purpose of debugging production code.

    Unfortunately, I disagree.  Empty try blocks are uncommon, they do not appear commonly, they may not be entirely absent from the framework (they probably should be), but they are thankfully uncommon, because they're almost always a bad idea.  Like any good developer I know that the finally block will still run during Abort (assuming the CLR is running with the default configuration, there are a number of cases when finallys don't run), however great care should be used with that bit of knowledge.

    Let's take an example:

    Timer timer;
    try
    {
        Thread.CurrentThread.Abort();
    }
    finally
    {
        timer = new Timer(state => state.Dump(), "Do a tick", TimeSpan.FromSeconds(1), TimeSpan.FromMilliseconds(-1.0));
    }

    Here we're trying to ensure that if the Timer is created it will always be assigned to 'timer' even though the thread is aborting.  (I've put an Abort in the try to simulate an inconvenient abort, in truth you only care if if happens between construction and assignment).

    If this were part of the constructor of an IDisposable object you wouldn't want to lose the timer reference as you would be unable to dispose it and it would become orphaned...  

    Now the first question is to ask, is "what would happen if the assignment failed and the timer was orphaned - what are you trying to prevent?"

    Well if you subsequently tried to dispose your constructed object, it wouldn't be able to kill the timer as it wouldn't have a reference to the timer, so keeping a reference to the timer is critical.

    Of course a better question is, "if the thread aborts during the constructor what will try to dispose the class?"

    The thread was aborted in the constructor, therefore the constructor assignment in the callee would never run so no one would have reference to the your object to dispose it.

    Is this a memory leak?  Well the failed constructor means that your object is fine, however the timer still exists, and it is now orphaned, but this is managed code so this isn't a leak, it will get garbage collected (if memory pressure requires), also it's guaranteed to be first generation (only just created) so it's highly likely to get GC'd.  It may be GC'd quickly, in which case it will be disposed and timer will never fire, it may hang around for a long time and the timer may fire.  This is unpleasant, and probably what you were trying to prevent, but the finally hack doesn't help because it produces exactly the same result, in that the containing object may now have a reference to the Timer but there will be no reference to the containing object, so the Timer is still at the whim of the GC.

    However, all of this is moot.  If you were genuinely worried about this, the BCL team allows you to construct the timer first and then call Change (MSDN link included for your reference).  In this scenario you construct, assign and then change, problem solved, no dirty hacks and no empty try blocks.  The timer cannot fire until after it has been assigned.  The corresponding clean up of the knot tying that exists later in the class to try and deal with disposal when the timer is not assigned is also then easily resolved.

    To be fair I'm writing this without access to VS from memory, so maybe I'm missing something, I can't for example measure the performance difference between try...finally single constructor and constructor/change without finally.  Logically both have an extra stack transition, I doubt the difference is significant, however performance is an engineering problem so you'd have to measure it.

    I agree there is a lot more complexity and subtlety in the class (a sure fire sign someone has missed the point).

    But regardless of who's right or wrong (I'd really like to know the exact scenario you thought the finally was curing), this is still a really bad bug in that it will quickly deadlock an application, as such you should really pull the NuGet, or remove the SpinLock.  The bug you've introduced is far worse than the one you thought you were trying to solve.


    thargy

    Craig Dean (MCPD)
    Chief Executive
    www.webappuk.com

    Please consider marking this as the answer if you have been genuinely helped!


    • Edited by thargy Sunday, November 04, 2012 9:52 AM Removed rant
    • Marked as answer by thargy Sunday, November 04, 2012 9:54 AM
    Sunday, November 04, 2012 9:50 AM

All replies

  • OK,

    Found your bug, which would be a lot easier if you guys went Open Source (just saying).

    It's in ConcurrencyAbstractionLayerImpl.Timer, which is the new Timer abstraction.

    You have enhanced the Tick() method so that it will wait until the timer has been assigned (i.e. no longer null) before calling Dispose() - which will then free the assigned timer.

    Of course if the Dispose() occurs before the Tick() then the assignment check will never succeed and the SpinWait.SpinUntil() will continue ad infinitum (as witnessed).

    I'm not posting the source as it's not open source but there's some weird stuff going on.

    For example the Timer constructor set the _timer in a finally block but there is nothing in the try... this may be a reflection artifact, otherwise it would be worse than pointless.

    The only way the timer can be nulled is if it is Disposed(), so the spin lock is completely pointless in the Tick.

    The dispose method can be made thread-safe using:

    var timer = Interlocked.Exchange(ref _timer, null);
    if (timer == null) return;
    ... dispose timer

    Anyway hope this helps you fix it more quickly.


    thargy

    Craig Dean (MCPD)
    Chief Executive
    www.webappuk.com

    Please consider marking this as the answer if you have been genuinely helped!


    • Edited by thargy Friday, November 02, 2012 1:28 PM Less snarky
    Friday, November 02, 2012 1:11 PM
  • We'll have a look. There's more complexity and subtlety than one may expect at first; a timer may fire during its construction, before it gets assigned to the field. IIRC, the spinning was introduced to ensure the timer has been assigned and to avoid orphaned timers, but the Dispose path seems to conflict with that as you point out. Btw, empty try blocks are not uncommon; this pattern leverages a CLR property that ensures a finally block cannot be interrupted when a thread is aborted, in this case to make sure the assignment happens, in order to make sure the spinning stops eventually.
    Sunday, November 04, 2012 12:14 AM
    Owner
  • Firstly, apologies to anyone following along, I'm not posting code because we're a Gold partner and I don't want any Microsoft lawyers knocking at the door.  To my knowledge they haven't chosen to OS this project, though I strongly believe this poor sibling of TPL would strongly benefit from a bit more oversight by the community at large.  Any insight I may be eluding to in this thread is through legitimate fair use of the provided software for the purpose of debugging production code.

    Unfortunately, I disagree.  Empty try blocks are uncommon, they do not appear commonly, they may not be entirely absent from the framework (they probably should be), but they are thankfully uncommon, because they're almost always a bad idea.  Like any good developer I know that the finally block will still run during Abort (assuming the CLR is running with the default configuration, there are a number of cases when finallys don't run), however great care should be used with that bit of knowledge.

    Let's take an example:

    Timer timer;
    try
    {
        Thread.CurrentThread.Abort();
    }
    finally
    {
        timer = new Timer(state => state.Dump(), "Do a tick", TimeSpan.FromSeconds(1), TimeSpan.FromMilliseconds(-1.0));
    }

    Here we're trying to ensure that if the Timer is created it will always be assigned to 'timer' even though the thread is aborting.  (I've put an Abort in the try to simulate an inconvenient abort, in truth you only care if if happens between construction and assignment).

    If this were part of the constructor of an IDisposable object you wouldn't want to lose the timer reference as you would be unable to dispose it and it would become orphaned...  

    Now the first question is to ask, is "what would happen if the assignment failed and the timer was orphaned - what are you trying to prevent?"

    Well if you subsequently tried to dispose your constructed object, it wouldn't be able to kill the timer as it wouldn't have a reference to the timer, so keeping a reference to the timer is critical.

    Of course a better question is, "if the thread aborts during the constructor what will try to dispose the class?"

    The thread was aborted in the constructor, therefore the constructor assignment in the callee would never run so no one would have reference to the your object to dispose it.

    Is this a memory leak?  Well the failed constructor means that your object is fine, however the timer still exists, and it is now orphaned, but this is managed code so this isn't a leak, it will get garbage collected (if memory pressure requires), also it's guaranteed to be first generation (only just created) so it's highly likely to get GC'd.  It may be GC'd quickly, in which case it will be disposed and timer will never fire, it may hang around for a long time and the timer may fire.  This is unpleasant, and probably what you were trying to prevent, but the finally hack doesn't help because it produces exactly the same result, in that the containing object may now have a reference to the Timer but there will be no reference to the containing object, so the Timer is still at the whim of the GC.

    However, all of this is moot.  If you were genuinely worried about this, the BCL team allows you to construct the timer first and then call Change (MSDN link included for your reference).  In this scenario you construct, assign and then change, problem solved, no dirty hacks and no empty try blocks.  The timer cannot fire until after it has been assigned.  The corresponding clean up of the knot tying that exists later in the class to try and deal with disposal when the timer is not assigned is also then easily resolved.

    To be fair I'm writing this without access to VS from memory, so maybe I'm missing something, I can't for example measure the performance difference between try...finally single constructor and constructor/change without finally.  Logically both have an extra stack transition, I doubt the difference is significant, however performance is an engineering problem so you'd have to measure it.

    I agree there is a lot more complexity and subtlety in the class (a sure fire sign someone has missed the point).

    But regardless of who's right or wrong (I'd really like to know the exact scenario you thought the finally was curing), this is still a really bad bug in that it will quickly deadlock an application, as such you should really pull the NuGet, or remove the SpinLock.  The bug you've introduced is far worse than the one you thought you were trying to solve.


    thargy

    Craig Dean (MCPD)
    Chief Executive
    www.webappuk.com

    Please consider marking this as the answer if you have been genuinely helped!


    • Edited by thargy Sunday, November 04, 2012 9:52 AM Removed rant
    • Marked as answer by thargy Sunday, November 04, 2012 9:54 AM
    Sunday, November 04, 2012 9:50 AM
  • I've taken out my rants in the hope I might at least get an upvote for the amount of time invested :)

    thargy

    Craig Dean (MCPD)
    Chief Executive
    www.webappuk.com

    Please consider marking this as the answer if you have been genuinely helped!

    Sunday, November 04, 2012 9:53 AM
  • We published an update to address the problem for now and will reevaluate and revisit parts of the implementation for future updates; our apologies for the inconvenience this has caused to early adopters of last week's update. The main source of complexity is the rooting of the timers to prevent them from getting GC'ed prematurely. Various constructor overloads and Change interactions exhibit different behavior here, and this has subtly changed over the years. (Most recently in v4.5 where Task.Delay uses an internal hook to achieve this effect.)

    Monday, November 05, 2012 3:00 AM
    Owner
  • I'm actually really unsure how to respond to this so I'm just going to be honest.

    Do you realise the implications of your statement?  If we're to believe what you're saying, then there is no way to prevent the GC prematurely collecting a timer prior to it's assignment and firing...  That constructor+assign+change don't work as advertised.  That the only valid approach is to use an internal call or to write a mass of ugly spaghetti to prevent this doomsday scenario.  Bare in mind you are far from being the only developers that rely on .NET everyday to develop asynchronous applications, and as an internal MS team your statement is beyond terrifying.

    Yes timer is a PITA, always has been.  And yes it's implementation has changed (hopefully for the better), as have huge areas of the BCL in relation to concurrency (largely as the MS teams have got better at it).  However, you're statement seems to confirm what I already alluded to, namely that you are trying to ensure the timer is assigned to the container class field prior to it firing, so that the timer is not GC'd until your container class is disposed of correctly.  To now learn that, according to the Rx Team, assigning the object prior to a call to change (rather than using the constructor that sets the timer) is unreliable is frankly terrifying, I can see no scenario in which a constructor assignment fails and yet the containing constructor does not also fail.  Maybe there's a 'unintended consequence' in not constructing and triggering the timer in one step - but you've not actually come out and said that.  That would at least be more palatable.

    Having looked at your 'patched' implementation which just adds yet more spaghetti (please don't misread that as unintelligible, I can see what you've done and understand how it patches the problem) I am left clinging to the hope that in fact you're trying to 'blind us with bull'.  Although you wouldn't be the first MS team trying to hide some deep dark secret bug in the BCL (and timer would be a classic), I prefer the "you cocked up and you're too proud to admit it" conclusion, based on Occam's razor.  Should your current solution be considered 'best practice' then we may as well all give up and go home.

    Unfortunately, your community isn't daft enough, or disinterested enough to just go "oh that's OK, so long as they know what they're doing", as we rely on propagating best practice, especially when it comes to concurrency algorithms.  Either you're onto something, in which case we need to understand so we don't repeat the problems you're solving, or you've made a mistake, there isn't a middle ground.  A little bit more openness and honesty will earn you far more respect than trying to distract us into panicky reviews of our own timer usage in relation to an undisclosed, and unseen, side effect.

    I firmly believe that Rx is a necessary tool in the concurrency toolbox, and ultimately should enjoy the same status as TPL and async/await.  It solves a very different problem space, very well.  However, you could really do with a little less silo mentality and a lit more community engagement (for example I would have been much nicer if you'd given me an up vote... the little things count).


    thargy

    Craig Dean (MCPD)
    Chief Executive
    www.webappuk.com

    Please consider marking this as the answer if you have been genuinely helped!

    Monday, November 05, 2012 6:08 AM
  • Although I hate the idea of all those 'unnecessary' spinlocks I can confirm that this is no longer exhibiting the symptoms from v.2.0.21103.

    As for 'early adopter'  I think most NuGet users only consider themselves 'early adopter's when downloading pre-release NuGets (seehttp://docs.nuget.org/docs/reference/versioning), not released NuGets.  I'll happy adopt pre-release packages to test as an 'early adopter' if you push them out as such, however I won't stick them in my production code.


    thargy

    Craig Dean (MCPD)
    Chief Executive
    www.webappuk.com

    Please consider marking this as the answer if you have been genuinely helped!

    Monday, November 05, 2012 6:17 AM
  • First of all, thanks a bunch for taking the time to report this issue and to write up this lengthy discussion. We really do appreciate this. Our apologies for not having used the voting feature yet; this is corrected now :-).

    Anyway, back to the issue. The story goes back a long time, to the pre v1.0 days of Rx running on a wide variety of platforms, including .NET 3.5 and Windows Phone 7's version of .NET based on the Compact Framework. Back then, one of our uses of the System.Threading.Timer class was flawed, because the timer didn't get rooted properly, causing it to be GC'ed prematurely. While this was a well-known issue, it slipped through the cracks in a pre-release of v1.0 and we corrected this issue by adding rooting logic based on a Hashtable in a static field. Obviously, this brought its own slew of complexity in that the access to this collection has to be thread-safe and the rooted timers have to get removed from the collection eventually, in a reliable fashion to prevent possible memory leaks.

    During the early days of the Rx v2.0 work, we analyzed various performance traces for different uses of Rx in a variety of our own assets. For cases where a lot of ad-hoc temporal queries were hosted in the same process (think of a service monitoring product, not to be revealed by name right now), the lock for this collection started to show up. As such, we broke it up into separate locks for separate timer queues in Rx v2.0 RTM: each CAL had its own; the ThreadPoolScheduler (which is also used by TaskPoolScheduler for time-based operations) also had its own; and there was further distinction between rooting collections for relative timer versus absolute timer uses. This improved the situation a bit (especially due to the separation for absolute versus relative, making different use cases not have to wait for the same lock), which was welcome for monitoring scenarios where we want to have as little impact as possible (otherwise you end up monitoring yourself).

    However, somewhere during the v1.x days, the reason to root the timers became obsolete in a rather unexpected way... I don't recall the exact moment, given that the IScheduler interface underwent a lot of changes back then, and the disposal needs changed with it. Anyway, here's the essence. In order to allow for stopping time-based operations, we've always handed out an IDisposable from the scheduler layer, essentially wrapping the Timer object. This by itself is not enough to prevent the timer from getting collected of course: after all, no-one has to keep IDisposable objects in Rx alive to prevent things from stopping (which would be the - arguably odd - Timer behavior all over again). However, as this IDisposable also had code in it to remove the timer - under the lock, as discussed before - from the rooting collection (to prevent leaks again - remember the root collection is in a static field), it got reused within the Tick handler to achieve the same unrooting effect upon expiration of the timer. Now we have an interesting object graph. The Timer holds on to the Tick handler delegate, which was written as a lambda closing over outer scope variables, one of which was the disposable (to unroot in the callback in a finally block), which by itself held on to the Timer object in order to remove it from the rooting collection. To run the timer, the delegate gets added to a timer queue in the BCL, and now we're effectively keeping the Timer object alive by the mere fact that its Tick handler has a reference to it! So, did we need the rooting after all? Well, if we remove it, the effect of rooting the Timer through its Tick handler's reference-to-self goes away, putting us back at square one.

    In fact, the question whether to root or not didn't come up again in the v1.x days: things worked just great with the rooting trick. The insight of the self-rooting behavior only came later during the v2.0 performance investigation work. In fact, through our sync-ups with the async/await feature team, we saw the introduction of the Task.Delay method which also uses a Timer and required similar rooting tricks. However, at that very time we learned about the Timer behavior (at least in .NET 4.0 and beyond) with regards to self-rooting. It turns out that one of the constructor overloads captures the constructed Timer object ("this" deep inside the Timer code) as the state parameter, which ends up alongside the Tick callback in a queue, effectively rooting the timer. Other overloads with custom state don't have self-rooting behavior, causing potential premature GC. Recalling the hot locks (not to mention the cost of the rooting collection hash table, linear in the number of running timers, which in monitoring scenarios can be rather large), it seemed that leveraging this self-rooting behavior was something to investigate further. We didn't get to this until after the initial v2.0 release, also awaiting confirmation of Timer behavior from the team that was just wrapping up its 4.5 release (the beauty of sim-shipping is that everyone is busy at the same time!).

    So, in order to make the self-rooting behavior work, we had to make sure the Timer held onto itself either through the "this-as-state-behavior" constructor or the Tick handler delegate's instance having a reference to the Timer. Having a state parameter passed to scheduler and CAL methods, the former approach seems less appealing given that Timer already has a slot to share state. Luckily, we need a way to dispose the timer when the upper layers of Rx request a cancellation, so we need to store it in a field anyway. And the field would be on the object that also has the Tick handler method, so the second approach works. (Notice all of this was only done for .NET 4.0 and beyond where this self-rooting behavior was confirmed and also recorded as something that could never be changed in the future due to the breaking change behavior it'd have, causing timers to be collected prematurely. For earlier platforms, where the concerns of this stress test don't hold, e.g. WP7 devices, we kept things as-is.) An experiment showing this behavior is shown below:

    var t = default(Timer);
    t = new Timer(_ =>
    {
        t = null; // Comment out this line to see the timer stop
        Console.WriteLine("Hello!");
    }, null, 5000, Timeout.Infinite);

    Trigger a GC when this timer is active to see whether it eventually fires or not. Depending on the commented line, behavior will be different. You can use the SOS !gcroot command to reveal how the timer is rooted. Below is an example output on a .NET 4.5 system (when the line isn't commented out):

    !gcroot 02492440
     HandleTable:
         005a13fc (pinned handle)
         -> 03491010 System.Object[]
         -> 024924dc System.Threading.TimerQueue
         -> 02492450 System.Threading.TimerQueueTimer
         -> 02492420 System.Threading.TimerCallback
         -> 02492414 Program+<>c__DisplayClass1
         -> 02492440 System.Threading.Timer

    Either way, we now have a wrapper class that contains a Timer field which can be used to stop the timer early and has the useful side-effect of keeping the timer rooted through the callback delegate. Now there's one more requirement, namely to make this wrapper clean up after itself when the timer expires. After all, the instance may be held onto from the layers above to allow for cancellation, but the layer above won't remove the instance from whatever collections (e.g. a CompositeDisposable for an N-ary operator that uses time-based sequences) it has upon the timer elapsing. As such, the holders can be long-lived and we don't want to hold on to potentially large pieces of state. One such thing is the supplied action delegate itself of course, which may hold on to a closure. Notice that closures in C# can cause space leaks because each anonymous method in the same enclosing method will share the same display class, even if different anonymous methods have distinct sets of variables they capture - letting go of the delegate if it's no longer needed is a good thing. To see this, compile the following code and analyze it in ILDASM (what happens to the closure if "a" holding on to 100MB is no longer reachable, but "d" still is?):

    var b = new byte[1024 * 1024 * 100];
    var c = "";
    var a = new Action(() =>
    {
        Console.WriteLine(b.Length);
    });
    var d = new Action(() =>
    {
        Console.WriteLine(c);
    });

    In the same go, we can also drop the timer field as it's no longer needed; IIRC, the current structure of the Timer in the BCL is to contain an intermediate holder object, which holds on to a timer queue object, which is a linked list entry with a bunch of state alongside of it. This can add up if a lot of expired timers are artificially kept alive at a higher level.

    So, the natural thing is to null out the timer field and immunize the action delegate if it's no longer needed (in case of cancellation, if the Tick fires, we don't want it to null-ref, so we assign a shared "Nop" stub instance that's used in various places). However, if the Timer fires during construction and before assignment (the constructing thread may be swapped out before the stfld instruction), the Tick handler would clean up the timer field that's not yet set, so ultimately the timer will still end up in the field when the .ctor thread continues running. Now its lifetime is tied to the holder instance which may be long-lived for various classes of standing queries (as discussed before, the IDisposable holder may be held onto by a disposable subscription object graph). At this point, we can argue whether this case is rare enough to warrant a "don't care" approach. And yes, likely it is. But no, we made the other call upon rewriting this code and introduced the spinning logic. The one oversight is what you found, namely that the Dispose method can be called during the Tick handler's execution, causing it to face an everlasting null reference for the field it so badly wants to clean up...

    As for the thread abort case, when the thread running the holder object's constructor is running up to the point of creating a new Timer instance and then gets aborted, the assignment of the Timer instance to the field will never take place. However, the Tick handler delegate already has captured the "this" reference, hence it can eventually run on top of the holder instance whose timer field never got assigned, causing it to spin indefinitely. The following code fragment shows how the failed-to-construct instance is usable by the timer, nonetheless:

    class Program
    {
        static void Main()
        {
            try
            {
                new Bar(true);
            }
            catch { }
            Console.ReadLine();
        }
    }
    class Bar
    {
        private Timer timer;
        public Bar(bool b)
        {
            var t = new Timer(Tick, null, 1000, Timeout.Infinite);
            if (b)
                throw new Exception();
            timer = t;
        }
        private void Tick(object _)
        {
            Console.WriteLine(timer == null);
        }
    }

    You're absolutely right that the use of Change seems more appropriate here in order to ensure prior field assignment, rather than doing the funny dance with the finally handler to ensure assignment in this pathological situation (notice we don't do anything special for the case where the Timer constructor throws; in such a case, the Tick handler should not be called, so there's no need to ensure field assignment). We'll double-check the Change behavior to ensure the desired rooting behavior is maintained (I can't see right now how it would not be, but these things have been proven tricky in the past) and may revisit this code in the future to simplify it, as deemed appropriate. Task.Delay ended up using an internal hook to achieve the rooting effect but doesn't use Change. We mirrored the Rx code after it, combined with self-rooting through a field, because we can't access the internal hook from outside the BCL. Then the thread abort situation was brought up and for some reason Change wasn't (re)considered. (I'll check whether I'm missing something here when I get back to the office later this week.)

    Once more, we want to extend our apologies for the inconvenience this has caused and appreciate your ongoing participation here.

    • Marked as answer by thargy Monday, November 05, 2012 11:32 AM
    Monday, November 05, 2012 10:15 AM
    Owner
  • I'm on my phone at the moment, so I have to be brief. Thank you so much for the detailed explanation. I can't tell you how reassuring it is. We're up against a really tight deadline and the team's faith in Rx has been badly shaken in the last two weeks. First we had the group by bug, then we had some weird edge cases where the logical call context was not being propagated correctly, which I think was actually in SqlClient, but regardless it led to a mad dash API rewrite to pass context explicitly. When the last NuGet borked our code right before release we had to back to our customer and say we were going to miss our code drop. It was in this context that my decision to use Rx was being viewed and so apologies if I get stressy. Your explanation makes perfect sense to me, it's a scenario I've faced myself in library code development, but in general I've been happy to only gaurantee the tick whilst the reference is held. It's been a while since I reviewed the Timer code and your explanation shows I should go and have another look at the changes and the self rooting that now occurs. I noticed the shared stub. I still think the ultimate solution is worryingly inelegant (and I'm speaking as the guy who normally does the performance tuning so am generally sympathetic to inelegant!). When I get through our latest code rush I might spend some time chewing it over myself to improve my own levels of understanding. I gave you a bit of a slamming earlier so it's only fair to conclude by thanking you for such a thorough, careful explanation and say many thanks.

    thargy

    Craig Dean (MCPD)
    Chief Executive
    www.webappuk.com

    Please consider marking this as the answer if you have been genuinely helped!

    Monday, November 05, 2012 11:31 AM
  • No worries - I'm sure we've all been exposed to stressy situations, and we totally understand the inconvenience that hiccups in dependencies can cause. In case you're ever visiting Seattle, reach out to us: we owe you a coffee :-). We'll revisit the elegancy of the solution for a service release and went down the path of the least risk for the hot fix produced this weekend.

    Monday, November 05, 2012 7:56 PM
    Owner