none
Strange Performance Problem with Generic Delegates

    Question

  • 1    public class ActionFactory<T> 
    2    { 
    3        private readonly Action<T> on; 
    4        public ActionFactory(Action<T> onMsg) 
    5        { 
    6            this.on = onMsg; 
    7        } 
    8 
    9        public Action Create(T msg) 
    10        { 
    11            return () => on(msg); 
    12        } 
    13        public static Action Create(T msg, Action<T> target) 
    14        { 
    15            return ()=> target(msg); 
    16        } 
    17    } 
    18 
    19    [TestFixture] 
    20    public class PerfBug 
    21    { 
    22        [Test] 
    23        public void PerfTestWithStringInline() 
    24        { 
    25            Action<string> onMsg = x => { if (x == "end") Console.WriteLine(x); }; 
    26            Stopwatch watch = Stopwatch.StartNew(); 
    27            for (int i = 0; i < 5000000; i++) 
    28            { 
    29                Action act = () => onMsg("s"); 
    30                act(); 
    31            } 
    32            Action end = () => onMsg("end"); 
    33            end(); 
    34            watch.Stop(); 
    35            Console.WriteLine("Elapsed: " + watch.ElapsedMilliseconds); 
    36        } 
    37 
    38        [Test] 
    39        public void PerfTestWithString() 
    40        { 
    41            Action<string> onMsg = x => { if (x == "end") Console.WriteLine(x); }; 
    42            ActionFactory<string> fact = new ActionFactory<string>(onMsg); 
    43            Stopwatch watch = Stopwatch.StartNew(); 
    44            for (int i = 0; i < 5000000; i++) 
    45            { 
    46                Action act = fact.Create("s"); 
    47                act(); 
    48            } 
    49            fact.Create("end")(); 
    50            watch.Stop(); 
    51            Console.WriteLine("Elapsed: " + watch.ElapsedMilliseconds); 
    52        } 
    53
    The code demonstrates a performance problem I've encountered. The first test case completes in only 100 ms on my 2ghz machine. The second example only introduces the action factory, but the test time jumps to over 15 seconds. 

    Anyone have an idea why the time increases so dramatically? 


    Monday, January 19, 2009 8:47 PM

Answers

  • Want to see something even weirder? Here are the results of several test runs using 2 million iterations each of those two methods:

    Debug Configuration, With Debugging (F5):                249/6404
    Debug Configuration, Without Debugging (Ctrl+F5):   224/5882
    Release Configuration, With Debugging (F5):              249/6736
    Release Configuration, Without Debugging (Ctrl+F5): 5102/5401

    Based on those numbers, I now agree that this is a bug, although what exactly the bug is I cannot say. The time to run the first case should not increase when optimizations are turned on. There is a known class of bugs in the JIT which only manifest when optimizations are on and a debugger is not attached (as in the last case). Its possible that there are actually two related optimization bugs here; one, in the generic case, which happens all of the time, and one, in the non-generic case, which only happens in optimizations without the debugger. It is also possible that the final times are actually the right ones, and the first case is actually being executed incorrectly without optimizations; although that seems less likely, since the most common source of bugs in the JIT comes from optimization.

    I would update the Connect bug with this information, so that Microsoft does not dismiss this as an allocation problem the way I did. The repro case needs to make it clear that the only difference between the two cases under test is that one is generic and one is not.


    Moderator | MCTS .NET 2.0 Web Applications | My Blog: http://www.commongenius.com
    Thursday, January 22, 2009 5:19 PM
    Moderator

All replies

  • Hmmm... pretty interesting. I've never seen non-generic Action class.... Where did you get it from?
    Vitaliy Liptchinsky http://dotnetframeworkplanet.blogspot.com/
    Monday, January 19, 2009 9:11 PM
  • Yep, sorry. That's new one in .NET 3.5.
    The first thought would be that c# compiler generates additional type checks (which are pretty heavy but 15 secs is too much anyway).
    Unfortunately I left piece of code I was playing with at home and can't provide you with more info... I'll get back in about 10 hrs :)

    Vitaliy Liptchinsky http://dotnetframeworkplanet.blogspot.com/
    Tuesday, January 20, 2009 9:22 AM
  • I've checked IL generated by the c# compiler. I am quite sure that the problem exists in ActionFactory<T>.Create method. At the end of my post I have provided IL of this method as reference. The very strange thing to me is line number IL_000f (in bold). c__DisplayClass1 here is class (not a struct!) generated by C# compiler which keeps reference to on delegate. stdfld opcode sets it's field with current instance of ActionFactory class. But in the meantime I can't find any calls to the constructor of c__DisplayClass1! That's really weird. I would like to turn to Microsoft guys for any explanations. I believe there is an issue with c# compiler.
    And one more thing: I've tried to use generic Action class without factory and it worked with no problems - everything is ok with this class.

    .method public hidebysig instance class [System.Core]System.Action
            Create(!T msg) cil managed
    {
      // Code size       38 (0x26)
      .maxstack  3
      .locals init ([0] class ConsoleApplication1.ActionFactory`1/'<>c__DisplayClass1'<!T> 'CS$<>8__locals2',
               [1] class [System.Core]System.Action CS$1$0000)
      IL_0000:  newobj     instance void class ConsoleApplication1.ActionFactory`1/'<>c__DisplayClass1'<!T>::.ctor()
      IL_0005:  stloc.0
      IL_0006:  ldloc.0
      IL_0007:  ldarg.1
      IL_0008:  stfld      !0 class ConsoleApplication1.ActionFactory`1/'<>c__DisplayClass1'<!T>::msg
      IL_000d:  ldloc.0
      IL_000e:  ldarg.0
      IL_000f:  stfld      class ConsoleApplication1.ActionFactory`1<!0> class ConsoleApplication1.ActionFactory`1/'<>c__DisplayClass1'<!T>::'<>4__this'
      IL_0014:  nop
      IL_0015:  ldloc.0
      IL_0016:  ldftn      instance void class ConsoleApplication1.ActionFactory`1/'<>c__DisplayClass1'<!T>::'<Create>b__0'()
      IL_001c:  newobj     instance void [System.Core]System.Action::.ctor(object,
                                                                           native int)
      IL_0021:  stloc.1
      IL_0022:  br.s       IL_0024
      IL_0024:  ldloc.1
      IL_0025:  ret
    } // end of method ActionFactory`1::Create



    Vitaliy Liptchinsky http://dotnetframeworkplanet.blogspot.com/
    Tuesday, January 20, 2009 10:01 PM
  • Thanks Vitaliy for taking a look at the code and verifying the behavior.

    I created a support issue with Microsoft. 


    Hopefully, they will provide some insight and a solution.

    Thanks,

    Mike

    Tuesday, January 20, 2009 10:40 PM
  • The technical term is "closure".  The C# compiler generates a helper class to capture the execution state of the local variable.  Anonymous methods generate them too.  This is all very much by design.
    Hans Passant.
    Wednesday, January 21, 2009 1:35 AM
    Moderator
  • Okay. I'm very familiar with the "technical" term, but why does the performance suck? I am willing to accept a small degradation, but it should be no where near 15 SECONDS for 5 million iterations. 

    Thanks,

    Mike
    Wednesday, January 21, 2009 4:48 AM
  • Probably because the second case introduced an extra method call that was not inlined, in a loop that is otherwise trivial.
    Wednesday, January 21, 2009 8:46 AM
  • The problem is one of allocations. The inline version caches both the delegate itself and the anonymous instance which holds the closure data. The other version, however, cannot cache either the delegate or the anonymous closure instance, since the delegate is created in a method call. This means that you end up allocating 10 million instances during the test: 5 million delegates, and 5 million anonymous instances. This accounts for the extreme time difference.

    There are several ways to see this happening:

    1) Load the assembly into Reflector, and in View->Options->Dissambler, change the Optimization setting to .NET 1.0. This will show you that the PerfWithStringInline method allocates one delegate and one closure instance, whereas PerfWithString calls ActionFactory.Create, which allocates a delegate and a closure instance each time it is called.

    2) Use the CLR Profiler to profile the application, and then open the "Histogram by Age". This will show you all of the allocations, and you will see that they are all done after the first run is complete.

    3) Add an instance Stopwatch to the ActionFactory, then start it when Create is called and stop it right before returning. Then compare the time spent in Create to the total time of the test. You will see that most of the time is spend in the Create method, and all it does is create an Action delegate, create the closure instance, and assign the parameter value to the closure variable.

    The "solution" is this prefabricated example would be to move the delegate creation outside the loop, since it is not dependent on the state of the loop. The actual solution in production code might be more involved, depending on the situation.

    Moderator | MCTS .NET 2.0 Web Applications | My Blog: http://www.commongenius.com
    Wednesday, January 21, 2009 9:41 PM
    Moderator
  • 1         public static Action CreateString(string msg, Action<string> target)  
    2         {  
    3             return () => target(msg);  
    4         }  
    5  
    6         public static Action CreateGeneric<T>(T msg, Action<T> target)  
    7         {  
    8             return () => target(msg);  
    9         }  
    10  
    11  
     
    Well, if I switch to using 2 methods - one generic and one typed I still get a huge disparity in performance. The typed method takes 2 seconds while the generic method takes 17 seconds. I disassembled the code for the methods and it is identical except one is generic.

    The disparity is so large that it appears to me at least that the CLR simply isn't optimizing the generic method properly.

    Also, if I run the test with a value type for the generic call, the performance is drastically better. The performance suffers when an object type is specified.
    Thursday, January 22, 2009 1:38 AM
  • Want to see something even weirder? Here are the results of several test runs using 2 million iterations each of those two methods:

    Debug Configuration, With Debugging (F5):                249/6404
    Debug Configuration, Without Debugging (Ctrl+F5):   224/5882
    Release Configuration, With Debugging (F5):              249/6736
    Release Configuration, Without Debugging (Ctrl+F5): 5102/5401

    Based on those numbers, I now agree that this is a bug, although what exactly the bug is I cannot say. The time to run the first case should not increase when optimizations are turned on. There is a known class of bugs in the JIT which only manifest when optimizations are on and a debugger is not attached (as in the last case). Its possible that there are actually two related optimization bugs here; one, in the generic case, which happens all of the time, and one, in the non-generic case, which only happens in optimizations without the debugger. It is also possible that the final times are actually the right ones, and the first case is actually being executed incorrectly without optimizations; although that seems less likely, since the most common source of bugs in the JIT comes from optimization.

    I would update the Connect bug with this information, so that Microsoft does not dismiss this as an allocation problem the way I did. The repro case needs to make it clear that the only difference between the two cases under test is that one is generic and one is not.


    Moderator | MCTS .NET 2.0 Web Applications | My Blog: http://www.commongenius.com
    Thursday, January 22, 2009 5:19 PM
    Moderator