locked
How does the keyword fixed affect the compiled code?

    Question

  • var digit = new int[8]; int* p = stackalloc int[8]; fixed (int* dig = digit) { for (var i = 1; i <= max; i++) { do { ... } } }

    I was evolving the section of code above and noticed a huge difference in running speed of the code (375ms vs 261ms).  The change has been narrowed down to the variable "digit" above.  Digit is no longer used in the main loop so I removed it's declaration and fixed statements and recompiled.  Some how less code (used code) runs faster then the code with the variable still in place. So I add the two lines back in / out a few times.  The performance changes every time.

    Next I open the created IL code from the two optimized release exe's.  This is 17 more lines in the faster code (with the unused declaration).  After comparision of the IL, I was able to determine that the code below the "for" statement was in if fact identical, the 17 lines of IL extra all occur before the loop.

    My question is how does the "fixed" statement effect the way the CLR runs the identical code following it?

    I will include the code if needed, but I really don't see how it can be involved...


    -QuickC


    using System; namespace CS2010BCD { class CountOnes { const int MAXNUM = 99999999; static void Main(string[] args) { var cnt = 0; Console.WriteLine("Counting the ones..."); var sw = System.Diagnostics.Stopwatch.StartNew(); cnt = BCD(MAXNUM); sw.Stop(); Console.WriteLine("...From 1 to {0:N0} I counted {1:N0} 1's in {2} msec.", MAXNUM, cnt, sw.ElapsedMilliseconds); Console.ReadKey(); } unsafe static int BCD(int max) { var n = 0; var ones = 0; var count = 0; int* p = stackalloc int[8]; //var hole= new int[1]; //fixed (int* arse = hole) { for (var i = 1; i <= max; i++) { do { if (p[n] >= 2 && p[n] <= 8) { p[n]++; n = 0; break; } else if (p[n] == 9) { p[n] = 0; n++; break; } else if (p[n] == 1) { p[n] = 2; ones--; n = 0; break; } else { p[n] = 1; ones++; n = 0; } } while (n > 0); count += ones; } } return count; } } }

    compile release optimized unsafe 32bit verse the same settings and the two rem lines active. The active code will be faster even though it has no direct effect on the code produced below it... There must be a compiler reason for the variation.

    Note this was on VS VNEXT but it is the same on 2008, and 2010.


    • Edited by QuickC Sunday, February 26, 2012 8:11 PM fixed comments
    Sunday, February 26, 2012 2:34 AM

Answers

  • QuickC,
    Louis,

    i3, .NET 3.5, XP - 253 with lines var hole and fixed, 275 without them (compiled on VS2008)
    i3, .NET 4.0, XP - 258 with lines var hole and fixed, 338 without them (compiled on VS2010)
    i5, .NET 3.5, W7 - 196 with lines var hole and fixed, 216 without them (compiled on VS2008)
    i5, .NET 4.0, W7 - 269 with lines var hole and fixed, 275 without them (compiled on VS2010)

    (also tested on AMD multicore processors with similar small differences)

    As you can see, I'm not getting the big variations mentioned by QuickC. On the other hand: Allocations are really cheap in .NET. I can't imagine how allocating an array of length 1 (and subsequent pinning) could possibly take 22-100ms. My impression is that we're getting into micro-optimization, and - as I said before - I'm not at all sure as to what exactly we are measuring here.

    Marcel


    • Edited by Marcel RomaMVP Monday, February 27, 2012 6:12 PM
    • Marked as answer by QuickC Monday, February 27, 2012 11:10 PM
    Monday, February 27, 2012 6:11 PM

All replies

  • For this you need to know how GC works. Whenever your application requires more memory than available, Garbage collection has to run. During garbage collection, memory occupied by all dead objects is freed. This causes fragmentation and garbage collector starts moving the objects in heap to compact the memory. This means, the address of the object can change after garbage collection. This works well for managed object. But, if you have passed the address of an object to unmanaged code then if the adress of the object changes after GC, then unmanaged code might be reading or writing to wrong memory location due to which your application may behave inappropriately.

    So, by using fixed statement, you instruct CLR not to move the object while compacting the memory after GC. Obviously, CLR has to do extra work to treat the fixed object which is why you see 17 lines before for loop. This extra head definitely adds overhead.

    I hope you understand this.


    Please mark this post as answer if it solved your problem. Happy Programming!

    Sunday, February 26, 2012 6:03 AM
  • If you look for performance, there is no need to use the fixed keyword. As your code already suggests (stackalloc int[8]), you can simply allocate a block of memory on the stack (instead of heap like in: new int[8]) and address it by using unsafe pointers. Because stack allocations are automatically freed when the method returns, memory pinning/garbage collection has no relevance here.
    Sunday, February 26, 2012 7:54 AM
  • Adavesh,

    I was aware of the GC issues and there are none in the code, the GC runs every few minutes.  The extra 17 lines of IL run FASTER then the solution.  So I'm to conclude that the "fixed" staement in some way affects bounds checking or such that are part of the Libraries.  I would like to know why it has a effect, so I can apply the advantages on demand. 

    Note: I have the same code in VB, F#, C#, VC++, C++/CLI, JAVA, JS, GCC, and TINYC each tweaked for in many different ways to find the fastest code for each language and compiler.  The C# code with the unused fixed statement is the fastest, Squeeking by native C++, 2nd was CLI, 3rd VB 4th F#, 5th JAVA, 6th JS (compiled from the command line) 7th TINYC, and last GCC.  They must really put little effort into the quiality of the GCC created code.


    -QuickC

    Sunday, February 26, 2012 4:24 PM
  • Marcel,

    That is why the code was being removed.  I does not explain why removing it slowes the speed though.


    -QuickC

    Sunday, February 26, 2012 4:26 PM
  • QuickC,

    You didn't got the point, I'm afraid. Allocating a block of memory with stackalloc and then using unsafe pointers *without* the keyword fixed, is always faster. My point was: There simply is no need for fixed when you're dealing with stack allocations only.

    Marcel

    Sunday, February 26, 2012 4:33 PM
  • Thanks for the feedback Marcel,

    Note that the code with the "fixed" keyword is the much faster code though.?.!


    -QuickC

    Sunday, February 26, 2012 4:55 PM
  • This seems incredible to me. You must experience some measurement issues or something.
    Please post some compilable demo code here, so we can have a look too.
    Sunday, February 26, 2012 5:05 PM
  • Marcel,

    I added the full code to the original post, then noticed that it might not have notified you.

    If you note something different, please post your findings.


    -QuickC

    Sunday, February 26, 2012 5:31 PM
  • QuickC,

    Thank you for providing the code. Here's what I did:

    1. I uncommented the line: int* p = stackalloc int[8] and ran your testcode five times (release mode, no code optimization).
    Results:

     992ms
    1001ms
     992ms
     990ms
     993ms

    2. I commented the line again, uncommented the line: fixed (int* dig = digit), changed p[n] to dig[n] and ran the code again five times (release mode, no code optimization).
    Results:

    1222ms
    1236ms
    1222ms
    1223ms
    1210ms

    The results so far showed what I was arguing in my previous posting: the stackalloc-variant is faster. Given the number of iterations however, I think that the difference can be placed in the realm of micro-optimizations. I don't know, if it is very relevant in your particular scenario.

    After enabling code optimization the stackalloc-variant ran in avg. 272ms whereas the fixed-variant ran in avg. 326ms.

    Marcel


    • Edited by Marcel RomaMVP Sunday, February 26, 2012 6:54 PM Added optimization numbers
    Sunday, February 26, 2012 6:42 PM
  • Marcel,

    Let's make sure we are talking about the same, the code I tested was both comments in place as above or both removed, no other changes required, it's a dead variable that I was going to remove, but I noted the before and after, and Boom it had an impact.

    Sounds Like your running on an I-7 chip to get such close numbers..


    -QuickC

    Sunday, February 26, 2012 7:00 PM
  • QuickC,

    If both comments were in place - as you just described - the code would not compile (CS0103), because the p[n] identifier would simply not exist in the given context.

    If I, by contrary, was to remove both comments, the code would compile, but now the stackalloc-variant would be used (!). The fixed-keyword would just add some overhead, but since you never use the dig pointer the variants (with/without fixed-keyword) would have about the same performance in non-optimized mode (without fixed it runs slightly faster).

    It's only when enabling optimization that some effects are visible (257ms with fixed, and 271 without fixed), but frankly ... I'm not at all sure on what we're measuring here.

    Marcel

    Sunday, February 26, 2012 7:37 PM
  • Marcel,

    Your right!, the comments should have been on  

    //var digit = new int[8];
    //fixed (int* dig = digit)

    I completely forgot i reversed the variables for kicks and then commencted out the wrong line after pasting.

    You have been a great help, it's now up to someone who wrote the implementation of "fixed" to comment on how it effects the CLR in such a way that it impacts the code.  If all my loops will run faster if I place them in an unused "fixed" block, that would be good to know, and what I give up to get the better performance.


    -QuickC

    Sunday, February 26, 2012 7:55 PM
  • QuickC,


    The observable effects are very small in my opinion. If you consider that they manifest themselves on a 100,000,000 elements iteration, the gains per iteration (should they really exist) are quasi-infinitesimal.

    Marcel

    Sunday, February 26, 2012 8:11 PM
  • Marcel,

    My view is the change is huge, from 271ms with the "fixed" in place and 370ms without.  In my comparision to other languages it takes the code from faster then c++ to slower then VB which bothed showed there best times in a completely inverted implementation based on a structure.


    -QuickC

    Sunday, February 26, 2012 8:19 PM
  • QuickC,

    If 99ms (and in my measurements only 14ms) are so pretious to you on a 100,000,000 items iteration, then C#/.NET is definitely not the language/platform you want to use. And please remember that even those small gains come only with code optimization enabled. And it may well be some sort of short-circuit with code optimization rather than the mere fact that we are pinning an managed array into memory (that never gets used) that produces the effect. Pinning objects into memory has also negative effects, because it favors heap fragmentation, and leads to more memory pressure in the long run. I sure understand your point of view, but squeezing some milliseconds in such an unorthodox way may come at a high cost.

    Marcel

    Sunday, February 26, 2012 8:31 PM
  • Marcel,

    Did you run the revised code or as I miss pasted it.  I get 270 verse 370, 100ms @ 450,000,000. loops and increments per second that I have measured elsewhere, is either a few million instructions or one hit per loop at level two cache for an unknown reason. 

    I suspect that the loop code which is entirely static when run will never in a hundred runs cause a GC. 

    I also assume the faster version lives entirely in Level 1 cache which are essentially cpu registers.  As far as the C#/NET, I code in C/C++ native also, in this code without bit smashing, I have not found a variation that runs faster in NATIVE code then the C#.  With a CPU with at least four cores, there is no measurable overhead on the main code path, the CLR uses the other cores to feed optimized instructions only.

    If no one from the Compiler team comments by tommorrow night, I'll mark this thread answered by you.


    -QuickC

    Sunday, February 26, 2012 9:19 PM
  • QuickC,

    Yes, I ran the revised code [funny naming though]. After 10 runs each I now get an avg. of 275.8ms vs. 253.2ms @ 100,000,000 loops. That makes up for a difference of 22.6ms.

    While this delta is far from the 100ms you got, it still remains an intriguing fact.

    Marcel

    Monday, February 27, 2012 10:10 AM
  • Marcel,

    Very interesting, what versions of .NET and VS?

    .NET version 4.5.40805 (the directory however is 4.030319) win7
    VS version 11.0.40825.2
    i7 271ms with lines var hole and fixed, 371ms without them

    .NET version 4.5.65530 (the directory however is 4.030319) win8
    vs Version 11.0.40825.2
    Laptop 615 ms with lines var hole and fixed, 750ms without them


    -QuickC

    Monday, February 27, 2012 2:42 PM
  • I get an average of 323ms with the lines commented vs 325ms with the lines uncommented. With optimized code, the unused variable is removed and there is no pinning. The difference comes from the allocation of the 'hole' array.

    Removing optimization, I get 930ms with the lines commented, 1050ms with the lines uncommented.

    Monday, February 27, 2012 4:05 PM
  • Louis and Marcel,

    Any commencts on your versions of .ET and VS?


    -QuickC

    Monday, February 27, 2012 5:47 PM
  • QuickC,
    Louis,

    i3, .NET 3.5, XP - 253 with lines var hole and fixed, 275 without them (compiled on VS2008)
    i3, .NET 4.0, XP - 258 with lines var hole and fixed, 338 without them (compiled on VS2010)
    i5, .NET 3.5, W7 - 196 with lines var hole and fixed, 216 without them (compiled on VS2008)
    i5, .NET 4.0, W7 - 269 with lines var hole and fixed, 275 without them (compiled on VS2010)

    (also tested on AMD multicore processors with similar small differences)

    As you can see, I'm not getting the big variations mentioned by QuickC. On the other hand: Allocations are really cheap in .NET. I can't imagine how allocating an array of length 1 (and subsequent pinning) could possibly take 22-100ms. My impression is that we're getting into micro-optimization, and - as I said before - I'm not at all sure as to what exactly we are measuring here.

    Marcel


    • Edited by Marcel RomaMVP Monday, February 27, 2012 6:12 PM
    • Marked as answer by QuickC Monday, February 27, 2012 11:10 PM
    Monday, February 27, 2012 6:11 PM
  • Two quick comments

    1) If you suspect code generation differences, you can confirm or deny that absolutely by simply looking at the generated code.  See http://blogs.msdn.com/b/vancem/archive/2006/02/20/how-to-use-visual-studio-to-investigate-code-generation-questions-in-managed-code.aspx for details on how to do this.

    2) The alignment of loops can make a non-trivial difference for microbenchmarks.   This could explain some of the variability.  If that were true, other changes at the head of the method will have simmiar effects (making the perf change.    This is also worth trying.

    Monday, February 27, 2012 9:59 PM
  • Vance,

    Thanks for the link, it's should lead to an good guess!


    -QuickC

    Monday, February 27, 2012 11:11 PM
  • There it is,

    The extra variable causes a suttle change by using one more register to hold an unused variable causing swaping on EDX instead of leaving the value in EDI as in the faster case.  Marcel must have a newer CPU then I7-920. My it-920 seems to really choke by comparision. 

    So the code is Different!!!

    Thanks Guys,  I have a deeper toolbox now.


    -QuickC

    Monday, February 27, 2012 11:28 PM