none
.NET Array Assignment and Reference Optimization RRS feed

  • Question

  • Hello,

    Does anyone know if .NET is able to automatically optimize the following code condition?


    public Class Customer
    {
       public int age = 27;
       public string name "Bob Jones";
       public string address = "XXXX N Chicago";
       public List<string> personality = new List<string>();
    }

    public Customer[] CustomersArray = new Customer[5000];

    // Some code that inefficiently assigns each customer a value

    for (int i = 0i ; i < CustomerArray.Length; i++)
    {
      CustomersArray[i] = new Customer();

      CustomersArray[i].age = Math.Random(1, 100);
      CustomerArray[i].name = GetRandomName();
      CustomerArray[i].address = GetRandomAddress();
      CustomerArray[i].personality = GetRandomPersonality();
    }

    // Wouldn't the code below run much faster because it does not have to perform a memory lookup before each assignment?  Will it be optimized to...

    for (int i = 0i ; i < CustomerArray.Length; i++)
    {
      TempCustomer = new Customer();

      TempCustomer.age = Math.Random(1, 100);
      TempCustomer.name = GetRandomName();
      TempCustomer.address = GetRandomAddress();
      TempCustomer.personality = GetRandomPersonality();

      CustomerArray[i] = TempCustomer;
    }

    // I looked at the dissassembly for both the debug and release version of the first set of code and it does not appear any optimization is done.  Could the runtime possibly be performing optimization?

    Thanks in advance!

    Ervin


    Monday, December 8, 2008 8:55 PM

Answers

  • The C# compiler will not optimize this, but the JIT definitely might.  You'd have to use the CLR debugger tool to examine the assembly code to verify.

    The CPU will definitely recognize the repeated array de-referencing and optimize accordingly.  It could easily hide any inefficiencies this way.

    In any case, 5000 items is too small to notice any difference.  You have to get into the millions before these sorts of details start to show up.

    -Ryan
    • Marked as answer by Swerrrvin Tuesday, December 9, 2008 1:58 PM
    Tuesday, December 9, 2008 12:14 AM
  • I don't see how the runtime could possibly optimize this code.  Unless CustomerArray is a local variable, any such optimization would be thread-unsafe because another thread might put a different Customer in any given CustomerArray[i] slot between any two property assignments during the loop.

    You should always use the TempCustomer variant -- it's not just faster, it's also easier to debug the individual property assignments, and if some future version adds multithreaded access to the array there's only one point of contention (the final CustomerArray assignment).

    • Marked as answer by Swerrrvin Tuesday, December 9, 2008 1:58 PM
    Tuesday, December 9, 2008 9:19 AM

All replies

  • The C# compiler will not optimize this, but the JIT definitely might.  You'd have to use the CLR debugger tool to examine the assembly code to verify.

    The CPU will definitely recognize the repeated array de-referencing and optimize accordingly.  It could easily hide any inefficiencies this way.

    In any case, 5000 items is too small to notice any difference.  You have to get into the millions before these sorts of details start to show up.

    -Ryan
    • Marked as answer by Swerrrvin Tuesday, December 9, 2008 1:58 PM
    Tuesday, December 9, 2008 12:14 AM
  • I don't see how the runtime could possibly optimize this code.  Unless CustomerArray is a local variable, any such optimization would be thread-unsafe because another thread might put a different Customer in any given CustomerArray[i] slot between any two property assignments during the loop.

    You should always use the TempCustomer variant -- it's not just faster, it's also easier to debug the individual property assignments, and if some future version adds multithreaded access to the array there's only one point of contention (the final CustomerArray assignment).

    • Marked as answer by Swerrrvin Tuesday, December 9, 2008 1:58 PM
    Tuesday, December 9, 2008 9:19 AM
  • Thanks guys.  Both answer apply to my app.  In my real-time multi-threaded app, I actually have a buffer of about 50000 "customer" objects where data is added synchronously by one thread and retrieved out-of-order and asynchronously by a bunch of other threads.  I generally process about 50-100 customers a second, 24 hours a day. I was looking for ways to eck out extra performance and reducing latency when I discovered the poor code.

    Ervin
    Tuesday, December 9, 2008 2:06 PM
  • [edit: corrections and details in a later post. see below]
     And the answer is...


    The JIT compiler does optimize the "inefficient" assignment case. Only one bounds check is performed; and the assignments are made with one instruction per assignment, thereafter.

    Whether the second case runs efficiently or not is a bit complicated. The problem is with the "new Customer()" call. Ultimately, this will generate a garbage collect, incurring quite a bit of extra execution time. In real world apps, except under disaster conditions, garbage collects occur at idle time, and are effectively free. And the allocation calls in the CLR are wickedly efficient (the vast majority of allocations require only incrementing of a memory pointer to the next free memory location).  However, if the code is truly performance critical, then chances are that garbage collects will be triggered by the allocation, because this code would have to be executed millions of times in order to make it so.

    The "inefficient" case is actually faster in release, non-debug code.

    As to why the JIT can optimize this code... unless there's a "volatile" keyword involved somewhere, the compiler is free to make optimizations that are not thread-safe. The compiler has to make some assumptions to guarantee memory integrity -- namely that a garbage collect will not occur at arbitrary points in the code (pre-empts for garbage collection can only occur at "safe points" in the code, so this true), and that the target customer object will not move between the time that an pointer to the Customer object is developed, and the actual assignments are made (a straightforward consequence of the fact that pre-empts for garbage collection only occur at "safe points").

    Curiously, and somewhat counter-intuitively, the JIT compiler won't make the optimization if Customer is a struct, rather than a class. In that case, release non-debug code develops the interior pointer from scratch, and performs bounds checking (actually the most expensive part because it requires a compare/branch) each time, for each assignment. For structs, the single assignment code is overwhelmingly faster. It's not immediately clear to me why the JIT can't make the assumptions in that case. Perhaps because an interior pointer to an array is a far more fragile thing than a pointer to an object. This is bad news for math-intensive code (for example performing math on an array of complex numbers).

    <silently cursing, because I do have a large body of performance critical code that does perform vector processing of structs>
    Tuesday, December 9, 2008 3:24 PM
  • I'm afraid I must disagree with your claims, Robin.

    1. >>Only one bounds check is performed<<

    There are no bounds checks at all, in either case, because the JIT compiler special-cases loops from zero to Array.Length - 1.  Note that array sizes (but not array contents!) are immutable in .NET, therefore this optimization causes no threading issues.  The inefficieny here is not about bounds checking, but about re-fetching the current array element (i.e. the Customer reference that is stored in the array) from shared memory for every property assignment.  I don't see how that can be avoided, see below.

    2. >>The problem is with the "new Customer()" call. Ultimately, this will generate a garbage collect, incurring quite a bit of extra execution time.<<

    How so?  You are allocating exactly one "new Customer()" in both cases -- have you overlooked the first line in the first example's loop?  There are no extra objects to be garbage-collected in the second case.  You make just one object, then you assign the reference to the current array location.

    3.  >>unless there's a "volatile" keyword involved somewhere, the compiler is free to make optimizations that are not thread-safe.<<

    That would be true if the assignment in question were made to a memory location that could in fact be designated as volatile.  This, however, is not the case.  If you marked the array as volatile then only the array reference itself would be protected from thread-unsafe optimizations (like omitting bounds checks), but not any *elements* within the array.  To my knowledge there is no way to protect those, unless you write your own array wrapper.  Therefore, the runtime cannot make any assumptions here and must play it safe.

    4. >>the JIT compiler won't make the optimization if Customer is a struct<<

    I seem to recall that this was an older runtime defect that has been fixed in a recent version... wasn't there some discrepancy between the 32-bit and 64-bit runtime, too?

    Wednesday, December 10, 2008 8:26 AM
  • Thanks again for the insight.

    I understand.  The compiler should/can not optimize the first case into the second case because  it cannot assume that each re-fetching of the the array element will result in the return in the same array elements.  This is because with multi-threaded apps elements can be inserted / deleted from the array element the will effect each fetch before all the fetches are complete.  In the second case, you don't have to worry about array elements/index changes happening before all property assignments are complete.  No doubt, you would have to worry about it before the new customer element is assigned back to the array...

    I've noticed that in the first case the dissassembly  for each property fetch and assign was suprisingly long and complicated.  Hence, my decision to optimize.  The assembly is far simplier in the second case.

    First case:

    CustomerArray[i].age = CurrentCustomer.Age

    00000bae  mov         eax,dword ptr [ebp-3Ch]
    00000bb1  mov         eax,dword ptr [eax+40h]
    00000bb4  mov         edx,dword ptr [ebp-3Ch]
    00000bb7  mov         edx,dword ptr [edx+000000B8h]
    00000bbd  cmp         edx,dword ptr [eax+4]
    00000bc0  jb          00000BC7
    00000bc2  call        751B9913
    00000bc7  mov         eax,dword ptr [eax+edx*4+0Ch]
    00000bcb  mov         edx,dword ptr [ebp-3Ch]
    00000bce  mov         edx,dword ptr [edx+44h]
    00000bd1  fld         qword ptr [edx+0Ch]
    00000bd4  fstp        qword ptr [eax+0Ch]

    Second case:

    // Assumption: TempCustomer = CustomerArray[i].age

    TempCustomer.age = CurrentCustomer.Age

    00000b9f  mov         eax,dword ptr [ebp-54h]
    00000ba2  mov         edx,dword ptr [ebp-3Ch]
    00000ba5  mov         edx,dword ptr [edx+44h]
    00000ba8  fld         qword ptr [edx+0Ch]
    00000bab  fstp        qword ptr [eax+0Ch]

    I'm going to try using a profiler to see how these changes effect the performance of the code after JIT optimization.

    Fortunately, I realized early the importance of avoiding triggering garbage collections.  In my actual code, I create a circular buffer of customers and initialize it with new customers when the applications first starts.  Later when new customer data arrives, I just overwrite the properties of each customer as new data arrives.

    Ervin
    Wednesday, December 10, 2008 5:54 PM
  • Yep. Sorry. I didn't get it quite right. I mistook a string assignment call for a range check. Probably too much information, but I was curious, and the results were interesting.

    I simplified code of the loop so that the code generated by assigns would be clearer.

    The actual code under test is:

               for (int i = 0; i < c; ++i)  
                {  
                    testArray[i].name = "Name";  
                    testArray[i].value1 = i;  
                    testArray[i].value2 = i;  
                    testArray[i].value3 = i;  
                } 


    The generated code (release optimization, run without debugger, attach later -- which allows full JIT optimization). testArray element is a class:

     
    ;  ... i < c ...  
    007F0184  cmp         ecx,ebx    
    007F0186  jae         007F01B5   
     
    ;  testArray[i].name = "Name";  
    007F0188  mov         esi,dword ptr [edi+ecx*4+0Ch]  ; testArray[i]  
    007F018C  mov         eax,dword ptr ds:[2D72030h]   
    007F0192  lea         edx,[esi+4]   
    007F0195  call        71772D90                    ; copy the strig.   
     
    ; testArray[i].value1 = i;  
    ; testArray[i].value2 = i;  
    ; testArray[i].value3 = i;  
     
    007F019A  mov         esi,dword ptr [edi+ecx*4+0Ch]  ; single testArray[i]  
    007F019E  mov         dword ptr [esi+8],ecx      ; value1 = i;  
    007F01A1  mov         dword ptr [esi+0Ch],ecx    ; value2 = i;  
    007F01A4  mov         dword ptr [esi+10h],ecx    ; value2 = i;  
     
    ; ++i  
    007F01A7  add         ecx,1   
    ; i < c 
    007F01AA  cmp         ecx,dword ptr [ebp-10h]   
    007F01AD  jl          007F0184  


    Note how the last three testArray[i] dereferences are folded into a single dereference. Apparently, testArray[i] must be dereferenced again after the string assignment call, because the return from the string assignment call may have incurred a compacting garbage collect. The last three references assume NON-volatile semantics for testArray[i]. (volatile semantics would have required testArray[i] to be dereferenced for each assign).

    You are correct about the range check -- it's optimized out.

    Compare this to what happens if the testArray element type is converted to struct:


    ; testArray[i].name = "Name";  
    007C0198  mov         eax,dword ptr ds:[02C51EE8h]  ; testArray  
    007C019D  cmp         esi,dword ptr [eax+4]         ; range check  
    007C01A0  jae         007C0203   
     
    007C01A2  mov         edx,esi   
    007C01A4  shl         edx,4   
    007C01A7  lea         edx,[eax+edx+8]    ; ea of testArray[i].Name  
    007C01AB  mov         eax,dword ptr ds:[2C52030h]   
    007C01B1  call        71772528             ; string assign  
     
    ; testArray[i].value1 = i;  
    007C01B6  mov         eax,dword ptr ds:[02C51EE8h]  ; testArray  
     
    007C01BB  cmp         esi,dword ptr [eax+4]         ; range check again!  
    007C01BE  jae         007C0203   
     
    007C01C0  mov         edx,esi   
    007C01C2  shl         edx,4   
    007C01C5  lea         eax,[eax+edx+8]  ; ea of testArray[i].value1;  
    007C01C9  mov         dword ptr [eax+4],esi  ; assign;  
     
    ; testArray[i].value2 = i;  
    007C01CC  mov         eax,dword ptr ds:[02C51EE8h]  ; and again!  
    007C01D1  cmp         esi,dword ptr [eax+4]   
    007C01D4  jae         007C0203   
    007C01D6  mov         edx,esi   
    007C01D8  shl         edx,4   
    007C01DB  lea         eax,[eax+edx+8]   
    007C01DF  mov         dword ptr [eax+8],esi   
     
    ;  testArray[i].value3 = i;  
    007C01E2  mov         eax,dword ptr ds:[02C51EE8h]   
    007C01E7  cmp         esi,dword ptr [eax+4]   
    007C01EA  jae         007C0203   
    007C01EC  mov         edx,esi   
    007C01EE  shl         edx,4   
    007C01F1  lea         eax,[eax+edx+8]   
    007C01F5  mov         dword ptr [eax+0Ch],esi   
     
    ; ++i  
    007C01F8  add         esi,1   
    ; i < c 
    007C01FB  cmp         esi,edi   
    007C01FD  jl          007C0198   
     

    In this case, there's no folding of array references, bounds checks are performed every time. Running 3.5 SP1. In short, struct assignment is a bit of a disaster.


    Compare this against the case where a struct is created on the stack and copied in one assignment into the array.

              int c = testArray.Length;  
                for (int i = 0; i < c; ++i)  
                {  
                    TestObject o = new TestObject();  
                    o.name = "Name";  
                    o.value1 = i;  
                    o.value2 = i;  
                    o.value3 = i;  
     
                    testArray[i] = o;  
                } 

    The generated code is pretty interesting.


     
    ; "Name" -> edi temp reg.  
    015D019F  mov         edi,dword ptr ds:[2B22030h]   
     
    o.value1 = i; (spuriously "enregisterd: into a stack frame temporary that's never used)  
    015D01A5  mov         dword ptr [ebp-14h],ecx ; on stack frame, but not used later!  
     
    o.value2 = i;  (in bx temp reg)  
    015D01A8  mov         ebx,ecx   
     
    o.value3 = i;   (in stack frame)  
    015D01AA  mov         dword ptr [ebp-18h],ecx   
     
    ; testArray[i] = o;  
    015D01AD  mov         eax,dword ptr ds:[02B21EE8h]  ; testArray  
    015D01B2  cmp         ecx,dword ptr [eax+4]  ; Range check.  
    015D01B5  jae         015D01E5   
     
    015D01B7  mov         edx,ecx   
    015D01B9  shl         edx,4   
    015D01BC  lea         esi,[eax+edx+8]  ; ea of testArray[i]  
    015D01C0  lea         edx,[esi]   
    015D01C2  call        71772608  ; string assign.  
     
    015D01C7  mov         eax,ebx  ; use ebx, even though we have ebp-14 stack frame, and i in ecx !  
    015D01C9  mov         dword ptr [esi+4],eax ; value1 assign  
    015D01CC  mov         dword ptr [esi+8],ebx  ; value2 assign  
    015D01CF  mov         eax,dword ptr [ebp-18h]  ;   o.value3  
    015D01D2  mov         dword ptr [esi+0Ch],eax ; value3  
     
    ; end of for loop   
    015D01D5  add         ecx,1   
    015D01D8  cmp         ecx,dword ptr [ebp-10h]   
    015D01DB  jl          015D019F   
     

    No range check. The working struct is partially enregistered, and then memberwise assignment is inlined. Curiously, the code saves a value of i to a stack temporary that never actually gets used.  Definitely preferrable to assigning member by member into the array. In fairness, the enregistration oddities are probably due to misoptimization when the same value is assigned three times. In a more normal case, this wouldn't be a consideration that's in play.


    Thursday, December 11, 2008 11:03 AM
  • Thanks for testing this, Robin.  I'm rather shocked to see that array elements are in fact not dereferenced on each assignment.  That means if you ever anticipate thread contention over the modification of array elements you'd have to use either VolatileRead on each access or some other explicit synchronization.  Seems like a fairly severe drawback for an optimization that's easily done manually.

    Of course I'm even more shocked at the terrible code generation for arrays of structs.  I searched for the struct optimization issues that I recalled but they were unrelated to this one.  I don't think I've ever heard of this particular issue before -- has it been published anywhere yet?

    I noticed that you copied Array.Length to a separate local variable called c in all your examples.  Could you try re-running all your tests without this variable, comparing explicitly to testArray.Length instead?  In your second and third example, the optimizer may not "see" the opportunity for getting rid of range checks if the upper bound is hidden in a variable.

    Thursday, December 11, 2008 11:49 AM
  • Very cool stuff.  I'm definitely quite impressed with the power of JIT to optimize de-referencing.  It seems that when the compiler and CLR are optimizing code, they assume that an application is single threaded and perform optimization as such.  I already went in and changed all my code to the second case.  Mainly because I could not afford the risk of leaving it to be optimized by the compiler, not to mention the code is a heck of a lot cleaner.  Now I can't help but wonder how my new JIT optimized code compares to the old JIT opted code.  Yikes...

    So the trick to see JIT'd code is to run the compiled release executable, load up the project, and then attach to the already running executable.  I'll go back and take a look at both case using this method.  Also, seems like a pretty good time to sit down and get cozy with a good profiler too.

    FYI, I'll also be investingating how de-referencing of more complex collection (dictionaries, sorted lists, etc) are handled and optimized.  Can't help but wonder if .NET uses any tricks to avoid having to perform the GetHash function multiple times on an object that hasn't changed since it was previously looked up.  For example, when using strings as keys.  Perhaps, it uses mechanisms similar to the mechanism that allow it to have such strong string comparisons performance.

    Thanks again.

    Ervin
    Thursday, December 11, 2008 11:28 PM
  • Chris Nahr said:

    Of course I'm even more shocked at the terrible code generation for arrays of structs.  I searched for the struct optimization issues that I recalled but they were unrelated to this one.  I don't think I've ever heard of this particular issue before -- has it been published anywhere yet?

    I noticed that you copied Array.Length to a separate local variable called c in all your examples.  Could you try re-running all your tests without this variable, comparing explicitly to testArray.Length instead?  In your second and third example, the optimizer may not "see" the opportunity for getting rid of range checks if the upper bound is hidden in a variable.



    I haven't seen a whole lot of public analysis of JIT code generation. If there is a published bug against this, my analysis would be an independent discovery of the problem.

    With respect to "i < testArray.Length", vs. "i < c", I had actually wondered about that. Some of the tests were run both ways in the original analysis. Code generation within the for loop was identical in both cases. But I reran the "testArray[i].value1 = i;"/testArray-element-are-struct case one more time just to be certain. Replacing "i < c" with "i < testArray.Length" does not make a difference in code generation: bounds checks occur on ever access, and the array de-reference still occurs for each assignment.

    Details of the analysis: .net 3.5 SP1, running on 32-bit Vista, processor is an AMD quad-core, release build.

    Here's the code that was used. Code was captured by running the compiled app, and then attaching to the running process. Debug\Break... consistently leaves the instruction pointer of the Main() thread isomewhere in the for loop.
    using System;  
    using System.Collections.Generic;  
    using System.Linq;  
    using System.Text;  
     
    namespace OptimizationTest  
    {  
        public struct TestObject  
        {  
            public string name;  
            public int value1;  
            public int value2;  
            public int value3;  
        }  
     
        class Program  
        {  
     
            static TestObject[] testArray = new TestObject[10000];  
     
     
     
            static void SlowAssignment()  
            {  
                for (int i = 0; i < testArray.Length; ++i)  
                {  
                    testArray[i].name = "Name";  
                    testArray[i].value1 = i;  
                    testArray[i].value2 = i;  
                    testArray[i].value3 = i;  
     
                }  
            }  
     
            static void Time(Action action)  
            {  
                for (int i = 0; i < 1000000; ++i)  
                {  
                    action();  
                }  
            }  
     
            static void Init()  
            {  
                for (int i = 0; i < testArray.Length; ++i)  
                {  
                    testArray[i] = new TestObject();  
                }  
            }  
            static void Main(string[] args)  
            {  
                Init();  
     
                Time(SlowAssignment);  
            }  
        }  
    }  
     
    Friday, December 12, 2008 1:55 AM
  • Okay, now it gets very interesting.  I just reran your struct test on my own system, which is also Visual Studio 2008 with .NET 3.5 SP1 but running on 64-bit Vista... and two superfluous range checks are optimized away!

    00000000  push        rbx    
    00000001  push        rsi    
    00000002  push        rdi    
    00000003  sub         rsp,20h   
    00000007  mov         rax,127D2DC8h   
    00000011  mov         rax,qword ptr [rax]   
    00000014  mov         rcx,qword ptr [rax+8]   
    00000018  test        ecx,ecx   
    0000001a  jle         0000000000000077   
    0000001c  xor         ebx,ebx   
    0000001e  xchg        ax,ax   
    00000020  mov         rsi,127D2DC8h   
    0000002a  mov         rcx,qword ptr [rsi]   
    0000002d  movsxd      rdi,ebx   
    00000030  mov         rax,qword ptr [rcx+8]   
    00000034  cmp         rdi,rax   
    00000037  jae         0000000000000080   
    00000039  lea         rax,[rdi+rdi*2]   
    0000003d  mov         rdx,127D3050h   
    00000047  mov         rdx,qword ptr [rdx]   
    0000004a  lea         rcx,[rcx+rax*8+10h]   
    0000004f  call        FFFFFFFFF1C1F460   
    00000054  mov         rcx,qword ptr [rsi]   
    00000057  mov         rdx,qword ptr [rcx+8]   
    0000005b  cmp         rdi,rdx   
    0000005e  jae         0000000000000080   
    00000060  lea         rax,[rdi+rdi*2]   
    00000064  mov         dword ptr [rcx+rax*8+18h],ebx   
    00000068  mov         dword ptr [rcx+rax*8+1Ch],ebx   
    0000006c  mov         dword ptr [rcx+rax*8+20h],ebx   
    00000070  add         ebx,1   
    00000073  cmp         ebx,edx   
    00000075  jl          0000000000000020   
    00000077  add         rsp,20h   
    0000007b  pop         rdi    
    0000007c  pop         rsi    
    0000007d  pop         rbx    
    0000007e  rep ret            
    00000080  call        FFFFFFFFF1FBF7F0   
    00000085  nop               

    This is the whole SlowAssignment method according to my stack trace, and unless I'm mistaken the three mov instructions at addresses 64, 68, and 6c represent the assignments to the value1...value3 components of the structure.  There does appear to be an additional superfluous range check after the previous Name assignment, strangely enough.

    I also ran the same test in 32-bit mode and can confirm that in this case, there are four range checks -- one before each assignment.  Then I ran the test with a test object rather than a test structure, and once again got the same result as you in 32-bit mode -- just a single range check.

    But then I ran the object test in 64-bit mode... and got the following code:

    00000000  push        rbx    
    00000001  push        rsi    
    00000002  push        rdi    
    00000003  sub         rsp,20h   
    00000007  mov         rax,12702DC8h   
    00000011  mov         rax,qword ptr [rax]   
    00000014  mov         rcx,qword ptr [rax+8]   
    00000018  test        ecx,ecx   
    0000001a  jle         0000000000000075   
    0000001c  xor         ebx,ebx   
    0000001e  xchg        ax,ax   
    00000020  mov         rsi,12702DC8h   
    0000002a  mov         rcx,qword ptr [rsi]   
    0000002d  movsxd      rdi,ebx   
    00000030  mov         rax,qword ptr [rcx+8]   
    00000034  cmp         rdi,rax   
    00000037  jae         0000000000000080   
    00000039  mov         rcx,qword ptr [rcx+rdi*8+18h]   
    0000003e  mov         rdx,12703050h   
    00000048  mov         rdx,qword ptr [rdx]   
    0000004b  add         rcx,8   
    0000004f  call        FFFFFFFFF264F420   
    00000054  mov         rax,qword ptr [rsi]   
    00000057  mov         rcx,qword ptr [rax+8]   
    0000005b  cmp         rdi,rcx   
    0000005e  jae         0000000000000080   
    00000060  mov         rax,qword ptr [rax+rdi*8+18h]   
    00000065  mov         dword ptr [rax+10h],ebx   
    00000068  mov         dword ptr [rax+14h],ebx   
    0000006b  mov         dword ptr [rax+18h],ebx   
    0000006e  add         ebx,1   
    00000071  cmp         ebx,ecx   
    00000073  jl          0000000000000020   
    00000075  add         rsp,20h   
    00000079  pop         rdi    
    0000007a  pop         rsi    
    0000007b  pop         rbx    
    0000007c  rep ret            
    0000007e  xchg        ax,ax   
    00000080  call        FFFFFFFFF29EF7B0   
    00000085  nop               

    As you can see the object case is almost identical to the structure case on a 64-bit system -- there are two range checks, one before the string assignment and one before the three integer assignments.

    So there's a big inefficiency in the 32-bit code generator for arrays of structs, and a small inefficiency in the 64-bit code generator for... assigning different property types?

    This is all very curious.  I've sent a link to this thread to Microsoft's CLR Code Generation blog (http://blogs.msdn.com/clrcodegeneration/), perhaps the experts will chime in on this issue.
    Friday, December 12, 2008 9:55 AM