locked
Strange Visual Studio 2008 Optimization Regression RRS feed

  • Question

  •  

    Hi all,

     

    I’m hoping that some .NET C++/CLI MSIL guru can help explain this very strange phenomenon that I’m experiencing in Visual Studio 2008.   I’m getting significantly slower code for a very critical function in a scientific application we are developing.   This function runs much slower only when compiled with optimization.  

     

    Here are some times I get with various build/run configurations:

     

    AMD Opteron 256

    Generation of 1,000,000,000 random values

     

    Debug Code; Inside Debugger: 6 Seconds 5.8170 Milliseconds

    Optimized Code; Inside Debugger: 5 Seconds 953.1670 Milliseconds

    Optimized Code; Outside Debugger: 11 Seconds 31.2150 Milliseconds 

     

    As you can see the optimized code produces the best performance but only when running inside the debugger (when JIT optimizations are disabled)!  Outside the debugger it is quite slow, even slower then the non-optimized code.

     

    I’ve included snapshots of the code in questions, the IL (both optimized and non-optimized), and the ASM produced in all three cases.   If someone can shine some light on this problem I would be greatly appreciated.   It seems to me from looking at the assembly code, that the JIT compiler is adding some type of exception handler or security code, but I’m not sure or for that matter why it would be doing this only for optimized code running outside the debugger.

     

    I’ve simplified the code and reduced some of the namespaces for clarity.  

     

    Thanks much,

    James C. Papp

     

    *** C++/CLI Code (Simplified) ***

    Code Snippet

    extern "C"
    {
      [SuppressUnmanagedCodeSecurity]
      void __declspec(nothrow) __stdcall StreamingMersenneTwisterRefresh(unsigned __int32 * const pulStates);
    }

     

    [SuppressUnmanagedCodeSecurity]
    public ref class StreamingMersenneTwisterRandom sealed : public IRandom
    {
      ...

      public:
        virtual unsigned __int32 NextUInt()
        {
          if (m_paulIndex != m_paulEndSentinel)
          {
            return *m_paulIndex++; 
          }

          StreamingMersenneTwisterRefresh(m_paulStateTable);

          m_paulIndex -= 624 - 1;

          return *m_paulStateTable;
        }
    };

     

         

    *** DEBUG IL ***

    Code Snippet
    .method public hidebysig newslot virtual final instance uint32 NextUInt() cil managed
    {
        .maxstack 4
        .locals (
            [0] uint32 num)
        L_0000: ldarg.0
        L_0001: ldfld uint32* StreamingMersenneTwisterRandom::m_paulIndex
        L_0006: ldarg.0
        L_0007: ldfld uint32* modopt([mscorlib]CompilerServices.IsConst) modopt([mscorlib]CompilerServices.IsConst) StreamingMersenneTwisterRandom::m_paulEndSentinel
        L_000c: beq.s L_0026
        L_000e: ldarg.0
        L_000f: ldfld uint32* StreamingMersenneTwisterRandom::m_paulIndex
        L_0014: ldind.i4
        L_0015: ldarg.0
        L_0016: dup
        L_0017: ldfld uint32* StreamingMersenneTwisterRandom::m_paulIndex
        L_001c: ldc.i4.4
        L_001d: add
        L_001e: stfld uint32* StreamingMersenneTwisterRandom::m_paulIndex
        L_0023: stloc.0
        L_0024: br.s L_0053
        L_0026: ldarg.0
        L_0027: dup
        L_0028: ldfld uint32* modopt([mscorlib]CompilerServices.IsConst) modopt([mscorlib]CompilerServices.IsConst) StreamingMersenneTwisterRandom::m_paulStateTable
        L_002d: stfld uint32* StreamingMersenneTwisterRandom::m_paulIndex
        L_0032: ldarg.0
        L_0033: ldfld uint32* StreamingMersenneTwisterRandom::m_paulIndex
        L_0038: call void modopt([mscorlib]CompilerServices.CallConvStdcall) ::StreamingMersenneTwisterRefresh(uint32* modopt([mscorlib]CompilerServices.IsConst) modopt([mscorlib]CompilerServices.IsConst))
        L_003d: ldarg.0
        L_003e: ldfld uint32* StreamingMersenneTwisterRandom::m_paulIndex
        L_0043: ldind.i4
        L_0044: ldarg.0
        L_0045: dup
        L_0046: ldfld uint32* StreamingMersenneTwisterRandom::m_paulIndex
        L_004b: ldc.i4.4
        L_004c: add
        L_004d: stfld uint32* StreamingMersenneTwisterRandom::m_paulIndex
        L_0052: stloc.0
        L_0053: ldloc.0
        L_0054: ret
    }

     


    *** OPTIMIZED IL ***

    Code Snippet

    .method public hidebysig newslot virtual final instance uint32 NextUInt() cil managed
    {
        .maxstack 4
        .locals (
            [0] uint32* numPtr)
        L_0000: ldarg.0
        L_0001: ldfld uint32* StreamingMersenneTwisterRandom::m_paulIndex
        L_0006: stloc.0
        L_0007: ldloc.0
        L_0008: ldarg.0
        L_0009: ldfld uint32* modopt([mscorlib]CompilerServices.IsConst) modopt([mscorlib]CompilerServices.IsConst) StreamingMersenneTwisterRandom::m_paulEndSentinel
        L_000e: beq.s L_001c
        L_0010: ldloc.0
        L_0011: ldind.i4
        L_0012: ldarg.0
        L_0013: ldloc.0
        L_0014: ldc.i4.4
        L_0015: add
        L_0016: stfld uint32* StreamingMersenneTwisterRandom::m_paulIndex
        L_001b: ret
        L_001c: ldarg.0
        L_001d: ldfld uint32* modopt([mscorlib]CompilerServices.IsConst) modopt([mscorlib]CompilerServices.IsConst) StreamingMersenneTwisterRandom::m_paulStateTable
        L_0022: call void modopt([mscorlib]CompilerServices.CallConvStdcall) ::StreamingMersenneTwisterRefresh(uint32* modopt([mscorlib]CompilerServices.IsConst) modopt([mscorlib]CompilerServices.IsConst))
        L_0027: ldarg.0
        L_0028: dup
        L_0029: ldfld uint32* StreamingMersenneTwisterRandom::m_paulIndex
        L_002e: ldc.i4 0x9bc
        L_0033: sub
        L_0034: stfld uint32* StreamingMersenneTwisterRandom::m_paulIndex
        L_0039: ldarg.0
        L_003a: ldfld uint32* modopt([mscorlib]CompilerServices.IsConst) modopt([mscorlib]CompilerServices.IsConst) StreamingMersenneTwisterRandom::m_paulStateTable
        L_003f: ldind.i4
        L_0040: ret
    }

     

    *** ASM - DEBUG CODE - INSIDE DEBUGGER ***

    Code Snippet

       382:         virtual unsigned __int32 NextUInt()
       383:         {
       384:           if (m_paulIndex != m_paulEndSentinel)
    00000000 57               push        edi 
    00000001 56               push        esi 
    00000002 53               push        ebx 
    00000003 8B F1            mov         esi,ecx
    00000005 83 3D 08 6C AE 05 00 cmp         dword ptr ds:[05AE6C08h],0
    0000000c 74 05            je          00000013
    0000000e E8 B4 24 CD 73   call        73CD24C7
    00000013 33 FF            xor         edi,edi
    00000015 8B 46 04         mov         eax,dword ptr [esi+4]
    00000018 3B 46 10         cmp         eax,dword ptr [esi+10h]
    0000001b 74 0E            je          0000002B
       385:           {
       386:             return *m_paulIndex++;
    0000001d 8B 46 04         mov         eax,dword ptr [esi+4]
    00000020 8B 18            mov         ebx,dword ptr [eax]
    00000022 83 46 04 04      add         dword ptr [esi+4],4
    00000026 8B FB            mov         edi,ebx
    00000028 90               nop             
    00000029 EB 16            jmp         00000041
       387:           }
       388:          
       389:           StreamingMersenneTwisterRefresh(m_paulStateTable);
    0000002b 8B 4E 0C         mov         ecx,dword ptr [esi+0Ch]
    0000002e E8 01 D0 CF FF   call        FFCFD034
       390:
       391:           m_paulIndex -= 624 - 1;
    00000033 81 46 04 44 F6 FF FF add         dword ptr [esi+4],0FFFFF644h
       392:
       393:           return *m_paulStateTable;
    0000003a 8B 46 0C         mov         eax,dword ptr [esi+0Ch]
    0000003d 8B 00            mov         eax,dword ptr [eax]
    0000003f 8B F8            mov         edi,eax
       394:         }
    00000041 8B C7            mov         eax,edi
    00000043 5B               pop         ebx 
    00000044 5E               pop         esi 
    00000045 5F               pop         edi 
    00000046 C3               ret           

     

    *** ASM - OPTIMIZE CODE - INSIDE DEBUGGER ***

    Code Snippet

       408:         virtual unsigned __int32 NextUInt()
       409:         {
       410:           if (m_paulIndex != m_paulEndSentinel)
    00000000 57               push        edi 
    00000001 56               push        esi 
    00000002 53               push        ebx 
    00000003 8B F1            mov         esi,ecx
    00000005 83 3D 60 2F AE 05 00 cmp         dword ptr ds:[05AE2F60h],0
    0000000c 74 05            je          00000013
    0000000e E8 9C 50 F7 73   call        73F750AF
    00000013 33 DB            xor         ebx,ebx
    00000015 8B 46 04         mov         eax,dword ptr [esi+4]
    00000018 8B D8            mov         ebx,eax
    0000001a 3B 5E 10         cmp         ebx,dword ptr [esi+10h]
    0000001d 74 0E            je          0000002D
       411:           {
       412:             return *m_paulIndex++;
    0000001f 8B 3B            mov         edi,dword ptr [ebx]
    00000021 8D 43 04         lea         eax,[ebx+4]
    00000024 89 46 04         mov         dword ptr [esi+4],eax
    00000027 8B C7            mov         eax,edi
    00000029 5B               pop         ebx 
    0000002a 5E               pop         esi 
    0000002b 5F               pop         edi 
    0000002c C3               ret             
       413:           }
       414:          
       415:           StreamingMersenneTwisterRefresh(m_paulStateTable);
    0000002d 8B 4E 0C         mov         ecx,dword ptr [esi+0Ch]
    00000030 E8 5F F3 DD FF   call        FFDDF394
       416:
       417:           m_paulIndex -= 624 - 1;
    00000035 81 46 04 44 F6 FF FF add         dword ptr [esi+4],0FFFFF644h
       418:
       419:           return *m_paulStateTable;
    0000003c 8B 46 0C         mov         eax,dword ptr [esi+0Ch]
    0000003f 8B 00            mov         eax,dword ptr [eax]
    00000041 5B               pop         ebx 
    00000042 5E               pop         esi 
    00000043 5F               pop         edi 
    00000044 C3               ret        

     

    *** ASM - OPTIMIZE CODE - OUTSIDE DEBUGGER ***

    Code Snippet
       408:         virtual unsigned __int32 NextUInt()
       409:         {
       410:           if (m_paulIndex != m_paulEndSentinel)
    00000000 55               push        ebp 
    00000001 8B EC            mov         ebp,esp
    00000003 57               push        edi 
    00000004 56               push        esi 
    00000005 53               push        ebx 
    00000006 83 EC 30         sub         esp,30h
    00000009 64 8B 35 38 0E 00 00 mov         esi,dword ptr fs:[00000E38h]
    00000010 C7 45 C8 40 39 E7 79 mov         dword ptr [ebp-38h],79E73940h
    00000017 C7 45 C4 40 29 81 2E mov         dword ptr [ebp-3Ch],2E812940h
    0000001e 8B 7E 0C         mov         edi,dword ptr [esi+0Ch]
    00000021 89 7D CC         mov         dword ptr [ebp-34h],edi
    00000024 89 6D E8         mov         dword ptr [ebp-18h],ebp
    00000027 8D 7D C8         lea         edi,[ebp-38h]
    0000002a C7 45 D4 00 00 00 00 mov         dword ptr [ebp-2Ch],0
    00000031 89 7E 0C         mov         dword ptr [esi+0Ch],edi
    00000034 89 4D F0         mov         dword ptr [ebp-10h],ecx
    00000037 8B 45 F0         mov         eax,dword ptr [ebp-10h]
    0000003a 8B 48 04         mov         ecx,dword ptr [eax+4]
    0000003d 3B 48 10         cmp         ecx,dword ptr [eax+10h]
    00000040 74 0F            je          00000051
       411:           {
       412:             return *m_paulIndex++;
    00000042 8B 11            mov         edx,dword ptr [ecx]
    00000044 83 C1 04         add         ecx,4
    00000047 8B 45 F0         mov         eax,dword ptr [ebp-10h]
    0000004a 89 48 04         mov         dword ptr [eax+4],ecx
    0000004d 8B C2            mov         eax,edx
    0000004f EB 45            jmp         00000096
       413:           }
       414:          
       415:           StreamingMersenneTwisterRefresh(m_paulStateTable);
    00000051 8B 45 F0         mov         eax,dword ptr [ebp-10h]
    00000054 FF 70 0C         push        dword ptr [eax+0Ch]
    00000057 C7 45 D0 88 58 8C 02 mov         dword ptr [ebp-30h],28C5888h
    0000005e 89 65 D4         mov         dword ptr [ebp-2Ch],esp
    00000061 68 93 F2 8B 02   push        28BF293h
    00000066 8F 45 D8         pop         dword ptr [ebp-28h]
    00000069 C6 46 08 00      mov         byte ptr [esi+8],0
    0000006d FF 15 E8 82 8C 02 call        dword ptr ds:[028C82E8h]
    00000073 C6 46 08 01      mov         byte ptr [esi+8],1
    00000077 83 3D E0 D2 3A 7A 00 cmp         dword ptr ds:[7A3AD2E0h],0
    0000007e 74 07            je          00000087
    00000080 8B CE            mov         ecx,esi
    00000082 E8 62 DA 64 77   call        7764DAE9
       416:
       417:           m_paulIndex -= 624 - 1;
    00000087 8B 45 F0         mov         eax,dword ptr [ebp-10h]
    0000008a 81 40 04 44 F6 FF FF add         dword ptr [eax+4],0FFFFF644h
       418:
       419:           return *m_paulStateTable;
    00000091 8B 50 0C         mov         edx,dword ptr [eax+0Ch]
    00000094 8B 02            mov         eax,dword ptr [edx]
    00000096 8B 7D CC         mov         edi,dword ptr [ebp-34h]
    00000099 89 7E 0C         mov         dword ptr [esi+0Ch],edi
    0000009c 8D 65 F4         lea         esp,[ebp-0Ch]
    0000009f 5B               pop         ebx 
    000000a0 5E               pop         esi 
    000000a1 5F               pop         edi 
    000000a2 5D               pop         ebp 
    000000a3 C3               ret             

     

     

    Tuesday, April 8, 2008 7:14 PM

Answers

  • Hi ildjarn,

    Yea, I thought of that too, but the MethodImplAttribute does not make any difference here; I've tried.  The PInvoke inlining really isn’t inlining the call itself just stuff that supports calling the unmanaged function and I guess the managed to unmanaged switch.  There is this CORJIT_FLG_PROF_NO_PINVOKE_INLINE enumeration that seems to control the feature but it is buried in the bowels of the CLR source with no obvious way to control or set it.  

     

    I was able to work around the problem by adding extra parameters to the unmanaged function, particularly "bool" which increased the complexity of the marshaling enough to prevent this sort of optimization by the JIT compiler (at least this is my assumption).  It runs fast again, even with optimizations enabled.  This also explains why some of our functions exhibit this performance issue and others don’t as the heuristic seems to be related to the unmanaged marshaling. 

     

    So from my end, I’ve solved the problem; though, I would really like some explanation about what PInvoke inlining is trying to accomplish (is it just potential performance gains?) and why does it sometimes makes things slower... Is this a bug? It steals a register so this may be the reason; I’m definitely curious.  I might post this on Microsoft Connect or burn through a support call and see if I can get one of the JIT compiler developers to explain things better.

     

    James.


    Thursday, April 10, 2008 9:41 PM

All replies

  • I have two initial thoughts:

     

    First, check your project settings to make sure optimizations are actually enabled in your release build. VC2008 has a bug where projects created from CLR application-type templates do not have the proper optimization settings for release mode by default. The settings in question are:

    1. C/C++ -> Optimization -> Optimization (should be 'Maximize Speed (/O2)')

    2. Linker -> Optimization -> References (should be 'Eliminate Unreferenced Data (/OPT:REF)')

    3. Linker -> Optimization -> Enable COMDAT Folding (should be 'Remove Redundant COMDATs (/OPT:ICF)')

     

    Second, in ildasm, check the code that calls NextUInt in your test case, and see if NextUInt is being inlined into the callsite. If it is, that may actually be causing adverse performance in this case -- it could be inlining when running the release code outside the debugger, and not inlining when running inside the debugger. If this is the case, try adding the standard __declspec(noinline) to NextUInt and see if that improves matters.
    Tuesday, April 8, 2008 7:41 PM
  •  

    Hi ildjarn,

     

    Yes, I’ve checked and verified that the optimizations are enabled.  You can see evidence of this in the two versions of IL I’ve posted above.   The problem is not the optimizer of the C++ compiler, but something to do with the JIT compiler optimizations.   

     

    Though..., it is certainly possible that the C++ optimizer is setting up the JIT optimizer for failure or at least a situation it does not typically see or coded to deal with; I’m not sure how much tuning Microsoft has done in this area.

     

    If you look at the timing numbers you can see that the optimized code performs the fastest, but only when executed under the debugger (when JIT optimizations are suppressed).  

     

    Also, in all cases the method NextUInt() will never be inlined and that is to be expected as it is called through an interface, so I do not believe this is the reason for the strange results in our performance testing.

     

    If you look at the optimized assembly language generated outside the debugger it has a bunch of extra instructions at the beginning of the function that are taking up time but I do not know what they are for as they are not associated with the source…   It seems to be accessing segment FS, but why only when JIT optimizations are enabled?  There are no exception handlers, thread specific storage, or static constructors in the class or method.

     

    James.

     

    Tuesday, April 8, 2008 8:57 PM
  • The FS segment is typically used for TLS and SEH frames, and occasionally, highly accessed thunk tables (e.g. import tables). The FS segment access could be an SEH frame since you're not compiling with /clr: safe, though I'm not sure why it would be omitted when you run with the debugger attached... Also, even though you don't have any exception handling directly in your code, the JITter no doubts emits an SEH frame to catch access violations when your code contains direct pointer arithmetic.

     

    You mentioned you're running on an Opteron, so I find the assembly being emitted curious... Are you running a binary linked with /MACHINE:X86 on a 64-bit OS? In this case, the FS segment access could be a WoW64 thunk table. Have you tried running the binary on a 32-bit OS to see if the JITter emits more optimal code, or creating an x64 build of your application?

    Tuesday, April 8, 2008 10:33 PM
  • Hi ildjarn,

     

    Yes, to answer your question, I’m running under a dual processor machine with Windows XP 64 Professional installed.  The application is targeted for x86.

     

    I think you might be on to something with “…the JITter no doubts emits an SEH frame to catch access violations when your code contains direct pointer arithmetic.”   If you look at the two versions of IL you can see that the non-optimized one pushes a uint32 on the stack while the optimized one (by the C++ backend) pushed a pointer to a unit32.  Could this be causing the JIT optimizer to assume it needs a SEH frame? 

     

    Unfortunately, it creates more questions then answers.  Both the debug and optimized versions use pointers so you would think that using pointer arithmetic would cause the JIT compiler to generate this code regardless, if it really needed it.   And, is there really any reason for the SEH frame in the first place?  There are no objects in the method or on the stack that need to be cleaned up.   And why is adding a SEH frame considered an “optimization;” I mean, that JIT compiles the code when under the debugger and does not seem fit to include it there when optimizations are suppressed.  Finally, why would the C++ backend do this type of transformation on the IL if it could produce slower code by the JIT optimizer down the road?  

     

    As to the Wow64 thunking, I must admit I did not think of this, but I’m not sure what it would be thunking to?  The method makes no calls to the operating system; it only calls the hand-coded assembly method which is also x86-code linked into the same .NET assembly.  The assembly generated in the optimized version does not show any new call instructions either.

     

    The really-really strange part is I have an almost identical NextUInt() method with the same IL (a few more instructions for setting a Boolean state flag), but for a different random generator (the assembly routine it calls out to, to do the refresh/rounds is of course different) and we do not see this problem!  Both classes are located in the same .NET assembly.   It was this reason which lead me on this investigation in the first place; the newer algorithm should definitely be faster, but we were seeing it fail our performance regression tests only when optimizations were turn on.  

     

    I’ll do some more tests now that I have a little more information and a few more things to experiment with and report back.  

     

    Another possibility I’ve thought of is the JIT compiler is trying to align the stack or something, but it seems way to much code for this.   The assembly code generated certainly smells of an SEH frame or an exception frame of some kind; though, it would be great if any assembly language experts could chime in and definitively identify the JIT generated code in question.

     

    Thanks much,

    James.

     

    Tuesday, April 8, 2008 11:38 PM
  •  James C. Papp wrote:

    If you look at the two versions of IL you can see that the non-optimized one pushes a uint32 on the stack while the optimized one (by the C++ backend) pushed a pointer to a unit32.  Could this be causing the JIT optimizer to assume it needs a SEH frame?

     

    I think so. My understanding is, if the IL emitted isn't verifiable (certainly including any code involving pointer arithmethic), the JITter must emit an exception stack frame to handle any OS or hardware exceptions so they can be wrapped in a managed exception (such as a System:: NullReferenceException or a System:: ComponentModel:: Win32Exception) at runtime.

     

    I think a good first step would be to see if the assembly JITted on a 32-bit OS is any different, to validate or invalidate the idea that WoW64 might be involved. However, I agree with you that it's more likely to be an SEH issue than a WoW64 issue.

     

    You may want to experiment with making your ref class a simple thin wrapper around a native class (i.e, one compiled without /clr, or at least inside of a #pragma unmanaged) that does the actual work; the managed to unmanaged transition may end up being faster than having pointer arithmetic inside your managed code.

     

    EDIT:

     James C. Papp wrote:

    Unfortunately, it creates more questions then answers.  Both the debug and optimized versions use pointers so you would think that using pointer arithmetic would cause the JIT compiler to generate this code regardless, if it really needed it.

     

    It's possible that the JITter doesn't emit SEH handlers when a debugger is attached, since the debugger is going to trap the exception anyway. This would explain why the FS segment access is in your release code outside of a debugger but not inside a debugger. This is all speculation on my part, of course. Wink

    Wednesday, April 9, 2008 12:15 AM
  • Well I’ve done some more tests…

     

    I’ve completely ruled out the C++ compiler optimizations done to the IL; the IL it produces is better optimized, so it is doing exactly what it is suppose to do.  

     

    I’ve also ruled out that it is the differences in the two IL dumps, the optimized and non-optimized versions.  I can get the JIT compiler to create the slower code on either set of IL by simply enabling JIT optimization.

     

    For some reason when the JIT optimizations are allowed it will makes the code slower.   The question is why…? 

     

    I’m not convinced that this is simply because of pointer operations.  As I’ve said earlier, I have other similar code which does not demonstrate this behavior or performance hit when optimized.  Though, more importantly, if this code is generated to support mapping of exceptions for null/access violations why is it not generated when JIT-level optimizations are disabled?  Would you not get different behavior?   

     

    Also, I’ve eliminated the debugger completely from the picture.   The issue is purely whether or not JIT-level optimizations are enabled at the time the method is jitted. 

     

    Take at look at the following lines in the assembly (this is the start of the method):

     

    00000000 55               push        ebp 

    00000001 8B EC            mov         ebp,esp

    00000003 57               push        edi 

    00000004 56               push        esi 

    00000005 53               push        ebx 

    00000006 83 EC 30         sub         esp,30h

     

    This looks like a typical ebp frame setup, but were the heck is the JIT compiler coming up with 30h to offset the locals on the stack?   Something is just not adding up.

     

    And what the heck are the following lines doing?  79E73940h smells of a code address of some type...  Ugh!

     

    00000009 64 8B 35 38 0E 00 00 mov         esi,dword ptr fs:[00000E38h]
    00000010 C7 45 C8 40 39 E7 79 mov         dword ptr [ebp-38h],79E73940h
    00000017 C7 45 C4 40 29 81 2E mov         dword ptr [ebp-3Ch],2E812940h

     

    I wonder if there is a way to turn off JIT optimization on a functional level?  Then I can at least work around this problem...

     

    It would be great if someone from Microsoft who worked on the JIT compiler could chime in.

     

    Thanks much and for your suggestions,

    James.

     

    Wednesday, April 9, 2008 12:55 AM
  • Hi all,

     

    Okay, I finally figured out what all the extra code added to our functions is for and why it only happens when JIT optimizations are enabled.   My initial guess was that it had something to do with exception/stack frames but now realize it is code to support what looks like some type of PInvoke inlining.  

     

    I’ve never heard of this feature before!  The missing piece was this method called JIT_RareDisableHelper (thanks to the symbol server), which was referenced by the generated assembly language.

     

    There is not much in the way of documentation but a few quick searches describing this stuff can be found here:

     

    http://discuss.develop.com/archives/wa.exe?A2=ind0201D&L=DOTNET&P=63742

    http://www.koders.com/cpp/fid11A72AA0BEFC6FC3C6A0B7421F7F275BBAC10B8F.aspx

     

    So, does anyone know more about PInvoke inline and/or fast unmanaged calls?  When and what heuristics determines it use?  Or why PInvoke inlining would be slower then not inlining it?  Obviously, it does not inline the actual unmanaged call but the associated overhead of setting the call up...

     

    And lastly, how to override or suppress the JIT compiler’s decision to use PInvoke inlining when it does not improve performance?

     

    Thanks much,

    James C. Papp

    Thursday, April 10, 2008 7:02 PM
  • To prevent the JIT compiler from inlining your method, I believe you can apply the System:: Runtime:: CompilerServices:: MethodImplAttribute attribute to your method with the MethodImplOptions:: NoInlining flag specified.

     

    Some rules on inline qualification are posted here: http://blogs.msdn.com/ricom/archive/2004/01/14/58703.aspx. The blog post is a few years old, but most of it probably still holds true.

     

    EDIT: Also, this link is useful, and is actually the one I meant to post the first time: http://blogs.msdn.com/ericgu/archive/2004/01/29/64717.aspx

    Thursday, April 10, 2008 8:16 PM
  • Hi ildjarn,

    Yea, I thought of that too, but the MethodImplAttribute does not make any difference here; I've tried.  The PInvoke inlining really isn’t inlining the call itself just stuff that supports calling the unmanaged function and I guess the managed to unmanaged switch.  There is this CORJIT_FLG_PROF_NO_PINVOKE_INLINE enumeration that seems to control the feature but it is buried in the bowels of the CLR source with no obvious way to control or set it.  

     

    I was able to work around the problem by adding extra parameters to the unmanaged function, particularly "bool" which increased the complexity of the marshaling enough to prevent this sort of optimization by the JIT compiler (at least this is my assumption).  It runs fast again, even with optimizations enabled.  This also explains why some of our functions exhibit this performance issue and others don’t as the heuristic seems to be related to the unmanaged marshaling. 

     

    So from my end, I’ve solved the problem; though, I would really like some explanation about what PInvoke inlining is trying to accomplish (is it just potential performance gains?) and why does it sometimes makes things slower... Is this a bug? It steals a register so this may be the reason; I’m definitely curious.  I might post this on Microsoft Connect or burn through a support call and see if I can get one of the JIT compiler developers to explain things better.

     

    James.


    Thursday, April 10, 2008 9:41 PM