Answered by:
Strange Visual Studio 2008 Optimization Regression

Question
-
Hi all,
I’m hoping that some .NET C++/CLI MSIL guru can help explain this very strange phenomenon that I’m experiencing in Visual Studio 2008. I’m getting significantly slower code for a very critical function in a scientific application we are developing. This function runs much slower only when compiled with optimization.
Here are some times I get with various build/run configurations:
AMD Opteron 256
Generation of 1,000,000,000 random values
Debug Code; Inside Debugger: 6 Seconds 5.8170 Milliseconds
Optimized Code; Inside Debugger: 5 Seconds 953.1670 Milliseconds
Optimized Code; Outside Debugger: 11 Seconds 31.2150 Milliseconds
As you can see the optimized code produces the best performance but only when running inside the debugger (when JIT optimizations are disabled)! Outside the debugger it is quite slow, even slower then the non-optimized code.
I’ve included snapshots of the code in questions, the IL (both optimized and non-optimized), and the ASM produced in all three cases. If someone can shine some light on this problem I would be greatly appreciated. It seems to me from looking at the assembly code, that the JIT compiler is adding some type of exception handler or security code, but I’m not sure or for that matter why it would be doing this only for optimized code running outside the debugger.
I’ve simplified the code and reduced some of the namespaces for clarity.
Thanks much,
James C. Papp
*** C++/CLI Code (Simplified) ***
Code Snippetextern "C"
{
[SuppressUnmanagedCodeSecurity]
void __declspec(nothrow) __stdcall StreamingMersenneTwisterRefresh(unsigned __int32 * const pulStates);
}[SuppressUnmanagedCodeSecurity]
public ref class StreamingMersenneTwisterRandom sealed : public IRandom
{
...public:
virtual unsigned __int32 NextUInt()
{
if (m_paulIndex != m_paulEndSentinel)
{
return *m_paulIndex++;
}StreamingMersenneTwisterRefresh(m_paulStateTable);
m_paulIndex -= 624 - 1;
return *m_paulStateTable;
}
};*** DEBUG IL ***
Code Snippet.method public hidebysig newslot virtual final instance uint32 NextUInt() cil managed
{
.maxstack 4
.locals (
[0] uint32 num)
L_0000: ldarg.0
L_0001: ldfld uint32* StreamingMersenneTwisterRandom::m_paulIndex
L_0006: ldarg.0
L_0007: ldfld uint32* modopt([mscorlib]CompilerServices.IsConst) modopt([mscorlib]CompilerServices.IsConst) StreamingMersenneTwisterRandom::m_paulEndSentinel
L_000c: beq.s L_0026
L_000e: ldarg.0
L_000f: ldfld uint32* StreamingMersenneTwisterRandom::m_paulIndex
L_0014: ldind.i4
L_0015: ldarg.0
L_0016: dup
L_0017: ldfld uint32* StreamingMersenneTwisterRandom::m_paulIndex
L_001c: ldc.i4.4
L_001d: add
L_001e: stfld uint32* StreamingMersenneTwisterRandom::m_paulIndex
L_0023: stloc.0
L_0024: br.s L_0053
L_0026: ldarg.0
L_0027: dup
L_0028: ldfld uint32* modopt([mscorlib]CompilerServices.IsConst) modopt([mscorlib]CompilerServices.IsConst) StreamingMersenneTwisterRandom::m_paulStateTable
L_002d: stfld uint32* StreamingMersenneTwisterRandom::m_paulIndex
L_0032: ldarg.0
L_0033: ldfld uint32* StreamingMersenneTwisterRandom::m_paulIndex
L_0038: call void modopt([mscorlib]CompilerServices.CallConvStdcall) ::StreamingMersenneTwisterRefresh(uint32* modopt([mscorlib]CompilerServices.IsConst) modopt([mscorlib]CompilerServices.IsConst))
L_003d: ldarg.0
L_003e: ldfld uint32* StreamingMersenneTwisterRandom::m_paulIndex
L_0043: ldind.i4
L_0044: ldarg.0
L_0045: dup
L_0046: ldfld uint32* StreamingMersenneTwisterRandom::m_paulIndex
L_004b: ldc.i4.4
L_004c: add
L_004d: stfld uint32* StreamingMersenneTwisterRandom::m_paulIndex
L_0052: stloc.0
L_0053: ldloc.0
L_0054: ret
}
*** OPTIMIZED IL ***Code Snippet.method public hidebysig newslot virtual final instance uint32 NextUInt() cil managed
{
.maxstack 4
.locals (
[0] uint32* numPtr)
L_0000: ldarg.0
L_0001: ldfld uint32* StreamingMersenneTwisterRandom::m_paulIndex
L_0006: stloc.0
L_0007: ldloc.0
L_0008: ldarg.0
L_0009: ldfld uint32* modopt([mscorlib]CompilerServices.IsConst) modopt([mscorlib]CompilerServices.IsConst) StreamingMersenneTwisterRandom::m_paulEndSentinel
L_000e: beq.s L_001c
L_0010: ldloc.0
L_0011: ldind.i4
L_0012: ldarg.0
L_0013: ldloc.0
L_0014: ldc.i4.4
L_0015: add
L_0016: stfld uint32* StreamingMersenneTwisterRandom::m_paulIndex
L_001b: ret
L_001c: ldarg.0
L_001d: ldfld uint32* modopt([mscorlib]CompilerServices.IsConst) modopt([mscorlib]CompilerServices.IsConst) StreamingMersenneTwisterRandom::m_paulStateTable
L_0022: call void modopt([mscorlib]CompilerServices.CallConvStdcall) ::StreamingMersenneTwisterRefresh(uint32* modopt([mscorlib]CompilerServices.IsConst) modopt([mscorlib]CompilerServices.IsConst))
L_0027: ldarg.0
L_0028: dup
L_0029: ldfld uint32* StreamingMersenneTwisterRandom::m_paulIndex
L_002e: ldc.i4 0x9bc
L_0033: sub
L_0034: stfld uint32* StreamingMersenneTwisterRandom::m_paulIndex
L_0039: ldarg.0
L_003a: ldfld uint32* modopt([mscorlib]CompilerServices.IsConst) modopt([mscorlib]CompilerServices.IsConst) StreamingMersenneTwisterRandom::m_paulStateTable
L_003f: ldind.i4
L_0040: ret
}*** ASM - DEBUG CODE - INSIDE DEBUGGER ***
Code Snippet382: virtual unsigned __int32 NextUInt()
383: {
384: if (m_paulIndex != m_paulEndSentinel)
00000000 57 push edi
00000001 56 push esi
00000002 53 push ebx
00000003 8B F1 mov esi,ecx
00000005 83 3D 08 6C AE 05 00 cmp dword ptr ds:[05AE6C08h],0
0000000c 74 05 je 00000013
0000000e E8 B4 24 CD 73 call 73CD24C7
00000013 33 FF xor edi,edi
00000015 8B 46 04 mov eax,dword ptr [esi+4]
00000018 3B 46 10 cmp eax,dword ptr [esi+10h]
0000001b 74 0E je 0000002B
385: {
386: return *m_paulIndex++;
0000001d 8B 46 04 mov eax,dword ptr [esi+4]
00000020 8B 18 mov ebx,dword ptr [eax]
00000022 83 46 04 04 add dword ptr [esi+4],4
00000026 8B FB mov edi,ebx
00000028 90 nop
00000029 EB 16 jmp 00000041
387: }
388:
389: StreamingMersenneTwisterRefresh(m_paulStateTable);
0000002b 8B 4E 0C mov ecx,dword ptr [esi+0Ch]
0000002e E8 01 D0 CF FF call FFCFD034
390:
391: m_paulIndex -= 624 - 1;
00000033 81 46 04 44 F6 FF FF add dword ptr [esi+4],0FFFFF644h
392:
393: return *m_paulStateTable;
0000003a 8B 46 0C mov eax,dword ptr [esi+0Ch]
0000003d 8B 00 mov eax,dword ptr [eax]
0000003f 8B F8 mov edi,eax
394: }
00000041 8B C7 mov eax,edi
00000043 5B pop ebx
00000044 5E pop esi
00000045 5F pop edi
00000046 C3 ret*** ASM - OPTIMIZE CODE - INSIDE DEBUGGER ***
Code Snippet408: virtual unsigned __int32 NextUInt()
409: {
410: if (m_paulIndex != m_paulEndSentinel)
00000000 57 push edi
00000001 56 push esi
00000002 53 push ebx
00000003 8B F1 mov esi,ecx
00000005 83 3D 60 2F AE 05 00 cmp dword ptr ds:[05AE2F60h],0
0000000c 74 05 je 00000013
0000000e E8 9C 50 F7 73 call 73F750AF
00000013 33 DB xor ebx,ebx
00000015 8B 46 04 mov eax,dword ptr [esi+4]
00000018 8B D8 mov ebx,eax
0000001a 3B 5E 10 cmp ebx,dword ptr [esi+10h]
0000001d 74 0E je 0000002D
411: {
412: return *m_paulIndex++;
0000001f 8B 3B mov edi,dword ptr [ebx]
00000021 8D 43 04 lea eax,[ebx+4]
00000024 89 46 04 mov dword ptr [esi+4],eax
00000027 8B C7 mov eax,edi
00000029 5B pop ebx
0000002a 5E pop esi
0000002b 5F pop edi
0000002c C3 ret
413: }
414:
415: StreamingMersenneTwisterRefresh(m_paulStateTable);
0000002d 8B 4E 0C mov ecx,dword ptr [esi+0Ch]
00000030 E8 5F F3 DD FF call FFDDF394
416:
417: m_paulIndex -= 624 - 1;
00000035 81 46 04 44 F6 FF FF add dword ptr [esi+4],0FFFFF644h
418:
419: return *m_paulStateTable;
0000003c 8B 46 0C mov eax,dword ptr [esi+0Ch]
0000003f 8B 00 mov eax,dword ptr [eax]
00000041 5B pop ebx
00000042 5E pop esi
00000043 5F pop edi
00000044 C3 ret*** ASM - OPTIMIZE CODE - OUTSIDE DEBUGGER ***
Code Snippet408: virtual unsigned __int32 NextUInt()
409: {
410: if (m_paulIndex != m_paulEndSentinel)
00000000 55 push ebp
00000001 8B EC mov ebp,esp
00000003 57 push edi
00000004 56 push esi
00000005 53 push ebx
00000006 83 EC 30 sub esp,30h
00000009 64 8B 35 38 0E 00 00 mov esi,dword ptr fs:[00000E38h]
00000010 C7 45 C8 40 39 E7 79 mov dword ptr [ebp-38h],79E73940h
00000017 C7 45 C4 40 29 81 2E mov dword ptr [ebp-3Ch],2E812940h
0000001e 8B 7E 0C mov edi,dword ptr [esi+0Ch]
00000021 89 7D CC mov dword ptr [ebp-34h],edi
00000024 89 6D E8 mov dword ptr [ebp-18h],ebp
00000027 8D 7D C8 lea edi,[ebp-38h]
0000002a C7 45 D4 00 00 00 00 mov dword ptr [ebp-2Ch],0
00000031 89 7E 0C mov dword ptr [esi+0Ch],edi
00000034 89 4D F0 mov dword ptr [ebp-10h],ecx
00000037 8B 45 F0 mov eax,dword ptr [ebp-10h]
0000003a 8B 48 04 mov ecx,dword ptr [eax+4]
0000003d 3B 48 10 cmp ecx,dword ptr [eax+10h]
00000040 74 0F je 00000051
411: {
412: return *m_paulIndex++;
00000042 8B 11 mov edx,dword ptr [ecx]
00000044 83 C1 04 add ecx,4
00000047 8B 45 F0 mov eax,dword ptr [ebp-10h]
0000004a 89 48 04 mov dword ptr [eax+4],ecx
0000004d 8B C2 mov eax,edx
0000004f EB 45 jmp 00000096
413: }
414:
415: StreamingMersenneTwisterRefresh(m_paulStateTable);
00000051 8B 45 F0 mov eax,dword ptr [ebp-10h]
00000054 FF 70 0C push dword ptr [eax+0Ch]
00000057 C7 45 D0 88 58 8C 02 mov dword ptr [ebp-30h],28C5888h
0000005e 89 65 D4 mov dword ptr [ebp-2Ch],esp
00000061 68 93 F2 8B 02 push 28BF293h
00000066 8F 45 D8 pop dword ptr [ebp-28h]
00000069 C6 46 08 00 mov byte ptr [esi+8],0
0000006d FF 15 E8 82 8C 02 call dword ptr ds:[028C82E8h]
00000073 C6 46 08 01 mov byte ptr [esi+8],1
00000077 83 3D E0 D2 3A 7A 00 cmp dword ptr ds:[7A3AD2E0h],0
0000007e 74 07 je 00000087
00000080 8B CE mov ecx,esi
00000082 E8 62 DA 64 77 call 7764DAE9
416:
417: m_paulIndex -= 624 - 1;
00000087 8B 45 F0 mov eax,dword ptr [ebp-10h]
0000008a 81 40 04 44 F6 FF FF add dword ptr [eax+4],0FFFFF644h
418:
419: return *m_paulStateTable;
00000091 8B 50 0C mov edx,dword ptr [eax+0Ch]
00000094 8B 02 mov eax,dword ptr [edx]
00000096 8B 7D CC mov edi,dword ptr [ebp-34h]
00000099 89 7E 0C mov dword ptr [esi+0Ch],edi
0000009c 8D 65 F4 lea esp,[ebp-0Ch]
0000009f 5B pop ebx
000000a0 5E pop esi
000000a1 5F pop edi
000000a2 5D pop ebp
000000a3 C3 retTuesday, April 8, 2008 7:14 PM
Answers
-
Hi ildjarn,
Yea, I thought of that too, but the MethodImplAttribute does not make any difference here; I've tried. The PInvoke inlining really isn’t inlining the call itself just stuff that supports calling the unmanaged function and I guess the managed to unmanaged switch. There is this CORJIT_FLG_PROF_NO_PINVOKE_INLINE enumeration that seems to control the feature but it is buried in the bowels of the CLR source with no obvious way to control or set it.
I was able to work around the problem by adding extra parameters to the unmanaged function, particularly "bool" which increased the complexity of the marshaling enough to prevent this sort of optimization by the JIT compiler (at least this is my assumption). It runs fast again, even with optimizations enabled. This also explains why some of our functions exhibit this performance issue and others don’t as the heuristic seems to be related to the unmanaged marshaling.
So from my end, I’ve solved the problem; though, I would really like some explanation about what PInvoke inlining is trying to accomplish (is it just potential performance gains?) and why does it sometimes makes things slower... Is this a bug? It steals a register so this may be the reason; I’m definitely curious. I might post this on Microsoft Connect or burn through a support call and see if I can get one of the JIT compiler developers to explain things better.
James.
Thursday, April 10, 2008 9:41 PM
All replies
-
I have two initial thoughts:
First, check your project settings to make sure optimizations are actually enabled in your release build. VC2008 has a bug where projects created from CLR application-type templates do not have the proper optimization settings for release mode by default. The settings in question are:
1. C/C++ -> Optimization -> Optimization (should be 'Maximize Speed (/O2)')
2. Linker -> Optimization -> References (should be 'Eliminate Unreferenced Data (/OPT:REF)')
3. Linker -> Optimization -> Enable COMDAT Folding (should be 'Remove Redundant COMDATs (/OPT:ICF)')
Tuesday, April 8, 2008 7:41 PM -
Hi ildjarn,
Yes, I’ve checked and verified that the optimizations are enabled. You can see evidence of this in the two versions of IL I’ve posted above. The problem is not the optimizer of the C++ compiler, but something to do with the JIT compiler optimizations.
Though..., it is certainly possible that the C++ optimizer is setting up the JIT optimizer for failure or at least a situation it does not typically see or coded to deal with; I’m not sure how much tuning Microsoft has done in this area.
If you look at the timing numbers you can see that the optimized code performs the fastest, but only when executed under the debugger (when JIT optimizations are suppressed).
Also, in all cases the method NextUInt() will never be inlined and that is to be expected as it is called through an interface, so I do not believe this is the reason for the strange results in our performance testing.
If you look at the optimized assembly language generated outside the debugger it has a bunch of extra instructions at the beginning of the function that are taking up time but I do not know what they are for as they are not associated with the source… It seems to be accessing segment FS, but why only when JIT optimizations are enabled? There are no exception handlers, thread specific storage, or static constructors in the class or method.
James.
Tuesday, April 8, 2008 8:57 PM -
The FS segment is typically used for TLS and SEH frames, and occasionally, highly accessed thunk tables (e.g. import tables). The FS segment access could be an SEH frame since you're not compiling with /clr: safe, though I'm not sure why it would be omitted when you run with the debugger attached... Also, even though you don't have any exception handling directly in your code, the JITter no doubts emits an SEH frame to catch access violations when your code contains direct pointer arithmetic.
You mentioned you're running on an Opteron, so I find the assembly being emitted curious... Are you running a binary linked with /MACHINE:X86 on a 64-bit OS? In this case, the FS segment access could be a WoW64 thunk table. Have you tried running the binary on a 32-bit OS to see if the JITter emits more optimal code, or creating an x64 build of your application?
Tuesday, April 8, 2008 10:33 PM -
Hi ildjarn,
Yes, to answer your question, I’m running under a dual processor machine with Windows XP 64 Professional installed. The application is targeted for x86.
I think you might be on to something with “…the JITter no doubts emits an SEH frame to catch access violations when your code contains direct pointer arithmetic.” If you look at the two versions of IL you can see that the non-optimized one pushes a uint32 on the stack while the optimized one (by the C++ backend) pushed a pointer to a unit32. Could this be causing the JIT optimizer to assume it needs a SEH frame?
Unfortunately, it creates more questions then answers. Both the debug and optimized versions use pointers so you would think that using pointer arithmetic would cause the JIT compiler to generate this code regardless, if it really needed it. And, is there really any reason for the SEH frame in the first place? There are no objects in the method or on the stack that need to be cleaned up. And why is adding a SEH frame considered an “optimization;” I mean, that JIT compiles the code when under the debugger and does not seem fit to include it there when optimizations are suppressed. Finally, why would the C++ backend do this type of transformation on the IL if it could produce slower code by the JIT optimizer down the road?
As to the Wow64 thunking, I must admit I did not think of this, but I’m not sure what it would be thunking to? The method makes no calls to the operating system; it only calls the hand-coded assembly method which is also x86-code linked into the same .NET assembly. The assembly generated in the optimized version does not show any new call instructions either.
The really-really strange part is I have an almost identical NextUInt() method with the same IL (a few more instructions for setting a Boolean state flag), but for a different random generator (the assembly routine it calls out to, to do the refresh/rounds is of course different) and we do not see this problem! Both classes are located in the same .NET assembly. It was this reason which lead me on this investigation in the first place; the newer algorithm should definitely be faster, but we were seeing it fail our performance regression tests only when optimizations were turn on.
I’ll do some more tests now that I have a little more information and a few more things to experiment with and report back.
Another possibility I’ve thought of is the JIT compiler is trying to align the stack or something, but it seems way to much code for this. The assembly code generated certainly smells of an SEH frame or an exception frame of some kind; though, it would be great if any assembly language experts could chime in and definitively identify the JIT generated code in question.
Thanks much,
James.
Tuesday, April 8, 2008 11:38 PM -
James C. Papp wrote: If you look at the two versions of IL you can see that the non-optimized one pushes a uint32 on the stack while the optimized one (by the C++ backend) pushed a pointer to a unit32. Could this be causing the JIT optimizer to assume it needs a SEH frame?
I think so. My understanding is, if the IL emitted isn't verifiable (certainly including any code involving pointer arithmethic), the JITter must emit an exception stack frame to handle any OS or hardware exceptions so they can be wrapped in a managed exception (such as a System:: NullReferenceException or a System:: ComponentModel:: Win32Exception) at runtime.
I think a good first step would be to see if the assembly JITted on a 32-bit OS is any different, to validate or invalidate the idea that WoW64 might be involved. However, I agree with you that it's more likely to be an SEH issue than a WoW64 issue.
You may want to experiment with making your ref class a simple thin wrapper around a native class (i.e, one compiled without /clr, or at least inside of a #pragma unmanaged) that does the actual work; the managed to unmanaged transition may end up being faster than having pointer arithmetic inside your managed code.
EDIT:
James C. Papp wrote: Unfortunately, it creates more questions then answers. Both the debug and optimized versions use pointers so you would think that using pointer arithmetic would cause the JIT compiler to generate this code regardless, if it really needed it.
It's possible that the JITter doesn't emit SEH handlers when a debugger is attached, since the debugger is going to trap the exception anyway. This would explain why the FS segment access is in your release code outside of a debugger but not inside a debugger. This is all speculation on my part, of course.
Wednesday, April 9, 2008 12:15 AM -
Well I’ve done some more tests…
I’ve completely ruled out the C++ compiler optimizations done to the IL; the IL it produces is better optimized, so it is doing exactly what it is suppose to do.
I’ve also ruled out that it is the differences in the two IL dumps, the optimized and non-optimized versions. I can get the JIT compiler to create the slower code on either set of IL by simply enabling JIT optimization.
For some reason when the JIT optimizations are allowed it will makes the code slower. The question is why…?
I’m not convinced that this is simply because of pointer operations. As I’ve said earlier, I have other similar code which does not demonstrate this behavior or performance hit when optimized. Though, more importantly, if this code is generated to support mapping of exceptions for null/access violations why is it not generated when JIT-level optimizations are disabled? Would you not get different behavior?
Also, I’ve eliminated the debugger completely from the picture. The issue is purely whether or not JIT-level optimizations are enabled at the time the method is jitted.
Take at look at the following lines in the assembly (this is the start of the method):
00000000 55 push ebp
00000001 8B EC mov ebp,esp
00000003 57 push edi
00000004 56 push esi
00000005 53 push ebx
00000006 83 EC 30 sub esp,30h
This looks like a typical ebp frame setup, but were the heck is the JIT compiler coming up with 30h to offset the locals on the stack? Something is just not adding up.
And what the heck are the following lines doing? 79E73940h smells of a code address of some type... Ugh!
00000009 64 8B 35 38 0E 00 00 mov esi,dword ptr fs:[00000E38h]
00000010 C7 45 C8 40 39 E7 79 mov dword ptr [ebp-38h],79E73940h
00000017 C7 45 C4 40 29 81 2E mov dword ptr [ebp-3Ch],2E812940hI wonder if there is a way to turn off JIT optimization on a functional level? Then I can at least work around this problem...
It would be great if someone from Microsoft who worked on the JIT compiler could chime in.
Thanks much and for your suggestions,
James.
Wednesday, April 9, 2008 12:55 AM -
Hi all,
Okay, I finally figured out what all the extra code added to our functions is for and why it only happens when JIT optimizations are enabled. My initial guess was that it had something to do with exception/stack frames but now realize it is code to support what looks like some type of PInvoke inlining.
I’ve never heard of this feature before! The missing piece was this method called JIT_RareDisableHelper (thanks to the symbol server), which was referenced by the generated assembly language.
There is not much in the way of documentation but a few quick searches describing this stuff can be found here:
http://discuss.develop.com/archives/wa.exe?A2=ind0201D&L=DOTNET&P=63742
http://www.koders.com/cpp/fid11A72AA0BEFC6FC3C6A0B7421F7F275BBAC10B8F.aspx
So, does anyone know more about PInvoke inline and/or fast unmanaged calls? When and what heuristics determines it use? Or why PInvoke inlining would be slower then not inlining it? Obviously, it does not inline the actual unmanaged call but the associated overhead of setting the call up...
And lastly, how to override or suppress the JIT compiler’s decision to use PInvoke inlining when it does not improve performance?
Thanks much,
James C. Papp
Thursday, April 10, 2008 7:02 PM -
To prevent the JIT compiler from inlining your method, I believe you can apply the System:: Runtime:: CompilerServices:: MethodImplAttribute attribute to your method with the MethodImplOptions:: NoInlining flag specified.
Some rules on inline qualification are posted here: http://blogs.msdn.com/ricom/archive/2004/01/14/58703.aspx. The blog post is a few years old, but most of it probably still holds true.
EDIT: Also, this link is useful, and is actually the one I meant to post the first time: http://blogs.msdn.com/ericgu/archive/2004/01/29/64717.aspx
Thursday, April 10, 2008 8:16 PM -
Hi ildjarn,
Yea, I thought of that too, but the MethodImplAttribute does not make any difference here; I've tried. The PInvoke inlining really isn’t inlining the call itself just stuff that supports calling the unmanaged function and I guess the managed to unmanaged switch. There is this CORJIT_FLG_PROF_NO_PINVOKE_INLINE enumeration that seems to control the feature but it is buried in the bowels of the CLR source with no obvious way to control or set it.
I was able to work around the problem by adding extra parameters to the unmanaged function, particularly "bool" which increased the complexity of the marshaling enough to prevent this sort of optimization by the JIT compiler (at least this is my assumption). It runs fast again, even with optimizations enabled. This also explains why some of our functions exhibit this performance issue and others don’t as the heuristic seems to be related to the unmanaged marshaling.
So from my end, I’ve solved the problem; though, I would really like some explanation about what PInvoke inlining is trying to accomplish (is it just potential performance gains?) and why does it sometimes makes things slower... Is this a bug? It steals a register so this may be the reason; I’m definitely curious. I might post this on Microsoft Connect or burn through a support call and see if I can get one of the JIT compiler developers to explain things better.
James.
Thursday, April 10, 2008 9:41 PM