I want to maximize speed, up to 4 times fastest (xna-math)

الإجابة I want to maximize speed, up to 4 times fastest (xna-math)

  • 3 februarie 2012 08:18
     
     
    #include "stdafx.h"
    #include <xmmintrin.h>

    #define XMVECTOR __m128

    _forceinline XMVECTOR XMVectorAdd(XMVECTOR V1, XMVECTOR V2)
    {
        return _mm_add_ps(V1, V2);
    }

    int main()
    {
        static XMVECTOR V1 = {1.0f, 1.0f, 1.0f, 1.0f};
        static XMVECTOR V2 = {2.0f, 2.0f, 2.0f, 2.0f};
       
        XMVECTOR V3 = XMVectorAdd(V1, V2);
          
        return 0;
    }

    After compilation without optimization, I show the Disassembly:

    int main()
    {
    01131370  push        ebx 
    01131371  mov         ebx,esp 
    01131373  sub         esp,8 
    01131376  and         esp,0FFFFFFF0h 
    01131379  add         esp,4 
    0113137C  push        ebp 
    0113137D  mov         ebp,dword ptr [ebx+4] 
    01131380  mov         dword ptr [esp+4],ebp 
    01131384  mov         ebp,esp 
    01131386  sub         esp,108h 
    0113138C  push        esi 
    0113138D  push        edi 
    0113138E  lea         edi,[ebp-108h] 
    01131394  mov         ecx,42h 
    01131399  mov         eax,0CCCCCCCCh 
    0113139E 
    rep stos    dword ptr es:[edi]  <<<<<< The rep (repeat) makes the performance to be slow,
                                                     not only in main() but in every function.
                                                     The rep (repeat) can be disabled with the optimization we’ll talk about.

        static XMVECTOR V1 = {1.0f, 1.0f, 1.0f, 1.0f};
        static XMVECTOR V2 = {2.0f, 2.0f, 2.0f, 2.0f};
       
        XMVECTOR V3 = XMVectorAdd(V1, V2);
    011313A0  movaps      xmm1,xmmword ptr [V2 (1138010h)] 
    011313A7  movaps      xmm0,xmmword ptr [V1 (1138000h)] 

    011313AE  call        XMVectorAdd (113110Eh)  <<<<<< Why we have a ‘call’ but not an expanded inline?
    011313B3  movaps      xmmword ptr [ebp-100h],xmm0 
    011313BA  movaps      xmm0,xmmword ptr [ebp-100h] 
    011313C1  movaps      xmmword ptr [V3],xmm0
     
          
        return 0;
    011313C5  xor         eax,eax 
    }
    011313C7  pop         edi 
    011313C8  pop         esi 
    011313C9  mov         esp,ebp 
    011313CB  pop         ebp 
    011313CC  mov         esp,ebx 
    011313CE  pop         ebx 
    011313CF  ret 


    I want to Maximize speed with the optimization /O2 on VC++ 2010:
    In the dialog Project Properties > C/C++ > Optimization > Maximize Speed (/O2) is chose by me, then the following error message appears:
    "Command line error D8016 : '/ZI' and '/O2' command-line options are incompatible".
    The question is: “What other options should be chose in order that the /O2 works?”.

    I have already wasted much time but the problem is still unresolved. Is there exists a developer in Microsoft who understands the problem? I really like the ‘xnamath.h’, who created it?

    The aims are:
     - Using of the SSE2 Intrinsics is 2 times fastest than using of the xna-math if the functions in xnamath.h are not expanded inline i.e. without the optimization /O2.
     - If we have an expanded inline i.e. the optimization /O2 works, then using of the SSE2 Intrinsics is 4 times fastest than using only of standard C++. And the xna-math has the same speed as SSE2 has.

    The default option is not optimized

Toate mesajele

  • 8 februarie 2012 15:04
     
     Răspuns
    Finally, the problem is solved by myself:

    In the dialog Project Properties > C/C++, do the followings:
        1) Optimization > Maximize Speed
    (/O2)
        2) Debug Information Format > Program Database
    (/Zi)
        3) Code Generation > Basic Runtime Checks > Default

    I'm new and I apologize for asking such an easy question, because the solution is evident: just change
    (/ZI) to (/Zi), ...

    * The prove that the problem is solved:
    The inconvenience on the maximized speed optimization is that the debugging becomes difficult because the disassembled program is very compact, i.e. the Debug Information becomes complex.
        Examples: Breakpoints that are skipped and large Steps.

    Manda.