VFP in Windows Embedded Compact 7 RRS feed

  • Question

  • Hi, All,

    I have just built the OS image for Windows Embedded Compact 7 with TI OMAP 3530.
    And, I want to ask the details steps of how to enable VFP.

    Form MSDN, I know I need to initial the VFP feature with VfpOemInit in OEMInit.
    In application side, MSDN also mention to add the parameter "/QRfpe-" to enable the VFP.
    Is that the right steps to enable VFP in WinCE7?

    I follow the two steps above, and seems that the compiled assembly code is using VFP instruction set now.
    However, the performance is so weird to me.
    I test with the following code.

        for (int i=0; i<50000; i++)
            test = tan(cos(3.4578236482) + sin(83.9374658) / asin(0.123)) * log(36.123);
            test = log(fabs(test)) * pow(fabs(test), 0.578);

    In WinCE7 with VFP enabled(/QRarch7 /QRfpe-), the calculation time is around 0.92 secs.
    In WinCE7 with VFP disabled(/QRarch5), the calculation time is around 0.38 secs.

    Also, I compare this result with VFP enabled in WinCE6(using vfp2fpcrt.dll provided by ARM), the calculation time is around 0.17 secs.

    Would you help me to clarify why VFP enabled in CE7 is slower than VFP disabled?
    And why the speed is slower then CE6?
    Or is there any setting I'm missing?

    Thanks for your help,


    Monday, June 13, 2011 3:14 PM

All replies

  • What is the second parameter,dwFPSID, passed to VfpOemInit? Is should be VFP_AUTO_DETECT_FPSID or FPCRT may use SW emulation for the transcendental functions.

    Monday, June 13, 2011 6:35 PM
  • Thanks KMOS, I have passed it with VFP_AUTO_DETECT_FPSID.
    What is the other possible reason to explain that result?

    • Edited by ddrichard Monday, June 13, 2011 9:49 PM
    Monday, June 13, 2011 8:01 PM
  • You may want to see the list file (.COD by set WINCEDOD=1) to compare and analysis the opcode compiler generated when enabling/disabling the "/QRfpe-" option.
    Also you can check g_dwFPSID in platform\common\src\ARM\COMMON\vfp\vfpSupport.cpp to double check the value of FPSID of the processor.

    Monday, June 13, 2011 8:49 PM
  • When enabling "/QRfpe-" option, I can see the opcode using VFP instruction set.


    ; 156  :         test = tan(cos(3.4578236482) + sin(83.9374658) / asin(0.123)) * log(36.123);

      00024    e5983000     ldr         r3,[r8]
      00028    e30d05d5     mov         r0,#0xD5D5
      0002c    e30a199f     mov         r1,#0xA99F
      00030    e34701e2     movt        r0,#0x71E2
      00034    e344100b     movt        r1,#0x400B
      00038    e12fff33     blx         r3
      0003c    e5993000     ldr         r3,[r9]
      00040    ec410b19     vmov        d9,r0,r1
      00044    e3000795     mov         r0,#0x795
      00048    e30f1bff     mov         r1,#0xFBFF
      0004c    e347008e     movt        r0,#0x708E
      00050    e3441054     movt        r1,#0x4054
      00054    e12fff33     blx         r3
      00058    e59a3000     ldr         r3,[r10]
      0005c    ec410b18     vmov        d8,r0,r1
      00060    e30702b0     mov         r0,#0x72B0
      00064    e3071ced     mov         r1,#0x7CED
      00068    e3490168     movt        r0,#0x9168
      0006c    e3431fbf     movt        r1,#0x3FBF
      00070    e12fff33     blx         r3
      00074    e59b3000     ldr         r3,[r11]
      00078    ec410b10     vmov        d0,r0,r1
      0007c    ee880b00     vdiv.f64    d0,d8,d0
      00080    ee300b09     vadd.f64    d0,d0,d9
      00084    ec510b10     vmov        r0,r1,d0
      00088    e12fff33     blx         r3
      0008c    e5943000     ldr         r3,[r4]
      00090    ec410b18     vmov        d8,r0,r1
      00094    e30b0439     mov         r0,#0xB439
      00098    e3001fbe     mov         r1,#0xFBE
      0009c    e34706c8     movt        r0,#0x76C8
      000a0    e3441042     movt        r1,#0x4042
      000a4    e12fff33     blx         r3

    While disabling the "/QRfpe-" option, the opcode isn't using VFP instruction set.


    ; 156  :         test = tan(cos(3.4578236482) + sin(83.9374658) / asin(0.123)) * log(36.123);

      00004    e59f31c0     ldr         r3,|$LN35@FloatTest| ; =|__imp_cos|
      00008    e59f01b8     ldr         r0,|$LN34@FloatTest| ; =0x71e2d5d5
      0000c    e59f11b0     ldr         r1,|$LN33@FloatTest| ; =0x400ba99f
      00010    e5933000     ldr         r3,[r3]
      00014    e12fff33     blx         r3
      00018    e59f31a0     ldr         r3,|$LN32@FloatTest| ; =|__imp_sin|
      0001c    e1a08001     mov         r8,r1
      00020    e59f1194     ldr         r1,|$LN31@FloatTest| ; =0x4054fbff
      00024    e5933000     ldr         r3,[r3]
      00028    e1a07000     mov         r7,r0
      0002c    e59f0184     ldr         r0,|$LN30@FloatTest| ; =0x708e0795
      00030    e12fff33     blx         r3
      00034    e59f3178     ldr         r3,|$LN29@FloatTest| ; =|__imp_asin|
      00038    e1a06001     mov         r6,r1
      0003c    e59f116c     ldr         r1,|$LN28@FloatTest| ; =0x3fbf7ced
      00040    e5933000     ldr         r3,[r3]
      00044    e1a05000     mov         r5,r0
      00048    e59f015c     ldr         r0,|$LN27@FloatTest| ; =0x916872b0
      0004c    e12fff33     blx         r3
      00050    e59fe150     ldr         lr,|$LN26@FloatTest| ; =|__imp___divd|
      00054    e1a03001     mov         r3,r1
      00058    e1a02000     mov         r2,r0
      0005c    e59e4000     ldr         r4,[lr]
      00060    e1a00005     mov         r0,r5
      00064    e1a01006     mov         r1,r6
      00068    e12fff34     blx         r4
      0006c    e59fe130     ldr         lr,|$LN25@FloatTest| ; =|__imp___addd|
      00070    e1a02007     mov         r2,r7
      00074    e1a03008     mov         r3,r8
      00078    e59e4000     ldr         r4,[lr]
      0007c    e12fff34     blx         r4
      00080    e59f3118     ldr         r3,|$LN24@FloatTest| ; =|__imp_tan|
      00084    e5933000     ldr         r3,[r3]
      00088    e12fff33     blx         r3
      0008c    e59f30f0     ldr         r3,|$LN17@FloatTest| ; =|__imp_log|
      00090    e1a06001     mov         r6,r1
      00094    e59f1100     ldr         r1,|$LN23@FloatTest| ; =0x40420fbe
      00098    e5933000     ldr         r3,[r3]
      0009c    e1a05000     mov         r5,r0
      000a0    e59f00f0     ldr         r0,|$LN22@FloatTest| ; =0x76c8b439
      000a4    e12fff33     blx         r3
      000a8    e59f30d8     ldr         r3,|$LN18@FloatTest| ; =|__imp___muld|
      000ac    e1a02005     mov         r2,r5
      000b0    e5934000     ldr         r4,[r3]
      000b4    e1a03006     mov         r3,r6
      000b8    e12fff34     blx         r4
      000bc    e3a03cc3     mov         r3,#0xC300
      000c0    e59f40bc     ldr         r4,|$LN17@FloatTest| ; =|__imp_log|
      000c4    e59f50b4     ldr         r5,|$LN16@FloatTest| ; =|__imp_pow|
      000c8    e383b050     orr         r11,r3,#0x50
      000cc    e1a09001     mov         r9,r1
      000d0    e1a0a000     mov         r10,r0
      000d4         |$LL10@FloatTest|
      000d4    e3a0047f     mov         r0,#0x7F000000

    However, I notice one strange thing.
    Inside each mathematics function call(e.g. cos=>blx r3), it will go to check register R12 to determine whether to use the VFP version of mathematics function (e.g. cos_vfp).
    Either enabling or disabling "/QRfpe-" option will call VFP version of mathematics function.
    That's make me confuse.

    By the way, when you said to double check g_dwFPSID in vfpSupport.cpp,
    do you mean to make sure that I'm not using VFP_FULL_EMULATION_FPSID?

    Monday, June 13, 2011 10:14 PM
  • Because the QRfpe- option only applied to the compiler, and the entry point of transcendental functions (tan, cos and etc) are definde in FPCRT
    take a look at VFP_TRANS_ENTRY in private\winceos\coreos\core\fpcrt\fpcrt_vfp.s and you can see the R12 points to g_haveVfpHardware which's value is determined by IsProcessorFeaturePresent(PF_ARM_VFP_HARDWARE) in fpcrt.cpp.
    And the IsProcessorFeaturePresent eventually flows through the pOemGlobal->pfnIsVFPFeaturePresent (IsVfpFeaturePresent) in platform\common\src\ARM\COMMON\vfp\vfpSupport.cpp
    The g_dwFPSID is initialized by VfpReadFpsid in VfpOemInitEx and it holding the raw fpsid from hardware.
    Tuesday, June 14, 2011 5:41 PM
  • Thanks KMOS for the detail answer.

    So whatever what QRfpe setting in the compiler,
    once you got VFP in the hardware, it will automatically use VFP for transcendental function(tan, cos and etc).

    By the way, I knew why the performance is different when running with or without QRfpe option in the following program.

    1    for (int i=0; i<50000; i++)
    2    {
    3        test = tan(cos(3.4578236482) + sin(83.9374658) / asin(0.123)) * log(36.123);
    4        test = log(fabs(test)) * pow(fabs(test), 0.578);
    5    }

    With option "/QRarch5", line 3 is only executed once and stored the result in the register for further loop.
    However, with option "/QRarch7 /QRfpe-", line 3 is calculated in every loop and didn't save the result in register.
    That's why it is slower. Do you know why the source code is not optimized by the compiler with option "/QRarch7 /QRfpe-"?

    Furthermore, do you have the detail of "vfp2fpcrt.dll"?
    I still can't understand why the execution time is faster with that dll in CE6.

    Tuesday, June 14, 2011 8:19 PM
  • As using transcendental functions alwyas call into FPCRT, so it makes the comparison more complicated.
    You may consider to create a test code that using purely non-transcendental functions, such as floating point add, multiply, divide and etc.
    In general, with "QRfpe-" option the compiler will emit VFP instructions (e.g. vdiv.f64, vadd.f64 and etc) inline, so the performance should be better when without "QRfpe-" that always calls out to CRT (e.g. __imp___divd, __imp___add and etc)
    But looks like it is opposite in your case, so one possible explanation is due to the processor doesn't support VFP instructions (could be disabled by bootlaoder or system initial code), thus every single VFP instruction the processor encountered will generate an Undefined Instruction Exception; but due to the ARM VFP OAL library (platform\common\src\ARM\COMMON\vfp\vfpSupport.cpp) also provides the SW emulation to handle non-supported instructions (VfpHandleException func), so the VFP instruction system is actually executed by SW emulation.
    You could double check if VfpExecuteInstruction get executed constantly during the test program.

    The "QRarch7" is to indicate should compiler emit ARMv7 opcodes. (e.g. MOVT instruction)

    Tuesday, June 14, 2011 8:47 PM
  • Hi, KMOS,

    The hardware FPSID is 410330c1 which indicate HaveVfpHardware and using PF_ARM_VFP_V3 subarchitecture.

    For VfpHandleException, I have added some debug message to check whether the processor encountered Undefined Instruction Exception.
    But I didn't see any debug message coming from VfpHandleException function.

    In my case, I think the root reason for the performance issue is because of compiler optimization issue just mentioned in my previous post.
    If I change my benchmark program with the following one, then the performance issue will be gone.

        for (int i=0; i<50000; i++)
            test = cos(tan(test) + sin(test) / atan((i+test)/(2*i))) * log(fabs(test));
            test = log(fabs(test)) * pow(fabs(test), test/(i+1));

    With "QRfpe-" is running a little bit faster(~0.02) then without "QRfpe-".



    Wednesday, June 15, 2011 10:17 AM
  • Richard,

    Did you ever find out the reason(s) behind your performance issue? I am chasing a similar problem

    Tuesday, March 6, 2012 7:54 PM