none
Significant performance drop when locale is set and runtime library is using DLL (/MD) RRS feed

  • Question

  • We noticed that the performance of string functions for MBCS like _tcscat (_mbscat), _tstof (atof), _tcsstr (_mbscpy)   has been hit significantly if we set the locale LC_CTYPE to system default, and use the runtime library with Mutli-thread DLL (/MD). With /MD, it will be 50% slower; while if use runtime library with Multlthread (/MT), with same code it only drops 15-20%. We understand when a locale is set, it could get slower for string functions, but why /MD cause significant hit? Isn't the implementation same in either a lib or a dll for the runtime libraries?

    Is there any thing to improve the performance? We are switching the projects from /MT to /MD and to support user's native languages, it is hard to accept such performance drop.

    The demo code is on github if required. 

     
    • Edited by johnzhuca Wednesday, November 27, 2019 8:17 PM
    Wednesday, November 27, 2019 8:13 PM

All replies

  • I'd be curious to see the code.  Unless the routines are accessing a common buffer and therefore doing locking, there shouldn't be any performance difference between /MT and /MD.

    Tim Roberts | Driver MVP Emeritus | Providenza & Boekelheide, Inc.

    Wednesday, November 27, 2019 11:31 PM
  • Hello,

    Thank you for posting here.

    >>The demo code is on github if required. 

    Do you have any code show us? I will test it on my side. It could be better to help us to find the root cause. 

    Best Regards,

    Suarez Zhou


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Thursday, November 28, 2019 6:15 AM
  • Sorry it does not allow me to post link, it is under my account johnzhuacl, and repo's name is mtmd-test. The code is very simple as below, just compile it with /MT and /MD, target x86 platform. On my machine:

    /MD result:

    elapsed = 20.382277, 123456759020589.203125
    elapsed when locale set = 30.583172, 123456759020589.203125

    and /MT result:

    elapsed = 23.085637, 123456759020589.203125
    elapsed when locale set = 27.368535, 123456759020589.203125

    #include <iostream>
    #include <string>
    #include <tchar.h>
    #include <time.h>
    #include <windows.h>
    #include <mbctype.h>
    
    void setLocaleToSystemDefault()
    {
      int currentCodePage = GetACP(); //note test is running on code page 1252 
      char codePage[10];
      codePage[0] = '.';
      _itoa(currentCodePage, &codePage[1], 10);
      setlocale(LC_CTYPE, codePage);
      _setmbcp(_MB_CP_LOCALE);
    }
    
    inline long long PerformanceCounter() noexcept
    {
      LARGE_INTEGER li;
      ::QueryPerformanceCounter(&li);
      return li.QuadPart;
    }
    /***
      The calculcation is just to ensure optimization won't skip the string function call.
    ***/
    double doTest() {
      TCHAR testData[100] = _T("12345678");
      long long total = 0;
      double dtotal = 0.0;
      for (__int64 i = 0; i < 10000000; i++)
      {
    
        TCHAR* index =  _tcsstr(testData, _T("4"));
        int in = (TCHAR*)index - testData;
        total += in;
        TCHAR* v = _tcscat(testData, _T(".90"));
        double dv = _tstof(testData);
        dtotal += dv;
        _tcscpy(testData, _T("12345678"));
      }
      return dtotal - total;
    }
    int main()
    {
      double total = 0.0;
      long long  startTime = PerformanceCounter();
      total = doTest();
      long long endTime = PerformanceCounter();
      __int64 diff = endTime - startTime;
      printf("elapsed = %f, %f\n", diff/1000000.0, total);
      setLocaleToSystemDefault();
    
      total = doTest();
      long long endTime2 = PerformanceCounter();
      diff = endTime2 - endTime;
      printf("elapsed when locale set = %f, %f\n", diff / 1000000.0, total);
      return 0;
    }



    • Edited by johnzhuca Friday, November 29, 2019 1:03 AM
    Thursday, November 28, 2019 10:33 PM
  • Hello,

    >>Sorry it does not allow me to post link

    Please access this link to verify the account, and then you can post link after verification.

    https://social.technet.microsoft.com/Forums/en-US/dc4002e4-e3de-4b1e-9a97-3702387886cc/verify-account-42?forum=reportabug

    >>The code is very simple as below, just compile it with /MT and /MD, target x86 platform.

    We tested it and found that there was no difference. This is compiled in debug (x86) mode. We're sorry we can't restore your issue. Could you provide more details.

    Best Regards,

    Suarez Zhou


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.


    Friday, November 29, 2019 7:50 AM
  • There are a couple of issues here.

    The biggest problem is that your second reported time is including the time it took to execute setLocaleToSystemDefault.  You should grab another time hack immediately before the second call to "doTest".

    The second issue is that your units are confusing.  You're running the loop 10 millions times, but dividing the elapsed time by 1 million.  QPC does not return microseconds.  In order to find the units, you have to call QueryPerformanceFrequency.  Without doing that, you can't really tell whether your timings are meaningful.  QPC measures elapsed wall-clock time, so it counts time when your thread might not be running.  If you're on a single CPU machine, you'll be counting time while other processes do their thing.

    Timing is tricky.  You want to run this a few times and make sure the results are consistent.


    Tim Roberts | Driver MVP Emeritus | Providenza &amp; Boekelheide, Inc.

    Friday, November 29, 2019 7:50 AM
  • One thing to point out, QueryPerformanceCounter outputs processor ticks, while modern hardware may have a frequency which rounds nicely, you can't just divide by anything other than the output of QueryPerformanceFrequency to get actual usable results from QueryPerformanceCounter.

    So after fixing how you time the code, including what Tim Roberts mentioned about you timing the second test incorrectly, I ran those tests on my system to see what results I got.

    /MD
    elapsed = 1.356663s, 123456759020589.203125
    elapsed when locale set = 1.867729s, 123456759020589.203125

    /MT
    elapsed = 1.460518s, 123456759020589.203125
    elapsed when locale set = 1.744204s, 123456759020589.203125

    The thing to remember is that this time was obtained over 10,000,000 iterations. This means that the average for each iteration of the /MD example was 0.1867729 microseconds and the average of each iteration of the /MT example was 0.1744204 microseconds. If you really think that a difference in the average of about 12 nanoseconds is really significant then report this as a bug.

    However, please remember that the implementation of the static library and dynamic library versions of the UCRT may not be identical. Since the linker is also able to see and manipulate everything in the UCRT static library then it may also be able to do a tiny amount of optimisation that the linker would be unable to do on the dynamic version of the UCRT.


    This is a signature. Any samples given are not meant to have error checking or show best practices. They are meant to just illustrate a point. I may also give inefficient code or introduce some problems to discourage copy/paste coding. This is because the major point of my posts is to aid in the learning process.

    Friday, November 29, 2019 2:39 PM
  • Thanks, still wait for account verified. Just to confirm, the character set is MBSC, not Unicode. When charset is Unicode there is no different for /MD or /MT with or without locale set, I think in both cases them may call same routines to handle Unicode. 
    Friday, November 29, 2019 7:33 PM
  • Darran, thanks for trying out. I was trying to show that when locale is set, that application compiled with /MD option will see significant performance drop, in your case it is ~40%, but for /MT option, it is only 20%. While in my environment it is 50% vs 20%, so it is hard to understand and accept the 30% difference.
    Friday, November 29, 2019 7:49 PM
  • Since I know that the the difference is around 12 nanoseconds then I think it is fairly easy to explain. My system has DDR4 RAM, and the tightest timing based on the latency is 12.5ns. All there has to be is one extra cold memory access per iteration and that is an extra 12.5ns added to the iteration's average.

    There is other problems that could get in the way too, strstr/wcsstr is in vcruntime.dll but other functions used are in ucrtbase.dll. This means that the DLL version would have to call between DLLs which could be placed several MiB appart in the address space. For the static version the linker is able to place the objects right next to each other in the address space.

    This means that the static version has a higher probability of all the code being in the same cache line but the DLL version guarantees that not all of the code is going to be in the same cache line. It is because of this kind of characteristic of processors and memory access that worrying over nanoseconds in execution time is pointless.


    This is a signature. Any samples given are not meant to have error checking or show best practices. They are meant to just illustrate a point. I may also give inefficient code or introduce some problems to discourage copy/paste coding. This is because the major point of my posts is to aid in the learning process.

    Friday, November 29, 2019 9:53 PM
  • I am not sure about the cross-dll will be slower than static link. Without setting locale, or if I am using Unicode character set, they are same speed, and sometimes /MD is even faster. I assume once the DLL is loaded, function pointer is known so it won't have an extra function call. However,  I am not trying to compare performance different of /MD and /MT, I am just curios why setting locale, with /MD the performance has dropped 40% to 50%. Since for /MT the performance is only dropping 20%, I would like to see if there is any way we can improve performance to at least on par with /MT option.

    Monday, December 2, 2019 6:19 PM
  • If you are not sure about this then it seems that you need to read up on how static linking works compared to dynamic linking and how caches work on a system.

    You see, it isn't about it being cross DLL or whatever, it is about how often the code that is executed and the data that is accessed in the L1 cache.

    The code is placed in the same area in the static version so the processor won't really have as many cache misses. But the dynamic version is going to have to call between code that is more spread out. This has a higher liklihood of missing not just L1 cache but L2 cache too.

    There are other things here too. The static library isn't built with LTCG information but we don't know how the UCRT DLL was built. If it was built with LTCG and even PGO then the linker could optimize the DLL better and place functions better. This means that the dynamic version could be optimized better and that is why the no locale version can perform better. But as soon as the dynamic version starts using the locale data then the locality of the static version starts winning out since the dynamic version has to jump all over the place.

    So there are pleanty of reasons why the statically linked version has that weird performance characteristic. It just requires understanding that the processor may not make much sense.


    This is a signature. Any samples given are not meant to have error checking or show best practices. They are meant to just illustrate a point. I may also give inefficient code or introduce some problems to discourage copy/paste coding. This is because the major point of my posts is to aid in the learning process.

    Monday, December 2, 2019 10:38 PM
  • Look, you guys are trying to assign significant meaning to numbers that are below the measurement error here.  Run this app multiple times, and you'll see that the variation from run to run is higher than the deltas you're trying to explain.

    C:\Dev\sw>cl /EHsc /MT /Ox x.cpp
    Microsoft (R) C/C++ Optimizing Compiler Version 19.00.24215.1 for x64
    Copyright (C) Microsoft Corporation.  All rights reserved.
    
    x.cpp
    Microsoft (R) Incremental Linker Version 14.00.24215.1
    Copyright (C) Microsoft Corporation.  All rights reserved.
    
    /out:x.exe
    x.obj
    
    C:\Dev\sw>x
    elapsed = 17.288650, 123456759020589.203125
    elapsed when locale set = 22.237561, 123456759020589.203125
    
    C:\Dev\sw>x
    elapsed = 18.155147, 123456759020589.203125
    elapsed when locale set = 18.946379, 123456759020589.203125
    
    C:\Dev\sw>x
    elapsed = 17.816495, 123456759020589.203125
    elapsed when locale set = 20.260885, 123456759020589.203125
    
    C:\Dev\sw>cl /EHsc /MD /Ox x.cpp
    Microsoft (R) C/C++ Optimizing Compiler Version 19.00.24215.1 for x64
    Copyright (C) Microsoft Corporation.  All rights reserved.
    
    x.cpp
    Microsoft (R) Incremental Linker Version 14.00.24215.1
    Copyright (C) Microsoft Corporation.  All rights reserved.
    
    /out:x.exe
    x.obj
    
    C:\Dev\sw>x
    elapsed = 18.085366, 123456759020589.203125
    elapsed when locale set = 22.969098, 123456759020589.203125
    
    C:\Dev\sw>x
    elapsed = 18.353831, 123456759020589.203125
    elapsed when locale set = 19.933346, 123456759020589.203125
    
    C:\Dev\sw>x
    elapsed = 17.223524, 123456759020589.203125
    elapsed when locale set = 22.391077, 123456759020589.203125
    Code benchmarking is an art.  You HAVE to understand your environment.  Darren, I respect you, but you can't seriously try to explain timing differences at a nanosecond level when you're doing testing on a running system.  That's nonsense.  If you were using the cycle counter and measuring one run at a time, then maybe you could do that, but these tests each run for several seconds.  You don't own the CPU for that entire time.  You're going to have thousands of interrupts and thousands of context switches in that time, and ALL of the code running in those interruptions are going to be included in the QPC times.  As you can see, I'm getting 15% differences from run to run, and that's totally typical.

    Tim Roberts | Driver MVP Emeritus | Providenza &amp; Boekelheide, Inc.

    Wednesday, December 4, 2019 8:33 PM
  • I apologise if I made you think that I was actually assigning significant meaning. I didn't really explain much because this was only ever about trying to get the OP realise that such a small difference in absolute time was a non issue.

    Since I have an 8 core/16 thread processor, I could reduce CPU contention a bit and get more consistent results by closing a lot of unnecessary foreground and background processes. I was also not just taking a single result, but instead working it out over a lot of runs and averaging it.

    Why do you think that besides the one time that I referenced an absolute 12.5ns to show that a cold memory access could easily add that time on (which was inaccurate mind you, but that didn't matter) I either referenced average times or didn't reference times at all.

    While I didn't show it, the confidence with the numbers that I had was based upon doing the minimum required work to gain confidence in these numbers. I had actually separated out both parts of the code so that the code with the locale set would run independently of the code without the locale set and I then ran 100 iterations of all 4 versions to work out the minimum value, maximum value, average and standard deviation. I had also done this working using the times calculated by QPC/QPF in nanoseconds so there was no error introduced from converting it to seconds using the floating point division at that magnitude. Since I didn't keep those I ran some of the tests again.

    As an example of the values that I was working with:

    /MD
    No Locale

    Min 1309689
    Max 1343528
    Avg 1327270.3

    Standard Deviation 11646.76838

    With Locale

    Min 1797067
    Max 1838917
    Avg 1822467.7

    Standard Deviation 33951.89895

    /MT

    No Locale

    Min 1397075
    Max 1437463
    Avg 1415398.7

    Standard Deviation 16057.68281

    With Locale

    Min 1675783
    Max 1729937
    Avg 1710276.3

    Standard Deviation 19709.34446

    All times are in nanoseconds.

    The thing to note here is that the performance characteristic showed that /MD with locale was consistently slower than /MT with locale. The original data had only two outliers for the /MT version, one ended up being around 1.8 seconds, which wasn't slower than the /MD average but it also had one which was around 1.9 seconds.

    The /MD version with the locale set never had a time go below 1.79 seconds but had two runs also go above 1.9 seconds, one even managed to get to 1.95 seconds.

    So I was always working with the average performance characteristic values but it is something that I should have made clear. For that I apologise.

    And yes, the values that I posted in one of my replies is a little slower than my average. They were the values that I got out the first time I ran the application after fixing it to give more accurate times for the with locale portion and the values were similar enough to my full tests that I just used them as is.


    This is a signature. Any samples given are not meant to have error checking or show best practices. They are meant to just illustrate a point. I may also give inefficient code or introduce some problems to discourage copy/paste coding. This is because the major point of my posts is to aid in the learning process.


    • Edited by Darran Rowe Thursday, December 5, 2019 1:47 AM
    Thursday, December 5, 2019 1:42 AM