none
Are Interlocked and Barrier Intrinsics Implemented Incorrectly on VC8?

    Question

  • Before I go into my huge rant, is there a way to prevent the compiler from reordering statements that will work for VC8 as well as older versions of VC? I want to be very clear that I am looking strictly for compiler barriers and not memory barriers for the time being, since generating the appropriate memory barriers is easy enough to do in __asm blocks. Anyway, as the title suggests, in VC8 standard I am pretty sure that I am seeing very incorrect output when compared to functionality that is described on MSDN. First, the documentation on MSDN about _ReadBarrier and similar intrinsics actually contradicts itself! Here http://msdn2.microsoft.com/en-us/library/z055s48f(VS.80).aspx it is claimed that _ReadBarrier is a memory barrier, whereas here http://msdn2.microsoft.com/en-us/library/ms684208.aspx (in the remarks section of that page) it is described as being a compiler barrier and nothing more. As well, if you look at the code snippet from the second page linked, which supposedly shows a full memory barrier for x86, apparently does nothing of the sort. An xchg does not guarantee any fence what-so-ever, at least as far as I can see from x86 instruction listings. It is strictly an unordered atomic operation. In order to portably and reliably create any type of fence, I'm fairly certain that you would need to use sfence, lfence, or mfence, which are SSE and SSE2 instructions. After looking at the output generated by the supposed interlocked operation intrinsics such as InterlockedCompareExchange, which claim full fence semantics, it seems as though the appropriate fences are not in place, even when SSE and SSE2 generation is enabled! After looking through the interlocked intrinsics I have only found contradictions in documentation and, unless I am going insane, improper implementation. Similarly, the load acquire and store release semantics specified for volatile are not reflected in instructions generated. For reference, I am using VC8 standard in debug on Windows XP, testing with SSE and SSE2 generation on and off.
    Sunday, June 24, 2007 2:31 AM

Answers

  •  Rivorus wrote:
    An xchg does not guarantee any fence what-so-ever, at least as far as I can see from x86 instruction listings

     

    According to the Intel 64 / IA-32 software development manual, XCHG instructions imply automatic locking.

     

    As for the fence operations, I can confirm that no explicit ones are provided for volatile objects. That being said, the documentation doesn't appear to make that guarantee (and neither does volatile on any other C++ platforms I know of). The only guarantee made is for the volatile objects not to take part in any compiler re-orderings.

     

    One part I find ambiguous in the MSDN documentation is the following statement: "Although the processor will not reorder un-cacheable memory accesses, un-cacheable variables must be volatile to guarantee that the compiler will not change memory order." This seems to suggest that volatile objects are allocated in ranges marked as un-cachable in the processors MTRRs (memory type range registers, applicable for P4, Xeon and P6). According to the Intel docs, marking a range UC will cause the processor to enforce strong ordering on accesses to that segment. What's unclear to me about this strong ordering, is whether it only affects the un-cachable memory segments alone (that is, reads and writes to UC segments cannot be re-ordered). I doubt, however, that it also enforces serialization of reads / writes to segments *not* marked as UC, which occur (instruction-wise) in-between those to the UC segments.

     

    The bottom line, as I see it, is that you must manually use either the fence instructions; implicitly or explicitly lock prefixed instructions; CPUID or API level lock primitives. Failure to do so may, as you note, cause cpu re-orderings to break your code on SMP architecture.

     

    A side note, I noticed that the Orcas compiler generates volatile release build operations in a manner similar to the debug build operations in VC8: a single pre-increment of a volatile int will take the value into a register, increment the register, then write it back to the value itself. Release build apps in VC8 will skip the register step, and increment the value at the memory location in a single instruction.

    Monday, June 25, 2007 8:04 AM
    Moderator
  • We should probably spend some more time to document this better.

     

    Lets talk specifically about x86/x64 (as Itanium has different behavior).  volatile and _*Barrier prevent compiler reordering, but are not hardware fences..  volatile does so with acquire/release semantics and the Barriers with different semantics depending on which one is used.  So why do the *fence instructions exist?  They exist for weakly-ordered memory accesses (remember these fences weren't introduced until SSE and SSE2).  So in cases where you are using weakly-ordered instructions, then you should make sure that you use these fences.  In the absence of these weakly-ordered instructions, the strong-ordering guarantees by the processor make the constraints on compiler reordering sufficient (through the use of volatile and _*Barrier).

     

    Thanks,

    Thursday, June 28, 2007 5:52 PM
    Moderator

All replies

  • Have you actually witnessed any dangerous reorderings in your compiled code? And what platform are you targetting?
    Sunday, June 24, 2007 12:23 PM
    Moderator
  • Unfortunately, since they would just be CPU reorderings and not reorderings of the instructions themselves, it's physically impossible to force such reorderings to occur or reproduce the similar behavior on a second run of the same test, and so I would likely only be able to witness improper results if I were to personally run rigorous tests on various multicore processors on Windows, and even then I'd have to get lucky enough (unlucky enough?) to have the results of one operation issued earlier appear after the result of another operation issued later. I do not currently have that luxury and I'd rather not have to wait until users get odd results on Windows considering that from the instructions generated, it looks very clear to me that there are absolutely no guarantees for ordering in place despite what the documentation claims. While they may be rarely apparent when running code, these reorderings are perfectly fine for the CPU to do, and unless I am horribly mistaken, are exactly why instructions such as sfence, lfence, and mfence exist in the first place! Microsoft seems to somewhat understand the concepts of ordering since they acknowledge that barriers can be expensive, and so they supposedly provide versions with only read barriers (acquire semantics) and write barriers (release semantics) in addition to their supposedly fully ordered versions. I don't have access to a Vista machine so I can't check the instructions generated from the Acquire and Release versions (Vista only according to documentation, which doesn't make any sense to me since these low-level instrinsics are just a few lines of inlined x86 instructions), but I am really curious as to how they would be implemented if Microsoft believes that a lack of any fences already implies fully ordered semantics with atomic instructions. If that actually were the case, then it wouldn't be possible to implement strictly acquire and release versions. Does the implementation for the Acquire and Release forms of the instrinsics correctly insert sfence and lfence instructions respectively? The acquire and release semantics which are supposedly present in VC8 for volatile variable access do not output any of these fences, so I wouldn't be surprised if the same went for the interlocked intrinsics. If on Vista the Acquire and Release forms of the intrinsics actually do put in fences, doesn't that seem rather odd that the acquire and release forms have more ordering instructions than the supposedly fully ordered versions? In actuality, it looks to me that, at least on VC8 on XP in debug, with no SSE generation, SSE generation, and SSE2 generation applied, the fully ordered forms of these intrinsics are actually totally unordered. For the RMW instructions, proper acquire semantics would imply an lfence after the atomic operation, proper release semantics would imply an sfence before the operation, and strict ordering would imply both an sfence prior to the instruction and an lfence after the instruction. To make sure I am not going crazy, I've started looking online and I did come across this page http://ein.designreactor.com:8080/amd_devCentral/articlex.jsp?id=89#top which is from before VS 2005 was released. The important line to take note of is "... in Visual Studio 2005, volatile will also introduce code to prevent out-of-order execution by the processor (a memory fence or memory barrier). In other environments, you will need to use memory fence instructions (sfence, lfence, and mfence) to prevent hardware re-ordering." As was stated, CPU barriers are required for proper ordering guarantees, and I find it especially frustrating since this page is so old, makes reference to VS 2005 providing such functionality when VS 2005 was to be released, and even explains in text exactly how to properly implement such functionality including the exact instructions required, and yet I see no fences in the code generated by VC8. What's worse is, MSDN still claims proper output, but I see nothing of the sort. What does Microsoft think acquire and release semantics mean? Are they just interpretting these as being compiler-only reordering requirements? If so, they have a horrible misunderstanding of atomic operations and ordering semantics. It seems to me that the proper output would be a compiler error on attempt to use any ordering semantics without at least SSE generation enabled, a compiler error for store and fully ordered semantics with only SSE generation enabled (using sfence prior to "write" operations when release semantics are specified), and finally, release and fully ordered semantics allowable with SSE2 generation enabled (using lfence after "read" operations for acquire semantics and using sfence and lfence around RMW operations for fully ordered semantics). Similarly, this would have to be reflected in the functionality of volatile qualification as well.
    Sunday, June 24, 2007 5:04 PM
  •  Rivorus wrote:
    An xchg does not guarantee any fence what-so-ever, at least as far as I can see from x86 instruction listings

     

    According to the Intel 64 / IA-32 software development manual, XCHG instructions imply automatic locking.

     

    As for the fence operations, I can confirm that no explicit ones are provided for volatile objects. That being said, the documentation doesn't appear to make that guarantee (and neither does volatile on any other C++ platforms I know of). The only guarantee made is for the volatile objects not to take part in any compiler re-orderings.

     

    One part I find ambiguous in the MSDN documentation is the following statement: "Although the processor will not reorder un-cacheable memory accesses, un-cacheable variables must be volatile to guarantee that the compiler will not change memory order." This seems to suggest that volatile objects are allocated in ranges marked as un-cachable in the processors MTRRs (memory type range registers, applicable for P4, Xeon and P6). According to the Intel docs, marking a range UC will cause the processor to enforce strong ordering on accesses to that segment. What's unclear to me about this strong ordering, is whether it only affects the un-cachable memory segments alone (that is, reads and writes to UC segments cannot be re-ordered). I doubt, however, that it also enforces serialization of reads / writes to segments *not* marked as UC, which occur (instruction-wise) in-between those to the UC segments.

     

    The bottom line, as I see it, is that you must manually use either the fence instructions; implicitly or explicitly lock prefixed instructions; CPUID or API level lock primitives. Failure to do so may, as you note, cause cpu re-orderings to break your code on SMP architecture.

     

    A side note, I noticed that the Orcas compiler generates volatile release build operations in a manner similar to the debug build operations in VC8: a single pre-increment of a volatile int will take the value into a register, increment the register, then write it back to the value itself. Release build apps in VC8 will skip the register step, and increment the value at the memory location in a single instruction.

    Monday, June 25, 2007 8:04 AM
    Moderator
  • We should probably spend some more time to document this better.

     

    Lets talk specifically about x86/x64 (as Itanium has different behavior).  volatile and _*Barrier prevent compiler reordering, but are not hardware fences..  volatile does so with acquire/release semantics and the Barriers with different semantics depending on which one is used.  So why do the *fence instructions exist?  They exist for weakly-ordered memory accesses (remember these fences weren't introduced until SSE and SSE2).  So in cases where you are using weakly-ordered instructions, then you should make sure that you use these fences.  In the absence of these weakly-ordered instructions, the strong-ordering guarantees by the processor make the constraints on compiler reordering sufficient (through the use of volatile and _*Barrier).

     

    Thanks,

    Thursday, June 28, 2007 5:52 PM
    Moderator
  • Now that's really confusing. AFAIK there are no order guarantees for the ordering of normal loads (i.e. not only inherently weakly ordered instructions) from WB memory on x86. So, exactly how can a volatile memory accesses have acquire and release semantics without a memory barrier?

     

    -hg

    Monday, July 02, 2007 11:00 AM
  •  Holger Grund wrote:

    Now that's really confusing. AFAIK there are no order guarantees for the ordering of normal loads (i.e. not only inherently weakly ordered instructions) from WB memory on x86. So, exactly how can a volatile memory accesses have acquire and release semantics without a memory barrier?

     

    My understanding (or assumption) is that it's controlled through the memory type range registers and / or page attribute tables, by marking the volatile ranges as uncachable, and thus forcing the CPU to use strict ordering. I haven't seen this confirmed by anyone, though.

    Monday, July 02, 2007 11:41 AM
    Moderator
  • I don't think Windows ever allocates UC pages for standard heap memory. There are several reasons why this is impratical.

     

    My - admittely somewhat outdated information about IA-32 - suggests all CPUs will always agree about a given CPU's order of stores (unless instructions are inherently weakly ordered - e.g. nontemporal stores). The same is not true for loads, however. Therefore I claim that a simple load from memory (which is what a volatile read boils down to) does not have acquire semantics.

     

    -hg

    Monday, July 02, 2007 3:08 PM
  • That's pretty much what the Intel and AMD manuals say -- reads can go ahead of writes; out-of-order reads are allowed, as are speculative reads.

    Monday, July 02, 2007 8:43 PM
    Moderator
  •  Holger Grund wrote:

    I don't think Windows ever allocates UC pages for standard heap memory. There are several reasons why this is impratical.

     

    I suppose we need Kang Su to make another statement Smile

    Monday, July 02, 2007 8:45 PM
    Moderator
  • goes for x86 only, As I have only tested on x86.

     

    As for volatile variables they are not placed in UC area. they are treated just like normal variables as the placement goes.

    For compiler reordering, volatiles behave similiar to read or write barrier.

    for example

    Code Snippet
    int testloc1;
    volatile int testloc2;
    //
    testloc1 = x;
    ++ testloc2;
    testloc1 = y;

     will generate something like this

    The first write inst coule be omitted, but compiler kept it as it was placed before volatile write. 

    From http://msdn2.microsoft.com/en-us/library/bb310595.aspx

    Visual C++ 2005 goes beyond standard C++ to define multi-threading-friendly semantics for volatile variable access. Starting with Visual C++ 2005, reads from volatile variables are defined to have read-acquire semantics, and writes to volatile variables are defined to have write-release semantics. This means that the compiler will not rearrange any reads and writes past them, and on Windows it will ensure that the CPU does not do so either.

     

    As the my understanding of read acquire and wite release goes, any write after write release could be moved before the write-release. The code above could be also complied as.

    Code Snippet
    mov DWORD PTR ?testloc1__3HA, eax  ; testloc1 = y
    mov edx, 1
    add DWORD PTR ?testloc2__3HC, edx  ; ++ testloc2

    And another strange part is mentioning of "on Windows". There is no way that Windows can affect anything about reordering.

    These reordering and memory barrier and volatile are making me very uncomfortable.

    Getting complicated and strange.

     

    This is my first use of MSDN forum and i have make some mistake. Some mailto links are present in this post. just ignore them. Sorry.

    Sunday, July 08, 2007 12:37 PM
  •  Mag pie wrote:

    And another strange part is mentioning of "on Windows". There is no way that Windows can affect anything about reordering.

    These reordering and memory barrier and volatile are making me very uncomfortable.

    Getting complicated and strange.

     

    I was informed by Kang Su Gatlin that he had asked a chip vendor to address this thread. There will probably be some new input here this coming week.

    Sunday, July 08, 2007 12:44 PM
    Moderator
  •  Mag pie wrote:

     

    Code Snippet
    int testloc1;
    volatile int testloc2;
    //
    testloc1 = x;
    ++ testloc2;
    testloc1 = y;

     

     

    If I'm interpreting the Intel / AMD docs correctly, and the instruction ordering isn't otherwise strengthened, any x86 or x64 CPU would be allowed to reorder the execution of the above code to (roughly) read

    Code Snippet

    testloc1 = x;

    testloc1 = y;

    ++testloc2;

     

    Sunday, July 08, 2007 12:52 PM
    Moderator
  • Compiler is allowed to optimize away "testloc1 = x;" statement, as the statement is useless.

    My thought is that volatile write is acting as "_WriteBarrier" intrinsic.

     

    Code Snippet
    testloc1 = x;
    testloc2 = x;
    testloc1 = x;
    testloc2 = y;
    ++ ++ testloc1;
    ++ ++ testloc2;

     

    complied in VC8 ( /Ox /Oi /GL /D "WIN32" /D "NDEBUG" /FD /EHsc /MD /Fo"Asm\\" /c /Wp64 /Zi /TP /errorReportStick out tonguerompt ) the output asm file reads like this

    Code Snippet

    mov DWORD PTR ?testloc1__3HA, eax ; testloc1 = x : this shoud be optimized away.
    mov DWORD PTR ?testloc1__3HA, eax ; testloc1 = x : compiler reordered as write-release allows
    mov DWORD PTR ?testloc2__3HC, eax ; testloc2 = x
    add eax, 2
    mov DWORD PTR ?testloc1__3HA, eax ; ++ ++ testloc1 : compiler reordered, moved up
    mov eax, 1
    mov DWORD PTR ?testloc2__3HC, ecx ; testloc2 = y
    add DWORD PTR ?testloc2__3HC, eax ; ++ testloc2
    add DWORD PTR ?testloc2__3HC, eax ; ++ testloc2

     

     I have very little idea how the CPU will reorder this code in run time, but it seems to me that compiler is doing some reordering based on volatile write as _WriteBarrier.  MSDN and intel manuals state that write-reorder is very unlikely in x86, unless there some alignment issue, please correct me  I'm wrong. CPU will write as the assembly code order.

     

    Read can be the real problem as though the compiler will ensure that volatile read or _ReadBarrier acts as read-acquire sematics, but the CPU will not.  <- I'm not so sure about this.

    Sunday, July 08, 2007 2:13 PM
  •  Mag pie wrote:

    Compiler is allowed to optimize away "testloc1 = x;" statement, as the statement is useless.

     

    Well my point still stands if you still make both testloc's volatile.

     

     Mag pie wrote:

    My thought is that volatile write is acting as "_WriteBarrier" intrinsic.

     

    As did the original poster.

     

     Mag pie wrote:

     I have very little idea how the CPU will reorder this code in run time, but it seems to me that compiler is doing some reordering based on volatile write as _WriteBarrier.  MSDN and intel manuals state that write-reorder is very unlikely in x86, unless there some alignment issue, please correct me  I'm wrong. CPU will write as the assembly code order.

    Write-reordering is illegal. Reads are allowed to move ahead of writes (granted that they aren't targetting the same address), and speculative reads are also allowed.

     

    I think we're repeating ourselves, though, so I suggest we let this thread stay dead until the chipset people have spoken.

    Sunday, July 08, 2007 2:38 PM
    Moderator
  • If I may chime in here, I am simply interested in the following case, and it seems to be what the original poster was dancing around. If two threads are running on separate cores, and thread A goes:

      1. *foo = 7;
      2. <write barrier>

    ...then thread B goes:

      1. <read barrier>
      2. x = *foo;

    ...all I want is to be absolutely guaranteed that x = 7 in thread B; no matter what the address of foo. I just want to avoid all stale cache data. The question is, what to put in <write barrier> and <read barrier>?

     

    On Xbox 360 platform, which has three separate cores, you can use the MemoryBarrier() macro. This macro is supposed to act as both a read and write barrier, and it's implemented as lwsync (a PowerPC instruction), which makes perfect sense.

     

    On Win32, the MemoryBarrier macro is implemented as an xchg instruction on a temporary stack variable. So I have trouble believing that it's going to act as memory barrier on foo in a multi-core environment, such as Core 2 Duo. I combed through the IA-32 manuals, and I saw nothing reassuring there. I see only two possibilities:

    1. The xchg instruction secretly acts like lwsync - that is, it syncs the whole local L1 with the shared L2 - but it's not documented (which I doubt). Or...
    2. By default, a single Win32 process is simply never scheduled on multiple cores simultaneously, and never will in any future Windows OS. You basically run on a single virtual core, perhaps with HyperThreading. In this case the xchg has the desired effect. (This seems more likely.)

    I think it would help to have confirmation of one of the two possibilities. I didn't bother opening a new thread for my question, because I think if my second theory (about always running on a single core) is accurate, it would have cleared up some confusion for the original poster.

     

    Probably the only way to open your application up to simultaneous core execution is to call special API like GetLogicalProcessorInformation (which is Vista-only) and SetThreadAffinityMask. And if you do that, it's probably your own responsibility to use sfence/lfence/mfence appropriately throughout your code, because MemoryBarrier() won't help you. It would be helpful to have this theory confirmed as well. Thanks.

     

    Friday, July 20, 2007 8:20 PM
  • Take a look at IA-32 instruction set reference http://www.intel.com/design/processor/manuals/253667.pdf

    It explicitly say that "XCHG reg, mem" is locked operation even when LOCK prefix is not specified.

    Than take a look at section 7.2 of http://www.intel.com/design/processor/manuals/253668.pdf -- it describes IA-32 memory model.

     

    Thanks,

    Eugene

    Saturday, July 21, 2007 3:32 AM
  • As I wasn't able to get a an independent response to the forums in a timely manner, I wrote a more detailed response and placed it on my blog.  See:

    http://blogs.msdn.com/kangsu/archive/2007/07/16/volatile-acquire-release-memory-fences-and-vc2005.aspx

     

    It is worth noting that volatile does NOT give you a global total ordering (or at least it does not guarantee that you'll get it). 

     

    Also see page 206+ of http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf.  This is recently updated and clarified memory model semantics.  Reading old processor documentation about memory models isn't worth it (IMHO).

     

    Thanks,

    Monday, July 23, 2007 2:08 AM
    Moderator
  • OK, so I have a lot more faith now that XCHG acts as a memory fence even in multiprocessor environments. Thanks for the links. I guess I just find the documentation a bit muddy.

     

    Monday, July 23, 2007 3:11 PM
  •  Kang Su Gatlin wrote:

    As I wasn't able to get a an independent response to the forums in a timely manner, I wrote a more detailed response and placed it on my blog.  See:

    http://blogs.msdn.com/kangsu/archive/2007/07/16/volatile-acquire-release-memory-fences-and-vc2005.aspx

     

     

    FWIW, I'd for the trivial case of an aligned int if there is a guarantee that loads can't pass loads. However, everything I've read so far suggests that this assertion doesn't hold for IA-32.

     

     Kang Su Gatlin wrote:

    It is worth noting that volatile does NOT give you a global total ordering (or at least it does not guarantee that you'll get it). 

     

    Also see page 206+ of http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf.  This is recently updated and clarified memory model semantics.  Reading old processor documentation about memory models isn't worth it (IMHO).

     

    Of course, AMD-64 docs aren't really normative for IA-32, and there are recently updated Intel documents, and the are pretty clear about the memory model. I don't see how one could derive that loads can't pass loads from that.

     

    -hg

    Tuesday, July 24, 2007 2:13 PM
  • The statements in my blog do in fact apply to Intel processors as well, and I say this with a high degree of certainty.  Of course this is not our product, and it may be the case that the only party that will make you comfortable with such a statement is the manufacturer of the product you're interested in. 

     

     

    Thanks!

    Tuesday, July 24, 2007 5:31 PM
    Moderator
  • I'd probably be more confident if not everything I've heard so far suggests otherwise:

    -hg

    Tuesday, July 24, 2007 6:26 PM
  • If you're not confident then programming to a strictly weaker memory model will not affect the correctness of your program, but may result in worse performance (than necessary). 

     

    Thanks,

    Tuesday, July 24, 2007 10:31 PM
    Moderator
  •  

    Hate to bring a dead thread to life, but since Intel has published their updated memory model:

    http://developer.intel.com/products/processor/manuals/318147.pdf

     

    As you will see, this document agrees with the past statement I've made. 

     

    Thanks!

    Saturday, September 22, 2007 2:40 PM
    Moderator