none
Use of __m128i as two 64 bits integers

    Question

  • Hi,

    I'm trying to write an x64 implementation - targeted at AMD64 - of the lock-free queue designed by messrs Maged M. Michael and Michael L. Scott (go to http://www.cs.rochester.edu/research/synchronization/pseudocode/queues.html) and referenced in http://www.devx.com/amd/Article/33290/1763.

    The technique involves wrapping up pointers to the 'next' node in the singly-linked list as a pointer and a version.

    As pushes and pops occur, the algorithm constantly checks the version of either the tail or head node (when pushing/popping respectively) whilst it attempts to make modifications.  If the versions change during the push or the pop (meaning another thread has just completed a change), the thread restarts the push/pop operation.

    To achieve this lock-free algorithm, you need to exploit the CAS operation, or more precisely the DCAS operation.

    In x86, it is relatively easy since a pointer and the version will be 32 bits - 64 bit interlocked and SSE intrinsics are easier to find - albeit you can't use any __m64 variants.

    The DCAS equivalent for me, where my versioned pointer is defined as:

    //force alignment to 16-byte boundary (otherwise debug builds tend to crash)
    __declspec (align(16)) struct node_ptr
    {
        node_ptr *node;
        unsigned __int64 version;
    }

    is to use the cmpxch16b instruction, through the intrinsic _InterlockedCompare64Exchange128.  That in itself is not a problem.

    The problem lies in a few lines of pseudo-code in the aforementioned queue algorithm where node_ptr structures are simply copied via 'a=b' syntax.  Of course, even properly aligned 64bit writes are atomic on x86.  However, the only atomic 128 bit writes on x64 are those involving the __m128i type, via movdqa, or in VC2005 the intrinsics _mm_store_si128 or _mm_load_si128 (plus the unaligned versions of course).

    I have experimented (although perhaps not enough) with using the SSE methods with that data.   I've managed to get the copying of one structure to another down to two operations:

    //given two node_ptrs test1 and test2:
    _mm_store_si128((__m128i *)&test2,*((__m128i *)&test1));

    //compiles to:
    movdqa xmm0, XMMWORD PTR test1$[rsp]
    movdqa XMMWORD PTR test2$[rsp], xmm0

    But this is clearly not an atomic operation - making it impossible to implement the lock-free.  Only if I use __m128i will such a copy operation become atomic.

    I'm getting to the point. ;)

    So, my thinking was to use the __m128i type instead of my structure, and I naively felt sure that I'd be able to find other intrinsics that would let me atomically read and write the two 64 bit halves of the __m128i wouldn't I... of course I would...

    Except I can only find one intrinsic, _mm_move_epi64 (movq assembly instruction), and that only lets me load the lower 64 bits of the pair.

    How can I address both halves in separate operations?

    I did also wonder about the _mm_maskmoveu_si128 intrinsic - but will that be atomic?

    Any help greatly appreciated.

    Thursday, February 08, 2007 9:58 PM

Answers

  •  

    I can't say I'm an expert on this type of programming, but I did write some of the intrinsics docs for VC 2005, so maybe I can help.

    You can load two 64-bit integers into a 128 bit integer on x64 architecture using _mm_set_epi64x.  There are variations that let you load the upper or lower 64-bits only _mm_setl_epi64x or both parts the same (_mm_set1_epi64x).   These were introduced in Visual C++ 2005.  You should try them and check the assembly code and processor specs for the generated instructions to see if it's atomic, as the docs don't say so specifically.

    I don't think the __mm_maskmoveu_si128 will be atomic. The Intel specs for maskmovdqu (which is what that intrinsic generates) say that you may need a memory fence with it, so I don't think it is atomic. 

    There is the _mm_sfence intrinsic, which forces stores to be committed, so you might try using that if you must use a non-atomic instruction.

     

    Friday, February 09, 2007 1:32 AM
  • Hi Gordon,

    That's great info, thanks, I shall try and implement using that method.  Even if it doesn't address my problem, I think you've still answered the question - so I shall leave it as such.  Cheers ;)

    It would be great if I could implement this algorithm in 'proper' 64-bit code, but I fear it might not be possible - because I realised that I need a cmpxchg16b (_InterlockedCompare64Exchange128) instruction that compares all 128bits of the destination with the comperand and transfers all 128bits of the source if they are consistent.

    Here's the thing, from my understanding, that's what it's supposed to do.  I boned up on both the Intel and AMD documentation for the cmpxhg16b instruction, and here's a couple of excerpts:

    Intel: "Compares the 64-bit value in EDX:EAX (or 128-bit value in RDX:RAX if operand size is 128 bits) with the operand destination operand). If the values are equal, the 64-bit value in ECX:EBX (or 128-bit value in RCX:RBX) is stored in the destination operand."

    AMD: "Compares the value in the rDX:rAX registers with a 64-bit or 128-bit value in the specified memory
    location. If the values are equal, the instruction copies the value in the rCX:rBX registers to the
    memory location and sets the zero flag (ZF) of the rFLAGS register to 1."

    Am I being stupid - don't both of those paragraphs suggest that there is a 128bit comparison taking place, and a 128bit copy being performed?  Well, _InterlockedCompare64Exchange128 disagrees!

    Finally...

    Also, after I wrote the post, I thought about what was causing my difficulties here, and realised that if I could remove the reliance on the 64-bit pointer, I'd be laughing as we say in the UK.

    So, if I turned my node_ptr into:

    struct node_ptr
    {
       unsigned int nodeid;
       unsigned int version;
    };

    That's 64 bits, allowing me to use _InterlockedCompareExchange64.  All I then have to do is use a buffer of node objects (at the moment it'll have to be a fixed size because I don't want to go into lock-free memory allocation yet) and dish them out/grab them back as pushes and pops occur.

    To be honest, I'm ashamed I hadn't thought of it before...

    Friday, February 09, 2007 9:32 AM

All replies

  •  

    I can't say I'm an expert on this type of programming, but I did write some of the intrinsics docs for VC 2005, so maybe I can help.

    You can load two 64-bit integers into a 128 bit integer on x64 architecture using _mm_set_epi64x.  There are variations that let you load the upper or lower 64-bits only _mm_setl_epi64x or both parts the same (_mm_set1_epi64x).   These were introduced in Visual C++ 2005.  You should try them and check the assembly code and processor specs for the generated instructions to see if it's atomic, as the docs don't say so specifically.

    I don't think the __mm_maskmoveu_si128 will be atomic. The Intel specs for maskmovdqu (which is what that intrinsic generates) say that you may need a memory fence with it, so I don't think it is atomic. 

    There is the _mm_sfence intrinsic, which forces stores to be committed, so you might try using that if you must use a non-atomic instruction.

     

    Friday, February 09, 2007 1:32 AM
  • Hi Gordon,

    That's great info, thanks, I shall try and implement using that method.  Even if it doesn't address my problem, I think you've still answered the question - so I shall leave it as such.  Cheers ;)

    It would be great if I could implement this algorithm in 'proper' 64-bit code, but I fear it might not be possible - because I realised that I need a cmpxchg16b (_InterlockedCompare64Exchange128) instruction that compares all 128bits of the destination with the comperand and transfers all 128bits of the source if they are consistent.

    Here's the thing, from my understanding, that's what it's supposed to do.  I boned up on both the Intel and AMD documentation for the cmpxhg16b instruction, and here's a couple of excerpts:

    Intel: "Compares the 64-bit value in EDX:EAX (or 128-bit value in RDX:RAX if operand size is 128 bits) with the operand destination operand). If the values are equal, the 64-bit value in ECX:EBX (or 128-bit value in RCX:RBX) is stored in the destination operand."

    AMD: "Compares the value in the rDX:rAX registers with a 64-bit or 128-bit value in the specified memory
    location. If the values are equal, the instruction copies the value in the rCX:rBX registers to the
    memory location and sets the zero flag (ZF) of the rFLAGS register to 1."

    Am I being stupid - don't both of those paragraphs suggest that there is a 128bit comparison taking place, and a 128bit copy being performed?  Well, _InterlockedCompare64Exchange128 disagrees!

    Finally...

    Also, after I wrote the post, I thought about what was causing my difficulties here, and realised that if I could remove the reliance on the 64-bit pointer, I'd be laughing as we say in the UK.

    So, if I turned my node_ptr into:

    struct node_ptr
    {
       unsigned int nodeid;
       unsigned int version;
    };

    That's 64 bits, allowing me to use _InterlockedCompareExchange64.  All I then have to do is use a buffer of node objects (at the moment it'll have to be a fixed size because I don't want to go into lock-free memory allocation yet) and dish them out/grab them back as pushes and pops occur.

    To be honest, I'm ashamed I hadn't thought of it before...

    Friday, February 09, 2007 9:32 AM
  • I'm not sure about _InterlockedCompare64Exchange128.  It's possible that the docs are not correct, although I'd be a little bit surprised as no one has noticed so far.  I'll investigate and let you know...
    Friday, February 09, 2007 10:24 PM
  • Hi Gordon,

    Thanks - will be interesting to see what you find out!

    Cheers!

    Saturday, February 10, 2007 12:11 AM
  • _InterlockedCompare64Exchange128 intrinsic is for Itanium
    Sunday, February 11, 2007 10:53 PM
  • Hi there,

    I agree from the documentation that _InterlockedCompare64Exchange128 it appears that it is.  However, the documentation is talking about the versions with acquire/release semantics which, because the x64 architecture doesn't use that methodology, are emulated to work the same as the Itanium architecture.

    It's not exclusively for Itanium, as it wraps around the assembly instruction cmpxchg16b - which is documented as an operation in both Intel and AMD architectures.

    I have compiled code and examined the assembly output on my AMD x64 and can confirm for definite that cmpxchg16b is generated by _IC64E128 and compiles and runs with no problems at all.

    Monday, February 12, 2007 9:02 AM
  • Hi,

    Firstly - I apologise, the _IC64E128 is only valid on IPF, you're absolutely right.  I've discovered that cmpxchg16b has not been implemented at all in the VC++ compiler (even though it's been a feature of x64 since it was created and devised) at all, and there are no plans to implement it...

    Here's a post from the MS VC++ dev team on Connect on the issue called _InterlockedCompareExchange128

    "Thank you for suggestion. New feature work cannot be done in SP1 release to VS2005. Please re-open this suggestion against Orcas version of VS2005. It will be re-considered as a new feature for the Orcas."

    Unfortunately, this was found to be a lie (unless plausible deniability can be claimed), because SP1 saw the inclusion of 26 new intrinsics, 7 of which were for x64 (as per the list at: http://msdn2.microsoft.com/en-us/library/aa983353(VS.80).aspx).  All that effort, and yet a fundamental operation such as cmpxchg16b - one of those which would appear on a list of must-have features - slips through.

    Looks like we'll have to wait another year before we can programme properly.  Unless inline assembler support is brought in again.

    Overall I'm normally very satisfied with MS and the dev tools we have - but frankly this disgusts me.  It basically means that, when using VS2005, the x64 platform is less performant and flexible than it's ancient older brother x86.

    It's not acceptable.

    Wednesday, February 14, 2007 5:42 PM
  • I'd just like to add for the record that the vc9 compiler now does support ice128 -

    so this is no longer an issue!

    Saturday, February 02, 2008 10:57 PM