none
global_memory_fence vs. tile_barrier::wait_with_global_memory_fence?

    Question

  • Hi,

    I've been using tile_barrier::wait_with_global_memory_fence function for sync within a tile of threads. The post tile_barrier in C++ AMP mentions a global_memory_fence but doesn't provide further details.

    So what are the differences between the two? And when to use which? Thanks in advance.

    Thursday, February 14, 2013 3:11 AM

Answers

  • A fence does not guarantee synchronized execution (otherwise stated, it does not guarantee that all lanes / threads in a tile are aligned at the same PC, and all memory ops have commited). It does provide a guarantee that no reordering of memory accesses occurs around the fence (otherwise stated, your compiler will never hoist a mem op that comes after the fence ahead of it and vice-versa, whereas in the absence of the fence it might). A fence is less expensive than a barrier, performance wise.
    Thursday, February 14, 2013 6:45 AM
  • If a particular thread reads the same location it has written to, yes the data is visible in the expected order. If it reads another location, but one that is written by another thread in the same wavefront (warp, hardware unit of scheduling) yes, it will be visible, since in this case lock-step execution ensures that all threads will have done the write before hitting the fence. However, if it references a location that is to be written by a thread that's part of the same tile, but not within the same hardware unit of scheduling, everything goes, because it is possible that it has advanced beyond its particular fence before the other thread even hit the write phase (this is opaque to the programmer by the way and depends on how wavefronts are scheduled by the hardware). Once again, a fence does not ensure synchronization in a general sense, it just ensures that within the scope of visibility of a particular thread, ordering between memory ops around the fence is maintained (with the interesting side-effects I mentioned when one has more information about the hardware itself).
    Monday, February 18, 2013 4:54 PM

All replies

  • A fence does not guarantee synchronized execution (otherwise stated, it does not guarantee that all lanes / threads in a tile are aligned at the same PC, and all memory ops have commited). It does provide a guarantee that no reordering of memory accesses occurs around the fence (otherwise stated, your compiler will never hoist a mem op that comes after the fence ahead of it and vice-versa, whereas in the absence of the fence it might). A fence is less expensive than a barrier, performance wise.
    Thursday, February 14, 2013 6:45 AM
  • In addition to what Alex mentioned, please also note that (as stated in the blog post), 

    a memory fence ensures that memory accesses are visible to other threads in the thread tile, and are executed according to program order

    So the memory fence has no effects among threads in different tiles. In general, unless you are very familiar with C++ AMP memory model and has to write low-level synchronization code, or in some case you need to prevent certain optimization from happening, you probably should not try to use fence in most cases. Section 8.1.2 of the open spec has more details. 



    • Edited by Zhu, Weirong Thursday, February 14, 2013 8:52 PM
    Thursday, February 14, 2013 8:34 PM
  • Hi Weirong,

    Let's say if we're doing the following operations in the kernel:

    • write data to tile_static memory
    • call tile_static_memory_fence function
    • read data from tile_static memory

    Much like loading/caching data in Matrix Multiply example. Since the memory fence ensures the reads after the writes for all threads in a tile (correct me if I'm wrong), the data should be ready when we use it. Then why do we need to use tile_barrier::wait_with_tile_static_memory_fence function instead? Or can we only use tile_static_memory_fence function here?

    Monday, February 18, 2013 6:54 AM
  • If a particular thread reads the same location it has written to, yes the data is visible in the expected order. If it reads another location, but one that is written by another thread in the same wavefront (warp, hardware unit of scheduling) yes, it will be visible, since in this case lock-step execution ensures that all threads will have done the write before hitting the fence. However, if it references a location that is to be written by a thread that's part of the same tile, but not within the same hardware unit of scheduling, everything goes, because it is possible that it has advanced beyond its particular fence before the other thread even hit the write phase (this is opaque to the programmer by the way and depends on how wavefronts are scheduled by the hardware). Once again, a fence does not ensure synchronization in a general sense, it just ensures that within the scope of visibility of a particular thread, ordering between memory ops around the fence is maintained (with the interesting side-effects I mentioned when one has more information about the hardware itself).
    Monday, February 18, 2013 4:54 PM
  • Thank you very much, Alex. Now everything's clear to me.

    Tuesday, February 19, 2013 2:12 AM