none
C++ AMP: short vector types optimization

    问题

  • Hi,

    The published C++ AMP specs do not talk about functions like dot, cross, distance, and other HLSL functions operating on short vector types (float_3, etc.), which forces us to expand them like that, for example:

    dot = a.x*b.x + a.y*b.y + a.z*b.z;

    Is the compiler smart enough to vectorize such code and choose the right functions ?

    Best regards, Arnaud.

    2012年2月19日 15:56

答案

  • Hi Arnaud

    The C++ AMP SVT implementation is not mapped to the HLSL SVT implementation. We offer SVT so developers familiar with them do not have to redefine them, and for making it easier to work with our texture support. Though not mapped to HLSL SVT, the compiler still tries to vectorize the code as much as it can.

    To answer your specific question:

    1. For cross, there is no specific instruction at bytecode level.
    2. For dot/distance, if it’s not float, there is no corresponding shader instruction. 
    3. For dot/distance, if it’s float,  the compiler may or may not vectorize it (using the shader dot-product instruction). Even if the instruction ends up being used under the covers, whether it will have a performance benefit ultimately depends on the driver/hardware.

    In short, use as your scenario calls for, and if you believe there are potential performance benefits left on the table, let us know with a specific example so we can look into it…

    Cheers
    Daniel


    http://www.danielmoth.com/Blog/

    2012年2月21日 22:03
  • Hi Arnaud,

    Thanks very much for sharing your kernel. As Daniel explained, in this release, the C++ AMP SVT implementation is not mapped to the HLSL SVT implementation. HLSL compiler is used to vectorize the code as much as it could.  In your example, all three leads to similar code. As it is now,

        float x = b.x - a.x;
        float y = b.y - a.y;
        float z = b.z - a.z;

    are vectorized using a single iadd instruction

        float dist2 = x*x + y*y + z*z;

    is only partially vectorized but not fully.

    For the coming beta release, "pass by value" can increase the chance of vectorization. We plan to do something around this, so it probably would not matter in the final release.

    Driver JIT may do some target-dependent optimization including further vectorization if it is beneficial for the underlying hardware.

    Vectorization for per-thread-code is one of the areas that we will investigate into. So if you have any empirical performance results to demonstrate the benefit, please share with us.

    Thanks, Weirong

    2012年2月22日 3:52

全部回复

  • Hi Arnaud

    The C++ AMP SVT implementation is not mapped to the HLSL SVT implementation. We offer SVT so developers familiar with them do not have to redefine them, and for making it easier to work with our texture support. Though not mapped to HLSL SVT, the compiler still tries to vectorize the code as much as it can.

    To answer your specific question:

    1. For cross, there is no specific instruction at bytecode level.
    2. For dot/distance, if it’s not float, there is no corresponding shader instruction. 
    3. For dot/distance, if it’s float,  the compiler may or may not vectorize it (using the shader dot-product instruction). Even if the instruction ends up being used under the covers, whether it will have a performance benefit ultimately depends on the driver/hardware.

    In short, use as your scenario calls for, and if you believe there are potential performance benefits left on the table, let us know with a specific example so we can look into it…

    Cheers
    Daniel


    http://www.danielmoth.com/Blog/

    2012年2月21日 22:03
  • Hi Daniel,

    Here is the function (using an in-house float_4 type, similar to the one published in the beta specs):

    #define MARGIN 1.000F
    
    inline bool contact(float_4& a, float_4 b) restrict(amp)
    {
        /*float_4 r = b - a;
        float dist2 = r.x*r.x + r.y*r.y + r.z*r.z;*/
    
        float x = b.x - a.x;
        float y = b.y - a.y;
        float z = b.z - a.z;
        float dist2 = x*x + y*y + z*z;
    
        /*b.x -= a.x;
        b.y -= a.y;
        b.z -= a.z;
        float dist2 = b.x*b.x + b.y*b.y + b.z*b.z;*/
    
        float radiussum = (a.w + b.w)*MARGIN;
        
        return dist2 <= radiussum*radiussum;
    }

    As a short background, it is used inside a kernel which determines contacts between a set of spheres, by trying a carefully chosen subset of all possible combinations and returning a vector containing pairs of sphere ids in contact (array<uint_2>). I seems obvious, but the bool contact(a, b) function determines if two spheres a and b represented as float_4 with center (.x,.y,.z) and radius .w are in contact. In this function, there are three ways (two commented out) to determine the square of the distance between centers, all of which giving similar performance.

    Is this code vectorized in some way ? Does it depends on the signature of the function arguments (by value, by ref, by const ref) ?

    Another question: do you know if the driver could end up vectorizing non optimized HLSL bytecode ?

    All the best, Arnaud.


    2012年2月21日 23:42
  • Hi Arnaud,

    Thanks very much for sharing your kernel. As Daniel explained, in this release, the C++ AMP SVT implementation is not mapped to the HLSL SVT implementation. HLSL compiler is used to vectorize the code as much as it could.  In your example, all three leads to similar code. As it is now,

        float x = b.x - a.x;
        float y = b.y - a.y;
        float z = b.z - a.z;

    are vectorized using a single iadd instruction

        float dist2 = x*x + y*y + z*z;

    is only partially vectorized but not fully.

    For the coming beta release, "pass by value" can increase the chance of vectorization. We plan to do something around this, so it probably would not matter in the final release.

    Driver JIT may do some target-dependent optimization including further vectorization if it is beneficial for the underlying hardware.

    Vectorization for per-thread-code is one of the areas that we will investigate into. So if you have any empirical performance results to demonstrate the benefit, please share with us.

    Thanks, Weirong

    2012年2月22日 3:52