Answered A way to reduce build time of AMP code?

  • Saturday, June 23, 2012 12:29 AM
     
     

    I have two non-trivial projects implemented in AMP now and both take a really long time to compile. The first about 30 minutes, and the second started to take upto 3 hours (both use upto 1.8 GB of memory). I tried to observe more carefully on the second project why/when this occurs and it appears to be the moment I start using tile_static memory. I suspect it's the race-condition check (DX-compiler) or some other code-analysis which goes crazy about my implementation. Which isn't enormous, just 100 lines maybe.

    In addition, the compiler takes 100% of all of my 4 cores, to such a degree that the OS becomes irresponsive. My current approach is to jump to the task manager quickly after I started a build and set the priority to lowest.

    So my question is if I can reduce the build time somehow? For compute shaders I can turn off optimizations - for my first project the DC-shader I did to hold the build times in check doesn't benefit from above O0 fe. - and I can also turn off the correctness verification. I wish I could do something like that for AMP.

    Oh, and the produced (debug) executable is 80MB vs. 600kB pure C++ ConcRT vs. 69 kB just C++, is that normal?

    Thanks for any suggestion to solve this.

All Replies

  • Sunday, June 24, 2012 6:20 AM
    Owner
     
     

    Hi Ethatron,

    The compile times that you mentioned - are they for the debug or release configuration builds? Also, can you try turning off debug info generation and see if and how much difference that makes?

    If you can share your code, we can take a look to better determine whats going on and if there is some way to reduce the compilation times for your projects.

    - Amit


    Amit K Agarwal

  • Sunday, June 24, 2012 7:53 AM
     
     

    It's debug builds. I'll try the debug info suggestion.

    How may I pass the code to you without it being public?

  • Monday, June 25, 2012 6:08 PM
    Owner
     
     

    Hey Ethatron

    It may be tricky for us to look at code that you do not consider being in the public domain. Even after you assure us that there is no license of any form attached to the code...

    Regardless, please start an email conversation with me, and we can close the loop on this thread once we’ve resolved it… first name dot last name at youknowwhere dot com.

    Cheers
    Daniel


    http://www.danielmoth.com/Blog/

  • Tuesday, June 26, 2012 8:16 PM
     
      Has Code

    It's under MIT, but's for the competition. :^D

    I actually found the culprit:

        // kvalue, very small
        // klimit, very big (upto thread-limit 1024)
        for (int k = kvalue; k < klimit; k++) {
            // ...
            int i = result_of_some_calculation_with_uavs;
    
    	/* no indexed access to vector */
    	switch (i % SIMDsize) {
    	  case 0: memory(i / SIMDsize).x = x; break;
    	  case 1: memory(i / SIMDsize).y = y; break;
    	  case 2: memory(i / SIMDsize).z = z; break;
    	  case 3: memory(i / SIMDsize).w = w; break;
    	}
    
            // ...
        }
    

    I had the impression the code-generator will optimize the simulated short-vector index-access into an offset-access. But it didn't. Instead the code-generator not only used a switch, it became so confused, that it forced the unrolling of the entire loop.

    I got the hint because I wrote a DC-shader (again) and looked at the messages and the assembly. Actually the DC-shader failed to compile, said it couldn't unroll the loop. So, the problematic message for me is, write a DC-shader, look what you do wrong (in the POV of the fxc), fix it, then translate it to AMP-code.

    As a consequence I rewrote the whole short-vector-library to be in array-form ("value_type _M_xyzw[4];") and added []-operators, dropped the switches, then the problem disappeared, 30min -> 3min.

    I think it's essencial to be able to see the assembly of AMP-shaders at least (does it what you expect?), and to be able to set the optimization level (none of the code I wrote did benefit from >O0 ever). And unroll-hints/preference would be an extra. Especially the optimization-level is crucial for initial development, I was down to correcting 5 lines of code per day because I had to wait 5x (5 data-types) 30min per build for each line corrected. The algorithm was stable, I was mearly tinkering with more optimal codes, it was difficult to maintain motivation (I used the time reading AMP-docs) ...

    Anyway I hope you understand my notes as constructive "criticism", I appreciate the AMP-initiative whole-heartedly, it's the nearest to what I always wanted - C++ "native" GPGPU, libSh came the nearest and that's a lot of time passed since - if not the perfect match. I'm sure it'll evolve just fine.

    You're going to see the code pretty soon now. :^)

  • Wednesday, June 27, 2012 9:03 PM
    Owner
     
     Answered
    Hi Ethatron
     
    Thank you for the feedback; it is noted for consideration for a future release. I wanted to acknowledge your 3 points (to make sure I didn’t miss anything) and finally to offer a tip for the other one.
     
    In this release, as you have found, C++ AMP short vector types do not support [] operator, thus dynamic indexing and out-of-bound access are not possible. Can you help us understand your scenario and the need for using dynamic indexing within a short-vector?
     
    Also thanks for the feedback on hints for unrolling etc, we have heard this before and are examining what would be a C++ way to expose that… If you have ideas to offer, please share…
     
    Regarding the optimization compiler flags, we map both O1 and O2 to the HLSL’s compiler O2 flag. We do not expose the other 3, since like you said there is little difference between them, and we also didn’t want to introduce new compiler flags. If you have further thoughts and request here, please share.
     
    As for seeing the HLSL bytecode, the best we can offer at this point is the Disassembly window in Visual Studio 2012. For more on GPU debugging please read our blog post. For your scenario, remember to build in Release mode, and also in “Project Properties >> C/C++ >> Code Generation >> Runtime Library” select “Multi-threaded Debug DLL” option (to link against the debug runtime, so the debugger can work). Rebuild. Hit F11 to step into the first line of the kernel, and open the disassembly window to see the HLSL bytecodes for the C++ AMP shader. Let me know if this works for you.
     
    Thanks again for the feedback, keep it coming, and good luck with the contest.
     
    Cheers
    Daniel

    http://www.danielmoth.com/Blog/

    • Proposed As Answer by Zhu, Weirong Wednesday, June 27, 2012 11:20 PM
    • Marked As Answer by Ethatron Friday, June 29, 2012 6:01 PM
    •  
  • Thursday, June 28, 2012 7:16 AM
     
      Has Code

    Yes, the three points are right.
    I also verified that with omitting the debug info the build time is A-okay, I made a release-build and it just hurried through. Doesn't help for the debugging though. :)

    Regarding the short-vectors:
    I know HLSL doesn't support dynamic indexed access to its vector-types, just static. Though when you vectorize a non-even number of elements you end up with a tail at the last SIMD. For a trivial embarassingly parallel and data-independent loop it doesn't matter, but something in the spirit of a horizontal sum becomes difficult. That you may handle with 0.0 padding, but you always have cases where you can't. For example in my case I had to pull a single value out of a column of a matrix, without being able to use all four values in that moment. So the elegant pattern is:

    value = memory(col / SIMDsize)[col % SIMDsize];

    A four-component "vector" or a four-component array are in it's functional principle identical, if you have a swizzle-instruction to pull a specific component out of "swizzle r0.xyzw, r1.x, r2.xyzw" or if you have a base-offset from a memory address "read r0.xyzw, mem[r1.x]" doesn't matter much, both should be possible/available.

    The array-form just is much more convenient, and easier to handle by the compiler as well I suppose. If I see how stressfull the switch emulating the "swizzle"-instruction above to the compiler and code is, I think the array-form may prevent some bad practice and unnecessary tinkering with the disadvantages of the xyzw-vector form.

    I really only took the whole amp_short_vector.h and put all the percolated "value_type"s in array-form, nothing else. Well, allowing SVT-usage too etc.

    Regarding the unrolling:
    I think it's just an extra, what may be more helpfull is the message that it was enforced by the compiler. I'm sure the eco-system of code-translation is quite complex, and I see it may result in a meaningless message (fe. when there is no for(), but the compiler generated one). I'm not sure how to solve this. I think the more helpfull hints, the better. I'm always super-verbose in my debug+printf runs of first code. :^D It's important in that first moment when you tackle a problem to see that what you expect happens, or not, and why.

    Regarding the flags:
    So a debug-build is /Od? Weird ... I experienced that kind of slowdown besides only when I enable CodeAnalysis ... maybe it really is the verification-pass of the fxc. Is that off for the release-building? Can't imagine that though.
    Anyway, I don't think the flags are necessary if they mirror the regular O-flags, it just has to be explicitly documented somewhere, so I can set say /Os (C++) and that gives me /O0 (fxc). Please, if possible in some way, add /O0 (fxc), sometimes (often?) O0 vs. O1 can be 1:5 time without any code-improvement (at all).

    Regarding the assembly (edited):
    Wouldn't it be possible to give the HLSL-disassembly in a comment-block (so it's still valid for msasm) after the bytecode data-section, when you request the regular disassembly? I frequently look that one up for "wrong" translations of code-to-machine. And it would be a logical place to look for/dump it to.

    Yeah, I can handle all the debugger/profiler-features. :^) Isn't so hard, and I just started being (de)pressed because of loooong GPU->CPU download times, there comes the creativity in, I hope. And some issues I think maybe driver-bugs or bytecode-to-GPUcode translator bugs. Well in the not so far future you may tell me if I was too careless in my coding, and in effect it's my error, or if it really is something fishy.

    Okay, so far so good.
    Salú


    • Edited by Ethatron Thursday, June 28, 2012 3:08 PM regarding assembly
    •  
  • Friday, June 29, 2012 6:00 PM
     
      Has Code

    I actually found a branch-free pattern for the switch above:

    vec4 copy = memory(i / SIMDsize);
    
    /* no indexed access to vector */
    copy = vec4(
      (i % SIMDsize) == 0 ? val : copy.x,
      (i % SIMDsize) == 1 ? val : copy.y,
      (i % SIMDsize) == 2 ? val : copy.z,
      (i % SIMDsize) == 3 ? val : copy.w
    );
    
    memory(i / SIMDsize) = copy;
    
    It'll read 4x times as much data as necessary, the whole vector, thus will not operate at int-granularity like an array (both LDS and memory have int-granularity, LDS even short-granularity on AMD hardware). But's better than the switch I guess.
  • Friday, June 29, 2012 8:43 PM
     
     

    Hi Ethatron,

    Thanks for all your feedbacks. They are all noted and will be taken into consideration for future releases.

    For your use of short vectors,  I assume you use something like an array<float4, 1> or array_view<float4, 1>. When you need to deal with the tail and need the dynamic indexing, maybe you can consider using reinterpret_as, so you can reinterpret the storage into array_view<float, 1>, where you can use dynamic indexing to access the individual "float" element.  For the rest of the code, where you want to use short vectors, you use the original array_view<float4, 1>.  Note that you should do the reinterpret_as inside your kernel, if you do it outside the parallel_for_each and capture both array_view's, you can run into aliased invocation.

    Thanks,

    Weirong