none
Loop unroll optimization

    Question

  • Hi,

    I was playing around with different systems/compilers in aspect of code

    optimization. I found and that was surprizenly enough,

    VC 2005 does not unroll loops, it seems like. So then looking only at loop

    optimization for different compilers I found that for instance Sun Java

    ( does loop unroll ) perform sometimes 2-3 times better then simular C++ code.

    So my question is, is ther any reasons for not doing loop unroll or may be it's hidden somewhere?

    P.S

    I do not have Java jit generated asm to paste in here, but that was loop unrolled by factor of 10 simular to Intel compiler.

    C++ code snippet:

    int z = 0;

    int* pz = new int[1];

    ...

    ...

    for (int i = 0; i < 0x7FFFFFFF; i++) {

    z += i;

    pz[idx] = z;

    }

    ...

    ...

     

    VC 2005 Beta 2 generated asm code:

    00401050 xor eax,eax

    00401052 add esi,eax

    00401054 add eax,1

    00401057 cmp eax,7FFFFFFFh

    0040105C jl Test1::main+52h (401052h)

    Intel 8.1 geberated asm code

    004010FA xor edx,edx

    004010FC add ebp,edx

    004010FE lea ecx,[ebp+edx+1]

    00401102 lea ebp,[ecx+edx+2]

    00401106 lea ecx,[ebp+edx+3]

    0040110A lea ebp,[ecx+edx+4]

    0040110E add edx,5

    00401111 cmp edx,7FFFFFFAh




    Friday, May 27, 2005 4:43 PM

Answers

  • As Nikola states, loop unrolling is heuristic based.  In this case there are no floating-point operations or memory writes, but as you state unrolling this would be beneficial. 

    As I'm sure you know the heuristics walk a fine line between bloat or missing optimizations, and it's frankly never perfect (for every tweak to the heuristic we make, there are some apps that get faster, and some that regress).  We spend a lot of time tuning the loop-unroller on real apps (versus some people that focus on benchmarks).  While we won't hit all of them (and sometimes they look braindead simple), we are hitting most of the core ones.  And additionally, we are getting better with each release as we learn from apps and customers, such as yourself.

    Thanks,

    Kang Su Gatlin
    Visual C++ Program Manager

    Tuesday, May 31, 2005 6:11 PM
    Moderator

All replies

  • Hi,

    VC++ compiler does unroll loop in cases when this change benifit  for overall performance of the application. This loop as you see does not anything basically, so performance win of loop unrolling in this specific case in insignificant. Another thing when loop inside has non-trivial computation that when unrolled can be pipelined. Do you have a real-world code that built with intel compiler is slower that buid with VC++ because of inefficient loop unrolling?

    Thanks,
    Nikola
    Visual C++ Team
    Sunday, May 29, 2005 5:11 AM
  • Hi Nikola,
    Yes I see, VC does unroll loops in some cases, I was able to see it on real-life app.
    Then my question is what a threshold is for this, how compiler decides whether loop has to be unrolled or not, despite obvious reason that a loop body has to fit into CPU instruction cache. If go back to my first example, yes I agree with you loop is stupid, just arithmetic progression, but I will not agree that effect in this case is insignificant, the application consist of only this loop and unrolling gives 300% improvement.

    I would say more, this loop should be completely removed due to it’s invariant, but this is different discussion. About whether I have seen code compiled by Intel that works slower than compiled by VC due to inefficient loop unroll, answer is, no I haven’t.

     

    Best Regards.

    Monday, May 30, 2005 9:53 PM
  • As Nikola states, loop unrolling is heuristic based.  In this case there are no floating-point operations or memory writes, but as you state unrolling this would be beneficial. 

    As I'm sure you know the heuristics walk a fine line between bloat or missing optimizations, and it's frankly never perfect (for every tweak to the heuristic we make, there are some apps that get faster, and some that regress).  We spend a lot of time tuning the loop-unroller on real apps (versus some people that focus on benchmarks).  While we won't hit all of them (and sometimes they look braindead simple), we are hitting most of the core ones.  And additionally, we are getting better with each release as we learn from apps and customers, such as yourself.

    Thanks,

    Kang Su Gatlin
    Visual C++ Program Manager

    Tuesday, May 31, 2005 6:11 PM
    Moderator
  • Maksim,

    I had seen real-world cases where compilers generate code that is locally efficient and probably the best compiler can safely generate, but overall application performance is worse than with less greedy optimizations. One widely used large server application, compiled by such aggressive compiler, worked 50% slower than when compiled by VC.

    Loop unroller is good example. We easily can unroll every loop in your program; that would speed up tiny benchmarks, but will slow down every real-world program, as much less of it will fit into CPU cache. The same is true for inlining, replacing multiplication by constant by shifts and additions, etc.

    VC tries to balance execution speed and code size. You can change the balance by using flags /O1 or /O2, but even when optimzing for speed VC tries to conserve code size as well.

    In the past (due to questions similar to yours) we considered adding special flag (something like /Otiny_benchmark), but than decided agains it. We don't want to spend resources tuning compiler for benchmarks.

    Thanks,
    Eugene
    Tuesday, May 31, 2005 6:35 PM