Loop unrolling is a compiler optimization applied to certain kinds of loops to reduce the frequency of branches and loop maintenance instructions. It is easily applied to sequential array processing loops where the number of iterations is known prior to execution of the loop.
This web presentation first looks at how loop unrolling works. Then it examines loop unrolling applied to three example loops:
All of these examples occur in various types of programs. The row operation loop is often a major component of scientific computations.
The general idea of loop unrolling is to replicate the code inside a loop body a number of times. The number of copies is called the loop unrolling factor. The number of iterations is divided by the loop unrolling factor.
Loop unrolling is most effective when computations involving the loop control variable can be simulated by the compiler. For example, when a loop is stepping sequentially through an array, increments to a register that points to array entries can be simulated by the compiler with changes in the displacements in load and store instructions.
Loop unrolling can reduce the number of loop maintenance instruction executions by the loop unrolling factor. In effect, the computations are done by the compiler rather than being done during program execution.
The loop unrolling factor does not have to exactly divide the number of iterations of the original loop. If the number of iterations is known at compile time then the compiler can add extra iterations after the unrolled loop. Otherwise it can just add a copy of the original loop.
The following C code will compute the sum of the entries in a 100-entry vector A.
double arraySum = 0; for (int i = 0; i < 100; i++) { arraySum += A[i]; }
The code below omits the loop initializations:
A
.
loop3: l.d $f10, 0($5) ; $f10 ← A[i] add.d $f8, $f8, $f10 ; $f8 ← $f8 + A[i] addi $5, $5, 8 ; increment pointer for A[i] addi $7, $7, -1 ; decrement loop count test: bgtz $7, loop3 ; Continue if loop count > 0
The addi
instructions in this code are loop maintenance:
advancing addresses and counting iterations.
This is the same code with loop unrolling with a loop unrolling
factor of 4.
Notice that the displacements in the loads are incremented by the
compiler in the second, third, and fourth copies of the original loop
body instructions.
The immediate values in the addi
instructions have been
multiplied by 4 so that the effect of one iteration in the unrolled loop
is the same as 4 iterations of the original loop.
Notice that branches and loop maintenance instructions have been reduced by a factor of 4.
loop3: l.d $f10, 0($5) ; iteration with displacement 0 add.d $f8, $f8, $f10 l.d $f10, 8($5) ; iteration with displacement 8 add.d $f8, $f8, $f10 l.d $f10, 16($5) ; iteration with displacement 16 add.d $f8, $f8, $f10 l.d $f10, 24($5) ; iteration with displacement 24 add.d $f8, $f8, $f10 addi $5, $5, 32 addi $7, $7, -4 test: bgtz $7, loop3 ; Continue loop if $7 > 0
The following C code will compute the dot product of two 100-entry vectors A and B.
double dotProduct = 0; for (int i = 0; i < 100; i++) { dotProduct += A[i]*B[k]; }
The code below omits the loop initializations:
A
.
B
.
loop3: l.d $f10, 0($5) ; $f10 ← A[i] l.d $f12, 0($6) ; $f12 ← B[i] mul.d $f10, $f10, $f12 ; $f10 ← A[i]*B[i] add.d $f8, $f8, $f10 ; $f8 ← $f8 + A[i]*B[ki] addi $5, $5, 8 ; increment pointer for A[i] addi $6, $6, 8 ; increment pointer for B[i] addi $7, $7, -1 ; decrement loop count test: bgtz $7, loop3 ; Continue if loop count > 0
The addi
instructions in this code are loop maintenance:
advancing addresses and counting iterations.
This is the same code with loop unrolling with a loop unrolling factor of 4.
loop3: l.d $f10, 0($5) ; iteration with displacement 0 l.d $f12, 0($6) mul.d $f10, $f10, $f12 add.d $f8, $f8, $f10 l.d $f10, 8($5) ; iteration with displacement 8 l.d $f12, 8($6) mul.d $f10, $f10, $f12 add.d $f8, $f8, $f10 l.d $f10, 16($5) ; iteration with displacement 16 l.d $f12, 16($6) mul.d $f10, $f10, $f12 add.d $f8, $f8, $f10 l.d $f10, 24($5) ; iteration with displacement 24 l.d $f12, 24($6) mul.d $f10, $f10, $f12 add.d $f8, $f8, $f10 addi $5, $5, 32 addi $6, $6, 32 addi $7, $7, -4 test: bgtz $7, loop3 ; Continue loop if $7 > 0
The following C code will compute the dot product of two 100-entry vectors A and B.
double dotProduct = 0; for (int i = 0; i < 100; i++) { dotProduct += A[i]*B[k]; }
The code below omits the loop initializations:
A
.
B
.
C
.
loop3: l.d $f10, 0($5) ; $f10 ← A[i] mul.d $f10, $f10, $f0 ; $f10 ← A[i]*C l.d $f12, 0($6) ; $f12 ← B[i] add.d $f12, $f12, $f10 ; $f12 ← $f12 + A[i]*C s.d $f12, 0($6) ; B[i] ← B[i] + A[i]*C addi $5, $5, 8 ; increment pointer for A[i] addi $6, $6, 8 ; increment pointer for B[i] addi $7, $7, -1 ; decrement loop count test: bgtz $7, loop3 ; Continue if loop count > 0
This is the same code with loop unrolling with a loop unrolling factor of 4.
loop3: l.d $f10, 0($5) ; iteration with displacement 0 mul.d $f10, $f10, $f12 l.d $f12, 0($6) add.d $f12, $f12, $f10 s.d $f12, 0($6) l.d $f10, 8($5) ; iteration with displacement 8 mul.d $f10, $f10, $f12 l.d $f12, 8($6) add.d $f12, $f12, $f10 s.d $f12, 8($6) l.d $f10, 16($5) ; iteration with displacement 16 mul.d $f10, $f10, $f12 l.d $f12, 16($6) add.d $f12, $f12, $f10 s.d $f12, 16($6) l.d $f10, 24($5) ; iteration with displacement 24 mul.d $f10, $f10, $f12 l.d $f12, 24($6) add.d $f12, $f12, $f10 s.d $f12, 24($6) addi $5, $5, 32 addi $6, $6, 32 addi $7, $7, -4 test: bgtz $7, loop3 ; Continue loop if $7 > 0