3.Why study compiler optimizations?
Moore’s Law
Chip density doubles every 18 months
Reflected in CPU performance doubling every 18 months
Proebsting’s Law
Compilers double CPU performance every 18 years
4% improvement per year because of optimizations!
4.Why study compiler optimizations?
Corollary
1 year of code optimization research = 1 month of hardware improvements
No need for compiler research… Just wait a few months!
5.Free Lunch is over
Moore’s Law
Chip density doubles every 18 months
PAST : Reflected CPU performance doubling every 18 months
CURRENT: Density doubling reflected in more cores on chip!
Corollary
Cores will become simpler
Just wait a few months… Your code might get slower!
Many optimizations now being done by hand! (autotuning)
6.Optimizations: The Big Picture
What are our goals?
Simple Goal: Make execution time as small as possible
Which leads to:
Achieve execution of many (all, in the best case) instructions in parallel
Find independent instructions
7.Dependences
We will concentrate on data dependences
Simple example of data dependence:
S1 PI = 3.14
S2 R = 5.0
S3 AREA = PI * R ** 2
Statement S3 cannot be moved before either S1 or S2 without compromising correct results
S1
S2
S3
8.Dependences
Formally:
Data dependence from S1 to S2 (S2 depends on S1) if:
1. Both statements access same memory location and one of them stores onto it, and
2. There is a feasible execution path from S1 to S2
9.Load Store Classification
Dependences classified in terms of load-store order:
1. True dependence (RAW hazard)
2. Antidependence (WAR hazard)
3. Output dependence (WAW hazard)
10.Dependence in Loops
Let us look at two different loops:
DO I = 1, N
S1 A(I+1) = A(I)+ B(I)
ENDDO
DO I = 1, N
S1 A(I+2) = A(I)+B(I)
ENDDO
In both cases, statement S1 depends on itself
11.Transformations
We call a transformation safe if the transformed program has the same "meaning" as the original program
But, what is the "meaning" of a program?
For our purposes:
Two programs are equivalent if, on the same inputs:
They produce the same outputs in the same order
12.Loop Transformations
Compilers have always focused on loops
Higher execution counts
Repeated, related operations
Much of real work takes place in loops
*
13.Several effects to attack
Overhead
Decrease control-structure cost per iteration
Locality
Spatial locality use of co-resident data
Temporal locality reuse of same data
Parallelism
Execute independent iterations of loop in parallel
*
14.Eliminating Overhead
Loop unrolling (the oldest trick in the book)
To reduce overhead, replicate the loop body
Sources of Improvement
Less overhead per useful operation
Longer basic blocks for local optimization
do i = 1 to 100 by 1
a(i) = a(i) + b(i)
end
do i = 1 to 100 by 4
a(i) = a(i) + b(i)
a(i+1) = a(i+1) + b(i+1)
a(i+2) = a(i+2) + b(i+2)
a(i+3) = a(i+3) + b(i+3) end
becomes
(unroll by 4)
15.Loop Fusion
Two loops over same iteration space one loop
Safe if does not change the values used or defined by any statement in either loop (i.e., does not violate dependences)
do i = 1 to n
c(i) = a(i) + b(i)
end
do j = 1 to n
d(j) = a(j) * e(j)
end
becomes
(fuse)
do i = 1 to n
c(i) = a(i) + b(i)
d(i) = a(i) * e(i)
end
*
For big arrays, a(i) may not be in the cache
a(i) will be found in the cache
16.Loop Fusion Advantages
Enhance temporal locality
Reduce control overhead
Longer blocks for local optimization & scheduling
Can convert inter-loop reuse to intra-loop reuse
*
17.Loop Fusion of Parallel Loops
Parallel loop fusion legal if dependences loop independent
Source and target of flow dependence map to same loop iteration
Each iteration can execute in parallel
*
18.Loop distribution (fission)
Single loop with independent statements multiple loops
Starts by constructing statement level dependence graph
Safe to perform distribution if:
No cycles in the dependence graph
Statements forming cycle in dependence graph put in same loop
19.Loop distribution (fission)
Has the following dependence graph
(1) for I = 1 to N do
(2) A[I] = A[i] + B[i-1]
(3) B[I] = C[I-1]*X+C
(4) C[I] = 1/B[I]
(5) D[I] = sqrt(C[I])
(6) endfor
20.Loop distribution (fission)
becomes
(fission)
(1) for I = 1 to N do
(2) A[I] = A[i] + B[i-1]
(3) B[I] = C[I-1]*X+C
(4) C[I] = 1/B[I]
(5) D[I] = sqrt(C[I])
(6) endfor
(1) for I = 1 to N do
A[I] = A[i] + B[i-1]
(3) endfor
(4) for
B[I] = C[I-1]*X+C
C[I] = 1/B[I]
endfor
for
D[I] = sqrt(C[I])
endfor
21.21
Loop Fission Advantages
Enables other transformations
E.g., Vectorization
Resulting loops have smaller cache footprints
More reuse hits in the cache
*
22.Loop Tiling (blocking)
Want to exploit temporal locality in loop nest.
23.Loop Tiling (blocking)
24.Loop Tiling (blocking)
25.Loop Tiling (blocking)
26.Loop Tiling (blocking)
27.27
Loop Tiling Effects
Reduces volume of data between reuses
Works on one “tile” at a time (tile size is B by B)
Choice of tile size is crucial
28.Scalar Replacement
Allocators never keep c(i) in a register
We can trick the allocator by rewriting the references
The plan
Locate patterns of consistent reuse
Make loads and stores use temporary scalar variable
Replace references with temporary’s name
29.29
Scalar Replacement
do i = 1 to n
do j = 1 to n
a(i) = a(i) + b(j)
end
end
do i = 1 to n
t = a(i)
do j = 1 to n
t = t + b(j)
end
a(i) = t
end
becomes
(scalar replacement)
Almost any register allocator can get t into a register
30.30
Scalar Replacement Effects
Decreases number of loads and stores
Keeps reused values in names that can be allocated to registers
In essence, this exposes the reuse of a(i) to subsequent passes