2.What is Computer Architecture?
The design of computer systems
Goal: improve “performance”:
Run programs faster
Execution time: e.g. less waiting time, more simultaneous tasks
Throughput: e.g. higher framerate, faster downloads
Use less power, last longer on battery power
Generate less or more uniformally-distributed heat
Handle more secure encryption standards
Software defined networking at higher line speeds
More scalable
Less expensive
Chapter 1 — Computer Abstractions and Technology — 2
3.Computer Architecture
Instruction Set Architecture (“architecture”)
The native programming language of a processor
Assembly language
Machine language
Openly published to users, licensed for chip makers
Microarchitecture
The internal organization of a processor
Executes programs
Trade secret
CSCE 212 3
4.Chapter 1 — Computer Abstractions and Technology — 4
Levels of Program Code
High-level language
Level of abstraction closer to problem domain
Provides for productivity and portability
Assembly language
Textual representation of instructions
Hardware representation
Binary digits (bits)
Encoded instructions and data
Instruction Set Architecture (ISA):
Assembly code / machine code “language”
Microarchitecture:
ISA implementation
5.CSCE 212 5
Abstraction
Abstration used to manage complexity of design
Hide details that are not important
Devices
(Transistors)
Logic gates
Program code
Machine Instructions
Datapaths
6.Why Study Computer Architecture
Compiling “machine agnostic” code:
Generally achieve ~1-20% of peak theoretical performance
Performance tuned code must be explicitly written for the underlying architecture
Especially for embedded and special purpose processors
Understanding computer architecture allows for customization:
Multicore
More efficient use of registers, instructions
Device drivers must directly interface with peripherals
Uses CPU-specific, bit-level features to communicate
E.g. memory-mapped I/O, interrupts, DMA, double buffering, bit fields in status/control registers, memory management, virtual memory
Chapter 1 — Computer Abstractions and Technology — 6
7.Domains and Levels of Modeling
high level of abstraction
Functional
Structural
Geometric
low level of abstraction
Chapter 1 — Computer Abstractions and Technology — 7
8.Functional Abstraction
MAR <= PC, memory_read <= 1
PC <= PC + 1
wait until ready = 1
IR <= memory_data
memory_read <= 0
For i=0 to 10
C = C + A[i]
Chapter 1 — Computer Abstractions and Technology — 8
10.Semiconductors
Silicon is a group IV element
Forms covalent bonds with four neighbor atoms (3D cubic crystal lattice)
Si is a poor conductor, but conduction characteristics may be altered
Add impurities/dopants replaces silicon atom in lattice
Adds two different types of charge carriers
Spacing = .543 nm
Chapter 1 — Computer Abstractions and Technology — 10
12.Feature Size
Shrink minimum feature size…
Smaller L decreases carrier time and increases current
Therefore, W may also be reduced for fixed current
Cg, Cs, and Cd are reduced
Transistor switches faster (~linear relationship)
Chapter 1 — Computer Abstractions and Technology — 12
19.8” Wafer
8 inch (200 mm) wafer containing Pentium 4 processors
165 dies, die area = 250 mm2, 55 million transistors, .18mm
Chapter 1 — Computer Abstractions and Technology — 19
20.Chapter 1 — Computer Abstractions and Technology — 20
Intel Core i7 Wafer
300mm wafer, 280 chips, 32nm technology
Each chip is 20.7 x 10.5 mm
21.Chapter 1 — Computer Abstractions and Technology — 21
Speedup / Relative Performance
Define Performance = 1/Execution Time
“X is n time faster than Y”
Example: time taken to run a program
10s on A, 15s on B
Execution TimeB / Execution TimeA= 15s / 10s = 1.5
So A is 1.5 times faster than B
22.Chapter 1 — Computer Abstractions and Technology — 22
Integrated Circuit Cost
Nonlinear relation to area and defect rate
Wafer cost and area are fixed
Defect rate determined by manufacturing process
Die area determined by architecture and circuit design
23.Example
Assume C defects per area and a die area of D. Calculate the improvement in yield if the number of defects is reduced by 1.5.
CSCE 212 23
𝑦𝑖𝑒𝑙𝑑 𝑛𝑒𝑤 𝑦𝑖𝑒𝑙𝑑 𝑜𝑟𝑖𝑔 = 1 1+ 𝐶 1.5 𝐷 2 2 1 1+𝐶 𝐷 2 2 =
1+𝐶 𝐷 2 2 1+ 𝐶 1.5 𝐷 2 2
= 1+𝐶𝐷+ 𝐶 2 𝐷 2 4 1+ 𝐶𝐷 1.5 + 𝐶 2 𝐷 2 9
24.Example
Assume a 20 cm diameter wafer has a cost of 15, contains 100 dies, and has 0.031 defects/cm2.
If the number of dies per wafer is increased by 10% and the defects per area unit increases by 15%, find the die area and yield.
die area20cm = wafer area / dies per wafer = pi*10^2 / (100*1.1) = 2.86 cm2
yield20cm = 1/(1+(0.03*1.15* 2.86/2))^2 = 0.9082
Assume a fabrication process improves the yield from 0.92 to 0.95. Find the defects per area unit for each version of the technology given a die area of 200 mm2.
defects per area0.92 = (1-y^.5)/(y^.5*die_area/2) =
(1-0.92^.5)/(0.92^.5*2/2) = 0.043 defects/cm2
defects per area0.95 = (1-y^.5)/(y^.5*die_area/2) =
(1-0.95^.5)/(0.95^.5*2/2) = 0.026 defects/cm2
CSCE 212 24
25.Chapter 1 — Computer Abstractions and Technology — 25
Response Time and Throughput
Response time
How long it takes to do a task
Throughput
Total work done per unit time
e.g., tasks/transactions/… per hour
How are response time and throughput affected by
Replacing the processor with a faster version?
Adding more processors?
We’ll focus on response time for now…
26.Chapter 1 — Computer Abstractions and Technology — 26
Measuring Execution Time
Elapsed time
Total response time, including all aspects
Processing, I/O, OS overhead, idle time
Determines system performance
CPU time
Time spent processing a given job
Discounts I/O time, other jobs’ shares
Comprises user CPU time and system CPU time
Different programs are affected differently by CPU and system performance
27.Chapter 1 — Computer Abstractions and Technology — 27
CPU Clocking
Operation of digital hardware governed by a constant-rate clock
Clock (cycles)
Data transferand computation
Update state
Clock period
Clock period: duration of a clock cycle
e.g., 250ps = 0.25ns = 250×10–12 s
Clock frequency (rate): cycles per second
e.g., 4.0GHz = 4000MHz = 4.0×109 Hz
28.Chapter 1 — Computer Abstractions and Technology — 28
CPU Time
Performance improved by
Reducing number of clock cycles
Increasing clock rate
Hardware designer must often trade off clock rate against cycle count
29.Chapter 1 — Computer Abstractions and Technology — 29
CPU Time Example
Computer A: 2GHz clock, 10s CPU time
Designing Computer B
Aim for 6s CPU time
Can do faster clock, but causes 1.2 × clock cycles
How fast must Computer B clock be?
30.Chapter 1 — Computer Abstractions and Technology — 30
Instruction Count and CPI
Instruction Count for a program
Determined by program, ISA and compiler
Average cycles per instruction
Determined by CPU hardware
If different instructions have different CPI
Average CPI affected by instruction mix
31.Chapter 1 — Computer Abstractions and Technology — 31
CPI Example
Computer A: Cycle Time = 250ps, CPI = 2.0
Computer B: Cycle Time = 500ps, CPI = 1.2
Same ISA
Which is faster, and by how much?
A is faster…
…by this much
32.Chapter 1 — Computer Abstractions and Technology — 32
CPI in More Detail
If different instruction classes take different numbers of cycles
Weighted average CPI
Relative frequency
33.Chapter 1 — Computer Abstractions and Technology — 33
CPI Example
Alternative compiled code sequences using instructions in classes A, B, C
Sequence 1: IC = 5
Clock Cycles= 2×1 + 1×2 + 2×3= 10
Avg. CPI = 10/5 = 2.0
Sequence 2: IC = 6
Clock Cycles= 4×1 + 1×2 + 1×3= 9
Avg. CPI = 9/6 = 1.5
35.Example
Suppose one machine, A, executes a program with an average CPI of 2.1
Suppose another machine, B (with the same instruction set and an enhanced compiler), executes the same program with 25% less instructions and with a CPI of 1.8 at 800MHz
In order for the two machines to have the same performance, what does the clock rate of the first machine (machine A) need to be?
CSCE 212 35
𝐼 𝐴 𝐶𝑃𝐼 𝐴 𝑅 𝐴 = 𝐼 𝐵 𝐶𝑃𝐼 𝐵 𝑅 𝐵
𝐼 𝐴 ∙2.1 𝑅 𝐴 = 0.75𝐼 𝐴 ∙1.8 800∙ 10 6
36.Example
Suppose a program has the following instruction classes, CPIs, and mixtures:
Instruction type CPI ratio
A 1.4 55%
B 2.4 15%
C 2 30%
Your engineers give you the following options:
Option A: Reduce the CPI of instruction type A to 1.1
Option B: Reduce the CPI of instruction type B to 1.2
Which option would you choose and why?
CSCE 212 36
𝐶𝑃𝐼 𝐴 =.55 1.1 +.15 2.4 +.30 2 =1.565
𝐶𝑃𝐼 𝐵 =.55 1.4 +.15 1.2 +.30 2 =1.550
37.Chapter 1 — Computer Abstractions and Technology — 37
Pitfall: MIPS as a Performance Metric
MIPS: Millions of Instructions Per Second
Doesn’t account for
Differences in ISAs between computers
Differences in complexity between instructions
CPI varies between programs on a given CPU
38.Example
Consider two different implementations, M1 and M2, of the same instruction set. There are three classes of instructions (A, B, and C) in the instruction set. M1 has a clock rate of 800 MHz and M2 has a clock rate of 2 GHz. The average number of cycles for each instruction class and their frequencies (for a typical program) are as follows:
Instruction class Machine M1 CPI Frequency Machine M2 CPI Frequency
A 1 50% 2 60%
B 2 20% 3 30%
C 4 30% 4 10%
Calculate the average CPI for each machine, M1, and M2.
Calculate the average MIPS ratings for each machine, M1 and M2.
Hint: MIPS = (clock rate / CPI) / 106.
CSCE 212 38
M1 => .5(1) + .2(2) + .3(4) = 2.1
M2 => .6(2) + .3(3) + .1(4) = 2.5
M1 => 800 * 2.1 = 1680
M2 => 2000 * 2.5 = 5000
39.Example (Con’t)
How many less instructions would M1 need to execut to match the speed of M2?
M1 => 800 * 2.1 = 1680
M2 => 2000 * 2.5 = 5000
CSCE 212 39
5000/1680
40.Chapter 1 — Computer Abstractions and Technology — 40
Power Trends
In CMOS IC technology
×1000
×30
5V → 1V
41.Chapter 1 — Computer Abstractions and Technology — 41
Reducing Power
Suppose a new CPU has
85% of capacitive load of old CPU
15% voltage and 15% frequency reduction
The power wall
We can’t reduce voltage further
We can’t remove more heat
How else can we improve performance?
42.Example
Assume:
Processor Clock Voltage Dynamic P Static P
Pentium 4 3.6 GHz 1.25 V 90 W 10 W
Core i5 3.4 GHz 0.9 V 40 W 30 W
Calculate capacitive load of each processor.
If total power is to be reduced by 10%, how much should the voltage be reduced?
CSCE 212 42
43.Example
For given processor, assume we reduce the voltage by 10% and increase the frequency by 5%. What is the improvement to dynamic power consumption?
CSCE 212 43
𝑝𝑜𝑤𝑒𝑟 𝑜𝑟𝑖𝑔 𝑝𝑜𝑤𝑒𝑟 𝑛𝑒𝑤 = 𝐶 𝑉 2 𝐹 𝐶 .9𝑉 2 (1.05)𝐹 = 𝑉 2 .9𝑉 2 (1.05
= 𝑉 2 .81 𝑉 2 (1.05 = 1 .81(1.05 =1.18
44.Chapter 1 — Computer Abstractions and Technology — 44
Fallacy: Low Power at Idle
i7 power benchmark
At 100% load: 258W
At 50% load: 170W (66%)
At 10% load: 121W (47%)
Google data center
Mostly operates at 10% – 50% load
At 100% load less than 1% of the time
Consider designing processors to make power proportional to load
45.Chapter 1 — Computer Abstractions and Technology — 45
SPEC Power Benchmark
Power consumption of server at different workload levels
Performance: ssj_ops/sec
ssj_ops = server side Java operations per second
Power: Watts (Joules/sec)
46.Chapter 1 — Computer Abstractions and Technology — 46
SPECpower_ssj2008 for Xeon X5650
47.Chapter 1 — Computer Abstractions and Technology — 47
SPEC CPU Benchmark
Programs used to measure performance
Supposedly typical of actual workload
Standard Performance Evaluation Corp (SPEC)
Develops benchmarks for CPU, I/O, Web, …
SPEC CPU2006
Elapsed time to execute a selection of programs
Negligible I/O, so focuses on CPU performance
Normalize relative to reference machine
Summarize as geometric mean of performance ratios
CINT2006 (integer) and CFP2006 (floating-point)
48.Chapter 1 — Computer Abstractions and Technology — 48
CINT2006 for Intel Core i7 920
49.Chapter 1 — Computer Abstractions and Technology — 49
Pitfall: Amdahl’s Law
Improving an aspect of a computer and expecting a proportional improvement in overall performance
Can’t be done!
Example: multiply accounts for 80s/100s
How much improvement in multiply performance to get 5× overall?
Corollary: make the common case fast
50.Example
Use Amdahl’s Law to compute the new execution time for an architecture that previously required 25 seconds to execute a program, where 15% of the time was spent executing load/store instructions, if the time required for a load/store operation is reduced by 40% (amount of improvement for load/stores = 1/.60 = 1.67).
CSCE 212 50
51.Example
Suppose you have a machine which executes a program consisting of 50% multiply instructions, 20% divide instructions, and the remaining 30% are other instructions. Management wants the machine to run 4 times faster. You can make the divide run at most 3 times faster and the multiply run at most 8 times faster. Can you meet management’s goal by making only one improvement, and which one?
CSCE 212 51
52.Chapter 1 — Computer Abstractions and Technology — 52
Multiprocessors
Multicore microprocessors
More than one processor per chip
Requires explicitly parallel programming
Compare with instruction level parallelism
Hardware executes multiple instructions at once
Hidden from the programmer
Hard to do
Programming for performance
Load balancing
Optimizing communication and synchronization
53.Example
Assume the following instruction classes and corresponding CPIs and dynamic execution counts:
When run on > 1 processors, the number of executed arithmetic instructions is divided by 0.7p (where p = number of processors) but the number of other instructions executed remains the same.
To what factor would the CPI of load/store instructions need to be reduced (sped up) in order for a single processor to match the performance of four processors each having the original CPI?
CSCE 212 53
𝐴𝑋+ 𝐵𝑌 𝑓 +𝐶𝑍= 𝐴𝑋 0.7∙4 +𝐵𝑌+𝐶𝑍
54.Example
Assume the following instruction classes and corresponding CPIs and dynamic execution counts:
As compared to a single processor, under what condition is it possible to achieve an overall speedup of 6 by using multiple processors? Express this as an inequality.
CSCE 212 54
1 6 ≥ 𝐴 𝐴+𝐵+𝐶 .7𝑝 + 𝐵+𝐶 𝐴+𝐵+𝐶
55.Chapter 1 — Computer Abstractions and Technology — 55
Concluding Remarks
Cost/performance is improving
Due to underlying technology development
Hierarchical layers of abstraction
In both hardware and software
Instruction set architecture
The hardware/software interface
Execution time: the best performance measure
Power is a limiting factor
Use parallelism to improve performance