1.Parallelism, Multicore, and Synchronization
Hakim Weatherspoon
CS 3410
Computer Science
Cornell University
[Weatherspoon, Bala, Bracy, McKee, and Sirer, Roth, Martin]
2.Announcements
P4-Buffer Overflow is due today
Due Tuesday, April 16th
C practice assignment
Due Friday, April 19th
P5-Cache project
Due Friday, April 26th
Prelim2
Thursday, May 2nd, 7:30pm
3.xkcd/619
3
4.4
Big Picture: Multicore and Parallelism
5.5
Big Picture: Multicore and Parallelism
Why do I need four computing cores on my phone?!
6.6
Big Picture: Multicore and Parallelism
Why do I need eight computing cores on my phone?!
7.7
Big Picture: Multicore and Parallelism
Why do I need sixteeen computing cores on my phone?!
8.8
Pitfall: Amdahl’s Law
affected execution time
amount of improvement
+ execution time unaffected
Execution time after improvement =
Timproved = Taffected improvement factor + Tunaffected
9.9
Pitfall: Amdahl’s Law
Improving an aspect of a computer and expecting a proportional improvement in overall performance
Example: multiply accounts for 80s out of 100s
Multiply can be parallelized
How much improvement do we need in the multiply performance to get 5× overall improvement?
(a) 2x (b) 10x (c) 100x (d) 1000x (e) not possible
Timproved = Taffected improvement factor + Tunaffected
10.10
Pitfall: Amdahl’s Law
Improving an aspect of a computer and expecting a proportional improvement in overall performance
Example: multiply accounts for 80s out of 100s
Multiply can be parallelized
How much improvement do we need in the multiply performance to get 5× overall improvement?
20 = 80 𝑛 + 20
Timproved = Taffected improvement factor + Tunaffected
Can’t be done!
11.Workload: sum of 10 scalars, and 10 × 10 matrix sum
Speed up from 10 to 100 processors?
Single processor: Time = (10 + 100) × tadd
10 processors
Time = 100/10 × tadd + 10 × tadd = 20 × tadd
Speedup = 110/20 = 5.5
100 processors
Time = 100/100 × tadd + 10 × tadd = 11 × tadd
Speedup = 110/11 = 10
Assumes load can be balanced across processors
Scaling Example
11
12.What if matrix size is 100 × 100?
Single processor: Time = (10 + 10000) × tadd
10 processors
Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd
Speedup = 10010/1010 = 9.9
100 processors
Time = 10 × tadd + 10000/100 × tadd = 110 × tadd
Speedup = 10010/110 = 91
Assuming load balanced
Scaling Example
12
13.Takeaway
13
Unfortunately, we cannot not obtain unlimited scaling (speedup) by adding unlimited parallel resources, eventual performance is dominated by a component needing to be executed sequentially.
Amdahl's Law is a caution about this diminishing return
14.Performance Improvement 101
14
seconds instructions cycles seconds
program program instruction cycle
2 Classic Goals of Architects:
⬇ Clock period (⬆ Clock frequency)
⬇ Cycles per Instruction (⬆ IPC)
=
x
x
15.Darling of performance improvement for decades
Why is this no longer the strategy?
Hitting Limits:
Pipeline depth
Clock frequency
Moore’s Law & Technology Scaling
Power
Clock frequencies have stalled
15
16.Exploiting Intra-instruction parallelism:
Pipelining (decode A while fetching B)
Exploiting Instruction Level Parallelism (ILP):
Multiple issue pipeline (2-wide, 4-wide, etc.)
Statically detected by compiler (VLIW)
Dynamically detected by HW
Dynamically Scheduled (OoO)
Improving IPC via ILP
16
17.Pipelining: execute multiple instructions in parallel
Q: How to get more instruction level parallelism?
A: Deeper pipeline
E.g. 250MHz 1-stage; 500Mhz 2-stage; 1GHz 4-stage; 4GHz 16-stage
Pipeline depth limited by…
max clock speed (less work per stage shorter clock cycle)
min unit of work
dependencies, hazards / forwarding logic
Instruction-Level Parallelism (ILP)
17
18.Pipelining: execute multiple instructions in parallel
Q: How to get more instruction level parallelism?
A: Multiple issue pipeline
Start multiple instructions per clock cycle in duplicate stages
Instruction-Level Parallelism (ILP)
18
ALU/Br
LW/SW
19.Static multiple issue
aka Very Long Instruction Word
Decisions made by compiler
Dynamic multiple issue
Decisions made on the fly
Cost: More execute hardware
Reading/writing register files: more ports
Multiple issue pipeline
19
20.a.k.a. Very Long Instruction Word (VLIW)
Compiler groups instructions to be issued together
Packages them into “issue slots”
How does HW detect and resolve hazards?
It doesn’t. Compiler must avoid hazards
Example: Static Dual-Issue 32-bit RISC-V
Instructions come in pairs (64-bit aligned)
One ALU/branch instruction (or nop)
One load/store instruction (or nop)
Static Multiple Issue
20
21.Two-issue packets
One ALU/branch instruction
One load/store instruction
64-bit aligned
ALU/branch, then load/store
Pad an unused instruction with nop
RISC-V with Static Dual Issue
21
22.Loop: lw t0, s1, 0 # $t0=array element add t0, t0, s2 # add scalar in $s2 sw t0, s1, 0 # store result addi s1, s1,–4 # decrement pointer bne s1, zero, Loop # branch $s1!=0
Scheduling Example
22
Loop: lw t0, s1, 0 # $t0=array element add t0, t0, s2 # add scalar in $s2 sw t0, s1, 0 # store result addi s1, s1,–4 # decrement pointer bne s1, zero, Loop # branch $s1!=0
Schedule this for dual-issue RISC-V
Clicker Question: What is the IPC of this machine?
(A) 0.8 (B) 1.0 (C) 1.25 (D) 1.5 (E) 2.0
23.Scheduling Example
23
Loop: lw t0, s1, 0 # $t0=array element add t0, t0, s2 # add scalar in $s2 sw t0, s1, 0 # store result addi s1, s1,–4 # decrement pointer bne s1, zero, Loop # branch $s1!=0
Schedule this for dual-issue RISC-V
5 instructions 4 cycles = IPC = 1.25
4 cycles 5 instructions = CPI = 0.8
24.Goal: larger instruction windows (to play with)
Predication
Loop unrolling
Function in-lining
Basic block modifications (superblocks, etc.)
Roadblocks
Memory dependences (aliasing)
Control dependences
Techniques and Limits of Static Scheduling
24
25.Reorder instructions
To fill the issue slot with useful work
Complicated: exceptions may occur
Speculation
25
26.
Move instructions to fill in nops
Need to track hazards and dependencies
Loop unrolling
Optimizations to make it work
26
29. lw t0, s1, 0 # load A
addi t0, t0, +1 # increment A
sw t0, s1, 0 # store A
lw t0, s2, 0 # load B
addi t0, t0, +1 # increment B
sw t0, s2, 0 # store B
ALU/branch slot Load/store slot cycle
nop lw t0, s1, 0 1
nop nop 2
addi t0, t0, +1 nop 3
nop sw t0, s1, 0 4
nop lw t0, s2, 0 5
nop nop 6
addi t0, t0, +1 nop 7
nop sw t0, s2, 0 8
Limits of Static Scheduling
29
Compiler scheduling for dual-issue RISC-V…
30. lw t0, s1, 0 # load A
addi t0, t0, +1 # increment A
sw t0, s1, 0 # store A
lw t1, s2, 0 # load B
addi t1, t1, +1 # increment B
sw t1, s2, 0 # store B
ALU/branch slot Load/store slot cycle
nop lw t0, s1, 0 1
nop nop 2
addi t0, t0, +1 nop 3
nop sw t0, s1, 0 4
nop lw t1, s2, 0 5
nop nop 6
addi t1, t1, +1 nop 7
nop sw t1, s2, 0 8
Limits of Static Scheduling
30
Compiler scheduling for dual-issue RISC-V…
31. lw t0, s1, 0 # load A
addi t0, t0, +1 # increment A
sw t0, s1, 0 # store A
lw t1, s2, 0 # load B
addi t1, t1, +1 # increment B
sw t1, s2, 0 # store B
ALU/branch slot Load/store slot cycle
nop lw t0, s1, 0 1
nop lw t1, s2, 0 2
addi t0, t0, +1 nop 3
addi t1, t1, +1 sw t0, s1, 0 4
nop sw t1, s2, 0 5
Limits of Static Scheduling
31
Compiler scheduling for dual-issue RISC-V…
Problem: What if $s1 and $s2 are equal (aliasing)? Won’t work
32.Exploiting Intra-instruction parallelism:
Pipelining (decode A while fetching B)
Exploiting Instruction Level Parallelism (ILP):
Multiple issue pipeline (2-wide, 4-wide, etc.)
Statically detected by compiler (VLIW)
Dynamically detected by HW
Dynamically Scheduled (OoO)
Improving IPC via ILP
32
33.aka SuperScalar Processor (c.f. Intel)
CPU chooses multiple instructions to issue each cycle
Compiler can help, by reordering instructions….
… but CPU resolves hazards
Even better: Speculation/Out-of-order Execution
Execute instructions as early as possible
Aggressive register renaming (indirection to the rescue!)
Guess results of branches, loads, etc.
Roll back if guesses were wrong
Don’t commit results until all previous insns committed
Dynamic Multiple Issue
33
34.Dynamic Multiple Issue
34
35.It was awesome, but then it stopped improving
Limiting factors?
Programs dependencies
Memory dependence detection be conservative
e.g. Pointer Aliasing: A[0] += 1; B[0] *= 2;
Hard to expose parallelism
Still limited by the fetch stream of the static program
Structural limits
Memory delays and limited bandwidth
Hard to keep pipelines full, especially with branches
Effectiveness of OoO Superscalar
35
36.Power Efficiency
36
Q: Does multiple issue / ILP cost much?
A: Yes.
Dynamic issue and speculation requires power
Those simpler cores did something very right.
37.37
Moore’s Law
486
286
8088
8080
8008
4004
386
Pentium
Atom
P4
Itanium 2
K8
K10
Dual-core Itanium 2
38.Moore’s law
A law about transistors
Smaller means more transistors per die
And smaller means faster too
But: Power consumption growing too…
Why Multicore?
38
39.Power Limits
39
Hot Plate
Rocket Nozzle
Nuclear Reactor
Surface of Sun
Xeon
180nm
32nm
40.Power = capacitance * voltage2 * frequency
In practice: Power ~ voltage3
Reducing voltage helps (a lot)
... so does reducing clock speed
Better cooling helps
The power wall
We can’t reduce voltage further
We can’t remove more heat
Power Wall
40
Lower Frequency
41.Why Multicore?
41
Power
1.0x
1.0x
Performance
Single-Core
Power
1.2x
1.7x
Performance
Single-CoreOverclocked +20%
Power
0.8x
0.51x
Performance
Single-Core
Underclocked -20%
1.6x
1.02x
Dual-Core
Underclocked -20%
42.Power Efficiency
42
Q: Does multiple issue / ILP cost much?
A: Yes.
Dynamic issue and speculation requires power
Those simpler cores did something very right.
43.AMD Barcelona Quad-Core: 4 processor cores
Inside the Processor
43
44.Inside the Processor
44
Intel Nehalem Hex-Core
4-wide pipeline
45.Exploiting Thread-Level parallelism
Hardware multithreading to improve utilization:
Multiplexing multiple threads on single CPU
Sacrifices latency for throughput
Single thread cannot fully utilize CPU? Try more!
Three types:
Course-grain (has preferred thread)
Fine-grain (round robin between threads)
Simultaneous (hyperthreading)
Improving IPC via ILP TLP
45
46.Process: multiple threads, code, data and OS state
Threads: share code, data, files, not regs or stack
What is a thread?
46
47.Standard Multithreading Picture
47
Time evolution of issue slots
Color = thread, white = no instruction
CGMT
FGMT
SMT
4-wide
Superscalar
time
Switch to thread B on thread A L2 miss
Switch threads every cycle
Insns from multiple threads coexist
48. Multi-Core vs. Multi-Issue
Programs:
Num. Pipelines:
Pipeline Width:
Hyperthreads
HT = MultiIssue + extra PCs and registers – dependency logic
HT = MultiCore – redundant functional units + hazard avoidance
Hyperthreads (Intel)
Illusion of multiple cores on a single core
Easy to keep HT pipelines full + share functional units
Hyperthreading
48
vs. HT
49.Example: All of the above
49
8 die (aka 8 sockets)
4 core per socket
2 HT per core
Note: a socket is a processor,
where each processor may
have multiple processing
cores, so this is an example
of a multiprocessor multicore
hyperthreaded system
50.Q: So lets just all use multicore from now on!
A: Software must be written as parallel program
Multicore difficulties
Partitioning work
Coordination & synchronization
Communications overhead
How do you write parallel programs?
... without knowing exact underlying architecture?
Parallel Programming
50
51.Partition work so all cores have something to do
Work Partitioning
51
52.Load Balancing
Need to partition so all cores are actually working
Load Balancing
52
53.If tasks have a serial part and a parallel part…
Example:
step 1: divide input data into n pieces
step 2: do work on each piece
step 3: combine all results
Recall: Amdahl’s Law
As number of cores increases …
time to execute parallel part?
time to execute serial part?
Serial part eventually dominates
Amdahl’s Law
53
goes to zero
Remains the same
54.Amdahl’s Law
54
55.Necessity, not luxury
Power wall
Not easy to get performance out of
Many solutions
Pipelining
Multi-issue
Hyperthreading
Multicore
Parallelism is a necessity
55
56.Q: So lets just all use multicore from now on!
A: Software must be written as parallel program
Multicore difficulties
Partitioning work
Coordination & synchronization
Communications overhead
How do you write parallel programs?
... without knowing exact underlying architecture?
Parallel Programming
56
HW
SW
Your
career…
57.How do I take advantage of parallelism?
How do I write (correct) parallel programs?
What primitives do I need to implement correct parallel programs?
Big Picture: Parallelism and Synchronization
57
58.Cache Coherency
Processors cache shared data they see different (incoherent) values for the same memory location
Synchronizing parallel programs
Atomic Instructions
HW support for synchronization
How to write parallel programs
Threads and processes
Critical sections, race conditions, and mutexes
Parallelism & Synchronization
58
59.Cache Coherency Problem: What happens when to two or more processors cache shared data?
Parallelism and Synchronization
59
60.Cache Coherency Problem: What happens when to two or more processors cache shared data?
i.e. the view of memory held by two different processors is through their individual caches.
As a result, processors can see different (incoherent) values to the same memory location.
Parallelism and Synchronization
60
61.Parallelism and Synchronization
61
62.Each processor core has its own L1 cache
Parallelism and Synchronization
62
63.Each processor core has its own L1 cache
Parallelism and Synchronization
63
64.Each processor core has its own L1 cache
Parallelism and Synchronization
64
Core0
Cache
Memory
I/O
Interconnect
Core1
Cache
Core3
Cache
Core2
Cache
65.Shared Memory Multiprocessor (SMP)
Typical (today): 2 – 4 processor dies, 2 – 8 cores each
HW provides single physical address space for all processors
Shared Memory Multiprocessors
65
Core0
Cache
Memory
I/O
Interconnect
Core1
Cache
Core3
Cache
Core2
Cache
66.Shared Memory Multiprocessor (SMP)
Typical (today): 2 – 4 processor dies, 2 – 8 cores each
HW provides single physical address space for all processors
Shared Memory Multiprocessors
66
Core0
Cache
Memory
I/O
Interconnect
Core1
Cache
...
CoreN
Cache
...
...
67.Thread A (on Core0) Thread B (on Core1)for(int i = 0, i < 5; i++) { for(int j = 0; j < 5; j++) { x = x + 1; x = x + 1;} }
What will the value of x be after both loops finish?
Cache Coherency Problem
67
Core0
Cache
Memory
I/O
Interconnect
Core1
Cache
...
CoreN
Cache
...
...
68.Thread A (on Core0) Thread B (on Core1)for(int i = 0, i < 5; i++) { for(int j = 0; j < 5; j++) { x = x + 1; x = x + 1;} }
What will the value of x be after both loops finish?
(x starts as 0)
6
8
10
Could be any of the above
Couldn’t be any of the above
Cache Coherency Problem
68
Clicker Question
69.Thread A (on Core0) Thread B (on Core1)for(int i = 0, i < 5; i++) { for(int j = 0; j < 5; j++) { x = x + 1; x = x + 1;} }
What will the value of x be after both loops finish?
(x starts as 0)
6
8
10
Could be any of the above
Couldn’t be any of the above
Cache Coherency Problem
69
Clicker Question
70.Thread A (on Core0) Thread B (on Core1)for(int i = 0, i < 5; i++) { for(int j = 0; j < 5; j++) { LW t0, addr(x) LW t0, addr(x)
ADDIU t0, t0, 1 ADDIU t0, t0, 1
SW t0, addr(x) SW t0, addr(x)} }
Cache Coherency Problem, WB $
70
t0=0
t0=1
x=1
t0=0
t0=1
x=1
Problem!
Core0
Cache
Memory
I/O
Interconnect
Core1
Cache
...
CoreN
Cache
...
...
X
0
X
X
1
1
71.Not just a problem for Write-Back Caches
71
Executing on a write-thru cache
Core0
Cache
Memory
I/O
Interconnect
Core1
Cache
...
CoreN
Cache
...
...
72.Coherence
What values can be returned by a read
Need a globally uniform (consistent) view of a single memory location
Solution: Cache Coherence Protocols
Consistency
When a written value will be returned by a read
Need a globally uniform (consistent) view of all memory locations relative to each other
Solution: Memory Consistency Models
Two issues
72
73.Informal: Reads return most recently written value
Formal: For concurrent processes P1 and P2
P writes X before P reads X (with no intervening writes) read returns written value
(preserve program order)
P1 writes X before P2 reads X read returns written value
(coherent memory view, can’t read old value forever)
P1 writes X and P2 writes X all processors see writes in the same order
all see the same final value for X
Aka write serialization
(else PA can see P2’s write before P1’s and PB can see the opposite; their final understanding of state is wrong)
Coherence Defined
73
74.Operations performed by caches in multiprocessors to ensure coherence
Migration of data to local caches
Reduces bandwidth for shared memory
Replication of read-shared data
Reduces contention for access
Snooping protocols
Each cache monitors bus reads/writes
Cache Coherence Protocols
74
75.Snooping for Hardware Cache Coherence
All caches monitor bus and all other caches
Bus read: respond if you have dirty data
Bus write: update/invalidate your copy of data
Snooping
75
...
Core0
Cache
Memory
I/O
Interconnect
...
...
Snoop
Core1
Cache
Snoop
CoreN
Cache
Snoop
76.Cache gets exclusive access to a block when it is to be written
Broadcasts an invalidate message on the bus
Subsequent read in another cache misses
Owning cache supplies updated value
Invalidating Snooping Protocols
76
77.Write-back policies for bandwidth
Write-invalidate coherence policy
First invalidate all other copies of data
Then write it in cache line
Anybody else can read it
Permits one writer, multiple readers
In reality: many coherence protocols
Snooping doesn’t scale
Directory-based protocols
Caches and memory record sharing status of blocks in a directory
Writing
77
78.Hardware Cache Coherence
78
Coherence
all copies have same data at all times
Coherence controller:
Examines bus traffic (addresses and data)
Executes coherence protocol
What to do with local copy when you see different things happening on bus
Three processor-initiated events
Ld: load
St: store
WB: write-back
Two remote-initiated events
LdMiss: read miss from another processor
StMiss: write miss from another processor
CPU
D$ data
D$ tags
CC
bus
79.VI Coherence Protocol
79
VI (valid-invalid) protocol:
Two states (per block in cache)
V (valid): have block
I (invalid): don’t have block
Can implement with valid bit
Protocol diagram (left)
If you load/store a block: transition to V
If anyone else wants to read/write block:
Give it up: transition to I state
Write-back if your own copy is dirty
I
V
Load, Store
LdMiss, StMiss, WB
Load, Store
LdMiss/
StMiss
80.VI Protocol (Write-Back Cache)
80
lw by Thread B generates an “other load miss” event (LdMiss)
Thread A responds by sending its dirty copy, transitioning to I
0
V:0
0
V:1
0
I:
1
V:1
1
V:2
CPU0
Mem
CPU1
Thread A
lw t0, r3, 0
ADDIU t0, t0, 1
sw t0, r3, 0
Thread B
lw t0, r3, 0
ADDIU t0, t0, 1
sw t0, r3, 0
81.VI Coherence Question
81
I
V
Load, Store
LdMiss, StMiss, WB
Load, Store
LdMiss/
StMiss
Clicker Question:
Core A loads x into a register
Core B wants to load x into a register
What happens?
they can both have a copy of X in their cache
A keeps the copy
B steals the copy from A, and this is an efficient thing to do
B steals the copy from A, and this is a sad shame
B waits until A kicks X out of its cache, then it can complete the load
82.VI MSI
82
LdMiss
I
M
Store
StMiss, WB
Load, Store
S
Store
Load, LdMiss
StMiss
Load
LdMiss/
StMiss
VI protocol is inefficient
Only one cached copy allowed in entire system
Multiple copies can’t exist even if read-only
Not a problem in example
Big problem in reality
MSI (modified-shared-invalid)
Fixes problem: splits “V” state into two states
M (modified): local dirty copy
S (shared): local clean copy
Allows either
Multiple read-only copies (S-state) --OR--
Single read/write copy (M-state)
83.MSI Protocol (Write-Back Cache)
83
lw by Thread B generates a “other load miss” event (LdMiss)
Thread A responds by sending its dirty copy, transitioning to S
sw by Thread B generates a “other store miss” event (StMiss)
Thread A responds by transitioning to I
Thread A
lw t0, r3, 0
ADDIU t0, t0, 1
sw t0, r3, 0
Thread B
lw t0, r3, 0
ADDIU t0, t0, 1
sw t0, r3, 0
0
S:0
0
M:1
0
S:1
1
S:1
I:
1
M:2
CPU0
Mem
CPU1
84.Coherence introduces two new kinds of cache misses
Upgrade miss
On stores to read-only blocks
Delay to acquire write permission to read-only block
Coherence miss
Miss to a block evicted by another processor’s requests
Making the cache larger…
Doesn’t reduce these type of misses
As cache grows large, these sorts of misses dominate
False sharing
Two or more processors sharing parts of the same block
But not the same bytes within that block (no actual sharing)
Creates pathological “ping-pong” behavior
Careful data placement may help, but is difficult
Cache Coherence and Cache Misses
84
85.In reality: many coherence protocols
Snooping: VI, MSI, MESI, MOESI, …
But Snooping doesn’t scale
Directory-based protocols
Caches & memory record blocks’ sharing status in directory
Nothing is free directory protocols are slower!
Cache Coherency:
requires that reads return most recently written value
Is a hard problem!
More Cache Coherence
85
86.Informally, Cache Coherency requires that reads return most recently written value
Cache coherence hard problem
Snooping protocols are one approach
Takeaway: Summary of cache coherence
86
87.Is cache coherency sufficient?
i.e. Is cache coherency (what values are read) sufficient to maintain consistency (when a written value will be returned to a read). Both coherency and consistency are required to maintain consistency in shared memory programs.
Next Goal: Synchronization
87
88.Are We Done Yet?
88
What just happened???
Is Cache Coherency Protocol Broken??
Thread A
lw t0, r3, 0
ADDIU t0, t0, 1
sw t0, x, 0
Thread B
lw t0, r3, 0
ADDIU t0, t0, 1
sw t0, x, 0
0
S:0
0
S:0
0
S:0
M:1
1
I:
CPU0
Mem
CPU1
I:
0
M:1
89.Thread A (on Core0) Thread B (on Core1)for(int i = 0, i < 5; i++) { for(int j = 0; j < 5; j++) { LW t0, addr(x) LW t0, addr(x)
ADDIU t0, t0, 1 ADDIU t0, t0, 1
SW t0, addr(x) SW t0, addr(x)} }
Is Cache Coherency Sufficient?
89
Very expensive and difficult to maintain consistency
Core0
Cache
Memory
I/O
Interconnect
Core1
Cache
...
CoreN
Cache
...
...
90.The Previous example shows us that
Caches can be incoherent even if there is a coherence protocol.
Cache coherence protocols are not rich enough to support multi-threaded programs
Coherent caches are not enough to guarantee expected program behavior.
Multithreading is just a really bad idea.
All of the above
Clicker Question
90
91.The Previous example shows us that
Caches can be incoherent even if there is a coherence protocol.
Cache coherence protocols are not rich enough to support multi-threaded programs
Coherent caches are not enough to guarantee expected program behavior.
Multithreading is just a really bad idea.
All of the above
Clicker Question
91
92.Need it to exploit multiple processing units
…to parallelize for multicore
…to write servers that handle many clients
Problem: hard even for experienced programmers
Behavior can depend on subtle timing differences
Bugs may be impossible to reproduce
Needed: synchronization of threads
Programming with Threads
92
93.Within a thread: execution is sequential
Between threads?
No ordering or timing guarantees
Might even run on different cores at the same time
Problem: hard to program, hard to reason about
Behavior can depend on subtle timing differences
Bugs may be impossible to reproduce
Cache coherency is not sufficient…
Need explicit synchronization to make sense of concurrency!
Programming with Threads
93
94.Concurrency poses challenges for:
Correctness
Threads accessing shared memory should not interfere with each other
Liveness
Threads should not get stuck, should make forward progress
Efficiency
Program should make good use of available computing resources (e.g., processors).
Fairness
Resources apportioned fairly between threads
Programming with Threads
94
95.
Apache web server
void main() { setup();
while (c = accept_connection()) {
req = read_request(c);
hits[req]++; send_response(c, req);
}
cleanup();
}
Example: Multi-Threaded Program
95
96.Each client request handled by a separate thread (in parallel)
Some shared state: hit counter, ...
(look familiar?)
Timing-dependent failure race condition
hard to reproduce hard to debug
Example: web server
96
Thread 52
read hits
addiu
write hits
Thread 205
read hits
addiu
write hits
97.Possible result: lost update!
Timing-dependent failure race condition
Very hard to reproduce Difficult to debug
Two threads, one counter
97
ADDIU/SW: hits = 0 + 1
LW (0)
ADDIU/SW: hits = 0 + 1
LW (0)
T1
T2
hits = 1
hits = 0
time
98.Timing-dependent error involving access to shared state
Race conditions depend on how threads are scheduled
i.e. who wins “races” to update state
Challenges of Race Conditions
Races are intermittent, may occur rarely
Timing dependent = small changes can hide bug
Program is correct only if all possible schedules are safe
Number of possible schedules is huge
Imagine adversary who switches contexts at worst possible time
Race conditions
98
99.What if we can designate parts of the execution as critical sections
Rule: only one thread can be “inside” a critical section
Critical Sections
99
Thread 52
CSEnter()
read hits
addi
write hits
CSExit()
Thread 205
CSEnter()
read hits
addi
write hits
CSExit()
100.To eliminate races: use critical sections that only one thread can be in
Contending threads must wait to enter
Critical Sections
100
CSEnter();
Critical section
CSExit();
T1
T2
time
CSEnter();
# wait
# wait
Critical section
CSExit();
T1
T2
101.Implement CSEnter and CSExit; ie. a critical section
Only one thread can hold the lock at a time
“I have the lock”
Mutual Exclusion Lock (mutex)
lock(m): wait till it becomes free, then lock it
unlock(m): unlock it
Mutual Exclusion Lock (Mutex)
101
safe_increment() {
pthread_mutex_lock(&m);
hits = hits + 1;
pthread_mutex_unlock(&m);
}
102.Only one thread can hold a given mutex at a time
Acquire (lock) mutex on entry to critical section
Or block if another thread already holds it
Release (unlock) mutex on exit
Allow one waiting thread (if any) to acquire & proceed
Mutexes
102
pthread_mutex_lock(&m);
hits = hits+1;
pthread_mutex_unlock(&m);
T1
T2
pthread_mutex_lock(&m);
# wait
# wait
hits = hits+1;
pthread_mutex_unlock(&m);
pthread_mutex_init(&m);
103.How to implement mutex locks?
What are the hardware primitives?
Then, use these mutex locks to implement critical sections, and use critical sections to write parallel safe programs
Next Goal
103
104.Atomic read & write memory operation
Between read & write: no writes to that address
Many atomic hardware primitives
test and set (x86)
atomic increment (x86)
bus lock prefix (x86)
compare and exchange (x86, ARM deprecated)
linked load / store conditional (pair of insns)(RISC-V, ARM, PowerPC, DEC Alpha, …)
Hardware Support for Synchronization
104
105.Load Reserved: LR.W rd, rs1
“I want the value at address X. Also, start monitoring any writes to this address.”
Store Conditional: SC.W rd, rs1, rs2
SC.W rd, rs2, (rs1)
“If no one has changed the value at address X since the LR.W, perform this store and tell me it worked.”
Data at location has not changed since the LR?
SUCCESS:
Performs the store (rs2 to the address, rs1 is the value)
Returns 0 in rd
Data at location has changed since the LR?
FAILURE:
Does not perform the store
Returns 1 in rd
Synchronization in RISC-V
105
106.Load Reserved: LR.W rd, rs1
Store Conditional: SC.W rd, rs1, rs2
SC.W rd, rs2, (rs1)
Succeeds if location not changed since the LR
Returns 0 in rd
Fails if location is changed
Returns 1 in rd
Any time a processor intervenes and modifies the value in memory between the LR and SC instruction, the SC returns 1 in rd, causing the code to try again.
i.e. use this value 1 in rd to try again.
Synchronization in RISC-V
106
107.Load Reserved: LR.W rd, rs1
Store Conditional: SC.W rd, rs1, rs2
SC.W rd, rs2, (rs1)
Succeeds if location not changed since the LR
Returns 1 in rd
Fails if location is changed
Returns 0 in rd
Example: atomic incrementor
i++
LW t0, s0, 0
ADDIU t0, t0, 1
SW t0, s0, 0
Synchronization in RISC-V
107
LR.W t0, (s0)
ADDIU t0, t0, 1
SC.W t0, t0, (s0)
BNEZ t0, try
atomic(i++)
Value in memory changed between LR and SC ?
SC returns 0 in t0 retry
try:
↓
↓
108.Load Reserved: LR.W rd, (rs1)
Store Conditional: SC.W rd, rs2, (rs1)
Atomic Increment in Action
108
109.Load Reserved: LR.W rd, (rs1)
Store Conditional: SC.W rd, rs2, (rs1)
Atomic Increment in Action
109
Success!
Failure!
110.Load Reserved / Store Conditional
m = 0; // m=0 means lock is free; otherwise, if m=1, then lock locked
mutex_lock(int *m) {
while(test_and_set(m)){}
}
int test_and_set(int *m) {
old = *m;
*m = 1;
return old;
}
Mutex from LR and SC
110
LR.W Atomic
SC.W
111.Load Reserved / Store Conditional
m = 0; // m=0 means lock is free; otherwise, if m=1, then lock locked
mutex_lock(int *m) {
while(test_and_set(m)){}
}
int test_and_set(int *m) {
LI t0, 1
LR.W t1, (a0)
SC.W t0, t0, (a0)
MOVE a0, t1
}
Mutex from LR and SC
111
BNEZ t0, try
try:
112.Load Reserved / Store Conditional
m = 0; // m=0 means lock is free; otherwise, if m=1, then lock locked
mutex_lock(int *m) {
while(test_and_set(m)){}
}
int test_and_set(int *m) {
try:
LI t0, 1
LR.W t1, (a0)
SC.W t0, t0, (a0)
BNEZ t0, try
MOVE a0, t1
}
Mutex from LR and SC
112
113.Load Reserved / Store Conditional
m = 0; // m=0 means lock is free; otherwise, if m=1, then lock locked
mutex_lock(int *m) {
test_and_set:
LI t0, 1
LR.W t1, (a0)
BNEZ t1, test_and_set
SC.W t0, t0, (a0)
BNEZ t0, test_and_set
}
mutex_unlock(int *m) {
*m = 0;
}
Mutex from LR and SC
113
114.Load Reserved / Store Conditional
m = 0; // m=0 means lock is free; otherwise, if m=1, then lock locked
mutex_lock(int *m) {
test_and_set:
LI t0, 1
LR.W t1, (a0)
BNEZ t1, test_and_set
SC.W t0, t0, (a0)
BNEZ t0, test_and_set
}
mutex_unlock(int *m) {
SW zero, a0, 0
}
Mutex from LR and SC
114
This is called a
Spin lock
Aka spin waiting
115.Load Reserved / Store Conditional
m = 0; // m=0 means lock is free; otherwise, if m=1, then lock locked
mutex_lock(int *m) {
Mutex from LR and SC
115
116.Load Reserved / Store Conditional
m = 0; // m=0 means lock is free; otherwise, if m=1, then lock locked
mutex_lock(int *m) {
Mutex from LR and SC
116
117.Load Reserved / Store Conditional
m = 0; // m=0 means lock is free; otherwise, if m=1, then lock locked
mutex_lock(int *m) {
Mutex from LR and SC
117
Success grabbing mutex lock!
Inside Critical section
Failed to get mutex lock – try again
118.Load Reserved / Store Conditional
m = 0; // m=0 means lock is free; otherwise, if m=1, then lock locked
mutex_lock(int *m) {
test_and_set:
LI t0, 1
LR.W t1, (a0)
BNEZ t1, test_and_set
SC.W t0, t0, (a0)
BNEZ t0, test_and_set
}
mutex_unlock(int *m) {
SW zero, a0, 0
}
Mutex from LR and SC
118
This is called a
Spin lock
Aka spin waiting
119.Load Reserved / Store Conditional
m = 0; // m=0 means lock is free; otherwise, if m=1, then lock locked
mutex_lock(int *m) {
Mutex from LR and SC
119
120.Load Reserved / Store Conditional
m = 0; // m=0 means lock is free; otherwise, if m=1, then lock locked
mutex_lock(int *m) {
Mutex from LR and SC
120
121.Thread A Thread Bfor(int i = 0, i < 5; i++) { for(int j = 0; j < 5; j++) {
x = x + 1; x = x + 1;
} }
Now we can write parallel and correct programs
121
mutex_lock(m); mutex_lock(m);
mutex_unlock(m); mutex_unlock(m);
122.Other atomic hardware primitives
- test and set (x86)
- atomic increment (x86)
- bus lock prefix (x86)
- compare and exchange (x86, ARM deprecated)
- load reserved / store conditional (RISC-V, ARM, PowerPC, DEC Alpha, …)
Alternative Atomic Instructions
122
123.Synchronization techniques
clever code
must work despite adversarial scheduler/interrupts
used by: hackers
also: noobs
disable interrupts
used by: exception handler, scheduler, device drivers, …
disable preemption
dangerous for user code, but okay for some kernel code
mutual exclusion locks (mutex)
general purpose, except for some interrupt-related cases
Synchronization
123
124.Need parallel abstractions, especially for multicore
Writing correct programs is hard
Need to prevent data races
Need critical sections to prevent data races
Mutex, mutual exclusion, implements critical section
Mutex often implemented using a lock abstraction
Hardware provides synchronization primitives such as LR and SC (load reserved and store conditional) instructions to efficiently implement locks
Summary
124
125.Next Goal
125
How do we use synchronization primitives to build concurrency-safe data structure?
126.Access to shared data must be synchronized
Goal: enforce datastructure invariants
Producer/Consumer Example (1)
126
// invariant: // data is in A[h … t-1]
char A[100];
int h = 0, t = 0;
// producer: add to list tail
void put(char c) {
A[t] = c;
t = (t+1)%n;
}
1
2
3
head
tail
127.Access to shared data must be synchronized
Goal: enforce datastructure invariants
Producer/Consumer Example (1)
127
// invariant: // data is in A[h … t-1]
char A[100];
int h = 0, t = 0;
// producer: add to list tail
void put(char c) {
A[t] = c;
t = (t+1)%n;
}
// consumer: take from list head
char get() {
while (h == t) { };
char c = A[h];
h = (h+1)%n;
return c;
}
1
2
3
4
head
tail
128.Access to shared data must be synchronized
Goal: enforce datastructure invariants
Producer/Consumer Example (1)
128
// invariant: // data is in A[h … t-1]
char A[100];
int h = 0, t = 0;
// producer: add to list tail
void put(char c) {
A[t] = c;
t = (t+1)%n;
}
// consumer: take from list head
char get() {
while (h == t) { };
char c = A[h];
h = (h+1)%n;
return c;
}
What is wrong with code?
Will lose update to t and/or h
Invariant is not upheld
Will produce if full
Will consume if empty
All of the above
2
3
4
head
tail
129.Access to shared data must be synchronized
Goal: enforce datastructure invariants
Producer/Consumer Example (1)
129
// invariant: // data is in A[h … t-1]
char A[100];
int h = 0, t = 0;
// producer: add to list tail
void put(char c) {
A[t] = c;
t = (t+1)%n;
}
// consumer: take from list head
char get() {
while (h == t) { };
char c = A[h];
h = (h+1)%n;
return c;
}
What is wrong with the code?
Could miss an update to t or h
Breaks invariant: only produce if not full and only consume if not empty
Need to synchronize access to shared data
2
3
4
head
tail
130.Access to shared data must be synchronized
Goal: enforce datastructure invariants
Producer/Consumer Example (1)
130
// invariant: // data is in A[h … t-1]
char A[100];
int h = 0, t = 0;
// producer: add to list tail
void put(char c) {
A[t] = c;
t = (t+1)%n;
}
// consumer: take from list head
char get() {
while (h == t) { };
char c = A[h];
h = (h+1)%n;
return c;
}
acquire-lock()
release-lock()
acquire-lock()
release-lock()
Rule of thumb: all access and updates that can affect invariant become critical sections
2
3
4
head
tail
131.Protecting an invariant Example (2)
131
// invariant: (protected by mutex m)// data is in A[h … t-1]
pthread_mutex_t *m = pthread_mutex_create();
char A[100];
int h = 0, t = 0;
// producer: add to list tail
void put(char c) {
pthread_mutex_lock(m);
A[t] = c;
t = (t+1)%n;
pthread_mutex_unlock(m);
}
// consumer: take from list head
char get() {
pthread_mutex_lock(m);
while(h == t) {}
char c = A[h];
h = (h+1)%n;
pthread_mutex_unlock(m);
return c;
}
Does this fix work?
Rule of thumb: all access and updates that can affect invariant become critical sections
132.Protecting an invariant Example (2)
132
// invariant: (protected by mutex m)// data is in A[h … t-1]
pthread_mutex_t *m = pthread_mutex_create();
char A[100];
int h = 0, t = 0;
// producer: add to list tail
void put(char c) {
pthread_mutex_lock(m);
A[t] = c;
t = (t+1)%n;
pthread_mutex_unlock(m);
}
// consumer: take from list head
char get() {
pthread_mutex_lock(m);
while(h == t) {}
char c = A[h];
h = (h+1)%n;
pthread_mutex_unlock(m);
return c;
}
Does this fix work?
BUG: Can’t wait while holding lock
Rule of thumb: all access and updates that can affect invariant become critical sections
133.Insufficient locking can cause races
Skimping on mutexes? Just say no!
Poorly designed locking can cause deadlock
know why you are using mutexes!
acquire locks in a consistent order to avoid cycles
use lock/unlock like braces (match them lexically)
lock(&m); …; unlock(&m)
watch out for return, goto, and function calls!
watch out for exception/error conditions!
Guidelines for successful mutexing
133
P1: lock(m1); lock(m2);
P2: lock(m2); lock(m1);
Circular
Wait
134.Writers must check for full buffer& Readers must check for empty buffer
ideal: don’t busy wait… go to sleep instead
Beyond mutexes Example (3)
134
char get() {
acquire(L);
char c = A[h];
h = (h+1)%n;
release(L);
return c;
}
while(empty) {}
head
last==head
empty
135.char get() {
acquire(L);
char c = A[h];
h = (h+1)%n;
release(L);
return c;
}
Writers must check for full buffer& Readers must check for empty buffer
ideal: don’t busy wait… go to sleep instead
Beyond mutexes Example (3)
135
char get() {
acquire(L);
while (h == t) { };
char c = A[h];
h = (h+1)%n;
release(L);
return c;
}
char get() {
while (h == t) { };
acquire(L);
char c = A[h];
h = (h+1)%n;
release(L);
return c;
}
head
last==head
empty
Cannot check condition while
Holding the lock,
BUT, empty condition may no
longer hold in critical section
Dilemma: Have to check while holding lock,
136.Writers must check for full buffer& Readers must check for empty buffer
ideal: don’t busy wait… go to sleep instead
Beyond mutexes Example (3)
136
char get() {
acquire(L);
while (h == t) { };
char c = A[h];
h = (h+1)%n;
release(L);
return c;
}
Dilemma: Have to check while holding lock,
but cannot wait while hold lock
head
last==head
empty
Cannot check condition while
Holding the lock,
BUT, empty condition may no
longer hold in critical section
137.Writers must check for full buffer& Readers must check for empty buffer
ideal: don’t busy wait… go to sleep instead
Beyond mutexes Example (4)
137
char get() {
do {
acquire(L);
empty = (h == t);
if (!empty) {
c = A[h];
h = (h+1)%n;
}
release(L);
} while (empty);
return c;
}
Does this work?
Yes
No
138.Writers must check for full buffer& Readers must check for empty buffer
ideal: don’t busy wait… go to sleep instead
Beyond mutexes Example (4)
138
char get() {
do {
acquire(L);
empty = (h == t);
if (!empty) {
c = A[h];
h = (h+1)%n;
}
release(L);
} while (empty);
return c;
}
It works.
But, it is wasteful
Due to the spinning
139.Language-level Synchronization
139
140.Use [Hoare] a condition variable to wait for a condition to become true (without holding lock!)
wait(m, c) :
atomically release m and sleep, waiting for condition c
wake up holding m sometime after c was signaled
signal(c) : wake up one thread waiting on c
broadcast(c) : wake up all threads waiting on c
POSIX (e.g., Linux): pthread_cond_wait, pthread_cond_signal, pthread_cond_broadcast
Condition variables
140
141.wait(m, c) : release m, sleep until c, wake up holding m
signal(c) : wake up one thread waiting on c
Using a condition variable Example (5)
141
char get() {
lock(m);
while (t == h)
wait(m, not_empty);
char c = A[h];
h = (h+1) % n;
unlock(m);
signal(not_full);
return c;
}
cond_t *not_full = ...;
cond_t *not_empty = ...;
mutex_t *m = ...;
void put(char c) {
lock(m);
while ((t-h) % n == 1)
wait(m, not_full);
A[t] = c;
t = (t+1) % n;
unlock(m);
signal(not_empty);
}
142.A Monitor is a concurrency-safe datastructure, with…
one mutex
some condition variables
some operations
All operations on monitor acquire/release mutex
one thread in the monitor at a time
Ring buffer was a monitor
Java, C#, etc., have built-in support for monitors
Monitors
142
143.Java objects can be monitors
“synchronized” keyword locks/releases the mutex
Has one (!) builtin condition variable
o.wait() = wait(o, o)
o.notify() = signal(o)
o.notifyAll() = broadcast(o)
Java wait() can be called even when mutex is not held. Mutex not held when awoken by signal(). Useful?
Java concurrency
143
144.Lots of synchronization variations…Reader/writer locks
Any number of threads can hold a read lock
Only one thread can hold the writer lock
Semaphores
N threads can hold lock at the same time
Monitors
Concurrency-safe data structure with 1 mutex
All operations on monitor acquire/release mutex
One thread in the monitor at a time
Message-passing, sockets, queues, ring buffers, …
transfer data and synchronize
Language-level Synchronization
144
145.Summary
145
Hardware Primitives: test-and-set, LR.W/SC.W, barrier, ...
… used to build …
Synchronization primitives: mutex, semaphore, ...
… used to build …
Language Constructs: monitors, signals, ...