Computer Architecture and Parallel Computing 并行结构与计算. Lecture 5 SuperScalar and Multithreading. Peng Liu

Size: px

Start display at page:

Download "Computer Architecture and Parallel Computing 并行结构与计算. Lecture 5 SuperScalar and Multithreading. Peng Liu"

Arnold Gray
5 years ago
Views:

1 Comuter Architecture and Parallel Comuting 并行结构与计算 Lecture 5 SuerScalar and Multithreading Peng Liu College of Info. Sci. & Elec. Eng. Zhejiang University liueng@zju.edu.cn

2 Last time in Lecture 04 Register renaming removes WAR, WAW hazards In-order fetch/decode, out-of-order execute, in-order commit gives high erformance and recise excetions Dynamic branch redictors can be quite accurate (>95%) and avoid most control hazards Branch History Tables (BHTs) just redict direction (later in ieline) Just need a few bits er entry (2 bits gives hysteresis) Need to decode instruction bits to determine whether this is a branch and what the target address is Branch Target Buffers (BTBs) redict direction and target earlier in ieline, but bigger entries Return Address Stack redicts subroutine returns 2

3 Branch MisredictRecovery In-order execution machines: Assume no instruction issued after branch can write-back before branch resolves Kill all instructions in ieline behind misredicted branch Out-of-order execution? Multile instructions following branch in rogram order can comlete before branch resolves 3

4 In-Order Commit for Precise Excetions In-order Out-of-order In-order Fetch Decode Reorder Buffer Commit Kill Inject handler PC Kill Execute Kill Excetion? Instructions fetched and decoded into instruction reorder buffer in-order Execution is out-of-order ( out-of-order comletion) Commit (write-back to architectural state, i.e., regfile & memory, is in-order) Temorary storage needed in ROB to hold results before commit 4

5 Branch Misrediction in Pieline Inject correct PC Branch Prediction Kill Kill Branch Resolution Kill PC Fetch Decode Reorder Buffer Commit Comlete Execute Can have multile unresolved branches in ROB. Can resolve branches out-of-order by killing all the instructions in ROB that follow a misredicted branch. Must also kill instructions in-flight in execution ielines. 5

6 Recovering ROB/Renaming Table Rename Table r 1 t t v v t t v v Rename Snashots Register File r 2 Ptr 2 next to commit rollback next available Ptr 1 next available Reorder buffer Ins# use exec o 1 src1 2 src2 d dest data Load Unit FU FU FU Store Unit Commit < t, result > t 1 t 2.. t n Take snashot of register rename table at each redicted branch, recover earlier snashot if branch misredicted. 6

7 Data-in-ROB Design (HP PA8000, Intel Pentium Pro, Core2 Duo & Nehalem) Reorder buffer Register File holds only committed state Ins# use exec o 1 src1 2 src2 d dest data t 1 t 2.. t n Load Unit FU FU FU Store Unit Commit < t, result > On disatch into ROB, ready sources can be in regfile or in ROB dest (coied into src1/src2 if ready before disatch). On comletion, write to dest field and broadcast to src fields. On issue, read from ROB src fields. 7

8 Data Movement in Data-in-ROB Design Architectural Register File Read oerands during decode Write sources after decode Write results at commit Read results at commit Reorder Buffer Read oerands at issue Write results at comletion Functional Units 8

9 Unified Physical Register File (MIPS R10K, Alha 21264, Intel Pentium 4 & Sandy Bridge) Rename all architectural registers into a single hysical register file during decode, no register values read. Functional units read and write from single unified register file holding committed and temorary registers in execute. Commit only udates maing of architectural register to hysical register, no data movement. Decode Stage Register Maing Unified Physical Register File Committed Register Maing Read oerands at issue Write results at comletion Functional Units 9

10 Pieline Design with Physical Regfile Branch Prediction kill kill Branch Resolution kill kill Out-of-Order In-Order PC Fetch Decode & Rename Reorder Buffer Commit In-Order Physical Reg. File Branch Unit ALU MEM Store Buffer D$ Execute 10

11 Lifetime of Physical Registers Physical regfile holds committed and seculative values Physical registers decouled from ROB entries (no data in ROB) ld r1, (r3) add r3, r1, #4 sub r6, r7, r9 add r3, r3, r6 ld r6, (r1) add r6, r6, r3 st r6, (r1) ld r6, (r11) Rename ld P1, (Px) add P2, P1, #4 sub P3, Py, Pz add P4, P2, P3 ld P5, (P1) add P6, P5, P4 st P6, (P1) ld P7, (Pw) When can we reuse a hysical register? When next write of same architectural register commits? 11

12 Physical Register Management R0 R1 R2 R3 R4 R5 R6 R7 Rename Table P8 P7 P5 P6 P0 P1 P2 P3 P4 P5 P6 P7 P8 Physical Regs <R6> <R7> <R3> <R1> Free List Pn ROB use ex o 1 PR1 2 PR2 Rd LPRd PRd P0 P1 P3 P2 P4 ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1) (LPRd requires third read ort on Rename Table for each instruction) 12

13 Physical Register Management R0 R1 R2 R3 R4 R5 R6 R7 Rename Table P8 P7 P5 P6 P0 P0 P1 P2 P3 P4 P5 P6 P7 P8 Physical Regs <R6> <R7> <R3> <R1> Free List P0 P1 P3 P2 P4 Pn ROB use ex o 1 PR1 2 PR2 Rd LPRd PRd x ld P7 r1 P8 P0 ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1) 13

14 Physical Register Management R0 R1 R2 R3 R4 R5 R6 R7 Rename Table P8 P7 P5 P6 P0 P1 P0 P1 P2 P3 P4 P5 P6 P7 P8 Physical Regs <R6> <R7> <R3> <R1> Free List P0 P1 P3 P2 P4 Pn ROB use ex o 1 PR1 2 PR2 Rd LPRd PRd x ld P7 r1 P8 P0 x add P0 r3 P7 P1 ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1) 14

15 Physical Register Management R0 R1 R2 R3 R4 R5 R6 R7 Rename Table P8 P7 P5 P6 P0 P1 P3 P0 P1 P2 P3 P4 P5 P6 P7 P8 Physical Regs <R6> <R7> <R3> <R1> Free List P0 P1 P3 P2 P4 Pn ROB use ex o 1 PR1 2 PR2 Rd LPRd PRd x ld P7 r1 P8 P0 x add P0 r3 P7 P1 x sub P6 P5 r6 P5 P3 ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1) 15

16 Physical Register Management R0 R1 R2 R3 R4 R5 R6 R7 Rename Table P8 P7 P5 P6 ROB P0 P1 P3 P2 P0 P1 P2 P3 P4 P5 P6 P7 P8 Pn Physical Regs <R6> <R7> <R3> <R1> Free List P0 P1 P3 P2 P4 use ex o 1 PR1 2 PR2 Rd LPRd PRd x x ld add P7 P0 r1 r3 P8 P7 P0 P1 x sub P6 P5 r6 P5 P3 x add P1 P3 r3 P1 P2 ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1) 16

17 Physical Register Management R0 R1 R2 R3 R4 R5 R6 R7 Rename Table P8 P7 P5 P6 P0 P1 P3 P2 P4 P0 P1 P2 P3 P4 P5 P6 P7 P8 Physical Regs <R6> <R7> <R3> <R1> Free List P0 P1 P3 P2 P4 Pn ROB use ex o 1 PR1 2 PR2 Rd LPRd PRd x ld P7 r1 P8 P0 x add P0 r3 P7 P1 x sub P6 P5 r6 P5 P3 x add P1 P3 r3 P1 P2 x ld P0 r6 P3 P4 ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1) 17

18 Physical Register Management R0 R1 R2 R3 R4 R5 R6 R7 Rename Table P8 P7 P5 P6 ROB P0 P1 P3 P2 P4 P0 P1 P2 P3 P4 P5 P6 P7 P8 Pn Physical Regs <R6> <R7> <R3> <R1> Free List P0 P1 P3 P2 use ex o 1 PR1 2 PR2 Rd LPRd PRd x x ld P7 r1 P8 P0 x add P0 r3 P7 P1 x sub P6 P5 r6 P5 P3 x add P1 P3 r3 P1 P2 x ld P0 r6 P3 P4 <R1> P4 P8 ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1) Execute & Commit 18

19 Physical Register Management R0 R1 R2 R3 R4 R5 R6 R7 Rename Table P8 P7 P5 P6 ROB P0 P1 P3 P2 P4 P0 P1 P2 P3 P4 P5 P6 P7 P8 Pn Physical Regs <R6> <R7> <R3> Free List P0 P1 P3 P2 use ex o 1 PR1 2 PR2 Rd LPRd PRd x x ld P7 r1 P8 P0 x x add P0 r3 P7 P1 x sub P6 P5 r6 P5 P3 x add P1 P3 r3 P1 P2 x ld P0 r6 P3 P4 <R1> <R3> P4 P8 P7 ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1) Execute & Commit 19

20 Break 20

21 Searate Pending Instruction Window from ROB The instruction window holds instructions that have been decoded and renamed but not issued into execution. Has register tags and resence bits, and ointer to ROB entry. Reorder buffer used to hold excetion information for commit. use ex Ptr 2 next to commit Ptr 1 next available o 1 PR1 2 PR2 PRd ROB# Done? Rd LPRd PC Excet? ROB is usually several times larger than instruction window why? 21

22 Reorder Buffer Holds Active Instructions (Decoded but not Committed) ld r1, (r3) add r3, r1, r2 sub r6, r7, r9 add r3, r3, r6 ld r6, (r1) add r6, r6, r3 st r6, (r1) ld r6, (r1) (Older instructions) (Newer instructions) Commit Execute Fetch ld r1, (r3) add r3, r1, r2 sub r6, r7, r9 add r3, r3, r6 ld r6, (r1) add r6, r6, r3 st r6, (r1) ld r6, (r1) Cycle t Cycle t

23 Suerscalar Register Renaming During decode, instructions allocated new hysical destination register Source oerands renamed to hysical register with newest value Execution unit only sees hysical register numbers Inst 1 Inst 2 O Dest Src1 Src2 O Dest Src1 Src2 Udate Maing Write Ports Read Addresses Rename Table Read Data Register Free List O PDest PSrc1 PSrc2 O PDest PSrc1 PSrc2 Does this work? 23

24 Suerscalar Register Renaming Inst 1 Inst 2 O Dest Src1 Src2 O Dest Src1 Src2 Udate Maing Must check for RAW hazards between instructions issuing in same cycle. Can be done in arallel with rename looku. O Write Ports Read Addresses Rename Table Read Data PDest PSrc1 PSrc2 O =? =? PDest PSrc1 PSrc2 Register Free List MIPS R10K renames 4 serially-raw-deendent insts/cycle 24

25 Memory Deendencies st r1, (r2) ld r3, (r4) When can we execute the load? 25

26 In-Order Memory Queue Execute all loads and stores in rogram order => Load and store cannot leave ROB for execution until all revious loads and stores have comleted execution. Can still execute loads and stores seculatively, and out-of-order with resect to other instructions. Need a structure to handle memory ordering 26

27 Conservative O-o-O Load Execution st r1, (r2) ld r3, (r4) Slit execution of store instruction into two hases: address calculation and data write Can execute load before store, if addresses known and r4!= r2 Each load address comared with addresses of all revious uncommitted stores (can use artial conservative check i.e., bottom 12 bits of address) Don t execute load if any revious store address not known (MIPS R10K, 16 entry address queue) 27

28 Address Seculation Guess that r4!= r2 st r1, (r2) ld r3, (r4) Execute load before store address known Need to hold all comleted but uncommitted load/store addresses in rogram order If subsequently find r4==r2, squash load and all following instructions => Large enalty for inaccurate address seculation 28

29 Memory Deendence Prediction (Alha 21264) st r1, (r2) ld r3, (r4) Guess that r4!= r2 and execute load before store If later find r4==r2, squash load and all following instructions, but mark load instruction as store-wait Subsequent executions of the same load instruction will wait for all revious stores to comlete Periodically clear store-wait bits 29

30 Seculative Loads / Stores Just like register udates, stores should not modify the memory until after the instruction is committed - A seculative store buffer is a structure introduced to hold seculative store data. 30

31 Seculative Store Buffer Seculative Store Buffer V S V S V S V S V S V S Tag Tag Tag Tag Tag Tag Load Address Data Data Data Data Data Data Tags Store Commit Path Data L1 Data Cache Load Data On store execute: mark entry valid and seculative, and save data and tag of instruction. On store commit: clear seculative bit and eventually move data to cache On store abort: clear valid bit 31

32 Seculative Store Buffer Seculative Store Buffer V S V S V S V S V S V S Tag Tag Tag Tag Tag Tag Load Address Data Data Data Data Data Data Tags Store Commit Path Data L1 Data Cache Load Data If data in both store buffer and cache, which should we use? Seculative store buffer If same address in store buffer twice, which should we use? Youngest store older than load 32

33 Dataath: Branch Prediction and Seculative Execution Branch Prediction kill kill Branch Resolution kill kill PC Fetch Decode & Rename Reorder Buffer Commit Reg. File Branch Unit Execute ALU MEM Store Buffer 33 D$

34 Multithreading Difficult to continue to extract instruction-level arallelism (ILP) from a single sequential thread of control Many workloads can make use of thread-level arallelism (TLP) TLP from multirogramming (run indeendent sequential jobs) TLP from multithreaded alications (run one job faster using arallel threads) Multithreading uses TLP to imrove utilization of a single rocessor 34

35 Pieline Hazards t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 LW r1, 0(r2) LW r5, 12(r1) ADDI r5, r5, #12 SW 12(r1), r5 F D X M W F D D D F F F D X M W F D D D F F F D X M W F D D D D Each instruction may deend on the next What is usually done to coe with this? interlocks (slow) or byassing (needs hardware, doesn t hel all hazards) 35

36 Multithreading How can we guarantee no deendencies between instructions in a ieline? -- One way is to interleave execution of instructions from different rogram threads on same ieline Interleave 4 threads, T1-T4, on non-byassed 5-stage ie t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 T1: LW r1, 0(r2) T2: ADD r7, r1, r4 T3: XORI r5, r4, #12 T4: SW 0(r7), r5 T1: LW r5, 12(r1) F D X M W F D X M W F D X M W F D X M W F D X M W Prior instruction in a thread always comletes writeback before next instruction in same thread reads register file 36

37 CDC 6600 Periheral Processors (Cray, 1964) First multithreaded hardware 10 virtual I/O rocessors Fixed interleave on simle ieline Pieline has 100ns cycle time Each virtual rocessor executes one instruction every 1000ns Accumulator-based instruction set to reduce rocessor state 37

38 Simle Multithreaded Pieline PC PC PC 1 PC I$ IR GPR1 GPR1 GPR1 GPR1 X Y D$ +1 2 Thread select 2 Have to carry thread select down ieline to ensure correct state bits read/written at each ie stage Aears to software (including OS) as multile, albeit slower, CPUs 38

39 Multithreading Costs Each thread requires its own user state PC GPRs Also, needs its own system state virtual memory age table base register excetion handling registers Other overheads: Additional cache/tlb conflicts from cometing threads (or add larger cache/tlb caacity) More OS overhead to schedule more threads (where do all these threads come from?) 39

40 Thread Scheduling Policies Fixed interleave (CDC 6600 PPUs, 1964) Each of N threads executes one instruction every N cycles If thread not ready to go in its slot, insert ieline bubble Software-controlled interleave (TI ASC PPUs, 1971) OS allocates S ieline slots amongst N threads Hardware erforms fixed interleave over S slots, executing whichever thread is in that slot Hardware-controlled thread scheduling (HEP, 1982) Hardware kees track of which threads are ready to go Picks next thread to execute based on hardware riority scheme 40

41 Denelcor HEP (Burton Smith, 1982) First commercial machine to use hardware threading in main CPU 120 threads er rocessor 10 MHz clock rate U to 8 rocessors recursor to Tera MTA (Multithreaded Architecture) 41

Tera MTA (1990-) U to 256 rocessors U to 128 active threads er rocessor Processors and memory modules oulate a sarse 3D torus interconnection fabric Flat, shared main memory No data cache

42 Tera MTA (1990-) U to 256 rocessors U to 128 active threads er rocessor Processors and memory modules oulate a sarse 3D torus interconnection fabric Flat, shared main memory No data cache Sustains one main memory access er cycle er rocessor GaAs logic in rototye, 260MHz Second version CMOS, MTA-2, 50W/rocessor New version, XMT, fits into AMD Oteron socket, runs at 500MHz 42

43 MTA Pieline W Issue Pool Inst Fetch M A C Every cycle, one VLIW instruction from one active thread is launched into ieline Write Pool Memory Pool W W Instruction ieline is 21 cycles long Memory oerations incur ~150 cycles of latency Retry Pool Interconnection Network Memory ieline Assuming a single thread issues one instruction every 21 cycles, and clock rate is 260 MHz What is single-thread erformance? Effective single-thread issue rate is 260/21 = 12.4 MIPS 43

44 Coarse-Grain Multithreading Tera MTA designed for suercomuting alications with large data sets and low locality No data cache Many arallel threads needed to hide large memory latency Other alications are more cache friendly Few ieline bubbles if cache mostly has hits Just add a few threads to hide occasional cache miss latencies Swa threads on cache misses 44

45 MIT Alewife (1990) Modified SPARC chis register windows hold different thread contexts U to four threads er node Thread switch on local cache miss 45

46 IBM PowerPC RS64-IV (2000) Commercial coarse-grain multithreading CPU Based on PowerPC with quad-issue in-order fivestage ieline Each hysical CPU suorts two virtual CPUs On L2 cache miss, ieline is flushed and execution switches to second thread short ieline minimizes flush enalty (4 cycles), small comared to memory access latency flush ieline to simlify excetion handling 46

47 Oracle/Sun Niagara rocessors Target is datacenters running web servers and databases, with many concurrent requests Provide multile simle cores each with multile hardware threads, reduced energy/oeration though much lower single thread erformance Niagara-1 [2004], 8 cores, 4 threads/core Niagara-2 [2007], 8 cores, 8 threads/core Niagara-3 [2009], 16 cores, 8 threads/core 47

48 Oracle/Sun Niagara-3, Rainbow Falls

49 Simultaneous Multithreading (SMT) for OoO Suerscalars Techniques resented so far have all been vertical multithreading where each ieline stage works on one thread at a time SMT uses fine-grain control already resent inside an OoO suerscalar to allow instructions from multile threads to enter execution on same clock cycle. Gives better utilization of machine resources. 49

50 For most as, most execution units lie idle in an OoO suerscalar For an 8-way suerscalar. From: Tullsen, Eggers, and Levy, Simultaneous Multithreading: Maximizing On-chi Parallelism, ISCA

51 Suerscalar Machine Efficiency Instruction issue Issue width Comletely idle cycle (vertical waste) Time Partially filled cycle, i.e., IPC < 4 (horizontal waste) 51

52 Vertical Multithreading Instruction issue Issue width Second thread interleaved cycle-by-cycle Time Partially filled cycle, i.e., IPC < 4 (horizontal waste) What is the effect of cycle-by-cycle interleaving? removes vertical waste, but leaves some horizontal waste 52

53 Chi Multirocessing (CMP) Issue width Time What is the effect of slitting into multile rocessors? reduces horizontal waste, leaves some vertical waste, and uts uer limit on eak throughut of each thread. 53

54 Ideal Suerscalar Multithreading [Tullsen, Eggers, Levy, UW, 1995] Issue width Time Interleave multile threads to multile issue slots with no restrictions 54

55 O-o-O Simultaneous Multithreading [Tullsen, Eggers, Emer, Levy, Stamm, Lo, DEC/UW, 1996] Add multile contexts and fetch engines and allow instructions fetched from different threads to issue simultaneously Utilize wide out-of-order suerscalar rocessor issue queue to find instructions to issue from multile threads OOO instruction window already has most of the circuitry required to schedule from multile threads Any single thread can utilize whole machine 55

56 IBM Power 4 Single-threaded redecessor to Power 5. 8 execution units in out-of-order engine, each may issue an instruction each cycle. 56

57 Power 4 Power 5 2 commits (architected register sets) 2 fetch (PC), 2 initial decodes 57

58 Power 5 data flow... Why only 2 threads? With 4, one of the shared resources (hysical registers, cache, memory bandwidth) would be rone to bottleneck 58

59 Changes in Power 5 to suort SMT Increased associativity of L1 instruction cache and the instruction address translation buffers Added er thread load and store queues Increased size of the L2 (1.92 vs MB) and L3 caches Added searate instruction refetch and buffering er thread Increased the number of virtual registers from 152 to 240 Increased the size of several issue queues The Power5 core is about 24% larger than the Power4 core because of the addition of SMT suort 59

60 Pentium-4 Hyerthreading (2002) First commercial SMT design (2-way SMT) Hyerthreading == SMT Logical rocessors share nearly all resources of the hysical rocessor Caches, execution units, branch redictors Die area overhead of hyerthreading ~ 5% When one logical rocessor is stalled, the other can make rogress No logical rocessor can use all entries in queues when two threads are active Processor running only one active software thread runs at aroximately same seed with or without hyerthreading Hyerthreading droed on OoO P6 based followons to Pentium- 4 (Pentium-M, Core Duo, Core 2 Duo), until revived with Nehalem generation machines in Intel Atom (in-order x86 core) has two-way vertical multithreading 60

61 Initial Performance of SMT Pentium 4 Extreme SMT yields 1.01 seedu for SPECint_rate benchmark and 1.07 for SPECf_rate Pentium 4 is dual threaded SMT SPECRate requires that each SPEC benchmark be run against a vendor-selected number of coies of the same benchmark Running on Pentium 4 each of 26 SPEC benchmarks aired with every other (26 2 runs) seed-us from 0.90 to 1.58; average was 1.20 Power 5, 8-rocessor server 1.23 faster for SPECint_rate with SMT, 1.16 faster for SPECf_rate Power 5 running 2 coies of each a seedu between 0.89 and 1.41 Most gained some Fl.Pt. as had most cache conflicts and least gains 61

62 SMT adatation to arallelism tye For regions with high thread level arallelism (TLP) entire machine width is shared by all threads Issue width For regions with low thread level arallelism (TLP) entire machine width is available for instruction level arallelism (ILP) Issue width Time Time 62

63 Icount Choosing Policy Fetch from thread with the least instructions in flight. Why does this enhance throughut? 63

64 Summary: Multithreaded Categories Time (rocessor cycle) Suerscalar Fine-Grained Coarse-Grained Multirocessing Simultaneous Multithreading Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot 64

65 Acknowledgements These slides contain material develoed and coyright by: UCB John Kubiatowicz (UCB) David Patterson (UCB) UCB material derived from course CS252,CS152

CS 152 Computer Architecture and Engineering. Lecture 15 - Advanced Superscalars

CS 152 Computer Architecture and Engineering. Lecture 15 - Advanced Superscalars CS 152 Comuter Architecture and Engineering Lecture 15 - Advanced Suerscalars Krste Asanovic Electrical Engineering and Comuter Sciences University of California at Berkeley htt://www.eecs.berkeley.edu/~krste