Advanced Superscalar Architectures

Size: px

Start display at page:

Download "Advanced Superscalar Architectures"

Cameron Walton
5 years ago
Views:

1 Advanced Suerscalar Architectures Krste Asanovic Laboratory for Comuter Science Massachusetts Institute of Technology

2 Physical Register Renaming (single hysical register file: MIPS R10K, Alha 21264, Pentium-4) During decode, instructions allocated new hysical destination register Source oerands renamed to hysical register with newest value Execution unit only sees hysical register numbers ld r1, (r3) add r3, r1, #4 sub r6, r7, r9 add r3, r3, r6 ld r6, (r1) add r6, r6, r3 st r6, (r1) ld r6, (r11) Rename ld, (Px) add,, #4 sub, Py, Pz add,, ld P5, () add P6, P5, st P6, () ld P7, (Pw)

3 Physical Register File r 1 r 2 t i t j Snashots for misredict recovery t 1 t 2. t n Reg File Rename Table Load Unit FU FU FU Store Unit (ROB not shown) < t, result > One regfile for both committed and seculative values (no data in ROB) During decode, instruction result allocated new hysical register, source regs translated to hysical regs through rename table Instruction reads data from regfile at start of execute (not in decode) Write-back udates reg. busy bits on instructions in ROB (assoc. search) Snashots of rename table taken at every branch to recover misredicts On excetion, renaming undone in reverse order of issue (MIPS R10000)

4 Seculative and Out-of-Order Execution Branch Prediction kill kill Branch Resolution kill kill Out-of-Order Udate redictors In-Order PC Fetch Decode & Rename Reorder Buffer Commit In-Order Physical Reg. File Branch Unit Execute ALU MEM Store Buffer D$

5 Lifetime of Physical Registers Physical regfile holds committed and seculative values Physical registers decouled from ROB entries (no data in ROB) ld r1, (r3) add r3, r1, #4 sub r6, r7, r9 add r3, r3, r6 ld r6, (r1) add r6, r6, r3 st r6, (r1) ld r6, (r11) Rename ld, (Px) add,, #4 sub, Py, Pz add,, ld P5, () add P6, P5, st P6, () ld P7, (Pw) When can we reuse a hysical register? When next write of same architectural register commits

6 Physical Register Management R0 R1 R2 R3 R4 R5 R6 R7 Rename Table P7 P5 P6 P5 P6 P7 Physical Regs <R6> <R7> <R3> <R1> Free List ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1) Pn ROB use ex o 1 PR1 2 PR2 Rd LPRd PRd (LPRd requires third read ort on Rename Table for each instruction)

7 Physical Register Management R0 R1 R2 R3 R4 R5 R6 R7 Rename Table P7 P5 P6 P5 P6 P7 Physical Regs <R6> <R7> <R3> <R1> Free List ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1) Pn ROB use ex o 1 PR1 2 PR2 Rd LPRd PRd x ld P7 r1 (LPRd requires third read ort on Rename Table for each instruction)

8 Physical Register Management R0 R1 R2 R3 R4 R5 R6 R7 Rename Table P7 P5 P6 P5 P6 P7 Physical Regs <R6> <R7> <R3> <R1> Free List ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1) ROB Pn use ex o 1 PR1 2 PR2 Rd LPRd PRd x ld P7 r1 x add r3 P7 (LPRd requires third read ort on Rename Table for each instruction)

9 Physical Register Management R0 R1 R2 R3 R4 R5 R6 R7 Rename Table P7 P5 P6 P5 P6 P7 Physical Regs <R6> <R7> <R3> <R1> Free List ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1) Pn ROB use ex o 1 PR1 2 PR2 Rd LPRd PRd x ld P7 r1 x add r3 P7 x sub P6 P5 r6 P5 (LPRd requires third read ort on Rename Table for each instruction)

10 Physical Register Management R0 R1 R2 R3 R4 R5 R6 R7 Rename Table P7 P5 P6 P5 P6 P7 Physical Regs <R6> <R7> <R3> <R1> Free List ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1) Pn ROB use ex o 1 PR1 2 PR2 Rd LPRd PRd x ld P7 r1 x add r3 P7 x sub P6 P5 r6 P5 x add r3 (LPRd requires third read ort on Rename Table for each instruction)

11 Physical Register Management R0 R1 R2 R3 R4 R5 R6 R7 Rename Table P7 P5 P6 P5 P6 P7 Physical Regs <R6> <R7> <R3> <R1> Free List ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1) Pn ROB use ex o 1 PR1 2 PR2 Rd LPRd PRd x ld P7 r1 x add r3 P7 x sub P6 P5 r6 P5 x add r3 x ld r6 (LPRd requires third read ort on Rename Table for each instruction)

12 Physical Register Management R0 R1 R2 R3 R4 R5 R6 R7 Rename Table P7 P5 P6 Physical Regs <R1> P5 <R6> P6 <R7> P7 <R3> <R1> Free List ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1) Pn ROB use ex o 1 PR1 2 PR2 Rd LPRd PRd x x ld P7 r1 x add r3 P7 x sub P6 P5 r6 P5 x add r3 x ld r6 Execute & Commit (LPRd requires third read ort on Rename Table for each instruction)

13 Physical Register Management R0 R1 R2 R3 R4 R5 R6 R7 Rename Table P7 P5 P6 Physical Regs <R1> <R3> P5 <R6> P6 <R7> P7 <R3> Free List P7 ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1) Pn ROB use ex o 1 PR1 2 PR2 Rd LPRd PRd x x ld P7 r1 x x add r3 P7 x sub P6 P5 r6 P5 x add r3 x ld r6 Execute & Commit (LPRd requires third read ort on Rename Table for each instruction)

14 Reorder Buffer Holds Active Instruction Window ld r1, (r3) add r3, r1, r2 sub r6, r7, r9 add r3, r3, r6 ld r6, (r1) add r6, r6, r3 st r6, (r1) ld r6, (r1) (Older instructions) (Newer instructions) Commit Execute Fetch ld r1, (r3) add r3, r1, r2 sub r6, r7, r9 add r3, r3, r6 ld r6, (r1) add r6, r6, r3 st r6, (r1) ld r6, (r1) Cycle t Cycle t + 1

15 Suerscalar Register Renaming During decode, instructions allocated new hysical destination register Source oerands renamed to hysical register with newest value Execution unit only sees hysical register numbers Inst 1 O Dest Src1 Src2 O Dest Src1 Src2 Inst 2 Udate Maing Write Ports Read Addresses Rename Table Read Data Register Free List O PDest PSrc1 PSrc2 O PDest PSrc1 PSrc2 Does this work?

16 Suerscalar Register Renaming Inst 1 O Dest Src1 Src2 O Dest Src1 Src2 Inst 2 Udate Maing Write Ports Read Addresses Rename Table Read Data =? =? Register Free List O PDest PSrc1 PSrc2 O PDest PSrc1 PSrc2 Must check for RAW hazards between instructions issuing in same cycle. Can be done in arallel with rename looku. (MIPS R10K renames 4 serially-raw-deendent insts/cycle)

17 Memory Deendencies st r1, (r2) ld r3, (r4) When can we execute the load?

18 Seculative Loads / Stores Just like register udates, stores should not modify the memory until after the instruction is committed store buffer entry must carry a seculation bit and the tag of the corresonding store instruction If the instruction is committed, the seculation bit of the corresonding store buffer entry is cleared, and store is written to cache If the instruction is killed, the corresonding store buffer entry is freed Loads work normally -- older store buffer entries needs to be searched before accessing the memory or the cache

19 Load Path Load Address Seculative Store Buffer V S Tag V S Tag V S Tag V S Tag V S Tag V S Tag Data Data Data Data Data Data Tags Store Commit Path L1 Data Cache Data Load Data Hit in seculative store buffer has riority over hit in data cache Hit to newer store has riority over hits to older stores in seculative store buffer

20 Dataath: Branch Prediction and Seculative Execution PC Branch Prediction Fetch kill kill Decode & Rename Branch Resolution kill kill Reorder Buffer Udate redictors Commit Reg. File Branch Unit Execute ALU MEM Store Buffer D$

21 In-Order Memory Queue Execute all loads and stores in rogram order => Load and store cannot leave ROB for execution until all revious loads and stores have comleted execution Can still execute loads and stores seculatively, and out-of-order with resect to other instructions Stores held in store buffer until commit

22 Conservative Out-of-Order Load Execution st r1, (r2) ld r3, (r4) Slit execution of store instruction into two hases: address calculation and data write Can execute load before store, if addresses known and r4!= r2 Each load address comared with addresses of all revious uncommitted stores (can use artial conservative check i.e., bottom 12 bits of address) Don t execute load if any revious store address not known (MIPS R10K, 16 entry address queue)

23 Address Seculation st r1, (r2) ld r3, (r4) Guess that r4!= r2 Execute load before store address known Need to hold all comleted but uncommitted load/store addresses in rogram order If subsequently find r4==r2, squash load and all following instructions => Large enalty for inaccurate address seculation

24 Memory Deendence Prediction (Alha 21264) st r1, (r2) ld r3, (r4) Guess that r4!= r2 and execute load before store If later find r4==r2, squash load and all following instructions, but mark load instruction as store-wait Subsequent executions of the same load instruction will wait for all revious stores to comlete Periodically clear store-wait bits

25 Imroving Instruction Fetch Performance of seculative out-of-order machines often limited by instruction fetch bandwidth seculative execution can fetch 2-3x more instructions than are committed misredict enalties dominated by time to refill instruction window taken branches are articularly troublesome

26 Increasing Taken Branch Bandwidth (Alha I-Cache) PC Generation Branch Prediction Instruction Decode Validity Checks PC Line Predict Way Predict Cached Instructions Tag Way 0 Tag Way 1 fast fetch ath 4 insts =? =? Fold 2-way tags and BTB into redicted next block Hit/Miss/Way Take tag checks, inst. decode, branch redict out of loo Raw RAM seed on critical loo (1 cycle at ~1 GHz) 2-bit hysteresis counter er block revents overtraining

27 Tournament Branch Predictor (Alha 21264) Local history table (1,024x10b) PC Local rediction (1,024x3b) Prediction Global Prediction (4,096x2b) Choice Prediction (4,096x2b) Global History (12b) Choice redictor learns whether best to use local or global branch history in redicting next branch Global history is seculatively udated but restored on misredict Claim % success on range of alications

28 Taken Branch Limit Integer codes have a taken branch every 6-9 instructions To avoid fetch bottleneck, must execute multile taken branches er cycle when increasing erformance This imlies: redicting multile branches er cycle fetching multile non-contiguous blocks er cycle

29 Branch Address Cache (Yeh, Marr, Patt) Entry PC Valid redicted target #1 len redicted target #2 PC k = match valid target#1 len#1 target#2 Extend BTB to return multile branch redictions er cycle

30 Fetching Multile Basic Blocks Requires either multiorted cache: exensive interleaving: bank conflicts will occur Merging multile blocks to feed to decoders adds latency increasing misredict enalty and reducing branch throughut

31 Trace Cache Key Idea: Pack multile non-contiguous basic blocks into one contiguous trace cache line BR BR BR BR BR BR Single fetch brings in multile basic blocks Trace cache indexed by start address and next n branch redictions Used in Intel Pentium-4 rocessor to hold decoded uos

32 MIPS R10000 (1995) 0.35µm CMOS, 4 metal layers Four instructions er cycle Out-of-order execution Register renaming Seculative execution ast 4 branches On-chi 32KB/32KB slit I/D cache, 2-way set-associative Off-chi L2 cache Non-blocking caches Comare with simle 5-stage ieline (R5K series) ~1.6x erformance SPECint95 ~5x CPU logic area ~10x design effort

Advanced Superscalar Architectures. Speculative and Out-of-Order Execution

6.823, L16--1 Advanced Superscalar Architectures Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Speculative and Out-of-Order Execution Branch Prediction kill kill Branch