ENGN1640: Design of Computing Systems Topic 05: Pipeline Processor Design

Size: px

Start display at page:

Download "ENGN1640: Design of Computing Systems Topic 05: Pipeline Processor Design"

Neal Powers
6 years ago
Views:

ENGN64: Design of Computing Systems Topic 5: Pipeline Processor Design Professor Sherief Reda http://scale.engin.brown.

1 ENGN64: Design of Computing Systems Topic 5: Pipeline Processor Design Professor Sherief Reda Electrical Sciences and Computer Engineering School of Engineering Brown University Spring 26 [ material from Patterson & Hennessy and Harris]

2 Pipelining analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 6/7 = 2.3 Non-stop: Ideal Speedup = 4n/(n + 3) 4 = number of stages 2

3 Pipelined ARM processor Temporal parallelism Divide single-cycle processor into 5 stages: Fetch Decode Execute Writeback Add pipeline registers between stages 3

4 Single-Cycle vs Pipelined Single-Cycle Instr Fetc h Instruction Dec Read Reg Execute ALU Read / Write Wr Reg Fetc h Instruction Dec Read Reg Execute ALU Read / Write Wr Reg Time (ps) Instr 2 3 Fetc h Instruction Dec Read Reg Fetc h Instruction Execute ALU Dec Read Reg Fetc h Instruction Pipelined Read / Write Execute ALU Dec Read Reg Wr Reg Read / Write Execute ALU Wr Reg Read / Write Wr Reg (b) 4

5 Pipeline datapath abstraction Time (cycles) LDR LDR R2, [R, #4] 4 R + DM R2 ADD R3, R9, R ADD R9 R + DM R3 SUB R4, R, R5 SUB R R5 - DM R4 AND R5, R2, R3 AND R2 R3 & DM R5 STR R6, [R, #2] STR R 2 + DM R6 ORR R7, R, #42 ORR R 42 DM R7 5

6 Single-cycle & pipelined datapath Single-Cycle PC' PC 4 A RD Instructi on + PCPlus4 Instr 9:6 5 3: 5:2 4 + RA RA2 PCPlus8 A A2 A3 WD3 R5 WE3 Register File RD RD2 SrcA SrcB ALU ALUResult WriteData WE A RD Data WD ReadData PC' PCF PC' PCF A RD A RD Instructi on Instructi on PCPlus4F PCPlus4F + 23: Extend WE3 RAD RAD A WE3 RD 5 A RD 5 3: RA2D RA2D A2 RD2 A2 RD2 5:2 WA3D A3 Register A3 WD3Register File 4 PCPlus8 PCPlus8 WD3 File 4 R5 R5 Ext Imm InstrF InstrF InstrD InstrD 9:6 9:6 3: 5:2 23: 23: + + WA3D Extend Extend Pipelined Ext ImmE Ext ImmE SrcAE SrcAE SrcBE SrcBE Fetch Decode Execute Writeback WA3 must arrive at same time as Result Register file written on falling edge of ALU ALU Result W Result WE WE ALUResultE ReadDataW ALUResultE A RD ReadDat aw A RD Data Data WriteDataE WriteDataE WD WD ALUOutM ALUOutW ALUOutM ALUOutW WA3E WA3M WA3W Result W 6

7 Optimized pipeline datapath PC' PCF 4 A RD Instructi on + InstrF PCPlus4F InstrD 9:6 3: 5 4 5:2 PCPlus8 + RAD RA2D A A2 A3 WD3 R5 WA3D WE3 Register File RD RD2 SrcAE SrcBE ALU ALUResultE WriteDataE WE A RD Data WD ReadDataW ALUOutM ALUOutW WA3E WA3M WA3W 23: Extend Ext ImmE PC' PCF 4 A RD Instructi on + InstrF PCPlus4F InstrD 9:6 5 3: 5:2 RAD RA2D A A2 A3 WD3 R5 WA3D WE3 Register File RD RD2 SrcAE SrcBE ALU ALUResultE WriteDataE WE A RD Data WD ReadDataW ALUOutM ALUOutW WA3E WA3M WA3W Result W 23: PCPlus8D Extend Ext ImmE Result W Remove adder by using PCPlus4F after PC has been updated to PC+4 Assumes writing happens (e.g., in first half of clock cycle) before reading 7

8 Tracing LDR in its journey: st cycle LDR fetch PC' PCF 4 A RD Instructi on + InstrF PCPlus4F InstrD 9:6 5 3: 5:2 RAD RA2D A A2 A3 WD3 R5 WA3D WE3 Register File RD RD2 SrcAE SrcBE ALU ALUResultE WriteDataE WE A RD Data WD ALUOutM ReadDataW ALUOutW WA3E WA3M WA3W 23: PCPlus8D Extend Ext ImmE Result W 8

9 Tracing LDR in its journey: 2 nd cycle LDR decode PC' PCF 4 A RD Instructi on + InstrF PCPlus4F InstrD 9:6 5 3: 5:2 RAD RA2D A A2 A3 WD3 R5 WA3D WE3 Register File RD RD2 SrcAE SrcBE ALU ALUResultE WriteDataE WE A RD Data WD ALUOutM ReadDataW ALUOutW WA3E WA3M WA3W 23: PCPlus8D Extend Ext ImmE Result W 9

10 Tracing LDR in its journey: 3 rd cycle LDR EXE PC' PCF 4 A RD Instructi on + InstrF PCPlus4F InstrD 9:6 5 3: 5:2 RAD RA2D A A2 A3 WD3 R5 WA3D WE3 Register File RD RD2 SrcAE SrcBE ALU ALUResultE WriteDataE WE A RD Data WD ALUOutM ReadDataW ALUOutW WA3E WA3M WA3W 23: PCPlus8D Extend Ext ImmE Result W

11 Tracing LDR in its journey: 4 th cycle LDR mem PC' PCF 4 A RD Instructi on + InstrF PCPlus4F InstrD 9:6 5 3: 5:2 RAD RA2D A A2 A3 WD3 R5 WA3D WE3 Register File RD RD2 SrcAE SrcBE ALU ALUResultE WriteDataE WE A RD Data WD ALUOutM ReadDataW ALUOutW WA3E WA3M WA3W 23: PCPlus8D Extend Ext ImmE Result W

12 Tracing LDR in its journey: 5 th cycle LDR WB LDR WB PC' PCF 4 A RD Instructi on + InstrF PCPlus4F InstrD 9:6 5 3: 5:2 RAD RA2D A A2 A3 WD3 R5 WA3D WE3 Register File RD RD2 SrcAE SrcBE ALU ALUResultE WriteDataE WE A RD Data WD ALUOutM ReadDataW ALUOutW WA3E WA3M WA3W 23: PCPlus8D Extend Ext ImmE Result W 2

13 Pipeline performance Assume time for stages is ps for register read or write 2ps for other stages Compare pipelined datapath with single-cycle datapath Instr Instr fetch Register read ALU op access Register write Total time LDR 2ps ps 2ps 2ps ps 8ps STR 2ps ps 2ps 2ps 7ps data op 2ps ps 2ps ps 6ps Branch 2ps ps 2ps 5ps 3

14 Single-cycle versus pipeline performance Single-cycle (T c = 8ps) LDR R,[R5] LDR R2, [R6] LDR R3, [R7] Pipelined (T c = 2ps) LDR R,[R5] LDR R2, [R6] LDR R3, [R7] 4

15 Pipeline speedup If all stages are balanced i.e., all take the same time Time between instructions pipelined = Time between instructions nonpipelined Number of stages Ideal speedup (n instructions and s stages) If not balanced, speedup is less Speedup due to increased throughput Latency (time for each instruction) does not decrease Branches will also reduce the speedup Added pipeline registers reduce the speedup 5

16 Reminder of single-cycle control 6 Ext Imm A RD Instructi on + 4 A A3 WD3 RD2 RD WE3 A2 Register File A RD Data WD WE PC PC' Instr 9:6 5:2 23: 25:2 SrcB ALUResult ReadDat a WriteData SrcA PCPlus4 Result 27:26 ImmSrc PCSrc MemWrite MemtoReg ALUSrc RegWrite Op Funct Control Unit ALUFlags ALUControl ALU PCPlus8 R5 3: Cond 3:28 Flags 5:2 Rd RA RA2 Extend RegSrc

17 Modifications to pipeline control Control signals derived from instruction Same as in single-cycle implementation Control delayed to proper pipeline stage 7

18 Pipelined datapath + control PC' PCF 4 A RD Instructi on + InstrF PCPlus4F InstrD 27:26 25:2 5:2 3:28 9:6 3: 5:2 5 Control Unit Op Funct Rd RegSrcD RAD RA2D PCSrcD PCSrcE PCSrcM PCSrcW RegWriteD RegWriteE RegWriteM RegWriteW MemtoRegD MemtoRegE MemtoRegM MemtoRegW MemWriteD MemWriteE MemWriteM ALUControlD ALUControlE BranchD ALUSrcD FlagWriteD ImmSrcD A A2 A3 WD3 R5 WE3 RD Register File RD2 BranchE ALUSrcE FlagWriteE CondE FlagsE Ext ImmE SrcAE SrcBE ALU Flags' CondExE Cond Unit ALUFlags ALUResultE WriteDataE A WE RD Data WD ALUOutM ReadDat aw ALUOutW WA3E WA3M WA3W 23: Extend PCPlus8D Result W Same control unit as single-cycle processor Control delayed to proper pipeline stage 8

19 Pipelining hazards Situations that prevent starting the next instruction in the next cycle. Structural hazards A required resource is busy 2. Data hazard Need to wait for previous instruction to complete its data read/write 3. Control hazard Deciding on control action depends on previous instruction 9

20 . Structural hazards Conflict for use of a resource In ARMs pipeline with a single memory Load/store requires data access Instruction fetch would have to stall for that cycle Would cause a pipeline bubble Hence, pipelined datapaths require separate instruction/data memories Or separate instruction/data caches 2

21 2. Data Hazards: compute-use Time (cycles) ADD R, R4, R5 ADD R4 R5 + DM R AND R8, R, R3 AND R R3 & DM R8 ORR R9, R6, R ORR R6 R DM R9 SUB R, R, R7 SUB R R7 - DM R 2

22 Data Hazard: load-use LDR R, [R4, #4] AND R8, R, R3 ORR R9, R6, R SUB R, R, R7 22

23 Handling data hazards A. Compile-time techniques B. Forward data at run time C. Stall the processor at run time 23

24 A. Data hazard elimination using compile-time techniques (nop) Insert enough nops until result is ready (wastes cycles) Time (cycles) ADD R, R4, R5 ADD R4 R5 + DM R NOP NOP DM NOP NOP DM AND R8, R, R3 AND R R3 & DM R8 ORR R9, R6, R ORR R6 R DM R9 SUB R, R, R7 SUB R R7 - DM R 24

25 A. Data hazard elimination using compiletime techniques (code rescheduling) Reorder code to avoid use of load result in the next instruction Compiler must be aware of pipeline structure ADD R, R4, R5 ADD R8, R, R3 AND R9, R6, R SUB R, R2, R7 ADD R, R4, R5 SUB R, R2, R7 NOP ADD R8, R, R3 AND R9, R6, R Rescheduling saved one cycles! 25

26 B. Data hazard elimination using data forwarding/bypassing during runtime Don t wait for result to be stored in a register forward the results whenever the results Requires extra connections in the datapath Time (cycles) ADD R, R4, R5 ADD R4 R5 + DM R AND R8, R, R3 AND R R3 & DM R8 ORR R9, R6, R ORR R6 R DM R9 SUB R, R, R7 SUB R R7 - DM R Check if register read in Execute stage matches register written in or Writeback stage If so, forward result 26

27 Circuitry for forwarding PC' PCF 4 A RD Instructi on + InstrF PCPlus4F InstrD 27:26 25:2 5:2 3:28 9:6 3: 5:2 5 Control Unit Op Funct Rd RegSrcD RAD RA2D PCSrcD PCSrcE PCSrcM PCSrcW RegWriteD RegWriteE RegWriteM RegWriteW MemtoRegD MemtoRegE MemtoRegM MemtoRegW MemWriteD MemWriteE MemWriteM ALUControlD ALUControlE BranchD ALUSrcD FlagWriteD ImmSrcD A A2 A3 WD3 R5 WE3 RD Register File RD2 BranchE ALUSrcE FlagWriteE CondE FlagsE Ext ImmE SrcAE SrcBE ALU Flags' CondExE Cond Unit ALUFlags ALUResultE WriteDataE WE A RD Data WD ALUOutM ReadDataW ALUOutW WA3E WA3M WA3W 23: Extend PCPlus8D Result W RegWriteW RegWriteM Match ForwardBE ForwardAE Hazard Unit 27

28 To forward or not to forward! Execute stage register matches stage register? Match_E_M = (RAE == WA3M) Match_2E_M = (RA2E == WA3M) Execute stage register matches Writeback stage register? Match_E_W = (RAE == WA3W) Match_2E_W = (RA2E == WA3W) If it matches, forward result: if (Match_E_M RegWriteM) ForwardAE = ; else if (Match_E_W RegWriteW) ForwardAE = ; else ForwardAE = ; 28

29 Double data hazard Consider the sequence: ADD R,R,R2 ADD R,R,R3 ADD R,R,R4 Both hazards occur Want to use the most recent Revise MEM hazard condition Give priority to EX results. That is, only fwd from MEM if EX hazard condition isn t true 29

30 Pipelining hazards Structural hazards 2. Data hazard 3. Control hazard Compile-time techniques Forward data at run time C. Stall the processor at run time 3

31 Stalling Time (cycles) LDR LDR R, [R4, #4] 4 R4 + DM R AND R8, R, R3 AND Trouble! R R3 & DM R8 ORR R9, R6, R ORR R6 R DM R9 SUB R, R, R7 SUB R R7 - DM R 3

32 Forwarding is not going to eliminate all hazards Time (cycles) LDR LDR R, [R4, #4] 4 R4 + DM R AND R8, R, R3 AND R R3 R R3 & DM R8 ORR R9, R6, R ORR ORR R6 R DM R9 SUB R, R, R7 Stall SUB R R7 - DM R 32

33 FIX C. Data hazard elimination by stalling clock cycle wasted necessary for correctness stall inserted here 33

34 Stalling HW 34 Ext ImmE A RD Instructi on + 4 A A3 WD3 RD2 RD WE3 A2 Register File A RD Data WD WE PCF PC' InstrD 9:6 5:2 23: 25:2 SrcBE ALUResultE ReadDat aw WriteDataE SrcAE PCPlus4F Result W 27:26 ImmSrcD MemWriteD MemtoRegD ALUSrcD RegWriteD Op Funct Control Unit ALUFlags ALUControlD ALU PCPlus8D R5 3: 3:28 FlagWriteD 5:2 Rd 5 RAD RA2D Extend RegSrcD InstrF ALUOutM ALUOutW WA3E WA3M WA3W MemWriteE MemtoRegE ALUSrcE RegWriteE ALUControlE MemWriteM MemtoRegM RegWriteM MemtoRegW RegWriteW BranchD FlagsE FlagWriteE BranchE CondE CondExE PCSrcD PCSrcE PCSrcM PCSrcW Flags' Cond Unit Hazard Unit ForwardAE ForwardBE RegWriteM Match RegWriteW MemtoRegE StallF StallD FlushE EN CLR CLR EN FlushD

35 To stall or not to stall! Is either source register in the Decode stage the same as the one being written in the Execute stage? Match_2D_E = (RAD == WA3E) (RA2D == WA3E) Is a LDR in the Execute stage AND Match_2D_E? ldrstall = Match_2D_E MemtoRegE StallF = StallD = FlushE = ldrstall 35

36 Data hazard summary Compiler can arrange code to avoid hazards and stalls. Requires knowledge of the pipeline structure Forwarding can sometimes avoids stalls at the expense of extra hardware complexity Stalls reduce performance by increasing the average cycles per instruction (CPI). But sometimes are absolutely necessary to get correct results 36

37 3. Control hazards B: branch not determined until the Writeback stage of pipeline Instructions after branch fetched before branch occurs These 4 instructions must be flushed if branch happens Writes to PC (R5) similar 37

38 Control hazards Time (cycles) 2 B B 3C DM 24 AND R8, R, R3 AND R R3 & DM 28 2C ORR R9, R6, R SUB R, R, R7 ORR R6 R SUB DM R R7 - DM Flush these instructions 3 SUB R, R, R8 SUB R R8 - DM ADD R2, R3, R4 R4 ADD R3 + DM R2 Branch misprediction penalty number of instruction flushed when branch is taken (4) May be reduced by determining BTA earlier 38

39 Early branch resolution Determine BTA in Execute stage Branch misprediction penalty = 2 cycles Hardware changes Add a branch multiplexer before PC register to select BTA from ALUResultE Add BranchTakenE select signal for this multiplexer (only asserted if branch condition satisfied) PCSrcW now only asserted for writes to PC 39

40 Pipelined processor with early BTA PC' EN PCF 4 A RD Instructi on + InstrF PCPlus4F InstrD 27:26 25:2 5:2 3:28 9:6 3: 5:2 5 Control Unit Op Funct Rd RegSrcD RAD RA2D PCSrcD PCSrcE PCSrcM PCSrcW RegWriteD RegWriteE RegWriteM RegWriteW MemtoRegD MemtoRegE MemtoRegM MemtoRegW MemWriteD MemWriteE MemWriteM ALUControlD ALUControlE BranchD ALUSrcD FlagWriteD ImmSrcD A A2 A3 WD3 R5 WE3 RD Register File RD2 BranchE ALUSrcE FlagWriteE CondE FlagsE Ext ImmE BranchTakenE SrcAE SrcBE ALU Flags' CondExE Cond Unit ALUFlags ALUResultE WriteDataE WE A RD Data WD ALUOutM ReadDat aw ALUOutW WA3E WA3M WA3W CLR EN 23: Extend CLR PCPlus8D Result W MemtoRegE RegWriteW RegWriteM Match ForwardBE ForwardAE FlushE FlushD StallD StallF Hazard Unit 4

41 Control hazards with early BTA Time (cycles) 2 B B 3C DM AND R8, R, R3 ORR R9, R6, R AND R R3 ORR & R6 R DM DM Flush these instructions 2C SUB R, R, R7 3 SUB R, R, R ADD R2, R3, R4 R4 ADD R3 + DM R2 4

42 Control stalling logic PCWrPendingF = if write to PC in Decode, Execute or PCWrPendingF = PCSrcD + PCSrcE + PCSrcM Stall Fetch if PCWrPendingF StallF = ldrstalld + PCWrPendingF Flush Decode if PCWrPendingF OR PC is written in Writeback OR branch is taken FlushD = PCWrPendingF + PCSrcW + BranchTakenE Flush Execute if branch is taken FlushE = ldrstalld + BranchTakenE Stall Decode if ldrstalld (as before) StallD = ldrstalld 42

43 ARM Pipelined Processor with Hazard Unit PC' EN PCF 4 A RD Instructi on + InstrF PCPlus4F InstrD 27:26 25:2 5:2 3:28 9:6 3: 5:2 5 Control Unit Op Funct Rd RegSrcD RAD RA2D PCSrcD PCSrcE PCSrcM PCSrcW RegWriteD RegWriteE RegWriteM RegWriteW MemtoRegD MemtoRegE MemtoRegM MemtoRegW MemWriteD MemWriteE MemWriteM ALUControlD ALUControlE BranchD ALUSrcD FlagWriteD ImmSrcD A A2 A3 WD3 R5 WE3 RD Register File RD2 BranchE ALUSrcE FlagWriteE CondE FlagsE Ext ImmE BranchTakenE SrcAE SrcBE ALU Flags' CondExE Cond Unit ALUFlags ALUResultE WriteDataE WE A RD Data WD ALUOutM ReadDataW ALUOutW WA3E WA3M WA3W CLR EN 23: Extend CLR PCPlus8D Result W MemtoRegE RegWriteW RegWriteM Match ForwardBE ForwardAE FlushE FlushD StallD StallF Hazard Unit 43

44 Branch prediction Ideal pipelined processor: CPI = Branch misprediction increases CPI Static branch prediction: Always not taken Always taken Check direction of branch (forward or backward): If backward, predict taken; else, predict not taken Dynamic branch prediction: Make a dynamic based on history of branches In all cases, branch must be executed to see if prediction was correct if not, flush instructions and resume from correct direction! 44

45 Eliminating 2-cycle stall for taken-prediction policy with branch target buffer Even with predictor, still need to calculate the target address 2-cycle penalty for a taken branch Branch target buffer Cache of target addresses Indexed by PC when instruction fetched If hit and instruction is branch predicted taken, can fetch target immediately no 2-cycle penalty 45

46 Branch target buffer MUX PC Pipeline reg rest of pipeline Branch PC Branch target address No: it is not a branch Next PC = PC+4 = Yes: it is a branch and PC = branch target address 46

2. Dynamic branch prediction [floorplan of Pentium processor] In deeper and superscalar pipelines, branch penalty is more significant Use dynamic prediction: Branch prediction buffer (aka branch

47 2. Dynamic branch prediction [floorplan of Pentium processor] In deeper and superscalar pipelines, branch penalty is more significant Use dynamic prediction: Branch prediction buffer (aka branch history table) indexed by recent branch instruction addresses and stores outcome (taken/not taken) To execute a branch: Check table, expect the same outcome Start fetching from fall-through or target If wrong, flush pipeline and flip prediction 47

48 -bit branch predictor MOV R, # MOV R, # ; R = sum ; R = i FOR CMP R, # BGE DONE ADD R, R, R ADD R, R, # B FOR DONE ; for (i=; i<; i=i+) ; sum = sum + i Remembers whether branch was taken the last time and does the same thing Mispredicts first and last branch of loop Prediction bits are added to BTB entries 48

49 Problem with -bit predictor Inner loop branches mispredicted twice! outer: inner: beq,, inner beq,, outer Mispredict as taken on last iteration of inner loop Then mispredict as not taken on first iteration of inner loop next time around 49

50 2-bit predictor Only change prediction on two successive mispredictions 5

51 Summary Pipelining for speedup Ideal speedup = number of stages, but actual speedup depends on delay balance between stages (clock frequency), delays introduced by pipeline registers, and number of stalls (CPI). Hazards (structural, data, and control) can increase CPI Hazards can be eliminated or mitigated using code reorganization, stalling, flushing, forwarding / bypassing Branch prediction, branch prediction buffer, branch target buffer can reduce stalls arising from control hazards 5

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold Pipelining Readings: 4.5-4.8 Example: Doing the laundry Ann, Brian, Cathy, & Dave A B C D each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes Folder takes