Anne Bracy CS 3410 Computer Science Cornell University. [K. Bala, A. Bracy, S. McKee, E. Sirer, H. Weatherspoon]

Size: px

Start display at page:

Download "Anne Bracy CS 3410 Computer Science Cornell University. [K. Bala, A. Bracy, S. McKee, E. Sirer, H. Weatherspoon]"

Dina Riley
5 years ago
Views:

1 Anne Bracy CS 3410 Computer Science Cornell University [K. Bala, A. Bracy, S. McKee, E. Sirer, H. Weatherspoon]

2 Prog. Mem PC +4 inst Reg. File control ALU Data Mem Fetch Decode Execute Memory WB A Single cycle processor this diagram is not 100% spatial 2

3 Prog. Mem PC +4 inst Reg. File control ALU Data Mem Which instruction is more likely to determine the speed of the clock? A. Jump Register B. Add C. Store D. Load E. Either Load or Store 3

4 Basic CPU execution loop 1. Instruction Fetch 2. Instruction Decode 3. Execution (ALU) 4. Memory Access 5. Register Writeback 4

5 Single-cycle insn0. F, D, X, M, W insn1. F, D, X, M, W Pipelined insn0.f insn0.d insn1.f insn0.x insn1.d insn0.m insn1.x insn0.w insn1.m insn1.w 5

6 5-stage Pipeline Implementation Working Example Hazards Structural Data Hazards Control Hazards 6

7 ory PC +4 new pc inst register file control extend imm B A alu compute jump/branch targets B D addr d in d out ory D M Instruction Fetch Instruction Decode ctrl Execute ctrl Memory ctrl Write- Back IF/ID ID/EX EX/MEM MEM/WB 7

8 Cycle add nand lw add sw IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB Latency: 5 cycles Throughput: 1 insn/cycle CPI = 1 8

9 Break datapath into multiple cycles (here 5) Parallel execution increases throughput Balanced pipeline very important Slowest stage determines clock rate Imbalance kills performance Add pipeline registers (flip-flops) for isolation Each stage begins by reading values from latch Each stage ends by writing values to latch Resolve hazards 9

10 Stage Fetch Decode Execute Memory Perform Functionality Use PC to index Program Memory, increment PC Decode instruction, generate control signals, read register file Perform ALU operation Compute targets (PC+4+offset, etc.) in case this is a branch, decide if branch taken Perform load/store if needed, address is ALU result Latch values of interest Instruction bits (to be decoded) PC + 4 (to compute branch targets) Control information, Rd index, immediates, offsets, register values (Ra, Rb), PC+4 (to compute branch targets) Control information, Rd index, etc. Result of ALU operation, value in case this is a store instruction Control information, Rd index, etc. Result of load, pass result from execute Writeback Select value, write to register file 10

11 Prog. Mem PC +4 inst Reg. File control ALU Data Mem Fetch Decode Execute Memory WB Fetch 32-bit instruction from ory Increment PC = PC

12 instruction ory addr mc PC = read word inst PC+4 Rest of pipeline pc-reg pc-relpc-abs PC+4 pc-reg (PC registers: JR) pc-rel (PC-relative: BEQ, BNE) pc-abs (PC absolute: J and JAL) pc-sel IF/ID 12

13 Prog. Mem PC +4 inst Reg. File control ALU Data Mem Fetch Decode Execute Memory WB Gather data from the instruction Read opcode; determine instruction type, field lengths Read in data from register file (0, 1, or 2 reads for jump, addi, or add, respectively) 13

14 result Stage 1: Instruction Fetch inst ctrl PC+4 decode WE Rd D register file A B Ra Rb dest extend B A imm PC+4 Rest of pipeline IF/ID ID/EX 14

15 Prog. Mem PC +4 inst Reg. File control ALU Data Mem Fetch Decode Execute Memory WB Useful work done here (+, -, *, /), shift, logic operation, comparison (slt) Load/Store? lw $t2, 32($t3) à Compute address 15

16 pc-reg pc-sel branch? Stage 2: Instruction Decode PC+4 imm B A + pc-rel pc-abs ctrl B D Rest of pipeline ctrl alu target ID/EX EX/MEM 16

17 Prog. Mem PC +4 inst Reg. File control ALU addr Data Mem Data R/W Data Fetch Decode Execute Memory WB Used by load and store instructions only Other instructions will skip this stage 17

18 pc-sel branch? pc-reg Stage 3: Execute B D target pc-rel pc-abs d in addr ory ctrl M D Rest of pipeline ctrl d out mc EX/MEM MEM/WB 18

19 Prog. Mem PC +4 inst Reg. File control ALU Data Mem Fetch Decode Execute Memory WB Write to register file For arithmetic ops, logic, shift, etc, load. What about stores? Update PC For branches, jumps 19

20 result Stage 4: Memory M D dest ctrl MEM/WB 20

21 inst PC +4 inst PC+4 A Rd D B Ra Rb B A Rt Rd PC+4 imm OP B D ctrl target addr d in d out M D OP Rd IF/ID ID/EX EX/MEM MEM/WB 21

22 Consider a non-pipelined processor with clock period C (e.g., 50 ns). If you divide the processor into N stages (e.g., 5), your new clock period will be: A. C B. N C. less than C/N D. C/N E. greater than C/N 22

23 Instructions same length 32 bits, easy to fetch and then decode 3 types of instruction formats Easy to route bits between stages Can read a register source before even knowing what the instruction is Memory access through lw and sw only Access ory after ALU 23

24 5-stage Pipeline Implementation Working Example Hazards Structural Data Hazards Control Hazards 24

25 add nand lw add sw r3 ß r1, r2 r6 ß r4, r5 r4 ß 20(r2) r5 ß r2, r5 r7 à 12(r3) Assume 8-register machine 25

26 extend data dest 0 0 M U X Time: 0 IF/ID ID/EX EX/MEM MEM/WB 26

27 add M U X PC 4 Fetch: add Time: add Register file Bits Bits Bits R0 R1 R2 R3 R4 R5 R6 R extend nop M U X M U X A L U nop Data IF/ID ID/EX EX/MEM MEM/WB nop 27 M U X data dest

28 nand add M U X PC 4 Fetch: nand Time: nand Register file Bits Bits Bits R0 R1 R2 R3 R4 R5 R6 R extend add M U X M U X A L U nop Data IF/ID ID/EX EX/MEM MEM/WB nop 28 M U X data dest

29 lw 4 20(2) nand add M U X PC 4 Fetch: lw 4 20(2) Time: lw 4 20(2) 4 5 Register file Bits Bits Bits R0 R1 R2 R3 R4 R5 R6 R extend nand M U X M U X A L U add Data IF/ID ID/EX EX/MEM MEM/WB nop 29 M U X data dest

30 add lw 4 20(2) nand add M U X PC 4 Fetch: add Time: add Register file Bits Bits Bits R0 R1 R2 R3 R4 R5 R6 R extend lw M U X M U X nand 18 = = = A L U nand 45 3 Data IF/ID ID/EX EX/MEM MEM/WB add 30 M U X data dest

31 sw 7 12(3) add lw 4 20 (2) nand add M U X PC 4 Fetch: sw 7 12(3) Time: sw 7 12(3) 2 5 Register file Bits Bits Bits R0 R1 R2 R3 R4 R5 R6 R extend add M U X M U X A L U lw -3 6 Data nand IF/ID ID/EX EX/MEM MEM/WB M U X data dest

32 nop sw 7 12(3) add lw 4 20(2) nand M U X PC 4 No more instructions Time: Register file Bits Bits Bits R0 R1 R2 R3 R4 R5 R6 R extend sw M U X M U X A L U add Data IF/ID ID/EX EX/MEM MEM/WB lw M U X data dest

33 nop nop sw 7 12(3) add lw 4 20(2) M U X PC 4 No more instructions Time: 7 + Register file Bits Bits Bits R0 R1 R2 R3 R4 R5 R6 R extend M U X M U X A L U sw Data IF/ID ID/EX EX/MEM MEM/WB add M U X data dest

34 nop nop nop sw 7 12(3) add M U X PC 4 + Register file R0 R1 R2 R3 R4 R5 R6 R extend M U X A L U Data M U X data dest No more instructions Time: 8 Slides thanks to Sally McKee Bits Bits Bits M U X IF/ID ID/EX EX/MEM MEM/WB 7 sw 5 34

35 nop nop nop nop sw 7 12(3) M U X PC 4 + Register file R0 R1 R2 R3 R4 R5 R6 R extend M U X A L U Data M U X data dest No more instructions Bits Bits Bits M U X Time: 9 IF/ID ID/EX EX/MEM MEM/WB 35

36 Pipelining is great because: A. You can fetch and decode the same instruction at the same time. B. You can fetch two instructions at the same time. C. You can fetch one instruction while decoding another. D. Instructions only need to visit the pipeline stages that they require. E. C and D 36

37 5-stage Pipeline Implementation Working Example Hazards Structural Data Hazards Control Hazards 37

38 Correctness problems associated w/processor design 1. Structural hazards Same resource needed for different purposes at the same time (Possible: ALU, Register File, Memory) 2. Data hazards Instruction output needed before it s available 3. Control hazards Next instruction PC unknown at time of Fetch 38

39 inst D A B data add r3, r2, r1 nop nop add r6, r5, r4 add r6, r5, r4 nop IF ID Ex M W nop IF ID Ex M W IF ID Ex M W IF ID Ex M W add r3, r2,r1 Problem: Need to read from and write to Register File at the same time Solution: negate RF clock: write first half, read second half 39

40 Dependence: relationship between two insns Data: two insns use same storage location Control: 1 insn affects whether another executes at all Not a bad thing, programs would be boring otherwise Enforced by making older insn go before younger one Happens naturally in single-/multi-cycle designs But not in a pipeline Hazard: dependence & possibility of wrong insn order Effects of wrong insn order cannot be externally visible Hazards are a bad thing: most solutions either complicate the hardware or reduce performance 40

41 Data Hazards register file (RF) reads occur in stage 2 (ID) RF writes occur in stage 5 (WB) RF written in ½ half, read in second ½ half of cycle Processor is built exactly as we ve seen up until this slide. x10: add r3 ß r1, r2 x14: sub r5 ß r3, r4 1. Is there a dependence? 2. Is there a hazard? A) Yes B) No C) Cannot tell with the information given. 41

42 Which of the following statements is true? A. Whether there is a data dependence between two instructions depends on the machine the program is running on. B. Whether there is a data hazard between two instructions depends on the machine the program is running on. C. Both A & B D. Neither A nor B 42

43 time Clock cycle add r3, r1, r2 IF ID X MEM WB sub r5, r3, r4 IF ID X MEM WB lw r6, 4(r3) IF ID X MEM WB or r5, r3, r5 IF ID X MEM WB sw r6, 12(r3) IF ID X MEM WB 43

44 time Clock cycle backwards arrows require time travel add r3, r1, r2 IF ID X MEM WB sub r5, r3, r4 IF ID X MEM WB lw r6, 4(r3) IF ID X MEM WB or r5, r3, r5 IF ID X MEM WB sw r6, 12(r3) IF ID X MEM WB 44

45 time Clock cycle add r3, r1, r2 IF ID X MEM WB sub r5, r3, r4 IF ID X MEM WB lw r6, 4(r3) IF ID X MEM WB or r5, r3, r5 IF ID X MEM WB sw r6, 12(r3) IF ID X MEM WB 45

46 time Clock cycle add r3, r1, r2 IF ID X MEM WB sub r5, r3, r4 IF ID X MEM WB lw r6, 4(r3) IF ID X MEM WB or r5, r3, r5 IF ID X MEM WB sw r6, 12(r3) IF ID X MEM WB 46

47 Detecting Data Hazards inst PC +4 inst PC+4 IF/ID Rd D A B Ra Rb IF/ID.Ra 0? sub r5,r3,r4 OP Rt Rd PC+4 imm B A Ra ==? Ra ==? add r3, r1, r2 B D OP Rd d in addr d out M D OP Rd ID/EX EX/MEM MEM/WB Problem = (IF/ID.Ra!= 0 && (IF/ID.Ra == ID/EX.Rd IF/ID.Ra == EX/M.Rd)) repeat for Rb 47

48 1. Do Nothing Change the ISA to match implementation Hey compiler: don t create code w/data hazards! (We can do better than this) 2. Stall Pause current and subsequent instructions till safe 3. Forward/bypass Forward data value to where it is needed (Only works if value actually exists already) 48

49 How to stall an instruction in ID stage prevent IF/ID pipeline register update stalls the ID stage instruction convert ID stage insn into nop for later stages innocuous bubble passes through pipeline prevent PC update stalls the next (IF stage) instruction 49

50 add r3, r1, r2 sub inst r5, r3, r5 or r6, r3, r4 add r6, r3, r8 +4 PC inst PC+4 Rd D A B Ra Rb detect hazard If hazard: OP Rt Rd PC+4 imm B A OP Rd B D addr d in d out OP Rd M D WE=0 IF/ID MemWr=0 RegWr=0 ID/EX EX/MEM MEM/WB 50

51 inst +4 inst D rd ra rb A B A B D B data D M PC (MemWr=0 RegWr=0) nop Op WE Rd Op WE Rd Op WE Rd sub r5,r3,r5 add r3,r1,r2 or r6,r3,r4 (WE=0) /stall NOP = If(IF/ID.rA 0 && (IF/ID.rA==ID/Ex.Rd IF/ID.rA==Ex/M.Rd)) STALL CONDITION MET 51

52 inst +4 inst D rd ra rb A B A B D B data D M PC (MemWr=0 RegWr=0) nop sub r5,r3,r5 Rd WE Op (MemWr=0 RegWr=0) nop Rd WE Op add r3,r1,r2 Rd WE Op or r6,r3,r4 (WE=0) /stall NOP = If(IF/ID.rA 0 && (IF/ID.rA==ID/Ex.Rd IF/ID.rA==Ex/M.Rd)) STALL CONDITION MET 52

53 inst +4 inst D rd ra rb A B A B D B data D M WE Rd Rd Rd PC or r6,r3,r4 (WE=1) nop sub r5,r3,r5 /stall Op (MemWr=0 RegWr=0) nop NOP = If(IF/ID.rA 0 && (IF/ID.rA==ID/Ex.Rd IF/ID.rA==Ex/M.Rd)) WE Op WE Op nop add r3,r1,r2 NO STALL CONDITION MET: sub allowed to leave decode stage 53

54 time Clock cycle add r3, r1, r2 sub r5, r3, r5 or r6, r3, r4 add r6, r3, r8 54

55 time r3 = 10 add r3, r1, r2 r3 = 20 sub r5, r3, r5 Clock cycle IF ID Ex M W 2 Stall Cycles IF ID* ID* ID Ex M W or r6, r3, r4 IF* IF* IF ID Ex M add r6, r3, r8 IF ID Ex 55

56 1. Do Nothing Change the ISA to match implementation Compiler: don t create code with data hazards! (Nice try, we can do better than this) 2. Stall Pause current and subsequent instructions till safe 3. Forward/bypass Forward data value to where it is needed (Only works if value actually exists already) 56

57 Ex/Mem inst D A B data sub r5, r3, r1 add r3, r1, r2 add r3, r1, r2 sub r5, r3, r1 IF ID Ex M W IF ID Ex M W Problem: EX needs ALU result that is in MEM stage Solution: add a bypass from EX/MEM.D to start of EX 57

58 Ex/Mem A inst D B data sub r5, r3, r1 add r3, r1, r2 Detection Logic in Ex Stage: forward = (Ex/M.WE && EX/M.Rd!= 0 && ID/Ex.Ra == Ex/M.Rd) (same for Rb) 58

59 Mem/WB inst D A B data add r3, r1, r2 sub r5, r3, r1 or r6, r3, r4 IF ID Ex M W IF or r6, r3, r4 ID IF Ex M W ID Ex M sub r5, r3, r1 W add r3, r1,r2 Problem: EX needs value being written by WB Solution: Add bypass from WB final value to start of EX 59

60 Mem/WB inst D A B data or r6, r3, r4 sub r5, r3, r1 Detection Logic: forward = (M/WB.WE && M/WB.Rd!= 0 && ID/Ex.Ra == M/WB.Rd && not (Ex/M.WE && Ex/M.Rd!= 0 && ID/Ex.Ra == Ex/M.Rd) (same for Rb) add r3, r1,r2 60

61 inst D A B A B imm D B data D M detect hazard Rb Ra forward unit Rd WE MC Rd WE MC IF/ID ID/Ex Ex/Mem Mem/WB Two types of forwarding/bypass Forwarding from Ex/Mem registers to Ex stage (M Ex) Forwarding from Mem/WB register to Ex stage (W Ex) 61

time Clock cycle 1 2 3 4 5 6 7 8 add r3, r1, r2 sub

62 time Clock cycle add r3, r1, r2 sub r5, r3, r4 lw r6, 4(r3) or r5, r3, r5 sw r6, 12(r3) 62

63 time add r3, r1, r2 Clock cycle IF ID Ex M W sub r5, r3, r4 IF ID Ex M W lw r6, 4(r3) IF ID Ex M W or r5, r3, r5 IF ID Ex M W sw r6, 12(r3) IF ID Ex M W 63

64 A inst D B data or r6, r3, r4 lw r4, 20(r8) Data dependency after a load instruction: Value not available until after the M stage ànext instruction cannot proceed if dependent THE KILLER HAZARD 64

65 inst A D B or r6,r4,r1 lw r4, 20(r8) data lw r4, 20(r8) or r6, r3, r4 65

66 inst A D B or r6,r4,r1 lw r4, 20(r8) data lw r4, 20(r8) IF ID Ex or r6, r3, r4 IF ID 66

67 A inst D B data or r6,r4,r1 NOP lw r4, 20(r8) lw r4, 20(r8) or r6, r3, r4 IF ID Ex M W Stall IF ID* ID Ex M W 67

68 inst D A B data or r6,r4,r1 NOP lw r4, 20( lw r4, 20(r8) or r6, r3, r4 IF ID Ex M W Stall IF ID* ID Ex Ex M W 68

69 inst D A B A B imm D B data D M detect hazard Rd Rb Ra MC forward unit IF/ID ID/Ex Ex/Mem Mem/WB Rd WE MC Rd WE MC Stall = If(ID/Ex.MemRead && IF/ID.Ra == ID/Ex.Rd 69

70 inst D A B A B imm D B data D M detect hazard Rd Rb Ra MC forward unit IF/ID ID/Ex Ex/Mem Mem/WB Rd WE MC Rd WE MC Most frequent 3410 non-solution to load-use hazards Why is this solution so so so so so so awful? 70

71 Forwarding values directly from Memory to the Execute stage without storing them in a register first: A. Does not remove the need to stall. B. Adds one too many possible inputs to the ALU. C. Will cause the pipeline register to have the wrong value. D. Halves the frequency of the processor. E. Both A & D 71

72 Two MIPS Solutions: MIPS 2000/3000: delay slot ISA says results of loads are not available until one cycle later Assembler inserts nop, or reorders to fill delay slot MIPS 4000 onwards: stall But really, programmer/compiler reorders to avoid stalling in the load delay slot 72

73 5-stage Pipeline Implementation Working Example Hazards Structural Data Hazards Control Hazards 73

74 for (i = 0; i < max; i++) { n += 2; } i = 7; n--; r1: i r2: n r3: max Simplification: assume max > 0 x10 addi r1, r0, 0 # i=0 x14 Loop: addi r2, r2, 2 # n += 2 x18 addi r1, r1, 1 # i++ x1c blt r1, r3, Loop # i<max? x20 addi r1, r0, 7 # i = 7 x24 subi r2, r2, 1 # n-- 74

75 Control Hazards instructions are fetched in stage 1 (IF) branch and jump decisions occur in stage 3 (EX) à next PC not known until 2 cycles after branch/jump x1c blt r1, r3, Loop x20 addi r1, r0, 7 x24 subi r2, r2, 1 Branch not taken? No Problem! Branch taken? Just fetched 2 addi s à Zap & Flush 75

76 inst PC +4 D A B New PC = 14 1C blt r1,r3,l 20 addi r1,r0,7 24 subi r2,r2,1 14 L:addi r2,r2,2 branch calc decide branch data IF ID Ex M W IF ID NOP NOP NOP prevent PC update clear IF/ID latch branch continues If branch Taken Zap IF NOP NOP NOP NOP IF ID Ex M W 76

77 prevent PC update clear IF/ID latch branch continues inst PC +4 D A B New PC = 1C 1C blt r1,r3,l 20 addi r1,r0,7 24 subi r2,r2,1 14 L:addi r2,r2,2 branch calc decide branch data If branch Taken Zap IF ID Ex M W IF ID NOP NOP NOP IF NOP NOP NOP NOP For every taken branch? OUCH!!! IF ID Ex M W 77

78 Back of the envelope calculation Branch: 20%, load: 20%, store: 10%, other: 50% Say, 75% of branches are taken CPI = % * 75% * 2 = * 0.75 * 2 = 1.3 Branches cause 30% slowdown Even worse with deeper pipelines How do we reduce slowdown? 78

79 1. Delay Slot MIPS ISA: 1 insn after ctrl insn always executed Whether branch taken or not Your MIPS assembly should do this 2. Resolve Branch at Decode Move branch calc from EX to ID Alternative: just zap 2 nd instruction when branch taken 3. Branch Prediction Not in 3410, but every processor worth anything does this 79

80 for (i = 0; i < max; i++) { n += 2; } i = 7; n--; x10 addi r1, r0, 0 # i=0 i à r1 Assume: n à r2 max à r3 x14 Loop: addi r2, r2, 2 # n x+= 2 x18 addi r1, r1, 1 # i++ x1c blt r1, r3, Loop # i<max? x20 nop x24 addi r1, r0, 7 # i = 7 x28 subi r2, r2, 1 # n++ 80

81 inst PC +4 A D B New PC = 1C branch calc decide branch data 1C blt r1, r3, Loop F D X 20 nop F D 24 addi r1, r0, 7 F Zap!

82 A delay slot complicates the design of a processor. A. True B. False C. Cannot tell from the information given D. I don t know E. I think E is an awesome answer. 82

83 inst PC +4 D A B branch calc decide data New PC = 1C branch 1C blt r1, r3, Loop F D X 20 nop F D 14 Loop:addi r2,r2,2 F No Zapping! 83

84 Back of the envelope calculation Branch: 20%, load: 20%, store: 10%, other: 50% Say, 75% of branches are taken What is the CPI with decode? CPI = % * 75% * 1 = * 0.75 * 1 = % slowdown à 15% slowdown 84

85 Resolving branches at decode could slow down the clock frequency of the processor. A. True B. False C. Cannot tell from the information given D. I don t know E. I think E is an awesome answer. 85

86 Because MIPS has a delay slot, the instruction after any control instruction must always be a nop. A. True B. False C. Cannot tell from the information given D. I don t know E. I think E is an awesome answer. 86

87 x10 addi r1, r0, 0 # i=0 x14 Loop: addi r2, r2, 2 # n += 2 x18 addi r1, r1, 1 # i++ x1c blt r1, r3, Loop # i<max? x20 nop Compiler transforms code x10 addi r1, r0, 0 # i=0 x14 Loop: addi r1, r1, 1 # i++ x18 blt r1, r3, Loop # i<max? x1c addi r2, r2, 2 # n += 2 87

88 inst PC +4 D A B branch calc decide data New PC = 1C branch 1C blt r1, r3, Loop F D X 20 addi r2,r2,2 F D 14 Loop:addi r1,r1,1 F No Nop or Zapping! 88

89 Most processor support Speculative Execution Guess direction of the branch Allow instructions to move through pipeline Zap them later if guess turns out to be wrong A must for long pipelines 89

90 Parameters Branch: 20%, load: 20%, store: 10%, other: 50% 75% of branches are taken Dynamic branch prediction Branches predicted with 95% accuracy What is the CPI with decode? CPI = % * 5% * 2 =

91 Data hazards occur when a operand (register) depends on the result of a previous instruction that may not be computed yet. Pipelined processors need to detect data hazards. Stalling, preventing a dependent instruction from advancing, is one way to resolve data hazards. Stalling introduces NOPs ( bubbles ) into a pipeline. Introduce NOPs by (1) preventing the PC from updating, (2) preventing writes to IF/ID registers from changing, and (3) preventing writes to ory and register file. Nops significantly decrease performance. Forwarding bypasses some pipelined stages forwarding a result to a dependent instruction operand (register). Better performance than stalling. 91

92 Control hazards occur because the PC following a control instruction is not known until control instruction is executed. If branch is taken à need to zap instructions. 1 cycle performance penalty. Delay Slots can potentially increase performance due to control hazards. The instruction in the delay slot will always be executed. Requires software (compiler) to make use of delay slot. Put nop in delay slot if not able to put useful instruction in delay slot. We can reduce cost of a control hazard by moving branch decision and calculation from Ex stage to ID stage. With a delay slot, this removes the need to flush instructions on taken branches. 92

Hakim Weatherspoon CS 3410 Computer Science Cornell University

Hakim Weatherspoon CS 3410 Computer Science Cornell University The slides are the product of many rounds of teaching CS 3410 by Professors Weatherspoon, Bala, Bracy, McKee, and Sirer. memory inst register