Improving Performance: Pipelining!

Iproving Perforance: Pipelining! Meory General registers Meory ID EXE MEM WB Instruction Fetch (includes PC increent) ID Instruction Decode + fetching values fro general purpose registers EXE EXEcute arithetic/logic operations or address coputation MEM MEMory access or branch copletion WB Write Back results to general purpose registers (a.k.a. Coit) Inf3 Coputer Architecture - 2013-2014 1

Phases of Instruction Execution!! Instruction Fetch! InstructionRegister = Me (INST, PC)!! Decoding! Generate datapath control signals! Deterine register operands!! Operand Assebly! Trivial for soe ISAs, not for others! E.g. select between literal or register operand; operand pre-scaling! Soeties considered to part of the Decode phase!! Function Evaluation or Address Calculation! Add, subtract, shift, logical, etc.! Address calculation is siply unsigned addition!! Meory Access (if required)! Load: Data = Me(DATA, MeAddress, Size)! Store: MeWrite (DATA, MeAddress, WriteData, Size)!! Copletion! Update processor state odified by this instruction! Interrupts or exceptions ay prevent state update fro taking place! Inf3 Coputer Architecture - 2013-2014 2

Instruction fetch!! fro Instruction Cache at address given by PC!! Increent PC, i.e. PC = PC + sizeof(instruction)! 4 Add PC Instruction eory Address Data Inf3 Coputer Architecture - 2013-2014 3

MIPS R-type instruction forat (revision)! 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits opcode reg rs reg rt reg rd shat funct Destination register for R-type forat add $1, $2, $3 special $2 $3 $1 add sll $4, $5, 16 special $5 $4 16 sll Inf3 Coputer Architecture - 2013-2014 4

MIPS I-type instruction forat (revision)! 6 bits 5 bits 5 bits 16 bits opcode reg rs reg rt iediate value/addr Destination register for Load lw $1, offset($2) lw $2 $1 address offset beq $4, $5,.L001 beq $4 $5 (PC -.L001) >> 2 addi $1, $2, -10 addi $2 $1 0xfff6 Inf3 Coputer Architecture - 2013-2014 5

ing Registers!! Use source register fields to address the register file and read two registers!! Select the destination register address, according to the forat! 4 Add PC Instruction eory Address Data inst [25:21] inst [20:16] Register File Addr 0 Addr 1 Data 0 Data 1 inst [15:11] Write Addr Write Data RegDst Inf3 Coputer Architecture - 2013-2014 6

Extracting the literal operand!! Sign-extend the 16-bit literal field, for those instructions that have a literal! 4 Add PC Instruction eory Address Data inst [25:21] inst [20:16] Register File Addr 0 Addr 1 Data 0 Data 1 inst [15:11] Write Addr Write Data RegDst inst [15:0] Sign extend Verilog lit = { {16{inst[15]}}, inst[15:0] } Inf3 Coputer Architecture - 2013-2014 7

Perforing the Arithetic!! Perfor arithetic or logical operation on Data 0 and either Data 1 or the sign-extended literal! 4 Add PC Instruction eory Address Data inst [25:21] inst [20:16] Register File Addr 0 Addr 1 Data 0 Data 1 ALU inst [15:11] Write Addr Write Data RegDst inst [15:0] Sign extend Inf3 Coputer Architecture - 2013-2014 8

Inside the ALU!! Adder, Logic Unit, and Barrel Shifter are separate cobinational logic blocks! AndOp XorOp OrOp Logic unit A B + A Cout Add B Cin u x ==0 Zero Result SubtractOp Barrel shifter B [4:0] LeftOp SignedOp ShiftOp Inf3 Coputer Architecture - 2013-2014 9

Coputing Branch Displaceents!! Copute su of PC and scaled, sign-extended literal displaceent!! Can t share ALU, it ight be needed for coparisons during branch operations! 4 Add << 2 Add PCsrc PC Instruction eory Address Data inst [25:21] inst [20:16] Register File Addr 0 Addr 1 Data 0 Data 1 ALU inst [15:11] Write Addr Write Data RegDst inst [15:0] Sign extend Inf3 Coputer Architecture - 2013-2014 10

Accessing Meory Loads & Stores!! Load and Store instructions use the ALU result as the effective address!! Store instructions use Data 1 as the store data! 4 Add << 2 Add PCsrc PC Instruction eory Address Data inst [25:21] inst [20:16] inst [15:11] Register File Addr 0 Addr 1 Write Addr Write Data Data 0 Data 1 ALU MeRd MeWr Data Meory Address Write data data LoadReg RegDst inst [15:0] Sign extend Inf3 Coputer Architecture - 2013-2014 11

Decoding Instructions!! Control signals driven by cobinational logic, based on instruction opcode! 4 Add LoadReg << 2 Add MeWr inst [31:26] Decode logic MeRd PCsrc ALUop ALUsrc PC Instruction eory Address Data inst [25:21] inst [20:16] inst [15:11] inst [5:0] RegDst Register File Addr 0 Addr 1 Write Addr Write Data Data 0 Data 1 ALU ALU decode zero Data Meory Address Write data data inst [15:0] Sign extend Inf3 Coputer Architecture - 2013-2014 12

Pipelined Instruction Execution! action Phases of Instruction Execution Fetch Decode Execute Meory Write clock Write Write Write Write Write Write Meory Meory Meory Meory Meory Meory Execute Execute Execute Execute Execute Execute Decode Decode Decode Decode Decode Decode 1 Fetch 2 Fetch 3 Fetch Fetch 4 Fetch 5 Fetch 2 tie Inf3 Coputer Architecture - 2013-2014 13

CPU Pipeline Structure! DEC EX MEM WB Decode logic EX MEM MEM 4 Add PC+4 WB PC+4 << 2 Add WB bpc WB [31:26] PC Instruction eory Address Data [25:21] [20:16] Register File Addr 0 Addr 1 Write Data Write Addr Data 0 Data 1 ALU zero Branch decision Data Meory Address Write data data [15:0] Sign extend 6 ALU decode [15:11] Inf3 Coputer Architecture - 2013-2014 14

Ipleentation Issues: Pipeline balance!! Each pipeline stage is a cobinational logic network! Registered inputs and outputs! Longest circuit delay through all stages deterines clock period! D D Q Q Pipeline Stage Logic D D Q Q Ideally, all delays through every pipeline stage are identical In practice this is hard to achieve clk1 D Q clk2 Clock tree clock Inf3 Coputer Architecture - 2013-2014 15

Representing a sequence of instructions!! Space-tie diagra of pipeline!! Think of each instruction as a tie-shifted pipeline! c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 Instruction 1 Instruction 2 Instruction 3 Instruction 4 Instruction 5 Inf3 Coputer Architecture - 2013-2014 16

Inforation flow constraints!! Inforation fro one instruction to any successor, ust always ove fro left to right! c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 Instruction 1 Instruction 2 Instruction 3 Instruction 4 Instruction 5 Inf3 Coputer Architecture - 2013-2014 17

Another way to represent pipeline tiing!! A siilar, and slightly sipler, way to represent pipeline tiing:! Clock cycles progress left to right! Instructions progress top to botto! Tie at which each instruction is present in each pipeline stage is shown by labelling appropriate cell with pipeline nae!! This for is used in H&P, and throughout the reainder of these notes.! Instruction \ cycle 1 2 3 4 5 6 7 8 9 instruction 1 DEC EX MEM WB instruction 2 DEC EX MEM WB instruction 3 DEC EX MEM WB instruction 4 DEC EX MEM WB instruction 5 DEC EX MEM WB Inf3 Coputer Architecture - 2013-2014 18

Pipeline Hazards!! Hazards are pipeline events that restrict the pipeline flow!! They occur in circustances where two or ore activities cannot proceed in parallel!! There are three types of hazard:! Structural Hazards!! Arise fro resource conflicts, when a set of actions have to be perfored sequentially because there is not sufficient resource to operate in parallel! Data Hazards!! Occur when one instruction depends on the result of a previous instruction, and that result is not yet available. These hazards are exposed by the overlapped execution of instructions in a pipeline! Control Hazards!! These arise fro the pipelining of branch instructions, and other activities that change the PC.! Inf3 Coputer Architecture - 2013-2014 19

Structural Hazards!! Multi-cycle operations!! Meory or register file port restrictions! Exaple structural hazard caused by having only one eory port Instruction \ cycle 1 2 3 4 5 6 7 8 9 10 lw $1, ($2) DEC EX M EM WB instruction 2 DEC EX M EM WB instruction 3 DEC EX M EM WB instruction 4 DEC EX M EM WB instruction 5 DEC EX M EM WB Effect is to STALL instruction 4, delaying its entry to by one cycle Instruction \ cycle 1 2 3 4 5 6 7 8 9 10 lw $1, ($2) DEC EX M EM WB instruction 2 DEC EX M EM WB instruction 3 DEC EX M EM WB instruction 4 DEC EX M EM WB instruction 5 DEC EX M EM WB Inf3 Coputer Architecture - 2013-2014 20

Data Hazards!! Overlapped execution of instructions eans inforation ay be required before it is available.! c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 ADD R1, R2, R3 SUB R4, R1, R5 AND R6, R1, r7 OR R8, r1, R9 XOR R10, R1, R11 Inf3 Coputer Architecture - 2013-2014 21

Data hazards lead to pipeline stalls!! SUB instruction ust wait until R1 has been written to register file!! All subsequent instructions are siilarly delayed! c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 ADD R1, R2, R3 SUB R4, R1, R5 STALL AND R6, R1, r7 OR R8, r1, R9 XOR R10, R1, R11 Inf3 Coputer Architecture - 2013-2014 22

Miniising data hazards by data-forwarding!! Key idea is to bypass the register file and forward inforation, as soon as it becoes available within the pipeline, to the place it is needed.! c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 ADD R1, R2, R3 SUB R4, R1, R5 AND R6, R1, r7 OR R8, r1, R9 XOR R10, R1, R11 Inf3 Coputer Architecture - 2013-2014 23

CPU pipeline showing forwarding paths! DEC EX MEM WB Decode logic EX MEM MEM PC 4 Add Instruction eory Address Data PC+4 [31:26] [25:21] [20:16] Dependency checks Register File Addr 0 Addr 1 Write Data Write Addr Data 0 Data 1 WB PC+4 Add << 2 ALU WB bpc zero Branch decision Data Meory Address Write data data WB [15:0] Sign extend 6 ALU decode [15:11] Inf3 Coputer Architecture - 2013-2014 24

Data hazards requiring a stall!! Hazards involving the use of a Load result usually require a stall, even if forwarding is ipleented! c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 LW R1, (R2) SUB R4, R1, R5 STALL Reg ALU Me Reg AND R6, R1, r7 OR R8, r1, R9 XOR R10, R1, R11 Inf3 Coputer Architecture - 2013-2014 25

Code scheduling to avoid stalls (before)!! Hazards involving the use of a Load ay be avoided by reordering the code! c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 LW R1, 2(R2) LW R3, 4(R1) Reg STALL ALU Me Reg ADD R4, R4, R3 Reg STALL ALU Me Reg ADD R1, R1, 4 SUB R9, R9, 1 Inf3 Coputer Architecture - 2013-2014 26

Code scheduling to avoid stalls (after)!! SUB is entirely independent of other instructions place after 1 st load!! ADD to R1 can be placed after LW to R3 to hide the load delay on R3! c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 LW R1, 2(R2) SUB R9, R9, 1 LW R3, 4(R1) ADD R1, R1, 4 ADD R4, R4, R3 Inf3 Coputer Architecture - 2013-2014 27

General Perforance Ipact of Hazards! CPI unpipelined Speedup fro pipelining: S = CPI pipelined x clock unpipelined clock pipelined CPI pipelined = ideal CPI + stall cycles per instruction = 1 + stall cycles per instruction CPI unpipelined ~ pipeline depth clock unpipelined clock pipelined ~ 1 S = pipeline depth 1 + stall cycles per instruction Inf3 Coputer Architecture - 2013-2014 28

Ipact of Epty Load-delay Slots on CPI! 3 2.5 2 FP structural stalls FP result stalls CPI 1.5 1 0.5 0 copress eqntott espresso gcc li doduc Benchark ear hydro2d dljdp su2cor Branch stalls Load stalls Base CPI H&P Fig. A.48! Botto-line: CPI increase of 0.01 to 0.27 cycles! Inf3 Coputer Architecture - 2013-2014 29

Control Hazards!! When a branch is executed, PC is not affected until the branch instruction reaches the MEM stage.!! By this tie 3 instructions have been fetched fro the fall-through path.! c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 BEQZ R1, label SUB R4, R2, R5 Kill instructions in EX, DEC and as they ove forwards AND R6, R2, r7 OR R8, r2, R9 : : label: XOR R10, R1, R11 Inf3 Coputer Architecture - 2013-2014 30

Effect of branch penalty on CPI!! In this exaple pipeline the cost of each branch is:!! 1 cycle, if the branch is not taken!! 4 cycles, if the branch is taken!! If an equal nuber of branches are taken and not taken, and if 20% of all instructions are branches (a reasonable assuption), then! CPI = 0.8 + 0.2*2.5 = 1.3! This is a significant reduction in perforance!! If the pipeline was deeper, with 2 stages for ALU and 2 stages for Decode, then:! Cost of taken branch would be 6 cycles! CPI = 0.8 + 0.2*3.5 = 1.5!! Deeper pipelines have greater branch penalties, and potentially higher CPI!! Pentiu 4 (Prescott) had 31 pipeline stages! (this was too deep)!! Several iportant techniques have been developed to reduce branch penalties!! Early branch outcoe!! Delayed branches!! Branch prediction (static and dynaic)! Inf3 Coputer Architecture - 2013-2014 31

Early branch outcoe calculation - BEQZ, BNEZ! DEC EX MEM WB Decode logic EX MEM MEM 4 Add PC+4 << 2 Add WB WB WB [31:26] RD0 == 0? PC Instruction eory Address Data [25:21] [20:16] Register File Addr 0 Addr 1 Write Data Write Addr Data 0 Data 1 ALU Data Meory Address Write data data [15:0] Sign extend 6 ALU decode [15:11] Inf3 Coputer Architecture - 2013-2014 32

Delayed branch execution!! Always execute the instruction iediately after the branch, regardless of branch outcoe.! c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 SUB R4, R2, R5 BEQZ R1, label OR R8, r2, R9 : : Before: instruction after the branch gets killed if the branch is taken label: XOR R10, R1, R11 BEQZ R1, label SUB R4, R2, R5 label: XOR R10, R1, R11 Branch delay slot After: by oving the SUB instruction into the branch delay slot, and executing it unconditionally, the 1-cycle penalty is eliinated Inf3 Coputer Architecture - 2013-2014 33

Ipact of Branch Hazards on CPI! 3 2.5 2 FP structural stalls FP result stalls CPI 1.5 1 0.5 0 copress eqntott espresso gcc li doduc Benchark ear hydro2d dljdp su2cor Branch stalls Load stalls Base CPI H&P Fig. A.48! Botto-line: CPI increase of 0.06 to 0.62 cycles! Inf3 Coputer Architecture - 2013-2014 34