Improving Performance: Pipelining!

Similar documents
Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

Lecture 14: Instruction Level Parallelism

Pipelined MIPS Datapath with Control Signals

Anne Bracy CS 3410 Computer Science Cornell University. [K. Bala, A. Bracy, S. McKee, E. Sirer, H. Weatherspoon]

CS152: Computer Architecture and Engineering Introduction to Pipelining. October 22, 1997 Dave Patterson (http.cs.berkeley.

Hakim Weatherspoon CS 3410 Computer Science Cornell University

Advanced Superscalar Architectures. Speculative and Out-of-Order Execution

ENGN1640: Design of Computing Systems Topic 05: Pipeline Processor Design

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

Computer Architecture 计算机体系结构. Lecture 3. Instruction-Level Parallelism I 第三讲 指令级并行 I. Chao Li, PhD. 李超博士

CIS 662: Sample midterm w solutions

Parallelism I: Inside the Core

6.823 Computer System Architecture Prerequisite Self-Assessment Test Assigned Feb. 6, 2019 Due Feb 11, 2019

Pipeline Hazards. See P&H Chapter 4.7. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

COSC 6385 Computer Architecture. - Tomasulos Algorithm

Announcements. Programming assignment #2 due Monday 9/24. Talk: Architectural Acceleration of Real Time Physics Glenn Reinman, UCLA CS

Pipeline Hazards. See P&H Chapter 4.7. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Chapter 3: Computer Organization Fundamentals. Oregon State University School of Electrical Engineering and Computer Science.

M2 Instruction Set Architecture

Out-of-order Pipeline. Register Read. OOO execution (2-wide) OOO execution (2-wide) OOO execution (2-wide) OOO execution (2-wide)

Computer Architecture: Out-of-Order Execution. Prof. Onur Mutlu (editted by Seth) Carnegie Mellon University

Advanced Superscalar Architectures

CIS 371 Computer Organization and Design

To read more. CS 6354: Tomasulo. Intel Skylake. Scheduling. How can we reorder instructions? Without changing the answer.

CS 6354: Tomasulo. 21 September 2016

CIS 371 Computer Organization and Design

Lecture 20: Parallelism ILP to Multicores. James C. Hoe Department of ECE Carnegie Mellon University

DAT105: Computer Architecture Study Period 2, 2009 Exercise 2 Chapter 2: Instruction-Level Parallelism and Its Exploitation

Code Scheduling & Limitations

CMU Introduction to Computer Architecture, Spring 2013 HW 3 Solutions: Microprogramming Wrap-up and Pipelining

Unit 9: Static & Dynamic Scheduling

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering. Lecture 15 - Advanced Superscalars

FabComp: Hardware specication

Computer Architecture ELE 475 / COS 475 Slide Deck 6: Superscalar 3. David Wentzlaff Department of Electrical Engineering Princeton University

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

CSCI 510: Computer Architecture Written Assignment 2 Solutions

145.2 ENGLISH. Courtesy of Crane.Market

237. Compatibility Issues of HHO Cell With Internal Combustion Engine

Tomasulo-Style Register Renaming

igubal Pillow Block Lifetime calculation, configuration and more igubal pillow block

CMPEN 411 VLSI Digital Circuits Spring Lecture 20: Multiplier Design

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

ENGLISH. Courtesy of Crane.Market

CS 250! VLSI System Design

Performance Analysis of EV Powertrain system with/without transmission

SDRAM AS4SD8M Mb: 8 Meg x 16 SDRAM Synchronous DRAM Memory. PIN ASSIGNMENT (Top View)

SAFETY ASSESSMENT OF COOPERATIVE VEHICLE INFRASTRUCTURE SYSTEM-BASED URBAN TRAFFIC CONTROL

CS 152 Computer Architecture and Engineering. Lecture 14 - Advanced Superscalars

Important note To cite this publication, please use the final published version (if applicable). Please check the document version above.

Lifting device Type Sulzer /

Ceiling Slot Diffusers

512M(16MX32) Low Power DDR SDRAM

TGBgroup. Slewing Rings

Analysis of Tracked Vehicle Vibration Response Considering the Track Circuit Vibration

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 10 Instruction-Level Parallelism Part 3

SYNCHRONOUS DRAM. 128Mb: x32 SDRAM. MT48LC4M32B2-1 Meg x 32 x 4 banks

RAM-Type Interface for Embedded User Flash Memory

EE 330 Integrated Circuit. Sequential Airbag Controller

In-Place Associative Computing:

Analytical framework for analyzing the energy conversion efficiency of different hybrid electric vehicle topologies

Photo-voltaic system design and lab. Solar Panels. SP.721 (D-Lab) Fall 2004 Session #23 notes

IS42S32200C1. 512K Bits x 32 Bits x 4 Banks (64-MBIT) SYNCHRONOUS DYNAMIC RAM

128Mb Synchronous DRAM. Features High Performance: Description. REV 1.0 May, 2001 NT5SV32M4CT NT5SV16M8CT NT5SV8M16CT

Application: Blow Moulding Machines

JJMIE Jordan Journal of Mechanical and Industrial Engineering

IS42S32200L IS45S32200L

Registers Shift Registers Accumulators Register Files Register Transfer Language. Chapter 8 Registers. SKEE2263 Digital Systems

Operating instructions torque limiters ECE, ECG, ECI, ECR, ECH

Probabilistic and sensitive analysis of the secondary air system of a two-spool engine

Direct-Mapped Cache Terminology. Caching Terminology. TIO Dan s great cache mnemonic. UCB CS61C : Machine Structures

Electric wire rope hoists

An Energy Storage Technique for Gearless Wind Power Systems

PQS Quality Assurance protocol

EXPERIMENTAL DETERMINATION OF DOUBLE VIBE FUNCTION PARAMETERS IN DIESEL ENGINES WITH BIODIESEL

Helium Stand-Up Wheelchair

Hand of Thread & Number of Starts (Right hand, Single thread) Module(2) Type( Duplex Worm)

- DQ0 - NC DQ1 - NC - NC DQ0 - NC DQ2 DQ1 DQ CONFIGURATION. None SPEED GRADE

07 GRP07_All Transmissions.doc

IMPORTANT! Electric Tiller/Cultivator. Safety Instructions All Operators Must Read These Instructions Before Use SAVE THESE INSTRUCTIONS

- - DQ0 NC DQ1 DQ0 DQ2 - NC DQ1 DQ3 NC - NC

HPG Gearhead Series. Sizes

DQ0 NC DQ1 DQ0 DQ2 DQ3 DQ Speed Grade

CAT.E101 B. Electric Actuators

DEVELOPMENT OF WIND TURBINE SYSTEMS WITH PARALLEL CONNECTIONS OF DIFFERENT TYPES GENERATORS

CprE 281: Digital Logic

CMPEN 411 VLSI Digital Circuits Spring Lecture 24: Peripheral Memory Circuits

OWNER`S MANUAL. Permobil C350. Power Wheelchair

A Mean Value Internal Combustion Engine Model in MapleSim

Notes: Clock Frequency (MHz) Target t RCD- t RP-CL t RCD (ns) t RP (ns) CL (ns) A

Influence of shot peening intensity on fatigue design reliability of 65Si7 spring steel

Investigation of Brake Force Distribution for Three axle Double Deck Bus in Thailand

Purifying Moving Designing Vitalising Lighting Laying. Pumps

International Journal of Advanced Engineering Research and Science (IJAERS) [Vol-2, Issue-5, May- 2015] ISSN:

- DQ0 - NC DQ1 - NC - NC DQ0 - NC DQ2 DQ1 DQ

DIESEL & PETROL ENGINE SETTING/LOCKING & TIMING ADJUSTMENT KIT - FOR CITROEN, PEUGEOT, RENAULT.

UC Berkeley CS61C : Machine Structures

Mathematical Model and Experiment of Temperature Effect on Discharge of Lead-Acid Battery for PV Systems in Tropical Area

- - DQ0 NC DQ1 DQ0 DQ2 - NC DQ1 DQ3 NC - NC

Decoupling Loads for Nano-Instruction Set Computers

Transcription:

Iproving Perforance: Pipelining! Meory General registers Meory ID EXE MEM WB Instruction Fetch (includes PC increent) ID Instruction Decode + fetching values fro general purpose registers EXE EXEcute arithetic/logic operations or address coputation MEM MEMory access or branch copletion WB Write Back results to general purpose registers (a.k.a. Coit) Inf3 Coputer Architecture - 2013-2014 1

Phases of Instruction Execution!! Instruction Fetch! InstructionRegister = Me (INST, PC)!! Decoding! Generate datapath control signals! Deterine register operands!! Operand Assebly! Trivial for soe ISAs, not for others! E.g. select between literal or register operand; operand pre-scaling! Soeties considered to part of the Decode phase!! Function Evaluation or Address Calculation! Add, subtract, shift, logical, etc.! Address calculation is siply unsigned addition!! Meory Access (if required)! Load: Data = Me(DATA, MeAddress, Size)! Store: MeWrite (DATA, MeAddress, WriteData, Size)!! Copletion! Update processor state odified by this instruction! Interrupts or exceptions ay prevent state update fro taking place! Inf3 Coputer Architecture - 2013-2014 2

Instruction fetch!! fro Instruction Cache at address given by PC!! Increent PC, i.e. PC = PC + sizeof(instruction)! 4 Add PC Instruction eory Address Data Inf3 Coputer Architecture - 2013-2014 3

MIPS R-type instruction forat (revision)! 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits opcode reg rs reg rt reg rd shat funct Destination register for R-type forat add $1, $2, $3 special $2 $3 $1 add sll $4, $5, 16 special $5 $4 16 sll Inf3 Coputer Architecture - 2013-2014 4

MIPS I-type instruction forat (revision)! 6 bits 5 bits 5 bits 16 bits opcode reg rs reg rt iediate value/addr Destination register for Load lw $1, offset($2) lw $2 $1 address offset beq $4, $5,.L001 beq $4 $5 (PC -.L001) >> 2 addi $1, $2, -10 addi $2 $1 0xfff6 Inf3 Coputer Architecture - 2013-2014 5

ing Registers!! Use source register fields to address the register file and read two registers!! Select the destination register address, according to the forat! 4 Add PC Instruction eory Address Data inst [25:21] inst [20:16] Register File Addr 0 Addr 1 Data 0 Data 1 inst [15:11] Write Addr Write Data RegDst Inf3 Coputer Architecture - 2013-2014 6

Extracting the literal operand!! Sign-extend the 16-bit literal field, for those instructions that have a literal! 4 Add PC Instruction eory Address Data inst [25:21] inst [20:16] Register File Addr 0 Addr 1 Data 0 Data 1 inst [15:11] Write Addr Write Data RegDst inst [15:0] Sign extend Verilog lit = { {16{inst[15]}}, inst[15:0] } Inf3 Coputer Architecture - 2013-2014 7

Perforing the Arithetic!! Perfor arithetic or logical operation on Data 0 and either Data 1 or the sign-extended literal! 4 Add PC Instruction eory Address Data inst [25:21] inst [20:16] Register File Addr 0 Addr 1 Data 0 Data 1 ALU inst [15:11] Write Addr Write Data RegDst inst [15:0] Sign extend Inf3 Coputer Architecture - 2013-2014 8

Inside the ALU!! Adder, Logic Unit, and Barrel Shifter are separate cobinational logic blocks! AndOp XorOp OrOp Logic unit A B + A Cout Add B Cin u x ==0 Zero Result SubtractOp Barrel shifter B [4:0] LeftOp SignedOp ShiftOp Inf3 Coputer Architecture - 2013-2014 9

Coputing Branch Displaceents!! Copute su of PC and scaled, sign-extended literal displaceent!! Can t share ALU, it ight be needed for coparisons during branch operations! 4 Add << 2 Add PCsrc PC Instruction eory Address Data inst [25:21] inst [20:16] Register File Addr 0 Addr 1 Data 0 Data 1 ALU inst [15:11] Write Addr Write Data RegDst inst [15:0] Sign extend Inf3 Coputer Architecture - 2013-2014 10

Accessing Meory Loads & Stores!! Load and Store instructions use the ALU result as the effective address!! Store instructions use Data 1 as the store data! 4 Add << 2 Add PCsrc PC Instruction eory Address Data inst [25:21] inst [20:16] inst [15:11] Register File Addr 0 Addr 1 Write Addr Write Data Data 0 Data 1 ALU MeRd MeWr Data Meory Address Write data data LoadReg RegDst inst [15:0] Sign extend Inf3 Coputer Architecture - 2013-2014 11

Decoding Instructions!! Control signals driven by cobinational logic, based on instruction opcode! 4 Add LoadReg << 2 Add MeWr inst [31:26] Decode logic MeRd PCsrc ALUop ALUsrc PC Instruction eory Address Data inst [25:21] inst [20:16] inst [15:11] inst [5:0] RegDst Register File Addr 0 Addr 1 Write Addr Write Data Data 0 Data 1 ALU ALU decode zero Data Meory Address Write data data inst [15:0] Sign extend Inf3 Coputer Architecture - 2013-2014 12

Pipelined Instruction Execution! action Phases of Instruction Execution Fetch Decode Execute Meory Write clock Write Write Write Write Write Write Meory Meory Meory Meory Meory Meory Execute Execute Execute Execute Execute Execute Decode Decode Decode Decode Decode Decode 1 Fetch 2 Fetch 3 Fetch Fetch 4 Fetch 5 Fetch 2 tie Inf3 Coputer Architecture - 2013-2014 13

CPU Pipeline Structure! DEC EX MEM WB Decode logic EX MEM MEM 4 Add PC+4 WB PC+4 << 2 Add WB bpc WB [31:26] PC Instruction eory Address Data [25:21] [20:16] Register File Addr 0 Addr 1 Write Data Write Addr Data 0 Data 1 ALU zero Branch decision Data Meory Address Write data data [15:0] Sign extend 6 ALU decode [15:11] Inf3 Coputer Architecture - 2013-2014 14

Ipleentation Issues: Pipeline balance!! Each pipeline stage is a cobinational logic network! Registered inputs and outputs! Longest circuit delay through all stages deterines clock period! D D Q Q Pipeline Stage Logic D D Q Q Ideally, all delays through every pipeline stage are identical In practice this is hard to achieve clk1 D Q clk2 Clock tree clock Inf3 Coputer Architecture - 2013-2014 15

Representing a sequence of instructions!! Space-tie diagra of pipeline!! Think of each instruction as a tie-shifted pipeline! c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 Instruction 1 Instruction 2 Instruction 3 Instruction 4 Instruction 5 Inf3 Coputer Architecture - 2013-2014 16

Inforation flow constraints!! Inforation fro one instruction to any successor, ust always ove fro left to right! c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 Instruction 1 Instruction 2 Instruction 3 Instruction 4 Instruction 5 Inf3 Coputer Architecture - 2013-2014 17

Another way to represent pipeline tiing!! A siilar, and slightly sipler, way to represent pipeline tiing:! Clock cycles progress left to right! Instructions progress top to botto! Tie at which each instruction is present in each pipeline stage is shown by labelling appropriate cell with pipeline nae!! This for is used in H&P, and throughout the reainder of these notes.! Instruction \ cycle 1 2 3 4 5 6 7 8 9 instruction 1 DEC EX MEM WB instruction 2 DEC EX MEM WB instruction 3 DEC EX MEM WB instruction 4 DEC EX MEM WB instruction 5 DEC EX MEM WB Inf3 Coputer Architecture - 2013-2014 18

Pipeline Hazards!! Hazards are pipeline events that restrict the pipeline flow!! They occur in circustances where two or ore activities cannot proceed in parallel!! There are three types of hazard:! Structural Hazards!! Arise fro resource conflicts, when a set of actions have to be perfored sequentially because there is not sufficient resource to operate in parallel! Data Hazards!! Occur when one instruction depends on the result of a previous instruction, and that result is not yet available. These hazards are exposed by the overlapped execution of instructions in a pipeline! Control Hazards!! These arise fro the pipelining of branch instructions, and other activities that change the PC.! Inf3 Coputer Architecture - 2013-2014 19

Structural Hazards!! Multi-cycle operations!! Meory or register file port restrictions! Exaple structural hazard caused by having only one eory port Instruction \ cycle 1 2 3 4 5 6 7 8 9 10 lw $1, ($2) DEC EX M EM WB instruction 2 DEC EX M EM WB instruction 3 DEC EX M EM WB instruction 4 DEC EX M EM WB instruction 5 DEC EX M EM WB Effect is to STALL instruction 4, delaying its entry to by one cycle Instruction \ cycle 1 2 3 4 5 6 7 8 9 10 lw $1, ($2) DEC EX M EM WB instruction 2 DEC EX M EM WB instruction 3 DEC EX M EM WB instruction 4 DEC EX M EM WB instruction 5 DEC EX M EM WB Inf3 Coputer Architecture - 2013-2014 20

Data Hazards!! Overlapped execution of instructions eans inforation ay be required before it is available.! c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 ADD R1, R2, R3 SUB R4, R1, R5 AND R6, R1, r7 OR R8, r1, R9 XOR R10, R1, R11 Inf3 Coputer Architecture - 2013-2014 21

Data hazards lead to pipeline stalls!! SUB instruction ust wait until R1 has been written to register file!! All subsequent instructions are siilarly delayed! c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 ADD R1, R2, R3 SUB R4, R1, R5 STALL AND R6, R1, r7 OR R8, r1, R9 XOR R10, R1, R11 Inf3 Coputer Architecture - 2013-2014 22

Miniising data hazards by data-forwarding!! Key idea is to bypass the register file and forward inforation, as soon as it becoes available within the pipeline, to the place it is needed.! c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 ADD R1, R2, R3 SUB R4, R1, R5 AND R6, R1, r7 OR R8, r1, R9 XOR R10, R1, R11 Inf3 Coputer Architecture - 2013-2014 23

CPU pipeline showing forwarding paths! DEC EX MEM WB Decode logic EX MEM MEM PC 4 Add Instruction eory Address Data PC+4 [31:26] [25:21] [20:16] Dependency checks Register File Addr 0 Addr 1 Write Data Write Addr Data 0 Data 1 WB PC+4 Add << 2 ALU WB bpc zero Branch decision Data Meory Address Write data data WB [15:0] Sign extend 6 ALU decode [15:11] Inf3 Coputer Architecture - 2013-2014 24

Data hazards requiring a stall!! Hazards involving the use of a Load result usually require a stall, even if forwarding is ipleented! c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 LW R1, (R2) SUB R4, R1, R5 STALL Reg ALU Me Reg AND R6, R1, r7 OR R8, r1, R9 XOR R10, R1, R11 Inf3 Coputer Architecture - 2013-2014 25

Code scheduling to avoid stalls (before)!! Hazards involving the use of a Load ay be avoided by reordering the code! c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 LW R1, 2(R2) LW R3, 4(R1) Reg STALL ALU Me Reg ADD R4, R4, R3 Reg STALL ALU Me Reg ADD R1, R1, 4 SUB R9, R9, 1 Inf3 Coputer Architecture - 2013-2014 26

Code scheduling to avoid stalls (after)!! SUB is entirely independent of other instructions place after 1 st load!! ADD to R1 can be placed after LW to R3 to hide the load delay on R3! c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 LW R1, 2(R2) SUB R9, R9, 1 LW R3, 4(R1) ADD R1, R1, 4 ADD R4, R4, R3 Inf3 Coputer Architecture - 2013-2014 27

General Perforance Ipact of Hazards! CPI unpipelined Speedup fro pipelining: S = CPI pipelined x clock unpipelined clock pipelined CPI pipelined = ideal CPI + stall cycles per instruction = 1 + stall cycles per instruction CPI unpipelined ~ pipeline depth clock unpipelined clock pipelined ~ 1 S = pipeline depth 1 + stall cycles per instruction Inf3 Coputer Architecture - 2013-2014 28

Ipact of Epty Load-delay Slots on CPI! 3 2.5 2 FP structural stalls FP result stalls CPI 1.5 1 0.5 0 copress eqntott espresso gcc li doduc Benchark ear hydro2d dljdp su2cor Branch stalls Load stalls Base CPI H&P Fig. A.48! Botto-line: CPI increase of 0.01 to 0.27 cycles! Inf3 Coputer Architecture - 2013-2014 29

Control Hazards!! When a branch is executed, PC is not affected until the branch instruction reaches the MEM stage.!! By this tie 3 instructions have been fetched fro the fall-through path.! c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 BEQZ R1, label SUB R4, R2, R5 Kill instructions in EX, DEC and as they ove forwards AND R6, R2, r7 OR R8, r2, R9 : : label: XOR R10, R1, R11 Inf3 Coputer Architecture - 2013-2014 30

Effect of branch penalty on CPI!! In this exaple pipeline the cost of each branch is:!! 1 cycle, if the branch is not taken!! 4 cycles, if the branch is taken!! If an equal nuber of branches are taken and not taken, and if 20% of all instructions are branches (a reasonable assuption), then! CPI = 0.8 + 0.2*2.5 = 1.3! This is a significant reduction in perforance!! If the pipeline was deeper, with 2 stages for ALU and 2 stages for Decode, then:! Cost of taken branch would be 6 cycles! CPI = 0.8 + 0.2*3.5 = 1.5!! Deeper pipelines have greater branch penalties, and potentially higher CPI!! Pentiu 4 (Prescott) had 31 pipeline stages! (this was too deep)!! Several iportant techniques have been developed to reduce branch penalties!! Early branch outcoe!! Delayed branches!! Branch prediction (static and dynaic)! Inf3 Coputer Architecture - 2013-2014 31

Early branch outcoe calculation - BEQZ, BNEZ! DEC EX MEM WB Decode logic EX MEM MEM 4 Add PC+4 << 2 Add WB WB WB [31:26] RD0 == 0? PC Instruction eory Address Data [25:21] [20:16] Register File Addr 0 Addr 1 Write Data Write Addr Data 0 Data 1 ALU Data Meory Address Write data data [15:0] Sign extend 6 ALU decode [15:11] Inf3 Coputer Architecture - 2013-2014 32

Delayed branch execution!! Always execute the instruction iediately after the branch, regardless of branch outcoe.! c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 SUB R4, R2, R5 BEQZ R1, label OR R8, r2, R9 : : Before: instruction after the branch gets killed if the branch is taken label: XOR R10, R1, R11 BEQZ R1, label SUB R4, R2, R5 label: XOR R10, R1, R11 Branch delay slot After: by oving the SUB instruction into the branch delay slot, and executing it unconditionally, the 1-cycle penalty is eliinated Inf3 Coputer Architecture - 2013-2014 33

Ipact of Branch Hazards on CPI! 3 2.5 2 FP structural stalls FP result stalls CPI 1.5 1 0.5 0 copress eqntott espresso gcc li doduc Benchark ear hydro2d dljdp su2cor Branch stalls Load stalls Base CPI H&P Fig. A.48! Botto-line: CPI increase of 0.06 to 0.62 cycles! Inf3 Coputer Architecture - 2013-2014 34