Out-of-order Pipeline. Register Read. OOO execution (2-wide) OOO execution (2-wide) OOO execution (2-wide) OOO execution (2-wide)

Similar documents
CIS 371 Computer Organization and Design

Unit 9: Static & Dynamic Scheduling

CIS 371 Computer Organization and Design

Computer Architecture: Out-of-Order Execution. Prof. Onur Mutlu (editted by Seth) Carnegie Mellon University

Advanced Superscalar Architectures. Speculative and Out-of-Order Execution

Lecture 14: Instruction Level Parallelism

Code Scheduling & Limitations

Computer Architecture ELE 475 / COS 475 Slide Deck 6: Superscalar 3. David Wentzlaff Department of Electrical Engineering Princeton University

Parallelism I: Inside the Core

Advanced Superscalar Architectures

Lecture 20: Parallelism ILP to Multicores. James C. Hoe Department of ECE Carnegie Mellon University

COSC 6385 Computer Architecture. - Tomasulos Algorithm

Computer Architecture 计算机体系结构. Lecture 3. Instruction-Level Parallelism I 第三讲 指令级并行 I. Chao Li, PhD. 李超博士

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

Anne Bracy CS 3410 Computer Science Cornell University. [K. Bala, A. Bracy, S. McKee, E. Sirer, H. Weatherspoon]

CS 152 Computer Architecture and Engineering. Lecture 15 - Advanced Superscalars

Pipelined MIPS Datapath with Control Signals

Tomasulo-Style Register Renaming

CS 152 Computer Architecture and Engineering. Lecture 14 - Advanced Superscalars

CIS 662: Sample midterm w solutions

Chapter 3: Computer Organization Fundamentals. Oregon State University School of Electrical Engineering and Computer Science.

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

Decoupling Loads for Nano-Instruction Set Computers

Hakim Weatherspoon CS 3410 Computer Science Cornell University

ENGN1640: Design of Computing Systems Topic 05: Pipeline Processor Design

DAT105: Computer Architecture Study Period 2, 2009 Exercise 2 Chapter 2: Instruction-Level Parallelism and Its Exploitation

Announcements. Programming assignment #2 due Monday 9/24. Talk: Architectural Acceleration of Real Time Physics Glenn Reinman, UCLA CS

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 10 Instruction-Level Parallelism Part 3

6.823 Computer System Architecture Prerequisite Self-Assessment Test Assigned Feb. 6, 2019 Due Feb 11, 2019

CS152: Computer Architecture and Engineering Introduction to Pipelining. October 22, 1997 Dave Patterson (http.cs.berkeley.

CSCI 510: Computer Architecture Written Assignment 2 Solutions

Drowsy Caches Simple Techniques for Reducing Leakage Power Krisztián Flautner Nam Sung Kim Steve Martin David Blaauw Trevor Mudge

Improving Performance: Pipelining!

CMU Introduction to Computer Architecture, Spring 2013 HW 3 Solutions: Microprogramming Wrap-up and Pipelining

128Mb Synchronous DRAM. Features High Performance: Description. REV 1.0 May, 2001 NT5SV32M4CT NT5SV16M8CT NT5SV8M16CT

Computer Architecture and Parallel Computing 并行结构与计算. Lecture 5 SuperScalar and Multithreading. Peng Liu

CS 6354: Tomasulo. 21 September 2016

To read more. CS 6354: Tomasulo. Intel Skylake. Scheduling. How can we reorder instructions? Without changing the answer.

Direct-Mapped Cache Terminology. Caching Terminology. TIO Dan s great cache mnemonic. UCB CS61C : Machine Structures

Chapter 2 ( ) -Revisit ReOrder Buffer -Exception handling and. (parallelism in HW)

CMPEN 411 VLSI Digital Circuits Spring Lecture 20: Multiplier Design

Optimality of Tomasulo s Algorithm Luna, Dong Gang, Zhao

Near-Optimal Precharging in High-Performance Nanoscale CMOS Caches

Lecture 31 Caches II TIO Dan s great cache mnemonic. Issues with Direct-Mapped

CS 250! VLSI System Design

Locomotive Driver Desk. Manual

SYNCHRONOUS DRAM. 128Mb: x32 SDRAM. MT48LC4M32B2-1 Meg x 32 x 4 banks

Fixing the Hyperdrive: Maximizing Rendering Performance on NVIDIA GPUs

Digital Hand Controller. Manual

FabComp: Hardware specication

EECS 583 Class 9 Classic Optimization

RR Concepts. The StationMaster can control DC trains or DCC equipped trains set to linear mode.

Pipeline Hazards. See P&H Chapter 4.7. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Pipeline Hazards. See P&H Chapter 4.7. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

In-Place Associative Computing:

Programming Languages (CS 550)

ARC-H: Adaptive replacement cache management for heterogeneous storage devices

SDRAM AS4SD8M Mb: 8 Meg x 16 SDRAM Synchronous DRAM Memory. PIN ASSIGNMENT (Top View)

LAB 7. SERIES AND PARALLEL RESISTORS

IS42S32200C1. 512K Bits x 32 Bits x 4 Banks (64-MBIT) SYNCHRONOUS DYNAMIC RAM

Energy Efficient Content-Addressable Memory

Storage and Memory Hierarchy CS165

- - DQ0 NC DQ1 DQ0 DQ2 - NC DQ1 DQ3 NC - NC

128Mb DDR SDRAM. Features. Description. REV 1.1 Oct, 2006

IS42S32200L IS45S32200L

- DQ0 - NC DQ1 - NC - NC DQ0 - NC DQ2 DQ1 DQ

Physics12 Unit 8/9 Electromagnetism

! WARNING To avoid risk of electrical shock, personal injury or death; disconnect power to range before servicing, unless testing requires power.

CMPEN 411 VLSI Digital Circuits Spring Lecture 24: Peripheral Memory Circuits

! WARNING To avoid risk of electrical shock, personal injury or death; disconnect power to oven before servicing, unless testing requires power.

SECTION ELECTRIC MOTOR ACTUATORS FOR VALVES

- DQ0 - NC DQ1 - NC - NC DQ0 - NC DQ2 DQ1 DQ CONFIGURATION. None SPEED GRADE

SDRAM DEVICE OPERATION

Project 2: Traffic and Queuing (updated 28 Feb 2006)

EPAS Desktop Pro Software User Manual

CS 152 Computer Architecture and Engineering

Lab # 6 Work Orders, Vehicle Identification, Fuses, and Volt Drop

! WARNING To avoid risk of electrical shock, personal injury, or death, disconnect power to range before servicing, unless testing requires power.

Good Winding Starts the First 5 Seconds Part 2 Drives Clarence Klassen, P.Eng.

AVS64( )L

2048MB DDR2 SDRAM SO-DIMM

Steady-State Power System Security Analysis with PowerWorld Simulator

- - DQ0 NC DQ1 DQ0 DQ2 - NC DQ1 DQ3 NC - NC

Introduction to PowerWorld Simulator: Interface and Common Tools

Enhancing Energy Efficiency of Database Applications Using SSDs

Alloyed Branch History: Combining Global and Local Branch History for Robust Performance

HYB25D256400/800AT 256-MBit Double Data Rata SDRAM

Commercial Systems Customer Support

CprE 281: Digital Logic

! WARNING To avoid risk of electrical shock, personal injury or death; disconnect power to range before servicing, unless testing requires power.

A48P4616B. 16M X 16 Bit DDR DRAM. Document Title 16M X 16 Bit DDR DRAM. Revision History. AMIC Technology, Corp. Rev. No. History Issue Date Remark

! WARNING To avoid risk of electrical shock, personal injury or death; disconnect power to oven before servicing, unless testing requires power.

Contingency Analysis

PPEP: ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK

QuoteWerks Integration

Critical Chain Project Management (CCPM)

CS250 VLSI Systems Design

CAUTION All safety information must be followed as provided in Service Manual

Frequently Asked Questions: EMC Captiva 7.5

Warped-Compression: Enabling Power Efficient GPUs through Register Compression

Transcription:

Out-of-order Pipeline Register Read When do instructions read the register file? Fetch Decode Rename Dispatch Buffer of instructions Issue Reg-read Execute Writeback Commit Option #: after select, right before execute (Not done at decode) Read physical register (renamed) Or get value via bypassing (based on physical register name) This is Pentium 4, MIPS R0k, Alpha 64 style Physical register file may be large Multi-cycle read In-order front end Out-of-order execution Option #: as part of issue, keep values in Issue Queue Pentium Pro, Core, Core i7 43 OOO execution (-wide) OOO execution (-wide) RDY RDY i p 7 0 0 0 0 RDY i RDY p5, p p 7 0 0 0 0 48 49 OOO execution (-wide) OOO execution (-wide) i, p4 i, p 7 0 0 0 0 7, 3 6 3 i p 7 0 0 0 0 _, 9 i _, 4 3 50 5

OOO execution (-wide) OOO execution (-wide) p 7 p 7 i 4 0 3 0 3 4 i 4 3 3 4 5 53 Note similarity to in-order OOO execution (-wide) p 7 4 3 3 4 Multi-cycle operations Multi-cycle ops (load, fp, multiply, etc.) Wakeup deferred a few cycles Structural hazard? Cache misses? Speculative wake-up (assume hit) Cancel exec of dependents Re-issue later Details: complicated, not important 54 55 Re-order Buffer (ROB) All instructions in order Two purposes Misprediction recovery In-order commit Maintain appearance of in-order execution Freeing of physical registers RENAMING REVISITED 56 57

Renaming revisited Overwritten register Freed at commit Restore in map table on recovery Branch mis-prediction recovery Also must be read at rename Original insns r,r r3 r p r3 p3 58 59 r,r r3 p, p r,r r3 r p r3 p3 r p r3 60 6 r,r r3, p4 r,r r3 r p r3 r p r3 r4 6 63

r,r r3 p5, p r,r r3 p5, p r p r3 r4 r p r3 r4 64 65 r,r r3 p5, p i, [p] r,r r3 p5, p i, [p] r p r3 r4 r r3 r4 66 67 ROB ROB entry holds all info for recover/commit Logical register names Physical register names Instruction types Dispatch: insert at tail Full? Stall Commit: remove from head Not completed? Stall Recovery Completely remove wrong path instructions Flush from IQ Remove from ROB Restore map table to before misprediction Free destination registers 68 69

bnz r loop bnz p, loop r, r r3 r3, r4 r4 r5, r r3 i r3, r p5, p i, [p] bnz r loop bnz p, loop r, r r3 r3, r4 r4 r5, r r3 i r3, r p5, p i, [p] r r3 r4 r p r3 r4 70 7 bnz r loop bnz p, loop r, r r3 r3, r4 r4 r5, r r3 p5, p bnz r loop bnz p, loop r, r r3 r3, r4 r4 r p r3 r4 r p r3 7 73 bnz r loop r, r r3 bnz p, loop bnz r loop bnz p, loop r p r3 p3 r p r3 p3 74 75

What about stores Stores: Write D$, not registers Can we rename memory? Recover in the cache? No (at least not easily) Cache writes unrecoverable Stores: only when certain Commit Commit r, r r3 r3, r4 r4 r5, r r3 i r3, r p5, p i, [p] At commit: instruction becomes architected state In order Only when instructions are finished Free overwritten register (why?) 76 77 r,r r3 Freeing over-written register p5, p i, [p] Commit Example r,r r3 p5, p i, [p] Before : r3 p3 After : r3 Insns older than reads p3 Insns younger than read (until next r3-writing instruction) At commit of, no older instructions exist No one else needs p3 free it! r r3 r4 78 79 Commit Example r,r r3 p5, p i, [p] Commit Example p5, p i, [p] r r r3 r4 r5 p p5 p3 r r r3 r4 r5 p p5 p3 p4 80 8

Commit Example Commit Example p5, p i, [p] i, [p] r r r3 r4 r5 p p5 p3 p4 r r r3 r4 r5 p p5 p3 p4 p 8 83 Standard style: large and cumbersome Change layout slightly Columns = stages (dispatch, issue, etc.) Rows = instructions Content of boxes = cycles For our purposes: issue/exec = cycle Ignore preg read latency, etc. Load-use, mul, div, and FP longer ld [p] p ld [] Buffer of instructions Fetch Decode Rename Dispatch Issue Reg-read Execute Writeback Commit 84 85 ld [p] p ld [] ld [p] p ld [] -wide Infinite ROB, IQ, Pregs Loads: 3 cycles Cycle : Dispatch and ld 86 87

ld [p] p 5 ld [p] p 5 ld [] ld [] 3 6 Cycle : Dispatch and ld st Ld issues -- also note WB cycle while you do this (Note: don t issue if WB ports full) Cycle 3: and are not ready nd load is issue it 88 89 ld [p] p 5 ld [p] p 5 6 5 6 5 6 6 7 ld [] 3 6 ld [] 3 6 Cycle 4: nothing Cycle 5: can issue Cycle 6: st load can commit (oldest instruction & finished) can issue 90 9 ld [p] p 5 6 ld [p] p 5 6 5 6 7 5 6 7 6 7 6 7 8 ld [] 3 6 ld [] 3 6 8 Cycle 7: can commit (oldest instruction & finished) Cycle 8: and ld can commit (-wide: can do both at once) 9 93

ld [p] p ld [] 5 5 6 6 7 3 6 6 7 8 8 Buffer of instructions Fetch Decode Rename Dispatch Issue Reg-read Execute Writeback Commit 94