COSC 6385 Computer Architecture. - Tomasulos Algorithm

Similar documents
Computer Architecture 计算机体系结构. Lecture 3. Instruction-Level Parallelism I 第三讲 指令级并行 I. Chao Li, PhD. 李超博士

Lecture 14: Instruction Level Parallelism

CSCI 510: Computer Architecture Written Assignment 2 Solutions

DAT105: Computer Architecture Study Period 2, 2009 Exercise 2 Chapter 2: Instruction-Level Parallelism and Its Exploitation

Out-of-order Pipeline. Register Read. OOO execution (2-wide) OOO execution (2-wide) OOO execution (2-wide) OOO execution (2-wide)

Optimality of Tomasulo s Algorithm Luna, Dong Gang, Zhao

CS 6354: Tomasulo. 21 September 2016

To read more. CS 6354: Tomasulo. Intel Skylake. Scheduling. How can we reorder instructions? Without changing the answer.

Computer Architecture: Out-of-Order Execution. Prof. Onur Mutlu (editted by Seth) Carnegie Mellon University

Tomasulo-Style Register Renaming

Parallelism I: Inside the Core

Advanced Superscalar Architectures. Speculative and Out-of-Order Execution

CIS 371 Computer Organization and Design

CIS 371 Computer Organization and Design

Unit 9: Static & Dynamic Scheduling

Computer Architecture ELE 475 / COS 475 Slide Deck 6: Superscalar 3. David Wentzlaff Department of Electrical Engineering Princeton University

CS 152 Computer Architecture and Engineering. Lecture 15 - Advanced Superscalars

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

CS152: Computer Architecture and Engineering Introduction to Pipelining. October 22, 1997 Dave Patterson (http.cs.berkeley.

Improving Performance: Pipelining!

Announcements. Programming assignment #2 due Monday 9/24. Talk: Architectural Acceleration of Real Time Physics Glenn Reinman, UCLA CS

Advanced Superscalar Architectures

Pipelined MIPS Datapath with Control Signals

M2 Instruction Set Architecture

CS 152 Computer Architecture and Engineering. Lecture 14 - Advanced Superscalars

Chapter 3: Computer Organization Fundamentals. Oregon State University School of Electrical Engineering and Computer Science.

Code Scheduling & Limitations

CIS 662: Sample midterm w solutions

Chapter 2 ( ) -Revisit ReOrder Buffer -Exception handling and. (parallelism in HW)

Lecture 20: Parallelism ILP to Multicores. James C. Hoe Department of ECE Carnegie Mellon University

Anne Bracy CS 3410 Computer Science Cornell University. [K. Bala, A. Bracy, S. McKee, E. Sirer, H. Weatherspoon]

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 10 Instruction-Level Parallelism Part 3

6.823 Computer System Architecture Prerequisite Self-Assessment Test Assigned Feb. 6, 2019 Due Feb 11, 2019

ENGN1640: Design of Computing Systems Topic 05: Pipeline Processor Design

Hakim Weatherspoon CS 3410 Computer Science Cornell University

Direct-Mapped Cache Terminology. Caching Terminology. TIO Dan s great cache mnemonic. UCB CS61C : Machine Structures

CMU Introduction to Computer Architecture, Spring 2013 HW 3 Solutions: Microprogramming Wrap-up and Pipelining

EECS 583 Class 9 Classic Optimization

CS 152 Computer Architecture and Engineering

Decoupling Loads for Nano-Instruction Set Computers

RAM-Type Interface for Embedded User Flash Memory

Fixing the Hyperdrive: Maximizing Rendering Performance on NVIDIA GPUs

Proposed Solution to Mitigate Concerns Regarding AC Power Flow under Convergence Bidding. September 25, 2009

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

Scheduling. Purpose of scheduling. Scheduling. Scheduling. Concurrent & Distributed Systems Purpose of scheduling.

LADOT Railroad Preemption Form Instructions

APPLICATION NOTE Application Note for Torque Down Capper Application

Programming Languages (CS 550)

Pipeline Hazards. See P&H Chapter 4.7. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Functional Algorithm for Automated Pedestrian Collision Avoidance System

Isaac Newton vs. Red Light Cameras

Storage and Memory Hierarchy CS165

Good Winding Starts the First 5 Seconds Part 2 Drives Clarence Klassen, P.Eng.

ARC-H: Adaptive replacement cache management for heterogeneous storage devices

Warped-Compression: Enabling Power Efficient GPUs through Register Compression

Pipeline Hazards. See P&H Chapter 4.7. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

The purpose of this lab is to explore the timing and termination of a phase for the cross street approach of an isolated intersection.

Unmanned autonomous vehicles in air land and sea

Rule-based Integration of Multiple Neural Networks Evolved Based on Cellular Automata

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Project 2: Traffic and Queuing (updated 28 Feb 2006)

FabComp: Hardware specication

Alternative Fuel Engine Control Unit

Analyzing Feature Interactions in Automobiles. John Thomas, Ph.D. Seth Placke

Cruise Control 1993 Jeep Cherokee

Contents Please read this manual! Keep this manual!

index Page numbers shown in italic indicate figures. Numbers & Symbols

(Refer Slide Time: 00:01:10min)

Introduction to PowerWorld Simulator: Interface and Common Tools

Chapter 10 And, Finally... The Stack

Video Communications Presents. Reference Guide and Test Questions. Tail Swing Safety for School Bus Drivers

Automated Driving - Object Perception at 120 KPH Chris Mansley

CMPEN 411 VLSI Digital Circuits Spring Lecture 22: Memery, ROM

Written Exam Public Transport + Answers

The TIMMO Methodology

The Rollover Request customer accepts a reservation term for the rollover at least as long as that offered by any competing customer.

Control Design of an Automated Highway System (Roberto Horowitz and Pravin Varaiya) Presentation: Erik Wernholt

index changing a variable s value, Chime My Block, clearing the screen. See Display block CoastBack program, 54 44

White Paper Nest Learning Thermostat Efficiency Simulation for the U.K. Nest Labs April 2014

Scheduling for Wireless Energy Sharing Among Electric Vehicles

Autonomous taxicabs in Berlin a spatiotemporal analysis of service performance. Joschka Bischoff, M.Sc. Dr.-Ing. Michal Maciejewski

FLYING CAR NANODEGREE SYLLABUS

Turku Raitiotie plan Superbus charging system simulations

A Chemical Batch Reactor Schedule Optimizer

VHDL (and verilog) allow complex hardware to be described in either single-segment style to two-segment style

INSTALLATION INSTRUCTIONS FOR SYMCOM'S MODEL 777-HVR-SP ELECTRONIC OVERLOAD RELAY

Computer Architecture and Parallel Computing 并行结构与计算. Lecture 5 SuperScalar and Multithreading. Peng Liu

CPW Current Programmed Winder for the 890. Application Handbook. Copyright 2005 by Parker SSD Drives, Inc.

Managing Projects Teaching materials to accompany:

тел.: +375(1771) e mail: Fuel level sensors eurosens Dominator

QUICK INSTALLATION GUIDE

AC drive has detected too high a Check loading

1.2 Flipping Ferraris

Integrated System Models Graph Trace Analysis Distributed Engineering Workstation

:34 1/15 Hub-4 / grid parallel - manual

Code Generation Part III

2 MEETING THE CHALLENGE OF A DIFFICULT JOB SPECIALTY CONTRACTOR

Registers Shift Registers Accumulators Register Files Register Transfer Language. Chapter 8 Registers. SKEE2263 Digital Systems

Introduction to Digital Techniques

Battery Technology for Data Centers and Network Rooms: Site Planning

Transcription:

COSC 6385 Computer Architecture - Tomasulos Algorithm Fall 2008 Analyzing a short code-sequence DIV.D F0, F2, F4 ADD.D F6, F0, F8 S.D F6, 0(R1) SUB.D F8, F10, F14 MUL.D F6, F10, F8 1

Analyzing a short code-sequence 3 True data dependencies DIV.D F0, F2, F4 ADD.D F6, F0, F8 S.D F6, 0(R1) SUB.D F8, F10, F14 MUL.D F6, F10, F8 Analyzing a short code-sequence 3 True data dependencies DIV.D F0, F2, F4 ADD.D F6, F0, F8 S.D F6, 0(R1) SUB.D F8, F10, F14 MUL.D F6, F10, F8 2

Analyzing a short code-sequence 3 True data dependencies DIV.D F0, F2, F4 ADD.D F6, F0, F8 S.D F6, 0(R1) SUB.D F8, F10, F14 MUL.D F6, F10, F8 Analyzing a short code-sequence Anti-dependencies (WAR hazards) DIV.D F0, F2, F4 ADD.D F6, F0, F8 S.D F6, 0(R1) SUB.D F8, F10, F14 MUL.D F6, F10, F8 3

Analyzing a short code-sequence Output dependency (WAW DIV.D F0, F2, F4 hazard) ADD.D F6, F0, F8 S.D F6, 0(R1) SUB.D F8, F10, F14 MUL.D F6, F10, F8 Analyzing a short code-sequence DIV.D F0,F2, F4 ADD.D S, F0, F8 S.D S, 0(R1) SUB.D T, F10, F14 MUL.D F6,F10, T Renaming some registers can remove the WAR and WAW hazards Any subsequent use of F8 must be replaced by T 4

Tomasulo s Algorithm Register renaming is provided by reservation stations Buffer the operands of instructions waiting to being issued Fetches an operand as soon as available Eliminates the need to get an operand from register Pending instructions designate the reservation station providing the input For overlapping successive writes: only the last one will be executed Tomasulo s Algorithm Typically more reservation stations than registers Hazard detection is distributed (instead of centralized as in the Scoreboard) Results are passed directly from reservation stations to functional units using a common data bus (CDB) Each reservation station holds the opcode for the pending instruction and either operand values or names of reservation stations that will provide them Load and store buffers hold data and addresses for memory access 5

From instruction unit Instruction queue FP registers Common data bus Store buffers LOAD-STORE OPERATIONS Address unit Load buffers FP OPERATIONS Data Address 4 3 2 1 Reservation stations 4 3 2 1 Memory unit FP adders FP multipliers Tomasulo s Algorithm Load store buffers: Hold components of effective address Hold destination memory address ( = effective address) Hold value 6

Tomasulo s Algorithm Only three steps per instruction each step can take an arbitrary number of cycles Issue: get next instruction from FIFO instruction queue Search matching empty reservation station If found: issue instruction with operand values If not found: structural hazard-> instruction stalls If operands not in register: keep track of functional units producing operands Tomasulo s Algorithm Execute: If operands not available: monitor common data bus When all operands available: execute Write result: Write data on CDB and from there into registers 7

Data fields for reservation stations Q p : operation to perform on source operands S1 and S2 Q j, Q k : reservation stations producing the operands V j, V k : value for each operand A: holds information for memory address calculation (immediate field, effective address) Busy: indicates occupied functional units/reservation stations Q i : number of the reservation station who will produce the data to be stored in this register The same example as for scoreboarding L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 ADD.D F6, F8, F2 Following slides are based on a lecture by Jelena Mirkovic, University of Delaware http://www.cis.udel.edu/~sunshine/courses/f04/cis662/class12.pdf Assumption: ADD and SUB take 2 clock cycles MULT takes 10 clock cycle DIV takes 40 clock cycles 2 Load/Store, 3 ADD and 2 Mult functional units/reservation stations 8

Time=1 Issue first load L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 ADD.D F6, F8, F2 Yes Load Regs[R2] 34 Add2 Mult1 Time=2 First load calc. address. Second load issued L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 ADD.D F6, F8, F2 Yes Load Regs[R2] +34 Yes Load Regs[R3] 45 Add2 Mult1 9

Time=3 First load read from mem. Second load calc address. Mult is issued L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 ADD.D F6, F8, F2 Yes Load Regs[R2]+34 Yes Load Regs[R3] +45 Add2 Mult1 Yes Mult Regs[F4] Mult1 Time=4 First load write res. Second load read mem. Mult stalled, Sub issued L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 ADD.D F6, F8, F2 Yes Load Regs[R3]+45 Yes Sub Mem[34+Regs[R2]] Add2 Mult1 Yes Mult Regs[F4] Mult1 10

Time=5 Second load write res. Mult stalled, Sub stalled, Div. issued L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 ADD.D F6, F8, F2 Yes Sub Mem[34+Regs[R2]] Mem[45+Regs[R3]] Add2 Mult1 Yes Mult Mem[45+Regs[R3]] Regs[F4] Yes Div Mem[34+Regs[R2]] Mult1 Mult1 Time=6 Mult executes (1/10), Sub executes (1/2), Div. stalled, Add issued L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 ADD.D F6, F8, F2 Yes Sub Mem[34+Regs[R2]] Mem[45+Regs[R3]] Add2 Yes Add Mem[45+Regs[R3]] Mult1 Yes Mult Mem[45+Regs[R3]] Regs[F4] Yes Div Mem[34+Regs[R2]] Mult1 Mult1 Add2 11

Time=7 Mult executes (2/10), Sub executes (2/2), Div. stalled, Add stalled L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 ADD.D F6, F8, F2 Yes Sub Mem[34+Regs[R2]] Mem[45+Regs[R3]] Add2 Yes Add Mem[45+Regs[R3]] Mult1 Yes Mult Mem[45+Regs[R3]] Regs[F4] Yes Div Mem[34+Regs[R2]] Mult1 Mult1 Add2 Time=8 Mult executes (3/10), Sub writes res., Div. stalled, Add stalled L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 ADD.D F6, F8, F2 Add2 Yes Add Mem[34+Regs[R2]]- Mem[45+Regs[R3]] Mult1 Yes Mult Mem[45+Regs[R3]] Regs[F4] Mem[45+Regs[R3]] Yes Div Mem[34+Regs[R2]] Mult1 Mult1 Add2 12

Time=9 Mult executes (4/10), Div. stalled, Add executes (1/2) L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 ADD.D F6, F8, F2 Add2 Yes Add Mem[34+Regs[R2]]- Mem[45+Regs[R3]] Mult1 Yes Mult Mem[45+Regs[R3]] Regs[F4] Mem[45+Regs[R3]] Yes Div Mem[34+Regs[R2]] Mult1 Mult1 Add2 Time=10 Mult executes (5/10), Div. stalled, Add executes (2/2) L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 ADD.D F6, F8, F2 Add2 Yes Add Mem[34+Regs[R2]]- Mem[45+Regs[R3]] Mult1 Yes Mult Mem[45+Regs[R3]] Regs[F4] Mem[45+Regs[R3]] Yes Div Mem[34+Regs[R2]] Mult1 Mult1 Add2 13

Time=11 Mult executes (6/10), Div. stalled, Add writes result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 ADD.D F6, F8, F2 Add2 Mult1 Yes Mult Mem[45+Regs[R3]] Regs[F4] Yes Div Mem[34+Regs[R2]] Mult1 Mult1 Time=16 Mult writes result, Div. stalled L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 ADD.D F6, F8, F2 Add2 Mult1 Yes Div Mem[45+Regs[R3]] * Regs[F4] Mem[34+Regs[R2]] 14

Time=17 Div. Executed (1/40) L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 ADD.D F6, F8, F2 Add2 Mult1 Yes Div Mem[45+Regs[R3]] * Regs[F4] Mem[34+Regs[R2]] Time=57 Div. Writes result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 ADD.D F6, F8, F2 Add2 Mult1 15

Some remarks To preserve exception behavior, no instruction is allowed to initiate execution until all branches preceding the instruction have completed Load and store can be executed in different order if they access different addresses Not easy to verify, since 100(R3) can point to the same effective address as 0(R5)! -> A load must wait for any uncompleted stores to the same effective memory address -> A store must wait until there are no unexecuted loads/stores to the same memory address Some remarks (II) Effective memory address calculation has to be executed in order For a load operation: Calculate effective memory address Check for conflicts with all active (=pending) store buffers If conflict: load stalls Bypassing memory and taking data from the store buffer directly to the load buffer often done Else: execute load For a store operation: Similarly checking for conflicts with both active load and store buffers 16

A loop based example Loop: LD MULTD SD SUBI BNEZ F0, 0(R1) F4, F0, F2 F4, 0(R1) R1, R1,#8 R1, Loop This time assume Multiply takes 4 clocks Assume 1st load takes 8 clocks total (1 effective address + 7 mem. Access) (L1 cache miss), 2nd load takes 1 clock (hit) To be clear, will show clocks for SUBI, BNEZ Reality: integer instructions ahead of Fl. Pt. Instructions Show 2 iterations Slide based on a lecture by David A. Patterson, University of California, Berkley http://www.cs.berkeley.edu/~pattrsn/252s01 Time=1 Issue first load L.D F0, 0(R1) 1 MUL.D F4, F0, F2 S.D F4, 0(R1) L.D F0, 0(R1) MUL.D F4, F0, F2 S.D F4, 0(R1) Yes Load Regs[R1] 0 Store1 Store2 Mult1 17

Time=2 first load effective address calc., Issue mult L.D F0, 0(R1) 1 MUL.D F4, F0, F2 2 S.D F4, 0(R1) L.D F0, 0(R1) MUL.D F4, F0, F2 S.D F4, 0(R1) Yes Load Regs[R1] +0 Store1 Store2 Mult1 Yes Mult Regs[F2] Mult1 Time=3 first load mem. access(1/7), mult stalled, Issue store L.D F0, 0(R1) 1 MUL.D F4, F0, F2 2 S.D F4, 0(R1) 3 L.D F0, 0(R1) MUL.D F4, F0, F2 S.D F4, 0(R1) Yes Load Regs[R1]+0 Store1 Yes Store Regs[R1] Mult1 0 Store2 Mult1 Yes Mult Regs[F2] Mult1 18

Time=4 first load ex (2/7)., mult stall, store eff. addr, Calc SUBI (not shown) L.D F0, 0(R1) 1 MUL.D F4, F0, F2 2 S.D F4, 0(R1) 3 L.D F0, 0(R1) MUL.D F4, F0, F2 S.D F4, 0(R1) Yes Load Regs[R1]+0 Store1 Yes Store Regs[R1] Mult1 +0 Store2 Mult1 Yes Mult Regs[F2] Mult1 Time=5 first load exec (3/7)., mult stall, store stall, BNEZ (not shown) L.D F0, 0(R1) 1 MUL.D F4, F0, F2 2 S.D F4, 0(R1) 3 L.D F0, 0(R1) MUL.D F4, F0, F2 S.D F4, 0(R1) Yes Load Regs[R1]+0 Store1 Yes Store Mult1 Regs[R1] +0 Store2 Mult1 Yes Mult Regs[F2] Mult1 19

Time=6 first load exec (4/7)., mult stall, store stall, issue load L.D F0, 0(R1) 1 MUL.D F4, F0, F2 2 S.D F4, 0(R1) 3 L.D F0, 0(R1) 6 MUL.D F4, F0, F2 S.D F4, 0(R1) Yes Load Regs[R1]+0 Yes Load Regs[R1] 0 Store1 Yes Store Mult1 Regs[R1]+0 Store2 Mult1 Yes Mult Regs[F2] Mult1 Time=7 first load ex (5/7)., mult stall, store stall, load2 eff. Add., issue mult2 L.D F0, 0(R1) 1 MUL.D F4, F0, F2 2 S.D F4, 0(R1) 3 L.D F0, 0(R1) 6 MUL.D F4, F0, F2 7 S.D F4, 0(R1) Yes Load Regs[R1]+0 Yes Load Regs[R1] +0 Store1 Yes Store Mult1 Regs[R1]+0 Store2 Mult1 Yes Mult Regs[F2] Yes Mult Regs[F2] 20

Time=8 first load ex (6/7)., mult, store, mult2 stall, load2 ex., issue store2 L.D F0, 0(R1) 1 MUL.D F4, F0, F2 2 S.D F4, 0(R1) 3 L.D F0, 0(R1) 6 MUL.D F4, F0, F2 7 S.D F4, 0(R1) 8 Yes Load Regs[R1]+0 Yes Load Regs[R1]+0 Store1 Yes Store Mult1 Regs[R1]+0 Store2 Yes Store Regs[R1] 0 Mult1 Yes Mult Regs[F2] Yes Mult Regs[F2] Time=9 first load exec (7/7)., mult, store, mult2 stall, load2 exec., store2 L.D F0, 0(R1) 1 9 MUL.D F4, F0, F2 2 S.D F4, 0(R1) 3 L.D F0, 0(R1) 6 MUL.D F4, F0, F2 7 S.D F4, 0(R1) 8 Yes Load Regs[R1]+0 Yes Load Regs[R1]+0 Store1 Yes Store Mult1 Regs[R1]+0 Store2 Yes Store Regs[R1] +0 Mult1 Yes Mult Regs[F2] Yes Mult Regs[F2] 21

Time=10 first load write res. mult, store, mult2 stall, load2 finish, store2 stal L.D F0, 0(R1) 1 9 10 MUL.D F4, F0, F2 2 S.D F4, 0(R1) 3 L.D F0, 0(R1) 6 10 MUL.D F4, F0, F2 7 S.D F4, 0(R1) 8 Yes Load Regs[R1]+0 Store1 Yes Store Mult1 Regs[R1]+0 Store2 Yes Store Regs[R1]+0 Mult1 Yes Mult Mem[] Regs[F2] Yes Mult Regs[F2] Time=11 L.D F0, 0(R1) 1 9 10 MUL.D F4, F0, F2 2 S.D F4, 0(R1) 3 L.D F0, 0(R1) 6 10 11 MUL.D F4, F0, F2 7 S.D F4, 0(R1) 8 Store1 Yes Store Mult1 Regs[R1]+0 Store2 Yes Store Regs[R1]+0 Mult1 Yes Mult Mem[] Regs[F2] Yes Mult Mem[] Regs[F2] Load 2 write res, Mult1 (1/4), mult2, store1, store2 stalled 22

Time=14 L.D F0, 0(R1) 1 9 10 MUL.D F4, F0, F2 2 14 S.D F4, 0(R1) 3 L.D F0, 0(R1) 6 10 11 MUL.D F4, F0, F2 7 S.D F4, 0(R1) 8 Store1 Yes Store Mult1 Regs[R1]+0 Store2 Yes Store Regs[R1]+0 Mult1 Yes Mult Mem[] Regs[F2] Yes Mult Mem[] Regs[F2] Mult1 (4/4), (3/4), store1, store2 stalled Time=15 L.D F0, 0(R1) 1 9 10 MUL.D F4, F0, F2 2 14 15 S.D F4, 0(R1) 3 L.D F0, 0(R1) 6 10 11 MUL.D F4, F0, F2 7 15 S.D F4, 0(R1) 8 Store1 Yes Store Mult1 Regs[R1]+0 Store2 Yes Store Regs[R1]+0 Mult1 Yes Mult Mem[] Regs[F2] Mult1 write res., (4/4), store1 exec, store2 stalled 23

Time=16 store1, store2 exec L.D F0, 0(R1) 1 9 10 MUL.D F4, F0, F2 2 14 15 S.D F4, 0(R1) 3 L.D F0, 0(R1) 6 10 11 MUL.D F4, F0, F2 7 15 16 S.D F4, 0(R1) 8 Store1 Yes Store Mult1 Regs[R1]+0 Store2 Yes Store Regs[R1]+0 Mult1 Tomasulo s Algorithm Please note: F0 never sees data from the first load Register File completely detached from computation First and Second iteration overlap completely Assuming two Mult units, we could not have issued a third mult operation for the next iteration of the loop -> no third store instruction could be issued In order issue, out-of-order execution, out-of-order completion Slide based on a lecture by David A. Patterson, University of California, Berkley http://www.cs.berkeley.edu/~pattrsn/252s01 24

Why can Tomasulo overlap iterations of loops? Register renaming Multiple iterations use different physical destinations for registers (dynamic loop unrolling). s Permit instruction issue to advance past integer control flow operations Also buffer old values of registers - totally avoiding the WAR stall that we saw in the scoreboard. Other perspective: Tomasulo building data flow dependency graph on the fly. Slide based on a lecture by David A. Patterson, University of California, Berkley http://www.cs.berkeley.edu/~pattrsn/252s01 25