To read more. CS 6354: Tomasulo. Intel Skylake. Scheduling. How can we reorder instructions? Without changing the answer.

Similar documents
CS 6354: Tomasulo. 21 September 2016

Tomasulo-Style Register Renaming

Computer Architecture 计算机体系结构. Lecture 3. Instruction-Level Parallelism I 第三讲 指令级并行 I. Chao Li, PhD. 李超博士

COSC 6385 Computer Architecture. - Tomasulos Algorithm

Computer Architecture: Out-of-Order Execution. Prof. Onur Mutlu (editted by Seth) Carnegie Mellon University

Parallelism I: Inside the Core

Lecture 14: Instruction Level Parallelism

Advanced Superscalar Architectures. Speculative and Out-of-Order Execution

Lecture 20: Parallelism ILP to Multicores. James C. Hoe Department of ECE Carnegie Mellon University

Optimality of Tomasulo s Algorithm Luna, Dong Gang, Zhao

Computer Architecture ELE 475 / COS 475 Slide Deck 6: Superscalar 3. David Wentzlaff Department of Electrical Engineering Princeton University

CIS 371 Computer Organization and Design

Announcements. Programming assignment #2 due Monday 9/24. Talk: Architectural Acceleration of Real Time Physics Glenn Reinman, UCLA CS

CS152: Computer Architecture and Engineering Introduction to Pipelining. October 22, 1997 Dave Patterson (http.cs.berkeley.

Anne Bracy CS 3410 Computer Science Cornell University. [K. Bala, A. Bracy, S. McKee, E. Sirer, H. Weatherspoon]

DAT105: Computer Architecture Study Period 2, 2009 Exercise 2 Chapter 2: Instruction-Level Parallelism and Its Exploitation

CIS 371 Computer Organization and Design

Unit 9: Static & Dynamic Scheduling

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

CS 152 Computer Architecture and Engineering. Lecture 15 - Advanced Superscalars

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

Advanced Superscalar Architectures

Improving Performance: Pipelining!

UC Berkeley CS61C : Machine Structures

CS 152 Computer Architecture and Engineering. Lecture 14 - Advanced Superscalars

Out-of-order Pipeline. Register Read. OOO execution (2-wide) OOO execution (2-wide) OOO execution (2-wide) OOO execution (2-wide)

Chapter 3: Computer Organization Fundamentals. Oregon State University School of Electrical Engineering and Computer Science.

ENGN1640: Design of Computing Systems Topic 05: Pipeline Processor Design

Code Scheduling & Limitations

M2 Instruction Set Architecture

Smarter Bus Information in Leeds

CS 152 Computer Architecture and Engineering

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 10 Instruction-Level Parallelism Part 3

Hakim Weatherspoon CS 3410 Computer Science Cornell University

CMPEN 411 VLSI Digital Circuits Spring Lecture 20: Multiplier Design

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

CSCI 510: Computer Architecture Written Assignment 2 Solutions

Pipeline Hazards. See P&H Chapter 4.7. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Pipeline Hazards. See P&H Chapter 4.7. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Registers Shift Registers Accumulators Register Files Register Transfer Language. Chapter 8 Registers. SKEE2263 Digital Systems

ZEPHYR FAQ. Table of Contents

Direct-Mapped Cache Terminology. Caching Terminology. TIO Dan s great cache mnemonic. UCB CS61C : Machine Structures

6.823 Computer System Architecture Prerequisite Self-Assessment Test Assigned Feb. 6, 2019 Due Feb 11, 2019

Pipelined MIPS Datapath with Control Signals

Warped-Compression: Enabling Power Efficient GPUs through Register Compression

Programming Languages (CS 550)

How-To Convert a W8 Cluster for Use in a MKIV TDI By Greg Menounos

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

This is an easy to read report.

V 2.0. Version 9 PC. Setup Guide. Revised:

Our Mobility Scooter Policy: A guide to taking mobility scooters on our trains

Fast In-place Transposition. I-Jui Sung, University of Illinois Juan Gómez-Luna, University of Córdoba (Spain) Wen-Mei Hwu, University of Illinois

Topics on Compilers. Introduction to CGRA

CIS 662: Sample midterm w solutions

Public Transportation SPEAKING

CMU Introduction to Computer Architecture, Spring 2013 HW 3 Solutions: Microprogramming Wrap-up and Pipelining

TONY S TECH REPORT. Basic Training

Why do the dots go where they do?

APPENDIX A: Background Information to help you design your car:

Bringing ARB_gpu_shader_fp64 to Intel GPUs

Real-time Bus Tracking using CrowdSourcing

DRIVING. Robotic Cars. Questions: Do you like to drive? Why? / Why not? Read the article below and then answer the questions.

Problem Set 3 - Solutions

PRODUCT: EVAP Canister Relocation Kit: Late Model READ INSTRUCTIONS IN FULL BEFORE INSTALLATION. QUESTIONS? CALL M-F 7:00 AM 5:00 PM PST

EEEE 524/624: Fall 2017 Advances in Power Systems

mith College Computer Science CSC231 Assembly Fall 2017 Week #4 Dominique Thiébaut

Towards smarter public transport

Sinfonia: a new paradigm for building scalable distributed systems

Hazard Hamlet Activity Book An Electrical Safety Activity Book

2 of 10. Bronco Parts. Dana 44 Yukon Hardcore Chrome Moly Axle Set $ Dana 44 Chrome Moly 4340 Axle Set With $325.00

Are you as confident and

A device that measures the current in a circuit. It is always connected in SERIES to the device through which it is measuring current.

Test Infrastructure Design for Core-Based System-on-Chip Under Cycle-Accurate Thermal Constraints

Comments and facts below in chronological order as testing progress. Added non Added resistive Total load Watt meter kwh resistive

Simple Gears and Transmission

Application of safety principles for a guidance system in public transport

Using Advanced Limit Line Features

The reason for higher electric bills

CARTER CARDLOCK, INC. PACIFIC PRIDE

Fitting HID Xenon Headlamp system to R75/MG-ZT

Fault Attacks Made Easy: Differential Fault Analysis Automation on Assembly Code

In-Place Associative Computing:

Decoupling Loads for Nano-Instruction Set Computers

Frequently Misunderstood CMMI Appraisal Findings

D-DYNA-EN. Test Bench. Dynamo TECHNOLOGIE.

Power Consumption Reduction: Hot Spare

Geofix 10.0 Light Wheel alignment equipment. Lithium batteries, BT 2 communication

2004, 2008 Autosoft, Inc. All rights reserved.

Capacity Expansion. Operations Research. Anthony Papavasiliou 1 / 24

The Rights and Wrongs of Greasing : From Selection to Application

The malfunctions Technical Self- Balancing Scooter Fix Tips

QUICK START GUIDE FOR ACCESS CONTROL BOARDS. DX Series Four Door TCP/IP Web Server Controller. Model: ACP-DXEL4

1.69 Electric Conductors and Insulators

Tutorial. Running a Simulation If you opened one of the example files, you can be pretty sure it will run correctly out-of-the-box.

Eurathlon Scenario Application Paper (SAP) Review Sheet

Technology, Xi an , China

Section 8. MAINTENANCE & TROUBLESHOOTING

Lecture 31 Caches II TIO Dan s great cache mnemonic. Issues with Direct-Mapped

AXi-RGB1 INSTALLATION MANUAL

ELM327 OBD to RS232 Interpreter

Transcription:

To read more CS 6354: Tomasulo 21 September 2016 This day s paper: Tomasulo, An Efficient Algorithm for Exploiting Multiple Arithmetic Units Supplementary readings: Hennessy and Patterson, Computer Architecture: A Quantitative Approach, section 3.4-5 Shin and Lipatsi, Modern Processor Design, section 5.2 1 1 Intel Skylake Scheduling How can we reorder instructions? Without changing the answer Image: Intel Optimization Reference Manual 2 3

Recall: Data hazards Recall: Read-after-Write Instructions had wrong data because they weren t executed one-at-a-time Example: reading old value of register r1 < r2 + r3 r5 < r1 r5 r1 r2 + r3 r4 r1 - r5 1 IF 2 ID: read r2, r3 IF 3 EX: temp1 r2 + r3 ID: read r1, r5 4 MEM EX: temp2 r1 - r5 5 WB: r1 temp MEM 6 WB: r4 temp2 4 5 Write-after-Write Write-after-Read... r1 r6 + r7 ; (2) r4 r2 + r1 ; (3) time r1 r2 + r3 r1 r6 + r7 r4 r2 + r1 1 read r6, r7 2 read r2, r3 compute 3 compute write r1 4 write r1 5 6 value read read r1, r2 7 compute desired value r3 r4 + r5 ; (2) time r1 r2 + r3 r3 r4 + r5 1 read r4, r5 2 compute 3 write r3 4 read r2, r3 5 compute 6 write r1 6 7

Types of Data Hazards Read-after-Write (RAW) also called: true dependence Write-after-Write (WAW) also called: output dependence Write-after-Read (WAR) also called: anti-dependence a problem with names write-after-write r1 r6 + r7 ; (2) r4 r2 + r1 ; (3) write-after-read r3 r4 + r5 ; (2) no problem if we used a different name each write 8 9 register renaming original code r1 r2 + r3 r7 r1 + r3 r1 r6 + r7 r4 r2 + r1 r2 r4 + r5 with renaming new1 r2 + r3 ;(1) new2 new1 + r3 ;(2) new3 r6 + r7 ;(3) new4 r2 + new3 ;(4) new5 r4 + r5 ;(5) scheduling with renaming different architectual (external) and internal register names new internal name on each write new old from up to name name new1 r1 (1) (2) new2 r7 (2) new3 r1 (3) new4 r4 (4) new5 r2 (5) 10 11

register renaming state Diversion: SSA original code r1 r2 + r3 r7 r1 + r3 r1 r6 + r7 r4 r2 + r1 r2 r4 + r5 external name r1 r2 r3 r4 r5 r6 r7 r8 with renaming x09 x02 + x03 x10 x09 + x03 x11 x06 + x10 x12 x02 + x11 x13 x12 + x05 internal name x01 x09 x11 x02 x13 x03 x04 x12 x05 x06 x07 x10 x08 compiler technique: static single-assignment (SSA) form eewrite code as code with immutable variables only makes optimization easier if you know it this will seem familiar 12 13 scheduling with renaming handling variable times # (renamed) instructions run on done? (1) x05 Mem[x03] Load (2) x06 x01 + x02 Add1 (3) x07 x01 x02 Mult (4) x08 x05 x04 Mult (5) x09 x05 + x04 Add2 (6) x10 x07 + x06 Add1 time Add1 Add2 Mult Load 0 (2) start (3) start (1) start 1 (2) (3) (1) 2 (2) done (3) (1) 3 (3) (1) 4 (3) done (1) 5 (6) start (1) done 6 (6) (5) start (4) start 7 (6) done (5) (4) 8 (5) done (4) 9 (4) 10 (4) done int. name x01 x02 x03 x04 x05 x06 x07 x08 x09 x10 ready? Might have second adder, but x5 is not ready. 14 scheduling is reactive Load took longer? Doesn t matter. Don t try to start things until ready. 15

Running out of register names? reservation stations vs registers recycle names with no operations, external name still out of names? don t issue more instructions Tomasulo paper doesn t seem to have extra registers But has reservation stations with tags these are extra registers and their names 16 17 pieces in Tomasulo scheduling with reservation buffers ready bits internal external name mapping # (renamed) instructions run on done? (1) x05 Mem[x03] Load (2) x06 x01 + x02 Add1 (3) x07 x01 x02 Mult (4) x08 x05 x04 Mult (5) x09 x05 + x04 Add2 (6) x10 x07 + x06 Add1 dispatching transmits register values extra registers Add1 Add2 Mult Load source 1 tag x01x07 x05 x01x05 x03 source 1 ready? no no no source 2 tag x02x06 x04 x02x04 source 2 ready? sink tag x06x10 x09 x07x08 x05 18 19

common data bus results are broadcast here tag internal register name reservation stations listen for operands register file listens for register values issuing instructions assign tags for operands instruction will execute when operands are ready handles variable length operations (e.g. loads) keeps register file from being bottleneck fancy buses: mutliple value+tags per clock cycle 20 21 integrating with reorder buffer integrating with reorder buffer (2) reorder buffer just another thing listening on bus Hennessy & Patterson Figure 3.11 22 23

multiple entries in reservation stations instead of dispathcing one instruction, issue a list reservation station starts whichever one gets operands first variations on reservation stations Intel P6: shared reservation station for all types of operations MIPS R10000 (next Monday s paper): read from shared register file (with renaming) 24 25 Intel P6 execution unit datapaths summary register renaming to avoid data hazards otherwise even write-after-write, write-after-read a problem shared bus to communicate results register file, reservation buffers listen on bus can dispatch to buffer before value ready Image: Shen and Lipatsi, Figure 7.14 26 27