Lecture 14: Instruction Level Parallelism

Similar documents
Advanced Superscalar Architectures. Speculative and Out-of-Order Execution

Computer Architecture 计算机体系结构. Lecture 3. Instruction-Level Parallelism I 第三讲 指令级并行 I. Chao Li, PhD. 李超博士

Parallelism I: Inside the Core

Computer Architecture: Out-of-Order Execution. Prof. Onur Mutlu (editted by Seth) Carnegie Mellon University

Out-of-order Pipeline. Register Read. OOO execution (2-wide) OOO execution (2-wide) OOO execution (2-wide) OOO execution (2-wide)

COSC 6385 Computer Architecture. - Tomasulos Algorithm

Anne Bracy CS 3410 Computer Science Cornell University. [K. Bala, A. Bracy, S. McKee, E. Sirer, H. Weatherspoon]

CIS 371 Computer Organization and Design

Tomasulo-Style Register Renaming

CIS 371 Computer Organization and Design

Lecture 20: Parallelism ILP to Multicores. James C. Hoe Department of ECE Carnegie Mellon University

Unit 9: Static & Dynamic Scheduling

Hakim Weatherspoon CS 3410 Computer Science Cornell University

CS152: Computer Architecture and Engineering Introduction to Pipelining. October 22, 1997 Dave Patterson (http.cs.berkeley.

Computer Architecture ELE 475 / COS 475 Slide Deck 6: Superscalar 3. David Wentzlaff Department of Electrical Engineering Princeton University

CIS 662: Sample midterm w solutions

Code Scheduling & Limitations

CS 152 Computer Architecture and Engineering. Lecture 15 - Advanced Superscalars

Advanced Superscalar Architectures

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

Pipeline Hazards. See P&H Chapter 4.7. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Improving Performance: Pipelining!

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

Pipeline Hazards. See P&H Chapter 4.7. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

DAT105: Computer Architecture Study Period 2, 2009 Exercise 2 Chapter 2: Instruction-Level Parallelism and Its Exploitation

ENGN1640: Design of Computing Systems Topic 05: Pipeline Processor Design

CS 152 Computer Architecture and Engineering. Lecture 14 - Advanced Superscalars

Announcements. Programming assignment #2 due Monday 9/24. Talk: Architectural Acceleration of Real Time Physics Glenn Reinman, UCLA CS

6.823 Computer System Architecture Prerequisite Self-Assessment Test Assigned Feb. 6, 2019 Due Feb 11, 2019

Pipelined MIPS Datapath with Control Signals

CS 6354: Tomasulo. 21 September 2016

Decoupling Loads for Nano-Instruction Set Computers

To read more. CS 6354: Tomasulo. Intel Skylake. Scheduling. How can we reorder instructions? Without changing the answer.

Optimality of Tomasulo s Algorithm Luna, Dong Gang, Zhao

CSCI 510: Computer Architecture Written Assignment 2 Solutions

Chapter 3: Computer Organization Fundamentals. Oregon State University School of Electrical Engineering and Computer Science.

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 10 Instruction-Level Parallelism Part 3

CS 152 Computer Architecture and Engineering

CMU Introduction to Computer Architecture, Spring 2013 HW 3 Solutions: Microprogramming Wrap-up and Pipelining

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

M2 Instruction Set Architecture

Chapter 2 ( ) -Revisit ReOrder Buffer -Exception handling and. (parallelism in HW)

Warped-Compression: Enabling Power Efficient GPUs through Register Compression

Fixing the Hyperdrive: Maximizing Rendering Performance on NVIDIA GPUs

Storage and Memory Hierarchy CS165

ABB June 19, Slide 1

CS 250! VLSI System Design

Computer Architecture and Parallel Computing 并行结构与计算. Lecture 5 SuperScalar and Multithreading. Peng Liu

Topics on Compilers. Introduction to CGRA

Chapter 10 And, Finally... The Stack

EECS 583 Class 9 Classic Optimization

Sinfonia: a new paradigm for building scalable distributed systems

mith College Computer Science CSC231 Assembly Fall 2017 Week #4 Dominique Thiébaut

BASIC MECHATRONICS ENGINEERING

In-Place Associative Computing:

Advantage Memory Corporation reserves the right to change products and specifications without notice

VHDL (and verilog) allow complex hardware to be described in either single-segment style to two-segment style

Finite Element Based, FPGA-Implemented Electric Machine Model for Hardware-in-the-Loop (HIL) Simulation

128Mb Synchronous DRAM. Features High Performance: Description. REV 1.0 May, 2001 NT5SV32M4CT NT5SV16M8CT NT5SV8M16CT

FabComp: Hardware specication

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

SDRAM AS4SD8M Mb: 8 Meg x 16 SDRAM Synchronous DRAM Memory. PIN ASSIGNMENT (Top View)

Advantage Memory Corporation reserves the right to change products and specifications without notice

LABORATORY MEASUREMENTS (PLASMA, RED BLOOD CELLS, URINE)

Field Programmable Gate Arrays a Case Study

CprE 281: Digital Logic

HARDWIRE VS. WIRELESS FAILSAFE CONTROL SYSTEM. The answer is No.

Drowsy Caches Simple Techniques for Reducing Leakage Power Krisztián Flautner Nam Sung Kim Steve Martin David Blaauw Trevor Mudge

High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP)

Direct-Mapped Cache Terminology. Caching Terminology. TIO Dan s great cache mnemonic. UCB CS61C : Machine Structures

Multi Core Processing in VisionLab

ELM327 OBD to RS232 Interpreter

Chapter 11. Using MAX II User Flash Memory for Data Storage in Manufacturing Flow

HYB25D256400/800AT 256-MBit Double Data Rata SDRAM

CACHE LINE AWARE OPTIMIZATIONS FOR CCNUMA SYSTEMS

ASIC Design (7v81) Spring 2000

Energy Efficient Content-Addressable Memory

Flexible Ramping Product Technical Workshop

Welcome to ABB machinery drives training. This training module will introduce you to the ACS850-04, the ABB machinery drive module.

A48P4616B. 16M X 16 Bit DDR DRAM. Document Title 16M X 16 Bit DDR DRAM. Revision History. AMIC Technology, Corp. Rev. No. History Issue Date Remark

SYNCHRONOUS DRAM. 128Mb: x32 SDRAM. MT48LC4M32B2-1 Meg x 32 x 4 banks

Auto Idle System Trouble Shooting

PCT200 Powercast High-Function RFID Sensor Datalogger

RAM-Type Interface for Embedded User Flash Memory

Near-Optimal Precharging in High-Performance Nanoscale CMOS Caches

High Perform ance Caches: The Q uiet Revolution

Programming Languages (CS 550)

MANTECH ELECTRONICS. Stepper Motors. Basics on Stepper Motors I. STEPPER MOTOR SYSTEMS OVERVIEW 2. STEPPING MOTORS

Behavioral Research Center (BRC) User Guide

LLTek Introduces PowerBox Chip-Tuning Technology

Advantage Memory Corporation reserves the right to change products and specifications without notice

CMPEN 411 VLSI Digital Circuits Spring Lecture 20: Multiplier Design

5 5 Supervisor Engine GE (Active) VS S720 10G SAL1313MAFM

UC Berkeley CS61C : Machine Structures

HYB25D256[400/800/160]B[T/C](L) 256-Mbit Double Data Rate SDRAM, Die Rev. B Data Sheet Jan. 2003, V1.1. Features. Description

ZT-USB Series User Manual

CMPEN 411 VLSI Digital Circuits Spring Lecture 24: Peripheral Memory Circuits

Unidrive M600 High performance drive for induction and sensorless permanent magnet motors

Lecture Secure, Trusted and Trustworthy Computing Trusted Execution Environments Intel SGX

Total memory size : 256 MB (DDR-SDRAM PC2700) Memory Bandwidth : MB/s

Transcription:

Lecture 14: Instruction Level Parallelism Last time Pipelining in the real world Today Control hazards Other pipelines Take QUIZ 10 over P&H 4.10-15, before 11:59pm today Homework 5 due Thursday March 11, 2010 Instruction level parallelism Multi-issue (Superscalar) and out-of-order execution UTCS 352, Lecture 14 1

Where Are We? Pipelined in-order processor Simple branch prediction Instruction/data caches (on chip) Out-of-order instruction execution Superscalar Sophisticated branch prediction DEC Alpha 21064 Introduced in 1992 DEC Alpha 21264 Introduced 1998 UTCS 352, Lecture 14 2

Dynamic Multiple Issue (Superscalar) No instruction reordering, Choose 0, 1 N Instruction Memory Instruction Buffer Hazard Detect Register File UTCS 352, Lecture 14 3

MIPS with Static Dual Issue Two-issue packets One ALU/branch instruction One load/store instruction 64-bit aligned ALU/branch, then load/store Pad an unused instruction with nop Address Instruction type Pipeline Stages n ALU/branch IF ID EX MEM WB n + 4 Load/store IF ID EX MEM WB n + 8 ALU/branch IF ID EX MEM WB n + 12 Load/store IF ID EX MEM WB n + 16 ALU/branch IF ID EX MEM WB n + 20 Load/store IF ID EX MEM WB UTCS 352, Lecture 14 4

Hazards in Dual-Issue MIPS More instructions executing in parallel EX data hazard Forwarding avoided stalls with single-issue Now can t use ALU result in load/store in same packet add $t0, $s0, $s1 load $s2, 0($t0) Split into two packets, effectively a stall Load-use hazard Still one cycle use latency, but now two instructions More aggressive scheduling required UTCS 352, Lecture 14 5

What Hardware Do We Need? UTCS 352, Lecture 14 6

What Hardware Do We Need? Wider fetch i-cache bandwidth Multiported register file More ALUs Restrictions on issue of load/stores because N ports to the data cache slows it down too much UTCS 352, Lecture 14 7

Multiple Issue (Details) Dependencies and structural hazards checked at run-time Can run existing binaries Recompile for performance, not correctness Example - Pentium More complex issue logic Swizzle next N instructions into position Check dependencies and resource needs Issue M <= N instructions that can execute in parallel UTCS 352, Lecture 14 8

Example Multiple Issue Issue rules: at most 1 load/store, at most 1 floating op Latency: load=1, int=1, float-mult = 2, float-add = 1 cycle LOOP: LD F0, 0(R1) // a[i] 1 LD F2, 0(R2) // b[i] 2 MULTD F8, F0, F2 // a[i] * b[i] 4 (stall) ADDD F12, F8, F16 // + c 5 SD F12, 0(R3) // d[i] 6 ADDI R1, R1, 4 ADDI R2, R2, 4 7 ADDI R3, R3, 4 ADDI R4, R4, 1 // increment I 8 SLT R5, R4, R6 // i<n-1 9 BNEQ R5, R0, LOOP 10 Old CPI = 12/11 = 1.09 New CPI = 10/11 = 0.91 UTCS 352, Lecture 14 9

Rescheduled for Multiple Issue Issue rules: at most 1 LD/ST, at most 1 floating op Latency: LD - 1, int-1, F*-2, F+-1 LOOP: LD F0, 0(R1) // a[i] 1 ADDI R1, R1, 4 LD F2, 0(R2) // b[i] 2 ADDI R2, R2, 4 MULTD F8, F0, F2 // a[i] * b[i] 4 ADDI R4, R4, 1 // increment I ADDD F12, F8, F16 // + c 5 SLT R5, R4, R6 // i<n-1 SD F12, 0(R3) // d[i] 6 ADDI R3, R3, 4 BNEQ R5, R0, LOOP 7 Old CPI = 0.91 New CPI = 7/11 = 0.64 cycle Given a two way issue processor, what s the best possible CPI? IPC? UTCS 352, Lecture 14 10

The Problem with Static Scheduling In-order execution an unexpected long latency blocks ready instructions from executing binaries need to be rescheduled for each new implementation small number of named registers becomes a bottleneck LW R1, C //miss 50 cycles LW R2, D MUL R3, R1, R2 SW R3, C LW R4, B //ready ADD R5, R4, R9 SW R5, A LW R6, F LW R7, G ADD R8, R6, R7 SW R8, E UTCS 352, Lecture 14 11

Dynamic Scheduling Determine execution order of instructions at run time Schedule with knowledge of run-time variable latency cache misses Compatibility advantages avoid need to recompile old binaries avoid bottleneck of small named register sets but still need to deal with spills Significant hardware complexity UTCS 352, Lecture 14 12

Example Top = without dynamic scheduling, Bottom = with dynamic scheduling 10 cycle data memory (cache) miss 3 cycle MUL latency 2 cycle add latency UTCS 352, Lecture 14 13

Dynamic Scheduling Basic Concept PC Sequential Instruction Stream LW R1,A LW R2,B ADD R3,R1,R2 SW R3,C LW R4,8(A) LW R5,8(B) ADD R6,R4,R5 SW R6,8(C) LW R7,16(A) LW R8,16(B) ADD R9,R7,R8 SW R9,16(C) LW R10,24(A) LW R11,24(B) Window of Waiting Instructions on operands & resources ADD R3,R1,R2 SW R3,C ADD R6,R4,R5 SW R6,8(C) LW R7,16(A) LW R8,16(B) ADD R9,R7,R8 SW R9,16(C) Issue Logic Execution Resources Register File Instructions waiting to commit LW R4,8(A) LW R5,8(B) UTCS 352, Lecture 14 14

Implementation I - Register Scoreboard R0 1 R1 0 R2 1 R3 0 R4 0 R5 0 R6 0 R7 0 Register File Tracks register writes busy = pending write Detect hazards for scheduler ADD R3,R1,R2 Wait until R1 is valid Mark R3 valid when complete SUB R4,R0,R3 Wait for R3 valid bit (= 0 if write pending ) What about: LD R3,(0)R7 ADD R4,R3,R5 LD R3,(4)R7 UTCS 352, Lecture 14 15

Implementing A Simple Instruction Window 3 ADD dst R3 src1 reg rdy R1 0 src2 reg rdy R2 1 result reg ADD R3,R1,R2 SW R3,0(R8) ADD R6,R4,R5 SW R6,8(R8) LW R7,16(R9) 5 2 4 1 SW R3 0 R8 1 ADD R6 R4 0 R5 0 SW R6 0 R8 1 LW R7 R9 1 1 R0 1 R1 0 R2 1 R3 0 R4 0 R5 0 R6 0 R7 0 Often called reservation stations Result sequence: R4, R7, R5, R1, R6, R3 reg = name, value UTCS 352, Lecture 14 16

Instruction Window Policies Add an instruction to the window only when dest register is not busy mark destination register busy check status of source registers and set ready bits When each result is generated compare dest register field to all waiting instruction source register fields update ready bits mark dest register not busy Issue an instruction when execution resource is available all source operands are ready Result issues instructions out of order as soon as source registers are available allows only one operation in the window per destination register UTCS 352, Lecture 14 17

Register Renaming (1) What about this sequence? LW R1, 0(R4) ADD R2, R1, R3 LW R1, 4(R4) ADD R5, R1, R3 Can t add 3 to the window since R1 is already busy Need 2 R1s! UTCS 352, Lecture 14 18

Register Renaming (2) value Rename Table P1 0 A R1 0 P5 R2 0 P2 R3 1 P1 R4 1 P7 R5 1 P6 Virtual Registers P2 P3 P4 P5 P6 P7 P8 1 1 1 0 1 1 0 5 C 0 E F 3 2 Add a tag field to each register - translates from virtual to physical register name Physical Registers In window LW R1, 0(R4) ADD R2, R1, R3 Next instruction LW R1, 4(R4) UTCS 352, Lecture 14 19

Register Renaming (3) S1 LW P5 P7 1 1 Before R1 0 P5 After R1 0 P4 S2 ADD P2 P5 0 P1 1 R2 0 P2 R3 1 P1 R2 0 P2 R3 1 P1 R4 1 P7 R4 1 P7 S3 LW P4 P7 1 1 R5 1 P6 R5 0 P6 S4 ADD P6 P4 0 P1 1 When result is generated: compare tag of result to not -ready source fields grab data if match Add instruction to window even if dest register is busy When adding instruction to window read data of non-busy source registers and retain read tags of busy source registers and retain write tag of destination register with slot number LW R1,0(R4) ADD R2,R1,R3 LW R1,4(R4) ADD R5,R1,R3 UTCS 352, Lecture 14 20

Power Efficiency Complexity of dynamic scheduling and speculations requires power Multiple simpler cores may be better Microprocessor Year Clock Rate Pipeline Stages Issue width Out-of-order/ Speculation Cores i486 1989 25MHz 5 1 No 1 5W Power Pentium 1993 66MHz 5 2 No 1 10W Pentium Pro 1997 200MHz 10 3 Yes 1 29W P4 Willamette 2001 2000MHz 22 3 Yes 1 75W P4 Prescott 2004 3600MHz 31 3 Yes 1 103W Core 2006 2930MHz 14 4 Yes 2 75W UltraSparc III 2003 1950MHz 14 4 No 1 90W UltraSparc T1 2005 1200MHz 6 1 No 8 70W UTCS 352, Lecture 14 21

Summary Summary Pipelining is simple, but a correct high performance implementation is complex Dynamic multiple issue Static multiple issue (VLIW) Out-of-order execution dependencies, renaming, etc. Next Time Caches (new topic!) Homework 5 due Thursday March 11, 2010 Read: P&H 5.1 5 UTCS 352, Lecture 14 22