Lecture 20: Parallelism ILP to Multicores. James C. Hoe Department of ECE Carnegie Mellon University

Similar documents
Computer Architecture: Out-of-Order Execution. Prof. Onur Mutlu (editted by Seth) Carnegie Mellon University

Advanced Superscalar Architectures. Speculative and Out-of-Order Execution

Parallelism I: Inside the Core

Lecture 14: Instruction Level Parallelism

Out-of-order Pipeline. Register Read. OOO execution (2-wide) OOO execution (2-wide) OOO execution (2-wide) OOO execution (2-wide)

Computer Architecture 计算机体系结构. Lecture 3. Instruction-Level Parallelism I 第三讲 指令级并行 I. Chao Li, PhD. 李超博士

Announcements. Programming assignment #2 due Monday 9/24. Talk: Architectural Acceleration of Real Time Physics Glenn Reinman, UCLA CS

Tomasulo-Style Register Renaming

Computer Architecture ELE 475 / COS 475 Slide Deck 6: Superscalar 3. David Wentzlaff Department of Electrical Engineering Princeton University

Unit 9: Static & Dynamic Scheduling

CIS 371 Computer Organization and Design

Code Scheduling & Limitations

Advanced Superscalar Architectures

CIS 371 Computer Organization and Design

To read more. CS 6354: Tomasulo. Intel Skylake. Scheduling. How can we reorder instructions? Without changing the answer.

Decoupling Loads for Nano-Instruction Set Computers

CS 6354: Tomasulo. 21 September 2016

CS 152 Computer Architecture and Engineering. Lecture 15 - Advanced Superscalars

CS 152 Computer Architecture and Engineering. Lecture 14 - Advanced Superscalars

CS152: Computer Architecture and Engineering Introduction to Pipelining. October 22, 1997 Dave Patterson (http.cs.berkeley.

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 10 Instruction-Level Parallelism Part 3

COSC 6385 Computer Architecture. - Tomasulos Algorithm

Chapter 3: Computer Organization Fundamentals. Oregon State University School of Electrical Engineering and Computer Science.

Anne Bracy CS 3410 Computer Science Cornell University. [K. Bala, A. Bracy, S. McKee, E. Sirer, H. Weatherspoon]

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

Improving Performance: Pipelining!

General Processor Information

DAT105: Computer Architecture Study Period 2, 2009 Exercise 2 Chapter 2: Instruction-Level Parallelism and Its Exploitation

Fixing the Hyperdrive: Maximizing Rendering Performance on NVIDIA GPUs

ENGN1640: Design of Computing Systems Topic 05: Pipeline Processor Design

General Processor Information

Techniques, October , Boston, USA. Personal use of this material is permitted. However, permission to

Hakim Weatherspoon CS 3410 Computer Science Cornell University

Helping Moore s Law: Architectural Techniques to Address Parameter Variation

Chapter 2 ( ) -Revisit ReOrder Buffer -Exception handling and. (parallelism in HW)

CS 250! VLSI System Design

Drowsy Caches Simple Techniques for Reducing Leakage Power Krisztián Flautner Nam Sung Kim Steve Martin David Blaauw Trevor Mudge

Near-Optimal Precharging in High-Performance Nanoscale CMOS Caches

6.823 Computer System Architecture Prerequisite Self-Assessment Test Assigned Feb. 6, 2019 Due Feb 11, 2019

Computer Architecture and Parallel Computing 并行结构与计算. Lecture 5 SuperScalar and Multithreading. Peng Liu

Improving Memory System Performance with Energy-Efficient Value Speculation

CSCI 510: Computer Architecture Written Assignment 2 Solutions

CS 152 Computer Architecture and Engineering

CMPEN 411 VLSI Digital Circuits Spring Lecture 24: Peripheral Memory Circuits

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

CMU Introduction to Computer Architecture, Spring 2013 HW 3 Solutions: Microprogramming Wrap-up and Pipelining

ARC-H: Adaptive replacement cache management for heterogeneous storage devices

Warped-Compression: Enabling Power Efficient GPUs through Register Compression

High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP)

EECS 583 Class 9 Classic Optimization

Storage and Memory Hierarchy CS165

CS250 VLSI Systems Design

ABB June 19, Slide 1

Multi Core Processing in VisionLab

CMPEN 411 VLSI Digital Circuits Spring Lecture 15: Dynamic CMOS

In-Place Associative Computing:

UC Berkeley CS61C : Machine Structures

CIS 662: Sample midterm w solutions

Pipeline Hazards. See P&H Chapter 4.7. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

CMPEN 411 VLSI Digital Circuits Spring Lecture 20: Multiplier Design

Pipeline Hazards. See P&H Chapter 4.7. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Pipelined MIPS Datapath with Control Signals

Direct-Mapped Cache Terminology. Caching Terminology. TIO Dan s great cache mnemonic. UCB CS61C : Machine Structures

Optimality of Tomasulo s Algorithm Luna, Dong Gang, Zhao

CACHE LINE AWARE OPTIMIZATIONS FOR CCNUMA SYSTEMS

Energy Efficient Content-Addressable Memory

Alloyed Branch History: Combining Global and Local Branch History for Robust Performance

VHDL (and verilog) allow complex hardware to be described in either single-segment style to two-segment style

Hybrid Myths in Branch Prediction

Green Server Design: Beyond Operational Energy to Sustainability

Non-wire Methods for Transmission Congestion Management through Predictive Simulation and Optimization

PPEP: ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK

Lecture 31 Caches II TIO Dan s great cache mnemonic. Issues with Direct-Mapped

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

Enhancing Energy Efficiency of Database Applications Using SSDs

PowerChop: Identifying and Managing Non-critical Units in Hybrid Processor Architectures

Topics on Compilers. Introduction to CGRA

MAS601 Design, Modeling & Simulation

FLEXIBILITY FOR THE HIGH-END DATA CENTER. Copyright 2013 EMC Corporation. All rights reserved.

Advanced Topics. Packaging Power Distribution I/O. ECE 261 James Morizio 1

Real-Time Simulation of A Modular Multilevel Converter Based Hybrid Energy Storage System

Aging of the light vehicle fleet May 2011

BASIC MECHATRONICS ENGINEERING

UTBB FD-SOI: The Technology for Extreme Power Efficient SOCs

Building Fast and Accurate Powertrain Models for System and Control Development

Understanding the benefits of using a digital valve controller. Mark Buzzell Business Manager, Metso Flow Control

EE Architecture for Highly Electrified Powertrain

mith College Computer Science CSC231 Assembly Fall 2017 Week #4 Dominique Thiébaut

Scheduling. Purpose of scheduling. Scheduling. Scheduling. Concurrent & Distributed Systems Purpose of scheduling.

Using Advanced Limit Line Features

Implication of Smart-Grids Development for Communication Systems in Normal Operation and During Disasters

Energy Source Lifetime Optimization for a Digital System through Power Management. Manish Kulkarni

Finite Element Based, FPGA-Implemented Electric Machine Model for Hardware-in-the-Loop (HIL) Simulation

The seal of the century web tension control

How Much Power Does your Server Consume? Estimating Wall Socket Power Using RAPL Measurements

Unmanned autonomous vehicles in air land and sea

Design Space Exploration for Complex Automotive Applications: An Engine Control System Case Study

ISC$High$Performance$Conference,$Frankfurt,$Germany$$$

Model Based Design: Balancing Embedded Controls Development and System Simulation

Transcription:

18 447 Lecture 20: Parallelism ILP to Multicores James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L20 S1, James C. Hoe, CMU/ECE/CALCM, 2018

18 447 S18 L20 S2, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today transition from sequential to parallel enjoy(you will not be tested on this) Notices Midterm 2 on Monday; Pick up practice midterm solutions HW4 past due; HW5 out next Wed Handout #14: HW4 solutions Readings (advanced optional) MIPS R10K Superscalar Microprocessor, Yeager Synthesis Lectures: Processor Microarchitecture: An Implementation Perspective, 2010

Parallelism Defined T 1 (work measured in time): time to do work with 1 PE T (critical path): time to do work with infinite PEs T bounded by dataflow dependence Average parallelism: P avg = T 1 / T For a system with p PEs T p max{ T 1 /p, T } When P avg >>p T p T 1 /p, aka linear speedup x = a + b; y = b * 2 z =(x y) * (x+y) a x + - * *2 + b y 18 447 S18 L20 S3, James C. Hoe, CMU/ECE/CALCM, 2018

ILP: Instruction Level Parallelism Average ILP = T 1 / T = no. instruction / no. cyc required code1: ILP = 1 i.e., must execute serially code2: ILP = 3 i.e., can execute at the same time code1: r1 r2 + 1 r3 r1 / 17 r4 r0 - r3 code2: r1 r2 + 1 r3 r9 / 17 r4 r0 - r10 18 447 S18 L20 S4, James C. Hoe, CMU/ECE/CALCM, 2018

Exploiting ILP for Performance Scalar in order pipeline with forwarding operation latency (OL)= 1 base cycle peak IPC = 1 required ILP 1 to avoid stall instruction stream base cyc 0 1 2 3 4 5 6 7 8 9 10 18 447 S18 L20 S5, James C. Hoe, CMU/ECE/CALCM, 2018

Superpipelined Execution OL = M minor cycle; same as 1 base cycle peak IPC = 1 per minor cycle required ILP M instruction stream base cycle = M minor cycles minor cycle IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF base cyc 0 1 2 3 4 5 6 7 8 9 10 18 447 S18 L20 S6, James C. Hoe, CMU/ECE/CALCM, 2018 Achieving full performance requires always finding M independent instructions in a row

Superscalar (Inorder) Execution OL = 1 base cycle peak IPC = N required ILP N instruction stream Base cyc 0 1 2 3 4 5 6 7 8 9 10 18 447 S18 L20 S7, James C. Hoe, CMU/ECE/CALCM, 2018 Achieving full performance requires finding N independent instructions on every cycle

Limitations of Inorder Pipeline Achieved IPC of inorder pipelines degrades rapidly as NxM approaches ILP Despite high peak IPC potential, pipeline never full due to frequent dependency stalls!! instruction stream 18 447 S18 L20 S8, James C. Hoe, CMU/ECE/CALCM, 2018

Out of Order Execution ILP is scope dependent r1 r2 + 1 r3 r1 / 17 r4 r0 r3 r11 r12 + 1 r13 r19 / 17 r14 r0 r20 ILP=1 ILP=2 Accessing ILP=2 requires (1) larger scheduling window and (2) out of order execution 18 447 S18 L20 S9, James C. Hoe, CMU/ECE/CALCM, 2018

Dataflow Execution Ordering Maintain a buffer of many pending instructions, a.k.a. reservation stations (RSs) wait for functional unit to be free wait for register RAW hazards to resolve (i.e., required input operands to be produced) Issue instructions for execution out of order select instructions in RS whose operands are available give preference to older instructions (heuristical) A completing instruction frees pending, RAWdependent instructions to execute 18 447 S18 L20 S10, James C. Hoe, CMU/ECE/CALCM, 2018

Tomasulo s Algorithm [IBM 360/91, 1967] Dispatch an instruction to a RS slot after decode decode received from RF either operand value or placeholder RS tag mark RF dest with RS tag of current inst s RS slot A inst in RS can issue when all operand values ready Completing instruction, in addition to updating RF dest, broadcast its RS tag and value to all RS slots RS slot holding matching RS tag placeholder pickup value 18 447 S18 L20 S11, James C. Hoe, CMU/ECE/CALCM, 2018

Instruction Reorder Buffer (ROB) Program order bookkeeping (circular buffer) instructions enter and leave in program order tracks 10s to 100s of in flight instructions in different stages of execution Dynamic juggling of state and dependency oldest finished instruction commit architectural state updates on exit all ROB entries considered speculative due to potential for exceptions and mispredictions 18 447 S18 L20 S12, James C. Hoe, CMU/ECE/CALCM, 2018 oldest youngest mispredict youngest

In order vs Speculative State In order state: cumulative architectural effects of all instructions committed in order so far can never be undone!! Speculative state, as viewed by a given inst in ROB in order state + effects of older insts in ROB effects of some older insts may be pending Speculative state effects must be reversible remember both in order and speculative values for an RF register (may have multiple speculative values) store inst updates memory only at commit time Discard younger speculative state to rewind execution to oldest remaining inst in ROB 18 447 S18 L20 S13, James C. Hoe, CMU/ECE/CALCM, 2018

Removing False Dependencies With out of order execution comes WAW and WAR hazards Anti and output dependencies are false dependencies on register names rather than data r 3 r 1 op r 2 r 5 r 3 op r 4 r 3 r 6 op r 7 With infinite number of registers, anti and output dependencies avoidable by using a new register for each new value 18 447 S18 L20 S14, James C. Hoe, CMU/ECE/CALCM, 2018

Register Renaming: Example Original r1 r2 / r3 r4 r1 * r5 r1 r3 + r6 r3 r1 r5 Renamed r1 r2 / r3 r4 r1 * r5 r8 r3 + r6 r9 r8 r5 18 447 S18 L20 S15, James C. Hoe, CMU/ECE/CALCM, 2018

On the fly HW Register Renaming ISA name e.g. r12 rename table rename t56 physical registers (t0... t63) Maintain mapping from ISA reg. names to physical registers When decoding an instruction that updates r x : allocate unused physical register t y to hold inst result set new mapping from r x to t y younger instructions using r x as input finds t y De allocate a physical register for reuse r1 r2 / r3 when it is never needed again? r4 r1 * r5 ^^^^^when is this exactly? r1 r3 + r6 18 447 S18 L20 S16, James C. Hoe, CMU/ECE/CALCM, 2018

Control Speculation Modern CPUs can have over 100 instructions in out of order execution scope if 14% of avg. instruction mix is control flow, what is average distance between control flow? instruction fetch must make multiple levels of branch predictions (condition and target) to fetch far ahead of execution and commit Large OOO is more about cache misses than ILP!!! keep working around long cache miss stalls get started on future cache misses as early as possible (to overlap/hide latency of cache misses) 18 447 S18 L20 S17, James C. Hoe, CMU/ECE/CALCM, 2018

Speculative Out of order Execution A mispredicted branch after resolution must be rewound and restarted Much trickier than 5 stage pipeline... can rewind to an intermediate speculative state a rewound branch could still be speculative and itself be discarded by another rewind! rewind must reestablish both architectural state (register value) and microarchitecture state (e.g., rename table) rewind/restart must be fast (not infrequent) Exception rewind is much easier, why? 18 447 S18 L20 S18, James C. Hoe, CMU/ECE/CALCM, 2018

Supercalarized BP: 2 way example tag BTBidx cache block offset last inst in cache block? Tag Table Branch History Table (BHT) Branch Target Buffer (BTB) = hit PC+4 1 0 PC+8 18 447 S18 L20 S19, James C. Hoe, CMU/ECE/CALCM, 2018 first? taken? 1 0 predpc

Trace Caching static 90% dynamic 10% E A C D F G B compiler static 10% static 90% dynamic A B C D E F G i cache line boundaries hardware dynamic A B C D F G trace cache line boundaries 18 447 S18 L20 S20, James C. Hoe, CMU/ECE/CALCM, 2018

Prototypical Superscalar OOO Datapath wide inst fetch + predict wide inst decode rename rename ROB RS (Int insts) physical registers (Integer) RS (FP insts) physical registers (FP) ALU1 ALU2 LD/ST FPU1 FPU2 18 447 S18 L20 S21, James C. Hoe, CMU/ECE/CALCM, 2018 Read [Yeager 1996, IEEE Micro] if you are interested

At the 2005 Peak of Superscalar OOO Alpha 21364 AMD Opteron Intel Xeon IBM Power5 MIPS R14000 Intel Itanium2 clock (GHz) 1.30 2.4 3.6 1.9 0.6 1.6 issue rate 4 3 (x86) 3 (rop) 8 4 8 pipeline int/fp 7/9 9/11 22/24 12/17 6 8 inst in flight 80 72(rop) 126 rop 200 48 inorder rename reg 48+41 36+36 128 48/40 32/32 328 transistor (10 6 ) 135 106 125 276 7.2 592 power (W) 155 86 103 120 16 130 SPECint 2000 904 1,566 1,521 1,398 483 1,590 SPECfp 2000 1279 1,591 1,504 2,576 499 2,712 18 447 S18 L20 S22, James C. Hoe, CMU/ECE/CALCM, 2018 Microprocessor Report, December 2004

At peak minus 5 years clock (MHz) Alpha 21264 AMD Athlon Intel P4 MIPS R12000 IBM Power3 HP PA8600 SUN Ultra3 833 1200 1500 400 450 552 900 issue rate 4 3 (x86) 3 (rop) 4 4 4 4 pipeline int/fp 7/9 9/11 22/24 6 7/8 7/9 14//15 inst in flight 80 72(rop) 126 rop 48 32 56 inorder rename reg 48+41 36+36 128 32+32 16+24 56 inorder transistor (10 6 ) 15.4 37 42 7.2 23 130 29 power (W) 75 76 55 25 36 60 65 SPECint 2000 518 524 320 286 417 438 SPECfp 2000 590 304 549 319 356 400 427 18 447 S18 L20 S23, James C. Hoe, CMU/ECE/CALCM, 2018 Microprocessor Report, December 2000

Performance (In)efficiency To hit expected performance target push frequency harder by deepening pipelines used the 2x transistors to build more complicated microarchitectures so fast/deep pipelines don t stall (i.e., caches, BP, superscalar, out of order) The consequence of performance inefficiency is limit of economical cooling [ITRS] 2005, Intel P4 Tehas 150W [Borkar, IEEE Micro, July 1999] 18 447 S18 L20 S24, James C. Hoe, CMU/ECE/CALCM, 2018

Efficiency of Parallel Processing technology normalized power (Watt) Better to replace 1 of this by 2 of these; Or N of these Pentium 4 Power Perf 1.75 486 [Energy per Instruction Trends in Intel Microprocessors, Grochowski et al., 2006] 18 447 S18 L20 S25, James C. Hoe, CMU/ECE/CALCM, 2018 technology normalized performance (op/sec)

Moore s Law Era Multicore Era: growing transistor count & aggr. perf; flattened power & seq. perf; lowering freq. 18 447 S18 L20 S26, James C. Hoe, CMU/ECE/CALCM, 2018

issue rate pipeline depth inst in flight on chip$ (MB) transistor (10 6 ) AMD 285 2x1 At peak plus 1 year Intel 965 2x2 3 (x86) 4 (rop) 3 (rop) 11 14 31 72(rop) 2x1 Intel 5160 126(rop) 2x2 233 291 376 power (W) 95 80 130 SPECint 2000 per core 1942 (1556 *) 1870 SPECfp 2000 per core 2260 (1694 +) 2232 Intel Itanium2 clock (GHz) 2.6 3.03 3.73 1.6 6 8 inorder 2x13 1700 104 1474 3017 IBM P5+ 2x2 2.3 8 17 200 1.9 276 100 1820 3369 MIPS R16000 1x1 0.7 4 6 48 0.064 7.2 17 560 580 SUN Ultra4 cores/threads 2x2 2x2 2x1 96(rop) 4 1.8 4 14 inorder 2 295 90 1300 1800 * 3086/ + 2884 according to www.spec.org 18 447 S18 L20 S27, James C. Hoe, CMU/ECE/CALCM, 2018 Microprocessor Report, Aug 2006

At peak plus 3 years cores/threads AMD Opteron 8360SE 4x1 Intel Xeon X7460 6x1 Intel Itanium 9050 2x2 IBM P5 2x2 IBM P6 2x2 Fijitsu SPARC 7 4x2 SUN T2 8x8 clock (GHz) 2.5 2.67 1.60 2.2 5 2.52 1.8 issue rate 3 (x86) 4 (rop) 6 5 7 4 2 pipeline depth 12/17 14 8 15 13 15 8/12 out of order 72(rop) 96(rop) inorder 200 limited 64 inorder on chip$ (MB) 2+2 9+16 1+12 1.92 8 6 4 transistor (10 6 ) 463 1900 1720 276 790 600 503 power max(w) 105 SPECint 2006 per core/total 14.4/170 SPECfp 2006 18.5/156 per core/total 130 22/274 22/142 104 14.5/1534 17.3/1671 100 10.5/197 12.9/229 >100 135 15.8/1837 10.5/2088 20.1/1822 25.0/1861 95 /142 /111 18 447 S18 L20 S28, James C. Hoe, CMU/ECE/CALCM, 2018 Microprocessor Report, Oct 2008

On to Mainstream Parallelism in Multicores and Manycores Core $ Core $ Core $ Fat Interconnect Big L2 Bigger L3 18 447 S18 L20 S29, James C. Hoe, CMU/ECE/CALCM, 2018 Remember, we got here because we need to compute faster while using less energy per operation