Warped-Compression: Enabling Power Efficient GPUs through Register Compression

Similar documents
Lecture 14: Instruction Level Parallelism

Advanced Superscalar Architectures. Speculative and Out-of-Order Execution

CMPEN 411 VLSI Digital Circuits Spring Lecture 20: Multiplier Design

CS 6354: Tomasulo. 21 September 2016

To read more. CS 6354: Tomasulo. Intel Skylake. Scheduling. How can we reorder instructions? Without changing the answer.

Drowsy Caches Simple Techniques for Reducing Leakage Power Krisztián Flautner Nam Sung Kim Steve Martin David Blaauw Trevor Mudge

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

Energy Efficient Content-Addressable Memory

Computer Architecture ELE 475 / COS 475 Slide Deck 6: Superscalar 3. David Wentzlaff Department of Electrical Engineering Princeton University

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

Lecture 20: Parallelism ILP to Multicores. James C. Hoe Department of ECE Carnegie Mellon University

Anne Bracy CS 3410 Computer Science Cornell University. [K. Bala, A. Bracy, S. McKee, E. Sirer, H. Weatherspoon]

Computer Architecture: Out-of-Order Execution. Prof. Onur Mutlu (editted by Seth) Carnegie Mellon University

In-Place Associative Computing:

CIS 371 Computer Organization and Design

CIS 371 Computer Organization and Design

Fixing the Hyperdrive: Maximizing Rendering Performance on NVIDIA GPUs

Non-volatile STT-RAM: A True Universal Memory

SDRAM DEVICE OPERATION

EECS 583 Class 9 Classic Optimization

Unit 9: Static & Dynamic Scheduling

RAM-Type Interface for Embedded User Flash Memory

Computer Architecture 计算机体系结构. Lecture 3. Instruction-Level Parallelism I 第三讲 指令级并行 I. Chao Li, PhD. 李超博士

Announcements. Programming assignment #2 due Monday 9/24. Talk: Architectural Acceleration of Real Time Physics Glenn Reinman, UCLA CS

128Mb Synchronous DRAM. Features High Performance: Description. REV 1.0 May, 2001 NT5SV32M4CT NT5SV16M8CT NT5SV8M16CT

Fast In-place Transposition. I-Jui Sung, University of Illinois Juan Gómez-Luna, University of Córdoba (Spain) Wen-Mei Hwu, University of Illinois

Out-of-order Pipeline. Register Read. OOO execution (2-wide) OOO execution (2-wide) OOO execution (2-wide) OOO execution (2-wide)

COSC 6385 Computer Architecture. - Tomasulos Algorithm

Advanced Superscalar Architectures

SYNCHRONOUS DRAM. 128Mb: x32 SDRAM. MT48LC4M32B2-1 Meg x 32 x 4 banks

Near-Optimal Precharging in High-Performance Nanoscale CMOS Caches

Dual-Rail Domino Logic Circuits with PVT Variations in VDSM Technology

Decoupling Loads for Nano-Instruction Set Computers

Online Learning and Optimization for Smart Power Grid

CMPEN 411 VLSI Digital Circuits Spring Lecture 22: Memery, ROM

ENGN1640: Design of Computing Systems Topic 05: Pipeline Processor Design

Marwan Adas December 6, 2011

Storage and Memory Hierarchy CS165

A Predictive Delay Fault Avoidance Scheme for Coarse Grained Reconfigurable Architecture

Battery durability. Accelerated ageing test method

Design Specification. DDR2 UDIMM Enhanced Performance Profiles

Enhancing Energy Efficiency of Database Applications Using SSDs

Hakim Weatherspoon CS 3410 Computer Science Cornell University

Parallelism I: Inside the Core

Storage-less and converter-less maximum power tracking of photovoltaic cells for a nonvolatile microprocessor

THE alarming rate, at which global energy reserves are

DAT105: Computer Architecture Study Period 2, 2009 Exercise 2 Chapter 2: Instruction-Level Parallelism and Its Exploitation

SDRAM AS4SD8M Mb: 8 Meg x 16 SDRAM Synchronous DRAM Memory. PIN ASSIGNMENT (Top View)

DQ18 DQ19 VDD DQ20 NC *VREF **CKE1 VSS DQ21 DQ22 DQ23 VSS DQ24 DQ25 DQ26 DQ27 VDD DQ28 DQ29 DQ30 DQ31 VSS **CLK2 NC NC SDA SCL VDD

Bringing ARB_gpu_shader_fp64 to Intel GPUs

The Effect of Data Granularity on Load Data Compression

Optimal Thermostat Programming and Electricity Prices for Customers with Demand Charges

IS42S32200L IS45S32200L

INCREASING ENERGY EFFICIENCY BY MODEL BASED DESIGN

SFM/TFM Power Integrity Guidelines Samtec SFM/TFM Series Measurement and Simulation Data

SPARC T4-4 Server with. Oracle Database 11g Release 2

Design and Analysis of 32 Bit Regular and Improved Square Root Carry Select Adder

WESTERN INTERCONNECTION TRANSMISSION TECHNOLGOY FORUM

IS42S32200C1. 512K Bits x 32 Bits x 4 Banks (64-MBIT) SYNCHRONOUS DYNAMIC RAM

Chapter 3: Computer Organization Fundamentals. Oregon State University School of Electrical Engineering and Computer Science.

Power Integrity Guidelines Samtec MPT/MPS Series Connectors Measurement and Simulation Data

CSCI 510: Computer Architecture Written Assignment 2 Solutions

Pipelined MIPS Datapath with Control Signals

Using Advanced Limit Line Features

PV inverters in a High PV Penetration scenario Challenges and opportunities for smart technologies

Online Learning and Optimization for Smart Power Grid

Algebraic Integer Encoding and Applications in Discrete Cosine Transform

Pipeline Hazards. See P&H Chapter 4.7. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

F²MC-8FX FAMILY MB95330 SERIES DC INVERTER CONTROL F2MC- 8L/8FX SOFTUNE C LIBRARY 120 HALL SENSOR/SENSORLESS 8-BIT MICROCONTROLLER APPLICATION NOTE

CMPEN 411 VLSI Digital Circuits Spring Lecture 24: Peripheral Memory Circuits

Field Programmable Gate Arrays a Case Study

Collective Traffic Prediction with Partially Observed Traffic History using Location-Based Social Media

- - DQ0 NC DQ1 DQ0 DQ2 - NC DQ1 DQ3 NC - NC

Electric Power Research Institute, USA 2 ABB, USA

CHECK AND CALIBRATION PROCEDURES FOR FATIGUE TEST BENCHES OF WHEEL

M2 Instruction Set Architecture

Layout Design and Implementation of Adiabatic based Low Power CPAL Ripple Carry Adder

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

INTERCONNECTION POSSIBILITIES FOR THE WORKING VOLUMES OF THE ALTERNATING HYDRAULIC MOTORS

ARC-H: Adaptive replacement cache management for heterogeneous storage devices

DS1643/DS1643P Nonvolatile Timekeeping RAM

TS1SSG S (TS16MSS64V6G)

Exploiting Clock Skew Scheduling for FPGA

mith College Computer Science CSC231 Assembly Fall 2017 Week #4 Dominique Thiébaut

Fault-tolerant Control System for EMB Equipped In-wheel Motor Vehicle

Pipeline Hazards. See P&H Chapter 4.7. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Test Infrastructure Design for Core-Based System-on-Chip Under Cycle-Accurate Thermal Constraints

Steady-State Power System Security Analysis with PowerWorld Simulator

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

M464S1724CT1 SDRAM SODIMM 16Mx64 SDRAM SODIMM based on 8Mx16,4Banks,4K Refresh,3.3V Synchronous DRAMs with SPD. Pin. Pin. Back. Front DQ53 DQ54 DQ55

CMU Introduction to Computer Architecture, Spring 2013 HW 3 Solutions: Microprogramming Wrap-up and Pipelining

IS42S Meg Bits x 16 Bits x 4 Banks (64-MBIT) SYNCHRONOUS DYNAMIC RAM FEATURES OVERVIEW. PIN CONFIGURATIONS 54-Pin TSOP (Type II)

Model-Based Design and Hardware-in-the-Loop Simulation for Clean Vehicles Bo Chen, Ph.D.

Newly Developed High Power 2-in-1 IGBT Module

Intelligent Energy Management System Simulator for PHEVs at a Municipal Parking Deck in a Smart Grid Environment

Techniques, October , Boston, USA. Personal use of this material is permitted. However, permission to

100GE PCS Modeling. Oded Trainin, Hadas Yeger, Mark Gustlin. IEEE HSSG September 2007

Biologically-inspired reactive collision avoidance

PERMAS Users' Conference on April 12-13, 2018, Stuttgart

FUEL CORRECTIONS: 13 July 2015

Transcription:

WarpedCompression: Enabling Power Efficient GPUs through Register Compression Sangpil Lee, Keunsoo Kim, Won Woo Ro (Yonsei University*) Gunjae Koo, Hyeran Jeon, Murali Annavaram (USC) (*Work done while visiting USC)

Short Summary Target Register File on GPUs Problem Energy Consumption of Register File Solution Data Compression on Register File Results Reducing 25% of Register File Energy Consumption 2

Motivation: Register Power Consumption GPUs Need Large Register Files to Maximize TLP Register File Contributes Significant Portion of the Total GPU Chip Power Register File Size Has Been Growing 512KB 1920 KB 2048 KB 3840 KB 6144 KB Tesla (G80/G92) Tesla (GT200) Fermi (GF110) Kepler (GK110) Maxwell (GM200) Estimated GeForce GTX480 (Fermi) Component Power Consumption* 3 *Leng et al., GPUWattch : Enabling Energy Optimizations in GPGPUs

Motivation: GPU Register Characteristics Warp: A Bundle of 32 Threads Operands of a Warp: A Bundle of 32 Thread Registers This bundle of registers is treated as a single instruction operand in GPUs add.u32 %r0, %r1, %r6;... dst src1 src2 Warp Instruction (add.u32 %r0, %r1, %r6) T 0 T 1 T 2 T 3 T 28 T 29 T 30 T 31 r0 r0 r0 r0 r0 r0 r0 r0 r1 r1 r1 r1 r1 r1 r1 r1 r6 r6 r6 r6 r6 r6 r6 r6 32bit Registers X 32 (128byte) 4

line Register File Multibanked Register File* 4KB per bank, 32 banks 128bit wide single read/write port Provides 4 thread operands per bank Access 8 banks for collecting a warp operand Bank Arbiter 4KB Bank (128bit Wide) Bank 0 byte Bank 1 byte Bank 2 byte Bank 3 byte Bank 4 byte Bank 5 byte Bank 6 byte Bank 7 byte Operand Collector Buffer (32bit X 32) *Gebhart et al., Energyefficient Mechanisms for Managing Thread Context in Throughput Processors 5

Register File Access Energy Accessing Warp Operand Registers Activates Multiple Banks Bank access energy + wire energy Bank Arbiter 4KB SRAM Access Energy 1 7pJ Bank 0 byte Bank 1 Bank 2 Power Hungry! Bank 3 Bank 4 Bank 5 Bank 6 Register byte byte byte File byte Access byte byte is Bank 7 byte 128bit Wire Energy 2 9.6pJ/mm Access Energy/Warp Operand : (7 + 9.6)*8 = 132.8pJ 1 CACTI (1.0V, 45nm) 2 Gebhart et al., Energyefficient Mechanisms for Managing Thread Context in Throughput Processors (1.0v, 40nm) Operand Collector Buffer (32bit X 32) How Can We Reduce Register File Access Energy? 6 1mm

Opportunity: Similarity of Register Values Value Similarity is Frequently Observed on a Warp Operand Constant Value: all thread registers in a warp have a same value T 0 T 1 T 2 T 3 T 28 T 29 T 30 T 31 src 1 1 1 1 1 1 1 1 Index Values: all thread registers have incremental values T 0 T 1 T 2 T 3 T 28 T 29 T 30 T 31 src 0 1 2 3 28 29 30 31 Low Dynamic Range: values of all thread registers are bounded in a limited range T 0 T 1 T 2 T 3 T 28 T 29 T 30 T 31 src 127 156 156 157 172 173 8 2 Dynamic Range: 46 (min=127, MAX=173) 7

Source of Value Similarity: pathfinder * Index Values Constant Values Low Dynamic Range global void pathfinder_kernel(int iteration,...) { }... int tx = threadidx.x; int bx = blockidx.x; int small_block_cols = BLOCKSIZEiteration*HALO*2; int blkx = small_block_cols*bxborder; int xidx = blkx+tx;... for (int i=0; i<iteration ; i++){ computed = false; if( IN_RANGE(tx, i+1, BLOCKSIZEi2) && isvalid){ computed = true; int left = prev[w]; } } int up = prev[tx]; int right = prev[e]; int shortest = MIN(left, up); shortest = MIN(shortest, right); int index = cols*(startstep+i)+xidx; result[tx] = shortest + wall[index];...... 8 Thread Index (0 ~ 1023) Thread Block Index (0 ~ 65535) Application Input Data (0 ~ 9) *from Rodinia Benchmark Suite

Arithmetic Distance Distribution How Much is This Opportunity? On Average, 70% Thread Registers are Not Random Zero: neighboring registers has same value 128 bin: neighboring registers differ by at most 128 32K bin: neighboring registers differ by at most 2 15 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Zero 128 bin 32K bin Random 9

Exploiting Value Similarity for Register Compression 10

Register Compression Writeback (32bit X 32) Compressor 50% Compressed Bank Arbiter Bank 0 Bank 1 Bank 2 Bank 3 Comp Comp Comp Comp B B B B Bank 4 Bank 5 Bank 6 Bank 7 Only 50% of RF & Wire Active Decompression Warp Operand (32bit X 32) 11

But Is It Practical? Energy Consumption Compression & Decompression consume extra energy Register File Access Latency Compression & Decompression increase register file access latency Requirements for Register Compression Low Energy Compression Low Latency Compression High Compression Ratio 12

Low Latency/Energy Compression DeltaImmediate (BΔI) Compression Optimized for zero and similar value compression Use base and delta to represent original value Original Data Warp Operand (32 Thread Registers) 100,000,000 100,000,001 100,000,002 100,000,031 4byte 4byte 4byte 4byte 128byte BΔI Compression Data Representation (4, Delta1) Value 100,000,000 4byte 1 2 31 Delta Values 1byte 1byte 31 1byte 35byte Register File Bank 0 Bank 1 Bank 2 Δ Δ Δ Δ Δ Δ Δ Δ Δ Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 3 Bank Used 13 5 Bank Unused

BDI /Delta Type Ratio Compression Ratio BΔI Compression Parameters BΔI Can Use Various and Delta size : 2, 4, 8byte / Delta: 0, 1, 2byte Various and Delta can improve compression ratio But also increase complexity of compression/decompression Use Single, Various Delta Most of registers can be compressed by using 4byte ( 4) Various Delta improve compression ratio We use 4byte and 0/1/2byte Delta 1 Not Compressed 2 0.8 0.6 0.4 0.2 8/Delta 4 8/Delta 2 8/Delta 1 8/Delta 0 4/Delta 2 4/Delta 1 1.5 1 0.5 4/Delta 0 only 4/Delta 1 only 4/Delta 2 only 4/Delta 0,1,2 0 AVG 4/Delta 0 0 AVG 14

Bank Arbiter Compression Range Indicator Vector Compressor Unit Array Interconnect Decompressor Unit Array WarpedCompression Architecture Compressor Inserted in front of the register file bank Decompressor Inserted in front of the operand collectors Bank Arbiter Tracks which register is compressed What compression parameters are used Warp Scheduler Issue Register Bank 0 Operand Collector Register Bank 1 Operand Collector SIMD EXE Units Register Bank 31 Operand Collector 15

Dealing with Branch ergence Branch ergence Partially update destination registers in a warp using the active mask If the destination registers are compressed, registers cannot be updated using active mask True If (threadid % 2) False Active Mask 1 1 1 0 1 0 1 1 add r0, r1, r6 Active Mask r0 T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 0 1 0 1 0 1 0 1 Execution Results sub r0, r1, r6 r0 r0 T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 1 0 1 0 1 0 1 0 T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 1 1 1 1 1 1 1 1 Δ Δ Δ Δ Compressed Destination Register

Compression Ratio N/A N/A N/A N/A N/A N/A Simplifying Branch ergence Handling Compression Ratio in ergent Region is Low Thread registers in a diverged warp can have different values according to their execution path 6 5 4 3 2 1 0 8 Nondivergent Region ergent Region Overall Simple Solution: Disable Compression in ergent Region But What If a Destination Register is Already Compressed? Using dummy MOV instructions 17

Bank Arbiter Compressor Decompressor Handling Branch ergence (1) Turn Off Register Compression Compression unit is disabled when the active mask contains any zero values Decompress Destination Operand Register Bank arbiter injects a dummy MOV instruction to the execution pipeline when a destination register is compressed This dummy MOV instruction has the same src/dest register Access Request r1, r6 2 ergence Check 3 Destination Reg. r0 Check mov r0, r0 add r0, r1, r6 4 If Destination Register is Compressed, Suspend Original Request & Inject Dummy MOV Instruction Register File B Δ Δ Δ Δ Dest. Reg is Compressed 5 Read & Decompress Warp Scheduler Operand Collector SIMD EXE Units 1 Register Access Request to Read Input Operands 18

Bank Arbiter Compressor (Disabled) Decompressor Handling Branch ergence (2) Update Register File Write uncompressed register value by the dummy MOV instruction At this point, the destination register on the register file is uncompressed Resume The Suspended Request Bank arbiter processes the suspended access request to the destination register as conventional register access Access Request r1, r6 7 Bank Arbiter Grants Register Write for Uncompressed Register Value 8 Bank Arbiter Restarts Suspended Register Access Request Register File B Δ Δ Δ Δ Dest. Reg is Uncompressed Compressed Operand Collector SIMD EXE Units 6 Writeback Uncompressed Destination Register Value 19

Register File Energy Register File Energy Saving Average Register File Energy Consumption: Reduced by 25% Dynamic energy consumption: Reduced by register compression Leakage energy consumption: Reduced by unused banklevel powergating Extra Energy Consumption of Compressor/Decompressor: Insignificant 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 RF Leakage RF Dynamic Compressor Decompressor AVG 20

Exeution Time Impact on Performance Performance Degradation: Negligible 2 cycle compression + 1 cycle decompression latency = 0.1% performance loss Dummy MOV instructions account for less 2% of the total instruction count 1.2 1 0.8 0.6 0.4 0.2 0 line WarpedCompression 21

Conclusion Register Files are Power Hungry But Register File Data Exhibits Strong Value Similarity Use BΔI Compression to Exploit Value Similarity to Compress Register Data Compression is Effective Reduce the size of a warp operand to 60% Compression is Energy Efficient Save 25% of total register file energy consumption Compression Has Negligible Performance Impact 0.1% degradation 22

Backup Slides 23

Evaluation Environment Simulation Parameters Parameter Value Clock Frequency 1.4GHz SMs / GPU 15 Warp Schedulers / SM 2 Warp Scheduling Policy GTO SIMT Lane Width 32 Max # of Warps / SM 48 Max # of Threads / SM 1536 Register File Size 128 KB Max Registers / SM 32,768 # of Register Banks 32 Bit Width / Bank 128bit # of Entries / Bank 256 # of Compressors 2 # of Decompressors 4 Compression Latency 2 cycle Decompression Latency 1 cycle Bank Wakeup Latency 10 cycle Parameter Operating Voltage Wire Capacitance (45nm) Wire Energy (128bit) Access Energy / Bank (45nm) Leakage Power / Bank (45nm) Compression Unit Energy / Activation Compression Unit Leakage Power Decompression Unit Energy / Activation Decompression Unit Leakage Power Value 1.0 V 300 ff/mm 9.6 pj/mm 7pJ 5.8 mw 23 pj 0.12 mw 21 pj 0.08 mw Benchmarks GPGPUsim, Rodinia benchmark suite, Parboil benchmark suite 24

Compression & Decompression Unit Simplifying BΔI GPU Register: 32bit Only use 4byte base and 0/1/2byte delta for compressing register values Only need 32bit Adder/Subtractors, bit comparators 4Byte 128byte Original Data 32bit Subtractor 32bit Subtractor 32bit Subtractor 32bit Subtractor 32bit Subtractor 4Byte Δ 0 Δ 0 Δ 0 Δ 0 Δ 0 Δ 0 Δ n1 4Byte Δ 0 Δ 0 Δ 0 Δ 0 Δ 0 Δ 1 Δ 2 Δ 3 Δ 30 Sign Extension Comparator Sign Extension Comparator Sign Extension Comparator Sign Extension Comparator Sign Extension Comparator Yes Δ 0 Δ 0 Δ n1 Compressible? Packing Data No 32bit Adder 32bit Adder 32bit Adder 32bit Adder 32bit Adder 32bit Adder 128byte Original Data Compressed Data out Original Data out Compressor Decompressor 25

Nondiv Nondiv Nondiv Nondiv Nondiv Nondiv Nondiv Nondiv Nondiv Nondiv Nondiv Nondiv Nondiv Nondiv Nondiv Nondiv Nondiv Nondiv Arithmetic Distance Distribution How Much is This Opportunity? On Average, 79% Thread Registers are Not Random Zero: neighboring registers has same value 128 bin: neighboring registers differ by at most 128 32K bin: neighboring registers differ by at most 2 15 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 N/A N/A N/A N/A N/A Zero 128 bin 32K bin Random N/A LIB AES BFS CP LPS STO backp hots path srad dwt2d cutcp mriq sad sgemm spmv stencil Avg 26

Register Compression Compressed Register Data Reduces the Number of Register File Access Decompression Bank Arbiter 50% Compressed Data Bank 0 Comp B Bank 1 Comp B Bank 2 Comp B Bank 3 Comp B Bank 4 Bank 5 Bank 6 Bank 7 Only 50% of RF & Wire Active Do Not Need to Access Decompression Warp Operand (32bit X 32) 27

BDI /Delta Type Ratio Compression Ratio BΔI Compression Parameters BΔI Can Use Various and Delta size : 2, 4, 8byte / Delta: 0, 1, 2byte Various and Delta can improve compression ratio But it increases complexity of compression/decompression Use Fixed, Various Delta Most of registers can be compressed by using 4byte (4) GPU register granularity: 32bit Do not need 2 or 8byte Various Delta improve compression ratio We use 4byte and 0/1/2byte Delta 1 0.8 0.6 0.4 0.2 Not Compressed 8/Delta 4 8/Delta 2 8/Delta 1 8/Delta 0 4/Delta 2 4/Delta 1 3 2.5 2 1.5 1 0.5 0 5.6 4/Delta 0 only 4/Delta 1 only 4/Delta 2 only 4/Delta 0,1,2 0 AVG 4/Delta 0 28

Compression Ratio N/A N/A N/A N/A N/A N/A Handling Branch ergence Compression Ratio in ergent Region is Low 6 5 4 3 2 1 0 8 Nondivergent Region ergent Region Overall Solution: Disable Compression & Decompress Register Before Access Dummy MOV instruction (which has same sourcedestination) used for decompressing registers when the destination register is compressed Writeback Active Mask Has 0 Destination Register is Compressed? Disable Compressor Inject Dummy MOV 29 Decompress Destination Register Target Register Writeback Suspended Resume Register Write Complete Writeback

Register File Energy Register File Energy Saving Average Register File Energy Consumption: Reduced by 25% Dynamic energy consumption: Reduced by register compression Leakage energy consumption: Reduced by unused banklevel powergating Extra Energy Consumption of Compressor/Decompressor: Insignificant 1 RF Leakage RF Dynamic Compressor Decompressor 0.8 0.6 0.4 0.2 0 LIB AES BFS CP LPS STO back hot path srad dwt2d cutcp mriq sad sgemm spmv stencil AVG 30