Warped-Compression: Enabling Power Efficient GPUs through Register Compression

WarpedCompression: Enabling Power Efficient GPUs through Register Compression Sangpil Lee, Keunsoo Kim, Won Woo Ro (Yonsei University*) Gunjae Koo, Hyeran Jeon, Murali Annavaram (USC) (*Work done while visiting USC)

Short Summary Target Register File on GPUs Problem Energy Consumption of Register File Solution Data Compression on Register File Results Reducing 25% of Register File Energy Consumption 2

Motivation: Register Power Consumption GPUs Need Large Register Files to Maximize TLP Register File Contributes Significant Portion of the Total GPU Chip Power Register File Size Has Been Growing 512KB 1920 KB 2048 KB 3840 KB 6144 KB Tesla (G80/G92) Tesla (GT200) Fermi (GF110) Kepler (GK110) Maxwell (GM200) Estimated GeForce GTX480 (Fermi) Component Power Consumption* 3 *Leng et al., GPUWattch : Enabling Energy Optimizations in GPGPUs

Motivation: GPU Register Characteristics Warp: A Bundle of 32 Threads Operands of a Warp: A Bundle of 32 Thread Registers This bundle of registers is treated as a single instruction operand in GPUs add.u32 %r0, %r1, %r6;... dst src1 src2 Warp Instruction (add.u32 %r0, %r1, %r6) T 0 T 1 T 2 T 3 T 28 T 29 T 30 T 31 r0 r0 r0 r0 r0 r0 r0 r0 r1 r1 r1 r1 r1 r1 r1 r1 r6 r6 r6 r6 r6 r6 r6 r6 32bit Registers X 32 (128byte) 4

line Register File Multibanked Register File* 4KB per bank, 32 banks 128bit wide single read/write port Provides 4 thread operands per bank Access 8 banks for collecting a warp operand Bank Arbiter 4KB Bank (128bit Wide) Bank 0 byte Bank 1 byte Bank 2 byte Bank 3 byte Bank 4 byte Bank 5 byte Bank 6 byte Bank 7 byte Operand Collector Buffer (32bit X 32) *Gebhart et al., Energyefficient Mechanisms for Managing Thread Context in Throughput Processors 5

Register File Access Energy Accessing Warp Operand Registers Activates Multiple Banks Bank access energy + wire energy Bank Arbiter 4KB SRAM Access Energy 1 7pJ Bank 0 byte Bank 1 Bank 2 Power Hungry! Bank 3 Bank 4 Bank 5 Bank 6 Register byte byte byte File byte Access byte byte is Bank 7 byte 128bit Wire Energy 2 9.6pJ/mm Access Energy/Warp Operand : (7 + 9.6)*8 = 132.8pJ 1 CACTI (1.0V, 45nm) 2 Gebhart et al., Energyefficient Mechanisms for Managing Thread Context in Throughput Processors (1.0v, 40nm) Operand Collector Buffer (32bit X 32) How Can We Reduce Register File Access Energy? 6 1mm

Opportunity: Similarity of Register Values Value Similarity is Frequently Observed on a Warp Operand Constant Value: all thread registers in a warp have a same value T 0 T 1 T 2 T 3 T 28 T 29 T 30 T 31 src 1 1 1 1 1 1 1 1 Index Values: all thread registers have incremental values T 0 T 1 T 2 T 3 T 28 T 29 T 30 T 31 src 0 1 2 3 28 29 30 31 Low Dynamic Range: values of all thread registers are bounded in a limited range T 0 T 1 T 2 T 3 T 28 T 29 T 30 T 31 src 127 156 156 157 172 173 8 2 Dynamic Range: 46 (min=127, MAX=173) 7

Source of Value Similarity: pathfinder * Index Values Constant Values Low Dynamic Range global void pathfinder_kernel(int iteration,...) { }... int tx = threadidx.x; int bx = blockidx.x; int small_block_cols = BLOCKSIZEiteration*HALO*2; int blkx = small_block_cols*bxborder; int xidx = blkx+tx;... for (int i=0; i<iteration ; i++){ computed = false; if( IN_RANGE(tx, i+1, BLOCKSIZEi2) && isvalid){ computed = true; int left = prev[w]; } } int up = prev[tx]; int right = prev[e]; int shortest = MIN(left, up); shortest = MIN(shortest, right); int index = cols*(startstep+i)+xidx; result[tx] = shortest + wall[index];...... 8 Thread Index (0 ~ 1023) Thread Block Index (0 ~ 65535) Application Input Data (0 ~ 9) *from Rodinia Benchmark Suite

Arithmetic Distance Distribution How Much is This Opportunity? On Average, 70% Thread Registers are Not Random Zero: neighboring registers has same value 128 bin: neighboring registers differ by at most 128 32K bin: neighboring registers differ by at most 2 15 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Zero 128 bin 32K bin Random 9

Exploiting Value Similarity for Register Compression 10

Register Compression Writeback (32bit X 32) Compressor 50% Compressed Bank Arbiter Bank 0 Bank 1 Bank 2 Bank 3 Comp Comp Comp Comp B B B B Bank 4 Bank 5 Bank 6 Bank 7 Only 50% of RF & Wire Active Decompression Warp Operand (32bit X 32) 11

But Is It Practical? Energy Consumption Compression & Decompression consume extra energy Register File Access Latency Compression & Decompression increase register file access latency Requirements for Register Compression Low Energy Compression Low Latency Compression High Compression Ratio 12

Low Latency/Energy Compression DeltaImmediate (BΔI) Compression Optimized for zero and similar value compression Use base and delta to represent original value Original Data Warp Operand (32 Thread Registers) 100,000,000 100,000,001 100,000,002 100,000,031 4byte 4byte 4byte 4byte 128byte BΔI Compression Data Representation (4, Delta1) Value 100,000,000 4byte 1 2 31 Delta Values 1byte 1byte 31 1byte 35byte Register File Bank 0 Bank 1 Bank 2 Δ Δ Δ Δ Δ Δ Δ Δ Δ Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 3 Bank Used 13 5 Bank Unused

BDI /Delta Type Ratio Compression Ratio BΔI Compression Parameters BΔI Can Use Various and Delta size : 2, 4, 8byte / Delta: 0, 1, 2byte Various and Delta can improve compression ratio But also increase complexity of compression/decompression Use Single, Various Delta Most of registers can be compressed by using 4byte ( 4) Various Delta improve compression ratio We use 4byte and 0/1/2byte Delta 1 Not Compressed 2 0.8 0.6 0.4 0.2 8/Delta 4 8/Delta 2 8/Delta 1 8/Delta 0 4/Delta 2 4/Delta 1 1.5 1 0.5 4/Delta 0 only 4/Delta 1 only 4/Delta 2 only 4/Delta 0,1,2 0 AVG 4/Delta 0 0 AVG 14

Bank Arbiter Compression Range Indicator Vector Compressor Unit Array Interconnect Decompressor Unit Array WarpedCompression Architecture Compressor Inserted in front of the register file bank Decompressor Inserted in front of the operand collectors Bank Arbiter Tracks which register is compressed What compression parameters are used Warp Scheduler Issue Register Bank 0 Operand Collector Register Bank 1 Operand Collector SIMD EXE Units Register Bank 31 Operand Collector 15

Dealing with Branch ergence Branch ergence Partially update destination registers in a warp using the active mask If the destination registers are compressed, registers cannot be updated using active mask True If (threadid % 2) False Active Mask 1 1 1 0 1 0 1 1 add r0, r1, r6 Active Mask r0 T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 0 1 0 1 0 1 0 1 Execution Results sub r0, r1, r6 r0 r0 T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 1 0 1 0 1 0 1 0 T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 1 1 1 1 1 1 1 1 Δ Δ Δ Δ Compressed Destination Register

Compression Ratio N/A N/A N/A N/A N/A N/A Simplifying Branch ergence Handling Compression Ratio in ergent Region is Low Thread registers in a diverged warp can have different values according to their execution path 6 5 4 3 2 1 0 8 Nondivergent Region ergent Region Overall Simple Solution: Disable Compression in ergent Region But What If a Destination Register is Already Compressed? Using dummy MOV instructions 17

Bank Arbiter Compressor Decompressor Handling Branch ergence (1) Turn Off Register Compression Compression unit is disabled when the active mask contains any zero values Decompress Destination Operand Register Bank arbiter injects a dummy MOV instruction to the execution pipeline when a destination register is compressed This dummy MOV instruction has the same src/dest register Access Request r1, r6 2 ergence Check 3 Destination Reg. r0 Check mov r0, r0 add r0, r1, r6 4 If Destination Register is Compressed, Suspend Original Request & Inject Dummy MOV Instruction Register File B Δ Δ Δ Δ Dest. Reg is Compressed 5 Read & Decompress Warp Scheduler Operand Collector SIMD EXE Units 1 Register Access Request to Read Input Operands 18

Bank Arbiter Compressor (Disabled) Decompressor Handling Branch ergence (2) Update Register File Write uncompressed register value by the dummy MOV instruction At this point, the destination register on the register file is uncompressed Resume The Suspended Request Bank arbiter processes the suspended access request to the destination register as conventional register access Access Request r1, r6 7 Bank Arbiter Grants Register Write for Uncompressed Register Value 8 Bank Arbiter Restarts Suspended Register Access Request Register File B Δ Δ Δ Δ Dest. Reg is Uncompressed Compressed Operand Collector SIMD EXE Units 6 Writeback Uncompressed Destination Register Value 19

Register File Energy Register File Energy Saving Average Register File Energy Consumption: Reduced by 25% Dynamic energy consumption: Reduced by register compression Leakage energy consumption: Reduced by unused banklevel powergating Extra Energy Consumption of Compressor/Decompressor: Insignificant 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 RF Leakage RF Dynamic Compressor Decompressor AVG 20

Exeution Time Impact on Performance Performance Degradation: Negligible 2 cycle compression + 1 cycle decompression latency = 0.1% performance loss Dummy MOV instructions account for less 2% of the total instruction count 1.2 1 0.8 0.6 0.4 0.2 0 line WarpedCompression 21

Conclusion Register Files are Power Hungry But Register File Data Exhibits Strong Value Similarity Use BΔI Compression to Exploit Value Similarity to Compress Register Data Compression is Effective Reduce the size of a warp operand to 60% Compression is Energy Efficient Save 25% of total register file energy consumption Compression Has Negligible Performance Impact 0.1% degradation 22

Backup Slides 23

Evaluation Environment Simulation Parameters Parameter Value Clock Frequency 1.4GHz SMs / GPU 15 Warp Schedulers / SM 2 Warp Scheduling Policy GTO SIMT Lane Width 32 Max # of Warps / SM 48 Max # of Threads / SM 1536 Register File Size 128 KB Max Registers / SM 32,768 # of Register Banks 32 Bit Width / Bank 128bit # of Entries / Bank 256 # of Compressors 2 # of Decompressors 4 Compression Latency 2 cycle Decompression Latency 1 cycle Bank Wakeup Latency 10 cycle Parameter Operating Voltage Wire Capacitance (45nm) Wire Energy (128bit) Access Energy / Bank (45nm) Leakage Power / Bank (45nm) Compression Unit Energy / Activation Compression Unit Leakage Power Decompression Unit Energy / Activation Decompression Unit Leakage Power Value 1.0 V 300 ff/mm 9.6 pj/mm 7pJ 5.8 mw 23 pj 0.12 mw 21 pj 0.08 mw Benchmarks GPGPUsim, Rodinia benchmark suite, Parboil benchmark suite 24

Compression & Decompression Unit Simplifying BΔI GPU Register: 32bit Only use 4byte base and 0/1/2byte delta for compressing register values Only need 32bit Adder/Subtractors, bit comparators 4Byte 128byte Original Data 32bit Subtractor 32bit Subtractor 32bit Subtractor 32bit Subtractor 32bit Subtractor 4Byte Δ 0 Δ 0 Δ 0 Δ 0 Δ 0 Δ 0 Δ n1 4Byte Δ 0 Δ 0 Δ 0 Δ 0 Δ 0 Δ 1 Δ 2 Δ 3 Δ 30 Sign Extension Comparator Sign Extension Comparator Sign Extension Comparator Sign Extension Comparator Sign Extension Comparator Yes Δ 0 Δ 0 Δ n1 Compressible? Packing Data No 32bit Adder 32bit Adder 32bit Adder 32bit Adder 32bit Adder 32bit Adder 128byte Original Data Compressed Data out Original Data out Compressor Decompressor 25

Nondiv Nondiv Nondiv Nondiv Nondiv Nondiv Nondiv Nondiv Nondiv Nondiv Nondiv Nondiv Nondiv Nondiv Nondiv Nondiv Nondiv Nondiv Arithmetic Distance Distribution How Much is This Opportunity? On Average, 79% Thread Registers are Not Random Zero: neighboring registers has same value 128 bin: neighboring registers differ by at most 128 32K bin: neighboring registers differ by at most 2 15 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 N/A N/A N/A N/A N/A Zero 128 bin 32K bin Random N/A LIB AES BFS CP LPS STO backp hots path srad dwt2d cutcp mriq sad sgemm spmv stencil Avg 26

Register Compression Compressed Register Data Reduces the Number of Register File Access Decompression Bank Arbiter 50% Compressed Data Bank 0 Comp B Bank 1 Comp B Bank 2 Comp B Bank 3 Comp B Bank 4 Bank 5 Bank 6 Bank 7 Only 50% of RF & Wire Active Do Not Need to Access Decompression Warp Operand (32bit X 32) 27

BDI /Delta Type Ratio Compression Ratio BΔI Compression Parameters BΔI Can Use Various and Delta size : 2, 4, 8byte / Delta: 0, 1, 2byte Various and Delta can improve compression ratio But it increases complexity of compression/decompression Use Fixed, Various Delta Most of registers can be compressed by using 4byte (4) GPU register granularity: 32bit Do not need 2 or 8byte Various Delta improve compression ratio We use 4byte and 0/1/2byte Delta 1 0.8 0.6 0.4 0.2 Not Compressed 8/Delta 4 8/Delta 2 8/Delta 1 8/Delta 0 4/Delta 2 4/Delta 1 3 2.5 2 1.5 1 0.5 0 5.6 4/Delta 0 only 4/Delta 1 only 4/Delta 2 only 4/Delta 0,1,2 0 AVG 4/Delta 0 28

Compression Ratio N/A N/A N/A N/A N/A N/A Handling Branch ergence Compression Ratio in ergent Region is Low 6 5 4 3 2 1 0 8 Nondivergent Region ergent Region Overall Solution: Disable Compression & Decompress Register Before Access Dummy MOV instruction (which has same sourcedestination) used for decompressing registers when the destination register is compressed Writeback Active Mask Has 0 Destination Register is Compressed? Disable Compressor Inject Dummy MOV 29 Decompress Destination Register Target Register Writeback Suspended Resume Register Write Complete Writeback

Register File Energy Register File Energy Saving Average Register File Energy Consumption: Reduced by 25% Dynamic energy consumption: Reduced by register compression Leakage energy consumption: Reduced by unused banklevel powergating Extra Energy Consumption of Compressor/Decompressor: Insignificant 1 RF Leakage RF Dynamic Compressor Decompressor 0.8 0.6 0.4 0.2 0 LIB AES BFS CP LPS STO back hot path srad dwt2d cutcp mriq sad sgemm spmv stencil AVG 30