CMPEN 411 VLSI Digital Circuits Spring Lecture 20: Multiplier Design

Similar documents
CMPEN 411 VLSI Digital Circuits Spring Lecture 24: Peripheral Memory Circuits

CMPEN 411 VLSI Digital Circuits Spring Lecture 22: Memery, ROM

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

Design and Analysis of 32 Bit Regular and Improved Square Root Carry Select Adder

Low Power And High Performance 32bit Unsigned Multiplier Using Adders. Hyderabad, A.P , India. Hyderabad, A.P , India.

Energy Efficient Content-Addressable Memory

Introduction to Digital Techniques

CMPEN 411 VLSI Digital Circuits Spring Lecture 15: Dynamic CMOS

Parallelism I: Inside the Core

6.823 Computer System Architecture Prerequisite Self-Assessment Test Assigned Feb. 6, 2019 Due Feb 11, 2019

CMPEN 411 VLSI Digital Circuits Spring Lecture 06: Static CMOS Logic

In-Place Associative Computing:

Registers Shift Registers Accumulators Register Files Register Transfer Language. Chapter 8 Registers. SKEE2263 Digital Systems

Chapter 3: Computer Organization Fundamentals. Oregon State University School of Electrical Engineering and Computer Science.

ASIC Design (7v81) Spring 2000

CMU Introduction to Computer Architecture, Spring 2013 HW 3 Solutions: Microprogramming Wrap-up and Pipelining

Storage and Memory Hierarchy CS165

IN CONVENTIONAL CMOS circuits, the required logic

Sequential Circuit Background. Young Won Lim 11/6/15

Page 1. Goal. Digital Circuits: why they leak, how to counter. Design methodology: consider all design abstraction levels. Outline: bottom-up

EE 330 Integrated Circuit. Sequential Airbag Controller

Lecture 10: Circuit Families

CprE 281: Digital Logic

Basic Electricity. Mike Koch Lead Mentor Muncie Delaware Robotics Team 1720 PhyXTGears. and Electronics. for FRC

Out-of-order Pipeline. Register Read. OOO execution (2-wide) OOO execution (2-wide) OOO execution (2-wide) OOO execution (2-wide)

128Mb Synchronous DRAM. Features High Performance: Description. REV 1.0 May, 2001 NT5SV32M4CT NT5SV16M8CT NT5SV8M16CT

Design of a Low Power Content Addressable Memory (CAM)

SYNCHRONOUS DRAM. 128Mb: x32 SDRAM. MT48LC4M32B2-1 Meg x 32 x 4 banks

Optimality of Tomasulo s Algorithm Luna, Dong Gang, Zhao

Drowsy Caches Simple Techniques for Reducing Leakage Power Krisztián Flautner Nam Sung Kim Steve Martin David Blaauw Trevor Mudge

Warped-Compression: Enabling Power Efficient GPUs through Register Compression

e-smart 2009 Low cost fault injection method for security characterization

Dual-Rail Domino Logic Circuits with PVT Variations in VDSM Technology

Layout Design and Implementation of Adiabatic based Low Power CPAL Ripple Carry Adder

Capacity-Achieving Accumulate-Repeat-Accumulate Codes for the BEC with Bounded Complexity

Introduction to Computer Engineering EECS 203 dickrp/eecs203/

To read more. CS 6354: Tomasulo. Intel Skylake. Scheduling. How can we reorder instructions? Without changing the answer.

SDRAM AS4SD8M Mb: 8 Meg x 16 SDRAM Synchronous DRAM Memory. PIN ASSIGNMENT (Top View)

CS 6354: Tomasulo. 21 September 2016

Contents. Preface... xiii Introduction... xv. Chapter 1: The Systems Approach to Control and Instrumentation... 1

Advanced Superscalar Architectures. Speculative and Out-of-Order Execution

FPGA-based New Hybrid Adder Design with the Optimal Bit-Width Configuration

- DQ0 - NC DQ1 - NC - NC DQ0 - NC DQ2 DQ1 DQ CONFIGURATION. None SPEED GRADE

Finite Element Based, FPGA-Implemented Electric Machine Model for Hardware-in-the-Loop (HIL) Simulation

- - DQ0 NC DQ1 DQ0 DQ2 - NC DQ1 DQ3 NC - NC

CS250 VLSI Systems Design

- DQ0 - NC DQ1 - NC - NC DQ0 - NC DQ2 DQ1 DQ

Lecture 14: Instruction Level Parallelism

Successive Approximation Time-to-Digital Converter with Vernier-level Resolution

- DQ0 - NC DQ1 - NC - NC DQ0 - NC DQ2 DQ1 DQ

- - DQ0 NC DQ1 DQ0 DQ2 - NC DQ1 DQ3 NC - NC

VHDL (and verilog) allow complex hardware to be described in either single-segment style to two-segment style

Introduction to Computer Engineering EECS 203 dickrp/eecs203/

Computer Architecture 计算机体系结构. Lecture 3. Instruction-Level Parallelism I 第三讲 指令级并行 I. Chao Li, PhD. 李超博士

Exploiting Clock Skew Scheduling for FPGA

IS42S32200L IS45S32200L

HYB25D256400/800AT 256-MBit Double Data Rata SDRAM

HYB25D256[400/800/160]B[T/C](L) 256-Mbit Double Data Rate SDRAM, Die Rev. B Data Sheet Jan. 2003, V1.1. Features. Description

Fully Integrated SC DC-DC: Bulk CMOS Oriented Design

IS42S32200C1. 512K Bits x 32 Bits x 4 Banks (64-MBIT) SYNCHRONOUS DYNAMIC RAM

AVS64( )L

Power distribution techniques for dual-vdd circuits. Sarvesh H Kulkarni and Dennis Sylvester EECS Department, University of Michigan

Improving Performance: Pipelining!

Algebraic Integer Encoding and Applications in Discrete Cosine Transform

DESIGN OF HIGH ENERGY LITHIUM-ION BATTERY CHARGER

Lecture PowerPoints. Chapter 21 Physics: Principles with Applications, 7th edition, Global Edition Giancoli

Standard Logic ICs. Selection guide. August

Circuit breaker wear monitoring function block description for railway application

Field Programmable Gate Arrays a Case Study

DQ0 NC DQ1 DQ0 DQ2 DQ3 DQ Speed Grade

Pump Control Ball Valve for Energy Savings

Fast In-place Transposition. I-Jui Sung, University of Illinois Juan Gómez-Luna, University of Córdoba (Spain) Wen-Mei Hwu, University of Illinois

Jet Dispensing Underfills for Stacked Die Applications

Slippage Detection and Traction Control System

Computer Architecture: Out-of-Order Execution. Prof. Onur Mutlu (editted by Seth) Carnegie Mellon University

ENGN1640: Design of Computing Systems Topic 05: Pipeline Processor Design

A48P4616B. 16M X 16 Bit DDR DRAM. Document Title 16M X 16 Bit DDR DRAM. Revision History. AMIC Technology, Corp. Rev. No. History Issue Date Remark

Survey Report Informatica PowerCenter Express. Right-Sized Data Integration for the Smaller Project

XC95288 In-System Programmable CPLD

UNIVERSITY OF CALIFORNIA, IRVINE THESIS MASTER OF SCIENCE

FULLY SYNCHRONOUS DESIGN By Serge Mathieu

SFM/TFM Power Integrity Guidelines Samtec SFM/TFM Series Measurement and Simulation Data

A Predictive Delay Fault Avoidance Scheme for Coarse Grained Reconfigurable Architecture

BEVEL GEAR JACKS D and 3D models available on website Ordering information on page 150

54ACxxxx, 54ACTxxxx. Rad-hard advanced high-speed 5 V CMOS logic series. Features. Description

Discrete Control Logic. 1. Pneumatic circuits. - Low forces - Discrete, fixed travel distances - Rotational or reciprocating motion

European Conference on Nanoelectronics and Embedded Systems for Electric Mobility. An Insight into Active Balancing for Lithium-Ion Batteries

INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR NPTEL ONLINE CERTIFICATION COURSE. On Industrial Automation and Control

2-marks question bank UNIT I - TRANSFORMERS UNIT II: AC MACHINES

EECS 461 Final Project: Adaptive Cruise Control

Power Integrity Guidelines Samtec MPT/MPS Series Connectors Measurement and Simulation Data

Near-Optimal Precharging in High-Performance Nanoscale CMOS Caches

t WR = 2 CLK A2 Notes:

NOT gate (P = NOT A) AND gate (P = A AND B) Create this circuit. Create this circuit. Copy this truth table. Copy this truth table

SP4 DOCUMENTATION. 1. SP4 Reference manual SP4 console.

Statistical Learning Examples

BAC and Fatal Crash Risk

XC95144 In-System Programmable CPLD. Features. Description. Power Management. December 4, 1998 (Version 4.0) 1 1* Product Specification

MANTECH ELECTRONICS. Stepper Motors. Basics on Stepper Motors I. STEPPER MOTOR SYSTEMS OVERVIEW 2. STEPPING MOTORS

DOUBLE DATA RATE (DDR) SDRAM

Transcription:

CMPEN 411 VLSI Digital Circuits Spring 2011 Lecture 20: Multiplier Design [Adapted from Rabaey s Digital Integrated Circuits, Second Edition, 2003 J. Rabaey, A. Chandrakasan, B. Nikolic] Sp11 CMPEN 411 L20 S.1

Review: Basic Building Blocks Datapath Execution units - Adder, multiplier, divider, shifter, etc. Register file and pipeline registers Multiplexers, decoders Control Finite state machines (PLA, ROM, random logic) Interconnect Switches, arbiters, buses Memory Caches (SRAMs), TLBs, DRAMs, buffers Sp11 CMPEN 411 L20 S.2

The Binary Multiplication + x 1 0 1 0 1 0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 1 0 0 1 1 1 0 Multiplicand Multiplier Partial products Result Sp11 CMPEN 411 L20 S.3

Multiply Operation Multiplication is just a a lot of additions N multiplicand multiplier N partial product array can be formed in parallel double precision product 2N Sp11 CMPEN 411 L20 S.4

Multiplication Approaches Right shift and add Partial product array rows are accumulated from top to bottom on an N-bit adder - After each addition, right shift (by one bit) the accumulated partial product to align it with the next row to add Time for N bits T serial_mult = O(N T adder ) = O(N 2 ) for a RCA Making it faster Use a faster adder Use higher radix (e.g., base 4) multiplication O(N/2 T adder ) Sp11 CMPEN 411 L20 S.5 - Use multiplier recoding to simplify multiple formation (booth) Form the partial product array in parallel and add it in parallel Making it smaller (i.e., slower) Use serial-parallel mult Use an array multiplier - Very regular structure with only short wires to nearest neighbor cells. Thus, very simple and efficient layout in VLSI Can be easily and efficiently pipelined

Serial-parallel multiplier structure Sp11 CMPEN 411 L20 S.6

The Array Multiplier X 3 X 2 X 1 X 0 Y 0 X 3 X 2 X 1 X 0 Y 1 Z 0 HA FA FA HA X Y 3 X 2 X 1 X 0 2 Z 1 FA FA FA HA X3 X 2 X 1 X 0 Y 3 Z 2 FA FA FA HA Z 7 Z 6 Z 5 Z 4 Z 3 Sp11 CMPEN 411 L20 S.7

The MxN Array Multiplier Critical Path HA FA FA HA FA FA FA HA Critical Path 1 Critical Path 2 FA FA FA HA Critical Path 1 & 2 Sp11 CMPEN 411 L20 S.8

Carry-Save Multiplier HA HA HA HA HA FA FA FA HA FA FA FA HA FA FA HA Vector Merging Adder Sp11 CMPEN 411 L20 S.9

Multiplier Floorplan X 3 X 2 X 1 X 0 Y 0 Y 1 C S C S C S C S Z 0 HA Multiplier Cell FA Multiplier Cell Y 2 C S C S C S C S Z 1 Vector Merging Cell Y 3 C S C S C S C S Z 2 X and Y signals are broadcasted through the complete array. ( ) C S C S C S C S Z 7 Z 6 Z 5 Z 4 Z 3 Sp11 CMPEN 411 L20 S.10

Booth multiplier Encoding scheme to reduce number of stages in multiplication. Performs two bits of multiplication at once requires half the stages. Each stage is slightly more complex than simple multiplier, but adder/subtracter is almost as small/fast as adder. Sp11 CMPEN 411 L20 S.11

Booth encoding Two s-complement form of multiplier: y = -2 n y n + 2 n-1 y n-1 + 2 n-2 y n-2 +... (first bit is the sign bit) (example, y=18=010010 y= -18 = 101110 ) Rewrite using 2 a = 2 a+1-2 a : y = 2 n (y n-1 -y n ) + 2 n-1 (y n-2 -y n-1 ) + 2 n-2 (y n-3 -y n-2 ) +... Consider first two terms: by looking at three bits of y, we can determine whether to add x, 2x to partial product. Sp11 CMPEN 411 L20 S.12

Booth actions y = 2 n (y n-1 -y n ) + 2 n-1 (y n-2 -y n-1 ) + 2 n-2 (y n-3 -y n-2 ) +... Consider first two terms: by looking at three bits of y, we can determine whether to add x, 2x to partial product. y i y i-1 y i-2 increment 0 0 0 0 0 0 1 x 0 1 0 x 0 1 1 2x 1 0 0-2x 1 0 1 -x 1 1 0 -x 1 1 1 0 Sp11 CMPEN 411 L20 S.13

Booth example x = 1001 (9 10 ), y = 0111 (7 10 ). P 0 = 00000000 y 3 y 2 y 1 =011 y 1 y 0 y -1 =11(0) y 1 y 0 y -1 = 110, P 1 = P 0 - (1001) = 11110111 x shift left for 2 bits to be 100100 y 3 y 2 y 1 = 011, P 2 = P 1 + (10*100100) = 11110111+01001000 = 001111111 (63 10 ) An array multiplier needs N addtions, booth multiplier needs only N/2 additions Sp11 CMPEN 411 L20 S.14

Review: A 64-bit Adder/Subtractor Ripple Carry Adder (RCA) built out of 64 FAs add/subt A 0 C 0 =C in 1-bit FA S 0 Subtraction complement all subtrahend bits (xor gates) and set the low order carry-in RCA B 0 B 1 A1 A 2 C 1 1-bit FA S 1 C 2 1-bit FA S 2 advantage: simple logic, so small (low cost) B 2... C 3 disadvantage: slow (O(N) for N bits) and lots of glitching (so lots of energy consumption) B 63 A 63 C 63 1-bit FA S 63 C 64 =C out Sp11 CMPEN 411 L20 S.15

Booth structure Sp11 CMPEN 411 L20 S.16

Wallace-Tree Multiplier Partial products First stage 6 5 4 3 2 1 0 6 5 4 3 2 1 0 Bit position (a) (b) Second stage Final adder 6 5 4 3 2 1 0 6 5 4 3 2 1 0 FA (c) HA (d) Sp11 CMPEN 411 L20 S.17

Wallace-Tree Multiplier Partial products x 3 y 3 x 2 y 3 x1 y 3 x 0 y 3 x 2 y 1 x 0 y 2 x 1 y 0 x 0 y 0 x 3 y 2 x 2 y 2 x 3 y 1 x 1 y 2 x 3 y 0 x 1 y 1 x 2 y 0 x 0 y 1 First stage HA HA Second stage FA FA FA FA Final adder z 7 z 6 z 5 z 4 z 3 z 2 z 1 z 0 Full adder = (3,2) compressor Sp11 CMPEN 411 L20 S.18

Making it Faster: Tree Multiplier Structure multiple forming circuits 0 D 0 D 0 D 0 D ( icand) Q ( ier) partial product array reduction tree fast carry propagate adder (CPA) P (product) mux + reduction tree (log N) + CPA (log N) interconnect Sp11 CMPEN 411 L20 S.19

(4,2) Counter Built out of two (3,2) counters (just FA s!) all of the inputs (4 external plus one internal) have the same weight (i.e., are in the same bit position) the internal carry output is fed to the next higher weight position (indicated by the ) (3,2) (3,2) Note: Two carry outs - one internal and one external Sp11 CMPEN 411 L20 S.20

Tiling (4,2) Counters (3,2) (3,2) (3,2) (3,2) (3,2) (3,2) Reduces columns four high to columns only two high Tiles with neighboring (4,2) counters Internal carry in at same level (i.e., bit position weight) as the internal carry out Sp11 CMPEN 411 L20 S.21

Tiling (4,2) Counters (3,2) (3,2) (3,2) (3,2) (3,2) (3,2) Reduces columns four high to columns only two high Tiles with neighboring (4,2) counters Internal carry in at same level (i.e., bit position weight) as the internal carry out Sp11 CMPEN 411 L20 S.22

4x4 Partial Product Array Reduction Fast 4x4 multiplication using (4,2) counters multiplicand multiplier partial product array reduced pp array (to CPA) double precision product How would you lay it out? Sp11 CMPEN 411 L20 S.23

4x4 Partial Product Array Reduction Fast 4x4 multiplication using (4,2) counters multiplicand multiplier How would you lay it out? multiplicand partial product array multip plier reduced pp array (to CPA) double precision product five (4,2) counters 5-bit CPA 8-bit product Sp11 CMPEN 411 L20 S.24

8x8 Partial Product Array Reduction Wallace tree multiplier icand ier partial product array two rows of nine (4,2) counters reduced partial product array one row of thirteen (4,2) counters to a 13-bit fast CPA Sp11 CMPEN 411 L20 S.25

An 8x8 Multiplier Layout How should it be laid out? multiplicand multiplier nine (4,2) counters nine (4,2) counters thirteen (4,2) counters 13-bit CPA Sp11 CMPEN 411 L20 S.26

Why Not Recode? Multiplier recoding (modified Booth s, canonical, ) recode the multiplier to allow base 4 multiplication with simple multiple formation with recoding have the base 4 multiplier digit set of -2, -1, 0, 1, 2 Thus, with recoding the initial partial product array is only N/2 high N But, the first level of (4,2) counters also reduces the partial product array to N/2 high N/2 2N Which is better depends on the logic delay (recoding wins) and interconnect complexity (counters win big) Sp11 CMPEN 411 L20 S.27

Hitachi 54X54b Mulitplier A 4.4 ns CMOS 54X54 multiplier using pass-transitor multiplexer Sp11 CMPEN 411 L20 S.28

Hitachi Multiplier: Booth encoder and PPG Sp11 CMPEN 411 L20 S.29

Hitachi multiplier: 4-2 compressor Sp11 CMPEN 411 L20 S.30

What is the state of art? ISSCC 2003 Sp11 CMPEN 411 L20 S.31

Multipliers Summary Optimization Goals Different Vs Binary Adder Once Again: Identify Critical Path Other possible techniques - Logarithmic versus Linear (Wallace Tree Mult) - Data encoding (Booth) - Pipelining FIRST GLIMPSE AT SYSTEM LEVEL OPTIMIZATION Sp11 CMPEN 411 L20 S.32

Next Lecture and Reminders Next lecture Shifters, decoders, and multiplexers - Reading assignment Rabaey, et al, 11.5-11.6 Sp11 CMPEN 411 L20 S.33