CMPEN 411 VLSI Digital Circuits Spring Lecture 20: Multiplier Design

CMPEN 411 VLSI Digital Circuits Spring 2011 Lecture 20: Multiplier Design [Adapted from Rabaey s Digital Integrated Circuits, Second Edition, 2003 J. Rabaey, A. Chandrakasan, B. Nikolic] Sp11 CMPEN 411 L20 S.1

Review: Basic Building Blocks Datapath Execution units - Adder, multiplier, divider, shifter, etc. Register file and pipeline registers Multiplexers, decoders Control Finite state machines (PLA, ROM, random logic) Interconnect Switches, arbiters, buses Memory Caches (SRAMs), TLBs, DRAMs, buffers Sp11 CMPEN 411 L20 S.2

The Binary Multiplication + x 1 0 1 0 1 0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 1 0 0 1 1 1 0 Multiplicand Multiplier Partial products Result Sp11 CMPEN 411 L20 S.3

Multiply Operation Multiplication is just a a lot of additions N multiplicand multiplier N partial product array can be formed in parallel double precision product 2N Sp11 CMPEN 411 L20 S.4

Multiplication Approaches Right shift and add Partial product array rows are accumulated from top to bottom on an N-bit adder - After each addition, right shift (by one bit) the accumulated partial product to align it with the next row to add Time for N bits T serial_mult = O(N T adder ) = O(N 2 ) for a RCA Making it faster Use a faster adder Use higher radix (e.g., base 4) multiplication O(N/2 T adder ) Sp11 CMPEN 411 L20 S.5 - Use multiplier recoding to simplify multiple formation (booth) Form the partial product array in parallel and add it in parallel Making it smaller (i.e., slower) Use serial-parallel mult Use an array multiplier - Very regular structure with only short wires to nearest neighbor cells. Thus, very simple and efficient layout in VLSI Can be easily and efficiently pipelined

Serial-parallel multiplier structure Sp11 CMPEN 411 L20 S.6

The Array Multiplier X 3 X 2 X 1 X 0 Y 0 X 3 X 2 X 1 X 0 Y 1 Z 0 HA FA FA HA X Y 3 X 2 X 1 X 0 2 Z 1 FA FA FA HA X3 X 2 X 1 X 0 Y 3 Z 2 FA FA FA HA Z 7 Z 6 Z 5 Z 4 Z 3 Sp11 CMPEN 411 L20 S.7

The MxN Array Multiplier Critical Path HA FA FA HA FA FA FA HA Critical Path 1 Critical Path 2 FA FA FA HA Critical Path 1 & 2 Sp11 CMPEN 411 L20 S.8

Carry-Save Multiplier HA HA HA HA HA FA FA FA HA FA FA FA HA FA FA HA Vector Merging Adder Sp11 CMPEN 411 L20 S.9

Multiplier Floorplan X 3 X 2 X 1 X 0 Y 0 Y 1 C S C S C S C S Z 0 HA Multiplier Cell FA Multiplier Cell Y 2 C S C S C S C S Z 1 Vector Merging Cell Y 3 C S C S C S C S Z 2 X and Y signals are broadcasted through the complete array. ( ) C S C S C S C S Z 7 Z 6 Z 5 Z 4 Z 3 Sp11 CMPEN 411 L20 S.10

Booth multiplier Encoding scheme to reduce number of stages in multiplication. Performs two bits of multiplication at once requires half the stages. Each stage is slightly more complex than simple multiplier, but adder/subtracter is almost as small/fast as adder. Sp11 CMPEN 411 L20 S.11

Booth encoding Two s-complement form of multiplier: y = -2 n y n + 2 n-1 y n-1 + 2 n-2 y n-2 +... (first bit is the sign bit) (example, y=18=010010 y= -18 = 101110 ) Rewrite using 2 a = 2 a+1-2 a : y = 2 n (y n-1 -y n ) + 2 n-1 (y n-2 -y n-1 ) + 2 n-2 (y n-3 -y n-2 ) +... Consider first two terms: by looking at three bits of y, we can determine whether to add x, 2x to partial product. Sp11 CMPEN 411 L20 S.12

Booth actions y = 2 n (y n-1 -y n ) + 2 n-1 (y n-2 -y n-1 ) + 2 n-2 (y n-3 -y n-2 ) +... Consider first two terms: by looking at three bits of y, we can determine whether to add x, 2x to partial product. y i y i-1 y i-2 increment 0 0 0 0 0 0 1 x 0 1 0 x 0 1 1 2x 1 0 0-2x 1 0 1 -x 1 1 0 -x 1 1 1 0 Sp11 CMPEN 411 L20 S.13

Booth example x = 1001 (9 10 ), y = 0111 (7 10 ). P 0 = 00000000 y 3 y 2 y 1 =011 y 1 y 0 y -1 =11(0) y 1 y 0 y -1 = 110, P 1 = P 0 - (1001) = 11110111 x shift left for 2 bits to be 100100 y 3 y 2 y 1 = 011, P 2 = P 1 + (10*100100) = 11110111+01001000 = 001111111 (63 10 ) An array multiplier needs N addtions, booth multiplier needs only N/2 additions Sp11 CMPEN 411 L20 S.14

Review: A 64-bit Adder/Subtractor Ripple Carry Adder (RCA) built out of 64 FAs add/subt A 0 C 0 =C in 1-bit FA S 0 Subtraction complement all subtrahend bits (xor gates) and set the low order carry-in RCA B 0 B 1 A1 A 2 C 1 1-bit FA S 1 C 2 1-bit FA S 2 advantage: simple logic, so small (low cost) B 2... C 3 disadvantage: slow (O(N) for N bits) and lots of glitching (so lots of energy consumption) B 63 A 63 C 63 1-bit FA S 63 C 64 =C out Sp11 CMPEN 411 L20 S.15

Booth structure Sp11 CMPEN 411 L20 S.16

Wallace-Tree Multiplier Partial products First stage 6 5 4 3 2 1 0 6 5 4 3 2 1 0 Bit position (a) (b) Second stage Final adder 6 5 4 3 2 1 0 6 5 4 3 2 1 0 FA (c) HA (d) Sp11 CMPEN 411 L20 S.17

Wallace-Tree Multiplier Partial products x 3 y 3 x 2 y 3 x1 y 3 x 0 y 3 x 2 y 1 x 0 y 2 x 1 y 0 x 0 y 0 x 3 y 2 x 2 y 2 x 3 y 1 x 1 y 2 x 3 y 0 x 1 y 1 x 2 y 0 x 0 y 1 First stage HA HA Second stage FA FA FA FA Final adder z 7 z 6 z 5 z 4 z 3 z 2 z 1 z 0 Full adder = (3,2) compressor Sp11 CMPEN 411 L20 S.18

Making it Faster: Tree Multiplier Structure multiple forming circuits 0 D 0 D 0 D 0 D ( icand) Q ( ier) partial product array reduction tree fast carry propagate adder (CPA) P (product) mux + reduction tree (log N) + CPA (log N) interconnect Sp11 CMPEN 411 L20 S.19

(4,2) Counter Built out of two (3,2) counters (just FA s!) all of the inputs (4 external plus one internal) have the same weight (i.e., are in the same bit position) the internal carry output is fed to the next higher weight position (indicated by the ) (3,2) (3,2) Note: Two carry outs - one internal and one external Sp11 CMPEN 411 L20 S.20

Tiling (4,2) Counters (3,2) (3,2) (3,2) (3,2) (3,2) (3,2) Reduces columns four high to columns only two high Tiles with neighboring (4,2) counters Internal carry in at same level (i.e., bit position weight) as the internal carry out Sp11 CMPEN 411 L20 S.21

4x4 Partial Product Array Reduction Fast 4x4 multiplication using (4,2) counters multiplicand multiplier partial product array reduced pp array (to CPA) double precision product How would you lay it out? Sp11 CMPEN 411 L20 S.23

4x4 Partial Product Array Reduction Fast 4x4 multiplication using (4,2) counters multiplicand multiplier How would you lay it out? multiplicand partial product array multip plier reduced pp array (to CPA) double precision product five (4,2) counters 5-bit CPA 8-bit product Sp11 CMPEN 411 L20 S.24

8x8 Partial Product Array Reduction Wallace tree multiplier icand ier partial product array two rows of nine (4,2) counters reduced partial product array one row of thirteen (4,2) counters to a 13-bit fast CPA Sp11 CMPEN 411 L20 S.25

An 8x8 Multiplier Layout How should it be laid out? multiplicand multiplier nine (4,2) counters nine (4,2) counters thirteen (4,2) counters 13-bit CPA Sp11 CMPEN 411 L20 S.26

Why Not Recode? Multiplier recoding (modified Booth s, canonical, ) recode the multiplier to allow base 4 multiplication with simple multiple formation with recoding have the base 4 multiplier digit set of -2, -1, 0, 1, 2 Thus, with recoding the initial partial product array is only N/2 high N But, the first level of (4,2) counters also reduces the partial product array to N/2 high N/2 2N Which is better depends on the logic delay (recoding wins) and interconnect complexity (counters win big) Sp11 CMPEN 411 L20 S.27

Hitachi 54X54b Mulitplier A 4.4 ns CMOS 54X54 multiplier using pass-transitor multiplexer Sp11 CMPEN 411 L20 S.28

Hitachi Multiplier: Booth encoder and PPG Sp11 CMPEN 411 L20 S.29

Hitachi multiplier: 4-2 compressor Sp11 CMPEN 411 L20 S.30

What is the state of art? ISSCC 2003 Sp11 CMPEN 411 L20 S.31

Multipliers Summary Optimization Goals Different Vs Binary Adder Once Again: Identify Critical Path Other possible techniques - Logarithmic versus Linear (Wallace Tree Mult) - Data encoding (Booth) - Pipelining FIRST GLIMPSE AT SYSTEM LEVEL OPTIMIZATION Sp11 CMPEN 411 L20 S.32

Next Lecture and Reminders Next lecture Shifters, decoders, and multiplexers - Reading assignment Rabaey, et al, 11.5-11.6 Sp11 CMPEN 411 L20 S.33