Lecture 20: Parallelism ILP to Multicores. James C. Hoe Department of ECE Carnegie Mellon University

Size: px

Start display at page:

Download "Lecture 20: Parallelism ILP to Multicores. James C. Hoe Department of ECE Carnegie Mellon University"

John Bates
5 years ago
Views:

1 Lecture 20: Parallelism ILP to Multicores James C. Hoe Department of ECE Carnegie Mellon University S18 L20 S1, James C. Hoe, CMU/ECE/CALCM, 2018

2 S18 L20 S2, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today transition from sequential to parallel enjoy(you will not be tested on this) Notices Midterm 2 on Monday; Pick up practice midterm solutions HW4 past due; HW5 out next Wed Handout #14: HW4 solutions Readings (advanced optional) MIPS R10K Superscalar Microprocessor, Yeager Synthesis Lectures: Processor Microarchitecture: An Implementation Perspective, 2010

3 Parallelism Defined T 1 (work measured in time): time to do work with 1 PE T (critical path): time to do work with infinite PEs T bounded by dataflow dependence Average parallelism: P avg = T 1 / T For a system with p PEs T p max{ T 1 /p, T } When P avg >>p T p T 1 /p, aka linear speedup x = a + b; y = b * 2 z =(x y) * (x+y) a x + - * *2 + b y S18 L20 S3, James C. Hoe, CMU/ECE/CALCM, 2018

4 ILP: Instruction Level Parallelism Average ILP = T 1 / T = no. instruction / no. cyc required code1: ILP = 1 i.e., must execute serially code2: ILP = 3 i.e., can execute at the same time code1: r1 r2 + 1 r3 r1 / 17 r4 r0 - r3 code2: r1 r2 + 1 r3 r9 / 17 r4 r0 - r S18 L20 S4, James C. Hoe, CMU/ECE/CALCM, 2018

5 Exploiting ILP for Performance Scalar in order pipeline with forwarding operation latency (OL)= 1 base cycle peak IPC = 1 required ILP 1 to avoid stall instruction stream base cyc S18 L20 S5, James C. Hoe, CMU/ECE/CALCM, 2018

6 Superpipelined Execution OL = M minor cycle; same as 1 base cycle peak IPC = 1 per minor cycle required ILP M instruction stream base cycle = M minor cycles minor cycle IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF IF base cyc S18 L20 S6, James C. Hoe, CMU/ECE/CALCM, 2018 Achieving full performance requires always finding M independent instructions in a row

7 Superscalar (Inorder) Execution OL = 1 base cycle peak IPC = N required ILP N instruction stream Base cyc S18 L20 S7, James C. Hoe, CMU/ECE/CALCM, 2018 Achieving full performance requires finding N independent instructions on every cycle

8 Limitations of Inorder Pipeline Achieved IPC of inorder pipelines degrades rapidly as NxM approaches ILP Despite high peak IPC potential, pipeline never full due to frequent dependency stalls!! instruction stream S18 L20 S8, James C. Hoe, CMU/ECE/CALCM, 2018

9 Out of Order Execution ILP is scope dependent r1 r2 + 1 r3 r1 / 17 r4 r0 r3 r11 r r13 r19 / 17 r14 r0 r20 ILP=1 ILP=2 Accessing ILP=2 requires (1) larger scheduling window and (2) out of order execution S18 L20 S9, James C. Hoe, CMU/ECE/CALCM, 2018

10 Dataflow Execution Ordering Maintain a buffer of many pending instructions, a.k.a. reservation stations (RSs) wait for functional unit to be free wait for register RAW hazards to resolve (i.e., required input operands to be produced) Issue instructions for execution out of order select instructions in RS whose operands are available give preference to older instructions (heuristical) A completing instruction frees pending, RAWdependent instructions to execute S18 L20 S10, James C. Hoe, CMU/ECE/CALCM, 2018

11 Tomasulo s Algorithm [IBM 360/91, 1967] Dispatch an instruction to a RS slot after decode decode received from RF either operand value or placeholder RS tag mark RF dest with RS tag of current inst s RS slot A inst in RS can issue when all operand values ready Completing instruction, in addition to updating RF dest, broadcast its RS tag and value to all RS slots RS slot holding matching RS tag placeholder pickup value S18 L20 S11, James C. Hoe, CMU/ECE/CALCM, 2018

12 Instruction Reorder Buffer (ROB) Program order bookkeeping (circular buffer) instructions enter and leave in program order tracks 10s to 100s of in flight instructions in different stages of execution Dynamic juggling of state and dependency oldest finished instruction commit architectural state updates on exit all ROB entries considered speculative due to potential for exceptions and mispredictions S18 L20 S12, James C. Hoe, CMU/ECE/CALCM, 2018 oldest youngest mispredict youngest

13 In order vs Speculative State In order state: cumulative architectural effects of all instructions committed in order so far can never be undone!! Speculative state, as viewed by a given inst in ROB in order state + effects of older insts in ROB effects of some older insts may be pending Speculative state effects must be reversible remember both in order and speculative values for an RF register (may have multiple speculative values) store inst updates memory only at commit time Discard younger speculative state to rewind execution to oldest remaining inst in ROB S18 L20 S13, James C. Hoe, CMU/ECE/CALCM, 2018

14 Removing False Dependencies With out of order execution comes WAW and WAR hazards Anti and output dependencies are false dependencies on register names rather than data r 3 r 1 op r 2 r 5 r 3 op r 4 r 3 r 6 op r 7 With infinite number of registers, anti and output dependencies avoidable by using a new register for each new value S18 L20 S14, James C. Hoe, CMU/ECE/CALCM, 2018

15 Register Renaming: Example Original r1 r2 / r3 r4 r1 * r5 r1 r3 + r6 r3 r1 r5 Renamed r1 r2 / r3 r4 r1 * r5 r8 r3 + r6 r9 r8 r S18 L20 S15, James C. Hoe, CMU/ECE/CALCM, 2018

16 On the fly HW Register Renaming ISA name e.g. r12 rename table rename t56 physical registers (t0... t63) Maintain mapping from ISA reg. names to physical registers When decoding an instruction that updates r x : allocate unused physical register t y to hold inst result set new mapping from r x to t y younger instructions using r x as input finds t y De allocate a physical register for reuse r1 r2 / r3 when it is never needed again? r4 r1 * r5 ^^^^^when is this exactly? r1 r3 + r S18 L20 S16, James C. Hoe, CMU/ECE/CALCM, 2018

17 Control Speculation Modern CPUs can have over 100 instructions in out of order execution scope if 14% of avg. instruction mix is control flow, what is average distance between control flow? instruction fetch must make multiple levels of branch predictions (condition and target) to fetch far ahead of execution and commit Large OOO is more about cache misses than ILP!!! keep working around long cache miss stalls get started on future cache misses as early as possible (to overlap/hide latency of cache misses) S18 L20 S17, James C. Hoe, CMU/ECE/CALCM, 2018

18 Speculative Out of order Execution A mispredicted branch after resolution must be rewound and restarted Much trickier than 5 stage pipeline... can rewind to an intermediate speculative state a rewound branch could still be speculative and itself be discarded by another rewind! rewind must reestablish both architectural state (register value) and microarchitecture state (e.g., rename table) rewind/restart must be fast (not infrequent) Exception rewind is much easier, why? S18 L20 S18, James C. Hoe, CMU/ECE/CALCM, 2018

19 Supercalarized BP: 2 way example tag BTBidx cache block offset last inst in cache block? Tag Table Branch History Table (BHT) Branch Target Buffer (BTB) = hit PC PC S18 L20 S19, James C. Hoe, CMU/ECE/CALCM, 2018 first? taken? 1 0 predpc

20 Trace Caching static 90% dynamic 10% E A C D F G B compiler static 10% static 90% dynamic A B C D E F G i cache line boundaries hardware dynamic A B C D F G trace cache line boundaries S18 L20 S20, James C. Hoe, CMU/ECE/CALCM, 2018

21 Prototypical Superscalar OOO Datapath wide inst fetch + predict wide inst decode rename rename ROB RS (Int insts) physical registers (Integer) RS (FP insts) physical registers (FP) ALU1 ALU2 LD/ST FPU1 FPU S18 L20 S21, James C. Hoe, CMU/ECE/CALCM, 2018 Read [Yeager 1996, IEEE Micro] if you are interested

22 At the 2005 Peak of Superscalar OOO Alpha AMD Opteron Intel Xeon IBM Power5 MIPS R14000 Intel Itanium2 clock (GHz) issue rate 4 3 (x86) 3 (rop) pipeline int/fp 7/9 9/11 22/24 12/ inst in flight 80 72(rop) 126 rop inorder rename reg /40 32/ transistor (10 6 ) power (W) SPECint ,566 1,521 1, ,590 SPECfp ,591 1,504 2, , S18 L20 S22, James C. Hoe, CMU/ECE/CALCM, 2018 Microprocessor Report, December 2004

23 At peak minus 5 years clock (MHz) Alpha AMD Athlon Intel P4 MIPS R12000 IBM Power3 HP PA8600 SUN Ultra issue rate 4 3 (x86) 3 (rop) pipeline int/fp 7/9 9/11 22/24 6 7/8 7/9 14//15 inst in flight 80 72(rop) 126 rop inorder rename reg inorder transistor (10 6 ) power (W) SPECint SPECfp S18 L20 S23, James C. Hoe, CMU/ECE/CALCM, 2018 Microprocessor Report, December 2000

Performance (In)efficiency To hit expected performance target push frequency harder by deepening pipelines used the 2x transistors to build more complicated microarchitectures so fast/deep pipelines

24 Performance (In)efficiency To hit expected performance target push frequency harder by deepening pipelines used the 2x transistors to build more complicated microarchitectures so fast/deep pipelines don t stall (i.e., caches, BP, superscalar, out of order) The consequence of performance inefficiency is limit of economical cooling [ITRS] 2005, Intel P4 Tehas 150W [Borkar, IEEE Micro, July 1999] S18 L20 S24, James C. Hoe, CMU/ECE/CALCM, 2018

25 Efficiency of Parallel Processing technology normalized power (Watt) Better to replace 1 of this by 2 of these; Or N of these Pentium 4 Power Perf [Energy per Instruction Trends in Intel Microprocessors, Grochowski et al., 2006] S18 L20 S25, James C. Hoe, CMU/ECE/CALCM, 2018 technology normalized performance (op/sec)

26 Moore s Law Era Multicore Era: growing transistor count & aggr. perf; flattened power & seq. perf; lowering freq S18 L20 S26, James C. Hoe, CMU/ECE/CALCM, 2018

27 issue rate pipeline depth inst in flight on chip$ (MB) transistor (10 6 ) AMD 285 2x1 At peak plus 1 year Intel 965 2x2 3 (x86) 4 (rop) 3 (rop) (rop) 2x1 Intel (rop) 2x power (W) SPECint 2000 per core 1942 (1556 *) 1870 SPECfp 2000 per core 2260 (1694 +) 2232 Intel Itanium2 clock (GHz) inorder 2x IBM P5+ 2x MIPS R x SUN Ultra4 cores/threads 2x2 2x2 2x1 96(rop) inorder * 3086/ according to S18 L20 S27, James C. Hoe, CMU/ECE/CALCM, 2018 Microprocessor Report, Aug 2006

28 At peak plus 3 years cores/threads AMD Opteron 8360SE 4x1 Intel Xeon X7460 6x1 Intel Itanium x2 IBM P5 2x2 IBM P6 2x2 Fijitsu SPARC 7 4x2 SUN T2 8x8 clock (GHz) issue rate 3 (x86) 4 (rop) pipeline depth 12/ /12 out of order 72(rop) 96(rop) inorder 200 limited 64 inorder on chip$ (MB) transistor (10 6 ) power max(w) 105 SPECint 2006 per core/total 14.4/170 SPECfp /156 per core/total /274 22/ / / / /229 > / / / / /142 / S18 L20 S28, James C. Hoe, CMU/ECE/CALCM, 2018 Microprocessor Report, Oct 2008

29 On to Mainstream Parallelism in Multicores and Manycores Core $ Core $ Core $ Fat Interconnect Big L2 Bigger L S18 L20 S29, James C. Hoe, CMU/ECE/CALCM, 2018 Remember, we got here because we need to compute faster while using less energy per operation

Computer Architecture: Out-of-Order Execution. Prof. Onur Mutlu (editted by Seth) Carnegie Mellon University

Computer Architecture: Out-of-Order Execution Prof. Onur Mutlu (editted by Seth) Carnegie Mellon University Reading for Today Smith and Sohi, The Microarchitecture of Superscalar Processors, Proceedings