Computer Architecture and Parallel Computing 并行结构与计算. Lecture 5 SuperScalar and Multithreading. Peng Liu

Size: px
Start display at page:

Download "Computer Architecture and Parallel Computing 并行结构与计算. Lecture 5 SuperScalar and Multithreading. Peng Liu"

Transcription

1 Comuter Architecture and Parallel Comuting 并行结构与计算 Lecture 5 SuerScalar and Multithreading Peng Liu College of Info. Sci. & Elec. Eng. Zhejiang University liueng@zju.edu.cn

2 Last time in Lecture 04 Register renaming removes WAR, WAW hazards In-order fetch/decode, out-of-order execute, in-order commit gives high erformance and recise excetions Dynamic branch redictors can be quite accurate (>95%) and avoid most control hazards Branch History Tables (BHTs) just redict direction (later in ieline) Just need a few bits er entry (2 bits gives hysteresis) Need to decode instruction bits to determine whether this is a branch and what the target address is Branch Target Buffers (BTBs) redict direction and target earlier in ieline, but bigger entries Return Address Stack redicts subroutine returns 2

3 Branch MisredictRecovery In-order execution machines: Assume no instruction issued after branch can write-back before branch resolves Kill all instructions in ieline behind misredicted branch Out-of-order execution? Multile instructions following branch in rogram order can comlete before branch resolves 3

4 In-Order Commit for Precise Excetions In-order Out-of-order In-order Fetch Decode Reorder Buffer Commit Kill Inject handler PC Kill Execute Kill Excetion? Instructions fetched and decoded into instruction reorder buffer in-order Execution is out-of-order ( out-of-order comletion) Commit (write-back to architectural state, i.e., regfile & memory, is in-order) Temorary storage needed in ROB to hold results before commit 4

5 Branch Misrediction in Pieline Inject correct PC Branch Prediction Kill Kill Branch Resolution Kill PC Fetch Decode Reorder Buffer Commit Comlete Execute Can have multile unresolved branches in ROB. Can resolve branches out-of-order by killing all the instructions in ROB that follow a misredicted branch. Must also kill instructions in-flight in execution ielines. 5

6 Recovering ROB/Renaming Table Rename Table r 1 t t v v t t v v Rename Snashots Register File r 2 Ptr 2 next to commit rollback next available Ptr 1 next available Reorder buffer Ins# use exec o 1 src1 2 src2 d dest data Load Unit FU FU FU Store Unit Commit < t, result > t 1 t 2.. t n Take snashot of register rename table at each redicted branch, recover earlier snashot if branch misredicted. 6

7 Data-in-ROB Design (HP PA8000, Intel Pentium Pro, Core2 Duo & Nehalem) Reorder buffer Register File holds only committed state Ins# use exec o 1 src1 2 src2 d dest data t 1 t 2.. t n Load Unit FU FU FU Store Unit Commit < t, result > On disatch into ROB, ready sources can be in regfile or in ROB dest (coied into src1/src2 if ready before disatch). On comletion, write to dest field and broadcast to src fields. On issue, read from ROB src fields. 7

8 Data Movement in Data-in-ROB Design Architectural Register File Read oerands during decode Write sources after decode Write results at commit Read results at commit Reorder Buffer Read oerands at issue Write results at comletion Functional Units 8

9 Unified Physical Register File (MIPS R10K, Alha 21264, Intel Pentium 4 & Sandy Bridge) Rename all architectural registers into a single hysical register file during decode, no register values read. Functional units read and write from single unified register file holding committed and temorary registers in execute. Commit only udates maing of architectural register to hysical register, no data movement. Decode Stage Register Maing Unified Physical Register File Committed Register Maing Read oerands at issue Write results at comletion Functional Units 9

10 Pieline Design with Physical Regfile Branch Prediction kill kill Branch Resolution kill kill Out-of-Order In-Order PC Fetch Decode & Rename Reorder Buffer Commit In-Order Physical Reg. File Branch Unit ALU MEM Store Buffer D$ Execute 10

11 Lifetime of Physical Registers Physical regfile holds committed and seculative values Physical registers decouled from ROB entries (no data in ROB) ld r1, (r3) add r3, r1, #4 sub r6, r7, r9 add r3, r3, r6 ld r6, (r1) add r6, r6, r3 st r6, (r1) ld r6, (r11) Rename ld P1, (Px) add P2, P1, #4 sub P3, Py, Pz add P4, P2, P3 ld P5, (P1) add P6, P5, P4 st P6, (P1) ld P7, (Pw) When can we reuse a hysical register? When next write of same architectural register commits? 11

12 Physical Register Management R0 R1 R2 R3 R4 R5 R6 R7 Rename Table P8 P7 P5 P6 P0 P1 P2 P3 P4 P5 P6 P7 P8 Physical Regs <R6> <R7> <R3> <R1> Free List Pn ROB use ex o 1 PR1 2 PR2 Rd LPRd PRd P0 P1 P3 P2 P4 ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1) (LPRd requires third read ort on Rename Table for each instruction) 12

13 Physical Register Management R0 R1 R2 R3 R4 R5 R6 R7 Rename Table P8 P7 P5 P6 P0 P0 P1 P2 P3 P4 P5 P6 P7 P8 Physical Regs <R6> <R7> <R3> <R1> Free List P0 P1 P3 P2 P4 Pn ROB use ex o 1 PR1 2 PR2 Rd LPRd PRd x ld P7 r1 P8 P0 ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1) 13

14 Physical Register Management R0 R1 R2 R3 R4 R5 R6 R7 Rename Table P8 P7 P5 P6 P0 P1 P0 P1 P2 P3 P4 P5 P6 P7 P8 Physical Regs <R6> <R7> <R3> <R1> Free List P0 P1 P3 P2 P4 Pn ROB use ex o 1 PR1 2 PR2 Rd LPRd PRd x ld P7 r1 P8 P0 x add P0 r3 P7 P1 ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1) 14

15 Physical Register Management R0 R1 R2 R3 R4 R5 R6 R7 Rename Table P8 P7 P5 P6 P0 P1 P3 P0 P1 P2 P3 P4 P5 P6 P7 P8 Physical Regs <R6> <R7> <R3> <R1> Free List P0 P1 P3 P2 P4 Pn ROB use ex o 1 PR1 2 PR2 Rd LPRd PRd x ld P7 r1 P8 P0 x add P0 r3 P7 P1 x sub P6 P5 r6 P5 P3 ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1) 15

16 Physical Register Management R0 R1 R2 R3 R4 R5 R6 R7 Rename Table P8 P7 P5 P6 ROB P0 P1 P3 P2 P0 P1 P2 P3 P4 P5 P6 P7 P8 Pn Physical Regs <R6> <R7> <R3> <R1> Free List P0 P1 P3 P2 P4 use ex o 1 PR1 2 PR2 Rd LPRd PRd x x ld add P7 P0 r1 r3 P8 P7 P0 P1 x sub P6 P5 r6 P5 P3 x add P1 P3 r3 P1 P2 ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1) 16

17 Physical Register Management R0 R1 R2 R3 R4 R5 R6 R7 Rename Table P8 P7 P5 P6 P0 P1 P3 P2 P4 P0 P1 P2 P3 P4 P5 P6 P7 P8 Physical Regs <R6> <R7> <R3> <R1> Free List P0 P1 P3 P2 P4 Pn ROB use ex o 1 PR1 2 PR2 Rd LPRd PRd x ld P7 r1 P8 P0 x add P0 r3 P7 P1 x sub P6 P5 r6 P5 P3 x add P1 P3 r3 P1 P2 x ld P0 r6 P3 P4 ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1) 17

18 Physical Register Management R0 R1 R2 R3 R4 R5 R6 R7 Rename Table P8 P7 P5 P6 ROB P0 P1 P3 P2 P4 P0 P1 P2 P3 P4 P5 P6 P7 P8 Pn Physical Regs <R6> <R7> <R3> <R1> Free List P0 P1 P3 P2 use ex o 1 PR1 2 PR2 Rd LPRd PRd x x ld P7 r1 P8 P0 x add P0 r3 P7 P1 x sub P6 P5 r6 P5 P3 x add P1 P3 r3 P1 P2 x ld P0 r6 P3 P4 <R1> P4 P8 ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1) Execute & Commit 18

19 Physical Register Management R0 R1 R2 R3 R4 R5 R6 R7 Rename Table P8 P7 P5 P6 ROB P0 P1 P3 P2 P4 P0 P1 P2 P3 P4 P5 P6 P7 P8 Pn Physical Regs <R6> <R7> <R3> Free List P0 P1 P3 P2 use ex o 1 PR1 2 PR2 Rd LPRd PRd x x ld P7 r1 P8 P0 x x add P0 r3 P7 P1 x sub P6 P5 r6 P5 P3 x add P1 P3 r3 P1 P2 x ld P0 r6 P3 P4 <R1> <R3> P4 P8 P7 ld r1, 0(r3) add r3, r1, #4 sub r6, r7, r6 add r3, r3, r6 ld r6, 0(r1) Execute & Commit 19

20 Break 20

21 Searate Pending Instruction Window from ROB The instruction window holds instructions that have been decoded and renamed but not issued into execution. Has register tags and resence bits, and ointer to ROB entry. Reorder buffer used to hold excetion information for commit. use ex Ptr 2 next to commit Ptr 1 next available o 1 PR1 2 PR2 PRd ROB# Done? Rd LPRd PC Excet? ROB is usually several times larger than instruction window why? 21

22 Reorder Buffer Holds Active Instructions (Decoded but not Committed) ld r1, (r3) add r3, r1, r2 sub r6, r7, r9 add r3, r3, r6 ld r6, (r1) add r6, r6, r3 st r6, (r1) ld r6, (r1) (Older instructions) (Newer instructions) Commit Execute Fetch ld r1, (r3) add r3, r1, r2 sub r6, r7, r9 add r3, r3, r6 ld r6, (r1) add r6, r6, r3 st r6, (r1) ld r6, (r1) Cycle t Cycle t

23 Suerscalar Register Renaming During decode, instructions allocated new hysical destination register Source oerands renamed to hysical register with newest value Execution unit only sees hysical register numbers Inst 1 Inst 2 O Dest Src1 Src2 O Dest Src1 Src2 Udate Maing Write Ports Read Addresses Rename Table Read Data Register Free List O PDest PSrc1 PSrc2 O PDest PSrc1 PSrc2 Does this work? 23

24 Suerscalar Register Renaming Inst 1 Inst 2 O Dest Src1 Src2 O Dest Src1 Src2 Udate Maing Must check for RAW hazards between instructions issuing in same cycle. Can be done in arallel with rename looku. O Write Ports Read Addresses Rename Table Read Data PDest PSrc1 PSrc2 O =? =? PDest PSrc1 PSrc2 Register Free List MIPS R10K renames 4 serially-raw-deendent insts/cycle 24

25 Memory Deendencies st r1, (r2) ld r3, (r4) When can we execute the load? 25

26 In-Order Memory Queue Execute all loads and stores in rogram order => Load and store cannot leave ROB for execution until all revious loads and stores have comleted execution. Can still execute loads and stores seculatively, and out-of-order with resect to other instructions. Need a structure to handle memory ordering 26

27 Conservative O-o-O Load Execution st r1, (r2) ld r3, (r4) Slit execution of store instruction into two hases: address calculation and data write Can execute load before store, if addresses known and r4!= r2 Each load address comared with addresses of all revious uncommitted stores (can use artial conservative check i.e., bottom 12 bits of address) Don t execute load if any revious store address not known (MIPS R10K, 16 entry address queue) 27

28 Address Seculation Guess that r4!= r2 st r1, (r2) ld r3, (r4) Execute load before store address known Need to hold all comleted but uncommitted load/store addresses in rogram order If subsequently find r4==r2, squash load and all following instructions => Large enalty for inaccurate address seculation 28

29 Memory Deendence Prediction (Alha 21264) st r1, (r2) ld r3, (r4) Guess that r4!= r2 and execute load before store If later find r4==r2, squash load and all following instructions, but mark load instruction as store-wait Subsequent executions of the same load instruction will wait for all revious stores to comlete Periodically clear store-wait bits 29

30 Seculative Loads / Stores Just like register udates, stores should not modify the memory until after the instruction is committed - A seculative store buffer is a structure introduced to hold seculative store data. 30

31 Seculative Store Buffer Seculative Store Buffer V S V S V S V S V S V S Tag Tag Tag Tag Tag Tag Load Address Data Data Data Data Data Data Tags Store Commit Path Data L1 Data Cache Load Data On store execute: mark entry valid and seculative, and save data and tag of instruction. On store commit: clear seculative bit and eventually move data to cache On store abort: clear valid bit 31

32 Seculative Store Buffer Seculative Store Buffer V S V S V S V S V S V S Tag Tag Tag Tag Tag Tag Load Address Data Data Data Data Data Data Tags Store Commit Path Data L1 Data Cache Load Data If data in both store buffer and cache, which should we use? Seculative store buffer If same address in store buffer twice, which should we use? Youngest store older than load 32

33 Dataath: Branch Prediction and Seculative Execution Branch Prediction kill kill Branch Resolution kill kill PC Fetch Decode & Rename Reorder Buffer Commit Reg. File Branch Unit Execute ALU MEM Store Buffer 33 D$

34 Multithreading Difficult to continue to extract instruction-level arallelism (ILP) from a single sequential thread of control Many workloads can make use of thread-level arallelism (TLP) TLP from multirogramming (run indeendent sequential jobs) TLP from multithreaded alications (run one job faster using arallel threads) Multithreading uses TLP to imrove utilization of a single rocessor 34

35 Pieline Hazards t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 LW r1, 0(r2) LW r5, 12(r1) ADDI r5, r5, #12 SW 12(r1), r5 F D X M W F D D D F F F D X M W F D D D F F F D X M W F D D D D Each instruction may deend on the next What is usually done to coe with this? interlocks (slow) or byassing (needs hardware, doesn t hel all hazards) 35

36 Multithreading How can we guarantee no deendencies between instructions in a ieline? -- One way is to interleave execution of instructions from different rogram threads on same ieline Interleave 4 threads, T1-T4, on non-byassed 5-stage ie t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 T1: LW r1, 0(r2) T2: ADD r7, r1, r4 T3: XORI r5, r4, #12 T4: SW 0(r7), r5 T1: LW r5, 12(r1) F D X M W F D X M W F D X M W F D X M W F D X M W Prior instruction in a thread always comletes writeback before next instruction in same thread reads register file 36

37 CDC 6600 Periheral Processors (Cray, 1964) First multithreaded hardware 10 virtual I/O rocessors Fixed interleave on simle ieline Pieline has 100ns cycle time Each virtual rocessor executes one instruction every 1000ns Accumulator-based instruction set to reduce rocessor state 37

38 Simle Multithreaded Pieline PC PC PC 1 PC I$ IR GPR1 GPR1 GPR1 GPR1 X Y D$ +1 2 Thread select 2 Have to carry thread select down ieline to ensure correct state bits read/written at each ie stage Aears to software (including OS) as multile, albeit slower, CPUs 38

39 Multithreading Costs Each thread requires its own user state PC GPRs Also, needs its own system state virtual memory age table base register excetion handling registers Other overheads: Additional cache/tlb conflicts from cometing threads (or add larger cache/tlb caacity) More OS overhead to schedule more threads (where do all these threads come from?) 39

40 Thread Scheduling Policies Fixed interleave (CDC 6600 PPUs, 1964) Each of N threads executes one instruction every N cycles If thread not ready to go in its slot, insert ieline bubble Software-controlled interleave (TI ASC PPUs, 1971) OS allocates S ieline slots amongst N threads Hardware erforms fixed interleave over S slots, executing whichever thread is in that slot Hardware-controlled thread scheduling (HEP, 1982) Hardware kees track of which threads are ready to go Picks next thread to execute based on hardware riority scheme 40

41 Denelcor HEP (Burton Smith, 1982) First commercial machine to use hardware threading in main CPU 120 threads er rocessor 10 MHz clock rate U to 8 rocessors recursor to Tera MTA (Multithreaded Architecture) 41

42 Tera MTA (1990-) U to 256 rocessors U to 128 active threads er rocessor Processors and memory modules oulate a sarse 3D torus interconnection fabric Flat, shared main memory No data cache Sustains one main memory access er cycle er rocessor GaAs logic in rototye, 260MHz Second version CMOS, MTA-2, 50W/rocessor New version, XMT, fits into AMD Oteron socket, runs at 500MHz 42

43 MTA Pieline W Issue Pool Inst Fetch M A C Every cycle, one VLIW instruction from one active thread is launched into ieline Write Pool Memory Pool W W Instruction ieline is 21 cycles long Memory oerations incur ~150 cycles of latency Retry Pool Interconnection Network Memory ieline Assuming a single thread issues one instruction every 21 cycles, and clock rate is 260 MHz What is single-thread erformance? Effective single-thread issue rate is 260/21 = 12.4 MIPS 43

44 Coarse-Grain Multithreading Tera MTA designed for suercomuting alications with large data sets and low locality No data cache Many arallel threads needed to hide large memory latency Other alications are more cache friendly Few ieline bubbles if cache mostly has hits Just add a few threads to hide occasional cache miss latencies Swa threads on cache misses 44

45 MIT Alewife (1990) Modified SPARC chis register windows hold different thread contexts U to four threads er node Thread switch on local cache miss 45

46 IBM PowerPC RS64-IV (2000) Commercial coarse-grain multithreading CPU Based on PowerPC with quad-issue in-order fivestage ieline Each hysical CPU suorts two virtual CPUs On L2 cache miss, ieline is flushed and execution switches to second thread short ieline minimizes flush enalty (4 cycles), small comared to memory access latency flush ieline to simlify excetion handling 46

47 Oracle/Sun Niagara rocessors Target is datacenters running web servers and databases, with many concurrent requests Provide multile simle cores each with multile hardware threads, reduced energy/oeration though much lower single thread erformance Niagara-1 [2004], 8 cores, 4 threads/core Niagara-2 [2007], 8 cores, 8 threads/core Niagara-3 [2009], 16 cores, 8 threads/core 47

48 Oracle/Sun Niagara-3, Rainbow Falls

49 Simultaneous Multithreading (SMT) for OoO Suerscalars Techniques resented so far have all been vertical multithreading where each ieline stage works on one thread at a time SMT uses fine-grain control already resent inside an OoO suerscalar to allow instructions from multile threads to enter execution on same clock cycle. Gives better utilization of machine resources. 49

50 For most as, most execution units lie idle in an OoO suerscalar For an 8-way suerscalar. From: Tullsen, Eggers, and Levy, Simultaneous Multithreading: Maximizing On-chi Parallelism, ISCA

51 Suerscalar Machine Efficiency Instruction issue Issue width Comletely idle cycle (vertical waste) Time Partially filled cycle, i.e., IPC < 4 (horizontal waste) 51

52 Vertical Multithreading Instruction issue Issue width Second thread interleaved cycle-by-cycle Time Partially filled cycle, i.e., IPC < 4 (horizontal waste) What is the effect of cycle-by-cycle interleaving? removes vertical waste, but leaves some horizontal waste 52

53 Chi Multirocessing (CMP) Issue width Time What is the effect of slitting into multile rocessors? reduces horizontal waste, leaves some vertical waste, and uts uer limit on eak throughut of each thread. 53

54 Ideal Suerscalar Multithreading [Tullsen, Eggers, Levy, UW, 1995] Issue width Time Interleave multile threads to multile issue slots with no restrictions 54

55 O-o-O Simultaneous Multithreading [Tullsen, Eggers, Emer, Levy, Stamm, Lo, DEC/UW, 1996] Add multile contexts and fetch engines and allow instructions fetched from different threads to issue simultaneously Utilize wide out-of-order suerscalar rocessor issue queue to find instructions to issue from multile threads OOO instruction window already has most of the circuitry required to schedule from multile threads Any single thread can utilize whole machine 55

56 IBM Power 4 Single-threaded redecessor to Power 5. 8 execution units in out-of-order engine, each may issue an instruction each cycle. 56

57 Power 4 Power 5 2 commits (architected register sets) 2 fetch (PC), 2 initial decodes 57

58 Power 5 data flow... Why only 2 threads? With 4, one of the shared resources (hysical registers, cache, memory bandwidth) would be rone to bottleneck 58

59 Changes in Power 5 to suort SMT Increased associativity of L1 instruction cache and the instruction address translation buffers Added er thread load and store queues Increased size of the L2 (1.92 vs MB) and L3 caches Added searate instruction refetch and buffering er thread Increased the number of virtual registers from 152 to 240 Increased the size of several issue queues The Power5 core is about 24% larger than the Power4 core because of the addition of SMT suort 59

60 Pentium-4 Hyerthreading (2002) First commercial SMT design (2-way SMT) Hyerthreading == SMT Logical rocessors share nearly all resources of the hysical rocessor Caches, execution units, branch redictors Die area overhead of hyerthreading ~ 5% When one logical rocessor is stalled, the other can make rogress No logical rocessor can use all entries in queues when two threads are active Processor running only one active software thread runs at aroximately same seed with or without hyerthreading Hyerthreading droed on OoO P6 based followons to Pentium- 4 (Pentium-M, Core Duo, Core 2 Duo), until revived with Nehalem generation machines in Intel Atom (in-order x86 core) has two-way vertical multithreading 60

61 Initial Performance of SMT Pentium 4 Extreme SMT yields 1.01 seedu for SPECint_rate benchmark and 1.07 for SPECf_rate Pentium 4 is dual threaded SMT SPECRate requires that each SPEC benchmark be run against a vendor-selected number of coies of the same benchmark Running on Pentium 4 each of 26 SPEC benchmarks aired with every other (26 2 runs) seed-us from 0.90 to 1.58; average was 1.20 Power 5, 8-rocessor server 1.23 faster for SPECint_rate with SMT, 1.16 faster for SPECf_rate Power 5 running 2 coies of each a seedu between 0.89 and 1.41 Most gained some Fl.Pt. as had most cache conflicts and least gains 61

62 SMT adatation to arallelism tye For regions with high thread level arallelism (TLP) entire machine width is shared by all threads Issue width For regions with low thread level arallelism (TLP) entire machine width is available for instruction level arallelism (ILP) Issue width Time Time 62

63 Icount Choosing Policy Fetch from thread with the least instructions in flight. Why does this enhance throughut? 63

64 Summary: Multithreaded Categories Time (rocessor cycle) Suerscalar Fine-Grained Coarse-Grained Multirocessing Simultaneous Multithreading Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot 64

65 Acknowledgements These slides contain material develoed and coyright by: UCB John Kubiatowicz (UCB) David Patterson (UCB) UCB material derived from course CS252,CS152

CS 152 Computer Architecture and Engineering. Lecture 15 - Advanced Superscalars

CS 152 Computer Architecture and Engineering. Lecture 15 - Advanced Superscalars CS 152 Comuter Architecture and Engineering Lecture 15 - Advanced Suerscalars Krste Asanovic Electrical Engineering and Comuter Sciences University of California at Berkeley htt://www.eecs.berkeley.edu/~krste

More information

Advanced Superscalar Architectures

Advanced Superscalar Architectures Advanced Suerscalar Architectures Krste Asanovic Laboratory for Comuter Science Massachusetts Institute of Technology Physical Register Renaming (single hysical register file: MIPS R10K, Alha 21264, Pentium-4)

More information

CS 152 Computer Architecture and Engineering. Lecture 14 - Advanced Superscalars

CS 152 Computer Architecture and Engineering. Lecture 14 - Advanced Superscalars CS 152 Comuter Architecture and Engineering Lecture 14 - Advanced Suerscalars Krste Asanovic Electrical Engineering and Comuter Sciences University of California at Berkeley htt://www.eecs.berkeley.edu/~krste

More information

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 10 Instruction-Level Parallelism Part 3

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 10 Instruction-Level Parallelism Part 3 ECE 552 / CPS 550 Advanced Comuter Architecture I Lecture 10 Instruction-Level Parallelism Part 3 Benjamin Lee Electrical and Comuter Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall12.html

More information

Advanced Superscalar Architectures. Speculative and Out-of-Order Execution

Advanced Superscalar Architectures. Speculative and Out-of-Order Execution 6.823, L16--1 Advanced Superscalar Architectures Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Speculative and Out-of-Order Execution Branch Prediction kill kill Branch

More information

Chapter 2 ( ) -Revisit ReOrder Buffer -Exception handling and. (parallelism in HW)

Chapter 2 ( ) -Revisit ReOrder Buffer -Exception handling and. (parallelism in HW) Comuter Architecture A Quantitative Aroach, Fifth Edition Chater 2 (2.6-2.11) -Revisit ReOrder Buffer -Excetion handling and (seculation in hardware) -VLIW and EPIC (seculation in SW, arallelism in SW)

More information

Lecture 14: Instruction Level Parallelism

Lecture 14: Instruction Level Parallelism Lecture 14: Instruction Level Parallelism Last time Pipelining in the real world Today Control hazards Other pipelines Take QUIZ 10 over P&H 4.10-15, before 11:59pm today Homework 5 due Thursday March

More information

Parallelism I: Inside the Core

Parallelism I: Inside the Core Parallelism I: Inside the Core 1 The final Comprehensive Same general format as the Midterm. Review the homeworks, the slides, and the quizzes. 2 Key Points What is wide issue mean? How does does it affect

More information

Computer Architecture ELE 475 / COS 475 Slide Deck 6: Superscalar 3. David Wentzlaff Department of Electrical Engineering Princeton University

Computer Architecture ELE 475 / COS 475 Slide Deck 6: Superscalar 3. David Wentzlaff Department of Electrical Engineering Princeton University Computer Architecture ELE 475 / COS 475 Slide Deck 6: Superscalar 3 David Wentzlaff Department of Electrical Engineering Princeton University 1 Agenda SpeculaJon and Branches Register Renaming Memory DisambiguaJon

More information

Out-of-order Pipeline. Register Read. OOO execution (2-wide) OOO execution (2-wide) OOO execution (2-wide) OOO execution (2-wide)

Out-of-order Pipeline. Register Read. OOO execution (2-wide) OOO execution (2-wide) OOO execution (2-wide) OOO execution (2-wide) Out-of-order Pipeline Register Read When do instructions read the register file? Fetch Decode Rename Dispatch Buffer of instructions Issue Reg-read Execute Writeback Commit Option #: after select, right

More information

Computer Architecture 计算机体系结构. Lecture 3. Instruction-Level Parallelism I 第三讲 指令级并行 I. Chao Li, PhD. 李超博士

Computer Architecture 计算机体系结构. Lecture 3. Instruction-Level Parallelism I 第三讲 指令级并行 I. Chao Li, PhD. 李超博士 Computer Architecture 计算机体系结构 Lecture 3. Instruction-Level Parallelism I 第三讲 指令级并行 I Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review ISA, micro-architecture, physical design Evolution of ISA CISC vs

More information

Lecture 20: Parallelism ILP to Multicores. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 20: Parallelism ILP to Multicores. James C. Hoe Department of ECE Carnegie Mellon University 18 447 Lecture 20: Parallelism ILP to Multicores James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L20 S1, James C. Hoe, CMU/ECE/CALCM, 2018 18 447 S18 L20 S2, James C. Hoe, CMU/ECE/CALCM,

More information

Computer Architecture: Out-of-Order Execution. Prof. Onur Mutlu (editted by Seth) Carnegie Mellon University

Computer Architecture: Out-of-Order Execution. Prof. Onur Mutlu (editted by Seth) Carnegie Mellon University Computer Architecture: Out-of-Order Execution Prof. Onur Mutlu (editted by Seth) Carnegie Mellon University Reading for Today Smith and Sohi, The Microarchitecture of Superscalar Processors, Proceedings

More information

CIS 371 Computer Organization and Design

CIS 371 Computer Organization and Design CIS 371 Computer Organization and Design Unit 10: Static & Dynamic Scheduling Slides developed by Milo Martin & Amir Roth at the University of Pennsylvania with sources that included University of Wisconsin

More information

CIS 371 Computer Organization and Design

CIS 371 Computer Organization and Design CIS 371 Computer Organization and Design Unit 10: Static & Dynamic Scheduling Slides developed by M. Martin, A.Roth, C.J. Taylor and Benedict Brown at the University of Pennsylvania with sources that included

More information

Unit 9: Static & Dynamic Scheduling

Unit 9: Static & Dynamic Scheduling CIS 501: Computer Architecture Unit 9: Static & Dynamic Scheduling Slides originally developed by Drew Hilton, Amir Roth and Milo Mar;n at University of Pennsylvania CIS 501: Comp. Arch. Prof. Milo Martin

More information

Code Scheduling & Limitations

Code Scheduling & Limitations This Unit: Static & Dynamic Scheduling CIS 371 Computer Organization and Design Unit 11: Static and Dynamic Scheduling App App App System software Mem CPU I/O Code scheduling To reduce pipeline stalls

More information

Tomasulo-Style Register Renaming

Tomasulo-Style Register Renaming Tomasulo-Style Register Renaming ldf f0,x(r1) allocate RS#4 map f0 to RS#4 mulf f4,f0, allocate RS#6 ready, copy value f0 not ready, copy tag Map Table f0 f4 RS#4 RS T V1 V2 T1 T2 4 REG[r1] 6 REG[] RS#4

More information

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold Pipelining Readings: 4.5-4.8 Example: Doing the laundry Ann, Brian, Cathy, & Dave A B C D each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes Folder takes

More information

Announcements. Programming assignment #2 due Monday 9/24. Talk: Architectural Acceleration of Real Time Physics Glenn Reinman, UCLA CS

Announcements. Programming assignment #2 due Monday 9/24. Talk: Architectural Acceleration of Real Time Physics Glenn Reinman, UCLA CS Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar GAS STATION Pipelining II Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Austin,

More information

COSC 6385 Computer Architecture. - Tomasulos Algorithm

COSC 6385 Computer Architecture. - Tomasulos Algorithm COSC 6385 Computer Architecture - Tomasulos Algorithm Fall 2008 Analyzing a short code-sequence DIV.D F0, F2, F4 ADD.D F6, F0, F8 S.D F6, 0(R1) SUB.D F8, F10, F14 MUL.D F6, F10, F8 1 Analyzing a short

More information

Anne Bracy CS 3410 Computer Science Cornell University. [K. Bala, A. Bracy, S. McKee, E. Sirer, H. Weatherspoon]

Anne Bracy CS 3410 Computer Science Cornell University. [K. Bala, A. Bracy, S. McKee, E. Sirer, H. Weatherspoon] Anne Bracy CS 3410 Computer Science Cornell University [K. Bala, A. Bracy, S. McKee, E. Sirer, H. Weatherspoon] Prog. Mem PC +4 inst Reg. File 5 5 5 control ALU Data Mem Fetch Decode Execute Memory WB

More information

CS152: Computer Architecture and Engineering Introduction to Pipelining. October 22, 1997 Dave Patterson (http.cs.berkeley.

CS152: Computer Architecture and Engineering Introduction to Pipelining. October 22, 1997 Dave Patterson (http.cs.berkeley. CS152: Computer Architecture and Engineering Introduction to Pipelining October 22, 1997 Dave Patterson (http.cs.berkeley.edu/~patterson) lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/ cs 152

More information

Hakim Weatherspoon CS 3410 Computer Science Cornell University

Hakim Weatherspoon CS 3410 Computer Science Cornell University Hakim Weatherspoon CS 3410 Computer Science Cornell University The slides are the product of many rounds of teaching CS 3410 by Professors Weatherspoon, Bala, Bracy, McKee, and Sirer. memory inst register

More information

To read more. CS 6354: Tomasulo. Intel Skylake. Scheduling. How can we reorder instructions? Without changing the answer.

To read more. CS 6354: Tomasulo. Intel Skylake. Scheduling. How can we reorder instructions? Without changing the answer. To read more CS 6354: Tomasulo 21 September 2016 This day s paper: Tomasulo, An Efficient Algorithm for Exploiting Multiple Arithmetic Units Supplementary readings: Hennessy and Patterson, Computer Architecture:

More information

CS 6354: Tomasulo. 21 September 2016

CS 6354: Tomasulo. 21 September 2016 1 CS 6354: Tomasulo 21 September 2016 To read more 1 This day s paper: Tomasulo, An Efficient Algorithm for Exploiting Multiple Arithmetic Units Supplementary readings: Hennessy and Patterson, Computer

More information

Decoupling Loads for Nano-Instruction Set Computers

Decoupling Loads for Nano-Instruction Set Computers Decoupling Loads for Nano-Instruction Set Computers Ziqiang (Patrick) Huang, Andrew Hilton, Benjamin Lee Duke University {ziqiang.huang, andrew.hilton, benjamin.c.lee}@duke.edu ISCA-43, June 21, 2016 1

More information

Improving Performance: Pipelining!

Improving Performance: Pipelining! Iproving Perforance: Pipelining! Meory General registers Meory ID EXE MEM WB Instruction Fetch (includes PC increent) ID Instruction Decode + fetching values fro general purpose registers EXE EXEcute arithetic/logic

More information

CIS 662: Sample midterm w solutions

CIS 662: Sample midterm w solutions CIS 662: Sample midterm w solutions 1. (40 points) A processor has the following stages in its pipeline: IF ID ALU1 MEM1 MEM2 ALU2 WB. ALU1 stage is used for effective address calculation for loads, stores

More information

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Homework 1 submission

More information

ENGN1640: Design of Computing Systems Topic 05: Pipeline Processor Design

ENGN1640: Design of Computing Systems Topic 05: Pipeline Processor Design ENGN64: Design of Computing Systems Topic 5: Pipeline Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

CS 250! VLSI System Design

CS 250! VLSI System Design CS 250! VLSI System Design Lecture 3 Timing 2014-9-4! Professor Jonathan Bachrach! slides by John Lazzaro TA: Colin Schmidt www-insteecsberkeleyedu/~cs250/ UC Regents Fall 2013/1014 UCB everything doesn

More information

DAT105: Computer Architecture Study Period 2, 2009 Exercise 2 Chapter 2: Instruction-Level Parallelism and Its Exploitation

DAT105: Computer Architecture Study Period 2, 2009 Exercise 2 Chapter 2: Instruction-Level Parallelism and Its Exploitation Study Period 2, 29 Exercise 2 Chapter 2: Instruction-Level Parallelism and Its Exploitation Mafijul Islam Department of Computer Science and Engineering November 12, 29 Study Period 2, 29 Goals: To understand

More information

CMU Introduction to Computer Architecture, Spring 2013 HW 3 Solutions: Microprogramming Wrap-up and Pipelining

CMU Introduction to Computer Architecture, Spring 2013 HW 3 Solutions: Microprogramming Wrap-up and Pipelining CMU 18-447 Introduction to Computer Architecture, Spring 2013 HW 3 Solutions: Microprogramming Wrap-up and Pipelining Instructor: Prof. Onur Mutlu TAs: Justin Meza, Yoongu Kim, Jason Lin 1 Adding the REP

More information

Storage and Memory Hierarchy CS165

Storage and Memory Hierarchy CS165 Storage and Memory Hierarchy CS165 What is the memory hierarchy? L1

More information

Pipelined MIPS Datapath with Control Signals

Pipelined MIPS Datapath with Control Signals uction ess uction Rs [:26] (Opcode[5:]) [5:] ranch luor. Decoder Pipelined MIPS path with Signals luor Raddr at Five instruction sequence to be processed by pipeline: op [:26] rs [25:2] rt [2:6] rd [5:]

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 23 Synchronization 2006-11-16 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs152/ 1 Last Time:

More information

Fixing the Hyperdrive: Maximizing Rendering Performance on NVIDIA GPUs

Fixing the Hyperdrive: Maximizing Rendering Performance on NVIDIA GPUs Fixing the Hyperdrive: Maximizing Rendering Performance on NVIDIA GPUs Louis Bavoil, Principal Engineer Booth #223 - South Hall www.nvidia.com/gdc Full-Screen Pixel Shader SM TEX L2 DRAM CROP SM = Streaming

More information

Optimality of Tomasulo s Algorithm Luna, Dong Gang, Zhao

Optimality of Tomasulo s Algorithm Luna, Dong Gang, Zhao Optimality of Tomasulo s Algorithm Luna, Dong Gang, Zhao Feb 28th, 2002 Our Questions about Tomasulo Questions about Tomasulo s Algorithm Is it optimal (can always produce the wisest instruction execution

More information

6.823 Computer System Architecture Prerequisite Self-Assessment Test Assigned Feb. 6, 2019 Due Feb 11, 2019

6.823 Computer System Architecture Prerequisite Self-Assessment Test Assigned Feb. 6, 2019 Due Feb 11, 2019 6.823 Computer System Architecture Prerequisite Self-Assessment Test Assigned Feb. 6, 2019 Due Feb 11, 2019 http://csg.csail.mit.edu/6.823/ This self-assessment test is intended to help you determine your

More information

Chapter 3: Computer Organization Fundamentals. Oregon State University School of Electrical Engineering and Computer Science.

Chapter 3: Computer Organization Fundamentals. Oregon State University School of Electrical Engineering and Computer Science. Chapter 3: Computer Organization Fundamentals Prof. Ben Lee Oregon State University School of Electrical Engineering and Computer Science Chapter Goals Understand the organization of a computer system

More information

In-Place Associative Computing:

In-Place Associative Computing: In-Place Associative Computing: A New Concept in Processor Design 1 Page Abstract 3 What s Wrong with Existing Processors? 3 Introducing the Associative Processing Unit 5 The APU Edge 5 Overview of APU

More information

Energy Efficient Content-Addressable Memory

Energy Efficient Content-Addressable Memory Energy Efficient Content-Addressable Memory Advanced Seminar Computer Engineering Institute of Computer Engineering Heidelberg University Fabian Finkeldey 26.01.2016 Fabian Finkeldey, Energy Efficient

More information

Practical Resource Management in Power-Constrained, High Performance Computing

Practical Resource Management in Power-Constrained, High Performance Computing Practical Resource Management in Power-Constrained, High Performance Computing Tapasya Patki*, David Lowenthal, Anjana Sasidharan, Matthias Maiterth, Barry Rountree, Martin Schulz, Bronis R. de Supinski

More information

ARC-H: Adaptive replacement cache management for heterogeneous storage devices

ARC-H: Adaptive replacement cache management for heterogeneous storage devices Journal of Systems Architecture 58 (2012) ARC-H: Adaptive replacement cache management for heterogeneous storage devices Young-Jin Kim, Division of Electrical and Computer Engineering, Ajou University,

More information

Topics on Compilers. Introduction to CGRA

Topics on Compilers. Introduction to CGRA 4541.775 Topics on Compilers Introduction to CGRA Spring 2011 Reconfigurable Architectures reconfigurable hardware (reconfigware) implement specific hardware structures dynamically and on demand high performance

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 02

More information

Pipeline Hazards. See P&H Chapter 4.7. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Pipeline Hazards. See P&H Chapter 4.7. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University Pipeline Hazards See P&H Chapter 4.7 Hakim Weatherspoon CS 341, Spring 213 Computer Science Cornell niversity Goals for Today Data Hazards Revisit Pipelined Processors Data dependencies Problem, detection,

More information

CACHE LINE AWARE OPTIMIZATIONS FOR CCNUMA SYSTEMS

CACHE LINE AWARE OPTIMIZATIONS FOR CCNUMA SYSTEMS CACHE LINE AWARE OPTIMIZATIONS FOR CCNUMA SYSTEMS 24th ACM International Symposium on High-Performance Parallel and Distributed Computing HPDC 15, Portland, 2015 Sabela Ramos (sramos@udc.es) GAC, Universidade

More information

Enhancing Energy Efficiency of Database Applications Using SSDs

Enhancing Energy Efficiency of Database Applications Using SSDs Seminar Energy-Efficient Databases 29.06.2011 Enhancing Energy Efficiency of Database Applications Using SSDs Felix Martin Schuhknecht Motivation vs. Energy-Efficiency Seminar 29.06.2011 Felix Martin Schuhknecht

More information

Lecture Secure, Trusted and Trustworthy Computing Trusted Execution Environments Intel SGX

Lecture Secure, Trusted and Trustworthy Computing Trusted Execution Environments Intel SGX 1 Lecture Secure, and Trustworthy Computing Execution Environments Intel Prof. Dr.-Ing. Ahmad-Reza Sadeghi System Security Lab Technische Universität Darmstadt (CASED) Germany Winter Term 2015/2016 Intel

More information

Pipeline Hazards. See P&H Chapter 4.7. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Pipeline Hazards. See P&H Chapter 4.7. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University Pipeline Hazards See P&H Chapter 4.7 Hakim Weatherspoon CS 341, Spring 213 Computer Science Cornell niversity Goals for Today Data Hazards Revisit Pipelined Processors Data dependencies Problem, detection,

More information

CMPEN 411 VLSI Digital Circuits Spring Lecture 20: Multiplier Design

CMPEN 411 VLSI Digital Circuits Spring Lecture 20: Multiplier Design CMPEN 411 VLSI Digital Circuits Spring 2011 Lecture 20: Multiplier Design [Adapted from Rabaey s Digital Integrated Circuits, Second Edition, 2003 J. Rabaey, A. Chandrakasan, B. Nikolic] Sp11 CMPEN 411

More information

Dell EMC SCv ,000 Mailbox Exchange 2016 Resiliency Storage Solution using 10K drives

Dell EMC SCv ,000 Mailbox Exchange 2016 Resiliency Storage Solution using 10K drives Dell EMC SCv3020 14,000 Mailbox Exchange 2016 Resiliency Storage Solution using 10K drives Microsoft ESRP 4.0 Abstract This document describes the Dell EMC SCv3020 storage solution for Microsoft Exchange

More information

Near-Optimal Precharging in High-Performance Nanoscale CMOS Caches

Near-Optimal Precharging in High-Performance Nanoscale CMOS Caches Near-Optimal Precharging in High-Performance Nanoscale CMOS Caches Se-Hyun Yang and Babak Falsafi Computer Architecture Laboratory (CALCM) Carnegie Mellon University {sehyun, babak}@cmu.edu http://www.ece.cmu.edu/~powertap

More information

Direct-Mapped Cache Terminology. Caching Terminology. TIO Dan s great cache mnemonic. UCB CS61C : Machine Structures

Direct-Mapped Cache Terminology. Caching Terminology. TIO Dan s great cache mnemonic. UCB CS61C : Machine Structures Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 31 Caches II 2008-04-12 HP has begun testing research prototypes of a novel non-volatile memory element, the

More information

M2 Instruction Set Architecture

M2 Instruction Set Architecture M2 Instruction Set Architecture Module Outline Addressing modes. Instruction classes. MIPS-I ISA. High level languages, Assembly languages and object code. Translating and starting a program. Subroutine

More information

128Mb Synchronous DRAM. Features High Performance: Description. REV 1.0 May, 2001 NT5SV32M4CT NT5SV16M8CT NT5SV8M16CT

128Mb Synchronous DRAM. Features High Performance: Description. REV 1.0 May, 2001 NT5SV32M4CT NT5SV16M8CT NT5SV8M16CT Features High Performance: f Clock Frequency -7K 3 CL=2-75B, CL=3-8B, CL=2 Single Pulsed RAS Interface Fully Synchronous to Positive Clock Edge Four Banks controlled by BS0/BS1 (Bank Select) Units 133

More information

WHITE PAPER. Informatica PowerCenter 8 on HP Integrity Servers: Doubling Performance with Linear Scalability for 64-bit Enterprise Data Integration

WHITE PAPER. Informatica PowerCenter 8 on HP Integrity Servers: Doubling Performance with Linear Scalability for 64-bit Enterprise Data Integration WHITE PAPER Informatica PowerCenter 8 on HP Integrity Servers: Doubling Performance with Linear Scalability for 64-bit Enterprise Data Integration This document contains Confi dential, Proprietary and

More information

SDRAM AS4SD8M Mb: 8 Meg x 16 SDRAM Synchronous DRAM Memory. PIN ASSIGNMENT (Top View)

SDRAM AS4SD8M Mb: 8 Meg x 16 SDRAM Synchronous DRAM Memory. PIN ASSIGNMENT (Top View) 128 Mb: 8 Meg x 16 SDRAM Synchronous DRAM Memory FEATURES Full Military temp (-55 C to 125 C) processing available Configuration: 8 Meg x 16 (2 Meg x 16 x 4 banks) Fully synchronous; all signals registered

More information

Finite Element Based, FPGA-Implemented Electric Machine Model for Hardware-in-the-Loop (HIL) Simulation

Finite Element Based, FPGA-Implemented Electric Machine Model for Hardware-in-the-Loop (HIL) Simulation Finite Element Based, FPGA-Implemented Electric Machine Model for Hardware-in-the-Loop (HIL) Simulation Leveraging Simulation for Hybrid and Electric Powertrain Design in the Automotive, Presentation Agenda

More information

CMPEN 411 VLSI Digital Circuits Spring Lecture 24: Peripheral Memory Circuits

CMPEN 411 VLSI Digital Circuits Spring Lecture 24: Peripheral Memory Circuits CMPEN 411 VLSI Digital Circuits Spring 2012 Lecture 24: Peripheral Memory Circuits [Adapted from Rabaey s Digital Integrated Circuits, Second Edition, 2003 J. Rabaey, A. Chandrakasan, B. Nikolic] Sp12

More information

CS250 VLSI Systems Design

CS250 VLSI Systems Design CS250 VLSI Systems Design Lecture 4: Physical Realities: Beneath the Digital Abstraction, Part 1: Timing Spring 2016 John Wawrzynek with Chris Yarp (GSI) Lecture 04, Timing CS250, UC Berkeley Sp16 What

More information

General Processor Information

General Processor Information General Processor Information Copyright 1994-2000 Tom Burd Last Modified: April 11, 2000 (DISCLAIMER: SPEC performance numbers are the highest rated for a given processor version. Actual performance depends

More information

UTILIZING WAVE ROTOR TECHNOLOGY TO ENHANCE THE TURBO COMPRESSION IN POWER AND REFRIGERATION CYCLES

UTILIZING WAVE ROTOR TECHNOLOGY TO ENHANCE THE TURBO COMPRESSION IN POWER AND REFRIGERATION CYCLES Proceedings of IMECE 3 3 ASME International Mechanical Engineering Congress & Exosition Washington, D.C., November -, 3 IMECE3- UTILIZING WAVE ROTOR TECHNOLOGY TO ENHANCE THE TURBO COMPRESSION IN POWER

More information

Welcome to the waitless world. CBU for IBM i. Steve Finnes

Welcome to the waitless world. CBU for IBM i. Steve Finnes CBU for IBM i Steve Finnes finnes@us.ibm.com CBU for IBM i Offering for IBM i HA/DR environments Consolidation environments (AIX, i and Linux) for HA/DR operations Offering Supports Optional permanent

More information

SYNCHRONOUS DRAM. 128Mb: x32 SDRAM. MT48LC4M32B2-1 Meg x 32 x 4 banks

SYNCHRONOUS DRAM. 128Mb: x32 SDRAM. MT48LC4M32B2-1 Meg x 32 x 4 banks SYNCHRONOUS DRAM 128Mb: x32 MT48LC4M32B2-1 Meg x 32 x 4 banks For the latest data sheet, please refer to the Micron Web site: www.micron.com/sdramds FEATURES PC100 functionality Fully synchronous; all

More information

General Processor Information

General Processor Information General Information Copyright 1994-2000 Tom Burd Last Modified: January 10, 2001 (DISCLAIMER: SPEC performance numbers are the highest rated for a given processor version. Actual performance depends on

More information

Test Infrastructure Design for Core-Based System-on-Chip Under Cycle-Accurate Thermal Constraints

Test Infrastructure Design for Core-Based System-on-Chip Under Cycle-Accurate Thermal Constraints Test Infrastructure Design for Core-Based System-on-Chip Under Cycle-Accurate Thermal Constraints Thomas Edison Yu, Tomokazu Yoneda, Krishnendu Chakrabarty and Hideo Fujiwara Nara Institute of Science

More information

128Mb DDR SDRAM. Features. Description. REV 1.1 Oct, 2006

128Mb DDR SDRAM. Features. Description. REV 1.1 Oct, 2006 Features Double data rate architecture: two data transfers per clock cycle Bidirectional data strobe () is transmitted and received with data, to be used in capturing data at the receiver is edge-aligned

More information

CSCI 510: Computer Architecture Written Assignment 2 Solutions

CSCI 510: Computer Architecture Written Assignment 2 Solutions CSCI 510: Computer Architecture Written Assignment 2 Solutions The following code does compution over two vectors. Consider different execution scenarios and provide the average number of cycles per iterion

More information

Helping Moore s Law: Architectural Techniques to Address Parameter Variation

Helping Moore s Law: Architectural Techniques to Address Parameter Variation Helping Moore s Law: Architectural Techniques to Address Parameter Variation Computer Science Department University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu/~teodores Technology scaling

More information

Green Server Design: Beyond Operational Energy to Sustainability

Green Server Design: Beyond Operational Energy to Sustainability Green Server Design: Beyond Operational Energy to Sustainability Justin Meza Carnegie Mellon University Jichuan Chang, Partha Ranganathan, Cullen Bash, Amip Shah Hewlett-Packard Laboratories 1 Overview

More information

High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP)

High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP) 1 T H E A C M I E E E I N T E R N A T I O N A L S Y M P O S I U M O N C O M P U T E R A R C H I T E C T U R E ( I S C A

More information

Chapter 10 And, Finally... The Stack

Chapter 10 And, Finally... The Stack Chapter 10 And, Finally... The Stack Stacks: An Abstract Data Type A LIFO (last-in first-out) storage structure. The first thing you put in is the last thing you take out. The last thing you put in is

More information

Setup of a multi-os platform based on the Xen hypervisor. An industral case study. Paolo Burgio

Setup of a multi-os platform based on the Xen hypervisor. An industral case study. Paolo Burgio Setup of a multi-os platform based on the Xen hypervisor An industral case study Paolo Burgio paolo.burgio@unimore.it Roberto Cavicchioli Ignacio Sanudo Olmedo Marco Solieri Who are we? High-Performance

More information

ABB June 19, Slide 1

ABB June 19, Slide 1 Dr Simon Round, Head of Technology Management, MATLAB Conference 2015, Bern Switzerland, 9 June 2015 A Decade of Efficiency Gains Leveraging modern development methods and the rising computational performance-price

More information

UC Berkeley CS61C : Machine Structures

UC Berkeley CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c UC Berkeley CS61C : Machine Structures Lecture 20 Synchronous Digital Systems Blu-ray vs HD-DVD war over? As you know, there are two different, competing formats for the next

More information

Performance Analysis with Vampir

Performance Analysis with Vampir Performance Analysis with Vampir Bert Wesarg Technische Universität Dresden Outline Part I: Welcome to the Vampir Tool Suite Mission Event trace visualization Vampir & VampirServer The Vampir displays

More information

Sinfonia: a new paradigm for building scalable distributed systems

Sinfonia: a new paradigm for building scalable distributed systems CS848 Paper Presentation Sinfonia: a new paradigm for building scalable distributed systems Aguilera, Merchant, Shah, Veitch, Karamanolis SOSP 2007 Presented by Somayyeh Zangooei David R. Cheriton School

More information

- - DQ0 NC DQ1 DQ0 DQ2 - NC DQ1 DQ3 NC - NC

- - DQ0 NC DQ1 DQ0 DQ2 - NC DQ1 DQ3 NC - NC SYNCHRONOUS DRAM 64Mb: x4, x8, x16 MT48LC16M4A2 4 Meg x 4 x 4 banks MT48LC8M8A2 2 Meg x 8 x 4 banks MT48LC4M16A2 1 Meg x 16 x 4 banks For the latest data sheet, please refer to the Micron Web site: www.micron.com/mti/msp/html/datasheet.html

More information

Reseller Update. Update no: 279

Reseller Update. Update no: 279 Reseller Update Update no: 279 Date: 13 th September 2000 ----------------------------------------------------------------------------------------------------------------- INDeX Call Centre Modules Update

More information

Copyright 2012 EMC Corporation. All rights reserved.

Copyright 2012 EMC Corporation. All rights reserved. 1 Transforming Storage: An EMC Overview Symmetrix storage systems Boštjan Zadnik Technology Consultant Bostjan.Zadnik@emc.com 2 Data Sources Are Expanding Source: 2011 IDC Digital Universe Study 3 Applications

More information

EEC 216 Lecture #10: Power Sources. Rajeevan Amirtharajah University of California, Davis

EEC 216 Lecture #10: Power Sources. Rajeevan Amirtharajah University of California, Davis EEC 216 Lecture #10: Power Sources Rajeevan Amirtharajah University of California, Davis Announcements Outline Review: Adiabatic Charging and Energy Recovery Lecture 9: Dynamic Energy Recovery Logic Lecture

More information

IS42S32200C1. 512K Bits x 32 Bits x 4 Banks (64-MBIT) SYNCHRONOUS DYNAMIC RAM

IS42S32200C1. 512K Bits x 32 Bits x 4 Banks (64-MBIT) SYNCHRONOUS DYNAMIC RAM 512K Bits x 32 Bits x 4 Banks (64-MBIT) SYNCHRONOUS DYNAMIC RAM JANUARY 2007 FEATURES Clock frequency: 183, 166, 143 MHz Fully synchronous; all signals referenced to a positive clock edge Internal bank

More information

Performance Characteristics of Hybrid MPI/OpenMP Implementations of NAS Parallel Benchmarks SP and BT on Large-Scale Multicore Supercomputers

Performance Characteristics of Hybrid MPI/OpenMP Implementations of NAS Parallel Benchmarks SP and BT on Large-Scale Multicore Supercomputers Performance Characteristics of Hybrid MPI/OpenMP Implementations of NAS Parallel Benchmarks SP and BT on Large-Scale Multicore Supercomputers Xingfu Wu Department of Computer Science and Engineering Institute

More information

Design and Experimental Study on Digital Speed Control System of a Diesel Generator

Design and Experimental Study on Digital Speed Control System of a Diesel Generator Research Journal of Applied Sciences, Engineering and Technology 6(14): 2584-2588, 2013 ISSN: 2040-7459; e-issn: 2040-7467 Maxwell Scientific Organization, 2013 Submitted: December 28, 2012 Accepted: February

More information

Real-Time Hardware-In-The- Loop Simulator Testbed Toolkit. Samuel Fix Space Department JHU/APL

Real-Time Hardware-In-The- Loop Simulator Testbed Toolkit. Samuel Fix Space Department JHU/APL Real-Time Hardware-In-The- Loop Simulator Testbed Toolkit Samuel Fix Space Department JHU/APL Agenda Introduction To Testbeds Testbed Toolkit History Testbed Toolkit Functionality Testbed Toolkit Future

More information

Koen De Schepper, Inton Tsang. Olga Bondarenko. Bob Briscoe. July, 2015

Koen De Schepper, Inton Tsang. Olga Bondarenko. Bob Briscoe. July, 2015 DualQ Couled AQM draft: htt://www.bobbriscoe.net/rojects/latency/draft-briscoe-aqm-dualq-couled-00.txt aer: htt://www.bobbriscoe.net/rojects/latency/dctth_rerint.df Koen De Scheer, Inton Tsang. Olga Bondarenko.

More information

FLEXIBILITY FOR THE HIGH-END DATA CENTER. Copyright 2013 EMC Corporation. All rights reserved.

FLEXIBILITY FOR THE HIGH-END DATA CENTER. Copyright 2013 EMC Corporation. All rights reserved. FLEXIBILITY FOR THE HIGH-END DATA CENTER 1 The World s Most Trusted Storage Platform More Than 20 Years Running the World s Most Critical Applications 1988 1990 1994 2000 2003 2005 2009 2011 2012 New Symmetrix

More information

Warped-Compression: Enabling Power Efficient GPUs through Register Compression

Warped-Compression: Enabling Power Efficient GPUs through Register Compression WarpedCompression: Enabling Power Efficient GPUs through Register Compression Sangpil Lee, Keunsoo Kim, Won Woo Ro (Yonsei University*) Gunjae Koo, Hyeran Jeon, Murali Annavaram (USC) (*Work done while

More information

A48P4616B. 16M X 16 Bit DDR DRAM. Document Title 16M X 16 Bit DDR DRAM. Revision History. AMIC Technology, Corp. Rev. No. History Issue Date Remark

A48P4616B. 16M X 16 Bit DDR DRAM. Document Title 16M X 16 Bit DDR DRAM. Revision History. AMIC Technology, Corp. Rev. No. History Issue Date Remark 16M X 16 Bit DDR DRAM Document Title 16M X 16 Bit DDR DRAM Revision History Rev. No. History Issue Date Remark 1.0 Initial issue January 9, 2014 Final (January, 2014, Version 1.0) AMIC Technology, Corp.

More information

IS42S32200L IS45S32200L

IS42S32200L IS45S32200L IS42S32200L IS45S32200L 512K Bits x 32 Bits x 4 Banks (64-MBIT) SYNCHRONOUS DYNAMIC RAM OCTOBER 2012 FEATURES Clock frequency: 200, 166, 143, 133 MHz Fully synchronous; all signals referenced to a positive

More information

How Much Power Does your Server Consume? Estimating Wall Socket Power Using RAPL Measurements

How Much Power Does your Server Consume? Estimating Wall Socket Power Using RAPL Measurements How Much Power Does your Server Consume? Estimating Wall Socket Power Using RAPL Measurements Kashif Nizam Khan Zhonghong Ou, Mikael Hirki, Jukka K. Nurminen, Tapio Niemi 1 Motivation The Large Hadron

More information

HYB25D256400/800AT 256-MBit Double Data Rata SDRAM

HYB25D256400/800AT 256-MBit Double Data Rata SDRAM 256-MBit Double Data Rata SDRAM Features CAS Latency and Frequency Maximum Operating Frequency (MHz) CAS Latency DDR266A -7 DDR200-8 2 133 100 2.5 143 125 Double data rate architecture: two data transfers

More information

The next revolution in simulation. Dr. Jan Leuridan Executive Vice-President, CTO LMS International

The next revolution in simulation. Dr. Jan Leuridan Executive Vice-President, CTO LMS International The next revolution in simulation Dr. Jan Leuridan Executive Vice-President, CTO LMS International The industry is facing faster and broader change (IBM CEO Survey 2008) Sustainability Radical new product

More information

Drowsy Caches Simple Techniques for Reducing Leakage Power Krisztián Flautner Nam Sung Kim Steve Martin David Blaauw Trevor Mudge

Drowsy Caches Simple Techniques for Reducing Leakage Power Krisztián Flautner Nam Sung Kim Steve Martin David Blaauw Trevor Mudge Drowsy Caches Simple Techniques for Reducing Leakage Power Krisztián Flautner Nam Sung Kim Steve Martin David Blaauw Trevor Mudge krisztian.flautner@arm.com kimns@eecs.umich.edu stevenmm@eecs.umich.edu

More information

Scheduling. Purpose of scheduling. Scheduling. Scheduling. Concurrent & Distributed Systems Purpose of scheduling.

Scheduling. Purpose of scheduling. Scheduling. Scheduling. Concurrent & Distributed Systems Purpose of scheduling. 427 Concurrent & Distributed Systems 2017 6 Uwe R. Zimmer - The Australian National University 429 Motivation and definition of terms Purpose of scheduling 2017 Uwe R. Zimmer, The Australian National University

More information

Adaptive Resource and Job Management for limited power consumption

Adaptive Resource and Job Management for limited power consumption Adaptive Resource and Job Management for limited power consumption 02/07/14 Bull, 2012 Yiannis Georgiou David Glesser Matthieu Hautreux Denis Trystram 1 Introduction High Performance Computing Target:

More information

An Analytical Study of GPU Computation for Solving QAPs by Parallel Evolutionary Computation with Independent Run

An Analytical Study of GPU Computation for Solving QAPs by Parallel Evolutionary Computation with Independent Run An Analyical Sudy of GPU Comuaion for Solving QAPs by Parallel Evoluionary Comuaion wih Indeenden Run Shigeyoshi Tsusui Hannan Univ., JAPAN Noriyuki Fujimoo Osaka Prefecure Univ., JAPAN 1 Ouline of This

More information