Announcements. Programming assignment #2 due Monday 9/24. Talk: Architectural Acceleration of Real Time Physics Glenn Reinman, UCLA CS

Size: px

Start display at page:

Download "Announcements. Programming assignment #2 due Monday 9/24. Talk: Architectural Acceleration of Real Time Physics Glenn Reinman, UCLA CS"

Vivien Franklin
6 years ago
Views:

1 Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar GAS STATION Pipelining II Fall 2007 Prof. Thomas Wenisch Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar, and Wenisch of Carnegie ellon niversity, Purdue niversity, niversity of ichigan, and niversity of Wisconsin. Slide

2 Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Announcements HW # 2 due Wednesday 9/26 Programming assignment #2 due onday 9/24 Talk: Architectural Acceleration of Real Time Physics Glenn Reinman, CLA CS on., Sep :5 3:5, CSE 3725 Office hour on Wed. 9/9 moved to Tue, 9/8 2 3 Slide 2

3 Readings Wenisch Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar For Wednesday: H & P Chapter A.7 A.8, 2., Slide 3

4 Outline: Wenisch Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar nderstanding di the Execution Core. 370 s 5 stage pipeline (review) 2. Implementing pipeline interlocks (review) 3. Scoreboard scheduling (CDC 6600) 4. Tomasulo s s OoO schedulingalgorithm (IB360) 5. Precise interrupts with a Reorder Buffer (P6) 6. odern OoO (IPS R0K, Netburst) Slide 4

5 Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Handling Data Hazards Avoidance (static) ake sure there are no hazards in the code Detect and Stall (dynamic) Stall until earlier instructions finish Detect and Forward (dynamic) Get correct value from elsewhere in pipeline Slide 5

6 Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Handling Data Hazards: Detect t & Forward Detection Same as detect and stall, but each possible hazard requires different forwarding paths Forward Add data paths for all possible sources Add mux in front of AL to select source bypassing logic often a critical path in wide issue machines # paths grows quadratically with machine width Slide 6

7 add 2 3 // r3 = r Lipasti, r2artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar nand // r5 = r3 NAND r4 add // r7 = r3 r6 lw // r6 = E[r30] sw // E[r62]=r2 r add nand add lw sw Slide 7

8 Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar First half of cycle 3 PC Inst mem 2 na and Hazard 3 3 rega regb data Re egister file R0 R R2 R3 R4 R5 R6 R A L Data memory add IF/ ID fwd fwd fwd ID/ E E/ em em/ WB Slide 8

9 End of cycle 3 Wenisch Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar PC Inst mem 3 a dd rega regb 3 data Re egister file R0 R R2 R3 R4 R5 R6 R A L 2 Data memory nand add IF/ ID H ID/ E E/ em em/ WB Slide 9

10 Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar First half of cycle 4 PC Inst mem 3 add New Hazard R R rega R2 3 regb R3 0 R4 3 R5 data R6 5 Re egister file R7 2 2 A L 2 Data memory nand add IF/ ID H ID/ E E/ em em/ WB Slide 0

11 End of cycle 4 Wenisch Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar PC Inst mem 4 lw rega regb data Re egister file R0 R R2 R3 R4 R5 R6 R A L -2 Data memory 2 add nand add IF/ ID H2 ID/ E H E/ em em/ WB Slide

12 Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar First half of cycle 5 PC Inst mem 4 lw No Hazard 3 rega regb data Re egister file R0 R R2 R3 R4 R5 R6 R A L -2 Data memory 2 add nand add IF/ ID H2 ID/ E H E/ em em/ WB Slide 2

13 End of cycle 5 Wenisch Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar PC Inst mem 5 sw rega regb 67 5 data Re egister file R0 R R2 R3 R4 R5 R6 R A L 22 Data memory -2 lw add nand IF/ ID ID/ E H2 E/ em H em/ WB Slide 3

14 Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar First half of cycle 6 en en PC Inst mem 5 sw Hazard 6 rega regb L data Re egister file R0 R R2 R3 R4 R5 R6 R A L 22 Data memory -2 lw add nand IF/ ID ID/ E H2 E/ em H em/ WB Slide 4

15 End of cycle 6 Wenisch Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar PC Inst mem 5 sw rega regb 6 7 data Re egister file R0 R R2 R3 R4 R5 R6 R A L 3 Data memory 22 noop lw add IF/ ID ID/ E E/ em H2 em/ WB Slide 5

16 Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar First half of cycle 7 PC Inst mem 5 sw Hazard 6 rega regb 6 7 data Re egister file R0 R R2 R3 R4 R5 R6 R A L 3 Data memory 22 noop lw add IF/ ID ID/ E E/ em H2 em/ WB Slide 6

17 End of cycle 7 Wenisch Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar PC Inst mem rega regb 6 data Re egister file R0 R R2 R3 R4 R5 R6 R A L Data memory 99 sw noop lw IF/ ID H3 ID/ E E/ em em/ WB Slide 7

18 Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar First half of cycle 8 PC Inst mem rega regb 6 data Re egister file R0 R R R3 R4 R5 7 R6 R7 2 2 A L Data memory 99 sw noop lw IF/ ID H3 ID/ E E/ em em/ WB Slide 8

19 End of cycle 8 Wenisch Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar PC Inst mem rega regb data Re egister file R0 R R2 R3 R4 R5 R6 R A L Data memory 7 sw noop IF/ ID ID/ E H3 E/ em em/ WB Slide 9

20 Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Load Delay Slot (IPS R2000) t 0 t t 2 t 3 t 4 t 5 i: F D E W j: F D E W F D E W k: h: R k -- i: R k E[ - ] - The effect of a delayed Load is not visible to the instructions in its delay slots. j: -- R k Which (R k: -- R k ) do we really mean? k Slide 20

21 Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Control Hazards beq 0 sub beq sub t 0 t t 2 t 3 t 4 t 5 F D E W F D E W squash Slide 2

22 Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Handling Data Hazards Avoidance (static) No branches? Convert branches to predication Control dependence becomes data dependence Detect andstall (dynamic) Stop fetch until branch resolves Speculate and squash (dynamic) Keep going past branch, throw away instructions if wrong Slide 22

23 Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Avoidance: if-conversion if (a == b) { sub t ab a, x; jnz t, PC2 y = n / d; add x x, # } div y n, d sub t a, b sub t a, b add(t) x x, # add t2 x, # div(t) y n, d div t3 n, d cmov(t) x t2 cmov(t) y t3 Slide 23

24 Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Handling Control Hazards: Detect t & Stall Detection In decode, check if opcode is branch or jump Stall Hold next instruction in Fetch Pass noop to Decode Slide 24

25 Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Problems with Detect & Stall CPI increases on every branch Are these stalls necessary? Not always! Branch is only taken half the time Assume branch is NOT taken Keep fetching, treat branch as noop If wrong, make sure bad instructions don t complete Slide 25

26 Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Handling Control Hazards: Speculate & Squash Speculate Assume branch is not taken Squash Overwrite opcodes in Fetch, Decode, Execute with noop Pass target to Fetch Slide 26

27 Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar PC beq sub add nand Inst mem noop add IF/ ID Control REG file sign ext noop sub ID/ E equal A L noop beq E/ em Data memory beq em/ WB Slide 27

28 Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Problems with Speculate & Squash Alwaysassumes assumes branch is not taken Can we do better? Yes. Predict branch direction and target! Why possible? Program behavior repeats. ore on branch prediction to come... Slide 28

29 Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Branch Delay Slot (IPS, SPARC) branch: next: target: t 0 t t 2 t 3 t 4 t 5 F D E W F Squash F D E W - Instruction in delay slot executes even on taken branch branch: delay: target: F D E W F D E W F D E W i: beq, 2, tgt j: add 3, 4, 5 What can we put here? Slide 29

30 Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Pipeline Hazard Checklist emory Data Dependences Output Dependence (WAW) Anti Dependence (WAR) True Data Dependence (RAW) Register Data Dependences Output Dependence (WAW) Anti Dependence (WAR) True Data Dependence (RAW) Control Dependences Slide 30

31 Sequential Code Semantics Wenisch Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Instruction Dependences i: xxxx i i2: xxxx i2 i3: xxxx i3 and Pipeline Hazards A.7-A.8, 2., A true dependence between two instructions may only involve one substep of each instruction. i: i2: The implied sequential precedences are overspecifications. It is sufficient but not necessary to ensure program correctness. i3: Slide 3

32 Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Limitations of Scalar Pipelines pper Bound on Scalar Pipeline Throughput Limited by IPC= Flynn Bottleneck Inefficient nificationinto Single Pipeline Long latency for each instruction Performance Lost Due to Rigid In order Pipeline nnecessary stalls Slide 32

33 Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Architectures for Instruction-Level ti Parallelism li Slide 33

34 Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Superscalar achine Slide 34

35 Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar What is the real problem? CPI of inorder pipelines degrades very sharply if the machine parallism is increased beyond a certain point, i.e., when Nx approaches average distance between dependent instructions Forwarding is no longer effective Pipeline may never be full due to frequent dependency stalls! Slide 35

36 Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar ILP: Instruction-Level Parallelism ILP is is a measure of the amount of inter dependencies between instructions Average ILP= no. instruction / no. cyc required code: ILP = code2: ILP = 3 i.e. must execute serially i.e. can execute at the same time code: r r2 r3 r / 7 r4 r0 - r3 code2: r r2 r3 r9 / 7 r4 r0 - r0 Slide 36

37 Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar Purported Limits on ILP Weiss and Smith [984].58 Sohi and Vajapeyam [987].8 Tjaden and Flynn [970] Tjaden and Flynn [973].96 ht [986] 2.00 Smith et al. [989] 2.00 Jouppi and Wall [988] 2.40 Johnson [99] 2.50 Acosta et al. [986] 2.79 Wedig [982] 3.00 Butler et al. [99] 5.8 elvin and Patt [99] 6 Wall [99] 7 Kuck et al. [972] 8 Riseman and Foster [972] 5 Nicolau and Fisher [984] 90 Slide 37

38 Scope of ILP Analysis Wenisch Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar ILP= r r2 r3 r / 7 r4 r0 -r3 r r2 r3 r9 / 7 r4 r0 - r20 ILP=2 Out-of-order execution permits more ILP to be exploited Slide 38

39 Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar How Large ust the Window Be? Slide 39

40 Outline: Wenisch Portions Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, artin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar nderstanding di the Execution Core. 370 s 5 stage pipeline (review) 2. Implementing pipeline interlocks (review) 3. Scoreboard scheduling (CDC 6600) 4. Tomasulo s s OoO schedulingalgorithm (IB360) 5. Precise interrupts with a Reorder Buffer (P6) 6. odern OoO (IPS R0K, Netburst) Slide 40

Parallelism I: Inside the Core

Parallelism I: Inside the Core 1 The final Comprehensive Same general format as the Midterm. Review the homeworks, the slides, and the quizzes. 2 Key Points What is wide issue mean? How does does it affect