DAT105: Computer Architecture Study Period 2, 2009 Exercise 2 Chapter 2: Instruction-Level Parallelism and Its Exploitation

Size: px

Start display at page:

Download "DAT105: Computer Architecture Study Period 2, 2009 Exercise 2 Chapter 2: Instruction-Level Parallelism and Its Exploitation"

Gervase Lester
5 years ago
Views:

1 Study Period 2, 29 Exercise 2 Chapter 2: Instruction-Level Parallelism and Its Exploitation Mafijul Islam Department of Computer Science and Engineering November 12, 29

2 Study Period 2, 29 Goals: To understand the notion of instruction-level parallelism (ILP) the notion of dependences and hazards as well as their impact on exploiting ILP the techniques of exploiting ILP Case Studies/Assignments: Case Study 1: 2.1, 2.2, 2.3, 2.5 Assignment 2 of the Exam on

3 Exam : Assignment 2(A) The following MIPS program operates on an array with 64-bit elements. The register R1 points to the beginning of the array from the beginning. The register R2 points to the end. The array always contains 1 elements. ANDI R3, R3, # R3 = LOOP: LD R4, (R1) DMUL R5, R4, R4 DADD R5, R3, R5 SD R5, (R1) # new[i] = old[i-1] + old[i]*old[i] DADDI R3, R4, DADDI R1, R1, 8 BNE R1, R2, LOOP FIND AT LEAST ONE EXAMPLE OF EACH TYPE OF DEPENDENCES

4 Exam : Assignment 2(A) FINDING VARIOUS CATEGORIES OF DEPENDENCES: ANDI R3, R3, LOOP: LD R4, (R1) DMUL R5, R4, R4 DADD R5, R3, R5 SD R5, (R1) DADDI R3, R4, DADDI R1, R1, 8 BNE R1, R2, LOOP Data dependency: DMUL reads R4 which is written by the preceding LD

5 Exam : Assignment 2(A) FINDING VARIOUS CATEGORIES OF DEPENDENCES: ANDI R3, R3, LOOP: LD R4, (R1) DMUL R5, R4, R4 DADD R5, R3, R5 SD R5, (R1) DADDI R3, R4, DADDI R1, R1, 8 BNE R1, R2, LOOP Data dependency: DMUL reads R4 which is written by the preceding LD Name dependency: Both DMUL and DADD write to R5. But the 2nd R5 can be renamed.

6 Exam : Assignment 2(A) FINDING VARIOUS CATEGORIES OF DEPENDENCES: ANDI R3, R3, LOOP: LD R4, (R1) DMUL R5, R4, R4 DADD R5, R3, R5 SD R5, (R1) DADDI R3, R4, DADDI R1, R1, 8 BNE R1, R2, LOOP Data dependency: DMUL reads R4 which is written by the preceding LD Name dependency: Both DMUL and DADD write to R5. But the 2nd R5 can be renamed. Control dependency: The LOOP body depends on BNE for all iterations except the first one.

7 Exam : Assignment 2(A) FINDING VARIOUS CATEGORIES OF DEPENDENCES: ANDI R3, R3, LOOP: LD R4, (R1) DMUL R5, R4, R4 DADD R5, R3, R5 SD R5, (R1) DADDI R3, R4, DADDI R1, R1, 8 BNE R1, R2, LOOP Data dependency: DMUL reads R4 which is written by the preceding LD Name dependency: Both DMUL and DADD write to R5. But the 2nd R5 can be renamed. Control dependency: The LOOP body depends on BNE for all iterations except the first one. HOW TO RESOLVE DEPENDENCES?

8 Exam : Assignment 2(B) FINDING VARIOUS CATEGORIES OF HAZARDS: Dependences are properties of programs Hazards are properties of the pipeline organization ANDI R3, R3, LOOP: LD R4, (R1) DMUL R5, R4, R4 DADD R5, R3, R5 SD R5, (R1) DADDI R3, R4, DADDI R1, R1, 8 BNE R1, R2, LOOP RAW: DMUL and LD on register R4 WAW: DMUL and DADD on register R5 WAR: SD and DAADI on register R1

9 Case Study 1: 2.1 No new instruction execution could be initiated until the previous instruction had completed. Ignore front-end fetch and decode. Execution does not stall for lack of the next instruction, but only 1 instruction/cycle can be issued. The branch is taken and there is a 1 cycle branch delay slot. Code sequence: Loop: LD F2, (Rx) I: MULTD F2, F, F2 I1: DIVD F8, F2, F I2: LD F4, (Ry) I3: ADDD F4, F, F4 Latencies beyond single cycle: Memory LD +3 Memory SD +1 Integer ADD, SUB + Branches +1 ADDD +2 MULTD +4 DIVD +1 Find the baseline performance (in cycle, per loop iteration)

10 Case Study 1: 2.1 Find the baseline performance (in cycle, per loop iteration): Loop: LD F2, (Rx) I: MULTD F2, F, F I1: DIVD F8, F2, F I2: LD F4, (Ry) I3: ADDD F4, F, F Cycles per loop iteration: 37 Latencies beyond single cycle: Memory LD +3 Memory SD +1 Integer ADD, SUB + Branches +1 ADDD +2 MULTD +4 DIVD +1 Can we improve the performance?

11 Case Study 1: 2.2 Pipeline stalled only on true data dependences: Loop: LD F2, (Rx) I: MULTD F2, F, F2 I1: DIVD F8, F2, F I2: LD F4, (Ry) I3: ADDD F4, F, F4 Latencies beyond single cycle: Memory LD +3 Memory SD +1 Integer ADD, SUB + Branches +1 ADDD +2 MULTD +4 DIVD +1

12 Case Study 1: 2.2 Pipeline stalled only on true data dependences: Loop: LD F2, (Rx) I: MULTD F2, F, F2 I1: DIVD F8, F2, F I2: LD F4, (Ry) I3: ADDD F4, F, F4 Number of stalls removed:

13 Case Study 1: 2.2 Pipeline stalled only on true data dependences: Loop: LD F2, (Rx) I: MULTD F2, F, F I1: DIVD F8, F2, F I2: LD F4, (Ry) I3: ADDD F4, F, F4 Number of stalls removed:

14 Case Study 1: 2.2 Pipeline stalled only on true data dependences: Loop: LD F2, (Rx) I: MULTD F2, F, F I1: DIVD F8, F2, F I2: LD F4, (Ry) I3: ADDD F4, F, F4 Number of stalls removed:

15 Case Study 1: 2.2 Pipeline stalled only on true data dependences: Loop: LD F2, (Rx) I: MULTD F2, F, F I1: DIVD F8, F2, F I2: LD F4, (Ry) I3: ADDD F4, F, F Number of stalls removed:

16 Case Study 1: 2.2 Pipeline stalled only on true data dependences: Loop: LD F2, (Rx) I: MULTD F2, F, F I1: DIVD F8, F2, F I2: LD F4, (Ry) I3: ADDD F4, F, F Number of stalls removed: 1 I2 (LD) issued 3 stalls of I2 (LD) 1 I3 (ADDD) issued

17 Case Study 1: 2.2 Pipeline stalled only on true data dependences: Number of stalls removed: Loop: LD F2, (Rx) I: MULTD F2, F, F I1: DIVD F8, F2, F I2: LD F4, (Ry) I3: ADDD F4, F, F I2 (LD) issued 3 stalls of I2 (LD) 1 I3 (ADDD) issued 2 stalls of I3 overlaps with I1

18 Case Study 1: 2.2 Pipeline stalled only on true data dependences: Loop: LD F2, (Rx) I: MULTD F2, F, F I1: DIVD F8, F2, F I2: LD F4, (Ry) I3: ADDD F4, F, F Number of stalls removed: 1 I2 (LD) issued 3 stalls of I2 (LD) 1 I3 (ADDD) issued 2 stalls of I3 overlaps with I1 1 I5 (SD) issued 1 stall of I4 overlaps with I5 1 I6 (ADDI) issued

19 Case Study 1: 2.2 Pipeline stalled only on true data dependences: Number of stalls removed: Loop: LD F2, (Rx) I: MULTD F2, F, F I1: DIVD F8, F2, F I2 (LD) issued I2: LD F4, (Ry) stalls of I2 (LD) I3: ADDD F4, F, F I3 (ADDD) issued Cycles per loop iteration: 27 2 stalls of I3 overlaps with I1 1 I5 (SD) issued 1 stall of I4 overlaps with I5 1 I6 (ADDI) issued TOTAL: 1 stalls removed Can we improve the performance?

20 Multiple-issue design: Loop: LD F2, (Rx) I: MULTD F2, F, F2 I1: DIVD F8, F2, F I2: LD F4, (Ry) I3: ADDD F4, F, F4 DAT15: Computer Architecture Case Study 1: 2.3

21 Multiple-issue design: DAT15: Computer Architecture Case Study 1: 2.3 Execution Pipeline 1 Execution Pipeline 2 Loop: LD F2, (Rx) nop <3 stalls: LD> nops I: MULTD F2, F, F2 I1: DIVD F8, F2, F I2: LD F4, (Ry) I3: ADDD F4, F, F4

22 Multiple-issue design: DAT15: Computer Architecture Case Study 1: 2.3 Execution Pipeline 1 Execution Pipeline 2 Loop: LD F2, (Rx) nop <3 stalls: LD> nops I: MULTD F2, F, F2 nop <4 stalls MULTD(I)> nops I1: DIVD F8, F2, F I2: LD F4, (Ry) I3: ADDD F4, F, F4

23 Multiple-issue design: DAT15: Computer Architecture Case Study 1: 2.3 Execution Pipeline 1 Execution Pipeline 2 Loop: LD F2, (Rx) nop <3 stalls: LD> nops I: MULTD F2, F, F2 nop <4 stalls MULTD(I)> nops I1: DIVD F8, F2, F I2: LD F4, (Ry) <3 stalls: LD(I2)> nops I3: ADDD F4, F, F4

24 Multiple-issue design: DAT15: Computer Architecture Case Study 1: 2.3 Execution Pipeline 1 Execution Pipeline 2 Loop: LD F2, (Rx) nop <3 stalls: LD> nops I: MULTD F2, F, F2 nop <4 stalls MULTD(I)> nops I1: DIVD F8, F2, F I2: LD F4, (Ry) <3 stalls: LD(I2)> nops I3: ADDD F4, F, F4 nop <6 stalls: DIVD(I1) nops nop < 1 stall: BNZ(I9)> Cycles per loop iteration: 24 Can we improve the performance?

25 Case Study 1: 2.5 out-of-order issue and execution: Loop: LD F2, (Rx) I: MULTD F2, F, F2 I1: DIVD F8, F2, F I2: LD F4, (Ry) I3: ADDD F4, F, F4

26 Case Study 1: 2.5 out-of-order issue and execution: Execution Pipeline 1 Execution Pipeline 2 Loop: LD F2, (Rx) I2: LD F4, (Ry) <3 stalls: LD> <3 stalls: LD(I2)> I: MULTD F2, F, F2 I1: DIVD F8, F2, F I3: ADDD F4, F, F4

27 Case Study 1: 2.5 out-of-order issue and execution: Execution Pipeline 1 Execution Pipeline 2 Loop: LD F2, (Rx) I2: LD F4, (Ry) <3 stalls: LD> <3 stalls: LD(I2)> I: MULTD F2, F, F2 I3: ADDD F4, F, F4 <4 stalls: MULTD(I)> <2 stalls: ADDD(I3)> I1: DIVD F8, F2, F

28 Case Study 1: 2.5 out-of-order issue and execution: Execution Pipeline 1 Execution Pipeline 2 Loop: LD F2, (Rx) I2: LD F4, (Ry) <3 stalls: LD> <3 stalls: LD(I2)> I: MULTD F2, F, F2 I3: ADDD F4, F, F4 <4 stalls: MULTD(I)> <2 stalls: ADDD(I3)> I1: DIVD F8, F2, F

29 Case Study 1: 2.5 out-of-order issue and execution: Execution Pipeline 1 Execution Pipeline 2 Cycle # Loop: LD F2, (Rx) I2: LD F4, (Ry) 1 <3 stalls: LD> <3 stalls: LD(I2)> I: MULTD F2, F, F2 I3: ADDD F4, F, F4 4 <4 stalls: MULTD(I)> <2 stalls: ADDD(I3)> 6 I1: DIVD F8, F2, F <1 stall: SD(I5) 8 9 1

30 Case Study 1: 2.5 out-of-order issue and execution: Execution Pipeline 1 Execution Pipeline 2 Cycle # Loop: LD F2, (Rx) I2: LD F4, (Ry) 1 <3 stalls: LD> <3 stalls: LD(I2)> I: MULTD F2, F, F2 I3: ADDD F4, F, F4 5 <4 stalls: MULTD(I)> <2 stalls: ADDD(I3)> I1: DIVD F8, F2, F <1 stall: SD(I5) 1 <8 stalls: DIVD(I1)> 21 <1 stall: BNZ(I9)> Cycles per loop iteration: 22 Can we improve the performance?

31 Case Study 1: Summary of Task Processor Model/Technique Performance (in cycles) 2.1 single-issue, respect all dependences, no execution until the previous instruction execution is completed single-issue, respect only true data dependences multiple-issue, in-order issue multiple-issue, out-of-order issue 22

COSC 6385 Computer Architecture. - Tomasulos Algorithm

COSC 6385 Computer Architecture - Tomasulos Algorithm Fall 2008 Analyzing a short code-sequence DIV.D F0, F2, F4 ADD.D F6, F0, F8 S.D F6, 0(R1) SUB.D F8, F10, F14 MUL.D F6, F10, F8 1 Analyzing a short