CIS 662: Sample midterm w solutions

Similar documents
Lecture 14: Instruction Level Parallelism

PIPELINING: BRANCH AND MULTICYCLE INSTRUCTIONS

Parallelism I: Inside the Core

DAT105: Computer Architecture Study Period 2, 2009 Exercise 2 Chapter 2: Instruction-Level Parallelism and Its Exploitation

Anne Bracy CS 3410 Computer Science Cornell University. [K. Bala, A. Bracy, S. McKee, E. Sirer, H. Weatherspoon]

Out-of-order Pipeline. Register Read. OOO execution (2-wide) OOO execution (2-wide) OOO execution (2-wide) OOO execution (2-wide)

Advanced Superscalar Architectures. Speculative and Out-of-Order Execution

Hakim Weatherspoon CS 3410 Computer Science Cornell University

Improving Performance: Pipelining!

CMU Introduction to Computer Architecture, Spring 2013 HW 3 Solutions: Microprogramming Wrap-up and Pipelining

6.823 Computer System Architecture Prerequisite Self-Assessment Test Assigned Feb. 6, 2019 Due Feb 11, 2019

Pipelining A B C D. Readings: Example: Doing the laundry. Ann, Brian, Cathy, & Dave. each have one load of clothes to wash, dry, and fold

CIS 371 Computer Organization and Design

CSCI 510: Computer Architecture Written Assignment 2 Solutions

Code Scheduling & Limitations

Optimality of Tomasulo s Algorithm Luna, Dong Gang, Zhao

ENGN1640: Design of Computing Systems Topic 05: Pipeline Processor Design

CIS 371 Computer Organization and Design

Pipelined MIPS Datapath with Control Signals

Unit 9: Static & Dynamic Scheduling

COSC 6385 Computer Architecture. - Tomasulos Algorithm

CS152: Computer Architecture and Engineering Introduction to Pipelining. October 22, 1997 Dave Patterson (http.cs.berkeley.

Advanced Superscalar Architectures

Computer Architecture 计算机体系结构. Lecture 3. Instruction-Level Parallelism I 第三讲 指令级并行 I. Chao Li, PhD. 李超博士

Computer Architecture: Out-of-Order Execution. Prof. Onur Mutlu (editted by Seth) Carnegie Mellon University

CS 152 Computer Architecture and Engineering. Lecture 15 - Advanced Superscalars

GRAND UNION CANAL UXBRIDGE WEST LONDON UB8 2GH. Accommodation schedule. Block B

Announcements. Programming assignment #2 due Monday 9/24. Talk: Architectural Acceleration of Real Time Physics Glenn Reinman, UCLA CS

CS 152 Computer Architecture and Engineering. Lecture 14 - Advanced Superscalars

M2 Instruction Set Architecture

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 10 Instruction-Level Parallelism Part 3

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Computer Architecture ELE 475 / COS 475 Slide Deck 6: Superscalar 3. David Wentzlaff Department of Electrical Engineering Princeton University

Tomasulo-Style Register Renaming

Pipeline Hazards. See P&H Chapter 4.7. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Pipeline Hazards. See P&H Chapter 4.7. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Lecture 20: Parallelism ILP to Multicores. James C. Hoe Department of ECE Carnegie Mellon University

Techniques, October , Boston, USA. Personal use of this material is permitted. However, permission to

CS 6354: Tomasulo. 21 September 2016

Where do Euro 6 cars stand? Nick Molden 29 April 2015

To read more. CS 6354: Tomasulo. Intel Skylake. Scheduling. How can we reorder instructions? Without changing the answer.

VSD Series II Variable Speed Micro Drives

Energy Efficient Content-Addressable Memory

The MathWorks Crossover to Model-Based Design

JNC, JC, and JNZ Instructions for the WIMP51

Online Appendix for Subways, Strikes, and Slowdowns: The Impacts of Public Transit on Traffic Congestion

Nut The standard bottom nut is a bullnose nut. Optional nut types, sub, mill, and side hill, are available.

Airborne Collision Avoidance System X U

CS 250! VLSI System Design

Programming Languages (CS 550)

A750F AUTOMATIC TRANSMISSION: AUTOMATIC TRANSMISSION SYSTEM (for 1GR-FE)... Model Year: 2007 Model: 4Runner Doc ID: RM000000W80023X

Too Good to Throw Away Implementation Strategy

Near-Optimal Precharging in High-Performance Nanoscale CMOS Caches

FabComp: Hardware specication

First plug-in hybrid with the three-pointed star: the S 500 PLUG-IN HYBRID 1. A pioneer for efficiency.


Surface drilling rig. Explorac R50 on truck. Atlas Copco Exploration Products. Features

RST INSTRUMENTS LTD.

HOUSING REPORT NORTHWEST MICHIGAN YEAR END 2018

Introduction Safety precautions for connections... 3 Series 3700 documentation... 4 Model 3732 overview... 5 Accessories...

Chapter 3: Computer Organization Fundamentals. Oregon State University School of Electrical Engineering and Computer Science.

Unit 8 ~ Learning Guide Name:

Introduction to hmtechnology

APPLICATION NOTE Application Note for Torque Down Capper Application

PPEP: ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK

EECS 461 Final Project: Adaptive Cruise Control

Fault Attacks Made Easy: Differential Fault Analysis Automation on Assembly Code

Chapter 13: Application of Proportional Flow Control

Fixing the Hyperdrive: Maximizing Rendering Performance on NVIDIA GPUs

CHAPTER 19 DC Circuits Units

Light-Lift Rocket II

An Introduction to R 2.5 A few data manipulation tricks!

Code Generation Part III

Issue 2.0 December EPAS Midi User Manual EPAS35

AC Induction Motor Controller with VCL

Hydrostatic Drive. 1. Main Pump. Hydrostatic Drive

Momentum Dynamics High Power Inductive Charging for Multiple Vehicle Applications

SAC SERIES CONTENTS TRIPLE-INTERVAL HIGH PRECISION COUNTING SCALE OPERATION MANUAL 1. INSTALLATION 2. SPECIFICATIONS

LECTURE-23: Basic concept of Hydro-Static Transmission (HST) Systems

Technical Overview. Pressure and Flow on Demand Proven 100,000 Times Over

Installation of the 2 position/10 port valve into the 1100 series thermostatted column compartment

VHDL (and verilog) allow complex hardware to be described in either single-segment style to two-segment style

2003 CVT when used with 2.2L L61 engine in the Saturn ION TRANSMISSION DIAGNOSTIC PARAMETERS

Probability-Driven Multi bit Flip-Flop Integration With Clock Gating

Transmitted by the expert from the European Commission (EC) Informal Document No. GRRF (62nd GRRF, September 2007, agenda item 3(i))

Capabilities, Innovation and Industry Dynamics

Experiment (4): Flow measurement

KWA. Evaluation of the Vaporless Manufacturing LD 3000 and LD 3000S Mechanical Line Leak Detector on Large Rigid and Flexible Pipelines.

Series 905-IV16(E) CAN/CANopen Input Modules Installation and Operating Manual

Sensors do not overheat at zero flow by using a unique constant temperature control method and power limiting design

Fig 1 An illustration of a spring damper unit with a bell crank.

Application Note : Comparative Motor Technologies

Cetane ID 510. Customer Presentation. Refining. Petrochemical

POWERWORLD SIMULATOR. University of Texas at Austin By: Mohammad Majidi Feb 2014

Automotive Technology II

Lake Tahoe Real Estate Report Quarter Two, 2013

STUDENT ACTIVITY SHEET Name Period Fire Hose Friction Loss The Varying Variables for the One That Got Away Part 1

Driving Characteristics of Cylindrical Linear Synchronous Motor. Motor. 1. Introduction. 2. Configuration of Cylindrical Linear Synchronous 1 / 5

Designing Drive Systems for Low Web Speeds

Effect of Sample Size and Method of Sampling Pig Weights on the Accuracy of Estimating the Mean Weight of the Population 1

Transcription:

CIS 662: Sample midterm w solutions 1. (40 points) A processor has the following stages in its pipeline: IF ID ALU1 MEM1 MEM2 ALU2 WB. ALU1 stage is used for effective address calculation for loads, stores and branches. ALU2 stage is used for all other calculations and for branch resolution. The only instructions that can access memory are load and store. The only supported addressing mode is displacement addressing. Because we have a slow memory unit, access to memory is pipelined through two stages MEM1 and MEM2. a) (10 points) Find all dependencies in the following code segment and list them by category (data dependence, output dependence, antidependence or control dependence). LD R1, 50(R2) ADD R3, R1, R4 LD R5, 100(R3) MUL R6, R5, R7 STORE R6, 50(R2) ADD R1, R1, #100 SUB R2, R2, #8 Let s number instructions: 1. LD R1, 50(R2) 2. ADD R3, R1, R4 3. LD R5, 100(R3) 4. MUL R6, R5, R7 5. STORE R6, 50(R2) 6. ADD R1, R1, #100 7. SUB R2, R2, #8 Data dependencies: Instruction 2 depends on instruction 1 for value R1 Instruction 6 depends on instruction 1 for value R1 Instruction 3 depends on instruction 2 for value R3 Instruction 4 depends on instruction 3 for value R5 Instruction 5 depends on instruction 4 for value R6 Antidependencies: Instruction 6 is antidependent on instruction 2 for access to R1 Instruction 7 is antidependent on instruction 1 for access to R2 Instruction 7 is antidependent on instruction 5 for access to R2 Output dependencies: Instruction 6 has output dependence with instruction 1 for access to R1

Control dependencies: None b) (10 points) Assume that there is no forwarding. How many cycles does it take to execute the above code segment? Indicate the total number of stall cycles. 1 2 3 4 5 6 7 8 9 10 LD R1, 50(R2) IF ID ALU1 MEM1 MEM2 ADD R3, R1, R4 IF s s s s ID ALU1 MEM1 MEM2 LD R5, 100(R3) IF s s s MUL R6, R5, R7 STORE R6, 50(R2) ADD R1, R1, #100 SUB R2, R2, #8 11 12 13 14 15 16 17 18 19 20 s ID ALU1 MEM1 MEM2 IF s s s s ID ALU1 MEM1 MEM2 IF s s s 21 22 23 24 25 26 27 28 29 s ID ALU1 MEM1 MEM2 IF ID ALU1 MEM1 MEM2 IF ID ALU1 MEM1 MEM2 LD R1, 50(R2) IF at time 1, WB at time 7 ADD R3, R1, R4 IF at time 2, stalls 3-6, ID at time 7, WB at time 12 LD R5, 100(R3) IF at time 7, stalls 8-11, ID at time 12, WB at time 17 MUL R6, R5, R7 IF at time 12, stalls 13-17, ID at time 17, WB at time 22 STORE R6, 50(R2) IF at time 17, stalls 18-21, ID at time 22, WB at time 27 ADD R1, R1, #100 IF at time 23, WB at time 28 SUB R2, R2, #8 IF at time 24, WB at time 29 It takes 29 cycles. We stall for 16 cycles. c) (10 points) Now apply forwarding to reduce number of stalls wherever possible. Indicate the source and destination stages for forwarding. How many cycles does it take now to execute the above code segment and how many stalls we have?

Stars denote stages when result is ready and when is needed 1 2 3 4 5 6 7 8 9 10 LD R1, 50(R2) IF ID ALU1 MEM1 MEM2* ADD R3, R1, R4 IF ID ALU1 MEM1 MEM2 *ALU2* WB LD R5, 100(R3) IF ID s s s *ALU1 MEM1 MEM2* MUL R6, R5, R7 IF s s s ID ALU1 MEM1 STORE R6, 50(R2) IF ID ALU1 ADD R1, R1, #100 IF ID SUB R2, R2, #8 IF 11 12 13 14 15 16 17 18 MEM2 *ALU2* WB s s *MEM1 MEM2 s s ALU1 MEM1 MEM2 s s ID ALU1 MEM1 MEM2 LD R1, 50(R2) R1 available at the end of MEM2 at time 5, ADD R3, R1, R4 R1 needed at the beginning of ALU2 at time 7 We can forward from MEM2 to MEM2 or from ALU2 to ALU2 ADD R3, R1, R4 R3 available at the end of ALU2 at time 7 LD R5, 100(R3) R3 needed at the beginning of ALU2 at time 5 We need 3 stalls and then we forward from ALU2 to ALU1 LD R5, 100(R3) R5 available at the end of MEM2 at time 10 MUL R6, R5, R7 R5 needed at the beginning of ALU2 at time 12 We can forward from MEM2 to MEM2 or from ALU2 to ALU2 MUL R6, R5, R7 R6 available at the end of ALU2 at time 12 STORE R6, 50(R2) R6 needed at the beginning of MEM1 at time 11 We need 2 stalls and then we forward from ALU2 to MEM1 It takes 18 cycles. We have 5 stalls. d) (10 points) Can you rearrange the code, just by shuffling commands and adjusting displacements, so that it takes less cycles? How many cycles does it take now to execute the code segment and how many stalls are left?

We can move last two instructions before the STORE, to eliminate two stall cycles. Then we adjust the offset in STORE. 1 2 3 4 5 6 7 8 9 10 LD R1, 50(R2) IF ID ALU1 MEM1 MEM2* ADD R3, R1, R4 IF ID ALU1 MEM1 MEM2 *ALU2* WB LD R5, 100(R3) IF ID S s s *ALU1 MEM1 MEM2* MUL R6, R5, R7 IF S s s ID ALU1 MEM1 ADD R1, R1, #100 IF ID ALU1 SUB R2, R2, #8 IF ID STORE R6, 58(R2) IF 11 12 13 14 15 16 MEM2 *ALU2* WB MEM1 MEM2 ALU1 MEM1 MEM2 ID ALU1 *MEM1 MEM2 It takes 16 cycles. We have 3 stalls left. 2. (33 points) For the processor from question 1 a) (4 points) How large is the branch penalty? Since branches are resolved in ALU2 stage, we can start IF only after that. We would like to start after IF stage. So the penalty is 5 cycles b) (4 points) Assume that we can introduce an optimization so that branches are resolved in ALU1 stage (along with effective address calculation). How large is the branch penalty now? 2 cycles c) (10 points) The optimization we introduced can be used in 75% of branches. Branches represent 30% of all instructions in our usual workload. Ideal CPI is 1. What is the average CPI? CPI = 1 + 0.3*branch_penalty = 1 + 0.3*(0.75*2 + 0.25*5) = 1.795

d) (15 points) Assume that we are choosing between flush pipeline, predict taken and predict not taken strategy for handling branches. On the average 80% of branches are conditional branches and 60% of conditional branches are taken. What is the average CPI for each branch handling approach and which approach is the best? CPI = 1 + 0.3*branch_penalty = 1 + 0.3*(%conditional * (%taken * penalty_taken + %not_taken * penalty_not_taken + %jumps * penalty_jumps) For jumps, for each approach penalty is 2 cycles. For predict not taken penalty is 0 cycles for not taken branches and 5 cycles for taken branches. For predict taken, penalty is 2 cycles for taken branches and 5 cycles for not taken branches. For flush pipeline penalty is 5 cycles. CPI(flush pipeline) = 1 + 0.3*(0.8*(0.6*5+0.4*5)+0.2*2)=2.32 CPI(predict taken) = 1 + 0.3*(0.8*(0.6*2+0.4*5)+0.2*2)=1.888 CPI(predict not taken) = 1 + 0.3*(0.8*(0.6*5+0.4*0)+0.2*2)=1.84 this is the best 3. (20 points) Assume the following MIPS code: DADD R1, R0, R0 Loop: BNEZ R1, If2 DADDI R1, R0, #2 If2: DADDI R1, R0, #-1 J Loop Done: a) (10 points) Use a one-bit predictor to predict outcomes of this branch. How many misses you have? Assume that initial prediction is not-taken First time: prediction NT, outcome NT, new prediction NT, R1 becomes 1 Second time: prediction NT, outcome T, new prediction T, R1 becomes 0 miss Third time: prediction T, outcome NT, new prediction NT, R1 becomes 1 miss Fourth time: prediction NT, outcome T, new prediction T, R1 becomes 0 miss Every time but the first one we have a miss

b) (10 points) Use a two-bit predictor to predict outcomes of this branch. How many misses you have? Assume that initial prediction is 00 First time: prediction 00, outcome NT, new prediction 00, R1 becomes 1 Second time: prediction 00, outcome T, new prediction 01, R1 becomes 0 miss Third time: prediction 01, outcome NT, new prediction 00, R1 becomes 1 Fourth time: prediction 00, outcome T, new prediction 01, R1 becomes 0 miss Every second time we have a miss