Storage and Memory Hierarchy CS165

Similar documents
Enhancing Energy Efficiency of Database Applications Using SSDs

ARC-H: Adaptive replacement cache management for heterogeneous storage devices

In-Place Associative Computing:

Non-volatile STT-RAM: A True Universal Memory

CMPEN 411 VLSI Digital Circuits Spring Lecture 22: Memery, ROM

Lecture 14: Instruction Level Parallelism

CACHE LINE AWARE OPTIMIZATIONS FOR CCNUMA SYSTEMS

Chapter 17 Notes. Magnetism is created by moving charges.

ZT Disk Drive Replacement Solutions

CMPEN 411 VLSI Digital Circuits Spring Lecture 20: Multiplier Design

Direct-Mapped Cache Terminology. Caching Terminology. TIO Dan s great cache mnemonic. UCB CS61C : Machine Structures

128Mb Synchronous DRAM. Features High Performance: Description. REV 1.0 May, 2001 NT5SV32M4CT NT5SV16M8CT NT5SV8M16CT

Total memory size : 256 MB (DDR-SDRAM PC2700) Memory Bandwidth : MB/s

ReRAM Technology, Versatility, and Readiness

ABB June 19, Slide 1

Chapter 29 Electromagnetic Induction and Faraday s Law

Decoupling Loads for Nano-Instruction Set Computers

HYB25D256400/800AT 256-MBit Double Data Rata SDRAM

Fast Orbit Feedback (FOFB) at Diamond

Setup of a multi-os platform based on the Xen hypervisor. An industral case study. Paolo Burgio

Lecture PowerPoints. Chapter 21 Physics: Principles with Applications, 7th edition, Global Edition Giancoli

Prototypage rapide du contrôle d'un convertisseur de puissance DC-DC à haut rendement

Advantage Memory Corporation reserves the right to change products and specifications without notice

Energy Efficient Content-Addressable Memory

Like poles repel, unlike poles attract can be made into a magnet

SYNCHRONOUS DRAM. 128Mb: x32 SDRAM. MT48LC4M32B2-1 Meg x 32 x 4 banks

- DQ0 - NC DQ1 - NC - NC DQ0 - NC DQ2 DQ1 DQ CONFIGURATION. None SPEED GRADE

Advanced Superscalar Architectures. Speculative and Out-of-Order Execution

Incorporating Real Time Computing in Data Center Power Networks

ABB uses an OPAL-RT real time simulator to validate controls of medium voltage power converters

A48P4616B. 16M X 16 Bit DDR DRAM. Document Title 16M X 16 Bit DDR DRAM. Revision History. AMIC Technology, Corp. Rev. No. History Issue Date Remark

FLEXIBILITY FOR THE HIGH-END DATA CENTER. Copyright 2013 EMC Corporation. All rights reserved.

- - DQ0 NC DQ1 DQ0 DQ2 - NC DQ1 DQ3 NC - NC

Capacity-Achieving Accumulate-Repeat-Accumulate Codes for the BEC with Bounded Complexity

Interface-Unit (SIU) Design Methods for Reducing Burn Rates on Tight-Pitch C4 Logic Arrays

SDRAM AS4SD8M Mb: 8 Meg x 16 SDRAM Synchronous DRAM Memory. PIN ASSIGNMENT (Top View)

128Mb DDR SDRAM. Features. Description. REV 1.1 Oct, 2006

DQ0 NC DQ1 DQ0 DQ2 DQ3 DQ Speed Grade

IS42S32200C1. 512K Bits x 32 Bits x 4 Banks (64-MBIT) SYNCHRONOUS DYNAMIC RAM

Finite Element Based, FPGA-Implemented Electric Machine Model for Hardware-in-the-Loop (HIL) Simulation

RAM-Type Interface for Embedded User Flash Memory

UpBit:'Scalable'In0Memory' Updatable'Bitmap'Indexing

AVS64( )L

- - DQ0 NC DQ1 DQ0 DQ2 - NC DQ1 DQ3 NC - NC

Installation Planning (or what I didn t learn at meter school)

- DQ0 - NC DQ1 - NC - NC DQ0 - NC DQ2 DQ1 DQ

CMPEN 411 VLSI Digital Circuits Spring Lecture 24: Peripheral Memory Circuits

IS42S32200L IS45S32200L

Advantage Memory Corporation reserves the right to change products and specifications without notice

Advantage Memory Corporation reserves the right to change products and specifications without notice

Parallelism I: Inside the Core

Fixing the Hyperdrive: Maximizing Rendering Performance on NVIDIA GPUs

Arduino-based OBD-II Interface and Data Logger. CS 497 Independent Study Ryan Miller Advisor: Prof. Douglas Comer April 26, 2011

5 5 Supervisor Engine GE (Active) VS S720 10G SAL1313MAFM

CS 152 Computer Architecture and Engineering

Design of DC/DC Converters for 42V Automotive Applications

Chapter 11. Using MAX II User Flash Memory for Data Storage in Manufacturing Flow

Installing the gate post bracket with the cardboard arm template

Student book answers Chapter 1

1. Check the contents of the installation kit

Out-of-order Pipeline. Register Read. OOO execution (2-wide) OOO execution (2-wide) OOO execution (2-wide) OOO execution (2-wide)

Installing the gate post bracket with the cardboard arm template

Chapter 23 Magnetic Flux and Faraday s Law of Induction

Test Infrastructure Design for Core-Based System-on-Chip Under Cycle-Accurate Thermal Constraints

EE Architecture for Highly Electrified Powertrain

Installation And Programming Manual of OPTIMA Eco Tec and OPTIMA Pro Tec OBD/CAN

Selecting the Optimum Motion Control Solution for the Application By Festo Corporation

Copyright 2012 EMC Corporation. All rights reserved.

Attitude Control. Actuators and Attitude Control

RR Concepts. The StationMaster can control DC trains or DCC equipped trains set to linear mode.

Jet Dispensing Underfills for Stacked Die Applications

DA 35/70 EFI MIL SPEC

6.823 Computer System Architecture Prerequisite Self-Assessment Test Assigned Feb. 6, 2019 Due Feb 11, 2019

PMD706416A. Document Title. 64Mb (4M x 16) DDR SDRAM (A die) Datasheet

Power Electronics Roadmap. Updated by the Advanced Propulsion Centre in collaboration with and on behalf of the Automotive Council

Shape - Typical designs with sector angles of pi/2 [90 degrees], and 2pi/3 [120 degrees] are shown below.

DOUBLE DATA RATE (DDR) SDRAM

Feature. 512Mb DDR SDRAM. REV 1.1 Jul CAS Latency Frequency NT5DS64M8DS NT5DS32M16DS CONSUMER DRAM. 2KB page size for all configurations.

PMD709408C/PMD709416C. Document Title. Revision History. 512Mb (64M x 8 / 32M x 16) DDR SDRAM C die Datasheet

Automated Circuit Breaker Calibration

LETTER TO FAMILY. Science News. Cut here and glue letter onto school letterhead before making copies.

Week 11. Module 5: EE100 Course Project Making your first robot

IS42S Meg Bits x 16 Bits x 4 Banks (64-MBIT) SYNCHRONOUS DYNAMIC RAM FEATURES OVERVIEW. PIN CONFIGURATIONS 54-Pin TSOP (Type II)

index Page numbers shown in italic indicate figures. Numbers & Symbols

HYB25D256[400/800/160]B[T/C](L) 256-Mbit Double Data Rate SDRAM, Die Rev. B Data Sheet Jan. 2003, V1.1. Features. Description

ELECTRICITY: INDUCTORS QUESTIONS

Cybercars : Past, Present and Future of the Technology

- DQ0 - NC DQ1 - NC - NC DQ0 - NC DQ2 DQ1 DQ

LS System. solutions for the. Smart Designer. precision high-speed proportional pneumatic systems

тел.: +375(1771) e mail: Fuel level sensors eurosens Dominator

Title III (HSI) STEM Grant IPAD CARTS BETHANY J CORDELL

IS42S16400J IS45S16400J

1) Introduction to wind power

A Primer on Auction Design, Management, and Strategy. David J. Salant / The MIT Press Cambridge, Massachusetts London, England

Operating Instructions

Drowsy Caches Simple Techniques for Reducing Leakage Power Krisztián Flautner Nam Sung Kim Steve Martin David Blaauw Trevor Mudge

THE FOURTH STATE. Gaining a universal insight into the diagnosis of automotive ignition systems. By: Bernie Thompson

Dell EMC SCv ,000 Mailbox Exchange 2016 Resiliency Storage Solution using 10K drives

COSC 6385 Computer Architecture. - Tomasulos Algorithm

Lecture Outline Chapter 23. Physics, 4 th Edition James S. Walker. Copyright 2010 Pearson Education, Inc.

Transcription:

Storage and Memory Hierarchy CS165

What is the memory hierarchy?

L1 <1ns L2 ~3ns Faster Smaller More expensive Bigger Cheaper Slower L3 ~10ns ~100ns Flash ~100μs HDD / Shingled HDD ~2ms

Why have such a hierarchy?

Which one is faster? As the gap grows, we need a deeper memory hierarchy

L1 <1ns block size (cacheline) 64B L2 ~3ns Faster Smaller More expensive Bigger Cheaper Slower L3 ~10ns ~100ns page size ~4KB Flash ~100μs 4 HDD / Shingled HDD ~2ms

IO cost: Scanning a relation to select 10% 5-page buffer IO#: Load 5 pages HDD

IO cost: Scanning a relation to select 10% 5-page buffer IO#: 5 HDD

IO cost: Scanning a relation to select 10% 5-page buffer Send for consumption IO#: 5 HDD

IO cost: Scanning a relation to select 10% 5-page buffer IO#: 5 Load 5 pages HDD

IO cost: Scanning a relation to select 10% 5-page buffer IO#: 10 HDD

IO cost: Scanning a relation to select 10% 5-page buffer IO#: 10 Load 5 pages HDD

IO cost: Scanning a relation to select 10% 5-page buffer IO#: 15 HDD

IO cost: Scanning a relation to select 10% 5-page buffer IO#: 15 Load 5 pages HDD

IO cost: Scanning a relation to select 10% 5-page buffer IO#: 20 HDD

IO cost: Scanning a relation to select 10% Send for consumption 5-page buffer IO#: 20 HDD

What if we had an oracle (index)?

IO cost: Scanning a relation to select 10% 5-page buffer IO#: Index HDD

IO cost: Use an index to select 10% 5-page buffer IO#: Load the index Index HDD

IO cost: Use an index to select 10% 5-page buffer IO#: 1 Index HDD

IO cost: Use an index to select 10% 5-page buffer IO#: 1 Load useful pages Index HDD

IO cost: Use an index to select 10% 5-page buffer IO#: 3 Index HDD

What if useful data is in all pages?

Scan or Index? 5-page buffer IO#: Index HDD

Scan or Index? 5-page buffer IO#: 20 with scan IO#: 21 with index Index HDD

L1 <1ns L2 ~3ns Faster Smaller More expensive Bigger Cheaper Slower L3 ~10ns ~100ns Flash ~100μs HDD / Shingled HDD ~2ms

Cache Hierarchy What is a core? What is a socket? L1 L2 L3

Cache Hierarchy Shared Cache: L3 (or LLC: Last Level Cache) L3 is physically distributed in multiple sockets L2 is physically distributed in every core of every socket Each core has its own private L1 & L2 cache All levels need to be coherent* 0 1 2 3 L1 L1 L1 L1 L2 L2 L2 L2 L3

Non Uniform Memory Access (NUMA) Core 0 reads faster when data are in its L1 If it does not fit, it will go to L2, and then in L3 Can we control where data is placed? We would like to avoid going to L2 and L3 altogether But, at least we want to avoid to remote L2 and L3 And remember: this is only one socket! We have multiple of those! 0 1 2 3 L1 L1 L1 L1 L2 L2 L2 L2 L3

Non Uniform Memory Access (NUMA) 0 1 2 3 L1 L1 L1 L1 0 1 2 3 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 L2 L2 L3 L3

Non Uniform Memory Access (NUMA) Cache hit! 0 1 2 3 L1 L1 L1 L1 0 1 2 3 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 L2 L2 L3 L3

Non Uniform Memory Access (NUMA) Cache miss! 0 1 2 3 L1 L1 L1 L1 0 1 2 3 L1 L1 L1 L1 Cache hit! L2 L2 L2 L2 L2 L2 L2 L2 L3 L3

Non Uniform Memory Access (NUMA) Cache miss! 0 1 2 3 L1 L1 L1 L1 0 1 2 3 L1 L1 L1 L1 Cache miss! L2 L2 L2 L2 L2 L2 L2 L2 Cache hit! L3 L3

Non Uniform Memory Access (NUMA) Cache miss! 0 1 2 3 L1 L1 L1 L1 0 1 2 3 L1 L1 L1 L1 Cache miss! L2 L2 L2 L2 L2 L2 L2 L2 LLC miss! L3 L3

Non Uniform Memory Access (NUMA) Cache miss! 0 1 2 3 L1 L1 L1 L1 0 1 2 3 L1 L1 L1 L1 Cache miss! L2 L2 L2 L2 L2 L2 L2 L2 NUMA access! L3 L3

Why knowing the cache hierarchy matters int arraysize; for (arraysize = 1024/sizeof(int) ; arraysize <= 2*1024*1024*1024/sizeof(int) ; arraysize*=2) // Create an array of size 1KB to 4GB and run a large arbitrary number of operations { int steps = 64 * 1024 * 1024; // Arbitrary number of steps int* array = (int*) malloc(sizeof(int)*arraysize); // Allocate the array int lengthmod = arraysize - 1; } // Time this loop for every arraysize int i; for (i = 0; i < steps; i++) { array[(i * 16) & lengthmod]++; // (x & lengthmod) is equal to (x % arraysize) } 16MB NUMA! This machine has: 256KB L2 per core 16MB L3 per socket 256KB

Storage Hierarchy Why not stay in memory? Cost Volatility What was missing from memory hierarchy? Durability Capacity

Storage Hierarchy Flash HDD Shingled Disks Tape

Storage Hierarchy Flash HDD Shingled Disks Tape

Disks Secondary durable storage that support both random and sequential access Data organized on pages/blocks (accross tracks) Multiple tracks create an (imaginary) cylinder Disk access time: seek latency + rotational delay + transfer time (0.5-2ms) + (0.5-3ms) + <0.1ms/4KB Sequential >> random access (~10x) Goal: avoid random access

Seek time + Rotational delay + Transfer time Seek time: the head goes to the right track Short seeks are dominated by settle time (D is on the order of hundreds or more) Rotational delay: The platter rotates to the right sector. What is the min/max/avg rotational delay for 10000RPM disk? Transfer time: <0.1ms / page more than 100MB/s

Flash Secondary durable storage that support both random and sequential access Data organized on pages (similar to disks) which are further grouped to erase blocks Main advantage over disks: random read is now much more efficient BUT: Slow random writes! Goal: avoid random writes

The internals of flash

Flash access time depends on: device organization (internal parallelism) software efficiency (driver) bandwidth of flash packages Flash Translation Layer (FTL), a complex device driver (firmware) which tunes performance and device lifetime

HDD Large - cheap capacity Inefficient random reads Flash vs HDD Flash Small - expensive capacity Very efficient random reads Read/Write Asymmetry

Storage Hierarchy Flash HDD Shingled Disks Tape

Tapes Data size grow exponentially! Cheaper capacity: Increase density (bits/in 2 ) Simpler devices Tapes: Magnetic medium that allows only sequential access (yes like an old school tape)

Increasing disk density Very difficult to differentiate between tracks settle time becomes Writing a track affects neighboring tracks Create different readers/writers Interleave writes tracks

Summary Memory/Storage Hierarchy Access granularity (pages, blocks, cache-lines) Memory Wall deeper and deeper hierarchy Next week: Algorithm design with a good understanding of the hierarchy -- External Sorting -- Cache-conscious algorithms