OStrich: Fair Scheduler for Burst Submissions of Parallel Jobs. Krzysztof Rzadca Institute of Informatics, University of Warsaw, Poland

Similar documents
Adaptive Resource and Job Management for limited power consumption

Practical Resource Management in Power-Constrained, High Performance Computing

Scheduling. Purpose of scheduling. Scheduling. Scheduling. Concurrent & Distributed Systems Purpose of scheduling.

Intelligent Energy Management System Simulator for PHEVs at a Municipal Parking Deck in a Smart Grid Environment

Peak Efficiency Aware Scheduling for Highly Energy Proportional Servers

Multi-level Feeder Queue Dispatch based Electric Vehicle Charging Model and its Implementation of Cloud-computing

Regulating Highly Automated Robot Ecologies: Insights from Three User Studies

Welcome to the waitless world. CBU for IBM i. Steve Finnes

Remarkable CO 2 Reduction of the Fixed Point Fishing Plug-in Hybrid Boat

ARC-H: Adaptive replacement cache management for heterogeneous storage devices

Power Distribution Scheduling for Electric Vehicles in Wireless Power Transfer Systems

Assignment # 6: Arena - Spotless Wash - Basic Model

Performance Analysis with Vampir

Finite Element Based, FPGA-Implemented Electric Machine Model for Hardware-in-the-Loop (HIL) Simulation

Discovery of Design Methodologies. Integration. Multi-disciplinary Design Problems

Umatilla Electric Cooperative Net Metering Rules

Light Vehicle Ordering Guide. Complete Leasing and Fleet Management Solutions

Behavioral Research Center (BRC) User Guide

Machine Design Optimization Based on Finite Element Analysis using

Real-time Bus Tracking using CrowdSourcing

Implementation of telecontrol of solar home system based on Arduino via smartphone

Based on the findings, a preventive maintenance strategy can be prepared for the equipment in order to increase reliability and reduce costs.

Emissions predictions for Diesel engines based on chemistry tabulation

Advanced SCADA systems for Energy management of electric buses

ATTEND Analytical Tools To Evaluate Negotiation Difficulty

Proxy Demand Resource FERC Order. Margaret Miller Manager, Market Design & Regulatory Policy August 24, 2010

LEAP: LSC Evaluation and Achievement Program

How to Deliver a LEED Certifiable Project

Optimizing Performance and Fuel Economy of a Dual-Clutch Transmission Powertrain with Model-Based Design

The Session.. Rosaria Silipo Phil Winters KNIME KNIME.com AG. All Right Reserved.

Building Fast and Accurate Powertrain Models for System and Control Development

Porting Applications to the Grid

GRID activities at MTA SZTAKI

Growing Charging Station Networks with Trajectory Data Analytics

Components and tooling to reduce complexity and cost in E/E powertrain system design for Hybrid electric Vehicles

Analysis of Big Data Streams to Obtain Braking Reliability Information July 2013, for 2017 Train Protection 1 / 25

EcoCar3-ADAS. Project Plan. Summary. Why is This Project Important?

Designing of Hot Strip Rolling Mill Control System

What is special with Grid activities in Korea

The purpose of this lab is to explore the timing and termination of a phase for the cross street approach of an isolated intersection.

HCLOUD: RESOURCE-EFFICIENT PROVISIONING IN SHARED CLOUD SYSTEMS

Experimental Verification of the Implementation of Bend-Twist Coupling in a Wind Turbine Blade

Design of pneumatic proportional flow valve type 5/3

Design and Experimental Study on Digital Speed Control System of a Diesel Generator

SMART MICRO GRID IMPLEMENTATION

Figure1: Kone EcoDisc electric elevator drive [2]

A Presentation on. Human Computer Interaction (HMI) in autonomous vehicles for alerting driver during overtaking and lane changing

HANDLING QUALITY OBJECTIVE EVALUATION OF LIGHT COMMERCIAL VEHICLES

Modelling Shared Mobility in City Planning How Transport Planning Software Needs to Change ptvgroup.com

Scheduling for Wireless Energy Sharing Among Electric Vehicles

Embedded Torque Estimator for Diesel Engine Control Application

GRID CONSTRAINT: OPTIONS FOR PROJECT DEVELOPMENT

Hardware-in-the-Loop Testing of Connected and Automated Vehicle Applications

MEETING GOVERNMENT MANDATES TO REDUCE FLEET SIZE

Project Summary Fuzzy Logic Control of Electric Motors and Motor Drives: Feasibility Study

SFM/TFM Power Integrity Guidelines Samtec SFM/TFM Series Measurement and Simulation Data

DESIGN OF SIMULATION TECHNIQUES FOR DATA PREDICTION IN PUBLIC TRANSPORTATION GREGORIUS VIKO & FRISKA NATALIA FERDINAND

Roy Hulli, P.Eng. and. Fernando Chua. Intelligent Transportation Systems Ministry of Transportation Ontario

ENERGY CONSERVATION ON WIRELESS SENSOR NODE AND NETWORK USING FREE ENERGY RESOURCE

The MathWorks Crossover to Model-Based Design

ENGINEERING FOR HUMANS STPA ANALYSIS OF AN AUTOMATED PARKING SYSTEM

GRID-enabling BEAMnrc & 1 st CLASS PARTICLE TRANSPORT

Five Cool Things You Can Do With Powertrain Blockset The MathWorks, Inc. 1

License Model Schedule Actuate License Models for the Open Text End User License Agreement ( EULA ) effective as of November, 2015

Solar Kit Lesson #13 Solarize a Toy

ZEPHYR FAQ. Table of Contents

Submission to Select Committee on Electric Vehicles - inquiry into the use and manufacture of electric vehicles in Australia

Power Integrity Guidelines Samtec MPT/MPS Series Connectors Measurement and Simulation Data

Electric Mobility-on-Demand a long step beyond carsharing. Jan-Olaf Willums Chairman EMN and Move About

Fachpraktikum Elektrische Maschinen. Experiments with a 400/ 690 V Squirrel Cage Induction Machine

Test Infrastructure Design for Core-Based System-on-Chip Under Cycle-Accurate Thermal Constraints

ASRM Energy-efficient and power optimized motion profiles Inspiring change in intralogistics. Unrestricted Siemens AG 2018

Models everywhere: How a fully integrated model-based test environment can enable progress in the future

SINGLE-PHASE LINE START PERMANENT MAGNET SYNCHRONOUS MOTOR WITH SKEWED STATOR*

Solar-Wind Specific Request for Proposals

Operations Research & Advanced Analytics 2015 INFORMS Conference on Business Analytics & Operations Research

Parking & Transportation Services Virtual Parking Permits at Stanford Stanford Staffers Brown Bag Forum Kingscote Gardens, Room 140 November 8, 2018

Industry-Wide Light Duty Hydrogen Vehicle Fueling Protocol up to 70MPa: Created by Math Modeling and Confirmed by System Testing

A Battery Smart Sensor and Its SOC Estimation Function for Assembled Lithium-Ion Batteries

Состояние и перспективы развития интегрированной модульной авионики

Commercial Vehicle Drivers Hours of Service. Module 10. Special Permit - Oil Well Service Vehicle Permits. Microsoft.

AGENT-BASED MODELING, SIMULATION, AND CONTROL SOME APPLICATIONS IN TRANSPORTATION

United Power Flow Algorithm for Transmission-Distribution joint system with Distributed Generations

The design and implementation of a simulation platform for the running of high-speed trains based on High Level Architecture

Technology, Xi an , China

Electronic Assembly Process - Part 1

DESY and NAF. Andreas Gellrich, DESY. 8th Belle II Computing Workshop May 2013, Leinsweiler, Germany

Uncontrolled copy not subject to amendment. Airframes. Revision 1.00

CONTACT: Rasto Brezny Executive Director Manufacturers of Emission Controls Association 2200 Wilson Boulevard Suite 310 Arlington, VA Tel.

ECSE-2100 Fields and Waves I Spring Project 1 Beakman s Motor

Simulation of an Electro-Hydraulic System for a P.E.T. Waste Baling Press

Pattern evaluation programs for traffic speed measurement devices in PR China

Performance study of combined test rig for metro train traction

AND CHANGES IN URBAN MOBILITY PATTERNS

EVREST: Electric Vehicle with Range Extender as a Sustainable Technology.

Human interaction in solving hard practical optimization problems

Using multiobjective optimization for automotive component sizing

Written Exam Public Transport + Answers

CS 152 Computer Architecture and Engineering

CRSM: Crowdsourcing based Road Surface Monitoring

Transcription:

Krzysztof Rzadca Institute of Informatics, University of Warsaw, Poland! joint work with: Filip Skalski (U Warsaw / Google)! based on work with: Vinicius Pinheiro (Grenoble) Denis Trystram (Grenoble) http://www.flickr.com/photos/bobjagendorf/345683620/ OStrich: Fair Scheduler for Burst Submissions of Parallel Jobs

KEY MESSAGE: A FAIR, MULTIUSER ONLINE SCHEDULING ALGORITHM Online problem with multiple users sharing a supercomputer Workload composed of campaigns (~job arrays): jobs independent to execute; the owner wants to finish all jobs as soon as possible OStrich: an algorithm with a guarantee on worst-case slowdown (stretch) for each user (OStrich ~ per-user Stretch) The slowdown depends on the total number of users, and not the total system load Implementation as a SLURM scheduler used in a production cluster

MODEL: A TYPICAL SUPERCOMPUTING CENTER m processors M M2 M3 M4 M5 0 8 2 2 7 2 3 6 9 M6 8 8 3 2 owner (red user) processing time (known when the job appears) submission time (not known in advance) time

WHY CAMPAIGNS? Modern applications submit many related computing jobs Map/Reduce parameter sweep workflows SLURM makes such submissions easier by job arrays (max job array size increased to M, so it s useful) But cluster schedulers treat such jobs as independent

WHY A WORST-CASE BOUND FOR EACH USER? Many policies based on First-Come-First-Served New jobs are put at the end of the queue Thus, users with large workloads slow down everyone else Hard to manage partial solutions: Limits on number of jobs in the queue, Karma points, priority queues, etc. Fair-share

A CAMPAIGN: A BAG OF INDEPENDENT TASKS user : campaign user : campaign 2 t () σ () C () t 2 () σ 2 () C 2 () Δ () tt 2 () Δ 2 () time submission user s goal: campaign submission (next campaign) start completion think time: next campaign not ready after C

PRINCIPLE OF THE ALGORITHM: PARETO-OPTIMALITY M M2 M3 0 0 M M2 M3 0 0 M4 M5 M6 0 0 M4 M5 M6 0 0 a fair-share schedule t t a Pareto-optimal schedule completion times: (20,20) completion times: (0,20)

PRINCIPLE OF THE ALGORITHM: OPTIMIZE SLOWDOWN (BUT NO STARVATION) M M M2 20 0 M2 0 20 M3 M3 M4 M4 M5 20 0 M5 0 20 M6 M6 a FCFS schedule: t a slowdown-optimal schedule: t completion (30,20) completion (0,30) slowdown (3,) slowdown (,3/2)

OSTRICH ALGORITHM: A VIRTUAL FAIR-SHARE SCHEDULE DEFINES PRIORITIES FOR CHOOSING JOBS M Virtual M2 M3 M4 M5 M6 8 48 OStrich assigns equal shares to each user Real M M2 M3 M4 M5 M6 2 2 2 2 3 3 Green user scheduled first, as finishes first in the virtual 5 4 4 4 3 2 22 3 222 two campaigns released at t=0

OSTRICH ALGORITHM: NEW SUBMISSIONS PREEMPT CURRENTLY EXECUTING CAMPAIGNS Virtual M M2 M3 M4 M5 M6 6 6 0 2 2 42 42 Real M M2 M3 M4 M5 M6 2 2 2 2 3 3 2 2 3 red user has priority 4 5 5 4 4 4 3 2 22 3 222 3 2 2 new campaign at t=2

OSTRICH ALGORITHM: NEXT CAMPAIGN DEFERRED UNTIL PREV CAMPAIGN VIRTUAL COMPLETION Virtual M M2 M3 M4 M5 M6 6 6 0 2 8 42 red campaign deferred in the virtual until the previous campaign completes Real M M2 M3 M4 M5 M6 2 2 2 2 3 3 2 2 3 4 4 4 4 5 4 submitted at t=5 5 4 4 4 3 2 22 3 222 3 2 2 2

OSTRICH ALGORITHM: NEXT CAMPAIGN DEFERRED UNTIL PREV CAMPAIGN VIRTUAL COMPLETION Virtual M M2 M3 M4 M5 M6 6 6 2 0 2 2 6 30 Real M M2 M3 M4 M5 M6 2 2 2 2 3 3 2 2 3 4 4 4 4 5 4 2 5 4 4 4 3 2 22 3 222 3 2 2 2

SOME PROOFS? http://www.supercoloring.com/

AN UPPER BOUND ON THE CAMPAIGN S COMPLETION TIME V (Virtual) σ ~ (u) i ~ C (u) i R (Real)......... user u C (u) i- t (u) i S σ (u) i,q C (u) i,q time J (u) i wait until the prev campaign completes in virtual standard upper bounds for the current campaign executing on all resources

AN UPPER BOUND ON THE CAMPAIGN S COMPLETION TIME V (Virtual) σ ~ (u) i ~ C (u) i R (Real)......... user u C (u) i- t (u) i S σ (u) i,q C (u) i,q time J (u) i wait until the prev campaign completes in virtual upper bound on the surface that can preempt while campaign is executing in virtual standard upper bounds for the current campaign executing on all resources

AN UPPER BOUND ON THE CAMPAIGN S COMPLETION TIME V (Virtual) σ ~ (u) i ~ C (u) i R (Real)......... user u C (u) i- t (u) i S σ (u) i,q C (u) i,q time J (u) i wait until the prev campaign completes in virtual upper bound on the surface that can preempt while campaign is executing in virtual standard upper bounds for the current campaign executing on all resources

EACH CAMPAIGN S SLOWDOWN IS BOUNDED campaign slowdown: flow time weighted by the surface OStrich guarantee: k is the number of active users we treat pmax as constant (and small compared to campaign s surface)

IMPLEMENTATION IN SLURM

FROM THEORY TO SLURM fixed reservations: as idle time partitions: as (perhaps overlapping) sets of processors users estimates are imprecise: simple estimates can be used (not yet implemented!) (in simulations we use the average from 2 last completed jobs ) campaign from a stream of jobs: we group jobs based on delay from the first submission 3 jobs in a single campaign threshold this job starts a new campaign

A SEMI - ACTIVE SCHEDULER OStrich is notified about a newly submitted job; assigns 0 priority to this job each -0 seconds, OStrich recalculates the virtual schedule (new jobs, completed jobs, changed jobs) OStrich assigns decreasing priorities to jobs by campaign order!! M M2 M3 M4 M5 M6 996 997 998 999 995 994 897 898 799 798 899 the main SLURM daemon uses priorities to order jobs for FCFS/backfill

https://www.flickr.com/photos/rivenimagery/835997629/ EXPERIMENTS (still work in progress )

http://www.flickr.com/photos/steveharris/24578034/ OSTRICH IS FAST! 50K+ JOBS SCHEDULED IN 0.04 SECONDS we emulated a cluster head node on a normal PC

IN PRODUCTION: 25K+ JOBS SCHEDULED SINCE JULY 204 NO MAJOR PROBLEMS running on a cluster with 262 nodes, 5056 cores, heterogeneous architecture (ICM: Warsaw Supercomputing Center site report tomorrow at4:05)

HOW GOOD IS THE ALGORITHM FROM USERS PERSPECTIVE? tests on a simulator using recorded logs from Dror Feitelson s archive

for ~95% of campaigns slowdown 5 (perfect estimates) (estimated runtime: avg 2 last jobs) OSTRICH IS MORE EFFICIENT THAN FAIRSHARE (FOR SOME LOGS!) Log from ANL Thunder BlueGene/P, 60k cores, 0.9x time compression

~0% more jobs with stretch 5 for perfect runtime estimates ~0% more jobs with stretch 5 for standard runtime estimates THE MORE CAMPAIGN-LIKE THE LOG, THE LARGER THE DIFFERENCE Log from ANL Thunder BlueGene/P, 60k cores, 0.8x time compression, jobs submitted during 30 minutes grouped and submitted together

FOR SOME LOGS, OSTRICH IS WORSE THAN FAIRSHARE LLNL Thunder, 4k cores 0.95x time compression, 30 minutes job groups

http://www.flickr.com/photos/gravitywave/9460440/ CONCLUSIONS

CONCLUSIONS OStrich guarantees that the slowdown of each campaign (burst submission) is proportional to the number of users in the system OStrich maintains a virtual, fair-share schedule We have a SLURM scheduling plugin and a simulator available for download: github.com/filipjs/ with the simulator you re able to test the performance on your workload before running in production OStrich can use existing configuration (shares) from multifactor plugin

ACKNOWLEDGEMENTS Work inspired by a problem suggested by Jarosław Żola (SUNY Bufallo) The algorithm developed with Vinicius Gama Pinheiro (U. Grenoble) and Denis Trystram (U. Grenoble) Joseph Emeras contributed to the experimental evaluation of an earlier version of the algorithm Marcin Stolarek and other brave sysadmins from ICM (Warsaw Supercomputing Center) agreed to manage their machines with our scheduler! Work supported by Polish National Science Center UMO-202/07/D/ ST6/02440

http://www.flickr.com/photos/kapkaupunki/3055670/ Thanks and... embrace the OStrich! Krzysztof Rzadca, krzadca@mimuw.edu.pl mimuw.edu.pl/~krzadca/ostrich/