Use of the ERD for administrative monitoring of Theta:

Similar documents
Innovations in Electric Vehicle Charging

Porting Applications to the Grid

Ampl2m. Kamil Herman Author of Ampl2m conversion tool. Who are you looking at

GOPALAN COLLEGE OF ENGINEERING AND MANAGEMENT Department of Computer Science and Engineering COURSE PLAN

What s cooking. Bernd Wiswedel KNIME.com AG. All Rights Reserved.

What s new. Bernd Wiswedel KNIME.com AG. All Rights Reserved.

July 17, Software and Systems Teach-in

PRODUCT PORTFOLIO. Electric Vehicle Infrastructure ABB Ability Connected Services

New generation of programmable controllers

Asian paper mill increases control system utilization with ABB Advanced Services

BASIC MECHATRONICS ENGINEERING

WEST KENTUCKY COMMUNITY AND TECHNICAL COLLEGE

EcoCAR 3. SPONSORSHIP OPPORTUNITIES. North America s Premier Collegiate Automotove Competition

END TO END NEEDS FOR AUTONOMOUS VEHICLES NORM MARKS SEPT. 6, 2018

Your web browser (Safari 7) is out of date. For more security, comfort and. the best experience on this site: Update your browser Ignore

Using cloud to develop and deploy advanced fault management strategies

Compact Syntax for Topic Maps (CTM) - initial work. Professor Sam G. Oh, Sung Kyun Kwan University; Gabriel Hopmans, Morpheus software;

Survey Report Informatica PowerCenter Express. Right-Sized Data Integration for the Smaller Project

Building Fast and Accurate Powertrain Models for System and Control Development

Industrial Use of EsDs ETP4HPC Workshop 22 June 2017 Frankfurt DLR CFD Solver TAU & Flucs for external Aerodynamic

Finite Element Based, FPGA-Implemented Electric Machine Model for Hardware-in-the-Loop (HIL) Simulation

Discovery of Design Methodologies. Integration. Multi-disciplinary Design Problems

ABB DRIVES SERVICE. Optimize your wind economy. Wind Converter Services

Applied Data Science, Big Data and The PI System

A combined future. Microgrids with renewable power integration

Scaling industrial control technologies for food & beverage industry

Benefits and Challenges of Using SmartNICs in Distributed Shared Storage

minispec Plus Release Letter Innovation with Integrity Version 001 AIC

Lecture 14: Instruction Level Parallelism

Learning paths. The path to higher performance ABB UNIVERSITY. ABB University. Learning paths

SentryGOLD Compact. for Bennett Electronic Dispenser INSTALLATION MANUAL. Fuel Management System

Deep Learning Will Make Truly Self-Driving Cars a Reality

SEAS-NVE: End to End Smart Metering Solution

SentryGOLD Fully-Automated Fuel Management System

Advanced Abaqus Scripting. Abaqus 2018

ELECTRIC CURRENT. Name(s)

SSI Technologies Application Note AT-AN6 Acu-Trac Off Vehicle Applications and Fuel Data Messaging. Table of Contents

#97-T-20A: MIL (Service Engine Soon Telltale Lamp) On and EGR DTCs P0401, P0404, P0405, P1404 and/or P1406 in PCM Memory - (Jan 6, 2003)

Simulation of joining technologies to support JLR new model development. Dr Li Wang (PhD, CEng, MIMechE) AME, BIW, Joining Technologies

Release Enhancements GXP Xplorer GXP WebView

In-Place Associative Computing:

RWE SmartHome. Key Findings - One Year On PAGE 1

ANALYSIS OF TRAFFIC SPEEDS IN NEW YORK CITY. Austin Krauza BDA 761 Fall 2015

Caliber: Road Quality Profiling

ELD Mandate Survival Guide

Model Based Design: Balancing Embedded Controls Development and System Simulation

feature 10 the bimmer pub

PRODUCT DESCRIPTIONS AND METRICS

CADILLAC CTS & CTS-V: (HAYNES REPAIR MANUAL) BY EDITORS OF HAYNES MANUALS

MetaXpress PowerCore System Installation and User Guide

RIMRES: A project summary

Aftermarket Testing and OEM Build Information Integration. Scott Bolt Chief Engineer MAHLE Test Systems 25 April 2013

Dynojet Research, Inc. All Rights Reserved. Optical RPM Sensor Installation Guide.

Informatica Powercenter 9 Transformation Guide Pdf

Automated Driving - Object Perception at 120 KPH Chris Mansley

Army Ground Vehicle Use of CFD and Challenges

ABB UPS 3. November 2014

Citi's 2016 Car of the Future Symposium

DOE s Focus on Energy Efficient Mobility Systems

Group Size ( Divide the class into teams of four or five students each. )

COSE312: Compilers. Lecture 8 Bottom-Up Parsing

Presented at the 2012 Aerospace Space Power Workshop Manhattan Beach, CA April 16-20, 2012

Cluster Knowledge and Skills for Business, Management and Administration Finance Marketing, Sales and Service Aligned with American Careers Business

Wireless noise surveillance - test of new technology for dynamic noise maps

DTC P0172 Fuel Trim System Rich

KNIME Software Pieces KNIME.com AG. All Rights Reserved. 1

Introduction to Abaqus Scripting. Abaqus 2018

4. Support for the. ProCut. The new. parameters. nesting R3 INTERNAL. June 2017 PAGE 1

A14-18 Active Balancing of Batteries - final demo. Lauri Sorsa & Joonas Sainio Final demo presentation

Repeatable perfection comes to Wide Format

KNIME Server Workshop

ABB June 19, Slide 1

Parallelism I: Inside the Core

SOME BASICS OF TROUBLESHOOTING

LIST OF OPEN TRAINING TRAINING PLANNER 2016/2017 Course location: Lagos, Abuja, Port Harcourt.

RDM Isolator. Product Information. Hans Juergen Buhr Cologne, June e:cue lighting control - An OSRAM Company

Formation Flying Experiments on the Orion-Emerald Mission. Introduction

Matrix Wireless Oil Management System

A2 units showing 90% conversion points (cp) January 2013 series

MGA Research Corporation

Pat Murray, Director of West Service Operations ABB Services Alaska Technology Conference Anchorage, AK

Supervised Learning to Predict Human Driver Merging Behavior

MeteorCalc SL. MeteorCalc SL is a CAD plugin for designing street lighting networks.

Design of Remote Monitoring and Evaluation System for UPS Battery Performance

LS-DYNA HYBRID Studies using the LS-DYNA Aerospace Working Group Generic Fan Rig Model

A PRACTICAL GUIDE TO RACE CAR DATA ANALYSIS BY BOB KNOX DOWNLOAD EBOOK : A PRACTICAL GUIDE TO RACE CAR DATA ANALYSIS BY BOB KNOX PDF

Smarter Bus Information in Leeds

Overview Python Scripting in Abaqus Specialized Postprocessing Advanced Topics Introduction to Python and Scripting in Abaqus

CHEMICALS AND REFINING. ABB in chemicals and refining A proven approach for transforming your challenges into opportunities

FUNCTIONAL SAFETY FOR AUTONOMOUS DRIVING

EE 370L Controls Laboratory. Laboratory Exercise #E1 Motor Control

GUI Customization with Abaqus. Abaqus 2017

DEVELOPMENT OF VIBRATION CONDITION MONITORING SYSTEM APPLYING OPTICAL SENSORS FOR GENERATOR WINDING INTEGRITY OF POWER UTILITIES

TRITON ERROR CODES ERROR CODE MODEL SERIES DESCRIPTION RESOLUTION

7. On-Board Computer Solution Focuses on Safer Drivers and Preventable Accidents. d. Partnership with Ft Worth, TX and Knights Waste

TAKE CONTROL OF YOUR FLEET

Twoskip Cyrus database format

Transforming mining maintenance

Sitras SCS, -RCI, -FFP, -TTU

SOLUTION BRIEF MACHINE DATA ANALYTICS FOR EV CHARGING STATIONS. SOLUTION BRIEF Machine Data Analytics for the EV Charging Stations Industry

Transcription:

Use of the ERD for administrative monitoring of Theta: Re-implementing xthwerrlog, sedc and related Cray utilities in Go alexk@anl.gov ALCF 1 Argonne Leadership Computing Facility

Who we are The Argonne Leadership Computing Facility (ALCF) is a national scientific user facility that provides supercomputing resources and expertise to the scientific and engineering community to accelerate the pace of discovery and innovation in a broad range of disciplines. We currently run Theta, a 24-rack XC40 with Knight s Landing CPUs 2 Argonne Leadership Computing Facility

What are we talking about? - The ERD (Event router daemon) is the backbone of the XC40 - Most forms of command & control, as well as log data, happen through the ERD - Console logs, Hardware Error data, environmental data, system state 3 Argonne Leadership Computing Facility

Why? - Hardware error data is stored in binary logs, making parsing difficult and CPU-time consuming - A propriety, closed-source software stack - All other logs are in unstructured text - I just don't like unstructured data 4 Argonne Leadership Computing Facility

What we did: Deluge - Extensible and loosely coupled: Little overhead to support new databases and input streams Three major backend libraries: events, hwerrcore, sedc - Highly parallel: makes heavy use of channels and Goroutines - Scalable: Can use as many cores and as much memory as it's given - Configurable: CLI options to tweak memory and core usage - Written in Go: A modern, statically-typed, garbage-collected, memory-safe systems language 5 Argonne Leadership Computing Facility

Library: events - The only component that talks to the ERD - Reads in raw binary data from the network, returns a stream of Go structs to the consumer 6 Argonne Leadership Computing Facility

Library: hwerrcore -Contains the data structures and logic used to parse cray RAS events -Has knowledge of all Aries errors and KNL machine-check errors Theta Deluge data structures mimics that used by Cray: 7 Argonne Leadership Computing Facility

Library: hwerrcore -Bulk of the code turns MMR data into JSON with human-readable error data Input: Output: 8 Argonne Leadership Computing Facility

MapDef - We have a problem: logic that parses memory-mapped registers (MMR) is hardcoded into the Cray Hardware Supervisory System (HSS) libraries - We need to reliably reproduce it in order to parse hardware RAS events - Cray s libraries output human-readable strings, we want structured data Solution: Create a harness that uses Cray s own libraries to re-generate Go code for every hardware error. Why not? 9 Argonne Leadership Computing Facility

MapDef Step 1: Call Cray s parser function in a loop, skip over non-existent error codes, or codes with no MMR data. Example of the data we get back: 10 Argonne Leadership Computing Facility

MapDef Step 2: Take the string output, use regex and code generation libraries to turn it into Go maps 11 Argonne Leadership Computing Facility

Library: sedccore - Reads data from the ec_sedc_data channel - was the least documented and hardest to implement - We didn t have to implement SEDCv1 12 Argonne Leadership Computing Facility

Library: sedccore -SEDC scan IDs are generated based on a dump of the PMDB COPY pmdb.sedc_scanid_info TO '/tmp/scanid_dump.csv' DELIMITER ',' CSV HEADER; 13 Argonne Leadership Computing Facility

Usage at ALCF: hardware error data - Elasticsearch is used to store all hardware error data - ES a good fit for hwerr data - Nested JSON makes data analysis easy, compared to string parsing 14 Argonne Leadership Computing Facility

Usage at ALCF: hardware error data 15 Argonne Leadership Computing Facility

Usage at ALCF: SEDC data 16 Argonne Leadership Computing Facility

Usage at ALCF: BER data - Generated from hardware error data - Used to track health of Aries links and in some cases predict failure 17 Argonne Leadership Computing Facility

Future work - Data science and machine learning to find trends - More correlation with job data - Ingest more data from the ERD: History of Admin commands ERFS metadata Some ALPS data Track system state 18 Argonne Leadership Computing Facility

Tack! (Thank you!) - Questions? 19 Argonne Leadership Computing Facility