MAXQ HRL in Soar. Mitchell Keith Bloch. University of Michigan. May 17, 2010

Similar documents
AGENT-BASED MICRO-STORAGE MANAGEMENT FOR THE SMART GRID. POWER AGENT: Salman Kahrobaee, Rasheed Rajabzadeh, Jordan Wiebe

HIGH VOLTAGE vs. LOW VOLTAGE: POTENTIAL IN MILITARY SYSTEMS

A Predictive Delay Fault Avoidance Scheme for Coarse Grained Reconfigurable Architecture

Optimal Vehicle to Grid Regulation Service Scheduling

Draft Project Deliverables: Policy Implications and Technical Basis

Smart Grid A Reliability Perspective

Growing Charging Station Networks with Trajectory Data Analytics

IMA Preprint Series # 2035

The Importance of Liquid Handling Details and Their Impact on your Assays

Using the NIST Tables for Accumulator Sizing James P. McAdams, PE

Rule-based Integration of Multiple Neural Networks Evolved Based on Cellular Automata

EXTENDING PRT CAPABILITIES

MWM Motores Diesel 1. s:

Scheduling. Purpose of scheduling. Scheduling. Scheduling. Concurrent & Distributed Systems Purpose of scheduling.

Topics on Compilers. Introduction to CGRA

Motion Planning Introduction to Optimization Techniques

Proposed Solution to Mitigate Concerns Regarding AC Power Flow under Convergence Bidding. September 25, 2009

Technical and Legal Challenges for Urban Autonomous Driving

Energy Storage and Distributed Energy Resource ( ESDER ) Initiative

Aldo Dagnino. ABB Inc. US Corporate Research Center Raleigh, NC. A Methodology for Determining the Organization s Readiness for Process Improvement

Outline. Background Performed evaluations. General experiences Future work. ATAM Experiences. Architecture used in 3O3P project SA-AFL architecture

Electrical Safety CSA Z462 & NB Regulations

ENERGY EXTRACTION FROM CONVENTIONAL BRAKING SYSTEM OF AUTOMOBILE

DS504/CS586: Big Data Analytics --Presentation Example

EXHAUST MANIFOLD DESIGN FOR A CAR ENGINE BASED ON ENGINE CYCLE SIMULATION

Methodologies and Examples for Efficient Short and Long Duration Integrated Occupant-Vehicle Crash Simulation

GRID MODERNIZATION INITIATIVE PEER REVIEW GMLC Control Theory

Developing tools to increase RES penetration in smart grids

Test Infrastructure Design for Core-Based System-on-Chip Under Cycle-Accurate Thermal Constraints

Predicting Solutions to the Optimal Power Flow Problem

CHAPTER 1 INTRODUCTION

Design and Installation of a Compressed Hydrogen Fueling System in a 2009 Chevrolet Colorado

briefing the Portfolio Committee on Mineral resources

Using ABAQUS in tire development process

On June 11, 2012, the Park Board approved the installation of three electric vehicle charging stations along Beach Avenue.

Evaluation of the Performance of Back-to-Back HVDC Converter and Variable Frequency Transformer for Power Flow Control in a Weak Interconnection

Supplementary file related to the paper titled On the Design and Deployment of RFID Assisted Navigation Systems for VANET

Intelligent Fault Analysis in Electrical Power Grids

100 MW Wind Generation Project

SOME ISSUES OF THE CRITICAL RATIO DISPATCH RULE IN SEMICONDUCTOR MANUFACTURING. Oliver Rose

HYSYS System Components for Hybridized Fuel Cell Vehicles

1 Configuration Space Path Planning

Operational Objectives

MODELING SUSPENSION DAMPER MODULES USING LS-DYNA

Siemens PLM Software develops advanced testing methodologies to determine force distribution and visualize body deformation during vehicle handling.

VGI Communications Protocols. April 2018

Wireless Digital Repeater (WiDR) network's packaging/ Initial deployment review

Microgrids Optimal Power Flow through centralized and distributed algorithms

Engineering Diploma Resource Guide ST280 ETP Hydraulics (Engineering)

Wireless Networks. Series Editor Xuemin Sherman Shen University of Waterloo Waterloo, Ontario, Canada

CARE AND MAINTENANCE OF SOLAR CELLS

VHDL (and verilog) allow complex hardware to be described in either single-segment style to two-segment style

Supervised Learning to Predict Human Driver Merging Behavior

Layout Analysis using Discrete Event Simulation: A Case Study

ECSE-2100 Fields and Waves I Spring Project 1 Beakman s Motor

Operations Research & Advanced Analytics 2015 INFORMS Conference on Business Analytics & Operations Research

Scroll Compressor Oil Pump Analysis

Mutual trading strategy between customers and power generations based on load consuming patterns. Junyong Liu, Youbo Liu Sichuan University

Using MATLAB/ Simulink in the designing of Undergraduate Electric Machinery Courses

9 EGR action plan for ESTP Training on EGR 5-6 September 2017

Compatibility of STPA with GM System Safety Engineering Process. Padma Sundaram Dave Hartfelder

Multiobjective Design Optimization of Merging Configuration for an Exhaust Manifold of a Car Engine

The Evolution of Side Crash Compatibility Between Cars, Light Trucks and Vans

Vehicle Seat Bottom Cushion Clip Force Study for FMVSS No. 207 Requirements

Parallel Evolutionary Optimization under MATLAB on standard computing networks

Robust Fault Diagnosis in Electric Drives Using Machine Learning

PROPOSAL FOR A LIMOUSINE MINIMUM RATES PILOT PROJECT FEBRUARY 2011

Decision on Merced Irrigation District Transition Agreement

Appropriate conditions for adopting new public transit systems: a comparative analysis of guided surface systems

Transportation Demand Management January 25, 2017 Waterfront Plan Transportation Working Group. Date & Location

CITY OF EDMONTON COMMERCIAL VEHICLE MODEL UPDATE USING A ROADSIDE TRUCK SURVEY

THE PEP PARTNERSHIP ON ECODRIVING Goals, achievements and next steps November 2017

EFFECTIVE APPROACH TO ENHANCE THE SHOCK PERFORMANCE OF ULTRA-LARGE BGA COMPONENTS

Instructionally Relevant Alternate Assessments for Students with Significant Cognitive Disabilities

Autonomous taxicabs in Berlin a spatiotemporal analysis of service performance. Joschka Bischoff, M.Sc. Dr.-Ing. Michal Maciejewski

6.823 Computer System Architecture Prerequisite Self-Assessment Test Assigned Feb. 6, 2019 Due Feb 11, 2019

Weight Allowance Reduction for Quad-Axle Trailers. CVSE Director Decision

ADVENT. Aim : To Develop advanced numerical tools and apply them to optimisation problems in engineering. L. F. Gonzalez. University of Sydney

Research on Transient Stability of Large Scale Onshore Wind Power Transmission via LCC HVDC

Recharge the Future Interim Findings

Propeller Blade Bearings for Aircraft Open Rotor Engine

Area-Wide Road Pricing Research in Minnesota

Servo Creel Development

London s residential EV Charging Future

GM Presentation for Introducing

The following section summarises the present conditions related to transportation for the proposed development of the Matimba B Power Station:

How To Start Your Own Trucking Company

The Development of Competitive Renewable Energy Zones in Texas

The Learning Outcomes are grouped into the following units:

Comparing optimal relocation operations with simulated relocation policies in one-way carsharing systems

Busy Ant Maths and the Scottish Curriculum for Excellence Year 6: Primary 7

California s RPS Program: Progress Towards California s 33% RPS Goal and the Role of Concentrating Solar Power CSP Conference

Autonomous inverted helicopter flight via reinforcement learning

Battery Maintenance Solutions for Critical Facilities

STIFFNESS CHARACTERISTICS OF MAIN BEARINGS FOUNDATION OF MARINE ENGINE

Assignment # 6: Arena - Spotless Wash - Basic Model

Accelerated Life Testing Final Report

Cycle Time Improvement for Fuji IP2 Pick-and-Place Machines

Finite Element Analysis on Thermal Effect of the Vehicle Engine

Comparing Total Mine Airflow Requirements using a comprehensive new approach vs. traditional method(s)

Transcription:

MAXQ HRL in Soar Mitchell Keith Bloch University of Michigan May 17, 2010 Mitchell Keith Bloch (University of Michigan) MAXQ HRL in Soar May 17, 2010 1 / 26

Motivation 1 Replicate the results described in [Dietterich, 1998] 2 Determine how to bring the cooling techniques employed by a special purpose one-off technique (MAXQ) to a general purpose architecture (Soar) 3 Demonstrate advantages of MAXQ HRL over flat RL 4 Demonstrate value of MAXQ HRL cooling techniques Mitchell Keith Bloch (University of Michigan) MAXQ HRL in Soar May 17, 2010 2 / 26

Outline Background 1 Background 2 Modifications to Soar 3 Agent Construction 4 Methodology and Results 5 Discussion 6 Nuggets and Coal Mitchell Keith Bloch (University of Michigan) MAXQ HRL in Soar May 17, 2010 3 / 26

Basic Information Background The Taxicab Domain 1 Initial conditions a. 5x5 grid world b. 4 sources/destinations c. Refueling station d. Impassable walls e. [5,12] fuel, capped at 14 2 Goals a. Pick up passenger b. Deliver to destination c. Avoid running out of fuel d. Always achievable 3 Rewards a. 1 for a legal action b. 10 for an illegal action c. 20 for running out of fuel d. 20 for delivering the passenger Mitchell Keith Bloch (University of Michigan) MAXQ HRL in Soar May 17, 2010 4 / 26

Background Reinforcement Learning Reinforcement Learning 1 Reinforcement learning problem a. Agent b. Environment and reward signal 2 Q-learning a temporal difference (TD) method 3 TD methods involve a value function a. Expected future reward b. One value per action per state in the limit 4 Should converge on optimal policy a. Learn value function b. Stop exploring Mitchell Keith Bloch (University of Michigan) MAXQ HRL in Soar May 17, 2010 5 / 26

Background Reinforcement Learning MAXQ Hierarchical Reinforcement Learning MaxRoot QRefuel QPut QGet MaxPut MaxGet MaxRefuel QPickup QFillup QPutdown Fillup Pickup Putdown QNavigateForGet(t) QNavigateForRefuel(t) QNavigateForPut(t) MaxNavigate(t) QNorth(t) QSouth(t) QEast(t) QWest(t) North South East West Figure: Dietterich s MAXQ Hierarchy. 1 Formulated by [Dietterich, 1998] 2 Max nodes represent goals a. Each goal is an RL problem b. Each has its own cooling strategy 3 A Max node cools on success if the absolute Bellman error per step is low a. Assumes success b. Assumes deterministic environment Mitchell Keith Bloch (University of Michigan) MAXQ HRL in Soar May 17, 2010 6 / 26

Soar-RL Background Soar sp {reinforce*putdown*151 (state <s> ˆoperator <o> +) (<o> ˆname putdown ˆpassenger true ˆx 0 ˆy 0 ˆdestination yellow) --> (<s> ˆoperator <o> = 20.) } Figure: Abstract view of a putdown proposal 1 Proposal rules assigned Q values 2 Boltzmann indifferent-selection decides between proposals 3 Q values modified when rewards received Mitchell Keith Bloch (University of Michigan) MAXQ HRL in Soar May 17, 2010 7 / 26

Background Soar Boltzmann indifferent-selection e Q(s,O i ) τ n j=1 e Q(s,O j ) τ Figure: Boltzmann indifferent-selection prefers actions with higher Q values 1 Start with a high temperature a. Choose almost randomly 2 End with a low temperature a. Choose the best almost exclusively 3 Interpolate in between Mitchell Keith Bloch (University of Michigan) MAXQ HRL in Soar May 17, 2010 8 / 26

Outline Modifications to Soar 1 Background 2 Modifications to Soar 3 Agent Construction 4 Methodology and Results 5 Discussion 6 Nuggets and Coal Mitchell Keith Bloch (University of Michigan) MAXQ HRL in Soar May 17, 2010 9 / 26

Modifications to Soar Architectural Modifications Cooling schedule for HRL proposed and implemented by [Dietterich, 1998] 1 Support per-goal cooling schedules 2 Slow cooling a. Require low average absolute Bellman error per step b. Require success Mitchell Keith Bloch (University of Michigan) MAXQ HRL in Soar May 17, 2010 10 / 26

Outline Agent Construction 1 Background 2 Modifications to Soar 3 Agent Construction 4 Methodology and Results 5 Discussion 6 Nuggets and Coal Mitchell Keith Bloch (University of Michigan) MAXQ HRL in Soar May 17, 2010 11 / 26

Agent Construction Basic Details 1 Agent knows a. Taxi s position b. Current type of cell c. Fuel available d. Where the passenger is e. Where the passenger wants to go 2 Seven choices of action from any state 3 Environment provides rewards Mitchell Keith Bloch (University of Michigan) MAXQ HRL in Soar May 17, 2010 12 / 26

Agent Construction Flat RL Agent Flat RL Agent 1 Actions unrestricted 2 Pickup and Putdown coded coarsely Mitchell Keith Bloch (University of Michigan) MAXQ HRL in Soar May 17, 2010 13 / 26

Agent Construction MAXQ HRL Agent Basics MaxRoot QGet QRefuel QPut MaxGet MaxRefuel MaxPut QPutdown QFillup QPickup Fillup Putdown Pickup QNavigateForGet(t) QNavigateForRefuel(t) QNavigateForPut(t) MaxNavigate(t) 1 Max nodes represent plans 2 Q values represent knowledge of implementation 3 Much more coarse coding QNorth(t) QSouth(t) QEast(t) QWest(t) North South East West Figure: Dietterich s MAXQ Hierarchy. Mitchell Keith Bloch (University of Michigan) MAXQ HRL in Soar May 17, 2010 14 / 26

Agent Construction MAXQ HRL Agent Reward Assignment MaxRoot QPut QRefuel QGet MaxPut MaxRefuel MaxGet QPutdown QFillup QPickup Fillup Putdown Pickup QNavigateForGet(t) QNavigateForRefuel(t) QNavigateForPut(t) MaxNavigate(t) QNorth(t) QSouth(t) QEast(t) QWest(t) North South East West 1 ±20 passed to MaxRoot 2 10 passed to MaxGet, MaxPut, and MaxRefuel 3 1 passed to all layers of the hierarchy 4 Internal reward of 10 generated for the MaxGet 5 Internal reward of 10 generated for the MaxRefuel Figure: Dietterich s MAXQ Hierarchy. Mitchell Keith Bloch (University of Michigan) MAXQ HRL in Soar May 17, 2010 15 / 26

Outline Methodology and Results 1 Background 2 Modifications to Soar 3 Agent Construction 4 Methodology and Results 5 Discussion 6 Nuggets and Coal Mitchell Keith Bloch (University of Michigan) MAXQ HRL in Soar May 17, 2010 16 / 26

Methodology and Results Motivation and Overview 1 Wanted to replicate the result from [Dietterich, 1998], that MAXQ hierarchical reinforcement learning is superior to flat reinforcement learning in a task as difficult as the taxicab domain 2 Wanted to show that the cooling schedule of MAXQ offers an advantage over HRL without MAXQ 3 In both the finite task and the infinite task, the non-maxq HRL agent was changed in the following ways: a. Only one temperature for the whole agent b. Absolute Bellman error per step is ignored c. Failure is ignored for purposes of disabling learning and cooling Mitchell Keith Bloch (University of Michigan) MAXQ HRL in Soar May 17, 2010 17 / 26

Methodology and Results Plot Information Agent Performance in the Taxicab Domain with Infinite Fuel Optimal (60 Runs Averaged) Flat RL (30 Runs Averaged) MAXQ HRL (30 Runs Averaged) Reward per Step (Moving Average Over 200 Epidoes) Mean Optimal Performance Flat Hierarchical 500 1,000 1,500 2,000 2,500 3,000 1 Plots are averaged over 30 sets of episodes 2 Afterward, they are smoothed using a moving average with a window of 200 episodes 3 Error bars indicate minima and maxima Mitchell Keith Bloch (University of Michigan) MAXQ HRL in Soar May 17, 2010 18 / 26

Flat vs MAXQ HRL Methodology and Results Infinite Task Reward per Step (Moving Average Over 200 Epidoes) Agent Performance in the Taxicab Domain with Infinite Fuel Mean Optimal Performance Flat Optimal (60 Runs Averaged) Flat RL (30 Runs Averaged) MAXQ HRL (30 Runs Averaged) Hierarchical 500 1,000 1,500 2,000 2,500 3,000 1 The HRL agent was untuned, and used the same parameters as the agent for the finite fuel task 2 After disabling exploration after 3, 000 episodes a. The optimal reward possible over 5, 000 episodes was 1.09 reward per step b. The flat agent averaged 1.00 reward per step c. The hierarchical agent averaged 1.09 reward per step and matched the optimal for all 5, 000 episodes in all 30 runs Mitchell Keith Bloch (University of Michigan) MAXQ HRL in Soar May 17, 2010 19 / 26

Methodology and Results Infinite Task Effect of MAXQ Modified Agent Performance in the Taxicab Domain with Infinite Fuel Optimal (60 Runs Averaged) Flat RL (30 Runs Averaged) MAXQ HRL (30 Runs Averaged) Reward per Step (Moving Average Over 200 Epidoes) HRL without MAXQ (30 Runs Averaged) Mean Optimal Performance Hierarchical Flat Hierarchical without MAXQ 500 1,000 1,500 2,000 2,500 3,000 1 Results from the same HRL agent with all cooling rates reduced to 0.97 are plotted against the previous flat agent results 2 This new choice of cooling rate for all Max nodes was untuned 3 The hierarchical agent still averaged 1.09 reward per step but only matched the optimal for all 5, 000 episodes in 28 runs this time 4 Without Dietterich s cooling techniques, learning slowed significantly, but the agent averaged 1.10 reward per step and matched the optimal for all 5, 000 episodes in all 30 runs Mitchell Keith Bloch (University of Michigan) MAXQ HRL in Soar May 17, 2010 20 / 26

Flat vs MAXQ HRL Methodology and Results Finite Task Reward per Step (Moving Average Over 200 Epidoes) 2 1 0-1 -2-3 -4-5 -6-7 Agent Performance in the Taxicab Domain with Finite Fuel Mean Optimal Performance Optimal (30 RunsAveraged) Flat RL (30 Runs Averaged) MAXQ HRL (30 Runs Averaged) Flat RL (Dietterich's 1998 Run) MAXQ HRL (Dietterich's 1998 Run) Dietterich's Hierarchical Hierarchical Dietterich's Flat -8 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000 Episode Number Flat 1 After disabling exploration after 50, 000 episodes a. The optimal reward possible over 5, 000 episodes was 0.93 reward per step b. The flat agent averaged 0.83 reward per step and the hierarchical agent averaged 0.86 reward per step 2 My hierarchical Soar agent learns more slowly than that of [Dietterich, 1998], although both manage to achieve a virtually optimal policy by the end of 50, 000 runs Mitchell Keith Bloch (University of Michigan) MAXQ HRL in Soar May 17, 2010 21 / 26

Methodology and Results Finite Task Effect of MAXQ Optimal (60 Runs Averaged) MAXQ HRL (30 Runs Averaged) HRL without MAXQ (30 Runs Averaged) 1 Once Dietterich s cooling techniques are disabled, learning actually speeds up a bit 2 However this agent averaged only 0.75 reward per step, which is significantly less than the 0.86 received when using these techniques 10,000 20,000 30,000 40,000 50,000 Mitchell Keith Bloch (University of Michigan) MAXQ HRL in Soar May 17, 2010 22 / 26

Outline Discussion 1 Background 2 Modifications to Soar 3 Agent Construction 4 Methodology and Results 5 Discussion 6 Nuggets and Coal Mitchell Keith Bloch (University of Michigan) MAXQ HRL in Soar May 17, 2010 23 / 26

Discussion Future Work - Cooling Strategies 1 Important to take from [Dietterich, 1998] the importance of cooling on success 2 Exploration of more finely grained cooling strategies 3 Possible features of a successful strategy a. Pass back success signal with rewards b. Keep track of moving average of success rate c. Map this average to a temperature d. Use maximum temperature of available options 4 Possible goals a. Better fit temperature to learning b. Make coarse coding more useful Mitchell Keith Bloch (University of Michigan) MAXQ HRL in Soar May 17, 2010 24 / 26

Outline Nuggets and Coal 1 Background 2 Modifications to Soar 3 Agent Construction 4 Methodology and Results 5 Discussion 6 Nuggets and Coal Mitchell Keith Bloch (University of Michigan) MAXQ HRL in Soar May 17, 2010 25 / 26

Nuggets and Coal Mineral Resources Nuggets 1 Implementing the cooling strategies employed by [Dietterich, 1998] in Soar was straightforward 2 The cooling strategies of MAXQ HRL have been integrated into Soar 3 The value of MAXQ HRL over flat RL has been verified 4 Shown that the MAXQ cooling strategies are of value Coal 1 Need to be able to evaluate success 2 Unclear that the problem formulation is identical to [Dietterich, 1998] 3 Unable to reproduce Dietterich s level of success with the flat RL agent 4 No public release of architectural modifications yet Mitchell Keith Bloch (University of Michigan) MAXQ HRL in Soar May 17, 2010 26 / 26

Nuggets and Coal Thomas G. Dietterich. The maxq method for hierarchical reinforcement learning. In In Proceedings of the Fifteenth International Conference on Machine Learning, pages 118 126. Morgan Kaufmann, 1998. Mitchell Keith Bloch (University of Michigan) MAXQ HRL in Soar May 17, 2010 26 / 26