Reinforcement Learning of Adaptive Longitudinal Control for Dynamic Collaborative Driving

Int. J. Vehicle Inormation and Communication Systems, Vol. x, No. x, 28 1 Reinorcement Learning o Adaptive Longitudinal Control or Dynamic Collaborative Driving Luke Ng Department o Mechanical and Mechatronics Engineering, University o Waterloo, 2 University Ave W., Waterloo, Ontario, N2L 3G1, Canada Fax: (519) 888-4333 Email: l4ng@engmail.uwaterloo.ca Christopher M. Clark Computer Science Department, Caliornia Polytechnic State University San Luis Obispo, CA, 9347, USA Fax: (85) 756-2956 Email: cmclark@calpoly.edu Jan P. Huissoon Department o Mechanical and Mechatronics Engineering, University o Waterloo, 2 University Ave W., Waterloo, Ontario, N2L 3G1, Canada Fax: (519) 888-4333 Email: jph@uwaterloo.ca Abstract: Dynamic collaborative driving involves the motion coordination o multiple vehicles using shared inormation rom vehicles instrumented to perceive their surroundings in order to improve road usage and saety. A basic requirement o any vehicle participating in dynamic collaborative driving is longitudinal control. Without this capability, higher-level coordination is not possible. Each vehicle involved is a composite nonlinear system powered by an internal combustion engine, equipped with automatic transmission, rolling on rubber tires with a hydraulic braking system. This paper ocuses on the problem o longitudinal motion control. A longitudinal vehicle model is introduced which serves as the control system design platorm. A longitudinal adaptive control system which uses Monte Carlo Reinorcement Learning is introduced. The results o the reinorcement learning phase and the perormance o the adaptive control system or a single automobile as well as the perormance in a multi-vehicle platoon is presented. Keywords: autonomous robotics; mobile robots; motion control; adaptive cruise control; collaborative driving; vehicle dynamics; vehicle simulation; artiicial intelligence; machine learning; reinorcement learning; adaptive control. Reerence to this paper should be made as ollows: Ng, L, Clark, C.M., Huissoon, J.P. (28) Reinorcement learning o adaptive longitudinal control or dynamic collaborative driving, Int. J. Vehicle Inormation and Communication Systems, Vol. X, No. Y, pp.. Copyright 28 Inderscience Enterprises Ltd.

L. Ng, C.M. Clark, J. P. Huissoon Biographical notes: Luke Ng is a doctoral candidate at the Lab or Autonomous and Intelligent Robotics, Dept. o Mechanical and Mechatronics Engineering, University o Waterloo. Chris M. Clark is an Assistant Proessor at the Computer Science Dept., Caliornia Polytechnic State University, San Luis Obispo, CA. Jan P. Huissoon is a Proessor at the Dept. o Mechanical and Mechatronics Engineering, University o Waterloo. He is the currently the Department Deputy Chair and is a Proessional Engineer. 1. Introduction In major cities throughout the world, urban expansion is leading to an increase o vehicle traic low. The adverse eects o increased vehicle traic low include traic congestion, driving stress, increased vehicle collisions, pollution, and logistical delays. Once traic low surpasses the capacity o the road system, it ceases to become a viable transportation option. One solution is to build more roads; another is to build better vehicles vehicles that can negotiate traic, coordinate with other similar thinking vehicles to optimize their speeds so as to arrive at their destination saely and eiciently. This is the concept behind Dynamic Collaborative Driving, an automated driving approach where multiple vehicles dynamically orm groups and networks, sharing inormation in order to build a dynamic representation o the road to coordinate travel. Ultimately our research goal is to create a decentralized control system capable o perorming dynamic collaborative driving which is scalable to a large number o vehicles, can be implemented on any vehicle and in any environment. However, beore we can deal with the issue o coordination, basic control o the vehicle must be achieved. Thereore, the ocus o this paper is the basic problem o longitudinal motion control, sometimes reerred to as adaptive cruise control. Research in automated driving in the United States during the 199s was conducted under the PATH project (Partners or Advanced Transit and Highways). PATH introduced the concept o platooning (Varaiya 1993; Shladover et al 1993; Hedrick et al 1994), where vehicles in groups o 1-25 cars travel in tight vehicle-string ormations. The most basic level o control in platooning is longitudinal control, also reerred to as autonomous intelligent cruise control (AICC). Ioannou and Chien (1993) describe an AICC system or automatic vehicle ollowing, which is a stand-alone longitudinal control system using a linear vehicle ollowing model. Raza and Ioannou (1997) implemented the AICC system on a real vehicle and evaluated it during Demo 97 to veriy the perormance obtained under simulation. Studies in the mid 199s at UC Berkeley (Maciuca and Hedrick 1995; Swaroop & Hedrick 1994) ocused on using sliding mode control to address the nonlinearities o longitudinal vehicle dynamics. The studies addressed both vehicle dynamics simulation, string stability o linear ormations and nonlinear control. Rajamani et al (2) implemented sliding surace control to longitudinal control during Demo '97. At Demo 2, Kato et al (22) demonstrated an adaptive proportional control law or longitudinal control. Recently, Zhang and Ioannou (25) proposed an adaptive control approach to vehicle ollowing with variable time headways, using a simpliied irst order

Reinorcement Learning o Adaptive Longitudinal Control or Dynamic Collaborative Driving linear vehicle model. The control system guarantees closed-loop system stability, and regulates the speed and separation errors towards zero when the lead vehicle is at a constant speed. Khatir and Davision (26) revisited linear control approaches by proposing a linear PID controller or longitudinal and lateral control assuming a simpliied 6 th order linear model, where the vehicle is modelled as a bicycle to urther simpliy vehicle dynamics. Due to the high costs associated with procuring large numbers o vehicles and the saety issues involved, ull-scale vehicle studies can only be conducted through large scale research projects in association with governments and automobile manuacturers such as Demo 97 (Thorpe et al 1997; Tan et al 1998; Rajamani et al 2) and in Japan during Demo 2 (Tsugawa et al 2; Kato et al 22). In Canada, smaller projects have used mobile robots to model cars (Michaud et al 26), however the cost and complexity associated with these mobile robot studies can also be quite high. In addition the vehicle dynamics o a mobile robot platorm are signiicantly dierent rom those o ull-sized automobiles thereby limiting the applicability o those results. Alternatively, simulation studies can be developed aster, they are more lexible, cost eective, have better repeatability and explore situations not easily achieved in reality. In 1989, the National Highway Traic Saety Administration (NHTSA) began researching the use and construction o a new state-o-the-art driving simulator, the National Advanced Driving Simulator (NADS) (Haug 199). Since then, NADS has been used as a substitute or actual vehicle testing. The NHTSA s Vehicle Research and Test Center (VRTC) provides vehicle data or several vehicles such as the 1997 Jeep Cherokee (Salaani and Heydinger 2), which can be used to validate simulations. With the adoption o high idelity simulation on modern computers, simulation has become the dominant method or study in this ield. Thereore, our methodology involves irst creating an accurate vehicle model to be used both in the process o design and validation o the control system. The use o a computer model can be considered a Computer Aided Engineering (CAE) approach to control design allowing the designer to assess the perormance o the control system and predict its limits. We proceed with a description o the vehicle dynamics model ollowed by an explanation o the longitudinal control system's design, then, the results o the learning process and the evaluation o the system s perormance are presented. 2. Vehicle Dynamics Modelling The basis o our simulation has its roots in the late 198 s to the late 199 s. A signiicant amount o research was conducted at the Vehicle Dynamics Laboratory at the University o Caliornia at Berkeley by Hedrick under the PATH project. His group developed a complex numerical automobile model used to design and evaluate the perormance o various controllers under certain conditions (McMahon and Hedrick 1989; Peng and Tomizuka 1991; Pham et al 1994). Later work by Pham and Hedrick (Pham et al 1997) used this model to evaluate the perormance o an optimal controller or combined lateral and longitudinal control. The vehicle model adopts many o the models used by Hedrick s group or key subsystems such as the engine, transmission, suspension and tires. However, in order to

L. Ng, C.M. Clark, J. P. Huissoon have a simulation which can be subjected to reinorcement learning, these separate models have been integrated and modiied to provide system perormance throughout the entire operating range. For example, an automatic transmission system is added to allow gear shiting so that the entire speed range can be experienced. Figure 1 illustrates how each subsystem model is interconnected into a coherent model o an automobile. The ollowing is a partial description o each o the major subsystem models and shows where the nonlinearities o the overall vehicle model originate. 2.1. Engine Model McMahon and Hedrick (1989) describe in detail a mathematical model o a 3.4L Ford V6 internal combustion engine. The control input to the engine is the throttle angle, which is supplied by the throttle actuator model. The throttle actuator model is simply a irst order system with a time constant o.5 ms. The output o the engine model is the engine s crankshat speed. In addition, a eedback term rom the transmission model is required in the orm o the torque o the transmission pump which is connected to the engine s crankshat. The engine model is made up o several dierential equations or each part o the combustion process. The dierential equation or the gas mixture in the intake maniold is given by Tm RTm P m. e vol Pm m ai m egri T 8873 KPa/s (1) V m where P m, and T m are the maniold pressure and temperature. The maniold volume V m is considered ixed at 3.4L or.34 m 3. The mass rate o air entering the intake maniold is given by the relationship MAX TC PRI kg/s, (2) m ai where MAX =.335 kg/s is the engine speciic maximum low rate. TC is the normalized throttle characteristic and is a unction o throttle angle, and PRI is the normalized pressure inluence ratio and is a 5th order polynomial unction o the pressure ratio PR = P m /P atm. The dierential equation or the gas mixture in the exhaust maniold is expressed as Pm Vm T m P m m ao m ai m egri m kg/s (4) egro RT T P Where the exhaust gas recirculation out o the exhaust maniold is described by the second order dierential equation 5 m 9.5 1 ( m m ) kg/s (5) ergo and volumetric eiciency term vol o the engine is expressed as a surace (Figure 2) with a dependence on the mass o air low rate into the intake maniold and the rotational speed o the engine e. The mass low rate o the exhaust gas into the exhaust maniold m EGRI P / P ) kg/s is provided by a lookup which is dependent on the ratio o egri ( m e maniold to exhaust pressure P m /P e. The pressure o the air in the exhaust maniold can be determined by the relationship 3 P 12 P 1.7 1 ( t ).12 KPa (6) e m where the engine speed is a unction o time e(t) where t = t - it, the delayed time o the intake to torque production delay resulting rom the cyclical nature o the engine. The vol e e m m ergi it m ergo m

Reinorcement Learning o Adaptive Longitudinal Control or Dynamic Collaborative Driving time delay is it = 5.48/ e sec. The indicated engine torque produced is modelled in the continuous time domain or simplicity as ctm ao ( t it ) AFI ( t it ) SI( SA( t it )) T N m (7) i e ( t it ) The constant c t = 1175584 N m s/kg is the maximum torque capability o the engine, the unction AFI(t), is the air/uel inluence unction. The unction SI(SA) = 1.-3.8x1-4 (SA(t)) 2, is the normalized spark inluence unction which is also a unction o spark advance SA(t) rom MBT or minimum spark advance or best torque. The engine speed can be determined using the ollowing torque balance or dierential equation J T T T N m (8) e e where J e =.263 kg m 2 is the eective inertia o the engine and torque converter, T i is the indicated torque produced by the engine, T T m, ) is the engine riction i p ric ( ao e torque unction and T p is the torque converter pump torque which is modelled in the transmission subsystem model. 2.2. Transmission Model McMahon and Hedrick (1989) describe the model o an automatic transmission subsystem (Figure 3). The transmission system connects the engine to the driveshat where the motion rom the engine is transmitted through a gear-train to the driveshat. The engine is connected directly to the pump o the torque converter. The rotational motion o the luid transmits the torque rom the pump to the turbine. The turbine s output shat is connected to the gear-train which is connected to the driveshat. Since the gear-train is only connected to the engine through the transmission luid, it is possible to change gears without disrupting the motion o the engine. The control o the gear selection o the gear train is managed by the valve body which senses hydraulic pressure and actuates servo pistons to select the proper gear ratios to optimize engine perormance. The behaviour o the valve body can be modelled as discrete logic schedule which is dependent on both vehicle speed and throttle position. There are two phases o operation or the torque converter, the high torque phase (1 and 11) experienced when changing gears and the luid coupling phase (13). The torque equations depend on the speed ratio o the turbine and pump t/ p, the high torque phase satisies the relationship t/ p <.9. The torque o the pump T p and the turbine T t are expressed using the ollowing equations 3 2 3 2 T.4325 1 2.21 1 (9) t p 3 p p t, e T (1) 3 2 3 3 2 5.7656 1 p, e.317 1 p, e t 5.4323 1 t where p,e and t,e satisy the irst order lag expressions p, e p p, e p t, e t t, e t rad/s (11) The luid coupling phase exists when t/ p.9, thereore the torques or the turbine and pump are expressed as T.7644 3 1 2 32.84 3 1 25.2441 3 1 2 (12) t T p 6 p p t t Since the engine is connected directly to the pump, e = p, thus, the angular speed o the turbine t, can be determined with the irst order dierential torque balance equation, J T R R T N m (13) tg t t g d s

L. Ng, C.M. Clark, J. P. Huissoon where J tg =.7 kg m 2 is the rotational inertia, R g is the gear ratio depending on which gear is used (i.e..4167,.6817, 1, 1.4993) and R d = 1 is the drive gear ratio. The shat torque T s can be determined rom the ollowing irst order dierential equation T K R R ) N m/s (14) s s( g d t w where K s = 6742 N m/rad is the shat stiness and wheel. 2.3. Braking System Model w is the angular speed o the ront McMahon and Hedrick (1989) describe a simpliied model to determine the braking orces to apply to each wheel. Although the brake torques are largely dependent on the hydraulic system that makes up the braking system o the vehicle (Maciuca and Hedrick 1995), a irst order lag expression provides a suicient simpliied approximation to the system. A normalized brake command cmd brake in the interval [, 1] is assumed to be provided by the control system and is passed through the brake actuator model. This is simply a irst order system with a time constant o.75 ms. The irst order lag unction, lag brake which approximates the braking system is modelled with a time constant o =.72 s. The equation or the braking torques or the ront T b and rear T br are lag actuator s ) h F T lag actuator s ) h F (15) T b brake brake ( brake max br brake brake ( brake r r max where h =.31 m and h r =.315 m are the heights to the rom the ground to the ront and rear axles respectively. The maximum brake orce F max and F r max occurs during wheel lock (slip = 1) and can be determined using the ollowing equations F m g l h ( ) F m g l h ( ) (16) max roll where is the coeicient o riction between the road and the tire as speciied in the tire subsystem model, m = 1573 kg is the mass o the automobile, g = 9.87 m/s 2 is gravity, l = 1.34 m and l r = 1.491 m is the longitudinal distance rom the center o gravity to the ront and rear axles respectively and roll =.498 is the coeicient o rolling resistance or the let and right tires combined. 2.4. Drive-train Model Pham et al (1997) describe the model o the drive-train subsystem or a ront wheel drive automobile. A torque balance about each wheel yields the irst order dierential equations or the ront and the rear wheels are J 1 T 1 3 T r F i = 1, 2 J 1 2 T r F i = 3, 4 (17) wi wi 2 s b wi xi where T s is the shat torque calculated previously in the transmission subsystem model and T b is the total braking torque available and F x is the longitudinal orce o each tire which is calculated by the tire model. An even distribution o the shat torque is assumed by splitting hal o the shat torque to each o the ront wheels. The total available brake torque is assumed to be distributed 6% to the ront and 4% to the rear wheels. 2.5. Suspension Model Pham et al (1997) describe a simple one-dimensional quarter car model o an automotive suspension system with shock absorber and hardening spring (Peng 1992). Neglecting the small coupling terms, the suspension orces can be completely determined by the r max wi wi b r r wi xi roll

Reinorcement Learning o Adaptive Longitudinal Control or Dynamic Collaborative Driving local motion o each wheel (Tseng 1991). Let e i be the delection at the i th suspension joint. s s e 1 z z h5 l e 2 z z h5 l 2 2 sr sr e 3 z z h5 lr e 4 z z h5 l (18) r 2 2 where z is the nominal height, z is the current height o the vehicle, is the pitch angle, is the roll angle, h 5 =.1 m is the longitudinal distance rom the center o gravity to pitch center, l = 1.34 m and l r = 1.491 m is the longitudinal distance rom the center o gravity to the ront and rear axles respectively, and s = 1.45 m and s r = 1.45 m are the ront and rear axles respectively. The spring orce F s and damping orce F d on each wheel is calculated using the ollowing equations 3 Fs i C1( e1 C2ei ) N Fd i D e N (19) 1 i where C 1 = 4 N/m and C 2 = 4 m -4 are coeicients o the third order polynomial it or the suspension spring and D 1 = 1 N s/m is the damping constant. A vertical orce balance is used to determine the normal orce F N exerted on each wheel, l FN 1 m g F i s F N (2) 2 i di l l where m = 1573 kg is the mass o the automobile and g = 9.87 m/s 2 is gravity. 2.6. Tire Model Pham et al (1997) describe a simpliied tire model reerred to as the Bakker-Pacejka model adopted rom the work o Peng (1992). This model calculates the traction orce resulting rom the road-tire interaction based on empirical curve-itting with experimental data or a Yokohama P25/6R1487H tire (Peng 1992) (see Figure 4). In this model, tire pressure, tire camber angle, and the road and tire physical parameters are ixed, but the orces generated at the tire are the unctions o slip ratio and the tire normal orce F N. The calculation o the slip ratio which is computed either or traction or braking using the ollowing equations Traction: rw w Vx Braking : rw w Vx r V (21) w w where w is the rotational speed o each wheel determined in the drive-train subsystem model and the radius o the tire is r w =.34 m. According to Bakker et al (1987), road-tire interaction under non-ideal conditions can be extrapolated rom the ideal curve by multiplying the ideal tire orces by the coeicient o riction. Typically or average reeway operation, =.8, or wet road conditions =.6, and or icy road conditions =.2. 2.7. Vehicle Model Response To demonstrate the perormance and the validity o the vehicle dynamics simulation, the model is subjected to input signals or either the throttle or brake command and the velocity response is charted. Although the simulation is a composite o various subsystems that does not correlate to a standard vehicle (i.e. Ford V6 3.L engine, r x

L. Ng, C.M. Clark, J. P. Huissoon Yokohama 15" radial tires, Toyota Camry chassis dimensions), comparison with vehicle response data supplied by the NHTSA s Vehicle Research and Test Center (VRTC) or a 1997 Jeep Cherokee (Salaani and Heydinger 2) is presented to illustrate analogous behaviour between an actual vehicle and simulation. It is not precision that is being compared rather accuracy in terms o behaviour. Figure 5 shows the actual vehicle's velocity response to a throttle step input in 1 st gear and the brake step response. In the throttle step, there is a smooth increase in acceleration which saturates at the top speed or the speciic gear. In the brake step, rom -6 s the throttle is released, this is known as power-o and results in a linear decrease in speed due to engine braking, at 6 s the brake is pressed and a linear decrease with a much steeper slope is seen. A commercial mechanical simulation called Adams Car (MSC Sotware Corp.) is used as an intermediate validation tool by providing data or throttle and brake inputs not provided by Salaani and Heydinger (2). The vehicle modelled in Adams Car is a high perormance sports car, thereore a comparison with our Simulation will show the same behaviour but the responses will be slower. Both simulations are subject to the same input signals and the results are presented in Figures 6 to 8. Figures 6 shows the simulation velocity response to a throttle step input. Notice that the vehicle speed range is much larger since the simulation incorporates an automatic transmission system. The eects o the automatic transmission shiting can be seen as slight discontinuities in the response. Despite the dierences, the step responses o both the vehicle (Figure 5) and the simulation ollow the same behaviour. Figures 7 and 8 show the simulation velocity responses to brake step and throttle power-o inputs. The simulation responses match the vehicle responses (Figure 5) in terms o behaviour. 3. Controller Design The outputs o the longitudinal control problem are i) the throttle angle, which controls the uel/air mixture or the combustion process within the engine and ii) the brake pedal position, which applies a braking torque to each wheel. In Figure 9, the vehicle model s velocity responses to 1% throttle step input and 5% throttle step input are charted. The responses can be characterized as a second order over-damped with a slight delay. By comparing the 5% response multiplied by a actor o two with the 1% response we see that the vehicle model s response with respect to the throttle is clearly non-linear. In Figure 1 the vehicle model s velocity response to 1% brake step input and 5% brake step input are shown. The responses show that during braking, Coulomb riction dominates the system. It is clear that the vehicle response to a 5% brake step input is not hal o the 1% signal indicating that the modeled braking system is nonlinear. Figure 8 shows the vehicle model s velocity response when the throttle is disengaged, and can be considered a step input rom 1 to. The throttle power o resembles the brake system's response although more gradual. It demonstrates Coulomb riction as well and can be considered a nonlinear response. To address each o these nonlinear responses, dierent control systems are required depending on the operating conditions. Our approach is to divide the control space into regions within which the behaviour o the plant approximates linearity. A patch-work o linear controllers would then be able to address the entire operating envelope. These

Reinorcement Learning o Adaptive Longitudinal Control or Dynamic Collaborative Driving linear controllers would all have the same orm, but their gains would dier depending on the operating conditions. This common linear controller along with its collection o gains is considered a orm o adaptive control reerred to as gain scheduling (Astrom and Wittenmark 1994). The dierence in our implementation o gain scheduling, is that the tedious task o determining each gain is accomplished using a machine learning algorithm called Monte Carlo ES reinorcement learning. 3.1 Reinorcement Learning Reinorcement learning (RL) is a machine learning approach where a sotware agent senses the environment through its states s and responds to it through its actions a under the control o a policy, a = (s). This policy is improved iteratively through its experiences with the environment through a reinorcement learning algorithm which in this paper is called Monte Carlo ES (Figure 11). The environment provides the agent with numerical eedback called a reward or the current state, r = R(s). The environment also supplies the next state based on the current state and the actions taken using the transition unction, s = (s, a). In this study, the transition unction is provided by the vehicle model. The control problem is ormulated into mathematical ramework known as a inite Markov Decision Process (MDP) (Bellman 1957) by deining {s, a,, R(s), (s,a)}. The key eature o an MDP is that to be considered Markov, its current state must be independent o previous states. This is so that or each visit to a state, the sotware agent is given a path independent reward. Subsequent actions will result in new states giving rise to dierent rewards. The challenge o reinorcement learning is to determine the actions which result in the maximum reward or every possible state, this state to action mapping is called the optimal policy * or the controller. For the current state, actions that result in more avorable uture states lead to higher rewards. The avorability o a certain action given the current state is known as the Q-Value. As an agent experiences its environment, it updates the Q-Value or each state-action (s, a) pair it visits according to its reinorcement learning algorithm. As it repeatedly visits every (s, a), it updates the policy so that the highest valued (s, a) will dominate. The optimal policy is reached when every state-action pair results in the highest reward possible; that is when the Q- Value unction has been maximized. The convergence o this maximization process requires that all states and actions be visited ininitely in order or estimates the Q-value to reach their actual values. To ensure this convergence criterion, policies leading to * are -sot, meaning that there is a probability that a random action or exploration is selected. Thereore, all actions and states will be reached as t. This process o policy improvement is reerred to as a reinorcement learning algorithm. Speciically, Monte Carlo reinorcement learning algorithms improve the policy using the averaged sample returns experienced by the agent at the end o each episode (Sutton and Barto 1998). The key to the process o improvement is the reward unction which expresses the desirability o being in a current state. It is the method o communicating to the agent the task to be perormed. The challenge o the designer is to be able to come up with a reward unction that captures the essence o the task so that learning can be achieved.

L. Ng, C.M. Clark, J. P. Huissoon 3.2 Longitudinal Control Simply stated, longitudinal control o a vehicle is to be able to ollow another vehicle in traic without colliding into it. That is, the controller must maintain a relative speed o zero with the vehicle ahead while maintaining a ixed distance behind the orward vehicle; this ixed distance will be reerred to as x i. During the process o control, the vehicle's relative speed, Vrel Vxi Vx and range, 1 i X rel xi x, to the vehicle ahead 1 i will provide eedback to the control system. Figure 12 shows how multiple vehicle's are linked to provide longitudinal control or multiple vehicles. Figure 13 shows the design o the control system, two parallel control systems are used, one or throttle control, and one or brake control. These two throttle and brake controllers are a combination o a digital Proportional-Derivative (PD) controller or V rel, and a digital Proportional-Integral (PI) controller or X rel. The dierence equation which provides the throttle/brake command m n is shown below k dv m n mn k pv ( vn vn ) ( vn 2vn vn ) k px ( xn xn ) kix T x (22) 1 1 1 2 1 n T where n is the current iteration o the control cycle, v is V rel, x is X rel,, and T is the period o the control cycle. Moreover, k pv, kd v, k px, and k ix are gains that are unctions o MDP state variables s 1, s 2, and s 3 as described in Table 1. This allows simultaneous regulation o the relative speed as well as the range while reducing the steady state range error through the integral control o the range. The results o both the throttle and brake controllers are ed into a logic element controlled by the gain K coast which decides whether throttle control or brake control is to be used. In this paper, K coast is set to.25; that is throttle values less than.25 utilize the braking system rather than coasting. The logic or this element is shown below i ( throttle ) else i else cmd cmd throttle ( throttle throttle throttle, cmd, cmd cmdthrottle, cmdbrake For a given operating point, there are eight parameters or gains which must be provided in a lookup table or schedule. By ormulating the control problem into a MDP, the gain schedule can be learned using reinorcement learning. The episode is deined as starting at the onset o a change in Vx i-1 and ending when Vx i = Vx i-1 or when Vx i-1 has been changed. This ollows the logic that when a new velocity is required, a set o gains should be selected rom the gain schedule and applied or the duration o that command. The goodness o a set o gains can thereore only be assessed once the command is complete, thus the MDP is episodic in nature and the Monte Carlo ES reinorcement learning algorithm described in Figure 11 is used to learn the gain schedule. The choice in the selection o states lies in the nonlinear nature o the throttle plant. At dierent initial speeds the throttle responds dierently. Thereore, the controller gains will dier rom a given initial speed to a inal speed. In addition, the distance required to achieve this acceleration/deceleration which is relected in the change in vehicle spacing is also an independent variable or the gain schedule. These three parameters are used as states (Table 1). The actions are the eight values which represent the gains used in the digital control system (Table 2). K coast brake ) brake (23)

Reinorcement Learning o Adaptive Longitudinal Control or Dynamic Collaborative Driving Table 1 States o the longitudinal MDP State Description Digitization Sets S 1 Vx : initial vehicle speed { 5, 1, 15, 2, 25, 3, 35, 4} m/s S 2 Vx i-1 : target vehicle speed { 5, 1, 15, 2, 25, 3, 35, 4} m/s S 3 x x : change in vehicle spacing {-1, -9, -8,, 8, 9, 1} m Table 2 Actions o the longitudinal MDP Action Description Digitization Sets A 1 K p : Throttle Proportional Gain (x) {.1,.2,.3, 9.9} n_s = 1 A 2 K i : Throttle Integral Gain (x) {.1,.2,.3,.99} n_s = 1 A 3 K d : Throttle Proportional Gain (V) {.1,.2,.3, 9.9} n_s = 1 A 4 K i2 : Throttle Derivative Gain (V) {.1,.2,.3,.99} n_s = 1 A 5 K p : Brake Proportional Gain (x) {.1,.2,.3, 9.9} n_s = 1 A 6 K i : Brake Integral Gain (x) {.1,.2,.3,.99} n_s = 1 A 7 K d : Brake Proportional Gain (V) {.1,.2,.3, 9.9} n_s = 1 A 8 K i2 : Brake Derivative Gain (V) {.1,.2,.3,.99} n_s = 1 The reward unction which relects the speciication o the control problem is a discrete unction o the eedback variables, the current normalized relative speed and normalized relative velocity o the vehicle and is expressed below. R R V ) R ( X ) (24) Total V ( rel X rel 1 i X rel.1 R X ( X rel 1) R V ( Vrel 1) 1 i Vrel. 1 1 i X rel For a given episode, the solution which maximizes the reward, or minimizes the X rel and V rel without colliding with the vehicle ahead (X rel < ) will be avoured. These avoured solutions will be explored to determine the optimal solution. 4. Reinorcement Learning Experiments The RL experiments obtain an optimal policy * or the longitudinal control o the vehicle. An experiment consists o 3 episodes where =.25 o the -sot greedy policy or a particular combination o the three states. For each episode, the agent must ollow another vehicle placed ahead o it which is travelling at a constant speed. Once the leading vehicle has reached the end o the test track, the episode is complete. The distance o the test track is dependent on the speed o the lead vehicle using the ollowing equation. x (1.2v ) 1 m (25) max During each step o an episode, a reward is generated (24), this reward is accumulated during the course o an episode to measure the controller's tracking perormance using a particular set o actions. Since it is possible to collide with the vehicle ahead during an episode, it would be beneicial i the reward were averaged to relect how ar the vehicle lead

L. Ng, C.M. Clark, J. P. Huissoon made it during the course o the episode. Thereore, the average reward or the course o the entire episode is provided by the ollowing equation. inal Ri i (26) Ravg xmax x inal Figure 14 shows the average reward as the agent progresses through the learning cycle or a particular state combination. The learning perormance is similar or all combinations. One can observe the steady increase in the average reward which eventually reaches a plateau. The learned optimal policy is a collection o eight our-dimensional discrete hyperspaces, one or each gain o the longitudinal controller; that is our or the throttle controller and our or the brake controller. * k pv ( v, v, x x ) k pv ( v, v, x x ) (27) k ( v, v, x x ) k ( v, v, x x ) k dv px ( v, v, x k ( v, v, x ix Throttle x ) x ) k dv px ( v, v, x k ( v, v, x 5. Controller Perormance Experiments These experiments demonstrate the tracking perormance o the optimal policy at various operating points. Three control situations are shown which orm the basis o platoon maneuvers which allow vehicles to enter or exit ormations. The irst o these, shown in Figure 15 is speed control. The vehicle must reach a inal speed o 2 m/s while maintaining a separation distance to the vehicle ahead o 2 m. At an initial speed o 1 m/s, the vehicle immediately decelerates to create room to accelerate to the higher speed. At an initial speed o 3 m/s the vehicle shows a negative range which means the vehicle cannot maintain its separation distance as it slows down. The second control situation is shown in Figures 16 is reerred to as negative range control. The vehicle must move rom an inter-vehicle space o 15 m to 5 m while maintaining a speed o 1, 2, and 3 m/s. The closing o the gap is accomplished within 5 m, with minimal velocity luctuation and no overshoot. Figure 17 shows the third control situation, positive range control. The vehicle must move rom an intervehicle space o 5m to 15m while maintaining a speed o 1, 2, and 3 m/s. In opening the gap, the vehicle's velocity luctuates during the manoeuvre with some overshoot in range. These experiments represent the basis or platoon maneuvers which allow vehicles to enter into the new open space or to close the ormation when a vehicle has let. 6. Multi-Vehicle Perormance Experiments These experiments show the operation o the control system within a ive car ormation or platoon. Five control situations have been chosen to demonstrate the range tracking perormance o the optimal policy or each o the our ollowing vehicles. Figure 18 shows the results o a ive car ormation moving at a constant speed o 2 m/s. In the irst experiment, the inter-vehicle spacing is set to 5 m between each car. At time t = s, Car 2 is instructed to open the space in ront to 15 m. The results show Car 2 overshooting the 15 m to roughly 22 m, in 35 s the car has reached a steady-state ix Brake x ) x )

Reinorcement Learning o Adaptive Longitudinal Control or Dynamic Collaborative Driving separation o 15 m, the ollowing cars reach the steady-state by 7 s. In the second, the inter-vehicle spacing is set to 15 m between each car. At time t = s, Car 2 is instructed to close the space in ront to 5 m. The results show Car 2 reaching 5 m in 1 s without overshoot; the ollowing cars reach 5 m in 5 s. Figure 19 shows the results o a ive car ormation trying to maintain constant spacing while accelerating or decelerating rom 2 m/s. In the irst experiment, the inter-vehicle spacing is set to 2 m between each car. At time t = s, Car 1 is to accelerate to 3 m/s. The results show a close tracking o the velocity with the presence o oscillations. The second experiment shows the deceleration o the vehicle to 1 m/s with an inter-vehicle spacing o 15 m. The tracking o the velocity is excellent. Figure 2 shows an emergency stop situation with a 15 m inter-vehicle spacing. The tracking o the velocity and is excellent with a very steep deceleration. All vehicles stop without colliding into the vehicle ahead. 7. Conclusions In this paper the nonlinear nature o the vehicle dynamics is shown. Due to the nonlinearities present in the engine model, the transmission model, and the tire model a complex nonlinear model results. From this, we conclude that linearization o the longitudinal model may not be suitable or the entire operating range o the vehicle. The linear controllers resulting rom using a simpliied linear model o the vehicle dynamics in the design process may only be adequate or a particular operating point. The use o a more accurate nonlinear vehicle dynamics model in the design process should result in better nonlinear control systems or longitudinal control. In this paper, an adaptive control system using gain scheduling is introduced whereby the gains are learned using reinorcement learning. Even with a simple reward unction, it is possible or Monte Carlo reinorcement learning to converge upon an optimal policy within 3 episodes or a particular operating regime; thereore, the MDP properly describes the task to be learned. When the learned optimal policies are combined to provide an adaptive control surace or a gain schedule, nonlinear control is achieved throughout the operating range. The perormance o the controller at speciic operating points shows accurate tracking o both velocity and position in most cases. When the adaptive controller is deployed in a multivehicle convoy or platoon, the tracking perormance is less smooth. As the second car attempts to track the leader, slight oscillations result. This oscillation is passed to the ollowing cars, but as we move arther in the ormation, the oscillations decrease, implying stability. The perormance o the adaptive controller in a multi-vehicle convoy or platoon shows promise and orms the basis o higher level platoon maneuvers. Acknowledgement This research is unded by the Auto21 Network o Centres o Excellence, an automotive research and development program ocusing on issues relating to the automobile in the 21st century. AUTO21 is a member o the Networks o Centres o Excellence o Canada program. Web site: http://www.auto21.ca

L. Ng, C.M. Clark, J. P. Huissoon Reerences Varaiya, P. (1993) Smart cars on smart roads: problems o control. IEEE Transactions on Automatic Control. Vol 32, March. Shladover S. E., Desoer C. A., Hedrick J. K., Tomizuka M., Walrand J., Zhang W.B., McMahon D. H., Peng H., Sheikholeslam S., and McKeown N. (1991) Automatic vehicle control developments in the PATH program. IEEE Transactions in Vehicular Technology. Vol. 4, no. 1, pp. 114-13. Hedrick J.K., Tomizuka M., Varaiya P. (1994) Control issues in automated highway systems. IEEE Control Systems Magazine, Volume: 14, Issue: 6, Dec, pp 21-32. Ioannu P.A. and Chien C.C. (1993) Autonomous Intelligent Cruise Control. IEEE Transactions on Vehicular Technology, Vol 42, No. 4, Nov, pp 657-672. Raza H. and Ioannou P. (1997) Vehicle ollowing control design or automated highway systems. Proceedings o 1997 IEEE 47th Vehicular Technology Conerence, Phoenix, AZ, USA, Vol 2, pp 94-98. Maciuca D.B., Hedrick, J.K. (1995) Advanced Nonlinear Brake System Control or Vehicle Platooning. Proceedings o the third European Control Conerence (ECC 1995), Rome, Italy. Swaroop D., Hedrick J.K. (1994) Direct Adaptive Longitudinal Control or Vehicle Platoons. IEEE Conerence on Decision and Control, December. Rajamani R., Tan H.S., Law B.K., Zhang W.B. (2) Demonstration o integrated longitudinal and lateral control or the operation o automated vehicles in platoons. IEEE Transactions on Control Systems Technology. Vol 8, Issue 4, July, pp 695-78. Kato, S., Tsugawa S., Tokuda, K., Matsui T. Fujii, H. (22) Cooperative Driving o Automated Vehicles with Inter-vehicle Communications. IEEE Transactions on Intelligent Transportation Systems. Volume: 3, Issue: 3, pp 155-161. Zhang J. and Ioannou P.A. (25) Adaptive Vehicle Following Control System with Variable Time Headways. Proceedings o 44th IEEE Conerence on Decision and Control and 25 European Control Conerence. CDC-ECC '5. pp 388 3885. Khatir, M. E., Davison, E. J. (26) A Decentralized Lateral-Longitudinal Controller or a Platoon o Vehicles Operating on a Plane. Proceedings o 26 American Control Conerence, June 14-16, Minneapolis, Minnesota, USA. Thorpe C., Jochem T., Pomerleau, D. (1997) The 1997 automated highway ree agent demonstration. Proceedings o IEEE Conerence on Intelligent Transportation System, 1997. ITSC 97. Boston, MA, USA, pp 495-51. Tan H.S., Rajamani R., Zhang W.B. (1998) Demonstration o an automated highway platoon system. Proceedings o the American Control Conerence 1998. Philadelphia, PA, USA, vol.3, pp 1823-1827. Tsugawa S., Kato S., Matsui T., Naganawa H., Fujii, H. (2) An architecture or cooperative driving o automated vehicles. Proceedings o 2 IEEE Intelligent Transportation Systems. Dearborn, MI, USA, pp 422-427. Michaud, F., Lepage, P., Frenette, P., Létourneau, D., Gaubert, N. (26), Coordinated maneuvering o automated vehicles in platoon. IEEE Transactions on Intelligent Transportation Systems, Special Issue on Cooperative Intelligent Vehicles, 7(4):437-447. Haug, E. J. (199), Feasibility Study and Conceptual Design o a National Advanced Driving Simulator, NHTSA Contract DTNH22-89-7352, Report No. DOT-HS-87-597, March. Salaani, M. K., Grygier, P. A., Heydinger, G. J. (21) Model Validation o the 1998 Chevrolet Malibu or the National Advanced Driving Simulator. March 21, SAE Paper 21-1- 141.

Reinorcement Learning o Adaptive Longitudinal Control or Dynamic Collaborative Driving McMahon, D. H; Hedrick, J K. (1989) Longitudinal model development or automated roadway vehicles, Institute o Transportation Studies Caliornia Partners or Advanced Transit and Highways (PATH), University o Caliornia, Berkeley, USA. Peng H. and Tomizuka M. (1991) Optimal Preview Control For Vehicle Lateral Guidance. In Research Reports: Paper UCB-ITS-PRR-91-16, Institute o Transportation Studies Caliornia Partners or Advanced Transit and Highways (PATH), University o Caliornia, Berkeley, USA. Pham H., Hedrick K., Tomizuka M. (1994) Combined lateral and longitudinal control o vehicles or IVHS. Proceedings o the 1994 American Control Conerence, Vol.2, pp125 126. Pham H., Tomizuka M., Hedrick K. (1997) Integrated Maneuvering Control or Automated Highway Systems Based on a Magnetic Reerence Sensing System. Research Reports: UCB-ITS-PRR-97-28, Institute o Transportation Studies Caliornia Partners or Advanced Transit and Highways (PATH), University o Caliornia, Berkeley, USA. Peng H., Zhang W.B., Arai A., Lin Y., Hessburg T., Devlin P., Tomizuka M., Shladover S. (1992) "Experimental Automatic Lateral Control System or an Automobile", In Research Reports: UCB-ITS-PRR-92-11, Institute o Transportation Studies Caliornia Partners or Advanced Transit and Highways (PATH), University o Caliornia, Berkeley, USA. Astrom K. J., Wittenmark B. (1994), Adaptive Control, Addison-Wesley. Bellman R. E. (1957) A Markov decision process. Journal o Mathematical Mech., Vol 6 pp 679-684. Sutton, R.S. and Barto A.G. (1998) Reinorcement Learning: An Introduction. A Bradord Book. The MIT Press. Cambridge, MA, USA. Figures: Figure 1 Schematic o the vehicle model

L. Ng, C.M. Clark, J. P. Huissoon Figure 2 Engine volumetric eiciency surace Figure 3 Schematic o transmission system (McMahon and Hedrick 1989) Figure 4 Longitudinal orce-slip or Yokohama P25/6R1487H (ideal = 1.).

Reinorcement Learning o Adaptive Longitudinal Control or Dynamic Collaborative Driving Figure 5 1997 Jeep Cherokee throttle step and brake step response (Salaani and Heydinger 2) Figure 6 Adams Car and Simulation throttle step response Figure 7 Adams Car and Simulation brake step response Figure 8 Adams Car and Simulation power-o response

L. Ng, C.M. Clark, J. P. Huissoon Figure 9 Vehicle model velocity responses to throttle step inputs Figure 1 Vehicle model velocity responses to brake step inputs Initialize, or all s S, a A(s): Q(s, a) arbitrary (s) arbitrary Returns(s, a) empty list Repeat orever: (a) Generate an episode using exploring starts (b) For each pair (s,a) appearing in the episode R return ollowing the irst occurrence o (s,a) Append R to Returns(s,a) Q(s,a) average(returns(s,a)) (c) For each s in the episode: (s) arg max a Q(s,a) Figure 11 Monte Carlo ES-algorithm Figure 12 Overview o longitudinal control system

Reinorcement Learning o Adaptive Longitudinal Control or Dynamic Collaborative Driving Figure 13 Block diagram o longitudinal controller Figure 14 Perormance o Reinorcement Learning experiments Figure 15 Speed control experiment

L. Ng, C.M. Clark, J. P. Huissoon Figure 16 Negative range control experiment Figure 17 Positive range control experiment Figure 18 Multi-vehicle range control experiment: Open and Close Figure 19 Multi-vehicle velocity control experiment: Acceleration and Deceleration

Reinorcement Learning o Adaptive Longitudinal Control or Dynamic Collaborative Driving Figure 2 Multi-vehicle velocity control experiment (Emergency Stop)