Adaptive Critic Designs-Based Autonomous Unmanned Vehicles Navigation: Application to Robotic Farm Vehicles

RL paradigms


Introduction
Unmanned vehicles like Unmanned Aerial Vehicles (UAV) and Unmanned Ground Vehicles (UGV) are mechanical devices capable of moving in some environment with a certain degree of autonomy.These vehicles use IMU (Inertial Measurement Unit), high precision GPS RTK (Global Positioning Systems, Real-Time Kinematics), encoders, compass, and tilt sensors, to position them self and follow waypoints.A picture of a vehicle with these characteristics is shown in Figure 1.Its use is becoming more frequent for both intensive and extensive agriculture, in the precision agriculture context.For example, in USA or Argentina with millions of arable hectares is essential to have autonomous farm machines for handling and managing growth, quality, and yield of the crops.The environment where these vehicles are used can be classified as:

•
Structured or partially structured when it is well known and the motion can be planned in advance.In general, this is the case of navigation and guide of mobile robots.

•
Not structured, when there are uncertainties which imply some on-line planning of the motion, this is the case of navigation and guide of robotic aerial vehicles.In general, the objective of controlling the autonomous vehicles implies solving the problems of sensing, path planning and kinematic and dynamic control.Autonomy of a vehicle is related to determine its own position and velocity without external aids.Autonomy is very important to certain military vehicles and to civil vehicles operating in areas of inadequate radio-navigation coverage.Regarding the trajectory planning, there are many approaches (Aicardi et al., 1995).Many works have been published on the control of autonomous vehicles, mainly in the UGV or mobile robots.Some of them propose stable control algorithms which are based on Lyapunov theory (Singh & Fuller, 2001).Others have focused on optimization planning and control (Kuwata et al., 2005) and (Patiño et al., 2008).
In this paper we propose the use of ACDs to design autonomously an optimal path planning and control strategy for robotic unmanned vehicles, in particular for a mobile robot, following a previous work (Liu & Patiño, 1999a), and (Liu & Patiño, 1999b).We consider a mobile robot with two actuated wheels and the autonomous control system is designed for kinematic and dynamic model.The kinematic mobile robot model for the socalled kinematic wheels under the nonholonomic constrain of pure rolling and nonslipping, is given by, Where 3 q(t ),q(t ) ∈ℜ are defined as x(t ), y(t ), and 3 (t) θ ∈ ℜ denote the linear position, and orientation respectively of the center of mass of the mobile vehicle; x(t ), y(t ), denote the Cartesian components of the linear velocity of the vehicle; (t), θ denotes the angular velocity of the mobile robot; the matrix 32 S( q ), × ∈ℜ is defined as, 0 0 01 cos( ) S( q ) sin( ) and the velocity vector with l v ∈ℜ denoting the constant straight line velocity, and w( t ) ∈ ℜ is the angular velocity of the mobile robot.Considering the dynamics of the car-driving device which contains a dc motor, a dc amplifier, and a gear transmission system, The state of the mobile vehicle is given by (cf. Figure 1) the coordinates of the robot (x,y) , the orientation of the vehicle, θ , and the actual turning rate of the robot, θ .The control signal is the desired turning rate of the mobile vehicle, R w .

Control problem formulation
As was previously defined, the reference trajectory is generated via a reference vehicle which moves according to the following dynamic trajectory, Where S( ) ⋅ was defined in (3), With regard to (5), it is assumed that the signal R v( t ) is constructed to produce the desired motion, and that R v( t ) , R v( t ) , R q( t ) , and R q( t ) are bounded for all time.In general the vehicle motion control can be classified in: i) Positioning without prescribing orientation: in this case a final destination point is specified; ii) Positioning with prescribed orientation: in this case a destination point has to be achieved with a desired orientation; and iii) Path following: here, the path is defined through a sequence of waypoints.In the first experiment, the control objective is limited to the first case, that is, given a reference point located at the workspace, RR (x ,y ), and considering the vehicle dynamical model, it is desired to obtain autonomously a sequence of optimal control actions (values of the turning rate) such that the vehicle achieves the target point as fast as possible (cf.Figure 2), and with minimum energy consumption.Since the mobile robot´s speed, l v , is taken as constant, minimum-time control is equivalent to shortest-path control.The design of the control system will be based on adaptive critic designs, in particular HDP (Werbos, 1992) and (Bellman, 1957).Next Section shows the background material needed for the present work.

Introduction to dynamic programming
Suppose that it is given a discrete-time nonlinear (time-varying) system, where, n x ∈ℜ represents the (complete) state vector of the system and m u ∈ ℜ denotes the control action.Suppose that it is desired to minimize for (7) a performance index (or cost), where U is called the utility function or local cost function, and γ is the discount factor with 0 1 γ ≤≤.Note that J is dependent on the initial time i and the state x( i ) , and it is referred to as the cost-to-go of the state x( i ) .The objective is to choose the control sequence 1 u( k ), k i , i , =+ … so that the J function (the cost) in ( 8) is minimized.The cost in this case accumulates indefinitely; these kinds of problems are referred to as infinite horizon problems in Dynamic Programming.On the other hand, in finite horizon problems, the cost will accumulate over a finite number of steps.Dynamic programming is based on Bellman's principle of optimality, (Lewis & Syrnos, 1995), (Prokhorov & Wunsch, 1997), and establishes that an optimal (control) policy has the property that no matter what previous decisions (i.e., controls) have been, the remaining decisions must constitute an optimal policy with regard to the state resulting from those previous decisions.Suppose that we have computed the optimal cost 1 1 + , from time 1 k + to the terminal time for possible states 1 x( k ) + , and that we have also found the optimal control sequences from time 1 k + on.The optimal cost results when the optimal control sequence 12 , is applied to the system with initial state 1 x( k ) + .Note that the optimal control sequence depends on 1 x( k ) + .If we apply an arbitrary control u( k ) at time k and then use the known optimal control sequence from 1 (k ) + on, the resulting cost will be 11 where, x( k ) is the state at time k and is determined by (2).According to Bellman, the optimal cost from time k on is equal to ( ) The optimal control * u(k) at time k is the u( k ) that achieves the minimum.Equation ( 9) is the principle of optimality for discrete-time systems.Its importance lies in the fact that it allows us to optimize over only one control vector at a time by working backward in time.Dynamic programming is a very useful tool in solving optimization and optimal control problems.In particular, it can easily be applied to nonlinear systems with constraints on the control and state variables, and arbitrary performance indexes.

Adaptive critic designs
In the computations in (9), whenever one knows the function J and the model F in (7), it is a simple problem in function minimization to pick the actions * u(k)which minimize J .However, due to the backward numerical process required, it is too computationally expensive to determine the exact J function for most real problems, even when the scales of the problems are considered to be small.Therefore, approximation methods are demanding in practice when performing dynamic programming (Werbos, 1992), (Bellman, 1957), (Balakrishnan & Biega, 1995).Instead of solving for the value of J function for every possible state, one can use a function approximation structure such as a neural network to approximate the J function.There are three basic methods proposed in the literature for approximating the dynamic programming.They are collectively called Adaptive Critic Designs, which include Heuristic Dynamic Programming (HDP), Dual Heuristic Programming (DHP), and Globalized Dual Heuristic Programming (GDIHP) (Bellman, 1957), (Werbos, 1990), (Balakrishnan & Biega, 1995).A typical adaptive critic design consists of three modules -Critic, Model, and Action.The present work considers the case where each module is a neural network; the designs in this case are referred to as neural network--based adaptive critic designs.The following introduces the HDP.In HDP (Werbos, 1990), (Werbos, 1992), (Lewis & Syrnos, 1995), (Balakrishnan & Biega, 1995, the critic network output estimates J function in equation ( 7).This is done by minimizing the following error measure over time,  (Balakrishnan & Biega, 1995), (Werbos, 1990)).It is usually a function of x( k ) , u( k ) , and k , i.e., U( k which is exactly the same as in dynamic programming [cf.( 8)].In Eq. ( 11), it is assumed that J(k) <∞ which can usually be guaranteed by choosing the discount factor γ such that 01 γ <<.The training samples for the critic network are obtained over a trajectory starting from 0 0 x( ) x = at 0 k = .The trajectory can be either over a fixed number of time steps [e.g., 300 consecutive points] or from 0 k = until the final state is reached.The training process will be repeated until no more weight update is needed.The weight update, during the p th training iteration, is given by where, 1 0 η > is the learning rate and W , the i th component of C W .Note that the gradient method is used in ( 12) and that the p th corresponds to certain time instant k [hence the use 1 E(k) in ( 12)].The weight update can also be performed in batch mode, e.g., after the completion of each trajectory.The model network in an adaptive critic design predicts 1 x( k ) + given x( k ) and u( k ) ; it is needed for the computation of in (12) for the weight update.The model network learns the mapping given in equation ( 7); it is trained previously off-line (Werbos, 1992), (Bellman, 1957), (Balakrishnan & Biega, 1995), or trained in parallel with the critic and action networks.Here, 1 J(k ) + is calculated using 1 ( p ) C W − and its dependence on is not considered, according to (Liu & Patiño, 1999a).After the critic network's training is finished, the action network's training starts with the objective of minimizing 1 J(k ) + .The action network generates an action signal

A u( k ) [ x( k ), k , W ]
= ; its training follows a similar procedure to the one for the critic network's training.The training process will be repeated until no more weight update is needed while keeping the critic network's weights fixed.During the p th training iteration, the weight update is given by where, 1 0 α > .Again, the model network is required for the computation of ik x(k) u (k) ∂∂ in the above weight update.It can be seen in ( 13) that information is propagated backward through the critic network to the model network and then to the action network, as if three networks formed one large feedforward network.After action network's training cycle is completed, one may check its performance, then stop or continue the training procedure entering the critic network's training cycle again, if the performance is not acceptable yet.
It is emphasized that in the methods described above, the knowledge of desired target values for the function J and the action signal u( k ) is not required in the neural net-work training.In conventional applications of neural networks for function approximation, the knowledge of the desired target values of the function to be approximated is required.It should also be emphasized that the nature of the present methodology is to iteratively build a link between present actions and future consequences via an estimate of the utility function J.

Main results
A simulation study has been carried out using the mobile vehicle model presented in Section I.The set of parameters for this vehicle model used are the following: The three networks (critic, action, and model) are all implemented using multilayer feedforward neural networks.Each neural network has six inputs, RR (x ,y ,x,y, ,w) θ , where R x and R y denote the desired target gate.The critic network output J , the action network output R w , and the model network is trained according to equation ( 1) and ( 5).The training samples for the critic network are obtained over trajectories starting from 0 0 5 x( ) .
= at 0 k = , initial position of the vehicle, and a reference point located at position 83 5 (m ,.m ) .The discount factor is chosen as 0 8 .

γ =
, and the utility function is chosen as where, R xxx =− and R y yy =− are position errors with respect to the target point (x,y) , and 0 q > and 0 r > are positive weight constants.As described previously, the training takes place in two stage: the training of model network, and then the training of critic network and action network.The objective for the training of the critic network is to match The objective for the training of the action network la minimize 1 J(k ) + .The procedures for the training of critic and action networks are similar, and they are repeated iteratively.Figure 3 shows the result for the mobile vehicle when reaching the reference point, after 10 trials (learning cycles), and Figure 4 passing through one gate from two different initial conditions.Figure 5 shows the result for the mobile vehicle through two gates.
A second simulation study was performed using the kinematic model of both the robot and the reference trajectory virtual robot.In his case the mathematical model of the systems are defined as in Equations ( 1), ( 2) and (3) under the non-holonomic restriction In this case both the linear and angular velocities are variable, and the mobile robot follows a reference trajectory given by the equations and combining equations ( 1), ( 14) and ( 15), the tracking error model is With this change of coordinates the tracking problem is turned into a regulation one.
In this experiment the control action is given by Figure 7 shows the overall block diagram of the control system.

Robot Error Model
Reference Trajectory K Kanayama Transfor.The virtual robot describes a circular trajectory given by the equations: Where 00 cc (x ,y ) ( , ) = is the center of the trajectory, 0 5 R. = is the radius, 0 2 c. = and t is the time.The utility function given for this experiment is ( ) The experience begins with a K matrix stable but not adjusted, given the results shown in Figures 8 and 9.
After 11 iterations of the training algorithm, the control system guides the robot to track the reference trajectory, as can be seen in Figures 10 and 11.

Conclusions
A solution to the problem of generating autonomously optimal control action sequence for a mobile robot control based on Adaptive Critic Designs approach has been presented.The proposed controller based on adaptive critic designs learns to guide the robot to a final point autonomously.It has been shown that using this technique we can obtain near optimal control actions which requires no external training data and gives an optimal control law for the entire range of operation.This work is extensible to UAV, assuming that is flying at a constant altitude, so the mission will be restricted to a planar motion around of a target point, and the kinematic equation of motion is similar to a UGV, see (Patiño et al., 20008).Future directions of research will be oriented to reach a final point with orientation with application to UAV.In addition, the problem of obstacle avoidance will be addressed.It will be also researched with other structures as DHP and GDHP, to use different local cost functions, and to consolidate formally a systematic design principle.From a theoretical point of view the efforts will be placed on the robustness issues of optimal control systems using adaptive critic designs.

Fig. 1 .
Fig. 1.Prototype of a UGV equipped with a number of sensors.This prototype belongs to the Instituto the Automática of the Universidad Nacional de San Juan.
reference time-varying velocity.
) J[x(k),t,W ] = and C W represents the parameters of the critic network.The function U is chosen as a utility function which indicates the performance of the overall www.intechopen.comsystem (see examples in

Figure 6 Fig. 6 .
Figure6shows all the variables presented in the previous equations.

Fig. 8 Fig. 9 .
Fig. 8 Reference trajectory and initial performance of the robot.