Model-based reinforcement learning with model error and its application

This paper proposes a reinforcement learning (RL) algorithm called model error based Forward planning reinforcement learning (ME-FPRL). In this algorithm, an agent controls the amount of learning by using the model error. This study applies ME-FPRL to the pursuit of a target by a robot camera. The results of this application show that ME-FPRL is more efficient than usual RL and model-based RL.


INTRODUCTION
Many systems working with humans, e.g., humanoid robot [1], environmental intelligence, and portable information assistant [2], have been studied in recent year.These systems need the autonomous learning function for the adaptations to various users' requests and the changes of environment.One of approaches to the autonomous learning is Reinforcement Learning (RL) [3] and a study on shooting robot is an example of its early applications to practical problems [4].This study shows that RL can be used for autonomous systems.RL, however, cannot be applied easily to every practical problem since RL needs much time for learning.So other methods instead of RL are necessary for practical applications.One of the methods is Model-based RL [4], which has an inner model of environment and improves the learning efficiency by using that model.That is, if there is the knowledge of a task, then an agent uses it for learning.Although even model-based RL has some problem about the error of an inner model, there are few studies on that.
This study gives careful consideration on the effects of the model error and proposes to solve the problem above a new approach, Model Error based Forward Planning Reinforcement Learning (ME-FPRL) [5].ME-FPRL controls the amount of learning based on errors of inner models and can make the learning efficiency high.And in this paper, ME-FPRL is applied to the pursuit of a target by a robot camera.The results of this application show that ME-FPRL learns more efficiently than Usual RL and Model-based RL.
This paper is organized as follows.Section 2 explains RL and Model-based RL, and discusses the effect of the model error on those.In Section 3, ME-FPRL is proposed and the algorithm of ME-FPRL is presented.In Section4 the algorithm of ME-FPRL is extended by the linear method in order to apply ME-FPRL to a real problem, i.e., pursuing target task.And Section 4 evaluates ME-FPRL.Finally, Section5 summarizes this study and shows remaining problems and possible future study.† Now, Hitachi Ltd. Systems Development Laboratory

2．RL AND MODEL-BASED RL
2.1 Framework of RL and Model-based RL Fig. 1 shows the framework of RL.Using RL based on Monte Carlo Methods and Dynamic Programming, the agent acquires the appropriate actions [3].

Rewrd Action
Fig. 1 Framework of RL In the RL method, the agent takes an action to environment and then environment changes.Environment gives reward to an agent according to the action.The agent observes the sate of environment, and then takes a new action on environment in order to get more reward.The agent learns the appropriate behavior by the interaction with environment and finally acquires the behavior which gives maximum rewards.
Fig. 2 shows the framework of Model-based RL.On the Model-based RL method, the agent has the inner model of environment and improves learning efficiency using the inner model.Model-based RL, however, has some problems about the error of the inner model.The agent learns the behavior by the both two interactions, with real environment and with the inner model.The learning by the interaction with environment is called direct learning, and by the interaction with the inner model is called indirect learning.If the agent's model is correct, that is, the same as environment, the agent can get a good performance by indirect learning because the model generates good experiences.On the other hand, if the inner model has some errors, or has some differences from real environment, the agent can not learn the behavior correctly and the learning efficiency becomes low.It is found that Model-based RL has the problem on the error of the inner model.

ME-FPRL
ME-FPRL is proposed in order to solve the problem about the error in Model-based RL [5].Fig. 3 shows the algorithm of ME-FPRL.

Fig.3 Algorithm of ME-FPRL
This algorithm is based on Model-Based RL with the trajectory [3].This algorithm controls the amount of indirect learning by estimating the current error.Therefore, learning is robust against the error [5].
In this algorithm, s stands for a current state, a a current selected action, and Q an action-value

Continuous State Spaces
In real world, some tasks such as control of robots are very complex and the state space is very large.Therefore, an agent needs much time to learn behavior.This problem is called "curse of dimensionality".To cope with this, there is a method converting the original state space into the feature space, which has enough size to be able to solve the task.This converting method is called function approximation.One of the methods of applying function approximation is the liner method.In the liner method, action-value function Q is expressed as Eq.(1). ) , ( ) , ( ) , ( Where θ is a parameter vector and ) , ( a i .So this method is liner.Action-value function is changed by the general gradient-descent update [3].If feature vector is given by linear approximation methods, the action-value function converges to a local optimum.
In this study, ME-FPRL is extended by the liner method in order to apply this algorithm to some practical tasks.Fig. 4 shows the algorithm of ME-FPRL with function approximation.δ is temporary variable for TD-Error [3].

Pursuing Target Task
In the control of a robot system, pursuing target task is a basic problem and many control methods are developed for it.However, a control policy adjusted to a robot cannot be applied to another robot because of different dynamics among robots.That is, many times adjustment of control policy is needed.In order to solve this problem, it is important to make up the differences of each dynamics by learning [6].In this paper, ME-FPRL is applied to the pursuing target task.A four-legged robot shown in Fig. 5 (AIBO ERS-7M3) [7] learns the control policy to pursue a target shown in Fig. 6 using its CCD Camera that works pan/tilt directions.The target moves like a pendulum by a.

Action and Reward
The agent chooses one action out of five actions: "Turn to top" "Turn to bottom", "Turn to right" "Turn to left" "Stop".Movement of each action except "Stop" turns its head by 5 degrees.
And the agent gets rewards as follows: 10 points; catching a target near the center of the camera, -10 points; catching a target out of the center of the camera, -20 points; missing a target.

Learning Agents
In order to confirm the validity of ME-FPRL, the learning efficiencies of the control policy with RL, model-based RL and ME-FPRL are compared with each other.
An agents' action is selected by Gibbs Sampler [3].Eq. ( 2) shows Gibbs distribution, whereτ is a time constant.
The agents approximate state spaces by the CMAC [3] that is a liner approximation method.In the CMAC, state spaces expressed by some tiles.The number of tile is Tiling .The parameters for the agents are as .θand ModelAcy is initialized by 0 .

Inner Model
The agents with the model-based method have an inner model.In this task, dynamics of the target is constructed by the multi-layered neural networks [8].
ComAcy a s r a s on stack (7-2-4) sˆ← ' ŝ 、 a ˆ← ' â  while stack has some item: 7 shows the inner model of the target.This model has two neural networks.These networks are switched by the direction of target's movement.That is, one network deals with a rotation of right and another network deals with a rotation of left.These networks are learned by the error back propagation [8].

Experiment and Its Result
In this experiment, an episode is defined as follows.One episode finishes when the robot catches a target in the center of a camera for 40 steps or when the robot loses the target completely.The robot performs 100 times tasks.In the first half, 50 times, the target moves to only right side, and in the latter half, 50 times, the target moves to both right and left sides.That is, the movement of the target is changed at the 50th time.
Fig. 8 shows experimental results with RL, Fig. 9 shows those with ME-FPRL and Fig. 10 shows those with model-based RL.It is found that ME-FPRL is faster than RL in the early stage and is more robust than RL.

CONCLUSIONS
In this paper, the effective learning algorithm ME-FPRL is proposed and applied to the pursuit of a target.Application results show that the proposed ME-FPRL is more efficient than a RL or Model-based RL.As a result, ME-FPRL is found to be able to apply to practical tasks.
Our future work is constructing more an efficient learning system by using human or other agent's advice and communication.

Fig. 2
Fig.2 Framework of Model-based RL

Fig. 4
Fig.4 Algorithm of ME-FPRL with the function approximation

( 0 )
Initialize all variable and a ←

Fig. 7
Fig.7 Inner model of the target

Fig. 8
Fig.8 Experimental Results of RL s ← current state (2) Execute action a ← ' a