Improving Search Efficiency in the Action Space of an Instance-Based Reinforcement Learning Technique for Multi-Robot Systems

We have developed a new reinforcement learning technique called Bayesian-discrimination-function-based reinforcement learning (BRL). BRL is unique, in that it not only learns in the predefined state and action spaces, but also simultaneously changes their segmentation. BRL has proven to be more effective than other standard RL algorithms in dealing with multi-robot system (MRS) problems, where the learning environment is naturally dynamic. This paper introduces an extended form of BRL that improves its learning efficiency. Instead of generating a random action when a robot encounters an unknown situation, the extended BRL generates an action calculated by a linear interpolation among the rules with high similarity to the current sensory input. In both physical experiments and computer simulations, the extended BRL showed higher search efficiency than the standard BRL.


Introduction
Recent years have witnessed growing interest in multi-robot system (MRS) research.To date, numerous research projects have been undertaken in various forms, such as robot soccer (Stone & Sutton, 2001), all-terrain operation (Mondada et al., 2003), box-pushing problems (Gerkey & Mataric, 2002), and many others.We can point out at least three advantages of an MRS over traditional single-robot systems (Stone & Veloso, 2000).The first is parallel processing, performed by autonomous and asynchronous robots in the system.The second is robustness, realized by redundancy: the system has more robots than required.The third is scalability in the sense that a robot can be added or removed from the system easily.From the viewpoint of complex adaptive systems, it is important to coordinate cooperative behavior to solve a given task because a task is given simply to a robot group without sufficiently detailed specifications to solve it.The most popular approach to realize coordination is providing strategies for effective cooperation in advance in the form of behavior rules, roles, or communication protocols.However, it is practically impossible to give hand-crafted behavior rules for all possible situations that a robot will encounter.This means that the performance is context-sensitive.One approach to this problem is giving the ability of acquiring cooperative behavior through experience to each robot by autonomous role development and assignment so that an MRS has the potential for system-level robustness.We consider that a key factor would be how to give an on-line autonomous specialization mechanism to an MRS.This study introduces an approach that uses reinforcement learning (RL) to achieve autonomous specialization.To date, RL has not often been applied to an MRS because of the following two reasons.The first is that RL generates quite sensitive results for segmentation of the state space and the action space.When segmentation is inappropriate, RL often fails.Even if RL obtains a successful result, the achieved behavior might not be sufficiently robust.The second is that the RL theory is constructed on the assumption of a static environment (Sutton & Barto, 1998).Therefore, RL in a simple form can yield good results only when the environment is sufficiently static or stable for a robot to be able to assume that it is static.We must therefore apply RL carefully to an MRS, so that learning robots cope with the dynamics in their environment resulting from the moves of other robots that are learning simultaneously.
To overcome these problems, we apply a novel RL algorithm that has a mechanism for segmenting continuous state space and continuous action space autonomously and simultaneously.We call this Bayesian-discrimination-function-based Reinforcement Learning (BRL).In addition, for supporting the stabilization of the dynamics in the learning problem for the RL, complementary information, i.e., the prediction of the other robots' postures at the next time step is provided to the BRL by a learning neural network.The remainder of this chapter is organized as follows: The target MRS is introduced in the second section.The third and fourth sections explain our design concept and our reinforcement learning controller details.The fourth section proposes an extended BRL for improving the robustness.The fifth section shows results of our experiments.Conclusions are given in the final section.

Task: cooperative carrying problem
Our target problem is a simple MRS composed of three autonomous robots, as shown in Fig. 1.This problem is called the cooperative carrying problem (CCP), and involves requiring the MRS to carry a triangular board from the start to the goal.A robot is connected to the different corners of the load so that it can rotate freely.A potentiometer measures the angle between the load and the robot's direction θ.A robot can perceive the potentiometer measurements of the other robots, as well as its own.All three robots have the same specifications.Each robot has two distance sensors d and three light sensors l.The greater d / l becomes, the nearer the distance to an obstacle or a light source.Each robot has two motors for rotating two omnidirectional wheels.A wheel provides powered drive in the direction it is pointing and passive coasting in an orthogonal direction at the same time.The difficulties in this task can be summarized as follows:

•
The robots have to cooperate with each other to move around.

•
They begin with no predefined behavior rule sets or roles.

•
They have no explicit communication functions.

•
They cannot perceive the other robots through the distance sensors because the sensors do not have sufficient range.

•
Each robot can perceive the goal (the location of the light source) only when the light is within the range of its light sensors.

•
Passive coasting of the omnidirectional wheels brings a dynamic and uncertain state transition.

Reinforcement learning approach to CCP
3.1 Reinforcement learning in continuous space:

BRL: Overview
Our approach, called BRL, updates the classification only when such an update is required.A set of production rules is defined using Bayesian discrimination method, which is a wellknown method of pattern classification (Dura & Hart, 1972).This method can assign an input, X, to the cluster, C i , which has the largest posterior probability, max Pr(C i |x).Here, Pr(C i |x) indicates the probability calculated by Bayes' formula that a cluster, C i , holds the observed input x.Therefore, using this technique, a robot can select the most similar rule to the current sensory input.The learning procedure is overviewed as follows: 1.A robot perceives the current input data x. 2. A robot selects the most similar rule from a rule set by using the Bayesian discrimination method.If a robot selects a rule, it executes the corresponding action a.
Otherwise, a robot executes an action randomly.3. A robot is transferred to the next state and receives a reward r. 4. The utilities of all rules are updated according to r.The rules for which the utilities are below a certain threshold are removed.5.The robot produces a new rule as the combination of the current input data and the executed action if a robot executed an action randomly.This executed rule is stored in the rule set.6. Parameters of all the rules are updated by the interval estimation technique if a robot receives no penalty.Otherwise, a robot only updates the parameters of the selected rule.7. Go to (1).

Rule Representation
The BRL operates on a set of rules R. A rule rl ∈ R is defined as rl:=< v, u, a, f, Σ, Φ>.In this expression, the state vector associated with rl is v = {v 1 ,…,v nd } T , where n d is the number of inputs.The utility of rl is represented as u.The action vector is a = {a 1 ,…,a na } T , where n a is the number of actuators.The prior probability is denoted as f.The covariance matrix is Σ = diag {σ 1 ,…,σ nd }.The sample set associated with rl is Φ ={φ 1 ,..., φ ns } T , where n s is the number of samples.

Action Selection
A rule in R is selected to minimize the risk of misclassification of the current input.The posterior probability Pr(C i |x) is calculated as the risk of misclassification for each cluster; it is calculated by Bayes' Theorem: For finding the minimal risk, it is sufficient to calculate the posterior probability because all clusters have a common factor of 1/Pr(x).The probability density function of the i-th rule's cluster is represented as the following. (2) The estimated value of g i , the risk of misclassification of the input data x into the other clusters, is calculated as the following: After calculating g i for all the rules, the winner rule, rl w , is selected as that which has the minimal value of g i .As mentioned in the learning procedure, the action in the rl w is performed if g i is lower than a threshold g th .Otherwise, a random action is performed.

Temporal Credit Assignment
The respective utilities of the rules are updated using the following four strategies after the action is performed.1. Direct payoff distribution: The direct payoff P is given to the winner rule.Two types of payoff are obtainable: reward (P>0) and punishment (P<0).The payoff is spread back along the sequence of the rules that triggered its actions with the discount rate γ.

"Bucket brigade" like strategy:
The current winner rule, rl w , hands over part of its utility Δu to the previous winner only when Δu is positive.

Taxation:
A firing rule reduces its utility as u w ← (1 -c f ) u w .4. Evaporation: All rules reduce their utilities at the evaporation rate η < 1 when the robot reaches the goal: u w ← ηu w .A rule that has smaller utility than the threshold u min is removed from the rule set R.

Updating Rule Set
The update phase is performed except when action by rl w results in punishment.If a random action is taken (i.e.g w > g th ), a new rule that is composed of the current sensory input, v c , and the executed action, a c , is added to R. Parameters for the new rule are defined as follows.
In those equations, σ 0 , u 0 and f 0 are constants, I is a unit matrix.When the action in rl w is performed as (i.e.g w ≤ g th ), all of its parameters are updated as follows.First, the sample set Φ w is updated by adding the current sensory input to x.Then, the sample mean x = {x 1 ,..., x ns } T and the sample variance s 2 = {s 1 2 ,..., s ns 2 } T are estimated from the updated set Φ w .The confidence intervals for X and s 2 are also updated.Subsequently, BRL determines whether any component of v and Σ is out of the range of the confidence intervals.If any component is outside of that range, the updates are conducted: where α and β are constants.For all other rules, the prior probabilities f i are updated as follows:

Related Work
To date, numerous reports that are related to the RL approach and that are applied to an MRS have been published.For instance, Tan (Tan, 1993), who examined the effects of sharing information, described that shared information is beneficial if it can be used efficiently.Asada et al. (Asada et al., 1999) and Ikenoue et al. (Ikenoue et al. 2002) proposed a vision-based RL method for acquiring cooperative behavior in a soccer-like game that includes two mobile robots: a shooter and a passer.To stabilize the learning process, Asada et al. introduced a method of global scheduling by limiting the number of learning agents to one and allowing the remaining agents to execute fixed policies that were acquired in the previous learning stage.Ikenoue et al. proposed a method of asynchronous policy renewal with one policy and one action value function.Elfwing et al. (Elfwing et al., 2004) added macro actions.Macro actions force an agent to execute the same primitive action for more than one time step to thereby stabilize learning and make action selection more predictable for other agents.Several studies have specifically addressed the internal model of other learning agents (Littman, 1994;Hu & Wellman, 1998;Nagayuki et al., 2000).In those models, agents learn through estimating others' actions, Q values, or policies.
To the best of our knowledge, no RL approaches have displayed autonomous specialization.Therefore, robots need well-designed states, actions, strategies, or roles for acquiring cooperative behavior.Achieving all of these goals simultaneously is a practical impossibility.

Our Approach
In this study, we adopt a mechanism for predicting the near-future state based on time-series sensory information.As related work, a memory-based method (Moore & Atkenson, 1993) and a decision-tree (Suzuki et al., 1999) have been proposed for dealing with non-Markovian characteristics in an MRS environment.However, the state space is expanded according to the length of the time series information; in the worst case, it is expanded indefinitely.
We consider that the state space expansion should be as little as possible.
Our research group has demonstrated that merely the nearest future state prediction is sufficient for stabilizing the dynamics in an RL space (Kawakami et al., 1999).In this study, although a continuous learning space is assumed, an identical approach is examined with a feed-forward neural network for predicting the average of the other robots' postures at the next time step.As shown in Fig. 2, BRL uses the output of the neural network as a sensory information input.

Basic concept
We have some RL approaches that provide learning in continuous action spaces.An actorcritic algorithm built with function approximators has a continuous learning space and modifies actions adaptively (Doya, 2000;Peters & Schaal, 2008).This algorithm modifies policies based on TD-error at every time step.The REINFORCE algorithm theoretically also needs immediate reward (Williams, 1992).These approaches are not useful for tasks such as the navigation problem shown in Sec. 2, because the robot gets a reward only when it reaches the goal.BRL, however, proves to be robust against a delayed reward.
In the standard BRL, a robot performs a random search in its action space, and these random actions can produce unstable behavior.Therefore, reducing the chance of random actions may accelerate behavior acquisition and provide more robust behavior.Instead of performing a random action, BRL needs a function that determines action based on acquired knowledge.

BRL with an adaptive action generator
To improve the search efficiency in a action space, in this paper, we introduce an extended BRL by modifying the learning procedure, Step (2) in Sec. 3. In this extension, instead of a random action, the robot performs a knowledge-based action when it encounters a new environment.To do this, we set a new threshold, P' th (< P th ), and provide three cases for rule selection in Step (2) as follows: • g w < g th : The robot selects the rule with g w and executes its corresponding action a w .• g th ≤ g w < g' th : The robot executes an action with parameters determined based on rl w and other rules with misclassification risks within this range as follows: where n r is the number of referred rules, and N(0, σ) is a zero-centred Gaussian noise with variance σ.This action is regarded as an interpolation of previously-acquired knowledge.

•
g' th ≤ g w : The robot generates a random action.In this rule selection, the first and third cases are the same as the standard BRL.

Settings
Fig. 3 and 4 show the general views of the experimental environments for simulation and physical experiments, respectively.In the simulation runs, the field is a square surrounded by a wall.The physical robots are situated in a 3.6-meter-long and 2.4-meter-wide pathway.The task for the MRS is to move from the start to the goal (light source).All robots get a positive reward when one of them reaches the goal (l 0 > thr goal ∨ l 1 > thr goal ∨ l 2 >thr goal ).A robot gets a negative reward when it collides with a wall (d 0 > thr d ∨ d 1 > thr d ).We represent a unit of time as a step.A step is a sequence that allows the three robots to get their own input information, make decisions by themselves, and execute their actions independently.When the MRS reaches the goal, or when it cannot reach the goal within 200 steps in simulations and 100 steps in physical experiments, it is put back to the start.This time span is called an episode.The settings of the robot controller are as follows.

Prediction Mechanism (NN)
The prediction mechanism attached is a three-layered feed-forward neural network that performs back propagation.The input of i-th robot is a short history of sensory information, where ψ i t = (θ j t +θ k t )/2 (i ≠ j ≠ k).The output is a prediction of the posture of the other robots at the next time step O i = { cosψ i t+1 , sinψ i t+1 }.The hidden layer has eight nodes.

Behavior Learning Mechanism (BRL)
The input is , where m i rud and m i th are the motor commands for the rudder and the throttle respectively.σ in Eq.( 9) is 0.05.For the standard BRL, P th = {0.012,0.01}.For the extended BRL, P th = 0.012 and P' th = 0.01.The other parameters are shown in Table .1.These values are the same as the recommended values in our journal (Yasuda et al., 2005).

Results: simulations
Fig. 5 shows the averages and the deviations of steps that the MRS takes by the end of each episode.In the early stages, the MRS requires a lot of trial and error and takes many steps to finish the episode.After such a trial and error process, the behavior of MRS becomes more stable and it takes fewer steps.An MRS with the standard BRL stably achieves the task within nearly constant steps after the 250th episode, and the extended BRL accomplishes this in 200 episodes.This means that, in terms of learning speed, the extended BRL outperforms the standard one.For the 50 independent runs, the MRS achieved different globally stable behavior as shown in Fig. 6.However, we found a common point that robots always achieved cooperative behavior by developing team play organised by a leader, a sub-leader and a follower.This implies that acquiring cooperative behavior always involved autonomous specialization.
The extended BRL displayed higher adaptability, and yielded autonomous specialization faster than the standard BRL.

Discussion
There is no significant difference in results in the learning performance of the BRLs for a three-robot CCP; therefore, we tested four-and five-robot CCP performance for more dynamic and complicated problems.The four robots use a square load, and the five robots have a pentagonal load.In these CCPs, ψ is the average of the angles between two neighbouring robots and the load.The other controller settings are the same as those for the three-robot CCP.Figs. 7 and 8 show the average and the deviations of steps an MRS takes by the end of each episode.As the number of robots increases, we can find that the extended BRL provides increasingly better results than the standard BRL, although it requires more episodes before obtaining stable behavior as shown in Figs. 9 and 10.The extended BRL has a function for coordinating behavior as well as reducing the number of random actions that can result in unstable behavior.These results show that the extended BRL has a higher learning ability and is less dependent on the number of robots in the MRS.This implies that the extended BRL might have more scalability, which is one of the advantages of MRS over single-robot systems.
Although parameters that are more refined might provide better performance, parameter tuning is outside the scope, because BRL is designed for acquiring reasonable behavior as quickly as possible, rather than optimal behavior.In other words, the focal point of our MRS controller is not optimality but versatility.In fact, we obtain similar experimental results through experiments with an arm-type MRS similar to that in (Svinin et al., 2000) using the same parameter settings.During this process, robots often collide with a wall and become immovable (Fig. 13).Then, some robots reach the goal and develop appropriate input-output mappings (Fig. 14).Observing the acquired behavior and investigating rule parameters, we found that the robots developed cooperative behavior, based on autonomous specialization.

Conclusions
We investigated the RL approach for the behavior acquisition of autonomous MRS.Our proposed RL technique, BRL, has a mechanism for autonomous segmentation of the continuous learning space, and proved effective for MRS through the emergence of autonomous specialization.For accelerated learning, we proposed an extension of BRL with a function to generate interpolated actions based on previously acquired rules.Results of the simulations and physical experiments showed that the MRS with an extended BRL did learn behavior faster than that with the standard BRL.

Fig. 6 .
Fig. 6.Typical behavior in the early stage and acquired stable behavior (three robots)

Fig. 9 .
Fig. 9. Typical behavior in the early stage and acquired stable behavior (four robots)

Fig. 13 .
Fig. 13.An example of behavior in the early stage (extended BRL)