How to Recommend Preferable Solutions of a User in Interactive Reinforcement Learning?

We propose a new method of recommending preferable solutions of a user in interactive reinforcement learning. Interactive reinforcement learning is different from normal reinforcement learning in that a human gives the reward function to the learner interactively. It is that the reward function may not be fixed for the learner if an end-user changes his mind or his preference. However, most of previous reinforcement learning methods assume that the reward function is fixed and the optimal solution is unique, so they will be useless in interactive reinforcement learning with such an end-user. To solve this, it is necessary for the learner to estimate the userpsilas preference and to consider its changes. This paper proposes a new method how to match an end-userpsilas preference solution with the learnerpsilas recommended solution. Experiments are performed with twenty subjects to evaluate the effectiveness of our method. As the experimental results, a large number of subjects prefer each every-visit-optimal solution than the optimal solution. On the other hand, a small number of subjects prefer each every-visit-non-optimal solution. We will discuss the reason why the end-userspsila preferences are divided into two groups.


Introduction
In field of robot learning (Kaplan et al., 2002), interactive reinforcement learning method in that reward function denoting goal is given interactively has worked to establish the communication between a human and the pet robot AIBO.The main feature of this method is the interactive reward function setup which was fixed and build-in function in the main feature of previous reinforcement learning methods.So the user can sophisticate reinforcement learner's behavior sequences incrementally.Shaping (Konidaris & Barto, 2006;Ng et al., 1999) is the theoretical framework of such interactive reinforcement learning methods.Shaping is to accelerate the learning of complex behavior sequences.It guides learning to the main goal by adding shaping reward functions as subgoals.Previous shaping methods (Marthi, 2007;Ng et al., 1999) have three assumptions on reward functions as following; 1. Main goal is given or known for the designer.2. Subgoals are assumed as shaping rewards those are generated by potential function to the main goal (Marthi, 2007).3. Shaping rewards are policy invariant (not affecting the optimal policy of the main goal) (Ng et al., 1999).However, these assumptions will not be true on interactive reinforcement learning with an end-user.Main reason is that it is not easy to keep these assumptions while the end-user gives rewards for the reinforcement learning agent.It is that the reward function may not be fixed for the learner if an end-user changes his/ her mind or his/ her preference.However, most of previous reinforcement learning methods assumes that the reward function is fixed and the optimal solution is unique, so they will be useless in interactive reinforcement learning with an end-user.To solve this, it is necessary for the learner to estimate the user's preference and to consider its changes.This paper proposes a new method how to match an end-user's preference solution with the learner's recommended solution.Our method consists of three ideas.First, we assume every-visit-optimality as the optimality criterion of preference for most of endusers.Including this, section 2 describes an overview of interactive reinforcement learning in our research.Second, to cover the end-user's preference changes after the reward function is given by the end-user, interactive LC-learning prepares various policies (Satoh & Yamaguchi, 2006) by generating variations of the reward function under every-visitoptimality.It is described in section 3. Third, we propose coarse to fine recommendation strategy for guiding the end-user's current preference among various policies in section 4. To examine these ideas, we perform the experiment with twenty subjects to evaluate the effectiveness of our method.As the experimental results, first, a majority of subjects prefer each every-visit plan (visiting all goals) than the optimal plan.Second, the majority of them prefer shorter plans, and the minority of them prefer longer plans.We discuss the reason why the end-users' preferences are divided into two groups.These are described in section 5.In section 6, the search ability of interactive LC-learning in a stochastic domain is evaluated.Section 7 describes relations between our proposed solutions and current research issues on recommendation systems.Finally, section 8 discusses our conclusions and future work.

Interactive reinforcement learning
This section describes the characteristics on interactive reinforcement learning in our research, and shows the overview of our system.

Interactive reinforcement learning with human
Table1 shows the characteristics on interactive reinforcement learning.In reinforcement learning, an optimal solution is decided by the reward function and the optimality criteria.In standard reinforcement learning, an optimal solution is fixed since both the reward function and the optimality criteria are fixed.On the other hand, in interactive reinforcement learning, an optimal solution may change according to the interactive reward function.Furthermore, in interactive reinforcement learning with human, various optimal solutions will occur since the optimality criteria depend on human's preference.Then the objective of this research is to recommend preferable solutions of each user.The main problem is how to guide to estimate the user's preference?Our solution consists of two ideas.One is to prepare various solutions by every-visit-optimality (Satoh & Yamaguchi, 2006), another is the coarse to fine recommendation strategy (Yamaguchi & Nishimura, 2008

Overview of the plan recommendation system
Fig. 1 shows an overview of the plan recommendation system.When a user input several goals to visit constantly as his/ her preference goals, they are converted to the set of rewards in the plan recommendation block for the input of interactive LC-learning (Satoh & Yamaguchi, 2006) block.After various policies are prepared, each policy is output as a round plan for recommendation to the user.The user comes into focus on his/ her preference criteria through the interactive recommendation process.The interactive recommendation will finish after the user decides the preference plan.Next section, interactive LC-Learning block is described.Yamaguchi, 2006) block that is extended model-based reinforcement learning.In Fig. 2, our learning agent consists of three blocks those are model identification block, optimality criterion block and policy search block.The details of these blocks are described in following section.The novelty of our method lies in optimality criterion as every-visit-optimality and the method of policy search collecting various policies.

Model identification
In model identification block, the state transition probabilities P(s'|s,a) and reward function R(s,a) are estimated incrementally by observing a sequence of (s,a,r).Note that s is an observed state, a is an executed action, and Rw is an acquired reward.This estimated model is generally assumed Markov Decision Processes (MDP) (Puterman, 2006).MDP model is defined by following four elements.

Optimality criterion
Optimality criterion block defines the optimality of the learning policy.In this research, a policy which maximizes average reward is defined as an optimal policy.Eq. (1) shows the definition of average reward.
() () where N is the number of step, ( ) is the expected value of reward that an agent acquired at step t where policy is π and initial state is s and ( ) denotes the expected value.To simplify, we use gain-optimality criterion in LC-Learning (Konda et al., 2002a).In that, average reward can be calculated by both the expected length of a reward acquisition cycle and the expected sum of the rewards in the cycle.Then we introduce every-visit-optimality as the new learning criterion based on average reward.Every-visit-optimal policy is the optimal policy that visits every reward in the reward function.For example, if the reward function has two rewards, the every-visit-optimal policy is the largest average reward one which visits both two rewards.Fig. 3 shows the example of an every-visit-optimal policy with two rewards.Fig. 3.An Every-visit-optimal policy with two rewards

Policy search
Policy search block searches every-visit-optimal policies on an identified model according to optimality of policies.Each policy is converted to a round plan by extracting a cycle.The detail of this block is described in next section.

Preparing various round plans
This section describes the definition of various round plans and the method for searching various policies.

Illustrated example
To begin with, we show an illustrated example.Fig. 4 shows an overview of preparing various round plans within two rewards.When a MDP has two rewards as shown in Fig. 4 (a), then 2 2 -1, three kinds of every-visit-optimal policies are prepared (Fig. 4 (b)).Each policy is converted to a round plan by extracting a reward acquisition cycle (Fig. 4 (c)), since each policy is consists of a reward acquisition cycle and some transit passes.

Definition of various round plans by every-visit-optimality
Various round plans are defined by following steps.1. Enumerate the all subsets of the reward function.2. Search an every-visit-optimal policy for each subset of the reward function.
3. Collect all every-visit-optimal policies and convert them into round plans.Fig. 5 illustrates the process for searching various round plans.When a reward function is identified as {Rw1, Rw2}, enumerated subsets of the function are {Rw1}, {Rw2}, {Rw1, Rw2} in step 1.Then an every-visit-optimal policy is decided for each subset of the reward function in step 2. At last, these every-visit-optimal policies are collected as various round plans.The number of plans in the various round plans is 2 r -1, where r is the number of rewards in the model.

Searching various policies
This section describes our various policies search method by interactive LC-Learning (Satoh & Yamaguchi, 2006).LC-Learning (Konda et al., 2002a;Konda et al., 2002b) is one of the average reward model-based reinforcement learning methods (Mahadevan, 1996).The features of LC-Learning are following; 1. Breadth search of an optimal policy started by each reward rule.2. Calculating average reward by a reward acquisition cycle of each policy.
Step 1 Step 2 Step 3 Various round plans plan1 plan 2 plan 12 every-visit -optimal policy12 every-visit -optimal policy2 every-visit -optimal policy1 Fig. 5. Process for searching various round plans  (1) Search reward acquisition policies In this step, reward acquisition policies are searched by converting a MDP into the tree structures where reward acquisition rules are root rule.We show an illustrated example.Fig. 7 shows a MDP model with two rewards r 1 and r 2 .It is converted into two tree structures.Fig. 8 shows two trees.First, a tree from reward r 1 as shown in Fig. 8 (a) is generated, then a tree from reward r 2 as shown in Fig. 8 (b) is generated.In a tree structure, a policy is a path from a root node to the state that is same state to the root node.In a path, an expanded state that is same state to the previous node is pruned since it means a local cycle.In Fig. 8, node D and B are pruned states.Fig. 9 shows all reward acquisition policies in Fig. 7.In a stochastic environment, several rules branch stochastically.In such case, a path from parent node of a stochastic rule to the state that is already extracted is part of a policy that contains the stochastic rule.The policy 12 in Fig. 9 is an example of this.
(2) Calculate average reward In this step, average reward of each policy is calculated by using occurring probability of each state of the policy.Occurring probability of a state is expected value of the number of transiting the state during the agent transit from the initial state to the initial state.Eq. ( 2) shows definition of the occurring probability of state s j where initial state is s i .Occurring probability of each state is calculated approximately by value iteration using eq.( 2).
Where a k is the action that is executed at state s k .The average reward of policies is calculated by eq. ( 3) using occurring probability calculated by eq. ( 2).
(3'-1) Classify policies by reward subset In this step, all policies searched by step 1 are classified by acquisition reward set.
(3'-2) Decide every-visit-optimal policies In this step, an every-visit-optimal policy is decided for each group classified in step (3'-1).Each every-visit-optimal policy is a policy that had maximum average reward in the each group.

Plan recommendation
This section describes the plan recommendation system and the coarse to fine recommendation strategy (Yamaguchi & Nishimura, 2008).In this section, a goal is a reward to be acquired, and a plan means a cycle that acquires at least one reward in a policy.

Grouping various plans by the visited goals
After preparing various round plans in section 3.3, they are merged into group by the number of acquired reward.

Coarse to fine recommendation strategy
After grouping various plans by the number of visited goals, they are presented to the user sequentially for selecting the most preferable plan.We call the way to decide this order as recommendation strategy.In this paper, we propose coarse to fine recommendation strategy that consists of two steps, coarse recommendation step and fine recommendation step.
(1) Coarse recommendation step For the user, the aim of this step is to select a preferable group.To support the user's decision, the system recommends a representative plan in each selected group to the user.
Fig. 11 shows a coarse recommendation sequence when a user changes his/ her preferable group as Group1, Group2, and Group3 sequentially.When the user selects a group, the system presents the representative plan in the group as the recommended plan.For the user, the aim of this step is to decide the most preferable plan in the selected group in previous step.To support the user's decision, the system recommends plans among his/ her selected group to the user.Fig. 12 shows a fine recommendation sequence after the user selects his/ her preferable group as Group2.In each group, plans are ordered according to the length of a plan.

Experiment
We perform the experiment with twenty subjects from 19 to 21 years old to evaluate the effectiveness of our method.

The round-trip plan task
Fig. 13 shows the round-trip plan Recommendation task in Hokkaido.For a subject, this task is executed by following steps.1.Each subject selects four cities to visit.Various round-trip plans are recommended.
2. The subject decides the most preferred round-trip plan among them.The task for a subject is to decide the most preferred round-trip plan after selecting four cities to visit among eighteen cities.The task for the system is to estimate the preferable round-trip plans to each user and to recommend them sequentially.

Experimental results
Fig. 14 shows the result of the most preferred plans of each twenty subjects.Horizontal axis is the number of visited cities (goals), and vertical axis is the number of subjects.The summary of the experimental result is as follows.First, the majority of subjects prefer each every-visit plan (visit all four cities) than the optimal plan.Second, majority prefers shorter plans, and minority prefers longer plans.Then we focus on these two points.First point is the effectiveness of every-visit criterion.After selecting four cities, 15 (threequarter) subjects preferred every-visit plans those visit selected four cities.In contrast, only 5 subjects preferred optimal plans with shorter length, yet these plans do not visit all four cities.This suggests that the every-visit criterion is preferable to the optimality criterion for human learners.Second point is that the users' preferences are divided into two groups, shorter plans, or longer plans.We look more closely the preference for every-visit plans among 15 subjects.Among them, 10 (two-thirds) subjects preferred shorter (every-visit-optimal) plans, and 5 (third) subjects preferred longer (every-visit-non-optimal) plans.Among all 20 subjects, they indicate a similar tendency.Table 2 shows the summary of the experimental result.In table 2, a majority of subjects prefer shorter plans those are either optimal or every-visit-optimal, a minority of subjects prefer longer plans those are every-visit-non-optimal.The reason why the end-users' preferences are divided into two groups will be discussed in the next section.Table 2. Summary of the most preferred plans

Discussions
(1) Why the end-users' preferences are divided?
We discuss the reason why the end-users' preferences are divided into two groups.Fig. 15 shows one of the every-visit-optimal plans those major subjects preferred.According to the results of the questionnaire survey, a majority of subjects selected an every-visit-optimal plan have less knowledge on Hokkaido (or no experience to visit Hokkaido).
In contrast, a minority of subjects selected every-visit-non-optimal plans those have additional cities to visit by the plan recommendation.Fig. 16 shows one of the every-visit-non-optimal plans the minority of subjects preferred.According to the results of the questionnaire survey, a majority of subjects selected an every-visit-non-optimal plan have much knowledge or interest on Hokkaido.It suggests that the preference of a user depends on the degree of the user's background knowledge of the task.In other word, the change of the end-users' preference by the recommendation occurs whether they have the background knowledge of the task or not.Note that in our current plan recommendation system, no background knowledge on the recommended round-trip plan except Fig. 13 is presented to each subject.If any information about recommended plan is provided, we expect that the result on preference change of these two kinds of subjects will differ.(2) The search ability of interactive LC-learning The computing time of the round-trip plan task in Fig. 13 including graphical output by interactive LC-learning is no more than one second or less per user input, since it is a deterministic MDP model.So we summarize the search ability of LC-Learning in a stochastic case (Satoh & Yamaguchi, 2006).We compare two kinds of search abilities of LC-Learning to that of Modified-PIA (Puterman, 2006).First, the search cost of LC-Learning increases linearly when the number of rewards increases linearly.However, the search cost of Modified-PIA increases nonlinearly when the number of rewards increases linearly.Besides, Modified-PIA collects no every-visit optimal policy when the number of rewards is more than three.These suggest that our method is better than previous reinforcement learning methods for interactive reinforcement learning in which many rewards are added incrementally.We go into the comparative experiments in detail in section 6.
(3) Every-visit-optimality in a non-deterministic environment In a stochastic environment, every-visit-optimality is defined as p-every-visit-optimality where each reward is visited stochastically by not less than probability p (0 < p =< 1).It can be calculated by occurring probability of each rewarded rule described in section 3.3 (2).Note that 1-every-visit-optimality is that each reward is visited deterministically even in a stochastic environment.

Evaluating the search ability of interactive LC-learning
To evaluate the effectiveness of interactive LC-learning in a stochastic domain, comparative experiments with preprocessed Modified-PIA are performed when the number of rewards increases.We compare the two kinds of search abilities as follows.
1.The search cost for every-visit optimal policies 2. The number of collected every-visit-optimal policies 6.1 Preprocess for Modified-PIA Modified-PIA (Puterman, 2006) is one of the model-based reinforcement learning methods based on PIA modified for the average reward.However Modified-PIA is the method to search an optimal policy.So it is not valid to compare the search cost of the Modified-PIA and LC-Learning that searches various policies.To enable to search various policies by Modified-PIA, following preprocess is added.Fig. 17 shows the preprocessed Modified-PIA.1. Enumerate the models those contain the subset of reward set of the original model.2. Search an optimal policy for each subset of the reward function using Modified-PIA.

Experimental setup
We use a hundred of MDP models those consist of randomly set state transition probability and reward function for experimental stochastic environment, in which the number of rewards is varied among 1 to 10, the number of states is 10 and the number of actions is 4.
As the measure of the search cost, we used the iteration count in calculating the occurring probability of state for LC-Learning and we used the iteration count in calculating the value function for Modified-PIA.

The search cost for every-visit-optimal policies
To begin with, the search cost for every-visit-optimal policies is evaluated.Fig. 18 shows the comparative search cost when the number of rewards increases.The result indicates that the tendency of search cost of LC-Learning is linear and one of Modified-PIA is non-linear when the number of rewards increases.
Then we discuss the theoretical search cost.In Modified-PIA, MDP models those contain the subset of reward set of an original MDP are made and an optimal policy for each MDP is searched.So original Modified-PIA is performed 2 r -1 times where r is the number of rewards.After one reward is added, incremental search cost is following.
(2 r+1 -1) -(2 r -1) = 2 r (4) Eq. ( 4) means that the search cost of Modified-PIA increases nonlinearly when the number of rewards increases.In contrast, in LC-Learning, the number of tree structure increase linearly when the number of rewards is increase.So it is considered that the search cost of LC-Learning increase linearly when the number of rewards increase.

The number of collected every-visit-optimal policies
To evaluate the effectiveness of interactive LC-learning, another search ability is compared with preprocessed Modified-PIA.Note that the experimental setup is same as the setup described in section 6.2.Fig. 19 shows the number of collected every-visit-optimal policies.Compared with LC-learning collecting all every-visit-optimal policies, the number of collected every-visit-optimal policies by preprocessed Modified-PIA is smaller than LC-learning.Then, carefully analyzing the case of six rewards, Fig. 20 shows the rate of collected everyvisit-optimal policies, that is percentage of LC-learning of preprocessed Modified-PIA.It shows that preprocessed Modified-PIA collects no every-visit-optimal policy when the number of rewards is more than three.
Then we discuss the reason why the number of collected every-visit-optimal policies by preprocessed Modified-PIA is smaller than LC-learning.Since preprocessed Modified-PIA is based on the standard optimality, it searches an optimal policy in each MDP with the subset of reward set of the original model as shown in Fig. 17.It means that preprocessed Modified-PIA finds an every-visit-optimal policy only if it is same as the optimal policy in each MDP model.As the number of rewards increases, the rate of every-visit-optimal policy that is same as the optimal policy decreases.In other words, the distinction between two criteria becomes larger according to the number of rewards increases.
Since most previous reinforcement learning methods including Modified-PIA are based on the standard optimality criterion, they only learn an optimal policy.Therefore, under everyvisit-optimality criterion, our method is better than previous reinforcement learning methods for interactive reinforcement learning in which many rewards are added incrementally.

Related works on recommender systems
This section describes relations between our proposed solutions and current research issues on recommendation systems.The main feature of our recommendation system is interactive and adaptable recommendation for human users by interactive reinforcement learning.First, we describe two major problems on traditional recommenders.Second, interactive recommendation system called Conversational Recommender is summarized.At last, adaptive recommenders with learning ability are described.

Major problems on traditional recommenders
Main objective of recommender systems is to provide people with recommendations of items, they will appreciate based on their past preferences.Major approach is collaborative filtering, whether user-based or item-based (Sarwar et al., 2001) such as by Amazon.com.The common feature is that similarity is computed for users or items, based on their past preferences.However, there are two major issues.First issue is the similar recommendations problem (Ziegler et al., 2005) in that many recommendations seem to be "similar" with respect to content.It is because of lack of novelty, serendipity (Murakami et al., 2007) and diversity of recommendations.Second issue is the preference change problem (Yamaguchi et al., 2009) that is inability to capture the user's preference change during the recommendation.It often occurs when the user is a beginner or a light user.For the first issue, there are two kinds of previous solutions.One is topic diversification (Ziegler et al., 2005) that is designed to balance and diversify personalized recommendation lists for user's full range of interests in specific topics.Another is visualizing the feature space (Hijikata et al., 2006) for editing a user's profile to search the different items on it by the user.However, these solutions do not directly considering a user's preference change.To solve this, this paper assumes a user's preference change as two-axes space, coarse and fine axes.

Interactive recommendation systems
Traditional recommenders are simple and non-interactive since they only decide which product to recommend to the user.So it is hard to support for recommending more complex products such as travel products (Mahmood et al., 2009).Therefore, conversational recommender systems (Bridge et al., 2006) have been proposed to support more natural and interactive processes.Typical interactive recommendation is the following two strategies (Mahmood et al., 2008): 1. Ask the user in detail about her preferences.2. Propose a set of products to the user and exploit the user feedback to refine future recommendations.A major limitation of this approach is that there could be a large number of conversational but rigid strategies for a given recommendation task (Mahmood et al., 2008).

Adaptive recommenders with learning ability
There are several adaptive recommenders using reinforcement learning.Most of them observe a user's behavior such as products the user viewed or selected, then learn the user's decision processes or preferences.To improve the rigid strategies for conversational recommenders, learning personalized interaction strategies for conversational recommender systems has been proposed (Mahmood & Ricci, 2008;Mahmood & Ricci, 2009;Mahmood et al., 2009).Major difference from them, the feature of our approach is adaptable recommendation for human users by passive recommendation strategy called coarse to fine recommendation.Adaptable recommendation means that during our recommendation, a user can select these two steps (coarse step or fine step) as his/ her likes before deciding the most preferable plan.

Conclusions
In this paper, we proposed a new method of interactive LC-learning for recommending preferable solutions of a user.1. Every-visit-optimality as the optimality criterion of preference for most of end-users was assumed.2. To cover the end-user's preference changes after the reward function is given by the end-user, interactive LC-learning prepared various policies by generating variations of the reward function under every-visit-optimality. 3.For guiding the end-user's current preference among various policies, coarse to fine recommendation strategy was proposed.As the experimental results, first, the majority of subjects preferred each every-visit plan (visiting all goals) than the optimal plan.Second, majority preferred shorter plans, and minority prefers longer plans.We discussed the reason why the end-users' preferences are divided into two groups.Then, the search ability of interactive LC-learning in a stochastic domain was evaluated.The future work is to assist a user for deciding the most preference plan to make his/ herself known the potential preference of the user.To realize this idea, we are evaluating passive recommendation by visualizing the coarse to fine recommendation space and the history of the recommendation of it (Yamaguchi et al., 2009).

Fig. 9 .
Fig. 7.An example of MDP model Fig. 10 shows grouping various plans by the number of visited goals.When three goals are input by a user, they are converted into three kinds of reward as Rw1, Rw2, and Rw3.Then, Group1 in Fig. 10 holds various plans acquiring only one reward among Rw1, Rw2, or Rw3.Group2 holds various plans acquiring two kinds of reward among Rw1, Rw2, or Rw3, and Group3 holds various plans acquiring Rw1, Rw2, and Rw3.

Fig. 12 .
Fig. 12. Fine recommendation in the selected group

Fig. 14 .
Fig. 14.The result of the most preferred plans

Fig. 18 .
Fig. 18.Search cost when the number of rewards increases

Table 1 .
). Characteristics on interactive reinforcement learning