Humanoid with Interaction Ability Using Vision and Speech Information

Intelligent robots will make a chance for us to use a computer in our daily life. We implemented a humanoid robot for the computerized university guidance at first and then some capabilities for the natural interaction are added. This paper describes the hardware and software system of this humanoid with interaction ability. HRP-2 sitting opposite to a user across a table can detect gaze direction, head pose and gestures using a stereo camera system attached to the head. In addition, our system can recognize the user's question utterance and non-stationary noise such as coughing and sneezing using microphones. Using efficiently such information, HRP-2 can answer the question by its synthesized voice with gestures and do tasks such as passing objects on the table


I. INTRODUCTION
Recently, there are many researches for harmonized human- robot interaction in the real environment [ 1] [2].Speech recognition is useful for human-robot communication, and there are many robots that have such interfaces [3] [4].Some interfaces using non-verbal information such as facial expression and gaze that are also seen as an importance for interaction.We have developed a reception guidance humanoid robot "ASKA" which can interact with human using the verbal information and the non-verbal information such as gaze direction, head pose and lip motion [5].

Ryuichi Nisimura Faculty of Systems Engineering
Wakayama University 930 Sakaedani, Wakayama-shi, Wakayama, Japan nisimura @ sys. wakayama-u.ac.jp In this paper, we introduce a humanoid robot system for the research of human-robot communication.Fig. 1 shows the overview of our humanoid robot HRP-2.The humanoid robot system with interaction ability was developed at NAIST (Nara Institute of Science and Technology) under the collaboration of Robotics Laboratory and Speech Laboratory.It is used as a research platform to develop an intelligent real-world interface using various information technologies studied in our institute.
The following functions are implemented for human-robot interaction: 1) Speech recognition 2) Voice synthesizing 3) Facial information measurement 4) Portrait drawing 5) Gesture recognition The dialogue system using large vocabulary continuous speech recognition and the eye contact system using facial information measurement system are the unique features on this robot.
The rest of this paper is organized as follows: Firstly, the design concepts are described in Section II.The hardware and software system configuration are separately described in Sectionlll.In Section IV, the voice interface implemented in this system is described.The interaction modules using visual information are explained in Section V. Finally, we summarize our research and mention future works in Section VII.

II. DESIGN CONCEPT
Our robot system has been designed based on two concepts of (1) a research platform for various information technologies, and (2) an achievement of human-robot interaction.

A. Research Platform
The main subject in the early stage of the development was to build a research platform for various information technolo- gies using a robot.The software architecture was designed based on this concept.Fig. 4 shows a simple configuration in which each module communicates own status and sensory information with the server.Each module runs independently and can start and stop at an arbitrary timing.This modularity enables the rapid development and easy maintenance of the modules.

B. Human-robot Interaction
The information utilized for face-to-face communication is classified in two major categories, "verbal" and "non-verbal" information.Although primary information in communication is the former, the latter information such as facial direction, gaze and gesture is recently emphasized as means of natural human-robot interaction.We focus on face direction and gesture information in this research, and try to achieve more natural interaction by combining them with speech information.

III. SYSTEM CONFIGURATION
In this section we describe how the software and the hardware of our system are constructed.Then the typical scenario of the interaction is also described.

A. Hardware
The system is composed of humanoid body, stereo cameras, hand-held microphones, a speaker and PCs as shown in Fig. 2. HRP-2 (KAWADA industries, Inc.) is used as the humanoid body.A stereo camera system with four IEEE1394 cam- eras(Flea, Point Grey Research Inc.), eight tiny microphones and the 8ch A/D board (TD-BD-8CSUSB, Tokyo Electron Device Ltd.) are installed in the head of HPR-2.Eight built- in microphones attached to the head are connected with the on-board vision PC via A/D board, and 8ch speech signals can be captured simultaneously.Additionally, a hand-held microphone can be connected to the external PC for speech recognition.Switching between these two microphone systems is achieved by software.The use of the hand-held microphone enables the interaction in the place where the background noise is large to such an extent that recognition using built- in microphone fails.Two external PCs are used besides the PC built in the robot.One of them is used for the speech recognition and speech synthesis, and the other is used as the terminal PC of robot.
A special chair as shown in Fig3.(A) was built in order for HRP-2 to sit down in the experiment.HRP-2 cannot regrettably get seated by itself because it has to be bolted to the chair with stability as shown Fig. 3.(B).Measurement Modules, and they are connected to vision sub- server.The speech recognition module has an independent interface called adintool to record, split, send and receive speech data.These interfaces enable to select the speech inputs with no influence on the other modules.
These modules run on the distributed PCs and communicate with a server program by socket communication over TCP/IP protocols as shown in Fig. 4.This is a simple implementation of the blackboard system [6].The server collects all the infor- mation (sensory information and status of execution) from all the client modules.Each client module can access the server to obtain any information in order to decide what actions to take.Each module runs independently and can start and stop at an arbitrary timing.This modularity enables the rapid development and easy maintenance of the modules.
C. Interaction Scenario HRP-2 sitting opposite to a user across a table can detect face/gaze directions of the user and recognize the question asked by the user.The typical scenario of the interaction between a user and the humanoid is as follows: 1.The humanoid detects the direction of the user's face.2. When the face direction of the user is detected to be facing to humanoid, the user is regarded to have an intention to talk to humanoid.The user can talk with gestures to humanoid.3. The humanoid recognizes the question and makes a response with voice and gesture or carries out an ordered task.
(A) The purpose-built chair (B) HRP-2 fixed on the chair The speech dialogue system of the humanoid can answer the following questions.
. Office and laboratory locations .Extension telephone numbers of staffs .Locations of university facilities .Today's weather report, news and current time .Greetings In addition to these questions, commands such as "passing objects" or "drawing a portrait" can be recognized and carried out.A training corpus used for speech recognition is described in the following section.
The motions of gesture response are defined beforehand using dedicated software, "Motion Creator" [7].These responses are linked to corresponding sentences by hand.

IV. VOICE INTERFACE USING SPEECH AND NOISE RECOGNITION
The voice interface of our system has been developed to contain two parallel sound recognition methods to be able to have flexible interactions with users.We implemented a spoken dialogue routine based on a continuous speech recognition technology with a large vocabulary dictionary for accepting users' various utterances.We also introduced a non-stationary noise recognition program based on likelihood measurements using Gaussian Mixture Models (GMMs).It realizes not only rejection mechanisms of environmental noises, but also novel human-robot interaction schemes by discerning unintended user's voices such as laughter, coughing, and so on.This section explains about the speech recognition and the noise recognition.

A. Speech Recognition
The continuous speech recognition has accomplished re- markable performance.However, sufficient accuracy when rec- ognizing natural spontaneous utterances has not been attained yet.To obtain higher accuracy, we needed to organize task- suitable statistical models beforehand.
Our speech recognition engine "Julius" [8] requires a lan- guage model and an acoustic model as statistical knowledge.
In the following, the composition of each model is described.
For an acoustic model, we use the speaker-independent PTM [9] triphone HMM (Hidden Markov Model).The model can deal with an appearance probability of phonemes with considering context dependent co-articulations consisting of the current phoneme and its left and right phonemes.
An acoustic model for the HRP-2 was trained from the following data using HTK (Hidden Markov Model Toolkit) [10]: DialogNatural users' utterances in using actual dialogue system (24,809 utterances).JNAS Reading style speech by speakers, extracted from the JNAS (Japanese Newspaper Article Sentences) [11]   database (40,086 utterances).Dialog data are actual human-machine dialogue data extracted from utterance logs collected by a long-term field test of our spoken dialogue system "Takemaru-kun System" [12], which has been deployed in a public city office since Novem- ber 2002 and operated every business day.We have obtained over 300,000 recorded inputs as of February 2005.The accuracy improvement of natural utterance recognition can be obtained efficiently by using these actual conversation data.We can also say that the built model can obtain better performance for recognition of child voices because the Dialog data contains many voices uttered by children.See [13] for details.
The training data of acoustic model included the JNAS data due to necessities of holding a large amount of speech data in building the model.We adopted a word trigram model as the language model, which is one of the major statistical methods in modeling appearance probabilities of a sequence of words [14].There are two well-known task description approaches in continuous speech recognition: (1) finite state network grammar, and (2) word trigram language model.Finite state network grammar is usually adopted for small restricted tasks.By using a statistical method instead of a network grammar, some utterances even in out-of-domain task are correctly recognized.Utterances in- cluding various expression styles can also be recognized more flexibly than with the network grammar based recognition.
In order to train the model, a training corpus consisting of the following texts was prepared: DialogTranscribed utterances collected by the field testing Takemaru-kun system (15,433 sentences).Web Texts extracted from web pages (826,278 sentences).Chat Texts extracted from Internet Relay Chat (IRC) logs (2,720,134 sentences).TV Texts of request utterances in operating a television through a spoken dialogue interface (4,256 sen- tences).We produced a vocabulary dictionary which includes 41,443 words, each appearing 20 or more times in the corpus.Then, language model tools provided from IPA Japanese free dicta- tion program project [15] was used to build a baseline model.
Finally, a task-dependent network grammar was adapted to the model.We wrote a finite state network grammar for the HRP-2 task, which included 350 words.Adaptation was performed by strengthening the trigram probabilities in the baseline model on the basis of word-pair constraints in the written grammar.This method enables that in-task utterances can be recognized more accurately, while keeping the acceptability of statistical model against unexpected utterances.

B. Noise Recognition
We introduced noise recognition programs to the HRP- 2 to realize a novel human-robot interaction that mediates unintended sound inputs, such as coughing, laughing, and other impulsive noises.Although noises have been deleted as needless inputs in a general dialogue system[l16], the proposal system can continue to dialogue with humans while recognizing a noise category.
We investigated sound verification to determine whether the inputted voice was intended by comparison of acoustic likelihood given by GMMs.GMMs have proven to be powerful for text-independent speech verification technique.Although conventional speech verification studies have only focused on environmental noises, our previous studies found that GMMs can also discriminate more utterance-like inputs [16].
Table I shows the training conditions of GMMs, where training data were recorded through a microphone used when performing a spoken dialogue for the HRP-2.When laughter or coughing is recognized, the response corresponding to the recognition result is returned to the user.To realize the identification of the voice and non-voice, adult and child's voices were included in the training data.If the input is identified as voice, the system executes a normal spoken dialogue routine."Beating by hand" and "Beating by soft hammer" indicate impulsive noises when a user beats the head of HRP-2 by hands or by a soft hammer.The system will use the identification result of beatings for dealing with mischief from users when the robot is installed in a house.
8-class GMMs with 64 Gaussian mixtures were made from each class training data.As for an acoustic parameter, we adopted the mel frequency cepstral coefficients (MFCC), which is a major parameter when analyzing human voices for speech recognitions.The class of GMM that has the highest acoustic likelihood against parameters of input sound is chosen as an output.

C. Dialogue Strategy
Spoken dialogue strategy of our system was designed based on a simple principle.Candidates of response to a user's question are prepared beforehand.Selection of a suitable response among the candidates is performed by keyword or key-phrase matching mechanism.We defined keywords for each candidate.After recording the user's voice, the number of keywords matched with recognized text is totaled for all prepared candidates.The system will choose the most matched candidate as a response.In this procedure, the N-best output is used as the speech recognition result that complements recognition errors.

V. INTERACTION SYSTEM USING VISUAL INFORMATION
Our system obtains gray-scale images from the stereo cam- era system installed in the head, and following three vision- based functions were implemented; facial information mea- surement, pointing gesture recognition and portrait drawing.These functions are described in the following sections.

A. Facial Information Measurement
The face and gaze information provides important informa- tion showing intentions and interests of a human.In an old study, it is shown that humans tend to be conscious of an object at the time of utterance [17].
Facial module is based on a facial measurement system [18] and sends measured parameters such as the pose and the position of the head and the gaze direction to the server via network.Fig. 5 shows how the facial information is measured.In this figures, rectangles indicate feature areas in a face utilized for tracking, and two lines indicate gaze directions.
The main purpose of this measurement is an detection of the valid speech period.The speech input is recognized only after the user turns his face to HRP-2.Therefore, the robot does not recognize utterances directed to other people.We regard this function as a simple implementation of "eye contact."

B. Pointing Gesture Recognition
Secondly, the gesture recognition module which recognizes simple pointing gesture is described.Gestures such as motion of the head helps attain a clearer communication.Further- more, gestures that points to directions are important when considering guiding tasks.If the robot can recognize only speech it is difficult to make natural communication because demonstrative pronouns are often used in such a situation.Pointing gesture recognition module used depth information generated by correlation based on SAD (Sum of Absolute Difference).Fig. 6 is an example of the disparity map.The process of recognizing the pointing gesture is as follows: 1.The disparity map is generated after correcting lens distortion.2. The pointing direction is detected on the supposition that the closest part of the user to the robot in the disparity map is the user's hand.The recognition of the pointing gesture enables HRP-2 to respond, even if a user gives questions with demonstrative pronoun.For example, HRP-2 can choose and pass a proper newspaper to the user when it is asked "Pass me that newspaper" with a pointing gesture.

C. Portrait Drawing
The third module using vision information, a portrait draw- ing module, is described here.This module literally provides the functionality of drawing a portrait.This function was implemented to show that HRP-2 can perform skillful tasks with its motion in addition to communicating with a user in the demonstration.From the technical viewpoint, portrait drawing requires the segmentation of the face region from the background.The procedures to draw the portrait of a user are as follows: 1.When it is detected by the foregoing facial information that a user has turned his face to HRP-2, a still image of the user is captured.2. A canny edge image, a depth mask image and an ellipsoid mask image are generated.3. The face region is extracted from the edge image using the above mask data.4. The face image is converted to a sequence of points by chain method.5.The sequence of points are sorted and thinned.
(E) Line segment data (F) Resulting image Fig. 8. Process of portrait drawing 6. HRP-2 draws a portrait using generated data.When the user requests a portrait to be drawn by HRP-2, it asks the user to pass the pen and to turn the user's face toward the robot.Then it captures a image as shown in Fig. 8(A).An edge image using Canny's algorithm (Fig. 8(B)) is generated by an appropriate face image.A depth mask image (Fig. 8(C)) and, an ellipsoid mask image (Fig. 8(D)) are generated by two kinds of data, the stereo image pair and the measurement value of face position.Fig. 8(E) shows the facial part extracted from the whole edge image using masks.The portrait is drawn on an actual white-board by HRP-2 using the sequence data generated.Inverse kinematics for eight degrees of freedom is solved under the condition that the pose of the pen is kept vertically.After the sequence of hand positions is determined, the hand moves with interpolating these points.Fig. 8(F) shows the resulting image drawn on the white-board.When HRP-2 actually draws a portrait, it uses a felt-tip pen with a holder that was designed to help it grasp and to absorb the position errors of the hand by a built-in spring (Fig. 7).

VI. EXPERIMENT
We verified the usefulness of our system in the real environment through a demonstration in the Prototype Robot Exhibition at Aichi World EXPO 2005 as shown in Fig. 9.The exhibition area was so noisy with full of audience and with simultaneously held demonstrations that hand-held mi- crophone was utilized in the demonstration.The non-stationary noise recognition system was utilized for recognizing users' I I i Fig. 9. Demonstration in the Aichi World EXPO 2005 coughing to start talking about cold in the demo scenario.When connected to the Internet, HRP-2 can answer questions on weather information, news headlines and so on.
The uncontrollable lighting condition was also crucial for image processing.However, since our method does not relies on the skin color detection which is known to be sensitive to lighting condition, the face measurement and gesture recognition was robust enough in such an environment.HRP-2 was also able to draw a portrait by extracting the face of the user from clattered background.Our demonstration was successfully shown for two weeks without problems.

VII. CONCLUSION
The HRP-2 is a speech-oriented humanoid robot system which realizes natural multi-modal interaction between human and robot.This system has a vision and a speech dialogue systems to communicate with visitors.The voice interface that has two aspects was implemented on the HRP-2 to realize flexible interactions with users.One is the spoken dialogue routine based on a continuous speech recognition technology with a large vocabulary dictionary, and the other is a non- stationary noise recognition system.We also implemented the face measurement function in order for the humanoid to realize "eye contact" with the user.In addition, the pointing gesture recognition function was implemented based on depthmap generation.By integrating speech information and gesture information, HRP-2 can recognize questions that include a demonstrative pronoun.The feasibility of the system was demonstrated at EXPO 2005.Some issues and demands have been gradually clarified by the demonstration and the experiment.The future works in vision and speech are mentioned below.Since the current system didn't fully make use of the microphone array, there is a room for improvement in this regard.For example, the realization of Blind Source Separation (BSS) using multiple microphones will enable dialogue with multiple users simultaneously.The strengthening of noise robustness and improvements of the dialogue system will be also necessary.The improvement of the number of the recognizable gestures is also an important issue for more natural interaction.