Facial Expression Recognition Using 3D Facial Feature Distances

In this paper, we propose a novel approach for facial expression analysis and recognition. The proposed approach relies on the distance vectors retrieved from 3D distribution of facial feature points to classify universal facial expressions. Neural network architecture is employed as a classifier to recognize the facial expressions from a distance vector obtained from 3D facial feature locations. Facial expressions such as anger, sadness, surprise, joy, disgust, fear and neutral are successfully recognized with an average recognition rate of 91.3%. The highest recognition rate reaches to 98.3% in the recognition of surprise.


Introduction
Face plays an important role in human communication. Facial expressions and gestures incorporate nonverbal information which contributes to human communication. By recognizing the facial expressions from facial images, a number of applications in the field of human computer interaction can be facilitated. Last two decades, the developments, as well as the prospects in the field of multimedia signal processing have attracted the attention of many computer vision researchers to concentrate in the problems of the facial expression recognition. The pioneering studies of Ekman in late 70s have given evidence to the classification of the basic facial expressions. According to these studies, the basic facial expressions are those representing happiness, sadness, anger, fear, surprise, disgust and neutral. Facial Action Coding System (FACS) was developed by Ekman and Friesen to code facial expressions in which the movements on the face are described by action units. This work inspired many researchers to analyze facial expressions in 2D by means of image and video processing, where by tracking of facial features and measuring the amount of facial movements, they attempt to classify different facial expressions. Recent work on facial expression analysis and recognition has used these seven basic expressions as their basis for the introduced systems. Almost all of the methods developed use 2D distribution of facial features as inputs into a classification system, and the outcome is one of the facial expression classes. They differ mainly in the facial features selected and the classifiers used to distinguish among the different facial expressions. Information extracted from 3D face models are rarely used in the analysis of the facial expression recognition. This chapter considers the techniques using the information extracted from 3D space for the analysis of facial images for the recognition of facial expressions. The first part of the chapter introduces the methods of extracting information from 3D models for facial expression recognition. The 3D distributions of the facial feature points and the estimation of characteristic distances in order to represent the facial expressions are explained by using a rich collection of illustrations including graphs, charts and face images. The second part of the chapter introduces 3D distance-vector based facial expression recognition. The architecture of the system is explained by the block diagrams and flowcharts. Finally 3D distance-vector based facial expression recognition is compared with the conventional methods available in the literature.

Information extracted from 3D models for facial expression recognition
Conventional methods for analyzing expressions of facial images use limited information such as gray levels of pixels and positions of feature points in a face [Donato et al.,1999], [Fasel & Luttin, (2003)], [Pantic & Rothkrantz ,2004]. Their results depend on the information used. If the information cannot be precisely extracted from the facial images, then we may obtain unexpected results. In order to increase the reliability of the results of facial expression recognition, the selection of the relevant feature points is important. In this section we are primarily concerned with gathering the relevant data from the facial animation sequences for expression recognition. The section is organised as follows. In section 2.1 we will present the description of the primary facial expressions while section 2.2 shows the muscle actions involved in the primary facial expressions and in section 2.3 we will present the optimization of the facial feature points.

Primary facial expressions
In the past, facial expression analysis was essentially a research topic for psychologists. However, recent progresses in image processing and pattern recognition have motivated significant research activities on automatic facial expression recognition [Braathen et al.,2002]. Basic facial expressions, shown in Figure 1, typically recognized by psychologists are neutral, anger, sadness, surprise, happiness, disgust and fear [P. Ekman & W. Friesen,1976]. The expressions are textually defined in Table 1.  : 1-Neutral, 2-Anger, 3-Sadness, 4-Surprise, 5-Happiness, 6-Disgust, 7-Fear.

Expression Textual Description
Neutral All face muscles are relaxed. Eyelids are tangent to the iris. The mouth is closed and lips are in contact.

Anger
The inner eyebrows are pulled downward and together. The eyes are wide open. The lips are pressed against each other or opened to expose the teeth.

Sadness
The inner eyebrows are bent upward. The eyes are slightly closed. The mouth is relaxed.

Surprise
The eyebrows are raised. The upper eyelids are wide open, he lower relaxed. The jaw is opened.

Happiness
The eyebrows are relaxed. The mouth is open and the mouth corners pulled back toward the ears.

Disgust
The eyebrows and eyelids are relaxed. The upper lip is raised and curled, often asymmetrically.

Muscle actions involved in the primary facial expressions
The Facial Definition Parameter set (FDP) and the Facial Animation Parameter set (FAP) were designed in the MPEG-4 framework to allow the definition of a facial shape and texture, as well as animation of faces reproducing expressions, emotions and speech pronunciation. The FAPs [Pandzic & Forchheimer, 2002] are based on the study of minimal facial actions and are closely related to muscle activation, in the sense that they represent a complete set of atomic facial actions; therefore they allow the representation of even the most detailed natural facial expressions, even those that cannot be categorized as particular ones. All the parameters involving translational movement are expressed in terms of the Facial Animation Parameter Units (FAPU). These units are defined with respect to specific distances in a neutral pose in order to allow interpretation of the FAPs on any facial model in a consistent way. As a result, description schemes that utilize FAPs produce reasonable results in terms of expression and speech related postures.

Expression Muscle Actions
Anger Table 2. Muscle Actions involved in the six basic expressions [Karpouzis et al.,2000].

www.intechopen.com
In general, facial expressions and emotions can be described as a set of measurements (FDPs and derived features) and transformations (FAPs) that can be considered atomic with respect to the MPEG-4 standard. In this way, one can describe the anatomy of a human face, as well as any animation parameters with the change in the positions of the facial feature points, thus eliminating the need to explicitly specify the topology of the underlying geometry. These facial feature points can then be mapped to automatically detected measurements and indications of motion on a video sequence and thus help analyse or reconstruct the emotion or expression recognized by the system. MPEG-4 specifies 84 feature points on the neutral face. The main purpose of these feature points is to provide spatial references to key positions on a human face. These 84 points were chosen to best reflect the facial anatomy and movement mechanics of a human face. The location of these feature points has to be known for any MPEG-4 compliant face model. The Feature points on the model should be located according to figure points illustrated in Figure 2. After a series of analysis on faces we have concluded that mainly 15 FAP's are affected by these expressions [Soyel et al., 2005]. These facial features are moved due to the contraction and expansion of facial muscles, whenever a facial expression is changed. Table 2 illustrates the description of the basic expressions using the MPEG-4 FAPs terminology. Although muscle actions [P. Ekman & W. Friesen,1978] are of high importance, with respect to facial animation, one is unable to track them analytically without resorting to explicit electromagnetic sensors. However, a subset of them can be deduced from their visual results, that is, the deformation of the facial tissue and the movement of some facial surface points. This reasoning resembles the way that humans visually perceive emotions, by noticing specific features in the most expressive areas of the face, the regions around the eyes and the mouth. The seven basic expressions, as well as intermediate ones, employ facial deformations strongly related with the movement of some prominent facial points that can be automatically detected. These points can be mapped to a subset of the MPEG-4 feature point set. The reader should be noted that MPEG-4 defines the neutral as all face muscles are relaxed.

Relevant facial feature points
In order to reduce the amount of time required to perform the experiments, a small set of 11 feature points were selected. Care was taken to select facial feature points from the whole set defined by the MPEG-4 standard. The MPEG-4 standard divides feature points into a number groups, which is listed in Table 3, corresponding to the particular region of the face to which they belong. A few points from nearly all the groups were taken. Nine points were selected from the left side of the face (Repetitive selection on the right side is not needed due to symmetry). The feature points selected were such that they have varying predicted extraction difficulty. The feature points selected are shown in Figure 3.

Information extracted from 3D Space
By using the distribution of the 11 facial feature points from 3D facial model we extract six characteristic distances that serve as input to neural network classifier used for recognizing the different facial expressions shown in Table 4. Table 4. Six characteristic distances.

Basic architecture of facial expression recognition system
Facial expression recognition includes both measurement of facial motion and recognition of expression. The general approach to Automatic Facial Expression Analysis (AFEA) systems, which is shown in Figure 4, can be categorised by three steps.

•
Facial feature extraction and representation. • Facial expression recognition. Face acquisition is the first step of the facial expression recognition system to find a face region in the input frame images. After determining the face location, various facial feature extraction approaches can be used. Mainly there are two general approaches; geometric feature-based methods and appearance-based methods. The first one utilizes the shape and the location of face components such as: mouth, nose, and eyes which are represented by a feature vector extracted from these facial components. In appearance-based methods, image filters, such as Gabor wavelets, are applied to either the whole face or specific regions in a face image to extract a feature vector. Depending on the different facial feature extraction methods, the effects of in-plane head rotation and different scales of the faces can be eliminated, either by face normalization before the feature extraction or by feature representation before the step of expression recognition. The last stage of the facial expression analysis system is facial expression recognition using different classification approaches. Facial expression recognition usually results in classes according to either the Facial Actions Coding System (FACS) or the seven basic facial expressions.

Distance No
Distance Name Distance Description

D1
Eye Opening Distance between the right corner of the right eye and the left corner of the right eye.

D2 Eyebrow Height
Distance between the centre of upper inner-right eyelid and the uppermost point of the right eyebrow.

Classification of the facial expressions
By using the entire information introduced in the previous section, we achieve 3D facial expression recognition in the following phases. First, we extract the characteristic distance vectors as defined in Table 3. Then, we classify a given distance vector on a previously trained neural network. The sixth distance, D6, is used to normalize the first five distances. The neural network architecture consists of a multilayered perceptron of input, hidden and output layers that is trained by using Backpropagation algorithm in the training process. The input layer receives a vector of six distances and the output layer represents 7 possible facial expressions mentioned in the preceding sections. Backpropagation was created by generalizing the Widrow-Hoff learning rule to multiplelayer networks and nonlinear differentiable transfer functions. Input vectors and the corresponding target vectors are used to train a network until it can approximate a function to associate input vectors with specific output vectors, or classify the input vectors. Networks with biases, a sigmoid layer, and a linear output layer are capable of approximating any function with a finite number of discontinuities. Standard backpropagation is a gradient descent algorithm, as is the Widrow-Hoff learning rule, in which the network weights are moved along the negative of the gradient of the performance function. The term backpropagation refers to the manner in which the gradient is computed for nonlinear multilayer networks. There are a number of variations on the basic algorithm that are based on other standard optimization techniques, such as conjugate gradient and Newton methods. Properly trained backpropagation networks tend to give reasonable answers when presented with inputs that they have never seen. Typically, a new input leads to an output similar to the correct output for input vectors used in training that are similar to the new input being presented. This generalization property makes it possible to train a network on a representative set of input/target pairs and get good results without training the network on all possible input/output pairs [ Rumelhart et al.,1986]. We used BU-3DFE database  in our experiments to train and test our model. The database we have used contains 7 facial expressions for 60 different people. We arbitrarily divided the 60 subjects into two subsets: one with 54 subjects for training and the other with 6 subjects for testing. During the recognition experiments, a distance vector is derived for every 3D model. Consecutive distance vectors are assumed to be statistically independent as well as the underlying class sequences. The vector is eventually assigned to the class with the highest likelihood score.

Training and testing the data
Neural networks are composed of simple elements operating in parallel. These elements are inspired by biological nervous systems. As in nature, the network function is determined largely by the connections between elements. We can train a neural network to perform a particular function by adjusting the values of the connections (weights) between elements. Commonly neural networks are adjusted, or trained, so that a particular input leads to a specific target output. Such a situation is shown in Figure 5. The network is adjusted, based on a comparison of the output and the target, until the network output matches the target. Typically many such input/target pairs are used, in this supervised learning, to train a network.

Fig.5. Basic Neural Network Structure
Batch training of a network proceeds by making weight and bias changes based on an entire set of input vectors. Incremental training changes the weights and biases of a network as needed after presentation of each individual input vector. Incremental training is sometimes referred to as "on line" or "adaptive" training. Once the network weights and biases have been initialized, the network is ready for training. The network can be trained for function approximation, pattern association, or pattern classification. The training process requires a set of examples of proper network behaviour -network inputs and target outputs. During training the weights and biases of the network are iteratively adjusted to minimize the network the average squared error between the network outputs and the target outputs. We have tested our neural network setup on the BU-3DFE database , which contains posed emotional facial expression images with seven fundamental emotional states, Anger, Disgust, Fear, Happiness, Sadness, Surprise and Neutral. In our experiment, we used the data captured from 60 subjects for each expression. The test is based on the seven fundamental expressions. The 3D distribution of the 84 feature vertices was provided for each facial model. A detail description of the database construction, post-processing, and organization can be found in .

System performance
Our facial expression analysis experiments are carried out in a person-independent manner, which is thought to be more challenging than a person-dependent approach. We arbitrarily divided the 60 subjects into two subsets: one with subjects for training and the other with subjects for test. The experiments assure that any subject used for testing does not appear in the training set because the random partition is based on the subjects rather than the individual expression. The tests are executed 10 times with different partitions to achieve a stable generalized recognition rate. The entire process assures that every subject is tested at least once for each classifier. For each round of the test, all the classifiers are reset and retrained from the initial state. We show the results for all the neural network classifiers in Table 5. Note that most of the expressions are detected with high accuracy and the confusion is larger with the Neutral and Anger classes. One reason why Anger is detected with only 85% is that in general this emotion's confusion with Sadness and Neutral is much larger than with the other emotions. As we compared the proposed 3D Distance Vectors based Facial Expression Recognition method (3D-DVFER) with 2D appearance feature based Gabor-wavelet (GW) approach [Lyons et al. 1999] we found the Gabor-wavelet approach performs poorly with an average recognition rate around 80%, comparing to the performance shown in Table 5, the 3D-DVFER method is superior to the 2D appearance feature based methods when classifying the seven prototypic facial expressions. When we compare the results of the proposed system with the results reported in ] which use the same 3D database through an LDA classifier, we can see that our method outperforms the recognition rates in Table 6 for all of the facial expressions except the Happy case. Both systems give the same performance for the "Happy" facial expression.

Input/Output Neutral Happy
Note that the classifier in  does not consider the Neutral case as an expression, which gives an advantage to the approach. The average recognition rate of the proposed system is 91.3% where the average performance of the method given in ] stays at 83.6% for the recognition of the facial expressions that uses the same 3D database.  Table 6. Average confusion matrix using of the LDA based classifier in

Conclusion
In this chapter we have shown that probabilistic neural network classifier can be used for the 3D analysis of facial expressions without relying on all of the 84 facial features and errorprone face pose normalization stage. Face deformation as well as facial muscle contraction and expansion are important indicators for facial expression and by using only 11 facial feature points and symmetry of the human face, we are able to extract enough information from a from a face image. Our results show that 3D distance vectors based recognition outperforms facial expression recognition results compared to the results of the similar systems using 2D and 3D facial feature analysis. The average facial expression recognition rate of the proposed system reaches up to 91.3%. The quantitative results clearly suggest that the proposed approach produces encouraging results and opens a promising direction for higher rate expression analysis.