Utterance independent bimodal emotion recognition in spontaneous communication

Emotion expressions sometimes are mixed with the utterance expression in spontaneous face-to-face communication, which makes difficulties for emotion recognition. This article introduces the methods of reducing the utterance influences in visual parameters for the audio-visual-based emotion recognition. The audio and visual channels are first combined under a Multistream Hidden Markov Model (MHMM). Then, the utterance reduction is finished by finding the residual between the real visual parameters and the outputs of the utterance related visual parameters. This article introduces the Fused Hidden Markov Model Inversion method which is trained in the neutral expressed audio-visual corpus to solve the problem. To reduce the computing complexity the inversion model is further simplified to a Gaussian Mixture Model (GMM) mapping. Compared with traditional bimodal emotion recognition methods (e.g., SVM, CART, Boosting), the utterance reduction method can give better results of emotion recognition. The experiments also show the effectiveness of our emotion recognition system when it was used in a live environment.

The spontaneous facial expression is always the natural way for the real human to human communication (e.g., [35]). Studies reported in [36][37][38][39] investigated explicitly the difference between spontaneous and deliberate facial behavior. In this situation, the facial expressions are sometimes combined with both emotions and expressed utterances [40]. Such problems may sometimes confuse the methods for emotion recognition. For instance, the facial expression of the phoneme "i" might be recognized as a smile. Some efforts have been recently reported on the analysis of spontaneous facial expression data (e.g., [19,20,[36][37][38][39][41][42][43][44][45][46][47][48][49]). For instance, Pantic and Rothkrantz [1] and Fasel and Luttin [10] suggested that the facial block near the lips should not be used for emotion recognition. In Zeng et al.'s study [50,51], smoothed facial features are calculated by averaging facial features at consecutive frames to reduce the influence of utterance on facial expression, based on the assumption that the influence of utterance on face features is temporary, and the influence of affect is relatively more persistent. However, these simple averaging facial features may give some error hints for facial expression especially while the utterance is very short, or some paralinguistic features are included. However, most of existing work still simply combines the audio-visual parameters for emotion recognition with model of feature-level fusion or decision-level fusion (e.g., [19,20,[45][46][47]), some of them just focus on getting Action Units (AUs) from facial expression rather than emotion recognition (e.g., [36][37][38][39][41][42][43][44]48,49]).
In this article, we try to introduce a new utteranceindependent method for bimodal emotion recognition in spontaneous communication. At the beginning, a Multistream Hidden Markov Model (MHMM) is used to combine the audio-visual features for the emotion recognition. While there is still argument on integrated emotion theory, we focus here on the six basic emotions: "happiness," "surprise," "fear," "anger," "sadness," and "neutral." To do the utterance reduction, the input audio features are classified into two types, contentrelated features and prosody features. Then the audiovisual mapping from content-related features to facial expression is created. The results of utterance reduction in visual parameters will be finally got by subtracting the audio-visual mapping results from the real facial expressions. We introduce a Fused Hidden Markov Model (HMM) Inversion model to solve the mapping problem. To reduce the computing complexity, the inversion model is further simplified to a Gaussian Mixture Model (GMM) mapping. Furthermore, as inspired by the idea of using averaging facial shapes [51], we also use a dynamic smoothed facial parameters which is tracked as a search in point distribution models (PDMs) [52] to get better visual parameters.
The article makes some detailed experiments and discussion of our methods by comparing with other utterance-dependent methods. The results show that the utterance-independent methods improve the results of the emotion recognition, especially for confusing emotions. Finally, a real-time bimodal emotion recognition system in our live communication environment has been created in our lab. The efficiency of the system is also discussed.
The contributions of this article are concluded as the following points.
(a) Our approach takes the advantages of time series analysis for emotion recognition by combining audio and visual features with a Multi-stream Hidden Markov Model (MHMM) method.
(b) We propose the utterance-independent method to enhance the visual expression parameters for emotion recognition in spontaneous communication by the hybrid of the MHMM and the fused HMM inversion.
(c) To reduce the computing complexity we also propose an alternative utterance reduction model which is based on a GMM and is simplified from the inversion model.
(d) We made detailed performance analysis for the utterance reduction methods and extended them in the live environment, which will have greater potential application as well as higher recognition accuracy.
This article is organized as follows. In "Bimodal fusion with multistream hidden markov model," we use the Multi-stream Hidden Markov Model (MHMM) for audio-visual data fusion in emotion recognition. Unlike traditional couple or fused HMM, the discrete coupling parameters is extended to the continuous observation in our study. Section "Utterance reduction with inversion method" introduces the fused HMM inversion model for the utterance reduction. A two-layer clustering method in visual configuration is further introduced to smooth the visual representations. The simplifying Fused HMM Inversion to GMM mapping is also described in this section followed by the experiments and discussion of our study. The audio-visual parameters and training data is also described here. Different utterance reduction models are discussed. We further compare our study with the typical emotion recognition methods, SVM-based method [13], CART-based method [11], Boosting Method [53], Rule-based decision fusion method [11], and also the methods that use only unichannel features via extensive experiments. Finally, we conclude our study and discuss future study.

Bimodal fusion with Multistream Hidden Markov Model
The most popular bimodal fusion methods are based on the feature-level fusion [13] or decision-level fusion [11]. The former classifies the bimodal feature vectors combined from audio and visual channels into different emotions directly [13], while the latter makes decisions based on rules after separate acoustic and visual classifications [11]. However, the audio and facial expressions are synchronous at successive times. Some study [24] has proved that the time series analysis methods can improve the robust of such data processing. Thus, we apply the Multi-stream Hidden Markov Model (MHMM) (e.g., [4,51]) for the emotion recognition in our study. In MHMM framework, the composite facial feature from video, acoustic features from audio are treated as two streams, and modeled by two-component HMMs. We use the weighted summation to fuse the results from these component HMMs.
Within all MHMMs, the Fused HMM has been proven as a good model in obtaining the probability between fused audio-visual training pairs (e.g., [4,51]). Given the observed audio-visual parameters, O a , O v , and their corresponding HMM, the Fused HMM was proposed to construct a structure linking the component HMMs together by giving optimal estimation of the joint probability. Taking advantage of the fact that the data from a single sensor can be individually modeled by a HMM, and according to the maximum entropy principle and the maximum mutual information (MMI) criterion, the fusion model yields the following two structures, as shown by [4]: where the most possible hidden state sequencesÛ a andÛ v are estimated by the Viterbi algorithm. The training process of the Fused HMM includes the following three main steps in general: (a) Two individual HMMs, consisting of visual component HMM and audio component HMM in our study, are trained independently by the EM algorithm; (b) The best hidden state sequences of the HMMs are found using the Viterbi algorithm; (c) The coupling parameters are determined [4]. While training in different emotional corpus, the conditional probability can be used as the probability of the emotion recognition.
In (1),Û a is asked to be reliably estimated, while in (2) U v has to be exactly determined. Previous studies (e.g., [4]) have proven that the first structure will generate more stable results in bimodal emotion recognition because the hidden states of the speech HMM can be estimated more reliably. The coupling parameter in (1) represents the conditional probability distribution of visual observation in visual component HMM, given states in audio component HMM. To use (1) and (2) for audio-visual mapping, we extended the discrete coupling parameters in [4] to the continuous observation as followed where O v is the visual features being modeled in visual component HMM, and this mixture Gaussian is the visual observation in audio state j. N(O v | μ jk , Σ jk ) is the Gaussian distributed density component related to audio state j, μ jk and ∑ jk are the kth mean vector and kth covariance matrix, C jk is the mixture weight, and K is the number of Gaussian functions in the GMM.

Utterance reduction with inversion method
The utterance reduction in visual parameters is trying to find the relationship between the visual parameters and the content-related audio parameters. It can be solved by finding the residual between the real visual parameters and the outputs of the audio-visual mapping which is trained in the neutral expressed audio-visual corpus. (see Figure 1)

Fused HMM inversion model
To find the most possible visual parameters corresponding to content-related speech parameters within the framework of multi-stream HMM, we need to find the best aligned HMM states between two component HMMs. The HMM inversion algorithm was proposed in [4] and applied to the robust speech recognition. Then Choi et al. [54] used HMM inversion in dynamic audiovisual mapping, whose usefulness has been demonstrated in [55]. However, it can only solve the problem of the single HMM. In our study, we extend this work by introducing a Baum-Welch HMM inversion method for multi-stream HMMs.
As shown by Choi and Hwang [54], Xie and Liu [56], and Moon and Hwang [57], the optimal visual counter-partÔ v can be formulated as the optimization of the following object function, given an audio input, where O a is the audio features, and l av is the parameters of the fused HMM model. The optimization can be found by iteratively maximizing the auxiliary function Q(λ av , λ av ; O a , O v ,Ō v ) based on the Baum-Welch method, where O v andŌ v denote the old and new visual vector sequence, respectively. In this study, the fused model can be presented as where for constants 1 ≥ 2 ≥ 0 with 1 + 2 = 1, 1 > 2 . It is obvious that the two HMMs will all affect the synthesis result, but have different reliability. It is an easy extension of the presentation in [58].
The objective function can be expressed as where m av is the vector m av = m av The auxiliary function can be derived as where o v t is then used for the visual residual computing for utterance reduction.
In our study, we have classified all visual parameters into several visual clusters (see section "Two-layer clustering in visual configuration") and choose a four-state right-left HMM model for each cluster. The visual cluster represents the deformation of the face shape. Based on the time synchronization between audio and visual representation in the neutral audio-visual corpus, the sequences for each clustered visual feature also have their own corresponding audio frames. Then, for each cluster sequence, we also train a three-state right-left HMM model for the audio data. The best hidden state sequences of the audio component HMMs are found using the Viterbi algorithm, while a Gaussian Mixture Model (GMM) is fitted on the visual frame data for each estimated hidden state.

Two-Layer clustering in visual configuration
If we do not control the amount of clusters, we will have a very large number of audio-visual candidates when compared to phoneme-based units. To reduce the computing complexity, therefore, we use a two-layer framework by classifying the corpus into a series of subsets by considering both visual and audio configurations. This two-layer framework is performed by the following steps: In the first layer, we only classify all audio-visual subsequences into 40 clusters according to the amount of the phoneme set. Each cluster center represents the repertoire of facial specification. Furthermore, each cluster is classified into sub-clusters by the k-means method. These sub-clusters constitute the second layer. Then, we can train more Fused HMMs for sub-clusters below the representative Fused HMM.
In the audio visual mapping, we use Fused HMMs of the first layer to select the best cluster. Then all fused HMMs of the second layer within the selected cluster will be further checked to find the best sub-cluster according to the concatenation (smoothing) cost between two visual frames. The target visual output will be got from these selected sub-clusters. The visual output will be more smoothed using the whole subsets, as shown in Figure 2a, compared with b.

Simplifying fused HMM inversion to GMM mapping
While we replace audio HMM states with audio observations in (3), we can find the function (3) will be simply changed to a GMM which combines the audiovisual observations directly, where O a is the audio observation within a total number M, μ k v and μ k a are kth mean vectors of visual observations and audio observations and ∑ k av are the kth covariance matrix of both audio and visual observations.
The GMM conversion will reduce the computing complexities compared with inversion method, however, it weakens the time-series analysis in the audio-visual processing by simply replacing HMM states with real audio observations. After the GMMs are trained by the EM method, the optimal estimate of neutral facial deformation (Ôv) given by the content related speech parameters (O a ) can be obtained according to the transform function of conditional expectation,Ō where a k is the covariance matrix in audio vector space, p k (O a ) is the probability that the given audio observation belongs to the mixture component ( Figure 3) Experiments and discussion
In this article, we do not want to argue which parameters are the best for the recognition, but only focus on the modality fusing method and the utterance reduction method. We then choose the geometric features by using 20 salient facial points (see Figure 4) including six brow corners and mid-points (p 1 , p 2 , p 3 , p 4 , p 5 , and p 6 ), eight eye corners and mid-points (p 7 , p 8 , p 9 , p 10 , p 11 , p 12 , p 13 , and p 14 ), two nostrils (p 15 and p 16 ), and four mouth corners and mid-points (p 17 , p 18 p 19 , and p 20 ) to represent the facial shape. This representation is a tradeoff between the modeling capacity of the facial expressive structure and the efficiency of feature extraction.
To better describe facial expression information, we divide the facial shape into two regions, the upper and lower regions. In the upper region, we have where θ, g, a, and b are angles defined in Figure 4. We also define two directions X and Y, which are collinear with the vectors p 3 p 4 and p 18 p 20 , respectively. |p 3 p 9 | and |p 4 p 13 | are the distances of vectors p 3 p 9 and p 4 p 13 .    As inspired by the idea of using averaging facial shapes [51], we also use a dynamic smoothing smoothed facial parameters which are tracked as a search in point distribution models (PDMs) [52].
In PDMs, each facial shape is approximately represented by a linear combination of basic variations as whereXis the mean facial shapē for which columns p n , n {1,2,K,2N}, denote all facial variation directions. h is the PDM representation of the facial shape, where b n indicates how much variation is exhibited for each direction.
The method for calculation of P and h has been reported in [52]. With the suggestion that the utterance expression makes a kind of random variations in facial expression, this mean facial shapeX can be considered as the smoothed facial expression for utterance reduction. However, to get the dynamic facial features in time sequences, we segment the whole facial utterance into several small periods. For each period, we get a mean facial shape of the PDM and concatenate these mean facial shapes together for the emotion recognition (Figure 5).

Audio parameters
With our existing research study on audio parameters useful for emotion speech classification [22] and speechdriven talkinghead [68], we have got that the prosody parameters including F0, speed, energy, etc., have a good "resolving power" for emotion expression while some spectrum parameters including MFCCs [69] have strong influence on the utterance expression in face. To simplify the study, we only use the MFCCs to reduce utterance expression in face.

Training database
Most of current spontaneous emotion recognition system used datasets which were collected in the following data-elicitation scenarios: human-human conversation (e.g., [20,60,[70][71][72][73]). In our study, the training database was collected from 30 subjects (15 males and 15 females) in National Laboratory of Pattern Recognition (NLPR). In each time, one of them was asked to sit in the noise reduction environment, and to talk to us for about 2 h with exaggerated expression during conversation, like drama actors/actresses. They were simply asked to display facial expressions and speak in natural way. After recording of all speakers, the data was labeled by three annotators with "happiness," "sadness," "anger," "fear," "surprise," and "neutral" in piece by piece. We selected 400 sentences for each emotion (about 1.8-h data) for the training. The SPTK toolkit [69] and the AAM method [74] were used to get the audio and visual features. For each emotion state, 90% of the data are used for training while others are used for testing. The Fused HMM inversion and GMM training are based on the whole training set of the neutral videos ( Figure 6).
To make the study comparable with others, we also use Belfast Naturalistic Database for testing. Some samples of Belfast corpus are shown in Figure 7.
The results of emotion recognition based on our three utterance reduction methods . From the results, it is clear that the utterance reduction methods can improve the emotion recognition results than that without utterance reduction models.
We can find HMM-inversion-based method is better than GMM-based method. Using the HMM state or the center of the visual clusters as the outputs of visual parameters, the HMM inversion can simulate the 1 X 2 X N X Figure 5 Framework of utterance reduction by PDMs. Figure 6 Samples selected from NLPR Emotional Database. detailed facial deformation while speaking. In our previous study, we even use it for the system of speech-driven facial animation [68]. In our study, it gets the better utterance reduction results than GMM-based method which may give an over-smoothing visual parameter outputs. The results of "neutral" and "fear" in GMMbased method are even worse than that without utterance reduction method. The results confirm the report in [75], which proved the over-smoothing problem while using GMM for conversion problems.
The results by only using PDM model and MHMM are not so good, compared with two other utterance reduction methods in thearticle. However, it is slightly better than that without utterance reduction methods. As the facial expression of utterance presentation cannot be considered as the random visual variation, the average face shape based on PDM simplifies the problem. Especially if the same phonemes are repeated frequently in a short period, the PDM mean face shape still consists of utterance information which may be easily confused with some emotion states. This confusing more happens between the phoneme "a" and "surprise," or "i" and "angry." Thus, this kind of improvement of only using PDM is poor.
In the experiments, we also made an interesting test by combine the PDM with HMM inversion and GMM. We first use HMM inversion to reduce the influence of utterance after the visual tracking. Then the PDM method is used for the further smoothing of facial deformation. This is really helpful because we always get the random variation after we calculate the residue between the real input visual parameters and the outputs from audio-visual conversion models. Results in Figure 8 confirm our proposal. The recognition accuracies are improved and emotion confusions are decreased.
The further tests were also made based on the Belfast Naturalistic Database. Due to the different emotional presentation styles, only four emotion states, "happiness," "sadness,", "anger," and "surprise" are selected from the Belfast database for the experiment. The results are shown in Figure 9.
From the results, we can find that the conclusion we got from NLPR's emotional corpus is also suitable for Belfast Naturalistic Database, however, most of the emotion recognition rates are lower than that from NLPR's corpus. Major reason is that the NLPR's corpus is a kind of Posed corpus. The speakers were asked to sit in front of the camera and were not allowed to do the complicated action, e.g., looking around, nodding, etc. The speech is also recorded in noise reduction environment. Compared with the Belfast corpus which is more spontaneous and contains more actions, the emotion recognition results on NLPR's corpus are higher than that from Belfast corpus. The difficulty of facial expression tracking might be another reason to cause lower emotion recognition rate in Belfast database.

Comparisons with uni-modal methods
To compare with methods using uni-modal parameters, we performed experiments in which parameters extracted from a single audio or visual channel were inputted into a HMM emotion recognition approach. The testing results from NLPR database are shown in Figure 10.
Compared to the two methods using the uni-modal parameters individually, the results confirm that the compensation between the two channels in the bimodal method improves the performance of emotion recognition.

Comparison with other bimodal fusion methods
To make the further comparison with other studies, we repeated four typical methods, the SVM method [13], CART method [11], Boosting method [53], and the rule- based decision fusion method [11]. The results from NLPR emotional database are shown in Figure 11. Although the SVM and boosting methods are the fine classifiers, their results are slightly poorer than our MHMM + Inv method (see Figure 11). This is a clear demonstration to show the importance of including the time serials in bimodal emotion recognition. By integrating the utterance reduction from audio to visual parameters in a reasonable way, a more efficient emotion recognition system is able to be developed. Figure 11 shows that rule-based decision fusion algorithm is the worst method of emotion recognition tested in our data. As different emotions may be expressed in different ways, a fixed modality-specific dominance measured by some rules for all people or emotions is not enough.

Emotion recognition in paralinguistic expression
It is also very interesting to know emotion recognition results while the subjects only speak only one or two emotion-related paralinguistic words, e.g., "[A]", "[x ɤ]", "[ən]", etc. Among them, "[A]" might be used for "surprise" expression, "[x ɤ]" is a typical "happy" mood, "[ən]" could be related to "angry". However, the expressions are various among different subjects. The expression of paralinguistic words gives us a hard problem. Do our utterance-independent models also work for these problems?
We selected 121 emotional sentences from NLPR database which consist of these paralinguistic words for testing and the results are shown in Figures 12 and 13. There are also some situations that the emotions are influenced by some modal words. But these problems are out of discussion in this article.
From Figures 12 and 13, unfortunately, we found the MHMM + Inv and MHMM + GMM methods do not give the good results as we expect. It tells that emotions sometimes can hardly be separated from speech content while in paralinguistic expressions. However, we find the hybrid method of MHMM and PDM give the best results among all methods. In general, the fused model which integrates the time sequences still works better than other fusion methods even in paralinguistic expressions. And, the smoothed facial shapes with PDM method can always improve the recognition accuracy.

Tests of time delay
To use the methods in real applications, we calculated the time delay of the major models and list them in Table 1.
The two indicators are the average emotion recognition rate for the whole database (mean accuracy) and the average running time per image (in millisecond). It shows that the MHMM + Inv method can get the best average emotion recognition rate, while the time consuming of this method is also compared with others.
Based on audio parameters Based on visual parameters Figure 10 The comparison among the emotion recognition results only based on audio or visual parameters.

Conclusion and future study
This article presented a framework using MHMM for bimodal emotion recognition. Six different emotions are classified by integrating both audio and visual input channels in communication. Within this framework, the article introduces an utterance reduction method to improve the quality of visual parameters in emotion recognition by introducing the Fused HMM inversion model. To reduce the computing complexity the Inversion model can be further simplified to a GMM. The PDM is also introduced to smooth the visual tracking results. We took several experiments to discuss our methods. The final results show that the hybrid method which consists of MHMM, HMM inversion, and PDM work best in most of cases except some emotions expressed by paralinguistic words. In paralinguistic expression, the method combining both MHMM and PDM works best. Compared with previous bimodal emotion recognition methods, e.g., SVM, CART, Boosting, and rule-based decision fusion methods, our methods can give the better emotion recognition results.
As the current research still focuses on the six basic emotions, in the future, more databases with spontaneous expressions will be recorded. Fused emotions, e. g., "painful", etc., will be added. Some dataset will be collected from TV directly. Additionally, we will pay more attention on classifications for more paralinguistic information in spontaneous conversation.