Open Access

Utterance independent bimodal emotion recognition in spontaneous communication

  • Jianhua Tao1Email author,
  • Shifeng Pan1,
  • Minghao Yang1,
  • Ya Li1,
  • Kaihui Mu1 and
  • Jianfeng Che1
EURASIP Journal on Advances in Signal Processing20112011:4

Received: 2 August 2010

Accepted: 13 May 2011

Published: 13 May 2011


Emotion expressions sometimes are mixed with the utterance expression in spontaneous face-to-face communication, which makes difficulties for emotion recognition. This article introduces the methods of reducing the utterance influences in visual parameters for the audio-visual-based emotion recognition. The audio and visual channels are first combined under a Multistream Hidden Markov Model (MHMM). Then, the utterance reduction is finished by finding the residual between the real visual parameters and the outputs of the utterance related visual parameters. This article introduces the Fused Hidden Markov Model Inversion method which is trained in the neutral expressed audio-visual corpus to solve the problem. To reduce the computing complexity the inversion model is further simplified to a Gaussian Mixture Model (GMM) mapping. Compared with traditional bimodal emotion recognition methods (e.g., SVM, CART, Boosting), the utterance reduction method can give better results of emotion recognition. The experiments also show the effectiveness of our emotion recognition system when it was used in a live environment.


Bimodal emotion recognition Utterance Independent Multistream Hidden Markov Model Fused Hidden Markov Model Inversion


The last two decades have seen significant effort devoted to developing methods for automatic human emotion recognition (e.g., [115]), which is an attractive research issue due to its great potential in human-computer interactions (HCIs), virtual reality, etc. Although there are a few tentative efforts to detect non-basic emotion states including fatigue (e.g., [16]), and mental states, such as agreeing, concentrated, disagreeing, interested, thinking, confused, and frustration (e.g., [1720]), most of the existing efforts focus on the some basic emotions due to their universal properties, their marked reference representation in our affective lives, and the availability of the relevant training and test material (e.g., [1, 2, 7, 21]). Many classical machine learning or pattern recognition algorithms were used to infer emotion states. Most of them have used only a single channel (e.g., [[2, 8, 10, 2128]), for instance, the facial expression (e.g., [2, 8, 10]) or speech (e.g., [2128]). As reported in [29], both vocal intonations and facial expressions determine the listener's affective state in up to 93% of cases. Recently, increased attention has been paid to analyzing multimodal information in emotion recognition (e.g., [1, 7, 913, 3034]). However, most of them still use deliberate and often exaggerated facial displays (e.g., [2, 5]).

The spontaneous facial expression is always the natural way for the real human to human communication (e.g., [35]). Studies reported in [3639] investigated explicitly the difference between spontaneous and deliberate facial behavior. In this situation, the facial expressions are sometimes combined with both emotions and expressed utterances [40]. Such problems may sometimes confuse the methods for emotion recognition. For instance, the facial expression of the phoneme "i" might be recognized as a smile. Some efforts have been recently reported on the analysis of spontaneous facial expression data (e.g., [19, 20, 3639, 4149]). For instance, Pantic and Rothkrantz [1] and Fasel and Luttin [10] suggested that the facial block near the lips should not be used for emotion recognition. In Zeng et al.'s study [50, 51], smoothed facial features are calculated by averaging facial features at consecutive frames to reduce the influence of utterance on facial expression, based on the assumption that the influence of utterance on face features is temporary, and the influence of affect is relatively more persistent. However, these simple averaging facial features may give some error hints for facial expression especially while the utterance is very short, or some paralinguistic features are included. However, most of existing work still simply combines the audio-visual parameters for emotion recognition with model of feature-level fusion or decision-level fusion (e.g., [19, 20, 4547]), some of them just focus on getting Action Units (AUs) from facial expression rather than emotion recognition (e.g., [3639, 4144, 48, 49]).

In this article, we try to introduce a new utterance-independent method for bimodal emotion recognition in spontaneous communication. At the beginning, a Multi-stream Hidden Markov Model (MHMM) is used to combine the audio-visual features for the emotion recognition. While there is still argument on integrated emotion theory, we focus here on the six basic emotions: "happiness," "surprise," "fear," "anger," "sadness," and "neutral." To do the utterance reduction, the input audio features are classified into two types, content-related features and prosody features. Then the audio-visual mapping from content-related features to facial expression is created. The results of utterance reduction in visual parameters will be finally got by subtracting the audio-visual mapping results from the real facial expressions. We introduce a Fused Hidden Markov Model (HMM) Inversion model to solve the mapping problem. To reduce the computing complexity, the inversion model is further simplified to a Gaussian Mixture Model (GMM) mapping. Furthermore, as inspired by the idea of using averaging facial shapes [51], we also use a dynamic smoothed facial parameters which is tracked as a search in point distribution models (PDMs) [52] to get better visual parameters.

The article makes some detailed experiments and discussion of our methods by comparing with other utterance-dependent methods. The results show that the utterance-independent methods improve the results of the emotion recognition, especially for confusing emotions. Finally, a real-time bimodal emotion recognition system in our live communication environment has been created in our lab. The efficiency of the system is also discussed.

The contributions of this article are concluded as the following points.
  1. (a)

    Our approach takes the advantages of time series analysis for emotion recognition by combining audio and visual features with a Multi-stream Hidden Markov Model (MHMM) method.

  2. (b)

    We propose the utterance-independent method to enhance the visual expression parameters for emotion recognition in spontaneous communication by the hybrid of the MHMM and the fused HMM inversion.

  3. (c)

    To reduce the computing complexity we also propose an alternative utterance reduction model which is based on a GMM and is simplified from the inversion model.

  4. (d)

    We made detailed performance analysis for the utterance reduction methods and extended them in the live environment, which will have greater potential application as well as higher recognition accuracy.


This article is organized as follows. In "Bimodal fusion with multistream hidden markov model," we use the Multi-stream Hidden Markov Model (MHMM) for audio-visual data fusion in emotion recognition. Unlike traditional couple or fused HMM, the discrete coupling parameters is extended to the continuous observation in our study. Section "Utterance reduction with inversion method" introduces the fused HMM inversion model for the utterance reduction. A two-layer clustering method in visual configuration is further introduced to smooth the visual representations. The simplifying Fused HMM Inversion to GMM mapping is also described in this section followed by the experiments and discussion of our study. The audio-visual parameters and training data is also described here. Different utterance reduction models are discussed. We further compare our study with the typical emotion recognition methods, SVM-based method [13], CART-based method [11], Boosting Method [53], Rule-based decision fusion method [11], and also the methods that use only uni-channel features via extensive experiments. Finally, we conclude our study and discuss future study.

Bimodal fusion with Multistream Hidden Markov Model

The most popular bimodal fusion methods are based on the feature-level fusion [13] or decision-level fusion [11]. The former classifies the bimodal feature vectors combined from audio and visual channels into different emotions directly [13], while the latter makes decisions based on rules after separate acoustic and visual classifications [11]. However, the audio and facial expressions are synchronous at successive times. Some study [24] has proved that the time series analysis methods can improve the robust of such data processing. Thus, we apply the Multi-stream Hidden Markov Model (MHMM) (e.g., [4, 51]) for the emotion recognition in our study. In MHMM framework, the composite facial feature from video, acoustic features from audio are treated as two streams, and modeled by two-component HMMs. We use the weighted summation to fuse the results from these component HMMs.

Within all MHMMs, the Fused HMM has been proven as a good model in obtaining the probability between fused audio-visual training pairs (e.g., [4, 51]). Given the observed audio-visual parameters, Oa, Ov, and their corresponding HMM, the Fused HMM was proposed to construct a structure linking the component HMMs together by giving optimal estimation of the joint probability. Taking advantage of the fact that the data from a single sensor can be individually modeled by a HMM, and according to the maximum entropy principle and the maximum mutual information (MMI) criterion, the fusion model yields the following two structures, as shown by [4]:

where the most possible hidden state sequences and are estimated by the Viterbi algorithm. The training process of the Fused HMM includes the following three main steps in general: (a) Two individual HMMs, consisting of visual component HMM and audio component HMM in our study, are trained independently by the EM algorithm; (b) The best hidden state sequences of the HMMs are found using the Viterbi algorithm; (c) The coupling parameters are determined [4]. While training in different emotional corpus, the conditional probability can be used as the probability of the emotion recognition.

In (1), is asked to be reliably estimated, while in (2) has to be exactly determined. Previous studies (e.g., [4]) have proven that the first structure will generate more stable results in bimodal emotion recognition because the hidden states of the speech HMM can be estimated more reliably. The coupling parameter in (1) represents the conditional probability distribution of visual observation in visual component HMM, given states in audio component HMM. To use (1) and (2) for audio-visual mapping, we extended the discrete coupling parameters in [4] to the continuous observation as followed

where Ov is the visual features being modeled in visual component HMM, and this mixture Gaussian is the visual observation in audio state j. N(Ov | μ jk , Σ jk ) is the Gaussian distributed density component related to audio state j, μ jk and ∑ jk are the k th mean vector and k th covariance matrix, C jk is the mixture weight, and K is the number of Gaussian functions in the GMM.

Utterance reduction with inversion method

The utterance reduction in visual parameters is trying to find the relationship between the visual parameters and the content-related audio parameters. It can be solved by finding the residual between the real visual parameters and the outputs of the audio-visual mapping which is trained in the neutral expressed audio-visual corpus. (see Figure 1)
Figure 1

Framework of utterance reduction using audio-visual mapping.

Fused HMM inversion model

To find the most possible visual parameters corresponding to content-related speech parameters within the framework of multi-stream HMM, we need to find the best aligned HMM states between two component HMMs. The HMM inversion algorithm was proposed in [4] and applied to the robust speech recognition. Then Choi et al. [54] used HMM inversion in dynamic audio-visual mapping, whose usefulness has been demonstrated in [55]. However, it can only solve the problem of the single HMM. In our study, we extend this work by introducing a Baum-Welch HMM inversion method for multi-stream HMMs.

As shown by Choi and Hwang [54], Xie and Liu [56], and Moon and Hwang [57], the optimal visual counterpart can be formulated as the optimization of the following object function, L(Ov) = logP(Oa, Ov|λav), given an audio input, where Oa is the audio features, and λav is the parameters of the fused HMM model. The optimization can be found by iteratively maximizing the auxiliary function based on the Baum-Welch method,

where Ov and denote the old and new visual vector sequence, respectively.

In this study, the fused model can be presented as

where for constants κ1 ≥ κ2 ≥ 0 with κ1 + κ2 = 1, κ12. It is obvious that the two HMMs will all affect the synthesis result, but have different reliability. It is an easy extension of the presentation in [58].

The objective function can be expressed as

where mav is the vector , andmv is the vector that indicates the mixture component for each state at each time.

The auxiliary function can be derived as
By setting the derivative of with respect to to zeros, i.e., , we can find the re-estimated the

is then used for the visual residual computing for utterance reduction.

In our study, we have classified all visual parameters into several visual clusters (see section "Two-layer clustering in visual configuration") and choose a four-state right-left HMM model for each cluster. The visual cluster represents the deformation of the face shape. Based on the time synchronization between audio and visual representation in the neutral audio-visual corpus, the sequences for each clustered visual feature also have their own corresponding audio frames. Then, for each cluster sequence, we also train a three-state right-left HMM model for the audio data. The best hidden state sequences of the audio component HMMs are found using the Viterbi algorithm, while a Gaussian Mixture Model (GMM) is fitted on the visual frame data for each estimated hidden state.

Two-Layer clustering in visual configuration

If we do not control the amount of clusters, we will have a very large number of audio-visual candidates when compared to phoneme-based units. To reduce the computing complexity, therefore, we use a two-layer framework by classifying the corpus into a series of subsets by considering both visual and audio configurations. This two-layer framework is performed by the following steps:

In the first layer, we only classify all audio-visual subsequences into 40 clusters according to the amount of the phoneme set. Each cluster center represents the repertoire of facial specification. Furthermore, each cluster is classified into sub-clusters by the k-means method. These sub-clusters constitute the second layer. Then, we can train more Fused HMMs for sub-clusters below the representative Fused HMM.

In the audio visual mapping, we use Fused HMMs of the first layer to select the best cluster. Then all fused HMMs of the second layer within the selected cluster will be further checked to find the best sub-cluster according to the concatenation (smoothing) cost between two visual frames. The target visual output will be got from these selected sub-clusters. The visual output will be more smoothed using the whole subsets, as shown in Figure 2a, compared with b.
Figure 2

The visual output by using different layers, (a) only using the first layer Fused HMMs. (b) using the second layer Fused HMMs.

Simplifying fused HMM inversion to GMM mapping

While we replace audio HMM states with audio observations in (3), we can find the function (3) will be simply changed to a GMM which combines the audio-visual observations directly,

where Oa is the audio observation within a total number M, μ k v and μ k a are k th mean vectors of visual observations and audio observations and ∑kav are the k th covariance matrix of both audio and visual observations.

The GMM conversion will reduce the computing complexities compared with inversion method, however, it weakens the time-series analysis in the audio-visual processing by simply replacing HMM states with real audio observations. After the GMMs are trained by the EM method, the optimal estimate of neutral facial deformation ( ) given by the content related speech parameters (Oa) can be obtained according to the transform function of conditional expectation,
where is the covariance matrix in audio vector space, p k (Oa) is the probability that the given audio observation belongs to the mixture component (Figure 3)
Figure 3

Framework of utterance reduction by using GMM.


Experiments and discussion

Visual parameters

The usually extracted facial features are either geometric features such as the shapes of the facial components (eyes, mouth, etc.) (e.g., [52]) and the location of facial salient points (corners of the eyes, mouth, etc.) (e.g., [59]) or appearance features (e.g., [43, 44, 60]), representing the facial texture using Fisher's linear discriminant analysis (FDA) [61], principal component analysis (PCA) (e.g., [62, 63]), independent component analysis (ICA) [64], and Gabor wavelets [65], Haar features [66], spatial ratio face template [67], or manifold subspace [41].

In this article, we do not want to argue which parameters are the best for the recognition, but only focus on the modality fusing method and the utterance reduction method. We then choose the geometric features by using 20 salient facial points (see Figure 4) including six brow corners and mid-points (p1, p2, p3, p4, p5, and p6), eight eye corners and mid-points (p7, p8, p9, p10, p11, p12, p13, and p14), two nostrils (p15 and p16), and four mouth corners and mid-points (p17, p18p19, and p20) to represent the facial shape. This representation is a tradeoff between the modeling capacity of the facial expressive structure and the efficiency of feature extraction.
Figure 4

Face shape represented by some salient facial feature points.

To better describe facial expression information, we divide the facial shape into two regions, the upper and lower regions. In the upper region, we have , , , where θ, γ, α, and β are angles defined in Figure 4. We also define two directions X and Y, which are collinear with the vectors and , respectively. and are the distances of vectors and .

In the lower region, we have , , where and are the distances of vectors and .

As inspired by the idea of using averaging facial shapes [51], we also use a dynamic smoothing smoothed facial parameters which are tracked as a search in point distribution models (PDMs) [52].

In PDMs, each facial shape is approximately represented by a linear combination of basic variations as
where is the mean facial shape
P is a matrix
for which columns pn, n {1,2,K,2N}, denote all facial variation directions. η is the PDM representation of the facial shape,

where b n indicates how much variation is exhibited for each direction.

The method for calculation of P and η has been reported in [52]. With the suggestion that the utterance expression makes a kind of random variations in facial expression, this mean facial shape can be considered as the smoothed facial expression for utterance reduction. However, to get the dynamic facial features in time sequences, we segment the whole facial utterance into several small periods. For each period, we get a mean facial shape of the PDM and concatenate these mean facial shapes together for the emotion recognition (Figure 5).
Figure 5

Framework of utterance reduction by PDMs.

Audio parameters

With our existing research study on audio parameters useful for emotion speech classification [22] and speech-driven talkinghead [68], we have got that the prosody parameters including F0, speed, energy, etc., have a good "resolving power" for emotion expression while some spectrum parameters including MFCCs [69] have strong influence on the utterance expression in face. To simplify the study, we only use the MFCCs to reduce utterance expression in face.

Training database

Most of current spontaneous emotion recognition system used datasets which were collected in the following data-elicitation scenarios: human-human conversation (e.g., [20, 60, 7073]). In our study, the training database was collected from 30 subjects (15 males and 15 females) in National Laboratory of Pattern Recognition (NLPR). In each time, one of them was asked to sit in the noise reduction environment, and to talk to us for about 2 h with exaggerated expression during conversation, like drama actors/actresses. They were simply asked to display facial expressions and speak in natural way. After recording of all speakers, the data was labeled by three annotators with "happiness," "sadness," "anger," "fear," "surprise," and "neutral" in piece by piece. We selected 400 sentences for each emotion (about 1.8-h data) for the training. The SPTK toolkit [69] and the AAM method [74] were used to get the audio and visual features. For each emotion state, 90% of the data are used for training while others are used for testing. The Fused HMM inversion and GMM training are based on the whole training set of the neutral videos (Figure 6).
Figure 6

Samples selected from NLPR Emotional Database.

To make the study comparable with others, we also use Belfast Naturalistic Database for testing. Some samples of Belfast corpus are shown in Figure 7.
Figure 7

Spontaneous data selected from Belfast Naturalistic Database.

The results of emotion recognition based on our three utterance reduction methods

Figure 8 shows the results of emotion recognition in NLPR's emotional corpus by the method of Fused HMM (MHMM), and the utterance reduction methods which are combined with Fused HMM Inversion (MHMM + Inv), GMM (MHMM + GMM), and PDM (MHMM + PDM). From the results, it is clear that the utterance reduction methods can improve the emotion recognition results than that without utterance reduction models.
Figure 8

The comparison among MHMM, MHMM + Inv and MHMM + GMM methods based on NLPR Emotional Database.

We can find HMM-inversion-based method is better than GMM-based method. Using the HMM state or the center of the visual clusters as the outputs of visual parameters, the HMM inversion can simulate the detailed facial deformation while speaking. In our previous study, we even use it for the system of speech-driven facial animation [68]. In our study, it gets the better utterance reduction results than GMM-based method which may give an over-smoothing visual parameter outputs. The results of "neutral" and "fear" in GMM-based method are even worse than that without utterance reduction method. The results confirm the report in [75], which proved the over-smoothing problem while using GMM for conversion problems.

The results by only using PDM model and MHMM are not so good, compared with two other utterance reduction methods in thearticle. However, it is slightly better than that without utterance reduction methods. As the facial expression of utterance presentation cannot be considered as the random visual variation, the average face shape based on PDM simplifies the problem. Especially if the same phonemes are repeated frequently in a short period, the PDM mean face shape still consists of utterance information which may be easily confused with some emotion states. This confusing more happens between the phoneme "a" and "surprise," or "i" and "angry." Thus, this kind of improvement of only using PDM is poor.

In the experiments, we also made an interesting test by combine the PDM with HMM inversion and GMM. We first use HMM inversion to reduce the influence of utterance after the visual tracking. Then the PDM method is used for the further smoothing of facial deformation. This is really helpful because we always get the random variation after we calculate the residue between the real input visual parameters and the outputs from audio-visual conversion models. Results in Figure 8 confirm our proposal. The recognition accuracies are improved and emotion confusions are decreased.

The further tests were also made based on the Belfast Naturalistic Database. Due to the different emotional presentation styles, only four emotion states, "happiness," "sadness,", "anger," and "surprise" are selected from the Belfast database for the experiment. The results are shown in Figure 9.
Figure 9

The comparison among MHMM, MHMM + Inv and MHMM + GMM methods based on Belfast Naturalistic Database.

From the results, we can find that the conclusion we got from NLPR's emotional corpus is also suitable for Belfast Naturalistic Database, however, most of the emotion recognition rates are lower than that from NLPR's corpus. Major reason is that the NLPR's corpus is a kind of Posed corpus. The speakers were asked to sit in front of the camera and were not allowed to do the complicated action, e.g., looking around, nodding, etc. The speech is also recorded in noise reduction environment. Compared with the Belfast corpus which is more spontaneous and contains more actions, the emotion recognition results on NLPR's corpus are higher than that from Belfast corpus. The difficulty of facial expression tracking might be another reason to cause lower emotion recognition rate in Belfast database.

Comparisons with uni-modal methods

To compare with methods using uni-modal parameters, we performed experiments in which parameters extracted from a single audio or visual channel were inputted into a HMM emotion recognition approach. The testing results from NLPR database are shown in Figure 10.
Figure 10

The comparison among the emotion recognition results only based on audio or visual parameters.

Compared to the two methods using the uni-modal parameters individually, the results confirm that the compensation between the two channels in the bimodal method improves the performance of emotion recognition.

Comparison with other bimodal fusion methods

To make the further comparison with other studies, we repeated four typical methods, the SVM method [13], CART method [11], Boosting method [53], and the rule-based decision fusion method [11]. The results from NLPR emotional database are shown in Figure 11.
Figure 11

The comparison among MHMM + Inv, SVM, CART, Boosting, rule-based decision fusion methods.

Although the SVM and boosting methods are the fine classifiers, their results are slightly poorer than our MHMM + Inv method (see Figure 11). This is a clear demonstration to show the importance of including the time serials in bimodal emotion recognition. By integrating the utterance reduction from audio to visual parameters in a reasonable way, a more efficient emotion recognition system is able to be developed.

Figure 11 shows that rule-based decision fusion algorithm is the worst method of emotion recognition tested in our data. As different emotions may be expressed in different ways, a fixed modality-specific dominance measured by some rules for all people or emotions is not enough.

Emotion recognition in paralinguistic expression

It is also very interesting to know emotion recognition results while the subjects only speak only one or two emotion-related paralinguistic words, e.g., "[A]", "[x ɤ]", "[ən]", etc. Among them, "[A]" might be used for "surprise" expression, "[x ɤ]" is a typical "happy" mood, "[ən]" could be related to "angry". However, the expressions are various among different subjects. The expression of paralinguistic words gives us a hard problem. Do our utterance-independent models also work for these problems?

We selected 121 emotional sentences from NLPR database which consist of these paralinguistic words for testing and the results are shown in Figures 12 and 13. There are also some situations that the emotions are influenced by some modal words. But these problems are out of discussion in this article.
Figure 12

The comparing emotion recognition of paralinguistic words among MHHM + Inv, MHMM + GMM, MHHM + PDM, and MHMM + Inv + PDM methods.

Figure 13

The comparing emotion recognition of paralinguistic words among SVM, CART, Boosting, and rule-based decision fusion methods.

From Figures 12 and 13, unfortunately, we found the MHMM + Inv and MHMM + GMM methods do not give the good results as we expect. It tells that emotions sometimes can hardly be separated from speech content while in paralinguistic expressions. However, we find the hybrid method of MHMM and PDM give the best results among all methods. In general, the fused model which integrates the time sequences still works better than other fusion methods even in paralinguistic expressions. And, the smoothed facial shapes with PDM method can always improve the recognition accuracy.

Tests of time delay

To use the methods in real applications, we calculated the time delay of the major models and list them in Table 1. The two indicators are the average emotion recognition rate for the whole database (mean accuracy) and the average running time per image (in millisecond). It shows that the MHMM + Inv method can get the best average emotion recognition rate, while the time consuming of this method is also compared with others.
Table 1

Time analysis of our systems compared with the SVM-based system




MHMM + Inv


MHMM + Inv + PDM

Mean accuracy






Time (ms)






Conclusion and future study

This article presented a framework using MHMM for bimodal emotion recognition. Six different emotions are classified by integrating both audio and visual input channels in communication. Within this framework, the article introduces an utterance reduction method to improve the quality of visual parameters in emotion recognition by introducing the Fused HMM inversion model. To reduce the computing complexity the Inversion model can be further simplified to a GMM. The PDM is also introduced to smooth the visual tracking results.

We took several experiments to discuss our methods. The final results show that the hybrid method which consists of MHMM, HMM inversion, and PDM work best in most of cases except some emotions expressed by paralinguistic words. In paralinguistic expression, the method combining both MHMM and PDM works best. Compared with previous bimodal emotion recognition methods, e.g., SVM, CART, Boosting, and rule-based decision fusion methods, our methods can give the better emotion recognition results.

As the current research still focuses on the six basic emotions, in the future, more databases with spontaneous expressions will be recorded. Fused emotions, e.g., "painful", etc., will be added. Some dataset will be collected from TV directly. Additionally, we will pay more attention on classifications for more paralinguistic information in spontaneous conversation.



Fisher's linear discriminant analysis


Gaussian Mixture Model


human-computer interactions


independent component analysis


Multistream Hidden Markov Model


point distribution models


point distribution models


principal component analysis.



This study was supported by the National Natural Science Foundation of China (grants 60575032, 60873160, and 90820303) and the 863 Program (Grant 2009AA01Z320).

Authors’ Affiliations

National Laboratory or Pattern Recognition, Institute of Automation, Chinese Academy of Sciences


  1. Pantic M, Rothkrantz LJM: Toward an affect-sensitive multimodal human-computer interaction. Proc IEEE 2003,91(9):1370-1390. 10.1109/JPROC.2003.817122View ArticleGoogle Scholar
  2. Pantic M, Rothkrantz LJM: Automatic analysis of facial expressions: the state of the art. IEEE Trans PAMI 2000,22(12):1424-1445. 10.1109/34.895976View ArticleGoogle Scholar
  3. Zeng Z, Tu J, Liu M, Huang TS, Pianfetti B, Roth D, Levinson S: Audio visual affect recognition. IEEE Trans Multimedia 2007,9(2):424-428.View ArticleGoogle Scholar
  4. Zeng Z, Tu J, Pianfetti P, Liu M, Zhang T, Zhang Z, Huang TS, Levinson S: Audio visual affect recognition through multi stream fused HMM for HCI. Proceedings of the International Conference on Computer Vision and Pattern Recognition 2005, 967-972.Google Scholar
  5. Zeng Z, Pantic M, Roisman GI, Huang TS: A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans PAMI 2009,31(1):39-58.View ArticleGoogle Scholar
  6. Cowie R, Douglas-Cowie E: Emotion recognition in human-computer interaction. IEEE Signal Process Mag 2001, 33-80.Google Scholar
  7. Chen LS, Huang TS, Miyasato T, Nakatsu R: Multimodal human emotion/expression recognition. Proceedings of the Third IEEE International Conference on Automatic Face and Gesture Recognition 1998, 366-371.View ArticleGoogle Scholar
  8. Fasel B, Luttin J: Automatic facial expression analysis: a survey. Pattern Recog 2003,36(1):259-275. 10.1016/S0031-3203(02)00052-3View ArticleGoogle Scholar
  9. Song M, Bu J, Chen C, Li N: Audio-visual based emotion recognition: a new approach. Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2004, 1020-1025.Google Scholar
  10. Busso C, Deng Z, Yildirim S, et al.: Analysis of emotion recognition using facial expressions, speech and multimodal information. Proceedings of the 6th International Conference on Multimedia Interfaces 2004, 205-211.Google Scholar
  11. Silva D, Miyasato T, Nakatsu R: Facial emotion recognition using multi-modal information. Proceedings of the International Conference on Information and Communications Security 1997, 397-401.Google Scholar
  12. Liyanage C, Silva D, Pei CN: Bimodal emotion recognition. Proceedings of the Forth IEEE International Conference on Automatic Face and Gesture Recognition 2000, 332-335.Google Scholar
  13. Chen CY, Huang YK, Cook P: Visual/acoustic emotion recognition. Proceedings of the International Conference on Multimedia and Expo 2005, 1468-1471.Google Scholar
  14. Balomenos T, Raouzaiou A, Ioannou S, Drosopoulos A, Karpouzis K, Kollias S: Emotion analysis in man-machine interaction systems. LNCS 2005, 3361: 318-328.Google Scholar
  15. Jaimes A, Sebe N: Multimodal human computer interaction: a survey. Proceedings of the Workshop on Human Computer Interaction in conjunction with ICCV 2005.Google Scholar
  16. Ji Q, Lan P, Looney C: A probabilistic framework for modeling and real-time monitoring human fatigue. IEEE Trans SMC A 2006,36(5):862-875.Google Scholar
  17. Ashraf AB, Lucey S, Cohn JF, Chen T, Ambadar Z, Prkachin KM, Solomon PE: The painful face: pain expression recognition using active appearance models. Proceedings of the International Conference on Multimodal Interfaces 2007, 9-14P.Google Scholar
  18. Kapoor A, Burleson W, Picard RW: Automatic prediction of frustration. Proc Int J Hum Comput Stud 2007,65(8):724-736. 10.1016/j.ijhcs.2007.02.003View ArticleGoogle Scholar
  19. Karpouzis K, Caridakis G, Kessous L, Amir N, Raouzaiou A, Malatesta L, Kollias S: Modeling naturalistic affective states via facial, vocal, and bodily expression recognition, Artifical Intelligence for Human Computing, Lecture notes in artificial intelligence. Volume 4451. Springer, Berlin; 2007:91-112.Google Scholar
  20. Littlewort GC, Bartlett MS, Lee K: Faces of pain: automated measurement of spontaneous facial expressions of genuine and posed pain. Proceedings of the International Conference on Multimodal Interfaces 2007, 15-21.Google Scholar
  21. Dellaert F, Polzin T, Waibel A: Recognizing emotion in speech. In Proceedings of the International Conference on Spoken Language Processing. Philadelphia, PA; 1996:1970-1973.View ArticleGoogle Scholar
  22. Tao JH, Kang YG: Features importance analysis for emotion speech classification. Proceedings of the 1st International Conference on Affective Computing and Intelligence Interaction 2005, 449-457.View ArticleGoogle Scholar
  23. Lee CM, Yildirim S, Bulut M, Kazemzadeh A, Busso C, Deng ZG, Lee S, Narayanan S: Emotion recognition based on phoneme classes. Proceedings of the International Conference on Spoken Language Processing 2004, 889-892.Google Scholar
  24. Schuller B, Rigoll G, Lang M: Hidden Markov model based speech emotion recognition. Proc ICASSP 2003, 2: 1-4.Google Scholar
  25. Roy D, Pentland A: Automatic spoken affect classification and analysis. Proceedings of the 2nd International Conference on Automatic Face and Gesture Recognition 1996, 363-367.View ArticleGoogle Scholar
  26. Campbell N: Perception of affect in speech - towards an automatic processing of paralinguistic information in spoken conversation. In Proceedings of the International Conference on Spoken Language Processing. Jeju; 2004:881-884.Google Scholar
  27. Gobl C, Chasaide AN: The role of voice quality in communicating emotion, mood and attitude. Speech Commun 2003, 40: 189-212. 10.1016/S0167-6393(02)00082-1View ArticleGoogle Scholar
  28. Tato R, Santos R, Kompe R, Pardo JM: Emotional space improves emotion recognition. In Proceedings of the International Conference on Spoken Language Processing. Denver, CO; 2002:2029-2032.Google Scholar
  29. Mehrabian A: Communication without words. Psychol Today 1968,2(4):53-56.Google Scholar
  30. Chen LS: Joint processing of audio-visual information for the recognition of emotional expressions in human-computer interaction, PhD thesis, UIUC. 2000.Google Scholar
  31. Chen L, Huang TS, Miyasato T, Nakatsu R: Multimodal human emotion/expression recognition. Proceedings of the International Conference on Automatic Face and Gesture Recognition 1998, 396-401.Google Scholar
  32. Stein BE, Meredith MA: The Merging of the Senses. MIT Press, Cambridge, MA; 1993.Google Scholar
  33. Summerfield Q: Some preliminaries to a comprehensive account of audio-visual speech perception. in Hearing by Eye . Edited by: B Dodd, R Campbell. Lawrence Erlbaum Associates, Hillsdale, NJ; 1987:3-51.Google Scholar
  34. Robert-Ribes J, Schwartz JL, Escudier P: A comparison of models for fusion of the auditory and visual sensors in speech perception. Artif Intell Rev 1995,9(4-5):323-346. 10.1007/BF00849043View ArticleGoogle Scholar
  35. Caridakis G, Malatesta L, Kessous L, Amir N, Raouzaiou1 A, Karpouzis1 K: Modeling naturalistic affective states via facial and vocal expression recognition. Proceedings of the International Conference on Multimodal Interfaces 2006, 146-154.Google Scholar
  36. Viola P, Jones M: Robust real-time face detection. Int J Comput Vision 2004,57(2):137-154.View ArticleGoogle Scholar
  37. Bartlett MS, Littlewort G, Fasel I, Movellan JR: Real time face detection and facial expression recognition: development and application to human computer interaction. Proceedings of the CVPR Workshop on Computer Vision and Pattern Recognition for Human-Computer Interaction 2003, 53.Google Scholar
  38. Michel P, Kaliouby RE: Real time facial expression recognition in video using support vector machines. Proceedings of the International Conference on Multimodal Interfaces 2003.Google Scholar
  39. Wang Y, Zhou H, Wu B, Huang C: Real time facial expression recognition with AdaBoost. Proceedings of the International Conference on Pattern Recognition 2004, 926-929.Google Scholar
  40. Huang TS, Chen L, Tao H: Bimodal emotion recognition by man and machine. In Proceedings of ATR Workshop on Virtual Communication Environments. Japan; 1998.Google Scholar
  41. Tian Y, Kanade T, Cohn JF: Recognizing action units for facial expression analysis. IEEE Tans PAMI 2001,23(2):97-115. 10.1109/34.908962View ArticleGoogle Scholar
  42. Essa IA, Pentland AP: Facial expression recognition using a dynamic model and motion energy. Proceedings of the 5th International Conference on Computer Vision 1995, 360-367.View ArticleGoogle Scholar
  43. Bartlett MS, Littlewort G, Braathen B, Sejnowski TJ, Movellan JR: A prototype for automatic recognition of spontaneous facial actions. Adv Neural Inf Process Syst 2003, 15: 1271-1278.Google Scholar
  44. Bartlett MS, Littlewort G, Frank M, Lainscsek C, Fasel I, Movellan J: Fully automatic facial action recognition in spontaneous behavior. Proceedings of the International Conference on Automatic Face and Gesture Recognition 2006, 223-230.View ArticleGoogle Scholar
  45. Devillers L, Vasilescu I: Real-life emotions detection with lexical and paralinguistic cues on human-human call center dialogs. Proceedings of the International Conference on Spoken Language Processing 2006, 801-804.Google Scholar
  46. Devillers L, Vidrascu L, Lamel L: Challenges in real-life emotion annotation and machine learning based detection. Neural Netw 2005, 18: 407-422. 10.1016/j.neunet.2005.03.007View ArticleGoogle Scholar
  47. Hoch S, Althoff F, McGlaun G, Rigoll G: Bimodal fusion of emotional data in an automotive environment. Proceedings of the ICASSP 2005, 1085-1088.Google Scholar
  48. Pantic M, Patras I: Dynamics of facial expression: recognition of facial actions and their temporal segments form face profile image sequences. IEEE Trans SMC B 2006,36(2):433-449.Google Scholar
  49. Pantic M, Rothkrantz LJM: Facial action recognition for facial expression analysis from static face images. IEEE Trans SMC B 2004,34(3):1449-1461.Google Scholar
  50. Zeng Z, Tu J, Liu M, Zhang T, Rizzolo N, Zhang Z, Huang TS, Roth D, Levinson S: Bimodal HCI-related affect recognition. Proceedings of the International Conference on Multimodal Interfaces 2004, 137-143.View ArticleGoogle Scholar
  51. Zeng Z, Tu J, Liu M, Huang TS: Multi-stream confidence analysis for audio-visual affect recognition. Proceedings of the International Conference on Affective Computing and Intelligent Interaction 2005.Google Scholar
  52. Huang CL, Huang YM: Facial expression recognition using model-based feature extraction and action parameters classification. J Visual Commun Image Represent 1997,8(3):278-290. 10.1006/jvci.1997.0359View ArticleGoogle Scholar
  53. Schapire RE, Singer Y: Improved boosting algorithms using confidence-rated prediction. Mach Learn 1999, 37: 297-336. 10.1023/A:1007614523901View ArticleGoogle Scholar
  54. Choi K, Hwang JN: Baum-Welch HMM inversion for reliable audio-to-visual conversion. Proceedings of the IEEE International Workshop Multimedia Signal Processing 1999, 175-180.Google Scholar
  55. Fu SL, Gutierrez-Osuna R, Esposito A, Kakumanu PK, Garcia ON: Audio/visual mapping with cross-modal hidden Markov models. IEEE Trans Multimedia 2005,7(2):243-252.View ArticleGoogle Scholar
  56. Xie L, Liu ZQ: Speech animation using coupled hidden Markov models. Proceedings of the 18th International Conference on Pattern Recognition 2006, 1128-1131.Google Scholar
  57. Moon SY, Hwang JN: Robust speech recognition based on joint model and feature space optimization of hidden Markov model. IEEE Trans Neural Netw 1997,8(2):194-204. 10.1109/72.557656View ArticleGoogle Scholar
  58. Pan H, Levinson S, Huang TS, Liang ZP: A fused hidden Markov model with application to bimodal speech processing. IEEE Trans Signal Process 2004,52(3):573-581. 10.1109/TSP.2003.822353MathSciNetView ArticleGoogle Scholar
  59. Black MJ, Yacoob Y: Recognizing facial expressions in image sequences using local parameterized models of image motion. Proc Int J Comput Vision 1997,25(1):23-48. 10.1023/A:1007977618277View ArticleGoogle Scholar
  60. Bartlett MS, Littlewort G, Frank M, Lainscsek C, Fasel I, Movellan J: Recognizing facial expression: machine learning and application to spontaneous behavior. Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition 2005, 568-573.Google Scholar
  61. Edwards GJ, Cootes TF, Taylor CJ: Face recognition using active appearance models. Proceedings of the European Conference on Computer Vision 1998, 2: 681-695.Google Scholar
  62. Andrew A, Calder J, Burton M: A principal component analysis of facial expression. Vision Res 2001, 41: 179-208.Google Scholar
  63. Tipping M, Bishop C: Probabilistic principal component analysis, Technical Report NCRG/97/010, Neural Computing Research Group . Aston University, Birmingham, UK; 1997.Google Scholar
  64. The Birmingham Cognition and Affect Project[]
  65. Lyons MJ, Akamatsu S, Kamachi M, Gyoba J: Coding facial expressions with Gabor Wavelets. Proceedings of the Third IEEE International Conference on Automatic Face and Gesture Recognition 1998, 200-205.View ArticleGoogle Scholar
  66. Whitehill J, Omlin CW: Haar features for FACS AU recognition. Proceedings of the International Conference on Automatic Face and Gesture Recognition 2006, 217-222.Google Scholar
  67. Anderson K, McOwan PW: A real-time automated system for recognition of human facial expressions. IEEE Trans SMC B 2006,36(1):96-105.Google Scholar
  68. Tao JH, Xin L, Yin PR: Realistic visual speech synthesis based on hybrid concatenation method. IEEE Trans ASLP 2009,17(3):469-477.Google Scholar
  69. []
  70. Cohn JF, Reed LI, Ambadar Z, Xiao J, Moriyama T: Automatic analysis and recognition of brow actions and head motion in spontaneous facial behavior. Proceedings of the International Conference on Systems, Man & Cybernetics 2004, 610-616.Google Scholar
  71. Cohn JF, Schmidt KL: The timing of facial motion in posed and spontaneous smiles. Int J Wavelets Multiresolution Inf Process 2004, 2: 1-12. 10.1142/S0219691304000317View ArticleGoogle Scholar
  72. Cowie R, Douglas-Cowie E, Cox C: Beyond emotion archetypes: databases for emotion modeling using neural networks. Neural Netw 2005, 18: 371-388. 10.1016/j.neunet.2005.03.002View ArticleGoogle Scholar
  73. Douglas-Cowie E, Campbell N, Cowie R, Roach P: Emotional speech: towards a new generation of database. Speech Commun 2003, 40: 33-60. 10.1016/S0167-6393(02)00070-5View ArticleGoogle Scholar
  74. Lucey S, Ashraf AB, Cohn JF: Investigating spontaneous facial action recognition through AAM representations of the face, in Face Recognition . Edited by: K Delac, M Grgic. I-Tech Education and Publishing, Vienna, Austria; 2007:275-286.Google Scholar
  75. Kang YG, Shuang ZW, Tao JH: A hybrid GMM and codebook mapping method for spectral conversion. Proceedings of the 1st International Conference on Affective Computing and Intelligent Interaction 2005, 303-310.View ArticleGoogle Scholar


© Tao et al; licensee Springer. 2011

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.