- Open Access
Multi-pose lipreading and audio-visual speech recognition
© Estellers and Thiran; licensee Springer. 2012
- Received: 3 October 2011
- Accepted: 29 February 2012
- Published: 29 February 2012
In this article, we study the adaptation of visual and audio-visual speech recognition systems to non-ideal visual conditions. We focus on overcoming the effects of a changing pose of the speaker, a problem encountered in natural situations where the speaker moves freely and does not keep a frontal pose with relation to the camera. To handle these situations, we introduce a pose normalization block in a standard system and generate virtual frontal views from non-frontal images. The proposed method is inspired by pose-invariant face recognition and relies on linear regression to find an approximate mapping between images from different poses. We integrate the proposed pose normalization block at different stages of the speech recognition system and quantify the loss of performance related to pose changes and pose normalization techniques. In audio-visual experiments we also analyze the integration of the audio and visual streams. We show that an audio-visual system should account for non-frontal poses and normalization techniques in terms of the weight assigned to the visual stream in the classifier.
- Linear Discriminant Analysis
- Discrete Cosine Transform
- Speech Recognition
- Discrete Cosine Transform Coefficient
- Visual Stream
The performance of automatic speech recognition (ASR) systems degrades heavily in the presence of noise, compromising their use in real world scenarios. In these circumstances, ASR systems can benefit from the use of other sources of information complementary to the audio signal and yet related to speech. Visual speech constitutes such a source of information. Mimicking human lipreading, visual ASR systems are designed to recognize speech from images and videos of the speaker's mouth. This fact gives rise to audio-visual automatic speech recognition (AV-ASR), combining the audio and visual modalities of speech to improve the performance of audio-only ASR, especially in presence of noise [1, 2]. In these situations, we cannot trust the corrupted audio signal and must rely on the visual modality of speech to guide recognition. The major challenges that AV-ASR has to face are, therefore, the definition of reliable visual features for speech recognition and the integration of the audio and visual cues when taking decisions about the speech classes.
A general framework for AV-ASR [1, 3] has been developed during the last years, but for a practical deployment the systems still lack robustness against non-ideal working conditions. Research has particularly neglected the variability of the visual modality subject to real scenarios, i.e., non-uniform lighting and non-frontal poses caused by natural movements of the speaker. The first studies on AV-ASR with realistic conditions [4, 5] applied directly the systems developed for ideal visual conditions, obtaining poor lipreading performance and failing to exploit the visual modality in the multi-modal systems. These studies pointed out the necessity of new visual feature extraction methods robust to illumination and pose changes. In particular, the topic of pose-invariant AV-ASR is central for the future deployment of this technology in genuine AV-ASR applications, e.g., smart-rooms or in-car vehicle systems. In these scenarios the audio modality is degraded by noise and the inclusion of visual cues can improve recognition. However, in natural situations the speaker moves freely, a frontal view to the camera is rarely kept and pose-invariant AV-ASR is necessary. It can be considered, then, as the first step in the adaptation of laboratory AV-ASR systems to the conditions expected in real applications.
In lipreading systems, the variations of the mouth's appearance caused by different poses are more significant than those caused by different speech classes and, therefore, recognition degrades dramatically when non-frontal poses are matched against frontal visual models. It is necessary to develop an effective framework for pose invariant lipreading. In particular, we are interested in pose-invariant methods which can easily be incorporated in the AV-ASR systems developed so far for ideal frontal conditions and reduce the train/test mismatch derived from pose changes. Techniques to adapt ASR systems to working conditions have already been developed for the audio modality (Cepstral mean subtraction  and RASTA processing ), but equivalent methods are necessary for the visual modality. In fact, the same problem exists in face recognition and several methods proposed for pose-invariant face recognition [8–11] can be applied to the lipreading problem. Motivated by these studies and the potential of AV-ASR in human-computer interfaces , we propose to introduce a pose normalization step in a system designed for frontal views, i.e., we generate virtual frontal views from the non-frontal images and rely on the existing frontal visual models to recognize speech. The pose normalization block has also an effect on the audio-visual fusion strategy, where the weight associated to the visual stream in the speech classifier should reflect its reliability. We can expect the virtual frontal features generated by pose normalization to be less reliable than the features extracted directly from frontal images. Therefore, the weight assigned to the visual stream on the audio-visual classifier should also account for the pose normalization.
Previous study on this topic is limited to Lucey et al. [13, 14], who projected the final visual speech features of complete profile images to a frontal viewpoint with a linear transform. However, the authors do not justify the use of a linear transform between the visual speech features of different poses, are limited to the extreme cases of completely frontal and profile views and their audiovisual experiments are not conclusive. Compared to these studies, we introduce other projection techniques applied in face recognition to the lipreading task and discuss and justify their use in the different feature spaces involved in the lipreading system: the images themselves, a smooth and compact representation of the images in the frequency domain or the final features used in the classifier. We also analyze the effects of pose normalization in the audio-visual fusion strategy in terms of the weight associated to the visual stream. Lucey et al.  propose an audio-visual system based on the concatenation of audio and visual features in a single stream, which is later processed in the speech classifier neglecting the multi-modal nature of speech and the possibility to assign different weights to the audio and visual streams. The main contributions of this study, partially presented in , are the adaptation of pose-invariant methods used in face recognition to the lipreading system, the study of linear regression for pose normalization in different feature spaces and the study of its effects on the weight associated to the visual stream in the classifier. Our experiments are the first comprehensive experimental validation of pose normalization in visual and audio-visual speech recognition, analyzing the adaptation of laboratory AV-ASR systems to the conditions expected in real applications.
The article is organized as follows. First, we review the structure of an AV-ASR system and explains how the pose-invariance is introduced. We then present the techniques adopted in face recognition to obtain a multi-pose system, adapt some of them to the lipreading problem and study the different feature spaces where the pose normalization can take place. Finally, we report experimental results for visual and audio-visual ASR systems and present the conclusions of our study.
2.1 Visual front-end
The first task on the visual front-end is to identify and extract a normalized region of interest (ROI), which is usually a rectangle centered on the mouth of the speaker [1, 21, 22]. The normalization of the ROI requires a robust method to detect the face and extract centered, aligned, and scaled images of the mouth for each sequence to makes recognition invariant to small movements of the speaker . This preprocessing step is not part of the lipreading system and it is usually included in the face detection block because the position of the mouth, its size and alignment are determined in relation to other face features (the eyes, the tip of the nose). However, an accurate extraction of the mouth ROI is critical in lipreading systems and induced the term front-end effect to refer to the effects of the ROI extraction in the performance of the speech recognition system. In that sense, the use of markers or special lipstick on the speaker avoids the use of more complicated mouth tracking techniques  to alleviate the front-end effect.
Two main types of features are used for visual speech recognition: appearance based features extracted directly from the pixels of the ROI [1, 21, 22] and shape based features extracted from the contour of the speaker's lips . Several studies [24, 25] report that appearance-based features outperform shape based ones and are, therefore, the features commonly chosen in lipreading and AV-ASR systems. In this approach, the pixels of the ROI themselves are used as features and, consequently, locating the ROI needs to be done with very good precision  and the front-end effect carefully considered. The dimensionality of the obtained feature-vector is too large to allow an accurate statistical modeling in the classifiers and dimensionality reduction techniques are necessary. The most popular of these techniques are image compressing transforms , as principal components analysis [21, 22] or the discrete cosine transform (DCT) . They reduce the size of the images by eliminating redundancy, but there is no guarantee that they are appropriate for the classification task. Linear discriminant analysis (LDA)  is a transform capturing relevant information for classification and is thus commonly used in AV-ASR. Other supervised transforms based on ideas from information theory have also been proposed for AV-ASR [29–32], but LDA is widely used because it is simple (linear), gives good results and can easily incorporate dynamic information. Dynamic features measure the visual motion during speech and are more robust to skin color or illumination conditions than the original features. This motion can be represented either by delta images or transforms measuring the inter-frame change of the features, e.g., inter-frame LDA .
2.2 Audio-visual integration and classification
Audio-visual integration can be grouped into two categories: feature and decision fusion [1, 3]. In the first case, the audio and visual features are combined projecting them onto an audio-visual feature space, where traditional single-stream classifiers are used [33–36]. Decision fusion, on its turn, processes the streams separately and, at a certain level, combines the outputs of each single-modality classifier. Decision fusion allows more flexibility for modality integration and is the technique usually adopted [1, 3], in AV-ASR systems because it allows to weight the contribution of each modality in the classification task.
where x A , x V are the audio and visual features and q the class variable. This weighting scheme is naturally introduced in the HMM classifiers by means of multi-stream HMMs . In multi-stream HMMs, independent statistical models like Gaussian mixtures  are used to compute the likelihood of each stream independently, which are then combined accordingly to the integration technique. In early integration the streams are assumed to be state synchronous and the likelihoods are combined at state level as indicated by Equation (1). Late integration, in its turn, combines the likelihoods at utterance level, while in intermediate integration the combination takes place at intermediate points of the utterance. The weighting scheme, nonetheless, remains the same and early integration is generally adopted . A common restriction is that the weights λ A , λ V sum up to one, which assures that the relation between the emission likelihoods and transition probabilities is kept the same as in single-stream HMMs.
2.3 Our lipreading system
Our speech recognition system is similar to the state-of-the-art presented in [1, 3], which we take as a model and introduce in it the pose normalization. On the following, we describe our system, giving more details for the blocks which play a role on the pose normalization task.
In order to minimize the front-end effect, we work with sequences where the speaker wears blue lipstick and we can accurately track the mouth by color information. Our work focuses on the adaptation of the visual features for pose normalization and the use of lipstick sequences allows us to decouple the performance of the face tracker (optimized for frontal poses and whose performance depends on the head pose and illumination) from the extraction of accurate visual features, which is critical in the case of appearance-based features. In our sequences the mouth ROI is detected in the hue domain and normalized mouths of 64 × 64 pixels are extracted for the definition of the visual features.
In the second block of our system, state-of-the-art audio and visual features are extracted. In terms of audio features, we adopt Mel Frequency Cepstral Coefficients (MFCC), together with their first and second time derivatives and their means removed by Cepstral mean subtraction . For the visual counterpart, we choose appearance-based features and the following sequence of dimensionality reduction transforms. From the original ROI images x I (frontal) and y I (lateral), we extract a compact low-dimensional representation of the image space retaining only the first 140 DCT coefficients in zig-zag order in x F , y F . To normalize the features for different speakers and sequences, we remove their mean value over the sequence in an equivalent technique to the Cepstral mean subtraction applied to the audio features, and finally LDA transforms are applied to further reduce the dimensionality of the features and adapt them to the posterior HMM classifier. First, intra-frame LDA reduces to 40 the dimensionality of the features while retaining information about the speech classes of interest (phonemes). Afterwards, inter-frame LDA incorporates dynamic information by concatenating 5 intra-frame LDA vectors over adjacent frames and projecting them via LDA to the final features x L , y L , which have dimension 39 and will be modeled by the HMMs.
The classifiers used are single- and weighted multi-stream HMMs . In the case of AV-ASR, the use of weighted multi-stream HMMs incorporates the audio-visual integration into the classification task, which is done at state level with the weights leading to best performance on an evaluation data.
In this section, we present the techniques adopted in face recognition to obtain a multi-pose system, justify the choice of linear regression (LR) as the technique best suited to our AV-ASR system and study the different feature spaces where the pose normalization can take place.
3.1 From face recognition to lipreading
The techniques proposed for pose-invariant face recognition can be classified into viewpoint transform and coefficient-based techniques . Coefficient based techniques estimate the face under all viewpoints given a single view, either by defining pose-invariant features known as "face lightfields"  or estimating the parameters of 3-D face models . In the viewpoint transform approach the face recognition system is designed and optimized for the dominant view (frontal) and a preprocessing step transforms the input images corresponding to undesired poses to the desired view . The same two strategies can be applied to the lipreading task. We adopt the viewpoint-transform approach because lipreading predominantly takes place with frontal views and coefficient-based techniques would suffer from over-generalization , i.e., only a small fraction of the time the system would benefit from the definition of pose-invariant features, while most of the time it would be outperformed by a system optimized for frontal views.
In the viewpoint transform approach there are two strategies to generate virtual frontal views from non-frontal poses: 3-D models [9, 10] and learning-based methods [46, 47]. In the first case, a 3-D morphable model of the face must be built from 2-D images before virtual views from any viewpoint can be generated with graphic rendering techniques. It is computationally expensive and time consuming to match the input 2-D image with the 3-D model and, therefore, that technique is not aimed to the real-world applications of AV-ASR. To overcome that issue, learning-based approaches learn how to estimate virtual views directly in the 2-D domain, either via a 2-D face model or from the images themselves. Working directly with the images, a simple and yet effective way to project the images from lateral to frontal views is based on linear regression [8, 11]. Several reasons justify the use of the images themselves instead of introducing a mouth model to estimate the virtual views of the mouth. First, most lipreading systems use directly images of the mouth as visual features and do not require mouth or lip models, which we do not want to introduce only for the pose normalization . Second, the visual features extracted from the images themselves are more informative than features based on lip-modeling, as they include additional information about other speech articulators such as teeth, tongue, and jaws also useful in human speech perception . Besides, appearance based features directly obtained from the image pixels are generic and can be applied to mouths of any viewpoint compared to lip models which have to be developed for any possible view. Finally, these pose normalization techniques involve transforms that can be quickly computed and allow real-time implementations required in most AV-ASR applications.
3.2 Linear regression in multi-pose face recognition
which measures the mean square error on the training dataset and might include a Tykhonov regularization term (weighted by parameter β) introducing additional smoothness properties and leading to a ridge regression . The well-known solution to the LR is given by W = XY T (YY T + βI)-1, with I the identity matrix.
Linear regression is theoretically justified when images of the same object but from different poses are subject to the same illumination. In the case of face recognition, in  Chai et al. show that if the face images are well aligned, there exists an approximate linear mapping x I = W I y I between images of one person captured under variable poses x I and y I , which is consistent through different people. Unfortunately, in real-world systems face images are only coarsely aligned, occlusions derived from the 3-D nature of faces affect the different views and the linear assumption no longer holds. To this end, the authors propose the use of a piecewise linear function to approximate the non-linear mapping existing between images from different poses. The main idea of the proposed method lies in the following intuitive observation: partitioning the whole face into multiple patches reduces the face misalignment and variability between different persons and the transformation associated to pose changes can be better approximated with a linear map for the patches than for the whole image. That technique is called local LR (LLR) in opposition to the previous implementation of LR, which considered the images as a whole and is therefore designated as global LR (GLR).
3.3 Linear regression and lipreading
In our study, the LR techniques are applied considering X and Y to be either directly the images from frontal and lateral views X I , Y I or the visual features extracted from them at different stages of the feature extraction process. A first set of features X F , Y F are designed to smooth the images and obtain a more compact and low-dimensional representation in the frequency domain. Afterwards, those features are transformed and their dimensionality again reduced in order to contain only information relevant for speech classification, leading to the vectors X L , Y L used in the posterior speech classifier.
Consequently, if all DCT coefficients are selected and S = I, the DCT coefficients obtained from W I by projecting images y I and the W F projected coefficients from y F coincide. The linear relationship, however, no longer holds when we consider only a reduced set of DCT coefficients x F = SDx I and the transform W F found with the LR method should be considered an approximation of the non-linear mapping existing between any pair of reduced DCT coefficients x F and y F . In that case, selecting the DCT features corresponding to lower frequencies to compute the transform W F corresponds to smoothing the images previous to the projection and estimating a linear transform forcing the projected virtual image to be smooth by having only low-frequency components. Moreover, the lower-dimensionality of X F , Y F compared to X I , Y I improves accuracy of the LR matrix estimation due to the Curse of Dimensionality, which states that the number of samples necessary to estimate a vectorial parameter grows exponentially with its dimensionality. In that sense, the effect of the regularization parameter β is more important in the estimation of W I than W F , as imposing smoothness reduces the number of required samples.
It is important to note that the proposed LLR technique on the DCT features provides a different meaning to the patches. If we choose the patches to be adjacent blocks of the DCT coefficients, we are considering different transforms for different frequency components of the image. Consequently, we use an equal partition of the selected DCT coefficients to define the frontal and associated lateral patches in the LLR transform. In that case, LLR approximates the existing non-linear mapping between frequency features X F and Y F by distinct linear functions between the different frequency bands of the images.
it is easy to prove that if v is an eigenvector of R y with eigenvalue λ v , then W-1v is an eigenvector of R x with the same eigenvalue and, consequently, there is also a linear mapping between the LDA projections associated to the frontal and lateral views. Two extra considerations have to be taken into account for the projection of the X L and Y L features. First, X L and Y L are obtained by applying LDA into the reduced DCT features X F and Y F , which means that the projection by W L is only a linear approximation of the real mapping between the LDA features in the same way W F linearly approximates the relation between X F and Y F . Second, two stages of LDA are necessary to obtain X L and Y L from X F and Y F : first intra-frame LDA on the DCT features and then an inter-frame LDA on concatenated adjacent vectors extracted from the intra-frame LDA. In the intra-frame LDA, x = x F , y = y F , and W = W F in Equation (5), from which we obtain LDA projected vectors x l and y l , related with an approximated linear mapping W l . In the inter-frame LDA, each x and y corresponds to the concatenation of 5 time-adjacent vectors x l and y l , and thus the approximated linear mapping W is given by a block matrix whose diagonal entries correspond to 5 block matrices W l . As a consequence, if the linear approximation of X F = W F Y F holds, then it is also a valid assumption for the projection of the speech features by X L = W L Y L .
Consideration should be given to the fact that applying the pose normalization on the original images, or even to the low-frequency DCT coefficients, is independent of the features we posteriorly use for speech recognition and could be adopted with other appearance or contour-based visual speech features. The use of the LDA features, however, is specific to the speech recognition system and involves an additional training of LDA projections for the different poses. In that sense, applying the LR techniques to the original images provides a more general strategy for the multi-pose problem, while the LDA features might be able to exploit their specificity for the speech recognition task.
3.4 Projective transforms on the images
We present two sets of experiments: one on lipreading studying the adaptation of the visual stream to multi-pose conditions and another on AV-ASR analyzing the effects of the pose normalization on the audio-visual integration strategy. In lipreading experiments we first quantify the loss of performance associated to non-frontal poses, we then justify quantitatively the necessity of a pose normalization step and final analyze the performance of the proposed pose normalization strategies. In audio-visual experiments, we first study if the visual stream can still be exploited for speech recognition after the pose normalization has taken place, something that previous studies [4, 5] on AV-ASR in realistic working conditions failed to do. In AV-ASR we are also interested in the influence of the pose normalization in the final performance and, specially, on the optimal value of the weight associated to the visual stream.
The technical details of the experimental set-up are the following. The task considered is connected speech recognition under different speaker poses relative to the camera. Training and testing has been done with the multi-speaker paradigm (all speakers are on train and test set but with different sequences) with three fold cross-validation and the results are given in terms of word accuracy. The same multi-speaker cross-validation is used to estimate the LR transforms for the different poses and features. The parameters of the feature extraction blocks and classifiers are chosen based on experiments with an evaluation dataset to optimize speech recognition. To fairly analyze the performance associated to frontal and lateral views, the same kind of classifiers are trained for each possible pose: frontal and lateral at 30°, 60° and 90° of head rotation. The HTK tool-kit  is used to implement three-state phoneme HMMs with a mixture of three Gaussians per state. For the multi-stream HMMs, the same number of states and Gaussians than in single-stream case is used. The parameters of the model are initialized with the values estimated for independent audio and visual HMMs and posteriorly re-estimated jointly. The audio and visual weights are considered fixed parameters of the system, restricted to sum up to one and optimized for speech recognition on the evaluation dataset.
For our experiments, we required speech recordings with constrained non-ideal visual conditions, namely, fixed known poses and natural lighting. To that purpose we recorded our own database, which is publicly available at our webpage. It consists of recordings of 20 native French speakers with simultaneous different views, one always frontal to the speaker and the other with different lateral poses.
To comply with the natural conditions, the corpus was recorded with natural lighting conditions, resulting in shadows on some images under the nose and mouth of the subjects. The videos were recorded with two high-definition cameras CANON VIXIA HG20, providing 1920 × 1080 pixels resolution at 25 frames per second, and included the head and shoulders of the speaker.
In terms of audio set-up, two different micros were used for the recordings, an external micro close to the speaker's mouth, without occluding its view, and the built-in micro of the second camera. That set-up provided two conditions for the audio signal, a clean audio signal obtained with an external microphone tailored for human voice and a noisy signal recorded with a lower quality microphone at some meters of distance to the speaker. Audio was recorded with a sample rate of 48000 Hz and 256 kbps for both micros and used to synchronize the videos, as it offered better time resolution than pairing of the video frames (offering only 40 milliseconds of frame resolution). For the two audio signals we computed the correlation of their normalized MFCC features within each manually segmented word, obtained an estimate of the a delay for each word and averaged over the whole sequence. The same delay was considered for the video signals, after correcting for the difference in distance between the two micros and the speaker.
The word labeling of the sequences was done manually at the millisecond and phone labels were posteriorly obtained by force alignment of the clean audio signals with the known transcriptions.
4.2 Visual speech recognition
4.3 Audio-visual speech recognition
This set of experiments study how pose changes and normalization affects AV-ASR systems. Since the visual stream is most useful when the audio signal is corrupted, we report audio-visual experiments with a noisy audio signal and compare it to an audio-only ASR system. In an audio-visual system, the weight assigned to the visual stream controls to which extend the classifier's decision is based on the visual features, therefore differences between visual streams are more evident when the weight assigned to the video is high. The extreme cases correspond to a completely corrupted visual stream, where λ A = 1, λ V = 0 and the different pose normalization techniques obtain the same performance, and to a corrupted audio signal with weights λ A = 0, λ V = 1 and the lipreading performance already observed. Consequently, the differences in performance of the pose normalization methods are more acute with 0 dB than 7 dB of audio SNR and almost imperceptible with clean audio data. To that purpose we artificially added babble noise extracted from the NOISEX  database to the clean audio signal with 7 dB and 0 dB of SNR and test our pose normalization techniques in these conditions. The HMM audio parameters were trained in clean conditions, but tested with the corrupted audio stream. The visual counter-part corresponds to the previous lipreading system, with the best GLR or LLR technique for each feature space.
4.4 Statistical significance of the results
In our experiments, we compare different views from the speaker and pose normalization strategies learned and tested on the same data and the results, therefore, reflect differences between the views and pose normalization strategies rather than differences in the test datasets. In this case, the statistical significance of the results cannot be evaluated by means of confidence intervals associated to the performance of each method independently, but requires the comparison of the different methods in a one-to-one basis for the same sentences, speakers and train/test datasets. In this study, we use the "probability of error reduction" p e presented in  to assess the differences in performance of the proposed weighting schemes. We refer the reader to the original article  for a detailed description of p e , give only an intuitive definition and use it to assess if one method significantly outperforms another. Intuitively, the probability of error reduction p e between two systems A and B measures the number of independent testing samples that favor system A over B while leaving the rest of the samples unchanged.
To asses if the differences in performance between pose normalization applied in different feature spaces are statistically significant, we compute p e with respect to a lateral system in the lipreading experiment. For the image and DCT feature spaces, performance degrades in every single test case for all the possible lateral views (p e = 1). In the case of LDA feature space (with the GLR technique), performance degrades in 70% of the cases for 30° of head rotation and in 80% for the rest of the lateral views. We conclude that LR pose normalization is more successful in the LDA space, while the DCT and image spaces perform poorly. At the same time, even though the final accuracy of the lateral system is close to the projected LDA features, there is a significant loss of performance due to the pose normalization.
For the audio-visual experiments, we compare each of the systems to an audio-only recognizer. Only the pose normalization in the LDA space is able to exploit the visual stream with 7 dB of SNR, with performance improving in 98%, 95%, and 89% of the sequences at 30, 60, and 90° in comparison to an audio-only system. This percentage is inferior to 16% and 13% for the DCT or image space, pointing out that pose normalization in these feature spaces fails to exploit the visual modality in an AV-ASR system. In a more noisy environment with 0 dB of SNR, the projection on the LDA space is always beneficial, while the DCT and image spaces only do better than an audio-only system in 80% of the cases. This analysis confirms that pose normalization is only really successful in the LDA feature space in both visual and audio-visual ASR systems.
In this article, we presented a lipreading system able to recognize speech from different views of the speaker. Inspired by pose-invariant face recognition studies, we introduce a pose normalization block in a standard system and generate virtual frontal views from non-frontal images. In particular, we use linear regression to project the features associated to different poses at different stages of the lipreading system: the images themselves, a low-dimensional and compact representation of the images in the frequency domain or the final LDA features used for classification. Our experiments show that the pose normalization is more successful when applied directly to the LDA features used in the classifier, while the projection of more general features like the images or their low-frequency representation fails because of misalignments on the training data and errors on the estimation of the transforms.
In terms of AV-ASR, we study the effects of pose normalization in the fusion strategy of the audio and visual modalities. We evaluate the effects of pose normalization on the weight associated to the visual stream and analyze for which one of the proposed techniques the audio-visual system is able to exploit its visual modality. We show that only the projection of the LDA features used in the classifier is really able to normalize the visual stream to a virtual frontal pose and enhance the performance of the audio system. As expected, there is a direct relation between the optimal weight associated to the pose normalized visual stream and its performance in lipreading experiments. Consequently, we can simply study the effects of pose normalization in the visual domain and transfer the improvements into the audio-visual task by adapting the weight associated to the visual stream.
aFor simple LDA we can interpret the patches as directions on the original space maximizing the projected ratio R, so that if we sort the eigenvectors on the LDA projection according to their eigenvalue, we could interpret the patches as linear subspaces decreasingly maximizing the projected ratio. However, as we include intra and inter-frame LDA in the W L transform, no interpretation is possible for the patch definition on the x L , y L space.
This study was supported by the Swiss SNF grant number 200021-130152.
- Potamianos G, Neti C, Luettin J, Matthews I: Audio-visual automatic speech recognition: an overview. In Issues in audio-visual speech processing. Edited by: Bailly G, Vatikiotis-Bateson E, Perrier P. MIT Press, Cambridge; 2004.Google Scholar
- Dupont S, Luettin J: Audio-visual speech modeling for continuous speech recognition. IEEE Trans Multimedia 2000, 2: 141-151. 10.1109/6046.865479View ArticleGoogle Scholar
- Potamianos G, Neti C, Gravier G, Garg A, Senior A: Recent advances in the automatic recognition of audio-visual speech. Proc IEEE 2003, 91(9):1306-1326. 10.1109/JPROC.2003.817150View ArticleGoogle Scholar
- Potamianos G, Neti C: Audio-visual speech recognition in challenging environments. Eighth European Conference on Speech Communication and Technology, EUROSPEECH-2003 2003, 1293-1296.Google Scholar
- Livescu K, Cetin O, Hasegawa-Johnson M, King S, Bartels C, Borges N, Kantor A, P Lal, Yung L, Bezman A, Dawson-Haggerty S, Woods B, Frankel J, Magami-Doss M, Saenko K: Articulatory feature-based methods for acoustic and audio-visual speech recognition. Final Workshop Report, Center for Language and Speech Processing, John Hopkins University 2006, 4: 621-624.Google Scholar
- Furui S: Cepstral analysis technique for automatic speaker verification. IEEE Trans Acoustics Speech Signal Process 2003, 29(2):254-272.View ArticleGoogle Scholar
- Hermansky H, Morgan N: RASTA processing of speech. IEEE Trans Speech Audio Process 1994, 2(4):578-589. 10.1109/89.326616View ArticleGoogle Scholar
- Blanz V, Grother P, Phillips P, Vetter T: Face recognition based on frontal views generated from non-frontal images. IEEE Proc Comput Vision Pattern Recogn 2005, 2: 454-461.Google Scholar
- Blanz V, Vetter T: Face recognition based on fitting a 3D morphable model. IEEE Trans Pattern Anal Mach Intell 2003, 25(9):1063-1074. 10.1109/TPAMI.2003.1227983View ArticleGoogle Scholar
- Wai Lee M, Ranganath S: Pose-invariant face recognition using a 3D de-formable model. Pattern Recogn 2003, 36(8):1835-1846. 10.1016/S0031-3203(03)00008-6View ArticleGoogle Scholar
- Chai X, Shan S, Chen X, Gao W: Locally linear regression for pose-invariant face recognition. IEEE Trans Image Process 2007, 16(7):1716-1725.MathSciNetView ArticleGoogle Scholar
- Zhang X, Broun CC, Mersereau RM, Clements MA: Automatic speechread-ing with applications to human-computer interfaces. EURASIP J Appl Signal Process 2002, 2002: 1228-1247. 10.1155/S1110865702206137View ArticleMATHGoogle Scholar
- Lucey P, Potamianos G, Sridharan S: A unified approach to multi-pose audio-visual ASR. In 8th Annual Conference of the International Speech Communication Association, Interspeeech. Antwerp, Belgium; 2007:650-653.Google Scholar
- Lucey P, Potamianos G, Sridharan S: An Extended Pose-Invariant Lipread-ing System. In International Workshop on Auditory-Visual Speech Processing. Edited by: Vroomen, Jean, Swerts, Marc, Krahmer, Emiel. Hilvarenbeek; 2007.Google Scholar
- Estellers V, Thiran JP: Multipose Audio-Visual Speech Recognition. In 19th European Signal Processing Conference EUSIPCO. Volume 2011. Barcelona; 2011:1065-1069.Google Scholar
- Hermansky H: Perceptual Linear Predictive (PLP) Analysis of Speech. J Acoustical Soc Am 1990, 87(4):1738-1752. 10.1121/1.399423View ArticleGoogle Scholar
- Mermelstein P: Distance measures for speech recognition, psychological and instrumental. Pattern Recognition and Artificial Intelligence 1976, 116: 374-388.Google Scholar
- Davis SB, Mermelstein P: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoustics Speech Signal Process 1980, 28(4):357-366. 10.1109/TASSP.1980.1163420View ArticleGoogle Scholar
- Cetingul H, Yemez Y, Erzin E, Tekalp A: Discriminative analysis of lip motion features for speaker identification and speech-reading. IEEE Trans Image Process 2006, 15(10):2879-2891.View ArticleMATHGoogle Scholar
- Rabiner L: A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 1989, 77(2):257-286. 10.1109/5.18626View ArticleGoogle Scholar
- Bregler C, Konig Y: Eigenlips for robust speech recognition. 1994 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1994. ICASSP-94 1994, 2: II/669-II/672. vol. iiGoogle Scholar
- Tomlinson M, Russell M, Brooke N: Integrating audio and visual information to provide highly robust speech recognition. Conference Proceedings., 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1996. ICASSP-96 1996, 2: 821-824.View ArticleGoogle Scholar
- Kaynak M, Q Zhi, Cheok A, Sengupta K, Jian Z, Chung KC: Analysis of lip geometric features for audio-visual speech recognition. Systems IEEE Trans Man Cybernetics, Part A: Syst Humans 2004, 34(4):564-570. 10.1109/TSMCA.2004.826274View ArticleGoogle Scholar
- Potamianos G, Graf HP, Cosatto E: An image transform approach for HMM based automatic lipreading. IEEE International Conference on Image Processing, Chicago 1998, Il: 173-177.Google Scholar
- Scanlon P, Ellis D, Reilly R: Using mutual information to design class specific phone recognizers, in Eighth European Conference on Speech Communication and Technology. EUROSPEECH-2003; 2003:857-860.Google Scholar
- Patterson EK, Gurbuz S, Tufekci Z, Gowdy JN: Moving-talker, speaker-independent feature study, and baseline results using the WAVE Multimodal speech corpus. Eurasip J Appl Signal Process 2002, 2002(11):1189-1201. 10.1155/S1110865702206101View ArticleGoogle Scholar
- Sonka M, Hlavac V, Boyle R: Image Processing, Analysis and Machine Vision. International Thomson 1999, 35(1):102-104.Google Scholar
- Lachenbruch P, Goldstein M: Discriminant analysis. Biometrics 1979, 35: 69-85. 10.2307/2529937MathSciNetView ArticleMATHGoogle Scholar
- Battiti R: Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 1994, 5(4):537-550. 10.1109/72.298224View ArticleGoogle Scholar
- Fleuret F: Fast binary feature selection with conditional mutual information. J Mach Learn Res 2004, 5: 1531-1555.MathSciNetMATHGoogle Scholar
- Peng H, Long F, Ding C: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 2005, 27(8):1226-1238.View ArticleGoogle Scholar
- Gurban M, Thiran JP: Information theoretic feature extraction for audiovisual speech recognition. IEEE Trans Signal Process 2009, 57(12):4765-4776.MathSciNetView ArticleGoogle Scholar
- Adjoudani A, Benoit C: Speechreading by Humans and Machines: Models, Systems and Applications. Springer, Berlin; 1996:461-471. 150 NATO ASI Series FView ArticleGoogle Scholar
- Chen T: Audiovisual speech processing. IEEE Signal Process Mag 2001, 18(1):9-21. 10.1109/79.911195View ArticleGoogle Scholar
- Neti C, Potamianos G, Luettin J, Matthews I, Glotin H, Vergyri D: Large-vocabulary audio-visual speech recognition: a summary of the Johns Hopkins Summer 2000 Workshop. In Proc Works Signal Processing. Cannes; 2001:619-624.Google Scholar
- Potamianos G, Luettin J, Neti C: Hierarchical discriminant features for audio-visual LVCSR. International Conference on Acoustics Speech and Signal Processing ICASSP 2001, 1: 165-168.Google Scholar
- Movellan J, Chadderdon G: Channel separability in the audio-visual integration of speech: A Bayesian approach. NATO ASI Series F Comput Syst Sci 1996, 150: 473-488.View ArticleGoogle Scholar
- Massaro D, Stork D: Speech recognition and sensory integration. Am Sci 1998, 86(3):236-244.View ArticleGoogle Scholar
- Kittler J, Hatef M, Duin R, Matas J: On combining classifiers. IEEE Trans Pattern Anal Mach Intell 1998, 20(3):226-239. 10.1109/34.667881View ArticleGoogle Scholar
- Kirchhoff K, Bilmes J: Dynamic classifier combination in hybrid speech recognition systems using utterance-level confidence values. International Conference on Acoustics, Speech, and Signal Processing 1999, 2: 693-696.Google Scholar
- Rabiner L, Juang B: An introduction to Hidden Markov models. IEEE ASSP Mag 1986, 3: 4-16.View ArticleGoogle Scholar
- Bishop C: Neural Networks for Pattern Recognition. Oxford University Press, Oxford; 1995.Google Scholar
- Rabiner L, Juang BH: Fundamentals of Speech Recognition. Signal processing, Prentice Hall, NJ; 1993.Google Scholar
- Gross R, Matthews I, Baker S: Appearance-based face recognition and light-fields. IEEE Trans Pattern Anal Mach Intell 2004, 26(4):449-465. 10.1109/TPAMI.2004.1265861View ArticleGoogle Scholar
- Blanz V, Vetter T: Face recognition based on fitting a 3D morphable model. IEEE Trans Pattern Anal Mach Intell 2003, 25(9):1063-1074. 10.1109/TPAMI.2003.1227983View ArticleGoogle Scholar
- Vetter T: Synthesis of novel views from a single face image. Int J Comput Vision 1998, 28(2):103-116. 10.1023/A:1008058932445MathSciNetView ArticleGoogle Scholar
- Beymer D: Face recognition under varying pose. IEEE Proc Comput Vision Pattern Recogn 1994, 1: 756-761.View ArticleGoogle Scholar
- Summerfield Q: Hearing by Eye: The Psychology of Lip-Reading. Lawrence Erlbaum Associates, Hillsdale, NJ; 1987.Google Scholar
- Bishop CM, Nasrabadi NM: Pattern Recognition and Machine Learning. Springer, New York; 2006.MATHGoogle Scholar
- Bellman R: Adaptive Control Processes: a guided tour. Volume 1. Princeton University Press, Princeton; 1961:2.Google Scholar
- Young S, Evermann G, Kershaw D, Moore G, Odell J, Ollason D, Valtchev V, Woodland P: The HTK book. Volume 2. Cambridge University Engineering Department; 1997.Google Scholar
- Jordan T, Thomas S: Effects of horizontal viewing angle on visual and audiovisual speech recognition. J Exp Psychol 2001, 27(6):1386-1403.Google Scholar
- Varga A, Steeneken H, Tomlinson M, Jones D: The NOISEX-92 study on the Effect of Additive Noise on Automatic Speech Recognition. Tech. Rep., DRA Speech Research Unit, Malvern, England; 1992.Google Scholar
- Bisani M, Ney H: Bootstrap estimates for confidence intervals in ASR performance evaluation. International Conference on Acoustics, Speech, and Signal Processing 2004, 1: 409-412.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.