A Support Vector Machine-Based Dynamic Network for Visual Speech Recognition Applications

Visual speech recognition is an emerging research ﬁeld. In this paper, we examine the suitability of support vector machines for visual speech recognition. Each word is modeled as a temporal sequence of visemes corresponding to the diﬀerent phones realized. One support vector machine is trained to recognize each viseme and its output is converted to a posterior probability through a sigmoidal mapping. To model the temporal character of speech, the support vector machines are integrated as nodes into a Viterbi lattice. We test the performance of the proposed approach on a small visual speech recognition task, namely the recognition of the ﬁrst four digits in English. The word recognition rate obtained is at the level of the previous best reported rates.


I. Introduction
Audio-visual speech recognition is an emerging research field, where multi-modal signal processing is required.The motivation for using the visual information in performing speech recognition lays on the fact that the human speech production is bimodal by its nature.In particular, human speech is produced by the vibration of the vocal cords and depends on the configuration of the articulatory organs, such as the nasal cavity, the tongue, the teeth, the velum, and the lips.A speaker produces speech using these articulatory organs together with the muscles that generate facial expressions.Because some of the articulators, such as the tongue, the teeth, and the lips are visible, there is an inherent relationship between the acoustic and visible speech.As a consequence, the speech can be partially recognized from the information of the visible articulators involved in its production and in particular from the image region comprising the mouth [1,2,3].
Undoubtedly, the most useful information for speech recognition is carried by the acoustic signal.When the acoustic speech is clean, performing visual speech recognition and integrating the recognition results from both modalities doesn't bring too much improvement, because the recognition rate from the acoustic information alone is very high, if not perfect.However, when the acoustic speech is degraded by noise, adding the visual information to the acoustic one improves significantly the recognition rate.Under noisy conditions, it has been proved that the use of both modalities for speech recognition is equivalent to a gain of 12 dB in the signal-to-noise ratio of the acoustic signal [1].For large vocabulary speech recognition tasks, the visual signal can also provide a performance gain when it is integrated with the acoustic signal, even in the case of a clean acoustic speech [4].
Visual speech recognition refers to the task of recognizing the spoken words based only on the visual examination of the speaker's face.This task is also referred as lipreading, since the most important visible part of the face examined for information extraction during speech is the mouth area.Different shapes of the mouth (i.e., different mouth openings, different position of the teeth and tongue) realized during speech cause the production of different sounds.One can establish a correspondence between the mouth shape and the phone produced, even if this correspondence is not one-to-one, but one-tomany, due to the involvement of invisible articulatory organs in the speech production.For small vocabulary word recognition tasks, we can perform good quality speech recognition using the visual information conveyed by the mouth shape only.
Several methods have been reported in the literature for visual speech recognition.The adopted methods vary widely with respect to: 1) the feature types, 2) the classifier used, and 3) the class definition.For example, Bregler and Omohundro used time delayed neural networks (TDNN) for visual classification and the outer lip contour coordinates as visual features [5].Luettin and Thacker used active shape models to represent the different mouth shapes and gray level distribution profiles (GLDPs) around the outer and/or inner lip contours as feature vectors, and finally built whole-word Hidden Markov Model (HMM) classifiers for visual speech recognition [6].Movellan employed also HMMs to build the visual word models, but he used directly the gray levels of the mouth images as features after simple preprocessing to exploit the vertical symmetry of the mouth [7].In recent works, Movellan et al. have reported very good results when partially observable (SDE) models are integrated in a network as visual speech classifiers instead of HMMs [8], and Gray et al. have presented a comparative study of a series of different features based on Principal Component Analysis (PCA) and Independent Component Analysis (ICA) in an HMM-based visual speech recognizer [9].
Despite the variety of existing strategies for visual speech recognition, there is still ongoing research in this area attempting to: 1) find the most suitable features and classification techniques to discriminate effectively between the different mouth shapes, while preserving in the same class the mouth shapes produced by different individuals that correspond to one phone; 2) require minimal processing of the mouth image, to allow for a real time implementation of the mouth shape classifier; 3) facilitate the easy integration of audio and video speech recognition modules [1].
In this paper, we contribute to the first two of the aforementioned aspects in visual speech recognition by examining the suitability of support vector machines (SVMs) for visual speech recognition tasks.The idea is based on the fact that SVMs have been proved powerful classifiers in various pattern recognition applications, such as face detection, face verification/recognition, etc. [10,11,12,13,14,15].Very good results in audio speech recognition using SVMs were recently reported in [16].No attempts in applying SVMs for visual speech recognition have been reported so far.According to the best of the authors' knowledge the use of SVMs as visual speech classifiers is a novel idea.
One of the reasons that partially explains why SVMs have not been exploited in automatic speech recognition so far is that they are inherently static classifiers, while speech is a dynamic process, where the temporal information is essential for recognition.A solution to this problem was presented in [16], where a combination of HMMs with SVMs is proposed.In this paper a similar strategy is adopted.We shall use Viterbi lattices to create dynamically visual word models.
The approaches for building the word models can be classified into the approaches where whole word models are developed [16,7,6] and those where viseme-oriented word models are derived [17,18,19].In this paper, we adopt the latter approach, because it is more suitable for an SVM implementation and offers the advantage of an easy generalization to large vocabulary word recognition tasks without a significant increase in storage requirements.It maintains also the dictionary of basic visual models needed for word modeling into a reasonable limit.
The word recognition rate obtained is on the level of the best previous reported rates in literature, although we will not attempt to learn the state transition probabilities.When very simple features (i.e., pixels) are used, our word recognition rate is superior to the ones reported in the literature.Accordingly SVMs are a promising alternative for visual speech recognition and this observation encourages further research in that direction.It is well known that the Morton-Massaro law (MML) holds when humans integrate audio and visual speech [20].Experiments have demonstrated that MML holds also for audiovisual speech recognition systems.That is, the audio and visual speech signals may be treated as if they were conditionally independent without significant loss of information about speech categories [20].This observation supports the independent treatment of audio and visual speech and yields an easy integration of the visual speech recognition module and the acoustic speech recognition module.
The paper is organized as follows.In Section II a short overview on SVM classifiers is made.We review the concepts of visemes and phonemes in Section III.We discuss the proposed SVM-based approach to visual speech recognition, in Section IV.Experimental results obtained when the proposed system is applied to a small vocabulary visual speech recognition task (i.e., the visual recognition of the first four digits in English) are described in SectionV and compared to other results published in the literature.Finally, in Section VI, our conclusions are drawn, and future research directions are identified.

II. Overview on SVMs and Their Applications in Pattern Recognition
SVMs constitute a principled technique to train classifiers that stems from statistical learning theory [21,22].Their root is the optimal hyperplane algorithm.They minimize a bound on the empirical error and the complexity of the classifier at the same time.Accordingly, they are capable of learning in sparse high-dimensional spaces with relatively few training examples.Let {x i , y i }, i = 1, 2, . . ., N , denote N training examples where x i comprises an M -dimensional pattern and y i is its class label.Without any loss of generality we shall confine ourselves to the two-class pattern recognition problem.That is, y i ∈ {−1, +1}.We agree that y i = +1 is assigned to positive examples, whereas y i = −1 is assigned to counterexamples.
The data to be classified by the SVM might be linearly separable in their original domain or not.If they are separable, then a simple linear SVM can be used for their classification.However, the power of SVMs is demonstrated better in the nonseparable case, when the data cannot be separated by a hyperplane in their original domain.In the latter case, we can project the data into a higher dimensional Hilbert space and attempt to linearly separate them in the higher dimensional space using kernel functions.Let Φ denote a nonlinear map Φ : R M → H where H is a higher-dimensional Hilbert space.SVMs construct the optimal separating hyperplane in H. Therefore, their decision boundary is of the form: where K(z 1 , z 2 ) is a kernel function that defines the dot product between Φ(z 1 ) and Φ(z 2 ) in H, and α i are the nonnegative Lagrange multipliers associated with the quadratic optimization problem that aims to maximize the distance between the two classes measured in H subject to the constraints where w and b are the parameters of the optimal separating hyperplane in H.That is, w is the normal vector to the hyperplane, |b|/ w is the perpendicular distance from the hyperplane to the origin, and w denotes the Euclidian norm of vector w.The use of kernel functions eliminates the need for an explicit definition of the nonlinear mapping Φ, because the data appear in the training algorithm of SVM only as dot products of their mappings.Frequently used kernel functions are the polynomial kernel, K(x i , x j ) = (mx T i x j + n) q and the Radial Basis Function (RBF) kernel, K(x i , x j ) = exp{−γ|x i − x j | 2 }.In the following, we will omit the sign function from the decision boundary (1) that simply makes the optimal separating hyperplane an indicator function.
To enable the use of SVM classifiers in visual speech recognition when we model the speech as a temporal sequence of symbols corresponding to the different phones produced, we shall employ the SVMs as nodes in a Viterbi lattice.But the nodes of such a Viterbi lattice should generate the posterior probabilities for the corresponding symbols to be emitted [23], and the standard SVMs do not provide such probabilities as output.Several solutions are proposed in the literature to map the SVM output to probabilities: the cosine decomposition proposed by Vapnik [21], the probabilistic approximation by applying the evidence framework to SVMs [24], the sigmoidal approximation by Platt [25].Here we adopt the solution proposed by Platt [25], since it is a simple solution which was already used in a similar application of SVMs to audio speech recognition [16].
The solution proposed by Platt shows that having a trained SVM, we can convert its output to probability by training the parameters a 1 and a 2 of a sigmoidal mapping function, and that this produces a good mapping from SVM margins to probability.In general, the class-conditional densities on either side of the SVM hyperplane are exponential.So, Bayes' rule [26] on two exponentials suggests the use of the following parametric form of a sigmoidal function: where: -y is the label for x, given by the sign of is the function value on the output of a SVM classifier for the feature vector x to be classified, and, -a 1 , a 2 are the parameters of the sigmoidal mapping to be derived for the currently trained SVM under consideration with a 1 < 0. P (y = −1|f (x)) could be defined similarly.However, since each SVM represents only one data category (i.e., the positive examples), we are interested only in the probability given by ( 3).The latter equation gives directly the posterior probability to be used in a Viterbi lattice.The parameters a 1 and a 2 are derived from a training set (f (x i ), y i ) using maximum likelihood estimation.In the adopted approach, we use the training set of the SVM, (x i , y i ), i = 1, 2, . . ., N , to estimate the parameters of the sigmoidal function.The estimation starts with the definition of a new training set, (f (x i ), t i ), i = 1, 2, . . ., N , where t i are the target probabilities.The target probabilities are defined as follows.
• When a positive example (i.e.y i = +1) is observed at a value f (x i ), we assume that this example is probably in the class represented by the SVM, but there is still a small finite probability + for getting the opposite label at the same f (x i ) for some out-of-sample data.Thus, t i = t + = 1 − +. • When a negative example (i.e.y i = −1) is observed at a value f (x i ) we assume that this example is probably not in the class represented by the SVM, but there is still a small finite probability − for getting the opposite label at the same f (x i ) for some out-of-sample data.Thus, Let us denote by N + the number of positive examples in the training set (x i , y i ) , i = 1, 2, . . ., N .Let N − be the number of negative examples in the training set.We set . The parameters a 1 and a 2 are found by minimizing the negative log likelihood of the training data, which is a cross-entropy error function given by where and In Eqs. ( 4) and ( 6), p i , i = 1, 2, . . ., N , is the value of the sigmoidal mapping for the training example x i , where f (x i ) is the real valued output of the SVM for this example.Due to the negative sign of a 1 , p i tends to 1 if x i is a positive example (i.e., f (x i ) > 0) and to 0 if x i is a negative example (i.e., f (x i ) < 0).

A. Phonetic Word Description
The basic units of the acoustic speech are the phones.Roughly speaking, a phone is an acoustic realization of a phoneme, a theoretical unit for describing how speech conveys linguistic meaning.The acoustic realization of a phoneme depends on the speaker's characteristics, the word context, etc.The variations in the pronunciation of the same phoneme are called allophones.In the technical literature, a clear distinction between phones and phonemes is seldom made.
In this paper, we are dealing with speech recognition in English, so we shall focus on this particular case.The number of phones in the English language varies in the literature [27,28].Usually there are about 10-15 vowels or vowel-like phones and 20-25 consonants.The most commonly used computer-based phonetic alphabet in American English is ARPABET, which consists of 48 phones [2].To convert the orthographic transcription of a word in English to its phonetic transcription, one can use the publicly available CMU pronunciation dictionary [30].The CMU pronunciation dictionary uses a subset of the ARPABET consisting of 39 phones.For example, the CMU phonetic transcription of the word "one" is "W-AH-N".

B. The Concept of Viseme
Similarly to the acoustic domain, we can define the basic unit of speech in the visual domain, the viseme.In general, in the visual domain, we observe the image region of the speaker's face that contains the mouth.Therefore, the concept of viseme is usually defined related to the mouth shape and the mouth movements.An example where the concept of viseme is related to the mouth dynamics is the viseme OW which represents the movement of the mouth from a position close to O to a position close to W [2].In such a case, to represent a viseme, we would need to use a video sequence, a fact that would complicate the processing of the visual speech to some extent.However, fortunately, most of the visemes can be approximately represented by stationary mouth images.Two examples of visemes defined in relationship to the mouth shape during the production of the corresponding phones are given in Figure 1.

C. Phoneme to Viseme Mappings
To be able to perform visual speech recognition, ideally we would like to define for each phoneme its corresponding viseme.In this way, each word could be unambiguously described according to its pronunciation in the visual domain.Unfortunately, invisible articulatory organs are also involved in speech production, that renders the mapping of phonemes to visemes into many-to-one.Thus, there are phonemes that cannot be distinguished in the visual domain.For example, the phonemes /P/, /B/, and /M/ are all produced with a closed mouth and are visually indistinguishable, so they will be represented by the same viseme.We also have to consider the dual aspect corresponding to the concept of allophones in the acoustic domain.The same viseme can have different realizations represented by different mouth shapes due to the speaker variability and the context.
Unlike the phonemes, in the case of visemes there are no commonly accepted viseme tables by all researchers [1], although several attempts toward this direction have been undertaken.For example, it is commonly agreed that the visemes of the English consonants can be grouped into 9 distinct groups, as in Table I [1].To obtain the viseme groupings the confusions in stimulus-response matrices measured on an experimental basis are analyzed.In such experiments, subjects are asked to visually identify syllables in a given context, such as vowel-consonant-vowel (V-C-V) words.Then the stimulus-response matrices are tabulated and the visemes are identified as those clusters of phonemes in which at least 75% of all responses occur.This strategy will lead to a systematic, application-  [18] and self-organizing maps [17] were employed to group visually similar phonemes based on geometric features.Similar techniques could be applied for raw images from mouth regions as well.However, in this paper we do not resort to such strategies, because our main goal is the evaluation of the proposed visual speech recognition method.Thus, we define only those visemes that are strictly needed to represent the visual realization of the small vocabulary used in our application and manually classify the training images to a number of predefined visemes, as is explained in Section V.

IV. The Proposed Approach to Visual Speech Recognition
Depending on the approach used to model the spoken words in the visual domain, we can classify the existing visual speech recognition systems to systems using word-oriented models and those using viseme-oriented models [4].In this paper, we develop visemeoriented models.Visemic based lipreading was investigated also in [18,17].Each visual word model can be represented afterwards as a temporal sequence of visemes.Thus, the structure of the visual word modeling and recognition system can be regarded as a two-level structure: 1.At the first level, we build the viseme classes, one class of mouth images for each viseme defined.This implies the formulation of the mouth shape recognition problem as a pattern recognition problem.The patterns to be recognized are the mouth shapes, symbolically represented as visemes.In our approach, the classification of mouth shapes to viseme classes is formulated as a two-class (binary) pattern recognition problem and there is one SVM dedicated for each viseme class.
2. At the second level, we build the abstract visual word models, described as temporal sequences of visemes.The visual word models are implemented by means of the Viterbi lattices, where each node generates the emission probability of a certain viseme at one particular time instant.One can notice that the aforementioned two-level approach is very similar to some techniques employed for acoustic speech recognition [16], justifying thus our expectation that the proposed method will ensure an easy integration of the visual speech recognition subsystem with a similar acoustic speech recognition subsystem.
In this section, we will focus on the first level of the proposed algorithm for visual speech modeling and recognition.The second level involves the development of the visual symbolic sequential word models using the Viterbi lattices.The latter level is discussed only in principle.

A. Formulation of Visual Speech Recognition as a Pattern Recognition Problem
The problem of discriminating between different mouth shapes during speech production can be viewed as a pattern recognition problem.In this case, the set of patterns is a set of feature vectors {x i }, i = 1, 2, . . ., P , each of them describing some mouth shape.The feature vector x i is a representation of the mouth image.x i can represent the mouth image at low-level (i.e., the gray levels from a rectangular image region containing the mouth), it can comprise geometric parameters (i.e., mouth width, height, perimeter, etc.) or the coefficients of a linear transformation of the mouth image.All the feature vectors from the set have the same number of components, M .
Let us denote the pattern classes by C j , j = 1, 2, . . ., Q where Q is the total number of classes.Each class C j is a group of patterns that represent mouth shapes corresponding to one viseme.
A network of Q parallel SVMs is designed where each SVM is trained to classify test patterns in class C j or its complement C C j .We should slightly deviate from the notation introduced in Section II, because a test pattern x i could be assigned to more than one classes.It is convenient to represent the class label of a test pattern, x k , by a (Q × 1) vector y k whose jth element, y kj , admits the value 1 if x k ∈ C j and −1 otherwise.It may occur more than one elements of y k to have the value 1 if f j (x k ) > 0, where f j (x k ) is the decision function of the jth SVM.To derive an unambiguous classification we will use SVMs with probabilistic outputs, that is, the output of the jth SVM classifier will be the posterior probability for the test pattern x k to belong to the class C j , P (y j = 1 | f j (x k )), given by (3).This pattern recognition problem can be applied to visual speech recognition in the following way: • Each unknown pattern represents the image of the speaker's face at a certain time instant.
• Each class label represents one viseme.Accordingly, we shall identify what is the probability of a viseme to be produced at any time instant in the spoken sequence.This gives the solution required at the first level of the proposed visual speech recognition system, to be passed to the second level.The network of Q parallel SVMs is shown in Figure 2.

B. The Basic Structure of the SVM Network for Visual Speech Recognition
The phonetic transcription represents each word by a left-to-right sequence of phonemes.Moreover, the visemic model corresponding to the phonetic model of a word can be easily derived using a phoneme-to-viseme mapping.However, the aforementioned representation shows only which visemes are present in the pronunciation of the word, not the duration of each viseme.Let T i , i = 1, 2, . . ., S, denote the duration of the ith viseme in a word model of S visemes.Let T be the duration of the video sequence that results from the pronunciation of this word.
In order to align the video sequence of duration T with the symbolic visemic model of S visemes, we can create a temporal Viterbi lattice [23] containing as many states as the frames in the video sequence, that is, T .Such a Viterbi lattice that corresponds to the pronunciation of the word "one" is depicted in Figure 3.For this example, the visemes present in the word pronunciation have been denoted with the same symbols as the underlying phones.Let D be the total number of visemic models defined for the words in the vocabulary.Each visemic model, w d , d = 1, 2, . . ., D, has its own Viterbi lattice.Each node in the lattice of Figure 3 is responsible for the generation of one observation that belongs to a certain class at each time instant.Let l k = 1, 2, . . ., Q be the class label where the observation o k generated at time instant k belongs to.Let us denote the emission probability of that observation by b l k (o k ).Each solid line between any two nodes in the lattice represents a transition probability between two states.Let us denote by a l k ,l k+1 the transition probability from the node corresponding to the class l k at time instant k to the node corresponding to the class l k+1 at time instant k + 1.The class labels l k and l k+1 may be different or not.
Having a video sequence of T frames for a word and a Viterbi lattice for each visemic word model, w d , d = 1, 2, . . ., D, we can compute the probability that the visemic word model w d is realized following a path in the Viterbi lattice as The probability that the visemic word model w d is realized can be computed by where L is the number of all possible paths in the lattice.Among the words that can be realized following any possible path in any of the D Viterbi lattices, the word described by the model whose probability p d , d = 1, 2, . . ., D, is maximum (i.e., the most probable word) is finally recognized.
In the visual speech recognition approach discussed in this paper, the emission probability b l k (o k ) is given by the corresponding SVM, SV M l k .To a first approximation, we assume equal transition probabilities a l k ,l k+1 between any two states.Accordingly, it is sufficient to take into account only the probabilities b l k (o k ), k = 1, 2, . . ., T , in the computation of the path probabilities p d, which yields the simplified equation Of course, learning the probabilities a l k l k+1 from word models would yield a more refined modeling.This could be a topic of future work.

V. Experimental results
To evaluate the recognition performance of the proposed SVM-based visual speech recognizer, we choose to solve the task of recognizing the first four digits in English.Towards this end we used the small audiovisual database Tulips1 [7], frequently used in similar visual speech recognition experiments.While the number of the words is small, this database is challenging due to the differences in illumination conditions, ethnicity and gender of the subjects.Also we must mention that, despite the small number of words pronounced in the Tulips1 database compared to vocabularies for real-world applications, the portion of phonemes in English covered by these four words is large enough: 10 out of 48 appearing in the ARPABET table, i.e., approximately 20%.Since we use visemeoriented models, and the visemes are actually just representations of phonemes in the visual domain, we can consider the results described in this section as significant.
Solving the proposed task requires first the design of a particular visual speech recognizer according to the strategy presented in Section IV.The design involves the following steps: 1. to define the viseme to phoneme mapping; 2. to build the SVM network; 3. to train the SVMs for viseme classification; 4. to generate and implement the word models as Viterbi lattices.Then we use the trained visual speech recognizer to assess its recognition performance in test video sequences.

A. Experimental Protocol
We start the design of the visual speech recognizer with the definition of the viseme classes for the first four digits in English.We first obtain the phonetic transcriptions of the first four digits in English, using the CMU pronunciation dictionary [30]: "one" −→ "W-AH-N" "two" −→ "T-UW" "three" −→ "TH-R-IY" "four" −→ "F-AO-R".We then try to define the viseme classes so that • a viseme class includes as few phonemes as possible; • we have as few different visual realizations of the same viseme as possible.
The definition of viseme classes was done based on the visual examination of the video part from the Tulips1 database.The clustering of the different mouth images into viseme classes was done manually based on the visual similarity of these images.By this procedure we obtained the viseme classes described in Table II and the phoneme-to-viseme mapping given in Table III.

TABLE II
Viseme classes defined for the four words of the Tulips1 database [7].

Viseme group index symbolic notation
Viseme description We have to define and train one SVM for each viseme.To employ SVMs one should define the features to be used to represent each mouth image and select the kernel function to be used.Since the recognition and generalization performance of each SVM is strongly influenced by the selection of the kernel function and the kernel parameters, we devoted much attention to these issues.We trained each SVM using as kernel function the linear, /F/ the polynomial, and the RBF one.In the case of the polynomial kernel, the degree of the polynomial q was varied between 2 and 6.For each trained SVM, we compared the predicted error, precision, and recall on the training set, as computed by SVMLight [31], for the different kernels and kernel parameters, and we finally selected the simplest kernel yielding the best values for these estimates.That kernel was the polynomial kernel of degree q = 3.The RBF kernel gave the same performance estimates with the polynomial kernel of degree q = 3 on the training set, but at the cost of a larger number of support vectors.A simple choice of a feature vector such as the collection of the gray levels from a rectangular region of fixed size containing the mouth, scanned row by row, is proved suitable whenever SVMs have been used for visual classification tasks [15].More specifically, we used two types of features to conduct the visual speech recognition experiments: • The first type comprised the gray levels of a rectangular region of interest around the mouth, downsampled to the size 16 × 16.Each mouth image is represented by a feature vector of length 256.
• The second type represented each mouth image frame at the time T f by a vector of double size (i.e., 512) that comprised the gray levels of the rectangular region of interest around the mouth downsampled to the size 16 × 16, as previously, and the temporal derivatives of the gray levels normalized to the range [0, L max − 1], where L max is the maximum gray level value in mouth image.The temporal derivatives are simply the pixel by pixel gray level differences between the frames T f and T f − 1.These differences are the so-called delta features.Some preprocessing of the mouth images was needed before training and testing the visual speech recognition system.It concerns the normalization of the mouth in scale, rotation, and position inside the image.Such a preprocessing is needed due to the fact that the mouth has different scale, position in the image, and orientation toward the horizontal axis from utterance to utterance, depending on the subject and on its position in front of the camera.To compensate for these variations we applied the normalization procedure of mouth images with respect to scale, translation and rotation described in [6].
The visual speech recognizer was tested for speaker-independent recognition using the leave-one-out testing strategy for the 12 subjects in the Tulips1 database.This implies training the visual speech recognizer 12 times, each time using only 11 subjects for training and leaving the 12th out for testing.In each case we trained first the SVMs, and then the sigmoidal mappings for converting the SVMs output to probabilities.The training set for each SVM in each system configuration is defined manually.Only the video sequences from the so-called Set 1 from the Tulips1 database were used for training.The labeling of all the frames from Set 1 (a total of 48 video sequences) was done manually by visual examination of each frame.We examined the video only to label all the frames according to Table III except the transition frames between two visemes, denoting differently the same viseme class for each subject.Finally, we compared the similarity of the frames corresponding to the same viseme and different subjects, and decided if the classes could be merged.The disadvantage of this approach is the large time needed for labeling, which would not be needed if HMMs were used for segmentation.A compromise solution for labeling could be the use of an automatic solution for phoneme-level segmentation of the audio sequence and the use of this segmentation on the aligned video sequence also.
Once The configuration of the Viterbi lattice depends on the length of the test sequence through the number of frames T tst of the sequence (as illustrated in Fig. 3), and it was generated automatically at runtime for each test sequence.The number of Viterbi lattices can be determined in advance, because it is equal to the total number of visemic word models.Thus, taking into account the phonetic descriptions for the four words of the vocabulary and the phoneme-to-viseme mappings in Table III, we have 3 visemic word models for the word "one", 3 models for "two", 4 models for "three", and 6 models for "four".The multiple visemic models per word are due to the variability in speakers' pronunciation.
In each of the 12 leave-one-out tests, we have as test sequences the video sequences corresponding to the pronunciation of the four words and there are two pronunciations available for each word and speaker.This leads to a sub-total of 8 test sequences per system configuration, and a total of 12 × 8 = 96 test sequences for the visual speech recognizer.
The complete visual speech recognizer was implemented in C++.We used the publicly available SVMLight toolkit modules for the training of the SVMs [31].We implemented in C++ the module for learning the sigmoidal mappings of the SVMs output to probabilities and the module for generating the Viterbi lattice models based on SVMs with probabilistic outputs.All these modules were integrated into the visual speech recognition system, whose architecture is structured into two modules, the training module and the test module.
Two visual speech recognizers were implemented, trained, and tested with the aforementioned strategy.They differ in the type of features used.The first system (without delta features) did not include temporal derivatives in the feature vector, while the second (with delta features) included also temporal derivatives between two frames in the feature vector.

B. Performance Evaluation
In this section we present the experimental results obtained with the proposed system without using delta features as well as with delta features.Moreover, we compare these results to others reported in the literature for the same experiment on the Tulips1 database.The word recognition rates (WRR) have been averaged over the 96 tests obtained by applying the leave-one-out principle.Five figures of merit are provided: 1.The WRR per subject obtained by the proposed method when delta features are used is measured and compared to that by Luettin and Thacker [6] (Table IV). 2. The overall WRR for all subjects and pronunciations with and without delta features is reported compared to that obtained by Luettin and Thacker [6], Movellan [7], Gray et al. [9] and Movellan et al. [8] (Table V). 3. The confusion matrix between the words actually presented to the classifier and the words recognized is shown in Table VI and compared to the average human confusion matrix [7] (Table VII) in percentages.4. The accuracy of the viseme segmentations resulting from the Viterbi lattices.5.The 95% confidence intervals for the WRRs of the several systems included in the comparisons (Table VIII) that provide an estimate of the performance of the systems for a much larger number of subjects.We would like to note that human subjects untrained in lip reading achieved under similar experimental conditions a WRR of 89.93% whereas the hearing impaired had an average performance of 95.49% [7].From the examination of Table V it can be seen that our WRR is equal to the best rate reported in [6] and just 1.1% below the recently reported rates in [9,8].However the features used in the proposed method are simpler than those used with HMMs to obtain the same or higher WRRs.For the shape + intensity models [6] the gray levels should be sampled in the exact subregion of the mouth image containing the lips, around the inner and outer lip contours, and should exclude the skin areas.Accordingly, the method reported in [6] requires the tracking of the lip contour in each frame, which increases the processing time of visual speech recognition.For the method reported in [9], a large amount of local processing is needed, by the use of a bank of linear shift invariant filters with unblocked selection whose response filters are ICA or PCA kernels of very small size (12 × 12 pixels).The obtained WRR is higher than those reported in [7], where similar features are used, namely the gray levels of the region of interest (ROI) comprising the mouth, after some simple preprocessing steps.The preprocessing in [7] was vertical symmetry enforcement of the mouth image by averaging, followed by low pass filtering, subsampling, and thresholding.
Another measure of the performance assessment is given by comparing the confusion matrix of the proposed system with the average human confusion matrix provided in [7].The accuracy of the viseme segmentation that results from the best Viterbi lattices was computed using as reference the manually performed segmentation of frames into the viseme classes (Table III) as a percentage of the correctly classified frames.We obtained an accuracy of 89.33%, which is just 1.27% lower than the WRR.
The results obtained demonstrate that the SVM-based dynamic network is a very promising alternative to the existing methods for visual speech recognition.An improvement of the WRR is expected when training of the transition probabilities is implemented and the trained transition probabilities are incorporated in the Viterbi decoding lattices.
To assess the statistical significance of the rates observed, we model the ensemble {test patterns, recognition algorithm} as a source of binary events, 1 for correct recognition and 0 for an error, with probability p of drawing a 1 and (1 − p) of drawing a 0. These events can be described by Bernoulli trials.Let us denote by p the estimate of p.The exact confidence interval of p is the segment between the two roots of the quadratic equation [32]: where z u is the u-percentile of the standard Gaussian distribution having zero mean and unit variance, and K = 96 is the total number of tests conducted.We computed the 95% confidence intervals ( = 0.95) for the WRR of the proposed approach and also for the WRRs reported in literature [6,7,9,8], as summarized in Table VIII.

C. Estimation of the SVM structure complexity
The complexity of the SVM structure can be estimated by the number of SVMs needed for the classification of each word, as a function of the number of frames T in the current word pronunciation.For the experiments reported here, if we take into account the total number of symbolic word models, that is 16, and the number of possible states as a function of the frame index, we get: 6 SVMs for the classification of the first frame, 7 for the second frame, 8 for the before-last frame, 6 for the last frame, and 9 SVMs for all remaining frames.This leads to a total of 9 × T − 9 SVMs.As we can see, the number of SVM outputs to be estimated at each time instant is not large.Therefore the recognition could be done in real-time, since the number of frames per word is small (on the order of 10) in general.Of course, when scaling the system to an LVCSR application, a significanlty larger number of context dependent viseme SVMs will be required, thus affecting both training and recognition complexity.

VI. Conclusions
In this paper we proposed a new method for a visual speech recognition task.We employed SVM classifiers and integrated them into a Viterbi decoding lattice.Each SVM output was converted to a posterior probability, and then the SVMs with probabilistic outputs were integrated into Viterbi lattices as nodes.We tested the proposed method on a small visual speech recognition task, namely the recognition of the first four digits in English.The features used were the simplest possible, that is, the raw gray level values of the mouth image and their temporal derivatives.Under these circumstances, we obtained a word recognition rate that competes with that of the state-of-the-art methods.Accordingly, SVMs are found to be promising classifiers for visual speech recognition tasks.The existing relationship between the phonetic and visemic models can also lead to an easy integration of the visual speech recognizer with its audio counterpart.In our future research, we will try to improve the performance of the visual speech recognizer by training the state transition probabilities of the Viterbi decoding lattice.Another topic of interest in our future research would be the integration of this type of visual recognizer with an SVM-based audio recognizer, to perform audio-visual speech recognition.

Fig. 1 .
Fig. 1.From left to right: mouth shape during the realization of phone /O/; mouth shape during the realization of phone /F/, by subject Anthony in the Tulips1 database [7].

Fig. 2 .
Fig. 2. Illustration of the parallel network of binary classifiers for viseme recognition.

Fig. 3 .
Fig.3.A temporal Viterbi lattice for the pronunciation of the word "one" in a video sequence of 5 frames.
the labeling was done, only the unambiguous positive viseme examples and the unambiguous negative viseme examples were included in the training sets.The feature vectors used in the training sets of all SVMs were the same.Only their labeling as positive or negative examples differs from one SVM to another.This leads to an unbalanced training set in the sense that the negative examples are frequently more than the positive ones.
independent, mapping of phonemes to visemes.Average linkage hierarchical clustering

TABLE IV
[6] for each subject in Tulips1, using (a) SVM dynamic network with delta features; (b) AAM for inner and outer lip contours and HMM with delta features[6].

TABLE V
The overall WRR of the SVM dynamic network compared to that of other techniques.

TABLE VI
Confusion matrix for visual word recognition by the dynamic network of SVMs with delta features.
TABLE VIII 95% confidence interval for the WRR of the proposed system compared to that of other techniques.