EURASIP Journal on Applied Signal Processing 2005:9, 1382–1399 c ○ 2005 Hindawi Publishing Corporation A Two-Channel Training Algorithm for Hidden Markov Model and Its Application to Lip Reading

Hidden Markov model (HMM) has been a popular mathematical approach for sequence classification such as speech recognition since 1980s. In this paper, a novel two-channel training strategy is proposed for discriminative training of HMM. For the proposed training strategy, a novel separable-distance function that measures the difference between a pair of training samples is adopted as the criterion function. The symbol emission matrix of an HMM is split into two channels: a static channel to maintain the validity of the HMM and a dynamic channel that is modified to maximize the separable distance. The parameters of the two-channel HMM are estimated by iterative application of expectation-maximization (EM) operations. As an example of the application of the novel approach, a hierarchical speaker-dependent visual speech recognition system is trained using the two-channel HMMs. Results of experiments on identifying a group of confusable visemes indicate that the proposed approach is able to increase the recognition accuracy by an average of 20% compared with the conventional HMMs that are trained with the Baum-Welch estimation.


INTRODUCTION
The focus of most automatic speech recognition techniques is on the spoken sounds alone. If the speaking environment is noise free and the recognition engine is well configured, high recognition rate is attainable for most speakers. However, in real-world environments such as office, bus station, shop, and factory, the speech captured may be greatly polluted by background noise and cross-speaker noise. Presenting such a signal to a sound-based speech recognition system, the recognition accuracy may drop dramatically. One solution to enhance the speech recognition accuracy under noisy conditions is to jointly process information from multiple modalities of speech. Automatic lip reading is one mode in which the visual aspect of speech is considered for speech recognition.
Among the various techniques for visual speech recognition, hidden Markov model (HMM) holds the greatest promise due to its capabilities in modeling and analyzing temporal processes as reported in [9,18,19,20,21,22,23,24,25,26,27,28,29]. Most of the HMM-based visual speech processing systems reported take an individual word as the basic recognition unit and an HMM is trained to model it. Such an approach works well with limited vocabulary such as digit set [15,30], a small number of AVletters [31], and isolated words or nonsense words [32], but it is difficult to extend the methods to large-vocabulary recognition task as a great number of word models has to be trained. One solution to this problem is to build subword models such as phoneme models. Any word that is presented to the recognition system is broken down into subwords. In this way, even if a word is not included in training the system, the system can still make a good guess on its identity.
The smallest visibly distinguishable unit of visual speech is commonly referred to as viseme [33]. Like phonemes that are the basic building blocks of sound of a language, visemes are the basic constituents for the visual representation of words. The time variation of mouth shape in speech is small compared with the corresponding variation of acoustic waveform. Some previous experiments indicate that the traditional HMM classifiers, which are trained with the Baum-Welch algorithm, are sometimes incompetent to separate mouth shapes with small difference [34]. Such small difference has prompted some researchers to regard the relationship between phonemes and visemes as many-to-one mapping. For example, although phonemes /b/, /m/, /p/ are acoustically distinguishable, the sequence of mouth shape for the three sounds are not readily distinguishable, hence the three phonemes are grouped into one viseme category. An early viseme grouping was suggested by Binnie et al. [35]. The MPEG-4 multimedia standard adopted the same viseme grouping strategy for face animation, in which fourteen viseme groups are included [36]. However, different groupings are adopted by different researchers to fulfill specific requirements [37,38].
Motivated by the need to find an approach to differentiate visemes that are only slightly different, we propose a novel approach to improve the discriminative power of the HMM classifiers. The approach aims at amplifying the separabledistance between a pair of training samples. A two-channel HMM is developed, one channel, called the static channel, is kept fixed to maintain the validity of the probabilistic framework, and the other channel, called the dynamic channel, is modified to amplify the difference between the training pair.
A hierarchical classifier is also proposed based on the two-channel training strategy. At the top level, broad identification is performed and fine identification is subsequently carried out within the broad category identified. Experimental results indicate that the proposed classifier excels the traditional ML HMM classifier in identifying the mouth shapes.
Although the proposed method is developed for the recognition of visemes, it can also be applied to any sequence classification problem. As such, the theoretical background and the training strategy of the two-channel discriminative training method are introduced first in Sections 2, 3, and 4. This is followed by discussion of the general properties and extensions of the training strategy in Sections 5 and 6, respectively. Details of the application of the method to viseme recognition and the experimental results obtained are given in Section 7. The concluding remarks are presented in Section 8.

REVIEW OF HIDDEN MARKOV MODEL
Hidden Markov model is also referred to as hidden Markov process (HMP) as the latter emphasizes the stochastic process rather than the model itself. HMP was first introduced by Baum and Petrie [39] in 1966. The basic theories/properties of HMP were introduced in full generality in a series of papers by Baum and his colleagues [40,41,42,43], which include the convergence of the entropy function of an HMP, the computation of the conditional probability, and the local convergence of the maximal likelihood (ML) parameter estimation of HMM. Application of HMM to speech processing took place in the mid-1970s. A phonetic speech recognition system that adopts HMM-based classifier was first developed in IBM [44,45]. Applications of HMM for speech processing were further explored by Rabiner and Juang [46,47].
The beauty of HMM is that it is able to reveal the underlying process of signal generation even though the properties of the signal source remain greatly unknown. Assume that is the discrete set of observed symbols and S N = {S 1 , S 2 , . . . , S N } is the set of states; an N-state-M-symbol discrete HMM θ(π, A, B) consists of the following three components.
(1) The probability array of the initial state: where o t is the tth observed symbol in the observation sequence.
In a K-class identification problem, assume that x T = (x 1 , x 2 , . . . , x T ) is a sample of a particular class, say class d i . The probability of occurrence of the sample x T given the HMM θ(π, A, B), denoted by P(x T |θ), is computed using either the forward or backward process and the optimal hidden-state chain is revealed using Viterbi matching [46]. Training of the HMM is the process of determining the parameters set θ(π, A, B) to fulfill a certain criterion function such as P(x T |θ) or the mutual information [46,48]. For training of the HMM, the Baum-Welch training algorithm is popularly adopted. The Baum-Welch algorithm is an ML estimation; thus the HMM so obtained, θ ML , is one that maximizes the probability P(x T |θ). Mathematically, The Baum-Welch training can be realized at a relatively high speed as the expectation-maximization (EM) estimation is adopted in the training process.
However, the parameters of the HMM are solely determined by the correct samples while the relationship between the correct samples and incorrect ones is not taken into consideration. The method, in its original form, is thus not developed for fine recognition. If another sample y T of class d j ( j = i) is similar to x T , the scored probability P(y T |θ) may be close to P(x T |θ), and θ ML may not be able to distinguish x T and y T . One solution to this problem is to adopt a training strategy that maximizes the mutual information I M (θ, x T ) defined as  This method is referred to as maximum mutual information (MMI) estimation [48]. It increases the a posteriori probability of the model corresponding to the training data, and thus the overall discriminative power of the HMM obtained is guaranteed. However, analytical solutions to (2) are difficult to realize and implementation of MMI estimation is tedious. A computationally less intensive approach is desirable.

PRINCIPLES OF TWO-CHANNEL HMM
To improve the discriminative ability of HMM and at the same time, to facilitate the process of parameter tuning, the following two-channel training method is proposed, where the HMM is specially tailored to amplify the difference between two similar samples.
The block diagram of the two-channel HMM is given in Figure 1. It consists of a static-channel HMM and a dynamicchannel HMM. For the static channel, a normal HMM derived from a parameter-smoothed ML approach is used. A new HMM for the dynamic channel is to be derived. Details of the derivation of the dynamic-channel HMM are described in the following paragraphs.
Assume that in a two-class identification problem, {x T : d 1 } and {y T : d 2 } are a pair of training samples, where . , x T T ) and y T = (y T 1 , y T 2 , . . . , y T T ) are observation sequences of length T and d 1 and d 2 are the class labels. The observed symbols in x T and y T are from the symbol set O M . P(x T |θ) and P(y T |θ) are the scored probabilities for x T and y T given HMM θ, respectively. The pair of training samples x T and y T must be of the same length so that their probabilities P(x T |θ) and P(y T |θ) can be suitably compared. Such a comparison is meaningless if the samples are of different lengths; the shorter sequence may give larger probability than the longer one even if it is not the true sample of θ.
Define a new function I(x T , y T , θ), called the separabledistance function, as follows: A large value of I(x T , y T , θ) would mean that x T and y T are more distinct and separable. The strategy then is to determine the HMM θ MSD (MSD for maximum separable distance) that maximizes I(x T , y T , θ). Mathematically, For the proposed training strategy, the parameter set for the static-channel HMM is determined in the normal way such as the ML approach. For the dynamic-channel HMM, to maintain synchronization of the duration and transition of states, the same set of values for π and A as derived for the static-channel HMM is used; only the parameters of matrix B are adjusted.
As a first step towards the maximization of the separable-distance function I(x T , y T , θ), an auxiliary function F(x T , y T , θ, λ) involving I(x T , y T , θ) and the parameters of B is defined as (5) where λ i is the Lagrange multiplier for the ith state and M j=1 b i j = 1 (i = 1, 2, . . . , N). By maximizing F(x T , y T , θ, λ), I(x T , y T , θ) is also maximized. Differentiating F(x T , y T , θ, λ) with respect to b i j and setting the result to 0, we have Since λ i is positive, the optimum value obtained for I(x T , y T , θ) is a maximum as solutions for b i j must be positive. In (6), log P(x T |θ) and log P(y T |θ) may be computed by summing up all the probabilities over time T: Note that the state-transition coefficients a i j do not appear explicitly in (7); they are included in the term P(s T τ = S i ).
The two partial derivatives in (6) may be evaluated separately as follows: By defining equation (6) can be written as By making use of the fact that M j=1 b i j = 1, it can be shown that The set {b i j } (i = 1, 2, . . . , N, j = 1, 2, . . . , M) so obtained gives the maximum value of I(x T , y T , θ).
An algorithm for the computation of the values may be developed by using standard expectation-maximization (EM) technique. By considering x T and y T as the observed data and the state sequence s T = (s T 1 , s T 2 , . . . , s T T ) as the hidden or unobserved data, the estimation of E θ (I) = E[I(x T , y T , s T | θ)|x T , y T , θ] from incomplete data x T and y T is then given by [49] E θ (I) = where θ and θ are the HMM before training and the HMM after training, respectively, and S denotes all the state combinations with length T. The purpose of the E-step of the EM estimation is to calculate E θ (I). By using the auxiliary function Q x ( θ, θ) proposed in [48] and defined as follows: equation (12) can be written as Q x ( θ, θ) and Q y ( θ, θ) may be further analyzed by breaking up the probability P(x T , s T | θ) as follows: where π, a, and b are the parameters of θ. Here, we assume that the initial distribution starts at τ = 0 instead of τ = 1 for notational convenience. The Q function then becomes The parameters to be optimized are now separated into three independent terms.
From (14) and (16), E θ (I) can also be divided into the following three terms: where E θ ( π, I) and E θ ( a, I) are associated with the hidden-state sequence s T . It is assumed that x T and y T are drawn independently and emitted from the same state sequence s T , hence both E θ ( π, I) and E θ ( a, I) become 0. E θ ( b, I), on the other hand, is related to the symbols that appear in x T and y T and contributes to E θ (I). By enumerating all the state combinations, we have is arranged according to the order of appearance of the symbols (O j ) within x T and y T , we have (20) or In the M-step of the EM estimation, b i j is adjusted to maximize E θ ( b, I) or E θ (I). Since M j=1 b i j = 1 and (21) has the form K M j=1 w j log v j , which attains a global maximum at the point v j = w j / M j=1 w j ( j = 1, 2, . . . , M), the reestimated value of b i j of θ that lead to the maximum E θ (I) is given by This equation, compared with (11), enables the reestimation of the symbol emission coefficients b i j from expectations of the existing HMM. The above derivations strictly observe the standard optimization strategy [49], where the expectation of the value of the separable-distance function, E θ (I), is computed in the E-step and the coefficients b i j are adjusted to maximize E θ (I) in the M-step. The convergence of the method is therefore guaranteed. However, b i j may not be estimated by applying (22) alone; other considerations will be taken into account such as when D i j (x T , y T , θ) is less than or equal to 0. Further discussion on the determination of values of b i j is given in the subsequent sections.
To modify the parameters according to (22) and simultaneously ensure the validity of the model, a two-channel structure as depicted in Figure 2 is proposed. The elements (b i j ) of matrix B of the two-channel HMM are decomposed into two parts as b s i j for the static channel and b d i j for the dynamic channel. The dynamic-channel coefficients b d i j are the key source of the discriminative power. b s i j are computed using parametersmoothed ML HMM and weighted. As long as b i j computed from (22) is greater than b s i j , b d i j is determined as the difference between b i j and b s i j according to (23); otherwise b d i j is set to be 0.
To avoid the occurrence of zero or negative probability, b s i j (∀i = 1, 2, . . . , N, ∀ j = 1, 2, . . . , M) should be kept greater than 0 in the training procedure and at the same time, In addition, the relative weightage of the static channel and the dynamic channel may be controlled by the credibility weighing factor ω i (i = 1, 2, . . . , N) (different states may have different values). If the weightage of the dynamic channel is set to be ω i by scaling of the coefficients then the weightage of the static channel has to be set as follows:

Parameter initialization
The parameter-smoothed ML HMM of x T , θ x ML , which is trained using the Baum-Welch estimation, is referred to as the base HMM. The static-channel HMM is derived from the base HMM after applying the scaling factor. Parameter smoothing is carried out for θ x ML to prevent the occurrence of zero probability. Parameter smoothing is the simple management that b i j is set to some minimum value, for example, ε = 10 −3 , if the estimated conditional probability b i j = 0 [46]. As a result, even though symbol O j never appears in the training set, there is still a nonzero probability of its occurrence in θ x ML . Parameter smoothing is a posttraining adjustment to decrease error rate because the training set, which is usually limited by its size, may not cover erratic samples.
Before carrying out discriminative training, ω i (credibility weighing factor of the ith state), b s i j (static-channel coefficients), and b d i j (dynamic-channel coefficients) are initialized.
The static-channel coefficients b s i j are given by where b i j is the symbol emission probability of θ x ML . As for the dynamic-channel coefficients b d i j , a random or uniform initial distribution usually works well. In the experiments conducted in this paper, uniform values equal to ω i /M are assigned to b d i j 's as initial values. The selection of ω i is flexible and largely problemdependent. A large value of ω i means large weightage is assigned to the dynamic channel and the discriminative power is enhanced. However, as we adjust b d i j toward the direction of increasing I(x T , y T , θ), the probability of the correct observation P(x T |θ) will normally decrease. This situation is undesirable because the two-channel HMM obtained is unlikely to generate even the correct samples.
A guideline for the determination of the value of ω i is as follows. If the training pairs are very similar to each other such that P( , ω i should be set to a large value to guarantee good discrimination; on the other hand, , ω i should be set to a small value to make P(x T |θ) reasonably large. In addition, different values will be used for different states because they contribute differently to the scored probabilities. However, the values of ω i for the different states should not differ greatly.
Based on the above considerations, the following procedures are taken to determine ω i . Given the base HMM θ x ML and the training pair x T and y T , the optimal state chains are searched using the Viterbi algorithm. If θ x ML is a left-right model and the expected (optimal) duration of the ith state is then written as follows: . This probability may be computed as follows: P dur (x T , S i | θ x ML ) may also be computed using the forward variables α x [46]. However, if θ x ML is not a left-right model but an ergodic model, the expected duration of a state will consist of a number of separated time slices, for example, k slices such as t i1 to t i1 + τ i1 , t i2 to t i2 + τ i2 , and t ik to t ik + τ ik . P dur (x T , S i | θ x ML ) is then computed by multiplying them together as shown: The value of ω i is derived by comparing the corre- , this indicates that the coefficients of the ith state of the base model are good enough for discrimination, ω i should be set to a small value to preserve the original ML configurations.
, this indicates that state S i is not able to distinguish between x T and y T , thus ω i must be set to a value large enough to ensure HMM. In practice, ω i can be manually selected according to the conditions mentioned above (which is preferred), or they can be computed using the following expression: where . C (C > 0) and D are constants that jointly control the smoothness of ω i with respect to v. Since C > 0 and v > 0, ω i < 1, by using suitable values of C and D, a set of credibility factors ω i are computed for the states of the target HMM. For example, if the range of v is 10 −3 ∼ 10 5 , a typical setting is C = 1.0 and D = 0.1.
Once the values of ω i (i = 1, 2, . . . , N) are determined, they will not be changed in the training process.

Partition of the observation symbol set
Let θ denote the HMM with the above initial configurations. The coefficients of the dynamic channel are adjusted according to the following procedures. First, [46], the following two probabilities are computed: ξ y τ (i, j) and γ y τ (i) are obtained in the same manner. By counting the state, we have It is shown in (22) that to maximize I(x T , y T , θ), b i j should be set proportional to D i j (x T , y T , θ). However, for certain symbols, for example, O p , the expectation D ip (x T , y T , θ) may be less than 0. Since the symbol emission coefficients cannot take negative values, these symbols have to be specially treated.  where η is the threshold with a typical value of 1. η will be set to a larger value if it is required that the set V will contain fewer dominant symbols.

Modification to the dynamic channel
For each state, the symbol set is partitioned according to the procedures described in Section 4.2. As an example, consider the ith state. For symbols in the set U, the symbol emission should be set as small as possible.
is computed according to (34), which is derived from (22): where

Termination
Optimization is done through iteratively calling the training epoch described in Sections 4.2 and 4.3. After each epoch, the separable-distance I(x T , y T , θ) of the HMM θ obtained, is calculated and compared with that obtained in the last epoch. If I(x T , y T , θ) does not change more than a predefined value, training is terminated and the target two-channel HMM is established.

State alignment
One of the requirements for the proposed training strategy is that the state durations of the training pair, say x T and y T , are comparable. This is a requirement for (22). If the state durations, for example, E(S i |θ, x T ) and E(S i |θ, y T ), differ too much, D i j (x T , y T , θ) will become meaningless. For example, if E(S i |θ, x T ) E(S i |θ, y T ), the symbol O j takes much greater portion in E(S i |θ, x T ) than in E(S i |θ, y T ), the computed D i j (x T , y T , θ) may also be less than 0. The outcome is that b i j is always set to b s i j rather than adjusted to increase I(x T , y T , θ). Fortunately, if the corresponding state durations of the training pair are very different, the normal ML HMMs are usually adequate to distinguish the states.
The following state-duration validation procedure is added to make the training strategy complete. After each training epoch, E(S i |θ, x T ) and E(S i |θ, y T ) are computed and compared with each other. Using the forward variables and backward variables, the state duration of x T is obtained as follows: and E(S i |θ, y T ) is computed in the same way. If E(S i |θ, x T ) ≈ E(S i |θ, y T ) (not necessary to be the same), for example, 1.2E(S i |θ, y T ) > E(S i |θ, x T ) > 0.8E(S i |θ, y T ), training continues; otherwise, training stops even if I(x T , y T , θ) keeps on increasing.
If the I(x T , y T , θ) of the final HMM θ does not meet certain discriminative requirement, for example, I(x T , y T , θ) is less than a desired value, a new base HMM or a smaller ω i should be used instead.

Speed of convergence
As discussed in Section 3, the convergence of the parameterestimation strategy proposed in (22) is guaranteed according to the EM optimization principles. In the implementation of discriminative training, only some of the symbol emission coefficients in the dynamic channel are modified according to (22) while the others remain unchanged. However, the convergence is still assured because firstly the surface of I(x T , y T , θ) with respect to b i j is continuous, and also adjusting the dynamic-channel elements according to the twochannel training strategy leads to increased E θ (I). A conceptual illustration is given in Figure 4 on how b i j is modified when the symbol set is divided into subsets V and U. For ease of explanation, we assume that the symbol set contains only three symbols O 1 , O 2 , and O 3 with O 1 , O 2 ∈ V and O 3 ∈ U for state S i . Let θ t denote the HMM trained at the tth round and let θ t+1 denote the HMM obtained at the t + 1th round. The surface of the separable distance (I surface) is denoted as I = I(x T , y T , θ t+1 ) for θ t+1 and I = I(x T , y T , θ t ) for θ t . Clearly I > I. The I surface is mapped to the b i1 -b i2 plane ( Figure 4a) and the b i1 -b i3 plane (Figure 4b). In the training phase, b i1 and b i2 are modified along the line b d i1 + b d i2 = ω i to reach a better estimation θ t+1 , which is shown in Figure 4a. In i3 with the direction d as shown in Figure 4b. The direction of parameter adjustment given by (22) is denoted by d . In the two-channel approach, since only b i1 and b i2 are modified according to (22) while b i3 remains unchanged, d may lead to lower speed of convergence than d does.

Improvement to the discriminative power
The improvement to the discriminative power is estimated as follows. Assume that θ is the two-channel HMM obtained. The lower bound of the probability P(y T | θ) is given by where ω max = max(ω 1 , ω 2 , . . . , ω N ). Because the base HMM is the parameter-smoothed ML HMM of x T , it is natural to assume that P(x T | θ x ML ) ≥ P(x T | θ). The upper bound of the separable distance is given by the following expression: In practice, the gain of I(x T , y T , θ) is much smaller than the theoretical upper bound. It depends on the resemblance between x T and y T , and the setting of ω i .

Training samples with different lengths
Up to this point, the training sequences are assumed to be of equal length. This is necessary as we cannot properly compare the probability scores of two sequences of different lengths. To extend the training strategy to sequences of different lengths, linear adjustment is first carried out as follows. Given the training pair x Tx of length T x and y Ty of length T y , the objective function (10) is modified as follows: Parameter estimation is then carried out as follows: The expectations of different states of y Ty are normalized using the scale factor T x /T y . This approach is easy to implement; however, it does not consider the nonlinear variance of signal such as local stretch or squash. If the training sequences demonstrate obvious nonlinear variance, some nonlinear processing such as sequence truncation or symbol prune may be carried out to adjust the training sequences to the same length [50].

Multiple training samples
In order to obtain a reliable model, multiple observations must be used to train the HMM. The extension of the proposed method to include multiple training samples may be carried out as follows. Consider two labeled sets X = {x (1) , x (2) , . . . , x (k) : d 1 } and Y = {y (1) , y (2) , . . . , y (l) : d 2 } of samples, where X has k number of samples and Y has l number of samples. The separable-distance function that takes care of all these samples is given by For simplicity, if we assume that the observation sequences in X and Y have the same length T, then (10) may be rewritten as The probability coefficients are then estimated using the following:

APPLICATION TO LIP READING
The proposed two-channel HMM method is applied to speaker-dependent lip reading for modeling and recognizing the basic visual speech elements of the English language.
For the experiments reported in this paper, the visemes are treated as having a one-to-one mapping with the phonemes in order to test the discriminative power of the proposed method. As there are 48 phonemes in the English language [47], 48 visemes are considered. The block diagram of the viseme recognition system is given in Figure 5. The lip movement is captured with a video camera and the sequence of images is processed to extract the essential features relevant to the lip movement. For each frame of image, a feature vector is extracted. The sequence of feature vectors thus represents the movement of lips during viseme production. This vector sequence is then presented as input to the proposed classifier. A hierarchical structure is adopted such that for a system with K visemes to be recognized, R (usually R < K) ML HMM classifiers are employed for preliminary recognition. The output of the preliminary recognition is a coarse identity, which may include L (usually 1 < L < K) viseme classes. Fine recognition is then performed using a bank of two-channel HMMs. The most probable viseme is then chosen as the identity of the input. Details of the various steps involved are given in the following sections.

Data acquisition
For our experiments, a professional English speaker is engaged. The speaker is asked to articulate every phone me of the 18 phonemes in Table 1 one hundred times. The 18 visemes are chosen as some of them bear close similarity to others. The lip movements of the speakers are captured at 50 frames per second. Each pronunciation starts from a closed mouth and ends with a closed mouth. This type of samples is referred to as text-independent viseme samples, which is different from the type of samples extracted from various contexts, for example, from different words. The video clips that indicate the productions of context-independent visemes are normalized such that all the visemes have uniform duration of 0.5 second, or equivalently 25 frames.

Feature extraction
Each frame of the video clip reveals the lip area of the speaker during articulation (Figure 6a). To eliminate the effect caused by changes in the brightness, the RGB (red, green, blue) factors of the image are converted into HSV (hue, saturation, value) factors. The RGB to HSV conversion algorithm proposed in [51,52] is adopted in our experiments. As illustrated in the histograms of distribution of the hue component shown in Figure 7, the hue factors of the lip region and the remaining lip-excluded image occupy different regions of the histogram. A threshold may be manually selected to segment the lip region from the entire image as shown in Figure 6b. This threshold usually corresponds to a local minimum point (valley) in the histogram as shown in Figure 7a. Note that for different speakers and lighting conditions, the threshold may be different.
The boundaries of the lips are tracked using a geometric template with dynamic contours to fit an elastic object [53,54,55]. As the contours of the lips are simple, the requirement on the selection of the dynamic contours that build the template is thus not stringent. Results of lip tracking experiments show that Bezier curves can well fit the shape of the lip [34]. In our experiments, the parameterized template consists of ten Bezier curves with eight of them characterizing the lip contours and two of them describing the tongue when it is visible (Figure 6c). The template is controlled by points marked as small circles in Figure 6c. Lip tracking is carried out by fitting the template to minimize a certain energy function. The energy function comprises the  following four terms: where R 1 , R 2 , R 3 , C 1 , and C 2 are areas and contours as illustrated in Figure 6c. H(x) is a function of the hue of a given pixel; H + (x) is the hue function of the closest right-hand side pixel and H − (x) is that of the closest left-hand side pixel. Γ t+1 and Γ t are the matched templates at time t + 1 and t. Γ t+1 − Γ t indicates the Euclidean distance between the two templates (further details may be found in [55]). The overall energy of the template E is the linear combination of the components defined as Similarly, the energy terms for the tongue template include and the overall energy is Initially, the dynamic contours are configured to provide a crude match to the lips. This can be done via comparing the enclosed region of the template and the segmented lip region as depicted in Figure 6b. Following that, the template is matched to the image sequence by adopting different values of the parameters {c i } (i = 1, 2, . . . , 7) in a number of searching epochs (a detailed discussion is given in [53,54,55]). The matched template is pictured in Figure 6d. It can be seen that the matched template is symmetric and smooth, and is therefore easy to process.
Eleven geometric parameters as shown in Figure 6d are extracted to form a feature vector from the matched template. These features indicate the thickness of various parts of the lips, the positions of some key points, and the curvatures of the bows. They are chosen as they uniquely determine the shape of the lips and they best characterize the movement of the lips.
Principal components analysis (PCA) is carried out to reduce the dimension of the feature vectors from eleven to seven. The resulting feature vectors are clustered into groups using K-means algorithm. In the experiments conducted, 128 clusters are created for the vector database. The means of the 128 clusters form the symbol set O 128 = (O 1 , O 2 , . . . , O 128 ) of the HMM. They are used to encode the vector sequences presented to the system.

Configuration of the viseme model
Investigation on the lip dynamics reveals that the movement of the lips can be partitioned into three phases during the production of a text-independent viseme. The initial phase begins with a closed mouth and ends with the start of sound production. The intermediate phase is the articulation phase, which is the period when sound is produced. The third phase is the end phase when the mouth restores to the relaxed state. Figure 8 illustrates the change of the lips in the three phases and the corresponding acoustic waveform when the phoneme /u/ is uttered.
To associate the HMM with the physical process of viseme production, three-state left-right HMM structure as shown in Figure 9 is adopted. Using this structure, the state-transition matrix A has the form where the 4th state is a null state that indicates the end of viseme production. The initial values of the coefficients in matrices A and B are set according to the statistics of the three phases. Given a viseme sample, the approximate initial phase, articulation phase, and end phase are segmented from the image sequence and the acoustic signal (an illustration is given in Figure 8), and the duration of each phase is counted. The coefficients a i,i and a i,i+1 are initialized with these durations. For example, if the duration of state S i is T i , the initial value of a i,i is set to be T i /(T i + 1) and the initial value of a i,i+1 is set to be 1/(T i + 1) as they maximize a Ti i,i a i,i+1 . Matrix B is initialized in a similar manner. If symbol O j appears T(O j ) times in state S i , the initial value of b i j is set to be T(O j )/T i . For such arrangement, the states of the HMM are aligned with the three phases of viseme production and hence are referred to as the initial state, articulation state, and end state.
For each of the 18 visemes in Table 1, an HMM with the above the configuration is trained using the Baum-Welch estimation. After implementing parameter smoothing, the parameter-smoothed ML HMM is ready for the subsequent two-channel discriminative training.

Viseme classifier
The block diagram of the proposed hierarchical viseme classifier is given in Figure 10.
A viseme model θ i is able to separate visemes d i and d j if the following condition applies: For an input viseme z T to be identified, the probabilities P(z T |θ Mac1 ), P(z T |θ Mac2 ), . . . , P(z T |θ MacR ) are computed and compared with one another. The macro identity of z T is determined by the HMM that gives the largest probability.
A macro class may consist of several similar visemes. Fine recognition within a macro class is carried out at the second layer. Assume that Macro Class i comprises L visemes: V 1 , V 2 , . . . , V L . A number of two-channel HMMs are trained with the proposed discriminative training strategy. For V 1 , L − 1HMMs, θ 1∧2 , θ 1∧3 , . . . , θ 1∧L , are trained to separate the samples of V 1 from those of V 2 , V 3 , . . . , V L , respectively. Take θ 1∧2 as an example, the parameter-smoothed ML HMM of V 1 , θ 1 ML , is adopted as the base HMM. The samples of V 1 are used as the correct samples (x T in (3)) and the samples of V 2 are used as the incorrect samples (y T in (3)) while training θ 1∧2 . There is a total of L(L − 1) two-channel HMMs in Macro Class i.
For an input viseme z T to be identified, the following hypothesis is made: where K is the positive constant as defined in (47). For the 25-frame sequence input to the system, K is chosen to be equal to 2. H i∧ j = i indicates a vote for V i . The decision about the identity of z T is made by a majority vote of all the two-channel HMMs. The viseme class that has the maximum number of votes is chosen as the identity of z T , denoted by ID(z T ). Mathematically, If two viseme classes, say V i and V j , receive the same number of votes, the decision about the identity of z T is made

Majority vote
Identity of the input by comparing P(z T |θ i∧ j ) and P(z T |θ j∧i ). Mathematically, The decision is based on pairwise comparisons of the hypotheses. The proposed hierarchical structure greatly reduces the computational load and increases the accuracy of recognition because pairwise comparisons are carried out within each macro class, which comprises much fewer candidate classes than the entire set. If coarse identification is not performed, the number of classes increases and the number of pairwise comparisons goes up rapidly.
The two-channel HMMs act as the boundary functions for the viseme they represent. Each of them serves to separate the correct samples from the samples of another viseme. A conceptual illustration is given in Figure 11 where the macro class comprises five visemes V 1 , V 2 , . . . , V 5 . θ 1∧2 , θ 1∧3 , . . . , θ 1∧5 build the decision boundaries for V 1 to delimit it from the similar visemes.
The proposed two-channel HMM model is specially tailored for the target viseme and its "surroundings". As a result, it is more accurate than the traditional modeling method that uses single ML HMM.  Figure 11: Viseme boundaries formed by the two-channel HMMs.

Performance of the system
Experiments are carried out to assess the performance of the proposed system. For the experiments conducted in this paper, 100 samples are drawn for each viseme with 50 for training and the remaining 50 for testing. By computing and comparing the probabilities scored by different viseme models using (49) and (50), the 18 visemes are clustered into 6 macro classes as illustrated in Table 2.
The results of fine recognition of some confusable visemes are listed in Table 3. Each row in Table 3 shows the two similar visemes that belong to the same macro class. The first viseme label (in boldface) is the target viseme and is denoted by x. The second viseme is the incorrect viseme and is denoted by y. θ ML denotes the parameter-smoothed ML  Configuration of the two-channel HMMs: * For θ 1 , ω 1 , ω 2 , and ω 3 are set according to (30), with C = 1.0 and D = 0.1. * * For θ 2 , ω 1 , ω 2 , and ω 3 are manually selected.
HMMs that are trained with the samples of x. With θ ML being the base HMM, two two-channel HMMs, θ 1 and θ 2 , are trained with the samples of x being the target training samples and the samples of y being the incorrect training samples. Different sets of the credibility factors (ω 1 , ω 2 , and ω 3 for the three states) are used for θ 1 and θ 2 . P is the average log probability scored for the testing samples and is computed as , where x i is the ith testing sample of viseme x and l is the number of the testing samples.
is the average separable distance. The value of I gives an indication of the discriminative power, the larger the value of I, the higher the discriminative power.
For all settings of (ω 1 , ω 2 , ω 3 ), the two-channel HMMs give a much larger separable-distance than the ML HMMs. It shows that better discrimination capabilities are attained using the two-channel viseme classifiers than using the ML HMM classifiers. In addition, different levels of capabilities can be attained by adjusting the credibility factors. However, the two-channel HMM gives smaller average probability for the target samples than the normal ML HMM. It indicates that the two-channel HMMs perform well at discriminating confusable visemes but are not good at modeling the visemes.
The change of I(x, y, θ) with respect to the training epochs in the two-channel training is depicted in Figure 12. For the three-state left-right HMMs and 25-length training samples adopted in the experiment, the separable-distance becomes stable after ten to twenty epochs. Such speed of convergence shows that the two-channel training is not computationally intensive for viseme recognition. It is also observed that I(x, y, θ) may drop at the first few training epochs. This phenomenon can be attributed to the fact that some symbols in subset V are transferred to U while training the dynamicchannel coefficients as explained in Section 4.3. Figure 12d illustrates the situation of early termination. The training process stops even though I(x, y, θ) still shows the tendency of increasing. As explained in Section 5.1, if the state durations of the target training samples and incorrect training samples differ greatly, that is, the state alignment condition is violated, the two-channel training should terminate immediately.
The performance of the proposed hierarchical system is compared with that of the traditional recognition system where ML HMMs (parameter-smoothed) are used as the viseme classifiers. The ML HMMs and the two-channel HMMs involved are trained with the same set of training samples. The credibility factors of the two-channel HMMs are set according to (30), with C = 0.1 and D = 0.1. The decision about the identity of an input testing sample is made according to (47), (49), (50), and (51), where K = 2. The false rejection error rates (FRRs) or Type-II error of the two types of viseme classifiers are computed for the 50 testing samples of each of the 18 visemes. Note that as some of the 18 visemes can be accurately identified by the ML HMMs with FRRs less than 10% [34], the improvement resulting from the two-channel training approach is not prominent for these visemes. In Table 4, only the FRRs of 12 confusable visemes are listed.
Compared with the conventional ML HMM classifier, the classification error of the proposed hierarchical viseme classifier is reduced by about 20%. Thus the two-channel training algorithm is able to increase the discriminative ability of HMM significantly for identifying visemes.

CONCLUSION
In this paper, a novel two-channel training strategy for hidden Markov model is proposed. A separable-distance function, which measures the difference between a pair of training samples, is applied as the objective function. To maximize the separable distance and maintain the validity of the probabilistic framework of HMM at the same time, a twochannel HMM structure is used. Parameters in one channel, named the dynamic channel, are optimized in a series of expectation-maximization (EM) estimations if feasible while parameters in the other channel, the static channel, are kept fixed. The HMM trained in this way amplifies the difference between the training samples. This strategy is especially suited to increase the discriminative ability of HMM over confusable observations.
The proposed training strategy is applied to viseme recognition. A hierarchical system is developed with normal ML HMM classifier implementing coarse recognition and two-channel HMM carrying out fine recognition. To extend the classification from binary-class to multiple-class, a decision rule based on majority vote is adopted. Experimental results show that the classification error of the proposed viseme Early termination (d) Figure 12: Change of I(x, y, θ) during the training process. classifier is on the average 20% less than that of the popular ML HMM classifier while only 10 ∼ 20 training epochs are required in the training process. The two-channel training strategy thus provides significant improvement over the traditional Baum-Welch estimation in fine recognition. However, the proposed method requires state alignment among the training samples; in other words, the samples should be of sufficient similarity such that the durations of the corresponding states are comparable.
Although the two-channel HMM is illustrated for viseme classification in this paper, the method is applicable to any sequence classification problem where the sequences to be recognized are of comparable length. Such applications include speech recognition, speaker identification, and handwriting recognition.