EURASIP Journal on Applied Signal Processing 2005:13, 2091–2100 c ○ 2005 Hindawi Publishing Corporation Spatio-Temporal Graphical-Model-Based Multiple Facial Feature Tracking

It is challenging to track multiple facial features simultaneously when rich expressions are presented on a face. We propose a two-step solution. In the first step, several independent condensation-style particle filters are utilized to track each facial feature in the temporal domain. Particle filters are very effective for visual tracking problems; however multiple independent trackers ignore the spatial constraints and the natural relationships among facial features. In the second step, we use Bayesian inference—belief propagation—to infer each facial feature's contour in the spatial domain, in which we learn the relationships among contours of facial features beforehand with the help of a large facial expression database. The experimental results show that our algorithm can robustly track multiple facial features simultaneously, while there are large interframe motions with expression changes.


INTRODUCTION
Multiple facial feature tracking is very important in the computer vision field: it needs to be carried out before videobased facial expression analysis and expression cloning. Multiple facial feature tracking is also very challenging because there are plentiful nonrigid motions in facial features besides rigid motions in faces. Nonrigid facial feature motions are usually very rapid and often form dense clutter by facial features themselves. Only using traditional Kalman filter is inadequate because it is based on Gaussian density, and works relatively poorly in clutter, which causes the density for facial feature's contour to be multimodal and therefore non-Gaussian. Isard and Blake [1] firstly proposed a face tracker by particle filters-condensationwhich is more effective in clutter than comparable Kalman filter.
Although particle filters are often very effective for visual tracking problems, they are specialized to temporal problems whose corresponding graphs are simple Markov chains (see Figure 1). There is often structure within each time instant that is ignored by particle filters. For example, in multiple facial feature tracking, the expressions of each facial feature (such as eyes, brows, lips) are closely related; therefore a more complex graph should be formulated.
The contribution of this paper is extending particle filters to track multiple facial features simultaneously. The straightforward approach of tracking each facial feature by one independent particle filter is questionable, because influences and actions among facial features are not taken into account.
In this paper, we propose a spatio-temporal graphical model for multiple facial feature tracking (see Figure 2). Here the graphical model is not a 2D or a 3D facial mesh model. In the spatial domain, the model is shown in Figure 3, where x i is a hidden random variable and y i is a noisy local observation. Nonparametric belief propagation is used to infer facial feature's interrelationships in a part-based face model, allowing positions and states of some features in clutter to be recovered. Facial structure is also taken into account, because facial features have spatial position constraints [2]. In the temporal domain, every facial feature forms a Markov chain (see Figure 1).
After briefly reviewing related work in Section 2, we introduce the details of our algorithm in Sections 3 and 4. Many convincing experimental results are shown in Section 5. Conclusions are given in Section 6.

RELATED WORK
After the pioneering work of Isard and Blake [1] who creatively used particle filters for visual tracking, many Figure 1: The Markov chain assumption of particle filters. The empty circle x i represents the hidden state (contour) in time i, and the filled-in one y i denotes the local observation. researchers have adopted particle filters to track face or facial features [2,3,4,5,6,7,8]. Rui and Chen [3] used the unscented particle filter (UPF) [9] to do visual tracking. Zeng and Ma [4] proposed an active particle filtering approach. Vermaak et al. [5] selectively adapted the observation model to obtain better tracking results. Pèrez et al. [6] combined color-based CamShift or MeanShift algorithm with particle filters. Loy et al. [2] utilized multiple cues to track target. All of the above methods only used particle filters to track the whole face or head, not the facial features. De la Torre et al. [7] used particle filters to track eyes or lips while switching between different shape/texture models; however they didn't track both simultaneously. Wang et al. [8] integrated a learned intrinsic object structure into a particle-filter style tracker; however only one facial feature-mouth-was tracked. Therefore the idea of this paper is very new. We use particle filters to track multiple facial features rather than one facial feature.
Isard [10] and Sudderth et al. [11] have independently developed an algorithm for performing belief propagation with the aid of particle sets. Their methods motivated us to use graphical model in multiple facial feature tracking. However they only show their algorithms' effectiveness in 2D graphical models, which are in the spatial domain. As far as multiple facial feature tracking is concerned, the correspond- x 4 x 5 x 6 y 1 ing graphical model is a 3D one, which is spatio-temporal. The 3D graphical model belongs to a specific type, and is directed-cum-undirected. In this paper, we try to seek the relationships between the particle filter in the 1D temporal domain and nonparametric belief propagation in the 2D spatial domain.

MULTIPLE FACIAL FEATURE TRACKING BY PARTICLE FILTER: THE FIRST STEP
We adopt the condensation algorithm to track each facial feature. After Isard and Blake [1] first proposed an implementation of particle filters, many other researchers also proposed enhanced algorithms for particle filters, for example, Icondensation [21], UPF [9], and the Rao-Blackwellised particle filter [22]; however they still could not solve the "curse of dimensionality" problem, and generally the workable dimensionality was below 10. On the other hand, we should break down the tracking result of particle filters in the spatial domain. Therefore the choice of different particle filters has no key effect on our algorithm. For simplicity, we choose the basic condensation algorithm because it can satisfy our requirements.
Six facial features are tracked in this paper. They are eyebrows, eyes, nose, and mouth. Taking eye for example, we track the eyelid contour. The contour is modeled as B-spline X t = x 1 , x 2 , . . . , x t , and the observation of eye is Y t = y 1 , y 2 , . . . , y t . We need to infer the marginal conditional density p(x t |Y t ). Isard and Blake [23] have proved that where c t is a constant, and In (1), p(x t |Y t−1 ) is the effective prior model, and p(y t |x t ) is the observation model. In (2), p(x t |x t−1 ) is the dynamic model.

Why several particle filters?
Single particle filter is not suitable to track multiple facial features simultaneously. The reason is as follows: the total dimensionality added by each feature's dimensionality is too high (dozens); the tracking efficiency of the particle filter decreases exponentially along with the linear increasing of dimensionality. Usually, it is extremely difficult to get good results from particle filters in spaces of dimensionality much greater than about 10 [24]. Even if dimensionality can be reduced by principal components analysis (PCA) [25] or other nonlinear methods [8,26], the total dimensionality of multiple facial features is significantly large. If we reduce the dimensionality too much, valuable state information may be lost.
A human face contains multiple facial features, and it can be decomposed into several parts, such as eyebrows, eyes, nose, and mouth, to form a graphical model in the spatial domain. In this paper, we track each facial feature by its corresponding particle filter, therefore computational complexity is converted from exponential to linear with the size of the graph.

Particle filter itself is not enough
When there are rapid motions in one facial feature (e.g., mouth) due to the changes of facial expressions (see Figure 4), the corresponding particle filter may fail to track the facial feature's contour. It is difficult to reduce this failure if we only use multiple independent particle filters to track each facial feature. In this paper, we track several facial features simultaneously through using several correlated particle filters. When emotion is presented on the face, different facial features have natural physical interaction. For example, when we smile with blinking the left eye, our left mouth tip will move up; when we surprise with the wide-open mouth, the eyebrows will also move up.
Instead of constructing heuristic rules for these relationships, we learn the relationships among facial features from training data beforehand. During the process of tracking, if we detect that some facial feature tracker's results are poor, we can infer their positions and states from other facial features by Bayesian inference. In this paper, belief propagation is used to carry out Bayesian learning and inference.

Loopy belief propagation
In every time instant, facial features are contained in an undirected graphical model G f (see Figure 3). Let V denote the set of nodes (facial features). Nodes are connected by edges E to describe the relationship between facial features. The neighborhood of a node i is NB(i) = {j|(i, j) ∈ E}. Let x i denote the hidden variable (contour of facial feature), and let y i denote the observed variable (facial feature image). Let where N is the number of nodes in the graphical model G f . The joint probability density function factorizes as where C is a normalization constant, and ψ i j (x i , x j ) and is a correlation function between x i and its neighbor variable x j , and φ i (x i , y i ) is an observation function that denotes the evidence of x i [27]. From Figure 3, we can see that it is a Markov network with loops. Pearl [28] warned that belief propagation might not converge in this kind of graphical models. However, some experimental [29] and theoretical results [30,31,32,33] motivate us to apply the belief propagation rules in the Markov network with loops, and Murphy et al. called it loopy belief propagation [31].
In belief propagation, we need to calculate the conditional marginal distribution p(x i |{y i }) for the nodes that have less confidence in the tracking results by particle filters. The intuitive meaning is that we can infer the facial feature i's position (contour) and state (e.g., expression) from all facial features' observations {y i }.
. . Figure 5: Message passing in a directed-cum-undirected graphical model.

Belief propagation in spatio-temporal graphical model
In this paper, the graphical model is the combination of directed graph (Markov chain) and undirected graph (Markov network). In order to do Bayesian inference, the key point is belief propagation or message passing. The messages of directed graph are passing through the time axis. In Figure 5, the message passing from and it is what we have to calculate. b(x i t−1 ) means the tracking result in facial feature i by graphical model in the previous time instant. The belief at node (i, t) is proportional to the product of the local evidence φ i (x i t , y i t ) at that node and all the messages coming into it [34]. There are two kinds of messages: one comes from the immediate preceding node x i t−1 temporally, and the other is from the neighbors of node (i, t) spatially. Therefore, we have m ji x i t . (6) In (6), K is a normalization constant and NB(i, t) denotes the nodes neighboring the node (i, t). As defined in (4) and (5), the message from the previous time is M(x i t−1 → x i t ). Furthermore, the message from the spatial neighbor is determined self-consistently by the following message update rule: In the right-hand side of (7), we take the product of messages going into node ( j, t) except for the one coming from node (i, t). Note that the message M(x j t−1 → x j t ) from the previous time instant is also taken into account.
Based on a factorization described by [27], we use the , and it can be seen that φ i (x i t , y i t ) is equal to the observation model in [23]. We also use the correlation function ψ ji (x j t , , and fit this probability with mixtures of Gaussians [35]. The message M(x i t−1 → x i t ) passing from x i t−1 to x i t can be viewed as the effective prior: a prediction taken from the marginal probability b(x i t−1 ) from the previous time step, onto which is superimposed one time step from the dynamical model.
From (6) and (7), we can see that φ always comes with M. By the analysis above, the product of them is (using (1)).
Equation (8) means that the product is effectively the posterior probability of x i t conditioned on y i t and {Y i t−1 }, and this shares the same idea with the condensation algorithm. This property is important because it allows us to firstly run the particle filter to track each facial feature in one time step, and the output of particle filter is naturally fitted into a loopy belief propagation process (see (6) and (7)).
Wu et al. [36] proposed a mean-field Monte Carlo algorithm for visual tracking of articulated human body, which is similar to ours in using the dynamic Markov network.

Particle propagation in spatio-temporal graphical model
Since in our spatio-temporal graphical model, messages are not needed to pass backward in the temporal domain, therefore the choice of importance function can be omitted. In conventional particle filter algorithms, the probability distribution of possible interpretations is represented by a randomly sampled set, which can be called "particle set." In particle set form, our algorithm is also the combination of particle filter and loop belief propagation as indicated by (8).
Each sample is a (s i t (m), π i t (m)) pair, in which s i t (m) is a value of x i t (m) and π i t (m) is a corresponding sampling probability. m ∈ [1, M], and M is the total number of samples for one facial feature.
Step 1. Firstly, a particular s i t−1 (m) is drawn randomly from b i t−1 (m) by choosing it with probability π i t−1 (m) from the set of M samples at time t − 1.
Step 2. Draw s pf i t (m) randomly from p(x i t |x i t−1 = s i t−1 (m)), one time step of the dynamic model, where pf denotes the particle filter.
Step 3. A value s pf i t (m) chosen in Step 2 is a fair sample from p(x i , therefore we obtain the particle set form of φ i (x i t , , which can be viewed as a likelihood function in belief propagation. Actually, LL(x i t ) is the tracking result of particle filter for one facial feature, since we have (using (8)).
Using the sampling method similar to conventional particle filter (as described in Steps 1, 2, and 3), we can obtain a nonparametric approximation (s pf i t (m), π pf i t (m)) to LL(x i t ). We can further use a bandwidth selection method to construct a kernel density estimate LL(x i t ) from (s pf i t (m), π pf i t (m)); therefore we can evaluate it in nonparametric belief propagation.
Step 4. Let p msg ji (x j t ) denote the foundation of message m ji (x i t ) as follows: where K j is a constant which makes p msg ji (x j t ) a probability density.
Step 5. Draw M samples In order to obtain the integral of (7), we should compute a weight for each sample: . (12) where p msg ji is defined in (10), and m i j is obtained from the message update in the last iteration.
is the result of temporal filter for each facial feature, and we use it to calculate sample weights of message m ji (x i t ) in (7) for nonparametric belief propagation.
In (12), although LL(x j t ) is in particle set form, it still can be evaluated.
For iterations of message passing, the procedure is initialized with all messages set to constant values.
Step 7. Generally, after several iterations of message passing, the belief distribution has converged. We should obtain the marginal estimate for b(x i t ) in (6) to get the final results. Given the input messages m ji (x i t ) from the spatial neighbors NB(i, t), (2) compute the weight for each sample s i t (m): We also use the result LL(x i t ) of particle filter in this step besides Step 5. Therefore our algorithm combines temporal particle filters with spatial belief propagations. Step 7, we use a similar method to [10]. Sampling from the product can be decomposed into two steps: randomly select one of the product density's components and then draw a sample from the corresponding Gaussian.
The algorithm in this paper is summarized in the above steps.

Learning the correlation function
In the training database, we manually mark some face's features; therefore we obtain the ground-truth position of the contour x i . First we reduce the dimensionality of facial feature i's contour x i by PCA. Then from the training data, we fit mixtures of Gaussians to p(x i t ) and the joint probabilities p(x j t , x i t ) for neighboring facial feature i and j. We evaluate , therefore the correlation function ψ ji (x j t , x i t ) is obtained.

Optimizing Bayesian inference for Markov network
Considering that Bayesian inference using belief propagation costs substantial time, we only initiate it when the particle filter's tracking result is poor.
For the corresponding particle filter on one facial feature, the tracking result on time t can be described by the moments [1]: As in [1], a mean position using f (x t ) = x t can be utilized for graphical display. Moreover, let f (x t ) = x t x T t , and we obtain the variance σ t = E(x t x T t |Y t ) of the tracking result. We use the variance σ t as a metric of the tracking quality.
For each facial feature, we have σ i t , i = 1, . . . , N, where N is the number of all facial features (in this paper, it is 6). For the facial features that have larger variances, we determine that their tracking results are worse than others. Therefore belief propagation is carried out to infer the more plausible positions of their contours. Based on experimental results, if the σ i > 0.5 * Area(x i ), we consider that the tracking result on facial feature i is bad, where Area(x i ) denotes the pixels occupied by facial feature i in the video stream. In implementation, the Area(x i ) is obtained by computing the bounding box for the facial feature i.

EXPERIMENTAL RESULTS
We have developed a prototype system on Windows platform using Visual C++ to implement the algorithm in this paper. There are 6 contour models for facial features: eyebrows, eyes, nose, and mouth. Each contour is a quadric B-spline curve, in which contours of nose and eyebrows are open curves, and others are closed curves. As shown in Figure 6, there are 6, 9, 12, 12 control points for left (right) eyebrow, left (right) eye, nose, and mouth, respectively. The total number of control points is 54; therefore the dimensionality is 108.
We choose Cohn-Kanade [37] facial expression database as the training set, because it contains plenty of frontal faces with rich facial expressions. This database is stored as 30 fps grayscale image sequences. To learn the relationships among facial features, we have selected 496 frame frontal face images, which belong to 98 different persons, and used interactive program to mark each facial feature's contours. PCA is used to reduce the dimensionality for each facial feature's contour. After that, the dimensionality of left (right) eyebrow, left (right) eye, nose, and mouth is 4, 7, 9, and 9, respectively; therefore the total dimensionality after dimension reduction is 40, accounting for 99% of the total variance.
After constructing the PCA bases, we compute the corresponding PCA coefficients for each individual in the training set. For each of facial feature's contour pairs connected by edges in Figure 3, we determine kernel-based nonparametric density estimates for each node itself p(x i t ) and their joint probabilities p(x j t , x i t ). Figure 7 shows several marginalizations of p(x j t , x i t ), each of which relates a single pair of PCA coefficients (e.g., the first mouth and second left eye contour's coefficients). We can see that simple Gaussian approximations would lose most of this data set's meaningful structure.
Using the similar method in [23], we have also trained the dynamic model for each facial feature. For observation model, a set of independent measurement lines that are perpendicular to the hypothesized contour are used to measure the likelihood of detected edge points.
Using a single condensation tracker with multiple contours to track multiple facial features is infeasible because the dimensionality is much higher than 10. Here we compare our results with those of multiple independent condensationstyle trackers. We have tested our algorithm on the image sequences in Cohn-Kanade database and the videos (640 × 480, 30 fps) that we captured by a digital video camera. The test image sequences are not included in the training database.
As stated in Section 4.2, our algorithm can be viewed as conventional condensation tracker plus contour adjustment by belief propagation. The experimental results (see The dark circles and teeth distract the MICT tracker; therefore it fails to track them. Figures 8,9,and 10) show that our tracker is more robust than multiple independent condensation-style trackers (MICT). In Figure 10, we also compare our algorithm's results with those of active appearance model (AAM). AAM [15] is based on face alignment, and we use the same training set as MICT to train the active appearance model. From Figure 10, we can see that AAM fails to track mouth in the case of occlusion. The advantage is that our algorithm is more accurate than AAM, while the drawback is that our algorithm is slightly slower than AAM. Our algorithm is more robust than AAM, since even the particle filter fails to track the mouth, the mouth's location, and state can be inferred from the spatial domain by belief propagation. For AAM, it is difficult to incorporate negative samples (e.g., occlusion) into its training set; therefore AAM performs badly when facial feature occlusion happens.
For all the testing image sequences, our algorithm obtains better results than those of MICT and AAM. For the experimental results shown in Figures 8, 9, and 10, the image sequences have 116, 368, and 900 frames, respectively.
Our tracker runs at about 3 Hz, the MICT tracker runs at about 4 Hz, and the AAM tracker runs at about 3.5 Hz on a Pentium 4 1.8 GHz 256 MB RAM computer.

CONCLUSIONS
In this paper, we extend the particle filter from the relatively simple Markov chain to the directed-cum-undirected graphical model applied to multiple facial feature tracking problem. Spatial structure information and relationships among nodes in each time instant are effectively considered by Bayesian learning and inference in the loopy belief propagation framework.
The advantages of our algorithm are as follows. Compared with particle filters, we extend conventional particle filters to track multiple facial features simultaneously by exploring the spatial coherence in each time step, and the complexity of tracking is linear rather than exponential in the number of facial features. Compared with AAM, our algorithm is more robust.
The tracking results in this paper can be used as motion capture data. We plan to use these data to derive a 3D face model and generate facial animations. The ultimate purpose of multiple facial feature tracking is for facial animation.
Currently, the tracking results are 2D control points of B-splines in each time instant. In the future, we will use these results as video-based motion capture data. Using performance-driven facial animation techniques, we can obtain 3D facial animation of the tracked human face from 2D mocap data. Finally we will retarget animation from human faces to other virtual avatars.
While our current results are promising, details of our implementation could be improved. The spatial and temporal relationships among facial features can also be learnt by recently proposed spatio-temporal manifold learning algorithms. Currently, the testing image sequences are obtained by a fixed camera. For greater applicability, the method should be extended to allow a moving camera. Our tracking algorithm will fail if the face is too small in the video stream, and MICT and AAM will fail too. Maybe some face video super-resolution techniques can solve this problem.