EURASIP Journal on Applied Signal Processing 2004:4, 510–521 c ○ 2004 Hindawi Publishing Corporation A Probabilistic Model for Face Transformation with Application to Person Identification

A novel approach for content-based image retrieval and its specialization to face recognition are described. While most face recognition techniques aim at modeling faces, our goal is to model the transformation between face images of the same person. As a global face transformation may be too complex to be modeled directly, it is approximated by a collection of local transformations with a constraint that imposes consistency between neighboring transformations. Local transformations and neighborhood constraints are embedded within a probabilistic framework using two-dimensional hidden Markov models (2D HMMs). We further introduce a new efficient technique, called turbo-HMM (T-HMM) for approximating intractable 2D HMMs. Experimental results on a face identification task show that our novel approach compares favorably to the popular eigenfaces and fisherfaces algorithms.


INTRODUCTION
Pattern classification is concerned with the general problem of inferring classes or "categories" from observations [1]. The success of a pattern classification system is largely dependent on the quality of its stochastic model, which generally models the generation of observations, to capture the intraclass variability.
Face recognition is a challenging pattern classification problem [2,3] as face images of the same person are subject to variations in facial expression, pose, illumination conditions, presence or absence of eyeglasses and facial hair, and so forth. Most face recognition algorithms attempt to build for each person P a face model ᏹ p (the stochastic source of the system) which is designed to describe as accurately as possible his/her intraface variability. This paper introduces a novel approach for contentbased image retrieval, which is applied to face identification and whose stochastic model focuses on the relation between observations of the same class rather than the generation process. Here we attempt to model a transformation be-tween face images of the same person. If Ᏺ T and Ᏺ Q are, respectively, template and query images and if ᏹ is the probabilistic transformation model, then our goal is to estimate P(Ᏺ T |Ᏺ Q , ᏹ). An important assumption made here is that the intraclass variability is the same for all classes and thus, ᏹ can be shared by all individuals. As the global face transformation may be too complex to be modeled directly, the basic idea is to split it into a set of local transformations and to ensure neighborhood consistency of these local transformations. Local transformations and neighboring constraints are embedded within a probabilistic framework using twodimensional hidden Markov models (2D HMMs). A similar approach for general content-based image retrieval appeared first in [4] and preliminary results were presented on a database of binary images.
The remainder of this paper is organized as follows. Our probabilistic model of face transformation based on 2D HMMs will be detailed in Section 2. In Section 3, we introduce turbo-HMMs (T-HMMs), a set of interdependent horizontal and vertical 1D HMMs that are exploited to approximate the computationally intractable 2D HMMs. T-HMMs are one of the main contributions of this paper and one of the keys of the success of our approach as we derive efficient formulas to compute P(Ᏺ T |Ᏺ Q , ᏹ) and to train automatically all the parameters of the face transformation model ᏹ.
In Section 4, we conceptually compare our novel algorithm to two different face recognition approaches that are particularly relevant: modeling faces with HMMs [5,6] and elastic graph matching (EGM) [7]. In Section 5, we give experimental results for a face identification task on the FERET face database [8] showing that the proposed approach can significantly outperform two popular face recognition algorithms, namely eigenfaces and fisherfaces. Finally, we outline future work.

MODELING FACE TRANSFORMATION
In this section, we model the transformation between two face images of the same person using a probabilistic framework based on local mapping and neighborhood consistency.

Framework
Our assumption is that a global transformation between two face images of the same person may be too complex to be modeled directly and that it should be approximated with a set of local transformations. They should be as simple as possible for an efficient implementation but such that the composition of all local transformations, that is, the global transformation, should be rich enough to model a wide range of transformations between faces of the same person. However, if we allow any combination of local transformations, the model could be over flexible and capable of patching together very different faces. This naturally leads to the second component of our framework: a neighborhood coherence constraint. The purpose of the neighborhood constraint is to provide context information and to impose consistency requirements on the combination of local transformations. It must be emphasized that such neighborhood consistency rules produce dependence in the local transformation selection for all image regions and the optimal solution must therefore involve a global decision. To combine the local transformation and consistency costs, we propose to embed the system within a probabilistic framework using 2D HMMs.
At any location on the face, the system is considered to be in one of a finite set of states. Assuming that the 2D HMM is first-order Markovian, the probability of the system to enter a particular state at a given position, that is, the transition probability, depends on the state of the system at the adjacent positions in both horizontal and vertical directions. At each position, an observation is emitted by the state according to an emission-probability distribution. In our framework, local transformations can be viewed as the states of the 2D HMM and emission probabilities model the local mapping cost. These transformations are "hidden" and information on them can only be extracted through the observations. Transition probabilities relate states of neighboring regions and implement the consistency rules. In the following, Query Template we specify the local transformations and neighborhood constraints.

Local transformations
A local transformation maps a region in a template image Ᏺ T to a cell in a query image Ᏺ Q . In the simplest setting, regions are obtained by tiling Ᏺ T into possibly overlapping blocks. However, one could envision a more complex tiling scheme where regions may be irregular cells, for example, the outcome of a segmentation algorithm. There are two possible types of transformations: geometric and feature transformations. Translation, rotation, and scaling are examples of simple geometric transformations and may be useful to model local deformations of the face. In the simple case where features are the pixel values, gray level shift or scale would be examples of simple feature transformations and could be used to compensate for illumination variations. The difference between geometric and feature transformations is not as clearcut as it may first seem and is dependent on the domain of the feature vectors. For instance, while a scaling was previously classified as geometric transformation, it could also be interpreted as a feature transformation in the Fourier domain. In the remainder of this paper, the only geometric transformation we used was the translation (if blocks are small enough, one can approximate a slight global affine transformation with a set of local translations). Hence, cells of Ᏺ Q are blocks of the same size as the blocks of Ᏺ T . As we chose Gabor features (cf. Section 5.2) which are robust to small variations in illumination, we did not implement any feature transformation.
We now explicate the emission probability which models the cost of a local transformation. An observation o i, j is extracted from each block (i, j) of Ᏺ T (cf. Figure 1). Let q i, j be the state associated with block (i, j). The probability that at position (i, j), the system emits observation o i, j knowing that it is in state q i, j = τ, where τ = (τ x , τ y ) is a translation vector, and knowing λ, the set of parameters of the HMM, is b τ (o i, j ) = P(o i, j |q i, j = τ, λ). Let z i, j = (x i, j , y i, j ) denote the coordinates of block (i, j) (i.e., the coordinates of its upper left pixel) in Ᏺ T . Let z τ i, j be the coordinates of the matching block in Ᏺ Q : z τ i, j = z i, j + τ. The emission probability b τ (o i, j ) represents the cost of matching these two blocks.
The emission-probability b τ (o i, j ) is modeled with a mixture of Gaussians (linear combinations of Gaussians have the ability to approximate arbitrarily shaped densities): where {b τ,k (o i, j )} are the component densities and {w τ,k i, j } are the mixture weights and must satisfy the constraint: ∀(i, j) and ∀τ, k w τ,k i, j = 1. Each component density is an Nvariate Gaussian function of the form where µ τ,k i, j and Σ τ,k i, j are, respectively, the mean and covariance matrix of the Gaussian, N is the size of the feature vectors, and | · | is the determinant operator. This HMM is nonstationary as Gaussian parameters depend on the position (i, j).
The choice of notation P(Ᏺ T |Ᏺ Q , ᏹ) suggests that we should separate Gaussian parameters into face-dependent (FD) parameters, that is, parameters that depend on a particular query image, and face-independent transformation (FIT) parameters, that is, the parameters of ᏹ that are shared by all individuals. The benefits of such a separation are discussed in Section 4.1. Let m τ i, j be the feature vector extracted from the matching block in Ᏺ Q . We use a bipartite model which separates the mean into additive FD and FIT parts: where m τ i, j is the FD part of the mean and δ τ,k i, j is an FIT offset. Intuitively, b τ,k (o i, j ) should be approximately centered and maximum near m τ i, j . The parameters we need to estimate are the FIT parameters, that is, {w}, {δ}, and {Σ}.

Neighborhood consistency
The neighborhood consistency of the transformation is ensured via the transition probabilities of the 2D HMM. If we assume that the 2D HMM is first-order Markovian in a 2D sense, the transition probabilities are of the form P(q i, j |q i, j−1 , q i−1, j , λ). However, we show in Section 3 that a 2D HMM can be approximated by a turbo-HMM (T-HMM): a set of horizontal and vertical 1D HMMs that "communicate" through an iterative process. The transition probabilities of the corresponding horizontal and vertical 1D HMMs are given by where a Ᏼ i, j and a ᐂ i, j model, respectively, the horizontal and vertical elastic properties of the face at position (i, j) and are part Query Template of the face transformation model ᏹ. Figure 2 represents the neighborhood consistency between adjacent vertical blocks.
As we want to be insensitive to global translations of face images, we choose a Ᏼ and a ᐂ to be of the form where δτ = τ − τ . We can apply further constraints on the transition probabilities to reduce the number of free parameters in our system. We can assume, for instance, separable transition probabilities: We can also assume parametric transition probabilities. If Ᏺ T and Ᏺ Q have the same scale and orientation, then a Ᏼ i, j and a ᐂ i, j should have two properties: they should preserve both local distance, that is, τ and τ should have the same norm, and ordering, that is, τ and τ should have the same direction. A horizontal separable parametric transition probability that satisfies the two previous constraints is where c is a normalization factor such that δτx a Ᏼx i, j (δτ x ) = 1 and δτy a Ᏼy i, j (δτ y ) = 1. Similar formulas can be derived for vertical transition probabilities.
In this section, we specified and derived emission and transition probabilities but have not introduced another traditional HMM parameter: the initial occupancy probability distribution. We assume in the remainder that the initial occupancy probability is uniform to ensure invariance to global translations of face images. In the next section, we derive efficient formulas to compute P(Ᏺ T |Ᏺ Q , ᏹ) and to train automatically all the parameters of the face transformation model ᏹ, that is, {w}, {δ}, {Σ}, and transition probabilities {a Ᏼ } and {a ᐂ }.

TURBO-HMMs
While the HMM has been extensively applied to onedimensional problems, the complexity of its extension to two dimensions grows exponentially with the data size and is intractable in most cases of interest. Many approaches to solve the 2D problem consist of approximating the 2D HMM with one or many 1D HMMs. Perhaps the simplest approach is to trace a 1D scan that takes into account as much of the neighborhood relationship of the data as possible, for example, the Hilbert-Peano scan [9]. Another approach is the so-called pseudo 2D HMM [10] which assumes that there exists a set of "super" states which are Markovian and which subsume a set of simple Markovian states. Finally, the path-constrained variable state Viterbi algorithm [11] considers sequences of states on a row (or a column, a diagonal, etc.) as states of a 1D HMM. However, this 1D HMM has such a huge number of states that the direct application of the Viterbi algorithm is often unpractical. Hence the central idea is to consider only the N sequences with the largest posterior probabilities.
We recently introduced a novel approach that transforms a 2D HMM into a turbo-HMM (T-HMM): a set of horizontal and vertical 1D HMMs that "communicate" through an iterative process. Similar approaches have been proposed in the image processing community, mainly in the context of image restoration [12] or page layout analysis [13]. The term "turbo" was also used in [13] in reference to the now celebrated turbo error-correcting codes. However, in [13], the layout of the document is preformulated with two orthogonal grammars and the problem is clearly separated into horizontal and vertical components in distinction with the more challenging case of general 2D HMMs.
While [14] focused on decoding, that is, searching the most likely state sequence, in this section, we provide efficient formulas to (1) compute the likelihood of a set of observations given the model parameters and (2) train the model parameters.

The modified forward-backward
We assume in the following that the reader is familiar with 1D HMMs (see, e.g., [15]) . . , J} be the set of all observations. For convenience, we also introduce the notations o Ᏼ i and o ᐂ j for the ith row and jth column of observations, respectively. Similarly, Q = {q i, j , i = 1, . . . , I, j = 1, . . . , J} denotes the set of all states, while q Ᏼ i and q ᐂ j denote the ith row and jth column of states. Finally, let λ be the set of all HMM parameters and let λ Ᏼ i and λ ᐂ j be the respective rows and columns of parameters. The goal of this section is to compute P(O|λ) using the quantities introduced in Table 1. It was shown in [14] that the joint likelihood of O and Q, given λ, can be approximated by where each term P(o ᐂ j , q ᐂ j |λ ᐂ j ) corresponds to a 1D verti- is, in effect, a horizontal prior for column j. We assume that the quantity is known, that is, it was obtained during the previous horizontal step.
If we sum over all possible paths, we obtain the following marginal: We introduce the compact notation {P ᐂ j } can be computed using a modified version of the forward-backward algorithm which we describe next after introducing one last notation: The forward α variables (i) Initialization: (ii) Recursion: (iii) Termination: The backward β variables (i) Initialization: (ii) Recursion: Occupancy probability γ Similar formulas can be derived for the horizontal pass. It is worthwhile to note that our reestimation equations are similar to the ones derived for the page layout problem in [13] based on the graphical model formalism. Also, we can see that the interaction between horizontal and vertical processing, which is based on the occupancy probability γ, is not as simple as the one used in [12].
Next, we consider the steps of the algorithm. We first initialize γ's uniformly (i.e., assuming no prior information). Then, the modified forward-backward algorithm is applied successively and iteratively on the rows and columns. Whether the iterative process is initialized with row or column operation may theoretically impact the performance. However, this choice had a very limited impact in our experiments and we always started with a horizontal pass. This algorithm is clearly linear in the size of the data and can be further accelerated with a parallel implementation, simply by running the modified forward-backward for each row or column on a different processor.
One should be aware that we do not end up with one score but with one horizontal score P(O|λ Ᏼ ) and one vertical score P(O|λ ᐂ ). Combining these two scores is a classical problem of decision fusion. As experiments showed that these scores were generally very close, we simply averaged them to obtain a global score. Although this simple heuristic may not be optimal, it provided good results.

The modified Baum-Welch algorithm
We now estimate the parameters of the T-HMM. Generally, the maximum likelihood (ML) reestimation formulas can be derived directly by maximizing Baum's auxiliary function [16] Here, the problem is that we obtain two equations that may be incompatible in the case where γ Ᏼ 's and γ ᐂ 's do not converge. So a simple combination rule is to maximize To train the system, we provide a set of pairs of pictures. Each pair contains a template and a query image that belong to the same person. We now provide formulas for reestimating the Gaussian parameters and transition probabilities. Index p in the sums of the following formulas is for the pth pair of pictures. Although each quantity o i, j , m τ i, j , γ i, j , and ξ i, j should be indexed with p in the following equations, we omitted this index on purpose to simplify notations.
Let k)) be the probability of being in state q i, j = τ at position (i, j) during the horizontal (resp., vertical) pass with the kth mixture component accounting for o i, j : We also introduce We assume diagonal covariance matrices and general transition matrices. The reestimation formulas are as follows (the update for a single dimension is shown for δ and σ): A formula similar to (26) can be derived for vertical transition probabilities.

RELATED WORK
The goal of this section is not to provide a full review of the literature on face recognition (the interested reader can refer, for instance, to [2,3]) but to compare the proposed approach to two different algorithms from a conceptual point of view. The first one consists in modeling faces with HMMs [5,6]. The interesting point is that, although we use the same mathematical framework (HMMs), the philosophy is different as [5,6] model a face while our algorithm models a transformation between faces. The second algorithm, elastic graph matching (EGM) [7], is particularly relevant to this paper as its philosophy, based on local similarity and neighborhood consistency, is similar to the philosophy of the proposed algorithm.

Modeling faces with HMMs
Modeling faces with HMMs was pioneered in [5] and later improved in [6]. While early work involved a simple topbottom 1D HMM, a model based on pseudo 2D HMMs (P2D HMMs) [10] proved to be more successful. The assumption of P2D HMMs is that there exists a set of "super" states which are Markovian and which themselves contain a set of simple Markovian states. In the following, we do not compare approaches in terms of their mathematical frameworks, that is, we do not compare P2D HMMs to T-HMMs, but in terms of the philosophies of both methods. While our HMM models a face transformation, HMMs in [5,6] model faces. In our framework, the parameters of the HMM can be clearly separated into FD parameters (the features extracted from Ᏺ Q ) and FIT parameters (δ's, Σ's, w's, and transition probabilities a Ᏼ 's and a ᐂ ) as seen in Section 2.2. These transformation parameters are shared by all persons as we assume that they have similar facial properties. The intraclass variability, due, for instance, to different facial expressions, can therefore be estimated reliably by pooling the data of all training individuals. Of course, if one had large amounts of enrollment material for each person, one could envision to train one set of face transformation parameters per individual but the amount of enrollment data is generally scarce.
One major drawback of the approach in [5,6] is that the separation of parameters cannot be done as easily and, generally, these HMMs confound all sources of variability. For instance, each HMM face has to model variations due to facial expressions. Therefore, to train the mixture of Gaussians that would correspond to the mouth, one should provide for each person an example image with the mouth in various states, open, smiling, and so forth, and it is conceivable that in each HMM face, a fair number of Gaussians models the various facial expressions. Hence, one has to train a large number of Gaussians using large amounts of training data from the same individual to get a good performance.
One drawback of our method is that we do not have a probabilistic model of the face. m τ i, j is directly extracted from a face image and is not the result of a training process. Nevertheless, as we efficiently separate parameters, only a small number of template images should be required to train m τ i, j 's.

Elastic graph matching
EGM stems from the neural network community. Its basic principle is to match two face graphs in an elastic manner [7,17]. The quality of a match is evaluated with a cost function Ꮿ: where Ꮿ v is the cost of the local matchings, Ꮿ e is the cost of local distortions, and ρ is a rigidity parameter which controls the balance between Ꮿ v and Ꮿ e . The matching is generally a two-step procedure: the two faces are first mapped in a rigidly manner and then elastic matching is performed through iterative random perturbations of the nodes of the graph. Both optimization steps correspond to a simulated annealing (SA) at zero temperature [7]. Wiskott et al. [18] elaborated on the idea with the elastic bunch graph matching (EBGM) which can be used for face recognition and also face labeling. Both algorithms were later improved, especially to incorporate local coefficients that weight the different parts of the face according to their discriminatory power using for instance fisher's linear discriminant (FLD) [19] or support vector machines (SVM) [20].
It is clear that the philosophies of EGM and of the proposed framework are distinct but bear obvious similarities. In our approach, the joint log-likelihood of observations and states log P(O, Q|λ) can be separated into log P(O, Q|λ) = log P(O|Q, λ) + log P(Q|λ).
The first term, which depends on emission probabilities, corresponds to the local matchings cost Ꮿ v and the second term, which depends on transition probabilities, corresponds to the local distortions cost Ꮿ e . Moreover, in the simple case where we use one Gaussian mixture, for the whole face, with a single Gaussian in the mixture (Σ τ,k i, j = Σ) and where there is, for the whole face, one unique transition probability which is separable and parametric (cf. Section 2.3), the formula for the joint log-likelihood log P(O, Q|λ) would be almost identical to Ꮿ in [7]. The main advantages of our novel approach are in: (1) the use of the well-developed HMM framework and (2) the use of a shared deformable model of the face.
First, as shown in Section 3.1, one can use a modified version of the forward-backward algorithm to compute the likelihood of the observations knowing the set of parameters. In EGM, the quality of the matching is generally assessed using a best match which, in the HMM framework, is equivalent to the Viterbi algorithm, whose aim is to find the best path in a trellis. Our score, which takes into account all paths, should be more robust.
Another advantage is the existence of simple formulas to train automatically all the parameters of the system (cf. Section 3.2). This is not the case with EGM as the parameter ρ is generally set manually. Duc et al. [19] showed experimentally that ρ only has a small impact on the final performance. However, as different parts of the face have different elastic properties, it would be natural to use different elastic coefficients for each part of the face. Hence, ρ may have a limited influence either because Ꮿ e is noninformative, which is implicitly suggested by [20], for instance, where Ꮿ v is discarded, or because the elastic properties of the face are poorly modeled with one unique parameter ρ. Using multiple elasticity coefficients is only possible if these coefficients can be trained automatically. To the best of our knowledge, it has never been investigated in the EGM framework and it is evaluated in Section 5.
Finally, while different methods have been proposed to weight the different parts of the face according to their discriminatory power [19,20], they all suggest to train one set of parameters per person. To train these parameters, one should have a reasonable amount of enrollment data. The interpretation of "reasonable" is application dependent but at least two images should be provided by each person at enrollment time. In our case, as the model of face transformation is shared, its parameters can be trained offline and do not need to be reestimated each time a new user is enrolled. Thus, we are able to weight the different parts of the face even when one unique image is available at enrollment time.

EXPERIMENTS
In this section, we assess the performance of our novel algorithm on a face identification task and compare it to two popular algorithms: eigenfaces and fisherfaces.

The database
All the following experiments were carried out on a subset of the FERET face database [8]. We used 1,000 individuals: 500 for training the system and 500 for testing the performance. We use two images (one target and one query image) per training and test individual. This means that test individuals are enrolled with one unique image. Target faces are FA images extracted from the gallery and query images are extracted from the FB probe. FA and FB images are frontal views of the face that exhibit large variabilities in terms of facial expressions. Images are preprocessed to extract normalized facial regions. For this purpose, we used the coordinates of the eyes and the tip of the nose provided with each image. First, each image was rotated so that both eyes were on the same line. Then a square box, twice the size of the interocular distance, was centered around the nose. Finally the corresponding region was cropped and resized to 128 × 128 pixels. See Figure 5 for an example of normalized face image.

Gabor features
We used Gabor features that have been successfully applied to face recognition [7,18,19,21] and facial analysis [22]. Gabor wavelets are defined by the following equation: where (i) exp(ik µ,µ z) is a plane wave, k µ,ν , the center frequency of the filter, is of the form k µ,ν = k ν exp(iφ µ ), and µ and ν define, respectively, the orientation and scale of k µ,ν . Let k max be the maximum frequency and let f be the spacing factor. Then k ν = k max / f ν . If M be the number of orientations, φ µ = πµ/M; (ii) exp(− k µ,ν 2 z 2 /2σ 2 ) is a Gaussian envelope which restricts the plane wave and σ determines the ratio of window width to wavelength. We should underline that, in our experiments, the plane wave is also restricted by the size of the blocks (cf. Section 2.2); (iii) exp(−σ 2 /2) is a term that makes the filter DC free; (iv) k µ,ν 2 /σ 2 compensates for the frequency-dependent decrease of the power spectrum in natural images.
Each kernel ψ µ,ν exhibits properties of spatial frequency, spatial locality, and orientation selectivity. Gabor responses are obtained through the convolution of the face image and the Gabor wavelet and we use the modulus of these responses as feature vectors.
After preliminary experiments, the block size was fixed to 32 × 32 pixels and we chose the following set of parameters for the Gabor wavelets: five scales, eight orientations, σ = 2π, k max = π/4, and f = √ 2. Finally, for each image, we normalized the feature coefficients to zero mean and unit variance which performed a divisive contrast normalization [22].

The baseline: eigenfaces and fisherfaces
For comparison purpose, we implemented the eigenfaces and fisherfaces algorithms. We should note that both methods are examples of techniques where one attempts to build a model of the face.
Eigenfaces are based on the notion of dimensionality reduction. Kirby and Sirovich [23] first outlined that the dimensionality of the face space, that is, the space of variation between images of human faces, is much smaller than the dimensionality of a single face considered as an arbitrary 2D image. As a useful approximation, one may consider an individual face image to be a linear combination of a small number of face components or eigenfaces derived from a set of reference face images. One calculates the covariance or correlation matrix between these reference images and then applies principal component analysis (PCA) [24] to find the eigenvectors of the matrix: the eigenfaces. To find the best match for an image of a person's face in a set of stored facial images, one may calculate the distances between the vector representing the new face and each of the vectors representing the stored faces, and then choose the stored image yielding the smallest distance [25].
While PCA is optimal with respect to data compression [23], in general it is suboptimal for a recognition task. For such a task, a dimension-reduction technique such as FLD should be preferred to PCA. The idea of FLD is to select a subspace that maximizes the ratio of the interclass variability and the intraclass variability. However, the straightforward application of this principle is often impossible due to the high dimensionality of the feature space. A method called fisherfaces was developed to overcome this issue [26]. First, one applies PCA to reduce the dimension of the feature space and then performs the standard FLD. A major similarity between our novel approach and fisherfaces is the fact that both algorithms assume that the intraclass variability is the same for all classes. The difference is in the way to deal with this variability; while fisherfaces try to cancel the intraface variability, we attempt to model it. For a fair comparison, we did not apply directly eigenfaces and fisherfaces on the gray-level images but on the Gabor features as done, for instance, in [21]. A feature vector was extracted every four pixels in the horizontal and vertical directions (which means that there is a 28-pixels block overlap) and the concatenation of all these vectors formed the Gabor representation of the face. In [21], various metrics were tested to compute the distance between points in an eigenface or a fisherface spaces: the L 1 , L 2 (Euclidean), Mahalanobis, and cosine distances. We chose the Mahalanobis metric which consistently outperformed all other distances. The performance was plotted on Figure 3 as a function of the number of eigenfaces and fisherfaces.
The best eigenfaces and fisherfaces identification rates are, respectively, 80% with the maximum possible number of eigenfaces and 93.2% with 300 fisherfaces. Fisherfaces were not guaranteed to perform so well due to the very limited number of elements per class in the training set (only two faces per person). However, in our experiments, they managed to generalize on novel test data.

Performance of the novel algorithm
Before showing experimental results of the proposed approach, we describe in detail the experimental setup. To reduce the computational load, and for a fair comparison with eigenfaces and fisherfaces, the precision of a translation vector τ was limited to 4 pixels in both horizontal and vertical di-rections and a feature vector m was extracted every 4 pixels of the query image. For each template image, a feature vector o was extracted every 16-pixels in both horizontal and vertical directions (which means that there is a 16-pixels block overlap) and it resulted in 7 × 7 = 49 observations per template image. We tried smaller step sizes for template images but this resulted in marginal improvements of the performance at the expense of a higher computational load.
We implemented traditional optimizations to speed up the algorithm at training and test time.
(i) Windowing: if we assume that Ᏺ T and Ᏺ Q are approximately aligned, then for each block in Ᏺ T , one can limit the search for possible matching blocks in Ᏺ Q in a neighborhood (or window) of this block by setting While T x and T y should ideally be input dependent, based, for instance, on some a priori knowledge on the distortion between Ᏺ T and Ᏺ Q , for simplicity, these parameters were constant in our system. After preliminary experiments, T x and T y were set to 8 pixels which limited the number of matching blocks, that is, of possible active states, to 5 × 5 = 25 at each position. (ii) Transition pruning: to limit the number of possible output transition probabilities at each state, we discard unlikely transitions, that is, unreasonable deformations of the face. For the horizontal transition probabilities, we impose The same constraint can be applied to vertical transition probabilities. Similarly to the windowing parameters, while the ∆'s should be input dependent, they were constant in our system. After preliminary experiments, ∆'s were set to 8 pixels which limited the number of horizontal or vertical transition probabilities going out of a state to 5 × 5 = 25. (iii) Beam search: the idea is to prune unlikely paths during the forward-backward algorithm [27]. During the forward pass, at each position (i, j), all α values that fall more than the beam width below the maximum α value at that position are ignored, that is, set to zero. Then, during the backward pass, β values are computed only if their associated α value is greater than zero. The beam size was set to 100.
The training and decoding algorithms based on T-HMMs are efficient as, once Gabor features are extracted, our nonoptimized code compares two face images in less than 15 milliseconds on a 2 GHz Pentium 4 with 512M RAM. We assume that Σ τ,k i, j = Σ k i, j , δ τ,k i, j = δ k i, j , and w τ,k i, j = w k i, j to reduce the number of the parameters to estimate. To train single Gaussian mixtures, we first align approximately Ᏺ T and Ᏺ Q and we match each block in Ᏺ T with the corresponding block in Ᏺ Q . As for the transition probabilities, they are initialized uniformly. Then Σ's and a i, j 's are reestimated. To train multiple Gaussians per mixture, we used an iterative splitting/retraining strategy inspired by the vector quantization algorithm [27,28] Figure 4: Performance of the proposed algorithm.
We measured the impact of using multiple Gaussian mixtures to weight the different parts of the face and using multiple horizontal and vertical transitions matrices to model the elastic properties of the various parts of the face. In both cases, we used face symmetry to reduce the number of the parameters to estimate. Hence, we tried one mixture for the whole face (Σ k i, j = Σ k , δ k i, j = δ k , and w k i, j = w k ) and one mixture for each position (using face symmetry, it resulted in 4 × 7 = 28 mixtures). We tried one horizontal and one vertical transition matrices for the whole face and one horizontal and one vertical transition matrices at each position (using face symmetry, it resulted in 3×7 = 21 horizontal and 4 × 6 = 24 vertical transition matrices). This made four test configurations. The performance was drawn on Figure 4 as a function of the number of Gaussians per mixture.
While applying weights to different parts of the face provides a significant increase of the performance, modeling the various elasticity properties of the face had a limited impact and resulted in marginal improvements. The best performance is 96.0% identification rate. We performed a Mc-Nemar's test of significance to determine whether the difference in performance between fisherfaces and the proposed approach is statistically significant [29]. Let K be the number of faces on which only one algorithm made an error (K = 26) and let M be the number of faces on which the proposed algorithm was correct while fisherfaces made an error (M = 6). The probability that the difference in performance between these algorithms would arise by chance is P = 2 K m=M K m (1/2) K = 0.009, which means we are 99% confident that this difference is significant.
It is also interesting to compare our novel approach to EGM. As stated in Section 4.2, we think that the main advantages of our novel approach are (1) in the use of the welldeveloped T-HMM framework which provides efficient for-mulas to compute P(Ᏺ T |Ᏺ Q , ᏹ) and to estimate all the parameters of M and (2) in the use of a shared deformable model of the face. Therefore, we will compare the benefits of these two improvements independently. Firstly, we can replace the T-HMM scoring with the SA scoring which is mostly used in the EGM framework. The iterative elastic matching step is generally stopped after a predefined number of iterations N have failed to increase the score. We fixed this figure N so that the amount of computation required by the SA scoring would be similar to the amount of computation required by the T-HMM scoring. We get approximately a 2.0% absolute increase of the performance for our best system with 16 Gaussians per mixture when we use the T-HMM scoring rather than the SA scoring which indicates that the former scoring procedure is more robust. Secondly, if we did not assume a shared transformation model, as we only have one image per person at enrollment time, we would not be able to train one set of parameters per person as is usually done in the EGM framework. Thus, in this case, an upper bound for the performance of EGM is the performance of our system in the simple case where we use one Gaussian mixture for the whole face, with a single Gaussian in the mixture, and where there is, for the whole face, one unique transition probability which is separable and parametric (cf. Section 4.2). The identification rate of such a system is approximately 84.0%, far below the performance of our best system with 16 Gaussians per mixture (cf. Figure 4).

Analysis
Finally, we visualize which parts of the face are the least variable, and thus, considered by our system the most reliable for face recognition (cf. Figure 5a), and which parts are the most elastic (cf. Figures 5b and 5c). The analysis was done on the system with 28 mixtures, 21 horizontal transition probabilities, and 24 vertical transition probabilities. In the case where there is only 1 GpM, log |Σ −1 i, j | is a simple measure of local variability: the greater is this value, the fewer variability a face exhibits around position (i, j). It is interesting to note that the upper part of the face exhibits less variability than the lower part and thus, has a higher contribution during identification, which is consistent with other findings [2]. To visualize the elasticity information, we represented the horizontal, respectively, vertical, parametric transition probabilities as vectors (σ Ᏼx i, j , σ Ᏼy i, j ), respectively, (σ ᐂx i, j , σ ᐂy i, j ).

FUTURE WORK
A first improvement was suggested in Section 4.1. In our current implementation, we compute the distance between a template image and a query image using a face transformation model. In the case where we have multiple template images for person P, we should combine them into a single face model ᏹ p (this would require a new formula for the face dependent part of the mean m τ i, j ). Hence we should model a transformation between a face model ᏹ p and a query image Ᏺ Q . If λ is the set of parameters of the transformation model, we should then estimate P(ᏹ p |Ᏺ Q , λ). A second possible improvement would be to use a discriminative criterion rather than an ML criterion to train the parameters of the face transformation model. If we assume that our HMM models perfectly the face transformation between faces of the same person and if we have infinite training data, then ML estimation can be shown to be optimal. However, as the underlying transformation is not a true HMM and as training data is limited, other training objective functions should be considered. During ML training, pairs of face images corresponding to the same individual were presented to our system and model parameters were adjusted to increase the likelihood of the template images, knowing the query images and the model parameters without taking into account the probability of other possible faces. In contrast to ML estimation, discriminative approaches such as minimum classification error (MCE) [30,31] or maximum mutual information estimation (MMIE) [32,33] would consider competing faces to reduce the probability of misclassification.
Although we have only presented face identification results, we should consider the extension of this work to face verification. While the first idea would be simply to threshold the score (P(Ᏺ Q |ᏹ p , λ) > θ), this approach is known to lack robustness when there is a mismatch between training and test conditions [34]. Generally, a likelihood normalization of the following form has to be performed: where ᏹp is an antiface model for individual P and P(Ᏺ Q |ᏹp, λ) is the likelihood that Ᏺ Q belongs to an impostor. Two types of antimodels are generally used: background model set (BMS), where the set of background model for each client is selected from a pool of impostor models, and universal background model (UBM), where a unique background model is trained using all the impostor data [34,35]. While the latter approach usually outperforms the first one, both score normalization methods should be tested on our novel approach. While we showed that our system could model with great accuracy facial expressions with local geometric transformations, it is clear that geometric transformations cannot grab certain types of variability such as illumination variations which are known to greatly affect the performance of a face recognition system. In our system, small variations in illumination are compensated by Gabor features and the feature normalization step (cf. Section 5.2). However Gabor features, even combined with feature normalization, cannot fully compensate for large variations in illumination due, for instance, to the location of the light source. Hence, the idea would be to use feature transformations as suggested in Section 2.2. Our model of face transformation would thus not only compensate for variations due to facial expressions but also for changes in illumination conditions. Finally, although our novel approach was tested on a face recognition task, we would like to outline that it was designed for the more general problem of content-based image retrieval and it has the potential to be extended to other biometrics such as fingerprint recognition.

SUMMARY
We presented a general novel approach for content based image retrieval and successfully specialized it to face recognition. In our framework, the stochastic source of the pattern classification system, which is a 2D HMM, does not directly model faces but a transformation between faces of the same person. We also introduced a new framework for approximating the computationally intractable 2D HMMs using turbo-HMMs (T-HMMs). T-HMMs are another major contribution of this paper and one of the keys of the success of our approach. We compared conceptually the proposed approach to two different face recognition algorithms. We presented experimental results showing that our novel algorithm significantly outperforms two popular face recognition algorithms: eigenfaces and fisherfaces. Also, a preliminary comparison of our probabilistic model of face transformation with the EGM approach showed great promise. However, to draw more general conclusions on the relative performance of approaches which model a face (such as eigenfaces and fisherfaces) and approaches which model the relation between face images (such as EGM and our novel approach), we would not only have to carry out more experiments but also to consider other algorithms for both classes of pattern classification methods.