Face Recognition Using Classiﬁcation-Based Linear Projections

Subspace methods have been successfully applied to face recognition tasks. In this study we propose a face recognition algorithm based on a linear subspace projection. The subspace is found via utilizing a variant of the neighbourhood component analysis (NCA) algorithm which is a supervised dimensionality reduction method that has been recently introduced. Unlike previously suggested supervised subspace methods, the algorithm explicitly utilizes the classiﬁcation performance criterion to obtain the optimal linear projection. In addition to its feature extraction capabilities, the algorithm also ﬁnds the optimal distance-metric that should be used for face-image retrieval in the transformed space. The proposed face-recognition technique signiﬁcantly outperforms traditional subspace-based approaches particulary in very low-dimensional representations. The method performance is demonstrated across a range of standard face databases.


INTRODUCTION
In recent years, automatic face recognition has become one of the most active research fields in computer vision and a large number of different recognition algorithms have been developed.Face recognition algorithms can be categorized into feature-based, holistic-based and hybrid-matching algorithms.In feature-based methods, local features such as the eyes, nose, and mouth are first extracted and their locations and local description are fed into the recognition system (e.g., [1,2]).Hybrid-matching methods use a combination of global and local features for face recognition (e.g., [3,4]).In another aspect, face recognition algorithms can be categorized into 2D, 3D and multimodal algorithms [5].A comprehensive survey of face-recognition algorithms is given by Zhao et al. [6].The most successful approaches, however, seem to be those appearance-based methods that operate directly on the face images.An image is considered as a highdimensional vector, that is, a point in a high-dimensional vector space and the set of all faces is assumed to form a lowdimensional manifold.Following this paradigm, face image matching can be viewed as a two-step process of subspace projection followed by classification in the low-dimensional space (see [7] for a recent survey on face recognition in subspaces).In a simple yet successful approach, face recogni-tion is implemented as a linear subspace projection followed by a nearest-neighbour classifier.In particular, the eigenfaces method which is based on principal component analysis (PCA) [8] and the Fisherfaces method based on the Fisher linear discriminant analysis (LDA) [9] have been applied to face recognition with impressive results.PCA-based algorithms select a subspace with maximum variation and they are optimal for object reconstruction.While PCA minimizes sample covariance (second-order dependency), independent component analysis (ICA) minimizes higher-order dependencies as well.The ICA selects a linear projection that maximizes the degree of statistical independence of output variables based on various contrast functions (see [10] for an application of ICA to face recognition).It was experimentally found that face recognition algorithms based on ICA do not offer much improvement over PCA [7].When a substantial variability in illumination and expression is present, similarity in the transformed space is not necessarily determined by the face identity.Both PCA and ICA construct the reduced face space without using the available face identity information.LDA-based algorithms take the class structure into account and focus on the most discriminant feature extraction.The performance of LDA, however, is often degraded by the fact that its separability criterion is not directly related to the classification accuracy in the transformed space.Instead, the LDA optimization is based on the assumption that the intraclass distributions are all Gaussian with a common variance.In other words, the LDA assumes, aside from the linearity of the image subspace, a linear separation between classes in the low-dimensional space.(There are many generalizations of the LDA optimization principle but they all impose parametric models on the within-class distributions).The kernel trick can be utilized to form classification algorithms that are based on nonlinear subspaces (e.g., kernel-PCA and kernel-LDA [11]).The basic methodology is to (implicitly) apply a nonlinear mapping on the input images and then apply linear methods on the resulting feature space.Although kernel methods such as SVM achieve state-of-the-art results, in the case of kernel-PCA and kernel-LDA the performance improvement in face recognition tasks over linear methods was not found to be significant.
The LDA approach is based on two assumptions of linearity.It assumes that the face subspace is linear and that there is a linear separation between classes.Kernel methods are based on relaxation of the first assumption.In this paper, we take a different approach.While keeping the linear subspace assumption, we assume no parametric model for the class distributions or the boundaries between them.In this paper, we apply a recently proposed linear subspace method, the neighbourhood component analysis (NCA) [12], to the task of face recognition.The NCA algorithm explicitly utilizes the classification performance criterion to obtain the optimal linear projection.In the original NCA paper [12], the method was applied to standard databases from the UCI repository.In this study, we systematically analyze the benefits of utilizing several variants of the NCA method for face recognition tasks.Unlike other classification problems, these tasks are generally characterized by small sample size on one hand and large sample dimensionality on the other hand.We show experimentally that the NCA approach yields a significant improvement in face-recognition tasks compared to currently used subspace methods.
There is yet another major advantage to the linear subspace method presented here.The fact that the optimization criterion of current subspace methods is not explicitly related to the classification target results in a need for an additional learning procedure that should find a suitable distance function in the transformed subspace [13][14][15] (e.g., the best results for ICA are obtained using the cosine distance [10]).In the proposed method, the distance measure, that should be used in the transformed subspace, is explicitly stated in the optimization cost function.The optimal transformation is selected such that using the Euclidean distance in the transformed space yields optimal classification results.
We start by presenting several variants of the NCA algorithm in Section 2. Comparative face-recognition experiments on several standard face databases are presented in Sections 3 and 4 contains concluding remarks.

LEARNING A LINEAR PROJECTION
In this section, we review the NCA algorithm [12] and focus on a variant that was found to be suitable for facerecognition tasks which often have problems of small sample size and high-dimensional samples.We begin with a labelled dataset consisting of n real-valued input vectors x 1 , . . ., x n in R D and corresponding class labels c 1 , ..., c n .In the case of face recognition, the vectors are the face images and the labels are the face identities.We want to find a low-dimensional linear transformation A : R D → R d that maximizes the performance of nearest neighbour classification in the reduced space.Ideally, we would like to optimize performance on future test data, but as we do not know the true data distribution we instead attempt to optimize leave-one-out (LOO) performance on the training data.Given a finite set of linear transformations to choose from, we can easily select the best one, namely the one that minimizes the number of classification errors.The nearest-neighbour classification error, however, is quite a discontinuous function of the transformation A, given that an infinitesimal change in A may change the neighbour graph and thus affect LOO classification performance by a finite amount.Hence, we can not use this optimization criterion in our case where there is a continuously parameterized family of linear transformations which must be searched.Instead, we adopt a more well-behaved measure of nearest-neighbour performance, by introducing a differentiable cost function based on stochastic ("soft") neighbour assignments in the transformed space.In particular, each point i selects another point j as its neighbour with some probability p i j , and inherits its class label from the point it selects.We define the p i j using a softmax over Euclidean distances in the transformed space: Note that the norm of matrix A controls the softness of the neighbour assignments.Replacing A with αA, it can easily be shown that as α tends to infinity, the probabilistic assignment is reduced to deterministic nearest-neighbour assignment in the same transformed space.Denote the set of points in the same class as i by C i = {j | c i = c j }.Under the stochastic selection rule (1), we can compute the probability p i that a point i will be correctly classified: The objective function we maximize is the following: Method: (i) set initial value for A (e.g., using the LDA or RCA methods).
(ii) apply a Conjugate-Gradient optimization to find the maximum of (iii) store A and the projected training set {Ax i , c i }.Testing: predict the label at query input face image x.
(i) find the nearest neighbour i = arg min j Ax − Ax j and set the label of x to be c i .
Figure 1: The proposed face recognition method based on a linear subspace-projection learning algorithm.
only on A A. Hence, every orthogonal matrix R d×d yields a solution R • A that is completely equivalent to A. To keep the representation parsimonious we can use the Choleski decomposition representation by forcing the entries of A below the main diagonal to be zero and the entries on the diagonal to be nonnegative.This makes the representation of A unique.
Differentiating C with respect to the transformation matrix A yields a gradient rule which we can use for learning.Observing that where Δ i j = A x i − x j x i − x j , it can be verified that Expression ( 5) can be viewed as the difference between the overall variability and the intraclass variability defined by the probabilistic model (1) induced from A. The learning algorithm therefore is to maximize the above objective (3) using a gradient-based optimizer such as delta-bar-delta or conjugate gradients.Of course, as the cost function above is not convex, some care must be taken to avoid local maxima during training.We have experimentally observed that the linear transformation obtained by the Fisherfaces (LDA) method can serve as a good starting point for the conjugate gradient algorithm.The linear-transformation learning algorithm is summarized in Figure 1.
In face recognition tasks we often observe the problem of small sample size where the number of the images in the training set (denoted by n) is significantly smaller than the dimensionality of the samples (D).Utilizing the NCA, the small sample size can cause another degeneracy.Assume that n • d < D where d is the dimensionality of the transformed space.In that case, we can easily find a transformation A that sends all the face images with the same label l to the same (prespecified) point y l ∈ R d .We need to solve the linear system such that l is the label of x i .Since nd < D, there are more variables than equations and solutions exist (except for degenerate cases) and can be easily found.Suppose A solves the linear system (6), then multiplying all the points y l by a large constant λ, we can obtain a solution λA such that p i j = 0 whenever the labels of x i and x j are different.Thus, we can find a transformation that yields a perfect (error-free) classification of the entire training set.To prevent this degeneracy, which can reduce the generalization capabilities of the learning algorithm, we can penalize large-norm transformations A by adding a regularization term −λ A 2 to the cost function we are maximizing such that λ is a prespecified positive constant that can be set in a cross validation step.The derivative for the regularized cost function is Other objective functions based on classification performance can be also considered [12], for example, we can search for a linear transformation that maximizes the expected number of points that are correctly classified.In other words, we can maximize the cost function i p i .In Section 3, we provide face-recognition results for the two variants of the cost function (for other variants of NCA see [16,17]).

EXPERIMENTAL RESULTS
To evaluate the performance of the proposed method we have conducted a comparative recognition experiments on several standard face databases.It is beneficial to use different kinds of databases because some properties of classification methods, for example, their generalization abilities change depending on the number of classes under consideration and  The goal of our experiments is to asses the relative performance of NCA as a (supervised) method in a facerecognition task.The face-recognition methods we compared are Eigenfaces (PCA) [8], Gaussian RBF Kernel-PCA [11], and Fisherfaces (LDA) [9].All these subspace projection methods are followed by a whitening step in the transformed space, which is equivalent to utilizing the Mahalanobis distance in the transformed space.Another recently suggested distance metric to be used in the transformed space is the relevant component analysis (RCA) method [13] where only the within-class variability is used for whitening.It was shown in [13] that utilizing the RCA distance metric can enhance the performance of LDA.We also show recognition results for the LDA followed by RCA.We have implemented two variants of the NCA.The first (denoted by NCA1) is based on the cost function i log(p i ) and the second variant (denoted by NCA2) is based on the cost function i p i where p i is the probability of correct classification.
The recognition task in the following experiments is to classify face images with respect to the identity of the person.We consider the retrieval paradigm reminiscent of nearestneighbour classifier in which a query image leads to the retrieval of its nearest neighbour in the training data set.The distance measure we used in the transformed space (after whitening) is the Euclidean distance.Note that when using the NCA to obtain a subspace, there is no need for a whitening process as the distance learning is combined with the linear-subspace searching.An example of a face-recognition retrieval query from the Yale database is presented in Figure 2 where 5 nearest-neighbour retrieval results, based on LDA and NCA1, are shown.
The recognition results are presented in Figure 3.It can be verified that both variants of the NCA algorithm significantly outperform previously suggested subspace methods across all the databases that were used.The competitive advantage of the NCA method is even more significant in the case of projection into very low-dimensional space (e.g., when d = 5 or d = 10).Aside from improved performance, this fact can yield a better recognition-system in terms of computational complexity and memory size.Following the results of Bar-Hillel et al. [13] we have found that in some cases using the RCA distance metric can improve the performance of recognition systems based on LDA.The RCA is useful in cases where there are many face-image examples from each subject and we can obtain a good estimation of the within-class variability.In such cases (e.g., Yale and Weizmann) using RCA and NCA, we obtained similar classification results when the dimensionality of the reduced representation was relatively high.In very low dimensions the NCA was found to be significantly better.In the case of the FERET database the RCA has no advantage over LDA with Mahalanobis distance (FERET-10) and it can even be worse (FERET-4).
To further exemplify the significant improved performance gained from the NCA in very low dimensions, we show an example of linear projection into the 2-dimensional plane.The database used comprised the first five subjects of the Weizmann face database.For each subject there are 66 face images, half of which are used to find the subspace and the other half is used for testing.Figure 4 shows the lowdimensional representation obtained from LDA and NCA.
The LDA transformation was also used as a starting point for the iterative conjugate-gradient algorithm that was applied to find the optimal NCA transformation.The nearest-neighbor recognition results (percentage of correct classification) for the database presented in Figure 4 are PCA-47, LDA-67, and NCA-93.The NCA was found to be better than all the other methods discussed in this paper in terms of performance.In real world applications the training is done once and the test phase running time is important.The computational complexity of a single-face recognition is the same for all the methods.They are all based on a nearest-neighbor classifier in the projected space.The only difference between the methods is the linear transformation selected at the training phase, which has no implications on the complexity.The

CONCLUSION
We have presented a linear subspace algorithm implicitly combined with a distance-learning method in the transform subspace for face-recognition tasks.We have shown that this method performs well across a range of standard face databases and a range of projection dimensions.It consistently outperforms existing subspace methods for face recognition particulary in the case of very low dimensions.There is a trend in recent years that linear subspace methods may be too limited for difficult classification tasks.A popular nonlinear alternative is based on kernelizing linear methods (e.g., kernel PCA and kernel LDA).The face manifold is definitely nonlinear.However, we have shown in this study that linear subspace can be a good approximation of this nonlinear manifold.We have shown that the space of linear transformations is still large enough to contain good classifiers.When using an appropriate target function, linear subspace methods can yield excellent face recognition results.It should be noted that the proposed method can be easily "kernelized."Instead of defining a projection A x i in R D , we can firstly project the subject in a Hilbert space F using a function φ and then using the projection Aφ(x i ).
This paper is focused on batch learning of the projection transformation.There are also online algorithms for learning a linear projection (e.g., [22]).A future research direction is developing an online version of the NCA algorithm that can be incrementally updated.As a final remark we note that the focus used in this paper is in evaluating a face-recognition approach by means of the performance achieved on a collection of datasets, dividing them into training and test sets.Under this focus, the results are provided as a general rate without taking into consideration if all the identities are similarly recognized or there are variations among them, this can be not good enough in real situations, that is, unrestricted imagery or video [23,24].

)
Maximizing this objective would correspond to maximizing the probability of obtaining a perfect (error-free) classification of the entire training set.Maximizing the objective function C(A) is also equivalent to minimizing the Kullback-Leibler divergence between the true class distribution (having probability one on the true class) and the stochastic class distribution induced by p i j via A. Note that since Ax i − Ax j Training: Input: a set of n labeled face images: {x i ∈ R D , c i } and the reduced dimension d.Output: a linear projection A d×D : R D → R d that maximizes the objective:

Figure 2 :
Figure 2: Example of a retrieval query from the Yale database.(a) The query image.Five nearest-neighbour retrieval results obtained using the (b) NCA-based transformation and (c) the Fisherfaces (LDA) method.

Figure 3 :
Figure 3: Face-recognition performance of several subspace methods, as a function of the representation dimensionality, on standard face databases.Standard errors of the means are shown on curves.The databases are (a) Yale, (b) Weizmann.(c) A subset of FERET consisting of persons with more than 4 images.(d) A subset of FERET consisting of persons with more than 10 images.
NCA training time is larger since the optimization is done iteratively.The training running time based on 200 training images 27 × 32, 40 classes, and reduced dimensionality 5 (using Pentium(R) 4 CPU 3.2 GHz, 1 GB of RAM) was 0.5 seconds for PCA, 5 seconds for LDA and RCA, and 68 seconds for NCA.

Figure 4 :
Figure 4: A two-dimensional linear representation of the first 5 subjects from the Weizmann face database.Images of the same person have the same color.The top and bottom rows show the results for NCA and LDA, respectively.