Integrating Illumination, Motion, and Shape Models for Robust Face Recognition in Video

The use of video sequences for face recognition has been relatively less studied compared to image-based approaches. In this paper, we present an analysis-by-synthesis framework for face recognition from video sequences that is robust to large changes in facial pose and lighting conditions. This requires tracking the video sequence, as well as recognition algorithms that are able to integrate information over the entire video; we address both these problems. Our method is based on a recently obtained theoretical result that can integrate the e ﬀ ects of motion, lighting, and shape in generating an image using a perspective camera. This result can be used to estimate the pose and structure of the face and the illumination conditions for each frame in a video sequence in the presence of multiple point and extended light sources. We propose a new inverse compositional estimation approach for this purpose. We then synthesize images using the face model estimated from the training data corresponding to the conditions in the probe sequences. Similarity between the synthesized and the probe images is computed using suitable distance measurements. The method can handle situations where the pose and lighting conditions in the training and testing data are completely disjoint. We show detailed performance analysis results and recognition scores on a large video dataset.


INTRODUCTION
It is believed by many that video-based facerecognition systems hold promise in certain applications where motion can be usedas a cue for face segmentation and tracking, and the presence of more data can increase recognition performance [1].However, these systems have their own challenges.They require tracking the video sequence, as well as recognition algorithms that are able to integrate information over the entire video.
In this paper, we present a novel analysis-by-synthesis framework for pose and illumination invariant, video-based face recognition that is based on (i) learning joint illumination and motion models from video, (ii) synthesizing novel views based on the learned parameters, and (iii) designing measurements that can compare two time sequences while being robust to outliers.We can handle a variety of lighting conditions, including the presence of multiple point and extended light sources, which is natural in outdoor environments (where face recognition performance is still relatively poor [1][2][3]).We can also handle gradual and sudden changes of lighting patterns over time.The pose and illumination conditions in the gallery and probe can be completely disjoint.We show experimentally that our method achieves high identification rates under extreme changes of pose and illumination.

Previous work
The proposed approach touches upon aspects of face recognition, tracking and illumination modeling.We place our work in the context of only the most relevant ones.
A broad review of face recognition is available in [1].Recently, there have been a number of algorithms for pose and/or illumination invariant face recognition, many of which are based on the fact that the image of an object under varying illumination lies in a lower-dimensional linear subspace.In [4], the authors proposed a 3D spherical harmonic basis morphable model (SHBMM) to implement a EURASIP Journal on Advances in Signal Processing face recognition system given one single image under arbitrary unknown lighting.Another 3D face morphable model-(3DMM-) based face recognition algorithm was proposed in [5], but they use the Phong illumination model, estimation of those parameters can be more difficult in the presence of multiple and extended light sources.The authors in [6] proposed to use Eigen light-fields and Fisher light-fields to do pose invariant face recognition.The authors in [7] introduced a probabilistic version of Fisher light-fields to handle the differences of face images due to withinindividual variability.Another method of learning statistical dependency between image patches was proposed for pose invariant face recognition in [8].Correlation filters, which analyze the image frequencies, have been proposed for illumination invariant face recognition from still images in [9].A novel method for multilinear independent component analysis was proposed in [10] for pose and illumination invariant face recognition.
All of the above methods deal with recognition in a single image or across discrete poses and do not consider continuous video sequences.Video-based face recognition requires integrating the tracking, recognition modules, and exploitation of the spatiotemporal coherence in the data.The authors in [11] deal with the issue of video-based face recognition, but concentrate mostly on pose variations.Similarly, [12] used adaptive hidden Markov models for posevarying video-based face recognition.The authors of [13] proposed to use a 3D model of the entire head for exploiting features like hairline and handled large pose variations in head tracking and video-based face recognition.However, the application domain is consumer video and requires recognition across a few individuals only.The authors in [14] proposed to perform face recognition by computing the Kullback-Leibler divergence between testing image sets and a learned manifold density.Another work in [15] learns manifolds of face variations for face recognition in video.A method for video-based face verification using correlation filters was proposed in [16], but the poses in the gallery and probe have to be similar.
Except [13] (which is not aimed at face recognition on large datasets), all the rest are 2D approaches, in contrast to our 3D model-based method.The advantage of using 3D models in face recognition has been highlighted in [17], but their focus is on acquiring 3D models directly from the sensors.The main reason for our use of 3D models is invariance to large pose changes and more accurate representation of lighting compared to 2D approaches.We do not need to learn models of appearance under different pose and illumination conditions.This makes our recognition strategy independent of training data needed to learn such models, and allows the gallery and probe conditions to be completely disjoint.
There are numerous methods for tracking objects in video in the presence of illumination changes [18][19][20][21][22].However, most of them compensate for the illumination conditions of each frame in the video (as opposed to recovering the illumination conditions).In [23,24], the authors independently derived a low order (9D) spherical harmonics-based linear representation to accurately approxi-mate the reflectance images produced by a Lambertian object with attached shadows.In [24,25], the authors discussed the advantage of this 3D model-based illumination representation compared to some image-based representations.Their methods work only for a single image of an object that is fixed relative to the camera, and do not account for changes in appearance due to motion.We proposed a framework in [26,27] for integrating the spherical harmonics-based illumination model with the motion of the objects leading to a bilinear model of lighting and motion parameters.In this paper, we show how the theory can be used for video-based face recognition.

Overview of the approach
The underlying concept of this paper is a method for learning joint illumination and motion models of objects from video.We assume that a 3D model of each face in the gallery is available.For our experiments, the 3D model is estimated from images, but any 3D modeling algorithm, including directly acquiring the model through range sensors, can be used for this purpose.Given a probe sequence, we track the face automatically in the video sequence under arbitrary pose and illumination conditions using the bilinear model of the illumination and motion we developed before [27].This is achieved by a new inverse compositional estimation approach leading to real-time performance [28].The illumination invariant model-based tracking algorithm allows us not only to estimate the 3D motion, but also to recover the illumination conditions as a function of time.The learned illumination parameters are used to synthesize video sequences for each gallery under the motion and illumination conditions in the probe.The distance between the probe and synthesized sequences is then computed for each frame.Different distance measurements are explored for this purpose.Next, the synthesized sequence that is at a minimum distance from the probe sequence is computed and is declared to be the identity of the person.
Experimental evaluation is carried out on a database of 57 people that we collected for this purpose.We compare our approach against other image-based and video-based face recognition methods.One of the challenges in videobased face recognition is the lack of a good dataset, unlike in image-based approaches [1].The dataset in [11] is small and consists mostly of pose variations.The dataset described in [29] has large pose variations under constant illumination, and illumination changes in (mostly) fixed frontal/profile poses (these are essentially for gait analysis).The XM2VTS dataset (http://www.ee.surrey.ac.uk/CVSSP/ xm2vtsdb/) does not have any illumination variations, which is one of the main contributions of our work.An ideal dataset for us would be similar to the CMU PIE dataset [9], but with video sequences instead of discrete poses.This is the reason why we collected our own data, which has large, simultaneous pose, illumination, and expression variations.It is similar to the PIE dataset though the illumination change is random and uses pre-existing and natural indoor and outdoor lighting.

Contributions
The following are the main contributions of the paper.
(i) We propose an analysis-by-synthesis framework for video-based face recognition that can work with large pose and illumination changes that are normal in natural imagery.
(ii) We propose a novel, inverse compositional (IC) approach for estimating 3D pose, and lighting conditions in the video sequence.Unlike existing methods [30], our warping function involves a 2D → 3D → 2D transformation.Our method allows us to estimate the motion and lighting in real-time.
(iii) We propose different metrics to obtain the identity of the individual in a probe sequence by integrating over the entire video and compare their merits and demerits.
(iv) Our overall strategy does not require learning an appearance variation model, unlike many existing methods [10][11][12][14][15][16].Thus, the proposed strategy is not dependent on the quality of the learned appearance model and can handle situations where the pose and illumination conditions in the probe are completely independent of the gallery and training data.
(v) We perform a thorough evaluation of our method against well-known image-based approaches like Kernel PCA + LDA [31] and 3D model-based approaches like 3DMM [4,5].

Bilinear model of the motion and illumination
In this section, we will briefly review the main results in [27] helping to lay the background and notation for this paper.It was proved that if the motion of the object (defined as the translation of the object centroid ΔT ∈ R 3 and the rotation ΔΩ ∈ R 3 about the centroid in the camera frame) from time t 1 to new time instance t 2 = t 1 + δt is small, then up to a first order approximation, the reflectance image I(x, y) at t 2 can be expressed as In the above equations, u represents the image point projected from the 3D surface with surface normal n (see Figure 1), and b i t1 (u) are the original basis images before motion.A and B contain the structure and camera intrinsic parameters, and are functions of u and the 3D surface normal n.For each pixel u, both A and B are N l × 3 matrices, where N l ≈ 9 for Lambertian objects with attached shadows.Please refer to [26] for the derivation of (1) and explicit expression for A and B. From (1), we see that

Image plane
Camera/world Reference frame Illumination the new image spans a bilinear space of six motion and approximately nine illumination variables (for Lambertian objects with attached shadows).The basic result is valid for general illumination conditions, but requires consideration of higher order spherical harmonics.
We can express the result in (1) succinctly using tensor notation as where × n is called the mode-n product [32] and l ∈ R Nl , is the vector of l i components.The mode-n product of a tensor For each pixel (p, q) in the image,

Pose and illumination estimation
Equation (2) provides us an expression relating the reflectance image I with the illumination coefficients l and motion variables ΔT, ΔΩ.Letting m = ΔT ΔΩ , we have a method for estimating 3D motion and illumination as where x denotes an estimate of x.Since the motion between consecutive frames is small, but illumination can change suddenly, we add a regularization term to the above cost function with the form of α m 2 .Since the image I t2 lies approximately in a bilinear space of illumination and motion variables with the bases B t1 and C t1 computed at the pose close to that of I t2 (ignoring the regularization term for now), such a minimization problem can be achieved by alternately estimating the motion and illumination parameters with the bases B t1 and C t1 at the pose of the previous iteration.This process guarantees convergence to a local minimum.Assuming that we have tracked the sequence up to some frame for which we can estimate the motion (hence, pose) and illumination, we calculate the basis images, b i t1 , at the current pose and write it in tensor form B t1 .Similarly, we can also obtain C t1 at the pose.(Assume an Nth-order tensor at the position with row number i n and column number equal to (i n+1 −1) ) Unfolding B t1 and the image I t2 along the first dimension, [32] which is the illumination dimension, the image can be represented as This is a least squares problem, and the illumination l can be estimated as Keeping the illumination coefficients fixed, the bilinear space in (2) becomes a linear subspace, that is, and motion m can be estimated as where I is an identity matrix of dimension 6 × 6.

Inverse compositional (IC) pose and illumination estimation
The iteration involving alternate minimization over motion and illumination in the above approach is essentially a gradient descent method.In each iteration, as pose is updated, the gradients (i.e., the tensors B and C) need to be recomputed, which is computationally expensive.The inverse compositional algorithm [30] works by moving these computational steps out of the iterative updating process.Consider an input frame I t2 (u) at time instance t 2 with image coordinate u.We introduce a warp operator W p : R 2 →R 2 such that, if the pose of I t2 (u) is p, the pose of I t2 (W p (u, m)) is p+m (see Figure 2).Basically, W p represents the displacement in the image plane due to a pose transformation of the 3D model.Denote the pose transformed image . Using this warp operator and ignoring the regularization term, we can restate the cost function (4) in the inverse compositional framework as This cost function can be minimized over m by iteratively solving for increments Δm in In each iteration, m is updated such that Thus {W p } is a group.)Using the additivity of pose transformation for small Δm, For the inverse compositional algorithm to be provably equivalent to the Lucas-Kanade algorithm up to a first order approximation of Δm, the set of warps {W p t 1 } must form a group, that is, every warp W p t 1 must be invertible.If the change of pose is small enough, the visibility for most of the pixels will remain the same-thus W p t 1 can be considered approximately invertible.However, if the pose change becomes too big, some portion of the object will become invisible after the pose transformation, and W p t 1 will no longer be invertible.A detailed proof of convergence is available in [28].
We select a set of poses these poses.We call these poses as cardinal poses.All frames that are close to a particular pose p j will use the B and C at that pose, and the warp W pt 1 should be performed to normalize the pose to p j .The pictorial representation of the inverse compositional tracking scheme is shown in Figure 3.While most of the existing inverse compositional methods move the expensive update steps out of the iterations for two-frame matching, we go even further and perform these expensive computations only once every few frames.This is by virtue of the fact that we estimate 3D motion.

The IC pose and illumination estimation algorithm
Consider a sequence of image frames I t , t = 0, . . ., N − 1.In keeping with standard notation used in tracking, we assume δt = 1, and consider two frames at t and t − 1.
Assume that we know the pose and illumination estimates for frame t − 1, that is, p t−1 and l t−1 .
Step 1.For the new input frame I t , find the closest p j to the pose estimates at t − 1, that is, p t−1 .Set m t to be 0.
Step 2. Apply the pose transformation operator W p t−1 to get the pose normalized version of the frame I Step 3. Use to estimate l t of the pose normalized image I Step 4. With the estimated l t from Step 3, use to estimate the motion increment Δm, where Update m t with m t ← m t + Δm.
Step 5. Repeat Steps 2, 3, and 4 for that input frame till the difference error ε between the pose normalized image I and the rendered image (B pj|t−1 + C pj|t−1 × 2 m t )× 1 l t can be reduced below an acceptable threshold.This gives l t and m t of (4).

FACE RECOGNITION FROM VIDEO
We now explain the face recognition algorithm and analyze the importance of different measurements for integrating the recognition performance over a video sequence.In our method, the gallery is represented by a textured 3D model of the face.The model can be built from a single image [33], a video sequence [34] or obtained directly from 3D sensors [17].In our experiments, the face model will be estimated from a gallery video sequence for each individual.Face texture is obtained by normalizing the illumination of the first frame in the gallery sequence to an ambient condition, and mapping it onto the 3D model.Given a probe sequence, we will estimate the motion and illumination conditions using the algorithms described in Section 2.2.Note that the tracking does not require a person-specific 3D model-a generic face model is usually sufficient.Given the motion and illumination estimates, we will then render images from the 3D models in the gallery.The rendered images can then be compared with the images in the probe sequence.For this purpose, we will design robust measurements for comparing these two sequences.A feature of these measurements will be their ability to integrate the identity over all the frames, ignoring some frames that may have the wrong identity.
Let I i , i = 0, . . ., N − 1 be the ith frame from the probe sequence.Let S i, j , i = 0, . . ., N − 1 be the frames of the synthesized sequence for individual j, where j = 1, . . ., M and M is the total number of individuals in the gallery.Note that the number of frames in the two sequences to be compared will always be the same in our method.By design, each corresponding frame in the two sequences will be under the same pose and illumination conditions, dictated by the accuracy of the estimates of these parameters from the probes sequences.Let d i j be the Euclidean distance between the ith frames I i and S i, j .We now compare three distance measures that can be used for obtaining the identity of the probe sequence: (1) ID = arg min (2) ID = arg min (3) ID = arg min The first alternative computes the distance between the frames in the probe sequence and each synthesized sequence that are the most similar and chooses the identity as the individual with the smallest distance.The second distance measure can be interpreted as minimizing the maximum separation between the frames in the probe sequence and synthesized sequences.Both of these measures suffer from a lack of robustness, which can be critical for their performance since the correctness of the frames in the synthesized sequences depends upon the accuracy of the illumination and motion parameter estimates.For this purpose, we replace the max by the f th percentile and the min (in the inner distance computation of 1) by the (1 − f )th percentile.
In our experiments, we choose f to be 0.8.The third option ( 16) chooses the identity as the minimum mean distance between the frames in the probe sequence and each synthesized sequence.Under the assumptions of Gaussian noise and uncorrelatedness between frames, this can be interpreted as choosing the identity with the maximum a-posterior probability given the probe sequence.
As the images in the synthesized sequences are pose and illumination normalized to the ones in the probe sequence, d i j can be computed directly using the Euclidean distance.Other distance measurements, like [14,35], can be considered in situations where the pose and illumination estimates may not be reliable or in the presence of occlusion and clutter.We will look into such issues in our future work.

Video-based face recognition algorithm
Using the above notation, let I i , i = 0, . . ., N − 1 be N frames from the probe sequence.Let G 1 , . . ., G M be the 3D models with texture for each of M galleries.
Step 1. Register a 3D generic face model to the first frame of the probe sequence.This is achieved using the method in [36].Estimate the illumination and motion model parameters for each frame of the probe sequence using the method described in Section 2.4 Step 2. Using the estimated illumination and motion parameters, synthesize, for each gallery, a video sequence using the generative model of (1).Denote these as S i, j , i = 1, . . ., N and j = 1, . . ., M.
Step 3. Compute d i j as above.
Step 4. Obtain the identity using a suitable distance measure as in ( 14) or ( 15) or ( 16).

Accuracy of tracking and illumination estimation
We will first show some results on the accuracy of tracking and illumination estimation with known ground truth.This is because of the critical importance of this step in our proposed recognition scheme.We use the 3DMM [33] to generate a face.The generated face model is rotated along the vertical axis at some specific angular velocity, and the illumination is changing both in direction (from rightbottom corner to the left-top corner) and in brightness (from dark to bright to dark).In Figure 4, the images show the back projection of some feature points on the 3D model onto the input frames using the estimated motion under three different illumination conditions.In Figure 5, (a) shows the comparison between the estimated motion (in blue) and the ground truth (in red).The maximum error in pose estimates is 2.53 • and the average error is 0.67 • .Figure 5(b) shows the norm of the error between the ground truth coefficients and the estimated ones, normalized with the ground truth.The maximum error is 4.93% and the average is 4.1%.
The results on tracking and synthesis on two of the probe sequences in our database (described next) are shown in Figure 6.The inverse compositional tracking algorithm can track about 20 frames per second on a standard PC using a MATLAB implementation.Real-time tracking could be achieved through better software and hardware optimization.

Face database and experimental setup
Our database consists of videos of 57 people.Each person was asked to move his/her head as they wished (mostly rotate their head from left to right, and then from down to up), and the illumination was changed randomly.The illumination consisted of ceiling lights, lights from the back of the head and sunlight from a window on the left side of the face.Random combinations of these were turned on and off and the window was controlled using dark blinds.There was no control over how the subject moves his/her head or on facial expression.Sample frames of these video sequences are shown in Figure 7.The images are scale normalized and centered.Some of the subjects had expression changes also, for example, the last row of the Figure 7.The average size of the face was about 70 × 70 with the minimum size being 50 × 50.Videos are captured with uniform background.We recorded 2 to 3 sessions of video sequences for each individual.All the video sessions are recorded within one week.The first session is used as the gallery for constructing the 3D textured model of the head, while the remaining are used for testing.We used a simplified version of the method in [34] for this purpose.We would like to emphasize that any other 3D modeling algorithm would also have worked.Texture is obtained by normalizing the illumination of the first frame in each gallery sequence to an ambient illumination condition and mapping onto the 3D model.
As can be seen from Figure 7, the pose and illumination vary randomly in the video.For each subject, we designed three experiments by choosing different probe sequences.

Experiment A
A video was used as the probe sequence with the average pose of the face in the video being about 15 • from frontal.

Experiment B
A video was used as the probe sequence with the average pose of the face the video being about 30 • from frontal.

Experiment C
A video was used as the probe sequence with the average pose of the face in the video being about 45 • from frontal.
Each probe sequence has about 20 frames around the average pose.The variation of pose in each sequence was less than 15 • , so as to keep pose in the experiments disjoint.The probe sequences are about 5 seconds each.This is because we wanted to separate the probes based on pose of the head (every 15 degrees) and it does not take the subject more than 5 seconds to move 15

Recognition results
We plot the cumulative match characteristic (CMC) [1,2] for experiments: A, B, and C with measurement 1 ( 14), measurement 2 (15), and measurement 3 (16) in Figure 8.In experiment A, where pose is 15 • away from frontal, all the videos with large and arbitrary variations of illumination are recognized correctly.In experiment B, we achieve about 95% recognition rate, while for experiment C it is 93% using the distance measure (14).Irrespective of the illumination changes, the recognition rate decreases consistently with large difference in pose from frontal (which is the gallery), a trend that has been reported by other authors [4,5].Note that the pose and illumination conditions in the probe and gallery sets can be completely disjoint.14), (15), and ( 16).Measurement 1 in (14) gives the best result.This is consistent with our expectation, as ( 14) is not affected by the few frames in which the motion and illumination estimation error is relatively high.The recognition result is affected mostly by registration error which increases with nonfrontal pose (i.e., A→B→C).On the other hand, measurement 2 in ( 15) is mostly affected by the errors in the motion and illumination estimation and registration, and thus the recognition rate in Figure 8(b) is lower than that of Figure 8(a).Ideally, measurement 3 should give the best recognition rate as this is the MAP estimation.However, the assumptions of Gaussianity and uncorrelatedness may not be valid.This affects the recognition rate for measurement 3, causing it perform worse than measurement 1 (14) but better than measurement 2 (15).We also found that small errors in 3D shape estimation have negligible impact on the motion and illumination estimates and the overall recognition result.

Effect of registration and tracking errors
There are two major error sources: registration and motion/ illumination estimation.The error in registration may affect the motion and illumination estimation accuracy in subsequent frames, while robust motion and illumination estimation may regain tracking back after some time, if the registration errors are small.
In Figures 9(a), 9(b), and 9(c), we show the plots of error curves under three different cases.Figure 9(a) is the ideal case, in which the registration is accurate and the error in motion and illumination estimation is consistently small through the whole sequence.The distance d ik from the probe sequence I i with the true identity k to the synthesized sequence with the correct model S i,k , will always be smaller than d i j , j = 1, . . ., k − 1, k + 1, . . ., M. In this case, all the measurements 1, 2, and 3 in ( 14), (15) or (16) will work.In the case shown in Figure 9(b), the registration is correct but the error in the motion and illumination estimation accumulates.Finally, the drift error causes d ik , the distance from the probe sequence to the synthesized sequence with the correct model (shown in bold red) to be higher than some other distance d i j , j / =k (shown in green).In this case, measurement 2 in (15) will be wrong but measurements 1 and 3 in ( 14) or ( 16) still work.In Figure 9(c), the registration is not accurate (the error d ik at the first frame is significantly higher than in (a) and (b)), but the motion and illumination estimation is able to regain tracking after a number of frames where the error decreases.Under this case, both measurements 1 and 2 in ( 14) and ( 15) will not work, as it is not any individual frame that reveals the true identity, but the behavior of the error over the collection of all frames.Measurement 3 in ( 16) computes the overall distance by taking every frame into consideration, thus it works in such cases.This shows the importance of using different distance measurements based on the application scenario.Also, the effect of obtaining the identity by integrating over time is seen.

Comparison with other approaches
The area of video-based face recognition is less standardized than image-based approaches.There is no standard dataset on which both image and video-based methods have been tried, thus we do the comparison on our own dataset.This dataset can be used for such comparison by other researchers in the future.

Comparison with 3DMM-based approaches
3DMM has achieved a significant impact in the face biometrics area, and obtained impressive results in pose and illumination varying face recognition.It is similar to our proposed approach in the sense that both methods are 3D approaches, estimate the pose, illumination, and do synthesis for recognition.However, 3DMM [5] method uses the Phong illumination model, thus it cannot model extended light sources (like the sky) accurately.To overcome this, Samaras and Zhang [4] proposed the 3D shperical harmonics basis morphable model (SHBMM) that integrates the spherical harmonics illumination representation into the 3DMM.Also, 3DMM and SHBMM methods have been applied to single images only.Although it is possible to repeatedly apply 3DMM or SHBMM approach to each frame in the video sequence, it is inefficient.Registration of the 3D model to each frame will be needed, which requires a lot of computation and manual work.None of the existing 3DMM approaches integrate tracking and recognition.Our proposed method, which integrates 3D motion into SHBMM, is a unified approach for modeling lighting and motion in a face video sequence.
Using our dataset, we now compare our proposed approach against the SHBMM method of [4], which was shown, give better results than 3DMM in [5].We will also compare our results with the published results of SHBMM method [4] in the later part of this section.
Recall that we designed three new experiments: D, E, and F by taking random single images from A, B, and C, respectively.In Figure 10, we plot the CMC curve with  achieved by integrating spherical harmonics illumination model with the 3DMM (which is essentially the idea in SHBMM [4]) on our data.For this comparison, we randomly chose images from the probe sequences of experiments: A, B, and C and computed the recognition performance over multiple such random sets.Thus the experiments D, E, and F average the image-based performance over different conditions.By analyzing the plots in Figure 10, we see that the recognition performance with the video-based approach is consistently higher than the image-based one, both in rank 1 performance as well as the area under the CMC curve.This trend is magnified as the average facial pose becomes more nonfrontal.Also, we expect that registration errors, in general, will affect image-based methods more than video-based methods (since robust tracking may be able to overcome some of the registration errors, as shown in Section 4.4).
It is interesting to compare these results against the results in [4], for image-based recognition.The size of the databases in both cases is close (though ours is slightly smaller).Our recognition rate with a video sequence at average 15 degrees facial pose (with a range of 15 degrees about the average) is 100%, while the average recognition rate for approximately 20 degrees (called side view) in [4] is 92.4%.For the experiments B and C, [4] does not have comparable cases and goes directly to profile pose (90 degrees), which we do not have.Our recognition rate at 45 • average pose is 93%.In [4], the quoted rates at 20 • is 92% and at 90 • is 55%.Thus the trend of our video-based recognition results are significantly higher than image-based approaches that deal with both pose and illumination variations.
We would like to emphasize that the above paragraph shows a comparison of recognition rates on two different datasets.While this may not seem completely fair, we are constrained by the lack of a standard dataset on which to compare image-and video-based methods.We have shown a comparison on our dataset using our implementation in Figure 9.The objective of the above paragraph is just to point out some trends with published results on other datasets that do not have video-these should be taken as very definitive statements.

Comparison with 2D approaches
In addition to comparing with 3DMM-based methods, we also do the comparison against traditional 2D methods.We choose the Kernel PCA [31] based approaches as it has performed quite well in many applications.We downloaded the Kernel PCA code from http://asi.insa-rouen.fr/arakotom/toolbox/index.html, and implemented the Kernel PCA with the LDA in MATLAB.In the training phase, we applied KPCA using the polynomial kernel and decrease the dimension of the training samples to 56.Then multiclass LDA is used for separating between different people.For each individual, we use the same images that we used for constructing the 3D shape in our proposed 3D approach as the training set.With this KPCA/LDA approach, we tested the recognition performance using single frames and the whole video sequences.
When we have a single frame as probe, we use k-Nearest Neighbor for the recognition, while in the case of video sequence, we compute the distance from every frame in the probe sequence to the centroid of the training samples in each class, take the summation over time, and then rank the distance of the sequence to each class.Here, we show the results of recognition with the described 2D approach using single frames and video sequences about 15 degrees (comparable to experiments: A and D), 30 degrees (comparable to experiments: B and E), and 45 degrees (comparable to experiments: C and F) in Figure 11.For the comparison, we also show the results of our approach with video sequences in experiments: A, B, and C. Note that testing frames and sequences are the same as those used in experiments: A/B/C and D/E/F.Since 2D approaches cannot model the pose and illumination variation well, the recognition results are much worse compared to 3D approaches under arbitrary pose and illumination variation.However, we can still see the advantage of integrating the video sequences in Figure 11.

Comparison with 2D illumination methods
The major disadvantage of the 2D illumination methods is that they cannot handle local illumination conditions (lighting coming from some specific direction such that only part of the object is illuminated).In Figure 12, we show the comparison in removing local illumination effects between the spherical harmonics illumination model against the local histogram equalization method.In the three images in Figure 12(a), the top one is the original frame with illumination coming from the left side of the face.The left image in the second row is local histogram equalized, and the right one is resynthesized with the spherical harmonics illumination model with some predefined ambient illumination.In the local histogram equalized image, although the right side of the face is enhanced compared with the original one, the illumination direction can still be clearly perceived.But in the one synthesized with the spherical harmonics illumination model, the direction of illumination is almost completely removed, and no illumination direction information is retained.In Figure 12(b), we show the plot of the error curves of the probe sequence (an image of which is shown in Figure 12(a)) with the local histogram equalization method, while in Figure 12(c) we show the error curves with the method we proposed.It is clear that 3D illumination methods can achieve better results under local illumination conditions.

CONCLUSIONS
In this paper, we have proposed an analysis-by-synthesis method for video-based face recognition that relies upon a novel theoretical framework for integrating illumination motion and shape models for describing the appearance of a video sequence.We started with a brief exposition of this theoretical result, followed by methods for learning   the model parameters.Then, we described our recognition algorithm that relies on synthesis of video sequences under the conditions of the probe.We collected a face video database consisting of 57 people with large and arbitrary variation in pose and illumination and demonstrated the effectiveness of the method on this new database.A detailed analysis of performance is also carried out.Future work on video-based face recognition will require experimentation on large datasets, design of suitable metrics, and tight integration of the tracking and recognition phases.

Figure 1 :
Figure1: Pictorial representation showing the motion of the object and its projection (reproduced from[26]).

Figure 2 :
Figure 2: Illustration of the warping function W. A point v in image plane is projected onto the surface of the 3D object model.After the pose transformation with Δp, the point on the surface is back-projected onto the image plane at a new point u.The warping function maps from v ∈ R 2 to u ∈ R 2 .The red ellipses show the common part in both frames that the warping function W is defined upon.

Figure 4 :
Figure 4: The back projection of the feature points on the generated 3D face model using the estimated 3D motion onto some input frames.

Figure 5 :
Figure 5: (a) 3D estimates (blue) and ground truth (red) of pose against frames.(b) The normalized error of the illumination estimates versus frame numbers.

Figure 6 :
Figure 6: Original images, tracking and synthesis results are shown in three successive rows for two of the probe sequences.

Figure 7 :
Figure 7: Sample frames from the video sequences collected for our database (best viewed on a monitor).
degrees when continuously rotating the head.To show the benefit of videobased methods over image-based approaches, we designed three new experiments: D, E, and F by taking random single images from A, B, and C, respectively.EURASIP Journal on Advances in Signal Processing

Figure 8 :
Figure 8: CMC curve for video-based face recognition experiments A to C; (a) with distance measurement 1 in (14), (b) with distance measurement 2 in (15), and (c) with distance measurement 3 in (16).

Figures 8 (
Figures 8(a), 8(b), and 8(c) show the recognition rate with the measurements in (14),(15), and(16).Measurement 1 in(14) gives the best result.This is consistent with our expectation, as(14) is not affected by the few frames in which the motion and illumination estimation error is relatively high.The recognition result is affected mostly by registration error which increases with nonfrontal pose (i.e., A→B→C).On the other hand, measurement 2 in (15) is mostly affected by the errors in the motion and illumination estimation and registration, and thus the recognition rate in Figure8(b) is lower than that of Figure8(a).Ideally, measurement 3 should give the best recognition rate as this is the MAP estimation.However, the assumptions of Gaussianity and uncorrelatedness may not be valid.This affects the recognition rate for measurement 3, causing it perform worse than measurement 1(14) but better than measurement 2(15).We also found that small errors in 3D shape estimation have negligible impact on the motion and illumination estimates and the overall recognition result.

Figure 9 :
Figure 9: The plots of error curves under three different cases: (a) both registration and motion/illumination estimation are correct, (b) registration is correct but motion/illumination estimation has drift error, and (c) registration is inaccurate, but robust motion/illumination estimation can regain tracking after a number of frames.The black, bold curve shows the distance of the probe sequence with the synthesized sequence of the correct identity, while both the gray bold and dotted curves show the distance with the synthesized sequences using the incorrect identity.

Figure 10 :
Figure 10: Comparison between the CMC curves for the videobased face experiments A to C with distance measurement 1 against SHBMM method of [4].

Figure 11 :
Figure 11: Comparison between the CMC curves for the videobased face experiments A to C with distance measurement 1 in (14) against KPCA+LDA-based 2D approaches.

Figure 12 :
Figure 12: The comparison over local illumination effects between the spherical harmonics illumination model and the local histogram equalization method.(a) Top: original image; bottom left: local histogram equalized image; bottom right: synthesis with spherical harmonics illumination model in a predefined ambient illumination.(b) Plots of the error curves using the local histogram equalization.(c) Plots of the error curves using the proposed method.The bold curve is for the face with the correct identity.
Figure 3: Pictorial representation of the inverse compositional tracking scheme.Starting with I t , we first warp it to I t as inStep 2 below.This allows computation of the bases of the joint pose and illumination manifold at the cardinal pose p j .Then, we search along the illumination dimension of this manifold to get the illumination estimate that best describes I t .This is Step 3.Then, in Step 4, I t is projected onto the tangent plane of the manifold where the motion estimates was obtained.
{p j } with interval of 20 degrees in pan and tilt angles, and precompute the basis B and C at