Matrix-Variate Probabilistic Model for Canonical Correlation Analysis

Motivated by the fact that in computer vision data samples are matrices, in this paper, we propose a matrix-variate probabilistic model for canonical correlation analysis (CCA). Unlike probabilistic CCA which converts the image samples into the vectors, our method uses the original image matrices for data representation. We show that the maximum likelihood parameter estimation of the model leads to the two-dimensional canonical correlation directions. This model helps for better understanding of two-dimensional Canonical Correlation Analysis (2DCCA), and for further extending the method into more complex probabilistic model. In addition, we show that two-dimensional Linear Discriminant Analysis (2DLDA) can be obtained as a special case of 2DCCA.


Introduction
Recently, a probabilistic interpretation of statistical dimension reduction algorithms has been proposed by several authors.Tipping and Bishop have derived a latent variable model for principal component analysis (PPCA) and have shown that how the principal subspace of the set of data vectors can be obtained within a maximum likelihood framework [1].Lawrence has proposed another probabilistic model for Principal Component Analysis (PCA); he integrated out the weights and optimized the positions of the latent variables in the q dimensional latent space [2].Roweis has presented an expectation-maximization (EM) algorithm for PCA.The algorithm allows a few eigenvectors and eigenvalues to be extracted from large collections of high dimensional data [3].Unlike PCA, which works with a single random vector and maximizes the variance in the projected space, Canonical Correlation Analysis (CCA) works with a pair of random vectors (or in general with a set of m random vectors) and maximizes correlation between sets of projections.In [4], a latent variable model for CCA has been proposed by Bach and Jordan.Other probabilistic models are also known [5][6][7][8].In general, the probabilistic models have many advantages including the following: (i) the potential to extending the scope of the methods into the mixture models [9], (ii) extending the methods so as to handle the missing data values [1], (iii) automatic model selection can be applied by combining the likelihood with a prior [10], (iv) Extending the model into supervised or semi supervised cases [5].
One major drawback of aforementioned methods is that they only work for data vectors while in computer vision research, samples are often multidimensional arrays such as matrix or tensor.Hence, in the preprocessing step, the image matrices should be converted into the long vectors.This results in losing the spatial structure of the image and consequently the huge covariance matrices, high computational cost, and small sample size problem.Recently, some statistical methods that directly perform on the image matrices without the image to vector conversion procedure have been proposed.These methods make use of the spatial information in the image structure and reduce the computational cost to a great extent.General Low Rank Approximation of Matrix (GLRAM) [11], Two-Dimensional Canonical Correlation Analysis (2DCCA) [12,13], and Two-Dimensional Linear Discriminant Analsis (2DLDA) [14] are some well-known matrix-based algorithms constructed based on this idea.Some other researchers have applied multilinear algebra and have extended this concept to higherorder tensor data [15][16][17][18][19][20].
Because of the success of the matrix-based methods, recently some researchers have developed probabilistic model for matrix and tensor extensions of PCA [21][22][23][24].However, they do not show the maximum likelihood relationship between their models and corresponding PCA.
Armed with probabilistic principal component analysis [1] and probabilistic canonical correlation analysis [4], in this paper, we propose a matrix-variate factor analysis model that has the property that its maximum likelihood solution extracts the canonical correlation directions of two random matrices.In addition, we show that 2DCCA can be converted to 2DLDA by considering special kind of random matrices.This means that 2DLDA can be interpreted as 2DCCA between appropriately defined random matrices.
The remaining part of the paper is organized as follows: in Section 2, we review CCA and its probabilistic interpretation.Two-Dimensional CCA is described in Sections 3, and 4 introduces our probabilistic model and derivation of canonical directions using maximum likelihood estimation.The relationship between 2DCCA and 2DLDA is discussed in Section 5. Finally, conclusions are presented in Section 6.

Probabilistic CCA
The solutions to this problem can be obtained as u 1 = Σ 1/2 11 q 1 and u 2 = Σ 1/2 22 q 2 , where q 1 and q 2 contain left-right singular vectors of ( Σ 11 ) and Σ i j denotes the sample covariance matrix of t i and t j random vectors.A latent variable model for CCA has been proposed in [4] whose graphical model is depicted in Figure 1.The model is defined as follows: where we assume that the data is centered.The negative log likelihood of the data is equal to where N is the number of the samples, , and denotes the The maximum likelihood solution is given by where the columns of Q 1d and Q 2d are the first d leftright singular vectors of the matrix ( , and P d is the diagonal matrix containing the singular values of

2-Dimensional CCA
The main difference between classical CCA and 2DCCA lies in the way the data are represented.Unlike classical CCA which uses the vectorized representation, 2DCCA works with the data in matrix representation.Therefore, 2DCCA preserves some implicit structural information among elements of the original images.It also overcomes the singularity problem of scatter matrices resulting from the high dimensionality of vectors [12,13].
2DCCA considers two random matrices T 1 ∈ Ê m1×n1 and T 2 ∈ Ê m2×n2 and seeks left transforms such that the following criteria is maximized: There is no closed form solution for maximizing all projection matrices simultaneously.Hence, 2DCCA adopts an iterative algorithm for finding the local optimal projections.
At first left transforms L 1 , L 2 are assumed known and the following covariance matrices are defined: where T l 1 = T T 1 L 1 and T l 2 = T T 2 L 2 are left projected sample matrices.Then, the formula (1) becomes arg max The optimal projection can be obtained as follows: where . Alternatively, we can rewrite (5) as arg max where, where T r 1 = T 1 R 1 and T r 2 = T 2 R 2 are right-projected sample matrices.The optimal solution can be obtained as where Q r 1 and Q r 2 contain m first left-right singular vector of ( Σ r 11 ) . left projections (L 1 and L 2 ) and right projections (R 1 and R 2 ) are determined by iteratively solving (7) and ( 9) until convergence.

Matrix-Variate Probabilistic Model for CCA
In this section, we propose an extension of probabilistic canonical correlation model to deal with 2D data.One limitation of probabilistic canonical correlation model is that, in this method, samples are represented by vectors while in computer vision research, data (images) are often matrices, and structural information can be used for improving the conventional model.We show that the estimating the parameters of the proposed model leads to the twodimensional canonical correlation analysis directions.
We relate the random matrices T 1 ∈ Ê m1×n1 and T 2 ∈ Ê m2×n2 with the latent matrix X ∈ Ê m ×n as follows: where , and V 2 ∈ Ê n2×n are the factor loading matrices.Ξ 1 and Ξ 2 are the noise sources, and every entry of them follows from N (0, ψ 1 ) and N (0, ψ 2 ), respectively.Let be the parameter of the model.The observations T 1 and T 2 are conditionally independent given the value of latent matrix X; so, we have Marginal distribution of observed variables is then given by the integrating out the latent variable as Maximum likelihood is one method for setting the values of these parameters which involves consideration of the log probability of the observed data set given the parameters, that is, where D i = {T i;n } N n=1 and i ∈ {1, 2} consist of N data matrix.One difficulty here is that all the projection matrices i=1 should be obtained simultaneously and there is no closed-form solution for it.Therefore, two probabilistic models are proposed so as to obtain each projection direction separately and from an alternating optimization procedure.
We assume that the value of U i , i = 1, 2 is known and proceed to project the observations over these matrices.The left probabilistic model is defined as where T l i = T T i U i , X l = X T , and Ξ l i is the noise in this model.We define the left probabilistic function P(T l 1 , T l 2 ) as the marginal distribution over the latent variables, that is, where . The projected observations T l 1 and T l 2 are conditionally independent given the value of latent matrix X l ; so, we have One major problem here is that the probabilistic distributions are defined over vectors but in this case observed data are matrices.Suppose that t l i, j ∈ Ê ni be the jth column of the projected matrices T l i ∈ Ê ni×m , then the probabilistic function P(T l i ) is defined as x-conditional probability distribution over t l i, j space is given by where x l j ∈ Ê n is defined as the jth column vector of X l and marginal distribution of x j is N (0, I).Therefore, the marginal distribution for the observed data t l i, j is readily obtained by integrating out the latent variables, giving Suppose that τ l j = [(t l 1, j ) , and Σ l = VV T + Ψ l .Therefore, P(τ l j ) can be obtained as follows: where Σ l = VV T + Ψ l .It can be shown that the negative log likelihood of the left projected data is equal to After some manipulations, equation ( 23) becomes where T is the sample covariance matrix of left projected data, and |A| denotes the determinant of matrix A. For log likelihood not become infinite, we assume Σ l 0. Figure 2(a) depicts the left probabilistic graphical model.
In this stage, we should maximize L with differentiating with respect to V , Ψ l 1 , and Ψ l 2 , where the solutions is straightforward.As shown in [4], the solutions can be obtained as where Q l 1 and Q l 2 are composed of n first left-right singular vectors of the matrix, ( with corresponding singular values on the diagonal of the matrix, P l ∈ Ê n ×n , and the matrices R 1 and R 2 are composed of first n canonical directions.Note that the size of matrix ( is n 1 × n 2 which is much smaller than the size of the matrix ( Σ 11 ) After computing V 1 and V 2 , the observations are projected onto these matrices.The right probabilistic model is defined as where T r i = T i V i .X r = X is the latent matrix and Ξ r i represents the noise source in this model.Similar to left probabilistic model, we define t r i, j ∈ Ê mi , and x r j ∈ Ê m as the jth column vector of T r i and X r , respectively, where the marginal distribution of x r j is N (0, I).
and Σ r = UU T + Ψ r .Therefore, P(τ r j ) = N (0, Σ r ).The negative log likelihood of the right projected data is equal to where ) T is the sample covariance matrix of right projected data samples, and assume Σ r 0. The solution to this optimization can be obtained as where in this case L 1 and L 2 contain the first m canonical directions, Q r 1 and Q r 2 are composed of m first left-right singular vectors of ( Σ r 11 ) , and P r ∈ Ê m ×m contains the corresponding singular values on the diagonal.The right graphical probabilistic model is shown in Figure 2(b).It can be seen that the left and right canonical directions of 2DCCA can be obtained by maximizing the likelihood function.Posterior expectations can be obtained as follows: (29)

Relationship of 2DCCA and 2DLDA
In this section, we show that 2DCCA and 2DLDA [11] are closely related.2DLDA uses original sample matrices for constructing between-class and within-class covariance matrices.It adapts an iterative algorithm where, in each iteration one projection direction is assumed known, and other projection is obtained by solving generalized eigenvalue problem.Let X l j ∈ Ê r ×c , j = 1, . . ., N the N image samples which are projected onto the left projection matrix, and L ∈ Ê r×r .These samples are clustered into C classes with y j ∈ {1 • • • C} class label where ith class has n i data samples.Define X l i ∈ Ê r ×c as the mean of ith class, π as a vector where its ith element is π i = n i /N , and Lemma 1.In 2DLDA, between class scatter matrix is obtained as SB l = MPM T , where C) , ⊗ is the kronecker product, diag(π) is a diagonal matrix with π i 's on its diagonal, and I is the identity matrix.
The proof of lemma is shown in Appendix A. Consider two sets of multivariate data, {T R Cr ×r , j = 1, . . ., N } which are realizations of random matrices T 1 and T 2 , respectively.Where Q i = I r if y j = i, and otherwise Q i = 0 r .For example, for image matrix X l 1 with class label y 1 = 2, T 1 1 and T 1 2 are defined as follows: where 0 r is a r × r square matrix of all zeros.The following lemma shows the relationship between two methods.
Lemma 2. 2DCCA finds the optimal correlation directions of solving the generalized eigenvalue problem SB l u = (λ/1 − λ)SW l u, where SB l and SW l are between-class and withinclass covariance matrices, respectively.
The proof is shown in Appendix B. As we know, The right projection vector of 2DLDA is computed using generalized eigenvalue problem SB l w = λSW l w, while in Lemma 2, we proved that the canonical correlation direction of 2DCCA for (T 1 , T 2 ) is obtained by solving the generalized eigenvalue problem SB l u = (λ/1 − λ)SW l u.These show the relationship between two methods.Therefore, the proposed probabilistic model can also be used for modeling 2DLDA technique.

Conclusion
Conventional probabilistic model only works for vectors data while the data samples in computer vision applications are matrices.In this paper, we presented a probabilistic interpretation of matrix-based canonical correlation analysis.We introduced a model and expressed that two-dimensional canonical correlation directions could be archived using maximum likelihood parameter estimation.This model can be applied for extending the matrix based CCA.We also showed that matrix-based Linear Discriminant Analysis can be obtained by setting the input random matrices of CCA.

Appendices
A. Proof of Lemma 1 By substituting X = C i=1 π i (X i ) and C i=1 π i = 1 in to above equation, we have (A. 2) The following equations can be easily obtained: which is equivalent to solving the generalized eigenvalue problem SB l u = λ(SB l + SW l )u which is equal to SB l u = (λ/1 − λ)SW l u.

Figure 2 :
Figure 2: Probabilistic graphical model for 2DCCA, (a) left model and (b) right model.