2.1 Minimization of low-rank matrices using nuclear norm minimization
In the area of engineering and applied science such as machine learning and computer vision, a wide range of problems can be or have been represented under low-rank minimization framework, since the low-rank formulation seems to be able to capture the low-order structure of the underlying problems.
In many practical problems, one would like to guess the missing entries of an n1×n2 matrix from a sampling Ω of its entries. This problem is known as the matrix completion problem. It comes up in a great number of applications including those of collaborating filtering. The collaborating filtering is the task of automatic predicting of the entries in an unknown data matrix. A popular example is the movie recommendation case where the task is to make automatic predictions about the interests of a user by collecting taste information from its formal interests or by collecting them from other users.
In mathematical terms, this problem is posed as follows:
A data matrix \(X\in \mathbb {R}^{n_{1} \times n_{2}}\) is the matrix to be known as much as possible. The only information available about it is a sampling set of entries Mij,(i,j)∈Ω, where Ω is a subset of the complete set of entries {1,..,n1}×{1,..,n2}.
Very few factors contribute to an individual’s tastes. Therefore, the problem of matrix completion is an optimization problem of a low-rank r matrix from a sample of its entries. The matrix rank satisfies r≤ min(n1,n2). Such a matrix is represented by counting n1×n2 numbers but has only r×(n1×n2−r) degrees of freedom. When the matrix rank is small and its dimension is large, then the data matrix carries much less information than its ambient dimension suggests. In the case of collaborative prediction movie recommendation system, users—rows of the matrix—are given the opportunity to rate items—columns of the data matrix. However, they usually rate very few ones so there are very few scattered observed entries of this data matrix. In this case, the users-ratings matrix is approximately low-rank, because as mentioned, it is commonly believed that only very few factors contribute to an individual’s tastes or preferences. These preferences are stored in a user profile [1]. In the same analogy, matrix completion can be used to restore images with missing data. From limited information, we aim to recover the image, i.e., infer the many missing pixels.
2.2 Problem statement
Given Ω⊂[n1]×[n2] a set of elements of an unknown rank-r matrix, \(X\in \mathbb {R}^{n_{1} \times n_{2}} \). The values of elements Mij,(i,j)∈Ω are known. The task is to recover incomplete matrix X. Formally, the low-rank matrix completion problem is given by:
$$ \left\{\begin{array}{ll} \text{minimize} & \text{rank}(X) \\ \text{subject to} & P_{\Omega}(X)=P_{\Omega}(M) \end{array}\right. $$
(1)
where \(P_{\Omega } : \mathbb {R}^{n_{1}\times n_{2}} \longrightarrow \mathbb {R}^{n_{1}\times n_{2}} \) is the orthogonal projection onto the subspace of matrices that vanish outside of Ω, (i,j)∈Ω if and only if Mij is observed. PΩ(X) is defined by:
$$ P_{\Omega}(X)=\left\{ \begin{array}{ll} X_{ij} \quad &\text{if}~~(i,j)\in \Omega \\ 0 \quad &\text{otherwise } \end{array}\right. $$
(2)
The data known in M is given by PΩ(M). The matrix X is recovered then from PΩ(X) if it is the unique matrix of rank less or equal to r and consistent with the data.
In a practical point of view, the rank minimization problem is an NP-hard problem. Algorithms are not capable to resolve it in time once the matrices have an important dimension. They require time doubly exponential in the dimension of the matrix to find the exact solution. Authors in [3] proposed the nuclear norm minimization method. Replacing the rank of a matrix by its nuclear norm can be justified as a convex relaxation (the nuclear norm \(\Vert X \Vert _{*} = \sum _{i} \sigma _{i}(X)\) is the largest convex lower bound of rank(X) on the ball {X/∥X∥∗=σ(X)≤1} [3]). Consequently, the problem (2) is then replaced by the following:
$$ \left\{ \begin{array}{ll} \text{minimize}\ v & \Vert X\Vert_{*} \\ \text{subject to} & P_{\Omega}(X)=P_{\Omega}(M) \end{array}\right. $$
(3)
where the nuclear norm ∥X∥∗ is defined as the sum of its singular values: \(\Vert X\Vert _{*} = \sum _{i} \sigma _{i}(X)\).
Since the nuclear norm ball {X:∥X∥∗≤1} is the convex hull of the set of rank-one matrices with spectral norm bounded by one, authors in [4] interpret that under suitable conditions, the rank minimization program (2) and the convex program (3) are formally equivalent in the sense that they have exactly the same unique solution.
Matrix completion problem is not as ill posed as thought. It is possible to resolve it by convex programming. The rank function counts the number of nonvanishing singular values when the nuclear norm sums their amplitude. The nuclear norm is a convex function. It can be optimized efficiently via semidefinite programming.
The following theorem is demonstrated by authors in [4].
Theorem 1
Let M be an n1×n2 matrix of a rank r sampled from the random orthogonal model, and put n= max(n1,n2). Suppose we observe m entries of M with locations sampled uniformly at random. Then, they are numerical constants C and c such that if :
$$ m \geq Cn^{5/4}r \log n $$
(4)
The minimizer to the problem (3) is unique and equal to M with probability at least 1−cn−3; that is to say, the semidefinite program (3) recovers all the entries of M with no error. In addition, if r≤n1/5, then the recovery is exact with probability at least 1−cn−3 provided that:
$$ m \geq Cn^{(6/5)}r \log n $$
(5)
Under the hypothesis of Theorem 1, there is a unique low-rank matrix, which is consistent with the observed entries. This matrix can be recovered by the convex optimization (3). For most problems, the nuclear norm relaxation is formally equivalent to the combinatorial hard rank minimization problem.
If the coherence is low, few samples are required to recover M. As an example, matrices with incoherent column and row space matrices with random orthogonal model or those with small components of the singular vectors of M.
Conventional semidefinite programming solvers such as SDPT3 [5] and SeDeMi [6] solve the problem (3). However, such solvers are usually based on interior-point methods and cannot deal with large matrices. They can only solve problems of size at most hundreds by hundreds on a moderate computer. These solvers are problematic when the size of the matrix is large. They need to solve huge systems of linear equations to compute the Newton direction. To be precise, SDTP handles only square matrices with the size less than 100. Another alternative is to think of using iterative solvers such as the method of conjugate gradients to solve the Newton system. However, it is still problematic as well since it is well known that the condition number of the Newton system increases rapidly as one gets closer to the solution. Furthermore, none of these general-purpose solvers use the fact that the solution may have low rank.
Therefore, the first-order methods are used to complete large low-rank matrices by solving (3).
In the special matrix completion setting presented in (3), PΩ(X) is the orthogonal projector onto the span of matrices vanishing outside of Ω. Therefore, the (i,j)th component of PΩ(X) is equal to Xij if (i,j)∈Ω and 0 otherwise. \(X\in \mathbb {R}^{n_{1} \times n_{2} }\) is then the optimization variable. Fix τ>0 and a sequence δk of scalar step sizes. Starting with \({Y}_{0} = 0 (\in \mathbb {R}^{n_{1} \times n_{2} })\), the algorithm defines until a stopping criterion is reached:
$$ \left\{ \begin{array}{l} X_{k}=\text{shrink}(Y_{k-1},\tau) \\ Y_{k}=Y_{k+1} + \delta_{k} P_{\Omega}(M-X_{k}) \end{array}\right. $$
(6)
shrink(x,λ) is a nonlinear function that applies a soft-thresholding rule at level λ to the singular values of the input matrix. The key property here is that for large values of τ, the sequence Xk converges to a solution which very nearly minimizes (3). Hence, at each step, one only needs to compute at most one singular value decomposition and perform a few elementary matrix additions.
2.3 The singular value thresholding algorithm
The most popular approaches to matrix completion in literature are the thresholding methods that can be divided into two groups: one-step thresholding methods and iterative thresholding methods. Despite the strong theoretical guaranties which have been obtained for one-step thresholding procedures, they show poor behavior in practice and only work under the uniform sampling distribution which is not realistic in many practical situations [7]. On the other hand, iterative thresholding methods are well adapted for general nonuniform distribution as well as they show practical performances as in [4]. Authors in [8] proposed a first-order singular value thresholding algorithm SVT which is a key subroutine in many numerical schemes for solving nuclear norm minimization. The conventional approach for SVT is to find the singular value decomposition SVD of the matrix, then to shrink its singular values.
The singular value decomposition step
The singular value shrinkage operator is the key building block of the SVT algorithm. Consider the singular value decomposition SVD of a matrix \(X \in \mathbb {R}^{n_{1} \times n_{2} }\) of rank r
$$ X = U \Sigma V^{*} \;\;\;\; \text{Where} \;\;\;\; \Sigma = diag(\{\sigma_{i}\}_{1\leq i \leq \tau }) $$
(7)
where U and V are respectively n1×r and n2×r matrices with orthonormal columns, and the singular values σi are positive. For each τ≥0, the soft-thresholding operator Dτ is defined as follows:
$$ {\begin{aligned} D_{\tau}(X) \,=\, {UD}_{\tau}\!(\Sigma)\!V^{*} \,=\, U. \left(\!\! \begin{array}{ccccc} (\sigma_{1}- \tau)_{+} & \hfill & \hfill & \hfill & \hfill \\ \hfill & \hfill & \ddots &\hfill & \hfill \\ \hfill & \hfill & \hfill & \hfill & (\sigma_{r} - \tau)_{+} \\ \end{array} \!\!\right).V^{*} \end{aligned}} $$
(8)
where t+ is the positive part of t defined by:
$$ (\sigma_{i} - \tau)_{+} = \max(0,\sigma_{i} - \tau)= \left\{\begin{array}{ll} \sigma_{i} - \tau & \text{if} \;\;\;\;\;\; \sigma_{i} - \tau >0 \\ 0 & \text{otherwise} \end{array}\right. $$
(9)
In other words, in Dτ(X), the singular vectors of X are kept and the singular values are shrinked by the soft-thresholding.
Even though the SVD may not be unique, it is easy to see that the singular value shrinkage operators are well defined. In some sense, this shrinkage operator is a straightforward extension of the soft-thresholding rule for scalars and vectors. In particular, note that if many of the singular values of X are below the threshold τ, the rank of Dτ may be considerably lower than that of X, just like the soft-thresholding rule applied to vectors leads to sparser outputs whenever some entries of the input are below threshold.
The singular value thresholding operator is the proximal operator associated with the nuclear norm. The proximal operator has its origins in convex optimization theory, and it has been widely used for non-smooth convex optimization problems, such as the l1-norm minimization problems arising from compressed sensing [9] and related areas. It is well known that the proximal operator of the l1-norm is the soft-thresholding operator, and soft-thresholding-based algorithms are proposed to solve l1-norm minimization problems [10].
Shrinkage iteration step
The singular value thresholding SVT algorithm approximates the minimization (3) by:
$$ \left\{ \begin{array}{l} \min_{X} \tau \Vert X \Vert_{*} + \frac{1}{2} \Vert X \Vert_{F}^{2} \\ \text{subject to} \;\;\; X_{ij} = M_{ij} \end{array}\right. $$
(10)
with a large parameter τ. ∥.∥F denotes the matrix Frobenius norm or the square root of the summation of squares of all entries. Then, it applies a gradient ascent algorithm to its dual problem. The iteration is:
$$ \left\{ \begin{array}{l} X_{k}=D_{\tau}(Y_{k-1},\tau) \\ Y_{k}=Y_{k+1} + \delta_{k} P_{\Omega}(M-X_{k}) \end{array}\right. $$
(11)
where Dτ is the SVT operator defined as:
$$ D_{\tau} = \arg \min \frac{1}{2} \Vert Y - X \Vert_{F} + \tau \Vert X \Vert_{*} \;\;, X\in\mathbb{R}^{n_{1}\times n_{2}} $$
(12)
The iteration is called the SVT algorithm, and it was shown in [11] to be an efficient algorithm for huge low-rank matrix completion. Two crucial properties make the SVT algorithm suitable for matrix completion.
-
Low-rank property: The matrices Xk turn out to have low rank, and hence, the algorithm has minimum storage requirement since it only needs to keep principal factors in memory.
-
Sparsity: For each k≥0, Yk vanishes outside of Ω and is, therefore, sparse, a fact, which can be used to evaluate the shrink function rapidly.
The SVT algorithm
The initial step of the SVT algorithm is to start with the following:
-
Y0=0;
-
Choosing a large τ to make sure that the solution of (11) is close enough to the solution of (3).
-
Defining k0 as the integer that obeys to: \( \frac {\tau }{\delta \Vert P_{\Omega }(M) \Vert } \in (k_{0}-1,k_{0}) \)
-
Since Y0=0, Xk=0, Yk=kδPΩ(M),k=1,...,k0
The stopping criteria of the SVT algorithm is motivated by the first-order optimality conditions for the minimization of the problem (10). The solution \(X_{\tau }^{*}\) to (11) must verify:
$$ \left\{ \begin{array}{l} X = D_{\tau}(Y) \\ P_{\Omega}(X-M)=0 \end{array}\right. $$
(13)
where Y is a matrix vanishing outside of Ωc. Therefore, to make sure that Xk is close to \(X_{\tau }^{*}\), it is sufficient to check how close (Xk,Yk−1) is obeying (13). By definition, the first equation in (13) is always true. Therefore, it is natural to stop (12) when the error in the second equation is below a specified tolerance:
$$ \frac{\Vert P_{\Omega}(X-M) \Vert_{F}}{\Vert P_{\Omega}(M) \Vert_{F}} \leq \epsilon $$
(14)
The matrix completion problem can be viewed as a special case of the matrix recovery matrix, where one has to recover the missing entries of a matrix, given limited number of known entries.
2.4 Literature review
Other works dressed other algorithms in the attempt to minimize the nuclear norm of low-rank sparse matrix. Authors in [12] presented the fixed-point continuation (FPC) algorithm. It combines the fixed-point continuation [13] with Bregman iteration [14]. The iteration is as follows:
$$ \left\{ \begin{array}{ll} \text{Iterate on } {i} \text{ to get }X_{k} \left\{\begin{array}{ll} X_{i} \,=\, D_{\tau}(Y_{i-1}) \\ Y_{i}\,=\, X_{i-1} \,+\, \delta_{i}P_{\Omega}(M+Z_{k-1}\,-\,X_{i}) \end{array}\right.\\ Z_{k}=Z_{k-1} + P_{\Omega}(M*X_{k}) \end{array}\right. $$
(15)
In fact, the FPC algorithm is a gradient ascent algorithm applied to an augmented Lagrangian of (3). The augmented Lagrangian multiplier (ALM) method in [15] reformulates the problem into:
$$ \left\{ \begin{array}{l} \min_{X} \Vert X \Vert_{*} \\ \text{subject to} \;\; X+E= P_{\Omega}(M), \;\; P_{\Omega}(E)=0, \end{array}\right. $$
(16)
where E is an auxiliary variable. The corresponding (partial) ALM function is:
$$ \begin{aligned} \Gamma(X,E,Y,\mu)&= \Vert X \Vert_{*} + <Y,P_{\Omega}(M)-X-E>\\ &\quad+ \frac{\mu}{2} \Vert P_{\Omega}(M)\!\,-\,\!X\!\,-\,\!E \Vert_{F}^{2} \; \text{with} \; P_{\Omega}(E)\,=\,0 \end{aligned} $$
(17)
An inexact gradient ascent is applied to the ALM and leads to the following algorithm:
$$ \left\{ \begin{array}{l} X_{k} = D_{\mu^{-1}_{k}}\left(P_{\Omega}(M) - E_{k} + \mu^{-1}_{k}Y_{k-1}\right) \\ E_{k} = P_{\Omega^{c}}\left(X_{k}\right)\\ Y_{k} = Y_{k-1} + \mu_{k} P_{\Omega}\left(M-X_{k}\right) \end{array}\right. $$
(18)
For all these algorithms, the SVT operator is the key to make them converge to low-rank matrices.
Just like the FPC and SVT algorithms, the proximal gradient (PG) [16] algorithm for matrix completion needs to compute the SVD at each iteration. It is as simple as the cited algorithms.
There are two main advantages of the SVT algorithm over the FPC and the PG algorithms when the former is applied to solve the problem of matrix completion.
First, in some cases, we dispose a sequence of low-rank iterates; in contrast, so many iterates at the initial phase of the FPC or PG algorithms may not have low rank even though the optimal solution itself has low rank. We observed this behavior when we applied them to solve the problem of matrix completion.
Second, the intermediate matrices generated during the resolution of our problem are sparse due to the sparcity of Ω, the set of observation. This makes the SVT algorithm computationally more attractive. Indeed, the generated matrices by FPC and PG algorithms may not be sparse and specially for the last one.
The first-order methods presented above are the basis of a number of recent works that minimize the nuclear norm of a matrix to recover an image with missing data.
In [17], authors proposed a two-step proximal gradient algorithm to solve nuclear norm regularized least squares for the purpose of recovering low-rank data matrix from sampling of its entries. Each iteration generated by the proposed algorithm is a combination of the latest three points, namely, the previous point, the current iterate, and its proximal gradient point. This algorithm preserves the computational simplicity of classical proximal gradient algorithm [16] where a singular value decomposition in proximal operator is involved. Global convergence is followed directly in the literature.
Authors in [18, 19] adopted the SVT algorithm to achieve the completed matrix but by using the power method [20] instead of using PROPACK [21] for computing the singular value decomposition of large and sparse matrix. They showed that accelerating Soft-Impute is indeed possible while still preserving the “sparse plus low rank” structure. To further reduce the iteration time complexity, instead of computing SVT exactly using PROPACK, they proposed an approximate SVT scheme based on the power method. Though the SVT obtained in each iteration is only approximate, they demonstrated that convergence can still be as fast as performing exact SVT. Hence, the resultant algorithm has low iteration complexity and fast convergence rate. Our objective is to increase the accuracy and the precision of image completion results by adopting unsupervised learning process that takes into account the characteristics of image pixels.
2.5 Nuclear norm minimization-based collaborating filtering for image reconstruction
In the problem of collaborating filtering based on nuclear norm minimization, the goal is to predict entries of an unknown matrix based on a subset of its observed entries. For example in a collaborative prediction movie recommendation system, where the rows of the matrix represent users and columns represent movies, the task is to predict ratings that users gave to movies based on their preferences. The prediction of users’ preferences over movies—they have not yet seen—are then based on patterns in the partially observed rating matrix. The setting can be formalized as a matrix completion problem completing entries in a partially observed data matrix.
In the same analogy for image completion problem, the collaborating filtering setting aims to predict the pixels missing in the image based on the partially observed entries, i.e., pixels in the image. The proposed approach then is based on two main steps:
Clustering defines the optimal partitioning of a given set of N data points into K subgroups. The points belonging to the same group are as similar as much as possible. However, data points from two different groups share the maximum difference.
The first step of our approach is to perform a data filtering. The learning process starts by applying a principal component analysis (PCA) in the attempt to reduce the number of variables and make the information less redundant. As a result, our data are centered. To detect the pixels’ clusters, the process adopts a bi-clustering step founded on prototype-based clustering by using the K-means algorithm on the principal component scores, that is, the representation of the data matrix in the principal component space and its correlation matrix.
The second process takes place to predict the missing pixels using the clusters, which performs a new framework for predicting the missing pixels. The clustering phase regroups automatically the pixels of an image into different homogeneous regions. These homogeneous regions usually contain similar objects or part of them. As a result, interesting performance will be achieved in the prediction step.
For a given point in the image, we identify clusters in which the selected pixel row index, respectively the column index, belongs. The predicted value is the result of singular value thresholding (SVT) algorithm applied on the matrix containing values of pixels existing in the intersection between the two clusters found in step 1. The adopted algorithm takes as parameters three mandatory elements:
The set of locations corresponding to the observed entries Ω might be defined in three forms:
-
The first one as a sparse matrix where only the elements different of 0 are to take into account.
-
The second one as a linear vector that contains the position of the observed elements.
-
The third one where Ω is specified as indices (i,j) with \((i,j)\in \mathbb {N}\).
The application of the proposed algorithm in image completion procures in some cases certain results that are out of range. In this case, we propose to use a median filtering on the predicted pixels. The median filter is often used as a typical pre-processing step to improve the result of later process in signal processing (for example, edge detection on an image). The idea is to use it as a final process to replace each entry (here, entries are the predicted pixels) with the median of neighboring entries, which performs a good result in image reconstruction as shown in the experimental results.
The result of our proposed approach is a completed data matrix that contains all the pixels’ values. The goal of the proposed approach is to predict the missing pixels in the image matrix. Our learning process detects the partitions of pixels’ indices where the predicting process exploits the clusters found to predict the missing value. It works on the assumption that pixels in the same cluster share almost the same characteristics in the image.