Tensor Recovery from Noisy and Multi-Level Quantized Measurements

Higher-order tensors can represent scores in a rating system, frames in a video, and images of the same subject. In practice, the measurements are often highly quantized due to the sampling strategies or the quality of devices. Existing works on tensor recovery have focused on data losses and random noises. Only a few works consider tensor recovery from quantized measurements but are restricted to binary measurements. This paper, for the first time, addresses the problem of tensor recovery from multi-level quantized measurements. Leveraging the low-rank property of the tensor, this paper proposes a nonconvex optimization problem for tensor recovery. We provide a theoretical upper bound of the recovery error, which diminishes to zero when the sizes of dimensions increase to infinity. Our error bound significantly improves over the existing results in one-bit tensor recovery and quantized matrix recovery. A tensor-based alternating proximal gradient descent algorithm with a convergence guarantee is proposed to solve the nonconvex problem. Our recovery method can handle data losses and do not need the information of the quantization rule. The method is validated on synthetic data, image datasets, and music recommender datasets.


I. INTRODUCTION
Many practical datasets are highly noisy and quantized, and recovering the actual values from quantized measurements finds applications in different domains. For example, users' preferences in rating systems are represented by a few scores (or even two scores in one bit [12]), which do not provide accurate characterization of preferences. Due to sensor issues or communication restrictions, images and videos in some applications may have very low resolution [33]. Quantization is applied to enhance the data privacy in power systems and sensor networks [10], [15], [27]. It is important to develop computationally efficient and reliable methods to recover the actual data from low-resolution measurements. [24] estimates the data from one-bit measurements by linearizing the nonlinear quantizer. [19] leverages the deep learning tool to recover the data. These approaches either require accurate parameter estimation or have high computational costs. [26], [30], [37] recover data from a small number of quantized measurements, but the methods only apply to sparse signals. Low-rank matrices can characterize the intrinsic data correlations in user ratings, images, and videos [5], [39], and the low-rank property has been exploited to recover the data from quantized measurements by solving a nonconvex R. Wang  constrained maximum likelihood estimation problem [4], [8], [12], [15]. For an n × n rank-r matrix (r n), the best achievable recovery error from quantized measurements is O( r 3 n ) 1 [4], [15]. The recovery error diminishes to zero when the data size increases.
Practical datasets may contain additional correlations that cannot be captured by low-rank matrices. For instance, if an image or a frame of a video is vectorized, the spatial correlation is no longer preserved [14]. In recommendation systems, users ratings against objects vary under different contexts [3], and a matrix presentation is not sufficient to characterize the structure. That motivates the usage of lowrank tensors, where a higher-order tensor contains data arrays with at least three dimensions. Tensors can represent threedimensional objects in generic object recognition [29], engagements on advertisements over time for behavior analysis [7], gene expressions in the development process [23], etc. Moreover, tensor techniques are widely used in deep learning [11], [25].
Low-rank tensors with quantization noise exist in hyperspectral data [1], [22], rating systems [17], and the knowledge predicates [40]. Existing works on low-rank tensor recovery mainly consider random noise or sparse noise [9], [28], [38], while only a few works [1], [17], [22] consider tensor recovery from one-bit measurements. [1] unfolds the tensors to matrices and applies matrix recovery techniques. Ref. [17] recovers the tensor by solving a convex optimization problem and shows its recovery error is O(( r 3K−3 K n K−1 ) 1/4 ), where K is the number of tensor dimensions, and n is the size per dimension. [22] focuses on the case when a significant percentage of measurements are lost. This paper for the first time studies low-rank tensor recovery from multi-level quantized measurements, while the existing work [1], [17], [22] only consider one-bit measurements. We formulate the tensor recovery problem as a nonconvex optimization problem and proves that the recovery error with full observations under the known quantization rule is at most , which decays to zero much faster than any existing results. Moreover, we develop a computationally efficient algorithm to solve the nonconvex data recovery problem and prove that even with partial data losses, our algorithm converges to a critical point from any initialization with at least sublinear convergence rate. Lastly, all the existing work on recovery from quantized measurements assumes that the quantization rule is known to the recovery method except one low-rank matrix recovery work [4]. We empirically extend our method to recover the tensor from quantized measurements without directly knowing the quantization rule and demonstrate encouraging numerical results. This paper is organized as follows. The problem formulation is introduced in Section II. Section III discusses our approach and its recovery error. An efficient algorithm with the convergence guarantee is proposed in Section IV. Section V records the numerical results. Section VI concludes the paper. All the lemmas and proofs can be found in Appendix.

A. Notation and preliminaries
We use boldface capital letters to denote matrices (twodimensional tensors), e.g., A. Higher order tensors (three or higher dimensions) are denoted by capital calligraphic letters, e.g., X . X ∈ R n1×n2×···×n K represents a Kdimensional tensor with the size of the i-th dimension equaling to is the mode-k matricization of X , which is formed by unfolding X along its k-th dimension. Let A p to represent the Khatri-Rao product [20] of A k ∈ R n k ×r , A p ∈ R np×r . We have The Frobenius norm of the tensor X is defined as

II. PROBLEM FORMULATION
Let X * ∈ R n1×n2×···×n K denote the actual data that are represented by a K-dimensional tensor. Let · ∞ denote the entry-wise infinity norm. We assume that the maximum value of X * is bounded by a positive constant α, i.e., X * ∞ ≤ α. We further assume rank(X * ) ≤ r.
Each entry of X * is mapped to one of a few possible values with certain probabilities through the quantization process. To model this probabilistic mapping, let N ∈ R n1×n2×···×n K denote a noise tensor with i.i.d. entries drawn from a known cumulative distribution function Φ(x). Given the quantization boundaries ω * 0 < ω * 1 < · · · < ω * W , the noisy data X * i1,i2,...,i K + N i1,i2,...,i K (i j ∈ [n j ], j ∈ [K]) can be quantized to W values based on the following rule, where Q is an operator that maps a real value to one of W values. We choose ω * 0 = −∞ and ω * W = ∞. Y i1,i2,...,i K is the (i 1 , i 2 , . . . , i K )-th entry of the quantized measurements Y ∈ [W ] n1×n2×···×n K . When W = 2, Y reduces to the one-bit case [17]. In general, Y is a log 2 W -bit tensor. The quantization process of a three-dimensional tensor is visualized in Fig. 1. and The probability description (3) follows from the same formula as those in [4], [15], except that the entries are from a higher order tensor. We assume Φ(x) is monotonously increasing. The monotonously increasing property holds for the cumulative distribution functions of many distributions. Examples include: (1) Probit model with Φ(x) = Φ norm (x/σ), where Φ norm is the cumulative distribution function of a standard Gaussian distribution; (2) Logistic model with Φ(x) = Φ log (x/σ) = 1 1+e −x/σ . We also consider the general setup that there exists missing data in the measurements, i.e., only measurements with indices belonging to the observation set Ω are available, while all the other measurements are lost. The question we will address in this paper is as follows. Given the partial observations Y Ω and the noise distribution Φ, how can we estimate the original tensor X * ?
We remark that this problem formulation can be applied in different domains. In the user voting systems, data can be represented as {users × scoring objects × contexts} [3], which is a three-dimensional tensor. The scores from the reviewers are highly quantized [12]. By solving the quantized tensor recovery problem, one can obtain the actual preferences of the reviewers. In video processing, the measurements can be represented as {rows of a frame × columns of a frame × different frames}. The measurements can be highly quantized due to the sensing process and the objective is to recover the data [2], [36]. A similar idea also applies to low-quality image recovery [14], [33]. Images from the same subject can be represented by {rows of an image × columns of an image × different images}.

MEASUREMENTS
We propose to estimate tensor X * , boundaries ω * 1 , ω * 2 , · · · , ω * W −1 using a constrained maximum likelihood approach. The negative log-likelihood function is given by where 1 [B] is an indicator function that takes value '1' if the event B is true and value '0' otherwise. |Ω| denotes the cardinality of Ω. (4) is a convex function when f l is a logconcave function. When ω * l 's are unknown, we estimate X * , where Most existing work on quantized data recovery consider the special case that the quantization boundaries are known [12], [15], [17] only except for [4]. In this case, (5) can be simplified tô We remark that both (5)-(6) and (7)-(8) are nonconvex problems since S f ω and S f are nonconvex sets.
[17] studies the case with known bin boundaries in (7)- (8). It focuses on the special case that W = 2 and relaxes the lowrank constraint in S f with a convex M-norm constraint. [4], [15] consider minimizing a negative log-likelihood function subject to a low-rank constraint, which is similar to (7), but are restricted to quantized matrix recovery.
Similar to [4], [12], we first define two constants γ α and L α for analysis in the case boundaries are all known constants. For simplicity, we denote f l (x, ω * l−1 , ω * l ) by f l (x).
whereḟ l andf l are the first and second order derivatives of f l .
We also remark that L α and γ α are bounded by some fixed constants when both α and f l are given. Taking the logistic model as an example [4], [15], we have where L α and γ α depend on σ and W . It is also easy to check that γ α , L α > 0 from (10). We next state our main result that characterizes the recovery error when there are no data losses and the quantization boundaries are known, i.e., the accuracy of the solution to (7) when Ω is the full observation set.
l 's are given, and Ω contains all the indices. X * ∈ S f , and f l (x) is strictly log-concave in x, ∀l ∈ [W ]. Then, with probability at least 1−δ, δ ∈ [0, 1], any global minimizerX of (7) satisfies where Theorem 1 establishes the upper bound of the recovery error when the measurements are noisy and quantized. L α , δ, γ α are all constants. Specifically, when n 1 , n 2 , . . . , n K are all in the order of n, The right-hand-side of (13) diminishes to zero when n increases to infinity. Note that the Frobenius norm of X * is in the same order of √ n 1 n 2 ...n K . By dividing the actual error by √ n 1 n 2 ...n K , we have that the left-hand side of (13) is in the same order of the relative error X − X * F / X * , which is a commonly used normalized error measure. Therefore, the relative recovery error is sufficiently close to zero when the size of the tensor is large enough.
Note that the recovery error depends on W implicitly because W affects L α and γ α . It might seem counter-intuitive that the recovery error is not a monotone function of W . That is because we consider all the possible selections of bin boundaries for a given W when computing L α and γ α . A larger W does not necessarily lead to more information in the quantized measurements. For example, if two bin boundaries are very close to each other, almost no data would be mapped to this bin, and the effective number of quantization levels is less than W (think of the extreme case when ω 1 = ω W −1 ). This is why W does not appear directly in the recovery bound. Of course in practice, in most cases, larger W (more bits) will provide us more information about the real data, and thus increase the performance. a) Recovery enhancement over the existing work on onebit tensor recovery.: Tensor recovery from one-bit measurements has been studied in [17]. [17] relaxes the nonconvex low-rank constraint with a convex M-norm constraint, and the resulting recovery method has an error bound of O(( r 3K−3 K n K−1 ) 1/4 ). In contrast, our recovery error bound decays to zero faster than the approach in [17] for any K ≥ 2. For example, the recovery error bound in (13) is O( r n ) when K = 3, while the bound is O(( r 3/2 n 1/2 )) in [17]. b) Recovery enhancement over quantized matrix recovery.: Tensor X can be unfolded to its mode-k matricization X (k) along the k-th dimension. When the size of each dimension is Θ(n), the sizes of the two dimensions of X (k) are Θ(n) and Θ(n K−1 ), respectively. Thus, we can compare the recovery error in (13) with the results obtained by applying quantized matrix recovery methods on X (k) . [4], [12], [15] provide the theoretical analyses of matrix recovery from quantized measurements. The recovery error are in the order of [4] and [12], respectively, wherer is the rank of the matrix, andN is the smallest size in the two dimensions. Here in X (k) ,r is smaller or equal to r, andN is Θ(n). The best existing bound of quantized matrix recovery is O( r N ) [15], which means an error bound of O( r n ) if we unfold the tensor to X (k) . Note that the error order in (13) has a power of K − 1 in its denominator. For example, the recovery error is O( r n ) by (13) when K = 3. Sincer, r n, the recovery error of (13) decays to zero faster than O( r n ) by [15] for the matricization case when K ≥ 3.
c) Reduction to the matrix case.: When reduced to the matrix case, i.e., K = 2, (13) shows that the quantized matrix recovery has an error bound of O( r √ n ), which is close to the smallest error bound O( r n ) [15]. The difference √ r comes from a technical nuclear norm relaxation in the proof and can be ignored when r is a constant.

IV. ALTERNATING PROXIMAL GRADIENT DESCENT BASED ON TENSORS
In this section, we develop a fast algorithm named Tensorbased Alternating Proximal Gradient Descent (TAPGD) to solve the nonconvex problem (5) with the convergence guarantee. Since in the objective, where λ is a positive constant. The equality constraint holds when λ goes to infinity. Note that X = A 1 •A 2 •· · ·•A K is in the form of CANDECOMP/PARAFAC (CP) decomposition [18]. Unlike matrix decomposition and the other major tensor decomposition method (Tucker decomposition [32]), CP decomposition has a very weak requirement for the uniqueness of tensor factors. A sufficient condition for CP decomposition to be unique is that the summation of independent column numbers in A k , k = 1, 2, · · · , K is larger or equal to 2r + K − 1, which often holds true. In contrast, Tucker decomposition is generally not unique, and is usually computationally expensive to update its core tensor.
We revise S f ω to add constraints that quantization boundaries shall not be too close to avoid trivial solutions in practice. The resulting feasible set is where κ l , ∀l ∈ {2, 3, · · · , W − 1} are some positive numbers that can be chosen using hyperparameter tuning or simply set as small positive constants, and κ 1 = κ W = 0. α low , α upper are two constants that provide the lower and upper bound of the boundaries, which could be chosen as −α and α, or estimates computed in different applications. The revised problem of (5) is shown as follows is transformed by the constraints on ω l in S ω . Let Then we solve (15) using proximal gradient method [6]. The main steps of the proximal gradient method include updating , ω l , l ∈ [W − 1] by using the gradient descent method on H, and projecting the result to S ω . Since for ∀k ∈ the partial gradients of H with respect to A k and X can be calculated by where Otherwise, for any (i 1 , i 2 , · · · , i K ) / ∈ Ω ∇ X F Ω (X , ω 1 , ω 2 , · · · , ω W −1 ) i1,i2,...,i K = 0.
The algorithm is initialized by first estimating ω * l 's according to the applications or simply setting ω 0 l = 2αl W − α if no information is available, and then setting are obtained through the decomposition of X 0 . The details of TAPGD is shown in Algorithm 1. Note that when the quantization boundaries ω * l 's are known, TAPGD can be revised easily by removing steps 14 -20 from Algorithm 1.
To improve the recovery performance, one can multiple λ by a small constant larger than one in each iteration. This provides a better numerical result than fixing λ in all iterations. The complexity of TAPGD in each iteration is O(Krn 1 n 2 . . . n K ). The convergence of TAPGD is summarized in Theorem 2.
to a critical point of (15) from any initial point, and the convergence rate is at least O(t θ−1 2θ−1 ), for some θ ∈ ( 1 2 , 1). Theorem 2 indicates a sublinear convergence of TAPGD. One way to satisfy the requirement of bounded sequence is to scale the factorized variables so that A 1 F = A 2 F = · · · = A K F after each iteration. We find TAPGD performs well numerically without the additional steps.

V. EXPERIMENTS
We conduct simulations on synthetic data, image data, and data from an in-car music recommender system [3] in this section. The recovery performance is measured by X * −X 2 F / X * 2 F , whereX is our estimation of X * . K = 3 in both tests on synthetic data and real data. We set T = 200. All the results are averaged over 100 runs. The simulations are run in MATLAB on a 3.4GHz Intel Core i7 computer.

A. Synthetic data
A rank-r, three-dimensional tensor is generated as follows. We first generate A 1 ∈ R n1×r with entries sampled independently from a uniform distribution in [−0.5, 0.5], A 2 ∈ R n2×r , and A 3 ∈ R n3×r with each entry sampled independently from a uniform distribution in [0, 1]. Then we obtain the tensor by calculating A 1 • A 2 • A 3 and scaling all the values to [−1, 1]. The entries of N are i.i.d. generated from the Gaussian distribution with mean 0 and the standard deviation of 0.25. We choose W = 2 (one-bit) and 4 (2-bit) in our experiments. Fig. 2 compares TAPGD with M-norm constrained one-bit tensor recovery (MNC-1bit-TR) method [17] and the quantized matrix recovery method [15]. We remark that MNC-1bit-TR can only deal with one-bit measurements and requires solving a convex optimization problem. We vary one of the rank, dimension, noise level while fixing other parameters. n1 = n2 = n3 = 120 when we only vary rank and noise level. Fig. 2 demonstrates that the relative recovery error increases when rank increases or dimension decreases. The results are consistent with the theoretical analysis in Section III. The results also show that TAPGD has the best performance among all these methods. Moreover, performance improves when the number of bits increases. When the noise level increases, the relative recovery error first decreases, and then increases. The reason behind this is that the noise is considered as part of the quantization process, and plays the role of adding measurement uncertainty. The problem without noise (measurement uncertainty) is ill-posed in the sense that there may be an infinite number of solutions. Large error under low noise corresponds to the case that the observations are nearly noise-free (especially for lower bits). The same behavior exists in the 2-bit curve when the noise level is smaller than 0.08. Large error under high noise comes from the mask of the high randomness. From Fig. 3, a low noise level does not necessarily mean low recovery error, and vice versa.   Fig. 4 (a) compares TAPGD with MNC-1bit-TR, the quantized matrix recovery method, and a nonconvex low-rank tensor recovery method named tensor completion by parallel matrix factorization (TMac) [35]. Note that MNC-1bit-TR models the quantization process like our approach, while TMac does not model quantization and treats the data as general noisy measurements. It shows that the relative recovery error decreases when the percentage of the observation increases, and TAPGD obtains the best performance among all the methods. In Fig. 5, we show a box-plot-diagram of relative recovery error with 100 runs obtained by TAPGD. All the setups are the same as the scenario W = 3 in Fig. 4 (a). The tops and bottoms of each "box" are the 25th and 75th percentiles of the samples respectively. The maximum standard deviation happens when the observation rate is 0.3, which equals to 8.79 × 10 −4 . The relative standard deviation, which is defined as the ratio of the standard deviation to the mean, reaches its maximum value 0.028 when the observation rate is 0.6. Fig. 6 compares the time cost of TAPGD and MNC-1bit-TR [17] when the number of facial images changes. TAPGD is three magnitudes faster than MNC-1bit-TR. Fig. 7 visualizes the quantized and recovered images by TAPGD.

C. In-car music recommender dataset
Many recommender systems' ratings from users are highly quantized (such as like or dislike) with many missing entries (e.g., users don't give rating for all subjects), while the underlying systems may want to recover real-valued user ratings. Our method can be used to recover the true underlying real-valued user preferences, thus improving the quality of recommendations. We apply our method to an in-car music recommender dataset [3]. The recommender dataset contains 139 songs with 4012 ratings from 42 users. This dataset has 26 contexts that including relaxed driving, country side, happy, sleepy, etc. The same user may rate different scores to the same song under different contexts. 2751 ratings have the corresponding context information while the rest 1261 ratings do not have context information. We only use the ratings with context information. An example of three ratings is shown in Table I. We construct the resulting tensor M as  ratings are quantized to 0, 1, 2, 3, 4, 5. All the locations without ratings are set to be zero. The observation rate is 1.81% in the tensor. We then randomly set 0.362% of the data (20% of the observed data) to be zero and let Ω predict denote the set of the indices. We predict data with indices belong to Ω predict using the rest 1.448% of the data (80% of the observed data), which are observed. In this test, we define the relative recovery error as whereM is our estimation of the ground truth, andM maps the values inM to their nearest quantized values. The reason for the occurrence of 5 at denominator is that the maximum difference betweenM and M is 5. The error increases when the difference increases. Ref. [17] also studies on the same dataset and first maps the multi-level quantized values to binary values. It then deletes some binary values and evaluates the recovery error. The smallest recovery error is 0.23 by their method. We remark that the multi-level prediction is harder than binary prediction in this application, since the binary case is to choose one out of two numbers, while the multi-level case is to choose one out of W > 2 numbers.
Here we estimate the rank r and the noise level σ, since we do not know the actual rank and noise. In Algorithm 1, we choose the estimated rank r from the set {5, 10, 15, 20, 25}, and choose the estimated standard variation σ from the set {0.001, 0.01, 0.05, 0.1, 0.15, 0.2, 0.25}. The recovery results are shown in Fig. 8. The relative recovery error reaches its smallest value when r = 5 and σ = 0.05, and the smallest relative recovery error is 0.22. This paper recovers a low-rank tensor from quantized measurements. A constrained maximum log-likelihood problem is proposed to estimate the ground truth tensor. The recovery error is proved to be at most O( ) when boundaries are known. This error bound is significantly smaller than the errors bounds of the existing methods for one-bit tensor recovery and low-rank matrix recovery from quantized measurements. We also propose an efficient method Tensor-Based Alternating Proximal Gradient Descent (TAPGD) to solve the nonconvex optimization problem. TAPGD is guaranteed to converge to a critical point from any initial point. TAPGD handles missing data and does not require information of the quantization rule. The method is evaluated on synthetic data, the Extend Yale Face Dataset B, and the in-car music recommender dataset. Future works include data recovery when partial measurements contain significant errors.
Proof: The proof is completed by combining Lemma 1 and Theorem 1 in [31].
where r k , k ∈ [K] is the k-rank of the tensor X , which is defined as the column rank of X (k) . The tensor nuclear norm X * is defined as The details of the property (34) can be viewed in Theorem 9.4 of [13]. Note that r k ≤ r, ∀k ∈ [K], since X (k) = A k (A K · · · A k+1 A k−1 . . . A 1 ) T . Therefore, where the last inequality holds because · * ≤ √ r · F for any matrix. We then have n k ) log(4K/3) + log(2/δ)) X − X * F (37) holds with probability at least 1 − δ. Then, holds with probability at least 1 − δ. The second inequality comes from the fact | A, B | ≤ A B * for two tensors A and B. We then have the desired result. Lemma 3 provides a lower bound on the second-order term of the second-order Taylor expansion, which is also related to X − X * F . Lemma 3. Let θ = vec(X ), θ * = vec(X * ), and X , X * ∈ S f . Then for anyθ = θ * + η(θ − θ * ) and any η ∈ [0, 1], we have θ − θ * , (∇ 2 θθ F (θ))(θ − θ * ) ≥ γ α X − X * 2 F . (38) Proof: Lemma 3 is an extension of Lemma 7 in [4]. Using (31), it follows that Then we have where the first inequality comes from the fact that B. Proof of Theorem 1 Proof: The first bound follows from the fact that X , X * ∈ S f . We have Letθ = vec(X ) and F(θ) = F (X ). By the second-order Taylor's theorem, we have whereθ = θ * + η(θ − θ * ) for some η ∈ [0, 1], with the corresponding tensorX = X * + η(X − X * ).
Using the results of Lemma 2 and Lemma 3, we can obtain that (41) holds with probability at least 1 − δ. Note thatX is the global optimal of the optimization problem. Thus, F (X ) ≤ F (X * ). We then have n k ) log(4K/3) + log(2/δ)) X − X * F (42) holds with probability at least 1 − δ. Thus, holds with the same probability 1 − δ, where (44) Combining (39) and (43), we have and this complete the proof.