Tensor recovery from noisy and multi-level quantized measurements

Wang, Ren; Wang, Meng; Xiong, Jinjun

doi:10.1186/s13634-020-00698-z

Research
Open access
Published: 14 September 2020

Tensor recovery from noisy and multi-level quantized measurements

Ren Wang¹,
Meng Wang¹ &
Jinjun Xiong²

EURASIP Journal on Advances in Signal Processing volume 2020, Article number: 41 (2020) Cite this article

2396 Accesses
1 Citations
1 Altmetric
Metrics details

Abstract

Higher-order tensors can represent scores in a rating system, frames in a video, and images of the same subject. In practice, the measurements are often highly quantized due to the sampling strategies or the quality of devices. Existing works on tensor recovery have focused on data losses and random noises. Only a few works consider tensor recovery from quantized measurements but are restricted to binary measurements. This paper, for the first time, addresses the problem of tensor recovery from multi-level quantized measurements by leveraging the low CANDECOMP/PARAFAC (CP) rank property. We study the recovery of both general low-rank tensors and tensors that have tensor singular value decomposition (TSVD) by solving nonconvex optimization problems. We provide the theoretical upper bounds of the recovery error, which diminish to zero when the sizes of dimensions increase to infinity. We further characterize the fundamental limit of any recovery algorithm and show that our recovery error is nearly order-wise optimal. A tensor-based alternating proximal gradient descent algorithm with a convergence guarantee and a TSVD-based projected gradient descent algorithm are proposed to solve the nonconvex problems. Our recovery methods can also handle data losses and do not necessarily need the information of the quantization rule. The methods are validated on synthetic data, image datasets, and music recommender datasets.

1 Introduction

Many practical datasets are highly noisy and quantized, and recovering the actual values from quantized measurements finds applications in different domains. For example, users’ preferences in rating systems are represented by a few scores (or even two scores in 1 bit [1]), which do not provide accurate characterizations of preferences. Due to sensor issues or communication restrictions, images and videos in some applications may have very low resolution [2]. Quantization is applied to enhance the data privacy in power systems and sensor networks [3–5]. It is important to develop computationally efficient and reliable methods to recover the actual data from low-resolution measurements.

Li et al. [6] estimate the data from 1-bit measurements by linearizing the nonlinear quantizer. Khobahi et al. [7] leverage the deep learning tool to recover the data. These approaches either require accurate parameter estimation or have high computational costs. Some other works recover data from a small number of quantized measurements, but the methods only apply to sparse signals [8–10]. Low-rank matrices can characterize the intrinsic data correlations in user ratings, images, and videos [11, 12], and the low-rank property has been exploited to recover the data from quantized measurements by solving a nonconvex constrained maximum likelihood estimation problem [1, 4, 13, 14]. For an n×n rank-r matrix (r≪n), the best achievable recovery error from quantized measurements is $O\left (\sqrt {\frac {r^{3}}{n}}\right)$^{Footnote 1} [4, 13]. The recovery error diminishes to zero when the data size increases.

Practical datasets may contain additional correlations that cannot be captured by low-rank matrices. For instance, if every frame of a video is vectorized so that the video is represented by a matrix, the spatial correlation is not directly characterized by low-rank matrices [15]. In recommendation systems, users’ ratings against objects vary under different contexts [16], and assuming the rating matrix is low-rank does not fully characterize the dependence of ratings over the contexts. That motivates the usage of low-rank tensors where higher-order tensors contain data arrays with at least three dimensions.

Tensors can represent three-dimensional objects in generic object recognition [17], engagements on advertisements over time for behavior analysis [18], gene expressions in the development process [19], etc. Moreover, tensor techniques are widely used in deep learning [20, 21]. Unlike matrices, there are different rank definitions for higher-order tensors, such as CANDECOMP/PARAFAC (CP) rank [22, 23] and Tucker rank [24]. Since there exist correlations in the practical datasets such as images and user ratings, the resulting tensor data is often low-rank. The low-rank property has been exploited in problems like low-rank tensor completion [25–30] and low-rank tensor recovery [31–35]. Leveraging the low-rank property, a convex relaxation of the Tucker rank can be applied to robust tensor recovery [31, 32] and tensor completion [25, 26]. CP rank-based decompositions are also proved to be effective in these tasks [27, 28]. Besides the low-Tucker-rank and low-CP-rank, there are also other tensor rank forms used in literature. To solve the unbalanced matricization scheme in the Tucker rank, tensor train rank is proposed to solve the tensor recovery and completion problem [29, 33]. Some works also leverage another rank form called tubal multi-rank and its convex surrogate tensor nuclear norm as tools for tensor-related tasks [34, 35]. This paper concerns only the recovery under the CP rank. It is more challenging to analyze tensors than matrices because some matrix properties do not extend to higher-order tensors. For example, in general, the best low-rank approximation to a tensor does not always exist [36, 37]. A couple of works have focused on special tensors that have the tensor singular value decomposition (TSVD) (a.k.a. completely orthogonal) [38, 39], which is a direct generalization of the matrix singular value decomposition. The significance of the tensors with TSVD is that many matrix properties are preserved. For example, the best orthogonal low-rank tensor approximation always exists [38, 40], and the CP rank of a tensor having TSVD equals to the number of its singular values. We will refer tensors having TSVD as SVD-tensors.

Low-rank tensors with quantization noise exist in hyper-spectral data [41, 42], rating systems [43], and the knowledge predicates [44]. Existing works on low-rank tensor recovery mainly consider random noise or sparse noise [45–47], while only a few works [41–43] consider tensor recovery from 1-bit measurements, i.e., all measurements are binary. Aidini et al. [41] introduce a 1-bit tensor completion method that first unfolds the tensor measurements to matrices along all dimensions and then applies matrix recovery techniques to each matrix. The final estimation is a real-valued tensor folded by the weighted sum of the recovered matrices. Ghadermarzy et al. [43] use tensor M-norm constraint to replace the exact low-rank constraint and then recover the tensor by solving the convex optimization problem. The recovery error is guaranteed to be $O((\frac {r^{3K-3}K}{n^{K-1}})^{1/4})$, where K is the number of tensor dimensions, and n is the size per dimension. Li et al. [42] focus on three-dimensional tensor and the scenario when a significant percentage of measurements are lost. The recovery is based on minimizing the convex surrogate of the tubal multi-rank.

This paper for the first time studies the problem of low-rank tensor recovery from multi-level quantized measurements, while the existing works [41–43] only consider 1-bit measurements. This paper is also the first one to study the recovery problem for SVD-tensors for quantized measurements. We formulate the tensor recovery problems as constrained nonconvex optimization problems. When there is no missing data, and the quantization rule is known, we prove that the recovery error of a K-dimensional tensor with CP rank r is at most $O(\sqrt {\frac {r^{K-1}K\log K}{n^{K-1}}})$, where n is the length of each tensor dimension. The error bound decays to zero much faster than any existing results. Moreover, we prove that if the tensor is a SVD-tensor, then the recovery error is reduced to at most $O(\sqrt {\frac {rK\log K}{n^{K-1}}})$. We also prove that the recovery error of low-CP-rank tensors by any algorithm cannot be smaller than the order of $\sqrt {\frac {r}{n^{K-1}}}$. Our method is close to optimal for a small r. We further develop computationally efficient algorithms to solve the nonconvex problems for recovering low-CP-rank tensors and tensors with TSVD. We prove that even with partial data losses, our proposed low-CP-rank tensor recovery algorithm converges to a critical point of the nonconvex problem from any initialization with at least a sublinear convergence rate. Lastly, all the existing works on tensor recovery from quantized measurements assume that the quantization rule is known to the recovery method except one low-rank matrix recovery work [13]. We empirically extend our methods to recover the tensor from quantized measurements when the quantization rule is unknown and demonstrate encouraging numerical results.

This paper is organized as follows. The problem formulation is introduced in Section 2. Section 3 discusses our approach and its recovery error. An efficient algorithm with the convergence guarantee is proposed in Section 4.1. The recovery algorithm for SVD-tensors is proposed in Section 4.2. Section 5 records the numerical results. Section 6 concludes the paper. All the lemmas and proofs can be found in the Appendices (see Appendices 1, 2, 3, 4, 5, and 6).

1.1 Notation and preliminaries

We use boldface capital letters to denote matrices (two-dimensional tensors), e.g., A. Higher-order tensors (three or higher dimensions) are denoted by capital calligraphic letters, e.g., $\mathcal {X}$. $\mathcal {X} \in \mathbb {R}^{n_{1} \times n_{2} \times \dots \times n_{K}}$ represents a K-dimensional tensor with the size of the i-th dimension equaling to n_i,i∈[K], where $[K] = \{1,2,\dots,K\}$. $\mathcal {X}_{i_{1},i_{2},\dots,i_{K}}$ denotes the $(i_{1},i_{2},\dots,i_{K})$-th entry of $\mathcal {X}$. $\mathbf {X}_{(k)} \in \mathbb {R}^{n_{k} \times (n_{1} \dots n_{k-1} n_{k+1} \dots n_{K})}$ is the mode-k matricization of $\mathcal {X}$, which is formed by unfolding $\mathcal {X}$ along its k-th dimension. The Frobenius norm of the tensor $\mathcal {X}$ is defined as $\|\mathcal {X}\|_{F} = \sqrt {\sum _{i_{1} = 1}^{n_{1}}\sum _{i_{2} = 1}^{n_{2}}\dots \sum _{i_{K} = 1}^{n_{K}}\mathcal {X}_{i_{1},i_{2},\dots,i_{K}}^{2}}$.

Let $a_{i} \in \mathbb {R}^{n_{i}}, \forall i \in [K]$ be K vectors. Then, $\mathcal {A} = a_{1} \circ \dots \circ a_{K}$ is a K-dimensional tensor with $\mathcal {A}_{i_{1},i_{2},\dots,i_{K}} = {a_{1}}_{i_{1}} {a_{2}}_{i_{2}} \dots {a_{K}}_{i_{K}}$. Here, ∘ is called the outer product. The CP rank of $\mathcal {X}$ [22, 23] is defined as

$$ \begin{aligned} &\text{rank}(\mathcal{X}) = \min\{R: \mathcal{X} = \\&\sum_{i=1}^{R} \mathbf{A_{1}}_{i} \circ \mathbf{A_{2}}_{i} \circ \dots \circ \mathbf{A_{K}}_{i}, \mathbf{A_{k}} \in \mathbb{R}^{n_{k} \times R}, k \in [K]\}, \end{aligned} $$

(1)

where A_k_i is the i-th column of A_k. $\mathbf {A_{1}} \circ \mathbf {A_{2}} \circ \dots \circ \mathbf {A_{K}}$ is equivalent to $\sum _{i=1}^{R} \mathbf {A_{1}}_{i} \circ \mathbf {A_{2}}_{i} \circ \dots \circ \mathbf {A_{K}}_{i}$. Note that the CP rank can be different in different fields, e.g., real number and complex number. We remark that the results in this paper can be easily generalized to different fields. We use A_k⊙A_p to represent the Khatri-Rao product [48] of $\mathbf {A_{k}} \in \mathbb {R}^{n_{k} \times r}, \mathbf {A_{p}} \in \mathbb {R}^{n_{p} \times r}$. We have $\mathbf {A_{k}} \odot \mathbf {A_{p}} = [\mathbf {A_{k}}_{1} \bigotimes \mathbf {A_{p}}_{1}, \mathbf {A_{k}}_{2} \bigotimes \mathbf {A_{p}}_{2}, \dots, \mathbf {A_{k}}_{r} \bigotimes \mathbf {A_{p}}_{r}]$, where $\-\mathbf {A_{k}}_{i} \-\bigotimes \- \mathbf {A_{p}}_{i} \- = \-[(\mathbf {A_{k}}_{i})_{1}\mathbf {A_{p}}_{i}^{T},\-(\mathbf {A_{k}}_{i})_{2}\mathbf {A_{p}}_{i}^{T},\- \dots,\-(\mathbf {A_{k}}_{i})_{n_{k}}\-\mathbf {A_{p}}_{i}^{T}]^{T}\- \in \-\mathbb {R}^{n_{k}n_{p} \times 1},\-\forall i \in [r]$.

We define the set of tensors that have tensor singular value decomposition (TSVD) as follows

$$ \begin{aligned} &\mathcal{S}_{\text{tsvd}}:=\{\mathcal{X}|\mathcal{X} = \sum_{i=1}^{R} \zeta_{i} \mathbf{V_{1}}_{i} \circ \mathbf{V_{2}}_{i} \circ \dots \circ \mathbf{V_{K}}_{i}, \\& \mathbf{V_{k}} \in \mathbb{R}^{n_{k} \times R}, \zeta_{1} \geq \zeta_{2} \geq \cdots \geq \zeta_{R} > 0, \\&\langle \mathbf{V_{k}}_{i},\mathbf{V_{k}}_{j} \rangle = \boldsymbol{1}_{[i=j]}, 1 \le i,j \le R, \forall R\} \end{aligned} $$

(2)

where 1_[B] is an indicator function that takes value “1” if the event B is true and value “0” otherwise. 〈,〉 denotes the inner product operation. Definition (2) generalizes the matrix SVD. One can see that TSVD is a special case of the decomposition form in (1), and Lemma 3.3 in [38] implies that the tensor in (2) has CP rank R. We remark that not all of the tensors have TSVD. We refer readers to [38] for more details.

Throughout this paper, when discussing tensor ranks and low-rank tensors, we refer to CP rank if not otherwise specified. Again, we will refer tensors in $\mathcal {S}_{\text {tsvd}}$ as SVD-tensors.

2 Our proposed framework of tensor recovery from noisy and multi-level quantized measurements

Let $\mathcal {X}^{*} \in \mathbb {R}^{n_{1} \times n_{2} \times \dots \times n_{K}}$ denote the actual data that are represented by a K-dimensional tensor. Let ∥·∥_∞ denote the entry-wise infinity norm. We assume that the maximum value of $\mathcal {X}^{*}$ is bounded by a positive constant α, i.e., $\|\mathcal {X}^{*}\|_{\infty } \le \alpha $. We further assume that $\mathcal {X}^{*}$ is a low-rank tensor, i.e., $\text {rank}(\mathcal {X}^{*})\le r$.

Each entry of $\mathcal {X}^{*}$ is mapped to one of a few possible values with certain probabilities through the quantization process. To model this probabilistic mapping, let $\mathcal {N} \in \mathbb {R}^{n_{1} \times n_{2} \times \dots \times n_{K}}$ denote a noise tensor with i.i.d. entries drawn from a known cumulative distribution function Φ(x). Given the quantization boundaries $\omega _{0}^{*} < \omega _{1}^{*}< \dots < \omega _{W}^{*}$, the noisy data $\mathcal {X}_{i_{1},i_{2},\dots,i_{K}}^{*} + \mathcal {N}_{i_{1},i_{2},\dots,i_{K}}$ (i_j∈[n_j],j∈[K]) can be quantized to W values based on the following rule,

$$ \begin{aligned} &\mathcal{Y}_{i_{1},i_{2},\dots,i_{K}} = Q\left(\mathcal{X}_{i_{1},i_{2},\dots,i_{K}}^{*} + \mathcal{N}_{i_{1},i_{2},\dots,i_{K}}\right) = l \ \ \\& \text{if} \ \omega_{l-1}^{*}<\mathcal{X}_{i_{1},i_{2},\dots,i_{K}}^{*} + \mathcal{N}_{i_{1},i_{2},\dots,i_{K}}\le\omega_{l}^{*}, \ l \in [W], \end{aligned} $$

(3)

where Q is an operator that maps a real value to one of W values. We choose $\omega _{0}^{*} = -\infty $ and $\omega _{W}^{*} = \infty $. $\mathcal {Y}_{i_{1},i_{2},\dots,i_{K}}$ is the $(i_{1},i_{2},\dots,i_{K})$-th entry of the quantized measurements $\mathcal {Y} \in [W]^{n_{1} \times n_{2} \times \dots \times n_{K}}$. When $W=2,\mathcal {Y}$ reduces to the 1-bit case [43]. In general, $\mathcal {Y}$ is a log2W-bit tensor. Figure 1 provides a visualization of the quantization process when K=3 and $\mathcal {Y}$ is a log2W-bit tensor. The actual tensor $\mathcal {X}^{*}$ is mapped to the quantized tensor $\mathcal {Y}$ by first adding a noise tensor $\mathcal {N}$ and then quantized by the operator Q.

The probability that $\mathcal {Y}_{i_{1},i_{2},\dots,i_{K}} = l$ given $\-\mathcal {X}_{i_{1},i_{2},\dots,i_{K}}^{*}$ and $\omega _{l-1}^{*},\- \omega _{l}^{*}$ is expressed by $f_{l}\left (\mathcal {X}_{i_{1},i_{2},\dots,i_{K}}^{*}, \omega _{l-1}^{*}, \omega _{l}^{*}\right)$, where

$$ \begin{aligned} &f_{l}\left(\mathcal{X}_{i_{1},i_{2},\dots,i_{K}}^{*}, \omega_{l-1}^{*}, \omega_{l}^{*}\right)\\&=P\left(\mathcal{Y}_{i_{1},i_{2},\dots,i_{K}} = l|\mathcal{X}_{i_{1},i_{2},\dots,i_{K}}^{*}, \omega_{l-1}^{*}, \omega_{l}^{*}\right) \\&= \Phi\left(\omega_{l}^{*}-\mathcal{X}_{i_{1},i_{2},\dots,i_{K}}^{*}\right)-\Phi\left(\omega_{l-1}^{*}-\mathcal{X}_{i_{1},i_{2},\dots,i_{K}}^{*}\right), \end{aligned} $$

(4)

and $\sum _{l=1}^{W}f_{l}\left (\mathcal {X}_{i_{1},i_{2},\dots,i_{K}}^{*}, \omega _{l-1}^{*}, \omega _{l}^{*}\right)= \Phi \left (\infty -\-\mathcal {X}_{i_{1},i_{2},\dots,i_{K}}^{*}\-\right)\--\Phi \left (-\infty -\mathcal {X}_{i_{1},i_{2},\dots,i_{K}}^{*}\right) = 1$. The probability description (4) follows from the same formula as those in [4, 13], except that the entries are from a higher-order tensor. Two common choices for Φ(x) are as follows: (1) probit model with Φ(x)=Φ_norm(x/σ), where Φ_norm is the cumulative distribution function of a standard Gaussian distribution, and (2) logistic model with $\Phi (x)=\Phi _{\text {log}}(x/\sigma)=\frac {1}{1+e^{-x/\sigma }}$.

We also consider the general setup that there exists missing data in the measurements, i.e., only measurements with indices belonging to the observation set Ω are available, while all the other measurements are lost. The question we will address in this paper is as follows. Given the partial observations $\mathcal {Y}_{\Omega }$ and the noise distribution Φ, how can we estimate the original tensor $\mathcal {X}^{*}$? We will discuss the case when $\mathcal {X}^{*}$ is a general tensor and the special case when $\mathcal {X}^{*}$ is a SVD-tensor.

We remark that this problem formulation can be applied in different domains. In the user voting systems, data can be represented as {users×scoring objects×contexts} [16], which is a three-dimensional tensor. The scores from the reviewers are highly quantized [1]. By solving the quantized tensor recovery problem, one can obtain the actual preferences of the reviewers. In video processing, the measurements can be represented as {rows of a frame×columns of a frame×different frames}. The measurements can be highly quantized due to the sensing process, and the objective is to recover the data [49, 50]. A similar idea also applies to low-quality image recovery [2, 15]. Images from the same subject can be represented by {rows of an image×columns of an image×different images}.

3 Results: theoretical

We propose to estimate tensor $\mathcal {X}^{*}$, boundaries $\omega _{1}^{*}, \omega _{2}^{*},\- \cdots, \omega _{W-1}^{*}$ using a constrained maximum likelihood approach. The negative log-likelihood function is given by

$$ \begin{aligned} &F_{\Omega}(\mathcal{X},\omega_{1}, \omega_{2},\cdots, \omega_{W-1}) = -\frac{n_{1}n_{2}\cdots n_{K}}{|\Omega|}\sum_{(i_{1},\cdots,i_{K}) \in \Omega}\\&\sum_{l=1}^{W} \boldsymbol{1}_{[\mathcal{Y}_{i_{1},i_{2},\cdots,i_{K}}=l]}\log(f_{l}(\mathcal{X}_{i_{1},i_{2},...,i_{K}},\omega_{l-1}, \omega_{l})), \end{aligned} $$

(5)

where |Ω| denotes the cardinality of Ω. Equation (5) is a convex function when f_l is a log-concave function. When $\omega _{l}^{*}$’s are unknown, we estimate $\mathcal {X}^{*},\omega _{l}^{*}$’s by $\hat {\mathcal {X}},\hat {\omega }_{l}$’s, where

$$ \begin{aligned} &(\hat{\mathcal{X}},\hat{\omega}_{1},\hat{\omega}_{2},\cdots,\hat{\omega}_{W-1})\\& = {\arg\min}_{\mathcal{X}, \omega_{l}, \forall l \in [W-1]} F_{\Omega}(\mathcal{X},\omega_{1}, \omega_{2},\cdots, \omega_{W-1}) \: \:\: \: \\&\mathrm{s.t.} \mathcal{X},\omega_{1}, \omega_{2},\cdots, \omega_{W-1} \in \mathcal{S}_{f\omega}, \end{aligned} $$

(6)

and

$$ \begin{aligned} \mathcal{S}_{f\omega}:=&\{\mathcal{X} \in \mathbb{R}^{n_{1}\times n_{2} \times... n_{K}}, \omega_{l}, \forall l \in [W-1]: \\& \|\mathcal{X}\|_{\infty}\le\alpha, \text{rank}(\mathcal{X}) \le r,\\& \omega_{0} < \omega_{1} < \omega_{2} < \cdots < \omega_{W-1} < \omega_{W}\}. \end{aligned} $$

(7)

Most existing works on quantized data recovery consider the special case that the quantization boundaries are known [1, 4, 43] only except for [13]. In this case, (6) can be simplified to

$$ \hat{\mathcal{X}} = {\arg\min}_{\mathcal{X}} F_{\Omega}(\mathcal{X}, \omega_{1}^{*},\omega_{2}^{*},...,\omega^{*}_{W-1}) \: \:\: \: \mathrm{s.t.} \mathcal{X} \in \mathcal{S}_{f}, $$

(8)

where

$$ \mathcal{S}_{f}:=\{\mathcal{X}: \|\mathcal{X}\|_{\infty}\le\alpha, \text{rank}(\mathcal{X}) \le r\}. $$

(9)

If we assume that $\mathcal {X}^{*} \in \mathcal {S}_{\text {tsvd}}$, then the optimization problem (6) changes to

$$ \begin{aligned} &(\hat{\mathcal{X}},\hat{\omega}_{1},\hat{\omega}_{2},\cdots,\hat{\omega}_{W-1})\\& = {\arg\min}_{\mathcal{X}, \omega_{l}, \forall l \in [W-1]} F_{\Omega}(\mathcal{X},\omega_{1}, \omega_{2},\cdots, \omega_{W-1}) \: \:\: \: \\&\mathrm{s.t.} \mathcal{X},\omega_{1}, \omega_{2},\cdots, \omega_{W-1} \in \mathcal{S}_{f\omega s}, \end{aligned} $$

(10)

where

$$ \begin{aligned} \mathcal{S}_{f\omega s}:=&\{\mathcal{X} \in \mathbb{R}^{n_{1}\times n_{2} \times... n_{K}}, \omega_{l}, \forall l \in [W-1]: \\& \|\mathcal{X}\|_{\infty}\le\alpha, \text{rank}(\mathcal{X}) \le r, \mathcal{X} \in \mathcal{S}_{\text{tsvd}}\\& \omega_{0} < \omega_{1} < \omega_{2} < \cdots < \omega_{W-1} < \omega_{W}\}, \end{aligned} $$

(11)

and problem (8) changes to

$$ \hat{\mathcal{X}} = {\arg\min}_{\mathcal{X}} F_{\Omega}\left(\mathcal{X}, \omega_{1}^{*},\omega_{2}^{*},...,\omega^{*}_{W-1}\right) \: \:\: \: \mathrm{s.t.} \mathcal{X} \in \mathcal{S}_{fs}, $$

(12)

where

$$ \begin{aligned} &\mathcal{S}_{fs}:=\{\mathcal{X}: \|\mathcal{X}\|_{\infty}\le\alpha, \text{rank}(\mathcal{X}) \le r, \mathcal{X} \in \mathcal{S}_{\text{tsvd}}\}. \end{aligned} $$

(13)

We remark that all (6), (8), (10), and (12) are nonconvex problems since $\mathcal {S}_{f\omega },\mathcal {S}_{f},\mathcal {S}_{f\omega s},\mathcal {S}_{fs}$ are nonconvex sets. Note that when the ground-truth tensor is not in $\mathcal {S}_{\text {tsvd}}$, the solutions of (10) and (12) can be viewed as a low-rank approximation of the tensor.

Ghadermarzy et al. [43] study the case with known bin boundaries in (8)–(9). It focuses on the special case that W=2 and relaxes the low-rank constraint in $\mathcal {S}_{f}$ with a convex M-norm constraint. Bhaskar et al. [13] and Gao et al. [4] consider minimizing a negative log-likelihood function subject to a low-rank constraint, which is similar to (8), but are restricted to quantized matrix recovery. None of the works address the problem of low-rank tensor recovery from multi-level quantized measurements, nor the recovery of SVD-tensors. We first analyze the recovery performance of our models. We defer the algorithms to Section 4.

3.1 Tensor recovery guarantee

Similar to the works of quantized matrix recovery [1, 13], we first define two constants γ_α and L_α for analysis in the case boundaries are all known constants. For simplicity, we denote $f_{l}(x, \omega _{l-1}^{*},\omega _{l}^{*})$ by f_l(x).

$$ \begin{aligned} &\gamma_{\alpha} = \min_{l\in[W]}\inf_{|x|\le2\alpha}\left\{\frac{\dot{f}_{l}^{2}(x)}{f_{l}^{2}(x)}-\frac{\ddot{f}_{l}(x)}{f_{l}(x)}\right\},\\& L_{\alpha} = \max_{l\in[W]}\sup_{|x|\le2\alpha}\left\{\frac{|\dot{f}_{l}(x)|}{f_{l}(x)}\right\}, \end{aligned} $$

(14)

where $\dot {f}_{l}$ and $\ddot {f}_{l}$ are the first- and second-order derivatives of f_l. Note that $\ddot {f}_{l} - \dot {f}_{l}f_{l} \geq 0$ if f_l is log-concave, and $\ddot {f}_{l} - \dot {f}_{l}f_{l} > 0$ if f_l is strictly log-concave. One can check that f_l is strictly log-concave if Φ is log-concave, which holds true for noises following Gaussian and logistic distributions. Thus, γ_α>0 in our setup. We also remark that L_α and γ_α are bounded by some fixed constants when both α and f_l are given. Taking the logistic model as an example [4, 13], we have

$$ \begin{aligned} &\gamma_{\alpha} = \min_{l\in[W]}\inf_{|x|\le2\alpha}\frac{1}{\sigma^{2}}[\Phi_{\text{log}}(\frac{\omega_{l}-x}{\sigma})(1-\Phi_{\text{log}}(\frac{\omega_{l}-x}{\sigma}))\\&+\Phi_{\text{log}}(\frac{\omega_{l-1}-x}{\sigma})(1-\Phi_{\text{log}}(\frac{\omega_{l-1}-x}{\sigma}))] \\& L_{\alpha} = \\& 1/\left[2\sigma\min_{l\in[W]}\inf_{|x|\le2\alpha} \left\{ \Phi_{\text{log}}\left(\frac{\omega_{l}-x}{\sigma}\right)-\Phi_{\text{log}}\left(\frac{\omega_{l-1}-x}{\sigma}\right)\right\}\right] \end{aligned} $$

(15)

where L_α and γ_α depend on σ and W. It is also easy to check that γ_α,L_α>0 from (15).

We next state our main results that characterize the recovery error when there are no data losses and the quantization boundaries are known, i.e., the accuracy of the solutions to (8) and (12) when Ω is the full observation set.

Theorem 1

Suppose $\omega _{l}^{*}$’s are given, and Ωcontains all the indices. $\mathcal {X}^{*} \in \mathcal {S}_{f}$, and f_l(x) is strictly log-concave in x, ∀l∈[W]. Then, with probability at least 1−δ,δ∈[0,1], any global minimizer $\hat {\mathcal {X}}$ of (8) satisfies

$$ \|\hat{\mathcal{X}}-\mathcal{X}^{*}\|_{F}/\sqrt{n_{1}n_{2}...n_{K}} \le \min (2\alpha,U_{\alpha}) $$

(16)

where

$$ U_{\alpha} = \sqrt{\frac{64r^{K-1}L_{\alpha}^{2}\left(\left(\sum_{k=1}^{K}n_{k}\right)\log(4K/3)+\log(2/\delta)\right)}{n_{1}n_{2}...n_{K} \cdot \gamma_{\alpha}^{2}}}, $$

(17)

Theorem 2

Under the same assumptions on $\omega _{l}^{*}$’s, Ω, and f_l(x) as Theorem 1, for $\mathcal {X}^{*} \in \mathcal {S}_{fs}$, any global minimizer $\hat {\mathcal {X}}$ of (12) satisfies

$$ \|\hat{\mathcal{X}}-\mathcal{X}^{*}\|_{F}/\sqrt{n_{1}n_{2}...n_{K}} \le \min (2\alpha,U_{\alpha}') $$

(18)

with probability at least 1−δ, where

$$ U_{\alpha}' = \sqrt{\frac{64rL_{\alpha}^{2}\left(\left(\sum_{k=1}^{K}n_{k}\right)\log(4K/3)+\log(2/\delta)\right)}{n_{1}n_{2}...n_{K} \cdot \gamma_{\alpha}^{2}}}, $$

(19)

Theorems 1 and 2 establish the upper bounds of the recovery error when the measurements are noisy and quantized. L_α,δ,γ_α are all constants. Specifically, when $n_{1},n_{2},\dots, n_{K}$ are all in the order of n, the recovery error of (16) and (18) can be represented as

$$ \|\hat{\mathcal{X}}-\mathcal{X}^{*}\|_{F}/\sqrt{n_{1}n_{2}...n_{K}} \le O\left(\sqrt{\frac{r^{K-1}K\log K}{n^{K-1}}}\right), $$

(20)

and

$$ \|\hat{\mathcal{X}}-\mathcal{X}^{*}\|_{F}/\sqrt{n_{1}n_{2}...n_{K}} \le O\left(\sqrt{\frac{rK\log K}{n^{K-1}}}\right). $$

(21)

The right-hand sides of (20) and (21) diminish to zero when n increases to infinity. Comparing (20) and (21), the provided recovery error bound is further reduced if the tensor is a SVD-tensor. Note that the Frobenius norm of X^∗ is in the same order of $\sqrt {n_{1}n_{2}...n_{K}}$. By dividing the actual error by $\sqrt {n_{1}n_{2}...n_{K}}$, we have that the left-hand sides of (20) and (21) are in the same order of the relative error $\|\hat {\mathcal {X}}-\mathcal {X}^{*}\|_{F}/\|\mathcal {X}^{*}\|$, which is a commonly used normalized error measure. Therefore, the relative recovery error is sufficiently close to zero when the size of the tensor is large enough. We want to emphasize that Theorem 1 and Theorem 2 are based on the global minimizers of (8) and (12), respectively. In general, the global optimum of a nonconvex problem is hard to achieve.

Note that the recovery error depends on W implicitly because W affects L_α and γ_α. It might seem counterintuitive that the recovery error is not a monotone function of W. That is because we consider all the possible selections of bin boundaries for a given W when computing L_α and γ_α. A larger W does not necessarily lead to more information in the quantized measurements. For example, if two bin boundaries are very close to each other, almost no data would be mapped to this bin, and the effective number of quantization levels is less than W (think of the extreme case when ω₁=ω_W−1). This is why W does not appear directly in the recovery bound. Of course, in practice, in most cases, larger W (more bits) will provide us more information about the real data and thus increase the performance.

3.1.1 Recovery enhancement over the existing work on 1-bit tensor recovery

Recovering low-CP-rank tensors from 1-bit measurements has been studied in the work of Ghadermarzy et al. [43]. Ghadermarzy et al. [43] relax the nonconvex low-rank constraint with a convex M-norm constraint, and the resulting recovery method has an error bound of $O\left (\left (\frac {r^{3K-3}K}{n^{K-1}}\right)^{1/4}\right)$. In contrast, our recovery error bound in (20) for general low-CP-rank tensors decays to zero faster than the 1-bit tensor recovery [43] for any K≥2, and the bound for TSVD tensors is even smaller. For example, the recovery error bounds in (20) and (21) are $O(\frac {r}{n})$ and $O\left (\frac {\sqrt {r}}{n}\right)$ when K=3, while the bound is $O\left (\left (\frac {r^{3/2}}{n^{1/2}}\right)\right)$ by Ghadermarzy et al. [43].

3.1.2 Reduction to the matrix case

When reduced to the matrix case, i.e., K=2, both (20) and (21) show that the quantized matrix recovery has an error bound of $O\left (\sqrt {\frac {r}{n}}\right)$, which is the same as the smallest error bound in the matrix case [4].

3.1.3 Recovery enhancement over quantized matrix recovery

Here, we simply compare the recovery error in (20) and (21) with the results obtained by applying quantized matrix recovery methods on the mode-k matricization X_(k) along the k-th dimension of $\mathcal {X}$. When the size of each dimension is Θ(n), the sizes of the two dimensions of X_(k) are Θ(n) and Θ(n^K−1), respectively. Let $\bar {r}$ be the rank of the matrix, and $\bar {r}$ is smaller or equal to r.

Existing works provide the theoretical analyses of matrix recovery from quantized measurements [1, 4, 13]. The recovery errors of applying these methods to X_(k) are in the order of $O\left (\sqrt {\frac {\bar {r}^{3}}{n}}\right)$ and $O((\frac {\bar {r}}{n})^{1/4})$ by Bhaskar et al. [13] and Davenport et al. [1], respectively. The best existing bound is $O\left (\sqrt {\frac {\bar {r}}{n}}\right)$ by Gao et al. [4]. Note that the error order in our results has a power of K−1 in its denominator. For example, the recovery error is $O(\frac {r}{n})$ by (20) and $O(\frac {\sqrt {r}}{n})$ by (21) when K=3. r is often assumed to be a constant, i.e., O(1) as in the work of Ghadermarzy et al. [43]. Then, $\bar {r}$ is also O(1). It is easy to see that when K≥3, the recovery errors of both (20) and (21) decay to zero faster than the best existing bound of $O(\sqrt {\frac {\bar {r}}{n}})$ by Gao et al. [4] for the mode-k matricization case. In Table 1, we compare our results to the state-of-art results of the existing 1-bit tensor recovery [43] and quantized matrix recovery [4].

Table 1 Comparison of our method to state-of-the-art quantized matrix recovery method and 1-bit tensor recovery method

Full size table

3.2 Fundamental limitation of the recovery

We next provide a fundamental error limit of any recovery method in recovering low-rank (CP rank) tensors even when the observed measurements are unquantized. We consider the noise distribution that follows from zero mean Gaussian distribution with variance σ². Let n_max denotes the max(n₁,n₂,⋯,n_K), and assume rn_max>64.

Theorem 3

Let $\mathcal {N} \in \mathbb {R}^{n_{1} \times n_{2} \cdots \times n_{K}}$ contain i.i.d. entries from zero mean Gaussian distribution with variance σ². For any $\mathcal {X} \in \mathcal {S}_{f}$, consider any algorithm that takes $\mathcal {Y}=\mathcal {X}+\mathcal {N}$as the input and returns an estimation $\hat {\mathcal {X}}$. Then, there always exists $\mathcal {X} \in \mathcal {S}_{f}$ such that with probability at least $\frac {3}{4}$,

$$ \frac{\|\hat{\mathcal{X}}-\mathcal{X}\|_{F}}{\sqrt{n_{1}n_{2} \cdots n_{K}}} \geq \min\left (\frac{\alpha}{4}, C_{1}\sigma\sqrt{\frac{rn_{\text{max}}-64}{n_{1}n_{2} \cdots n_{K}}}\right) $$

(22)

holds for a fixed constant $C_{1} < \sqrt {\frac {1}{512}}$.

Theorem 3 establishes the lower bound of the recovery error. When $n_{1},n_{2},\dots, n_{K}$ are all in the order of n, the recovery error of (22) can be represented as

$$ \|\hat{\mathcal{X}}-\mathcal{X}^{*}\|_{F}/\sqrt{n_{1}n_{2}...n_{K}} \geq \Theta\left(\sqrt{\frac{r}{n^{K-1}}}\right), $$

(23)

That means the recovery error from unquantized measurements by any algorithm is at least $\Theta \left (\sqrt {\frac {r}{n^{K-1}}}\right)$. Comparing the error bounds in (20) and (23), one can see that the error bound (20) is almost order-wise optimal when r≪O(n).

4 Algorithms: tensor recovery from quantized measurements

We propose two efficient algorithms to solve the noncovex problems (6) and (10), respectively. Both algorithms transform the rank constraint to a penalty function in the objectives, and update all the variables alternatively. Since the problem (10) has extra orthonormal constraints on tensor decomposition components, we apply different updating strategies on these decomposition component variables.

4.1 Alternating proximal gradient descent based on tensors

We develop a fast algorithm named tensor-based alternating proximal gradient descent (TAPGD) to solve the nonconvex problem (6) with the convergence guarantee.

Since $\text {rank}(\mathcal {X})\le r$, there exists $\mathbf {A_{k}}\in \mathbb {R}^{n_{k} \times r}, \forall k \in [K]$, such that $\mathcal {X} = \mathbf {A_{1}} \circ \mathbf {A_{2}} \circ \dots \circ \mathbf {A_{K}}$. Then, we change the rank constraint into a penalty function $\frac {\lambda }{2}\|\mathcal {X} - \mathbf {A_{1}} \circ \mathbf {A_{2}} \circ \dots \circ \mathbf {A_{K}}\|_{F}^{2}$ in the objective, where λ is a positive constant. The equality constraint holds when λ goes to infinity. Note that $\mathcal {X} = \mathbf {A_{1}} \circ \mathbf {A_{2}} \circ \dots \circ \mathbf {A_{K}}$ is in the form of CANDECOMP/PARAFAC (CP) decomposition [51, 52]. Unlike matrix decomposition and the other major tensor decomposition method (Tucker decomposition [53]), CP decomposition has a very weak requirement for the uniqueness of tensor factors. A sufficient condition for CP decomposition to be unique is that the summation of independent column numbers in A_k,k=1,2,⋯,K is larger or equal to 2r+K−1 [23], which often holds true. In contrast, Tucker decomposition is generally not unique and is usually computationally expensive to update its core tensor.

We revise $\mathcal {S}_{f\omega }$ to add constraints that quantization boundaries shall not be too close to avoid trivial solutions in practice. The resulting feasible set is

$$ \begin{aligned} &\mathcal{S}_{\omega} = \{ \mathcal{X}, \omega_{1}, \omega_{2},\cdots,\omega_{W-1}: \alpha_{\text{low}} \le \omega_{1} \le \omega_{2} - \kappa_{2},\\& \omega_{l-1} + \kappa_{l} \le \omega_{l} \le \omega_{l+1} - \kappa_{l+1},~ \forall l \in \{2,3,...,W-2\}, \\& \omega_{W-2} + \kappa_{W-1} \le \omega_{W-1} \le \alpha_{\text{upper}},\|\mathcal{X}\|_{\infty}\le\alpha \}, \end{aligned} $$

(24)

where κ_l,∀l∈{2,3,⋯,W−1} are some positive numbers that can be chosen using hyperparameter tuning or simply set as small positive constants, and κ₁=κ_W=0. α_low,α_upper are two constants that provide the lower and upper bound of the boundaries, which could be chosen as −α and α, or estimates computed in different applications. The revised problem of (6) is shown as follows

$$ \begin{aligned} &(\hat{\mathcal{X}},\hat{\omega}_{1},\hat{\omega}_{2},\cdots,\hat{\omega}_{W-1}) =\\& {\arg\min}_{\mathcal{X}, \mathbf{A_{k}}, k \in [K], \omega_{l}, l\in [W-1]} F_{\Omega}(\mathcal{X},\omega_{1},\omega_{2},\cdots,\omega_{W-1}) \\&+ \frac{\lambda}{2}\|\mathcal{X} - \mathbf{A_{1}} \circ \mathbf{A_{2}} \circ \dots \circ \mathbf{A_{K}}\|_{F}^{2} + \\&\Psi_{1}(\mathcal{X})+\sum_{l=1}^{W-1}\Psi_{2}(\omega_{l}) \end{aligned} $$

(25)

where

$$ \begin{aligned} &\Psi_{1}(\mathcal{X})= \left \{ \begin{array}{rcl} \infty & \text{if}~ \|\mathcal{X}\|_{\infty} > \alpha \\& \\ 0 & \text{otherwise} \end{array} \right. \\&\Psi_{2}(\omega_{l})= \left \{ \begin{array}{rcl} \infty & \text{if}~ \omega_{l} > \min(\omega_{l+1} - \kappa_{l+1}, \alpha_{\text{upper}}) \\& \text{or}~~ \omega_{l} < \max(\omega_{l-1} + \kappa_{l}, \alpha_{\text{low}}) \\ 0 & \text{otherwise} \end{array} \right. \end{aligned} $$

(26)

$\Psi _{1}(\mathcal {X})$ is transformed by the constraint $\|\mathcal {X}\|_{\infty }\le \alpha $. Ψ₂(ω_l) is transformed by the constraints on ω_l in $\mathcal {S}_{\omega }$. Let

$$ \begin{aligned} H = &F_{\Omega}(\mathcal{X},\omega_{1}, \omega_{2},\cdots, \omega_{W-1}) \\&+ \frac{\lambda}{2}\|\mathcal{X} - \mathbf{A_{1}} \circ \mathbf{A_{2}} \circ \dots \circ \mathbf{A_{K}}\|_{F}^{2}. \end{aligned} $$

(27)

Then, we solve (25) using the proximal gradient method [54]. The main steps of the proximal gradient method include updating $\mathcal {X}, \mathbf {A_{k}}, k \in [K], \omega _{l}, l \in [W-1]$ by using the gradient descent method on H, and projecting the result to $\mathcal {S}_{\omega }$. Since for ∀k∈[K]

$$ \begin{aligned} &\|\mathcal{X} - \mathbf{A_{1}} \circ \mathbf{A_{2}} \circ \dots \circ \mathbf{A_{K}}\|_{F} =\\& \|\mathbf{X}_{(k)} - \mathbf{A_{k}}(\mathbf{A_{K}}\odot \dots \odot \mathbf{A_{k+1}}\odot \mathbf{A_{k-1}}\odot \dots \mathbf{A_{1}})^{T}\|_{F}, \end{aligned} $$

(28)

the partial gradients of H with respect to A_k and $\mathcal {X}$ can be calculated by

$$ \begin{aligned} \nabla_{\mathbf{A_{k}}} H = (\mathbf{A_{k}}(\mathbf{B_{k}})^{T} - \mathbf{X}_{(k)}) \mathbf{B_{k}}, \forall k \in [K], \end{aligned} $$

(29)

$$ \begin{aligned} \nabla_{\mathcal{X}} H =& \nabla_{\mathcal{X}} F_{\Omega}(\mathcal{X},\omega_{1}, \omega_{2},\cdots, \omega_{W-1}) \\&+ \lambda (\mathcal{X} - \mathbf{A_{1}} \circ \mathbf{A_{2}} \circ...\circ \mathbf{A_{K}}), \end{aligned} $$

(30)

where B_k=A_K⊙...⊙A_k+1⊙A_k−1⊙...⊙A₁. For any (i₁,i₂,⋯,i_K)∈Ω,

$$ \begin{aligned} &\nabla_{\mathcal{X}} F_{\Omega}(\mathcal{X},\omega_{1}, \omega_{2},\cdots, \omega_{W-1})_{i_{1},i_{2},\dots,i_{K}} \\&= \frac{\dot{\Phi}(\omega_{l}-\mathcal{X}_{i_{1},i_{2},\dots,i_{K}})-\dot{\Phi}(\omega_{l-1}-\mathcal{X}_{i_{1},i_{2},\dots,i_{K}})}{\Phi(\omega_{l}-\mathcal{X}_{i_{1},i_{2},\dots,i_{K}})-\Phi(\omega_{l-1}-\mathcal{X}_{i_{1},i_{2},\dots,i_{K}})}. \end{aligned} $$

(31)

Otherwise, for any (i₁,i₂,⋯,i_K)∉Ω

$$ \begin{aligned} \nabla_{\mathcal{X}} F_{\Omega}(\mathcal{X},\omega_{1}, \omega_{2},\cdots, \omega_{W-1})_{i_{1},i_{2},\dots,i_{K}} = 0. \end{aligned} $$

(32)

Strictly speaking, the result should time a $\frac {n_{1}n_{2}\cdots n_{K}}{|\Omega |}$ term. We ignore this term since it is canceled out when we multiply the step size in our algorithm. The partial derivative of H with respect to ω_l is shown as follows

$$ \begin{aligned} &\nabla_{\mathbf{\omega_{l}}} H = \left(\sum_{(i_{1},i_{2},\cdots,i_{K}) \in \Omega}\right.\\& \left.\frac{\boldsymbol{1}_{[\mathcal{Y}_{i_{1},i_{2},...,i_{K}}=l+1]}\dot{\Phi}(\omega_{l}-\mathcal{X}_{i_{1},i_{2},\dots,i_{K}})}{\Phi(\omega_{l+1}-\mathcal{X}_{i_{1},i_{2},\dots,i_{K}})-\Phi(\omega_{l}-\mathcal{X}_{i_{1},i_{2},\dots,i_{K}})} \right)\\&-\left(\sum_{(i_{1},i_{2},\cdots,i_{K}) \in \Omega}\right.\\&\left.\frac{\boldsymbol{1}_{[\mathcal{Y}_{i_{1},i_{2},...,i_{K}}=l]}\dot{\Phi}(\omega_{l}-\mathcal{X}_{i_{1},i_{2},\dots,i_{K}})}{\Phi(\omega_{l}-\mathcal{X}_{i_{1},i_{2},\dots,i_{K}})-\Phi(\omega_{l-1}-\mathcal{X}_{i_{1},i_{2},\dots,i_{K}})}\right), \end{aligned} $$

(33)

The step sizes of the gradient descent are selected as

$$ \begin{aligned} &\tau_{\mathbf{A_{k}}} = \frac{1}{\|(\mathbf{B_{k}})^{T}\mathbf{B_{k}}\|},\forall k \in [K],\\& \tau_{\mathcal{X}} = \frac{1}{\frac{1}{\sigma^{2}\beta^{2}}+\lambda},\\&\tau_{\omega_{l}} = \frac{\sigma^{2}\beta^{2}}{\sqrt{G_{l}}+\sqrt{G_{l+1}}},\forall l \in [W-1], \end{aligned} $$

(34)

where $\|(\mathbf {B_{k}})^{T}\mathbf {B_{k}}\|,\frac {1}{\sigma \beta }+\lambda,\frac {\sqrt {G_{l}}+\sqrt {G_{l+1}}}{\sigma ^{2}\beta ^{2}}$ are Lipschitz constants of $\nabla _{\mathbf {A_{k}}} H,\nabla _{\mathcal {X}} H$, and $\nabla _{\omega _{l}}H$. G_l,G_l+1 are the number of entries in $\mathcal {Y}_{\Omega }$ that equal to l and l+1, respectively. Here, β is a small positive value that satisfies $\Phi (\omega _{l}-\mathcal {X}_{i_{1},i_{2},\dots,i_{K}}) \geq \Phi (\omega _{l-1}-\mathcal {X}_{i_{1},i_{2},\dots,i_{K}}) + \beta $. This holds true since $\mathcal {X}_{i_{1},i_{2},\dots,i_{K}}, \omega _{l}, \omega _{l-1}$ are all bounded, ω_l is larger than ω_l−1, and Φ is a monotonously increasing function. After updating $\mathcal {X}$, the algorithm sets $\mathcal {X}_{i_{1},i_{2},...,i_{K}}$ to α if $\mathcal {X}_{i_{1},i_{2},...,i_{K}} > \alpha $, and sets $\mathcal {X}_{i_{1},i_{2},...,i_{K}}$ to −α if $\mathcal {X}_{i_{1},i_{2},...,i_{K}} < -\alpha $. After updating ω_l, the algorithm sets ω_l= min(ω_l+1−κ_l+1,α_upper) if ω_l> min(ω_l+1−κ_l+1,α_upper), and sets ω_l= max(ω_l−1+κ_l,α_low) if ω_l< max(ω_l−1+κ_l,α_low).

The algorithm is initialized by first estimating $\omega _{l}^{*}$’s according to the applications or simply setting $\omega _{l}^{0} = \frac {2\alpha l}{W}-\alpha $ if no information is available, and then setting

$$ \mathcal{X}_{i_{1},i_{2},...,i_{K}}^{0} = \left \{ \begin{array}{rcl} \frac{\omega_{l}^{0} + \omega_{l-1}^{0}}{2}, & \text{if}~ 1<\mathcal{Y}_{i_{1},i_{2},...,i_{K}}=l<W. \\ \frac{\alpha + \omega_{W-1}^{0}}{2}, & \text{if}~ \mathcal{Y}_{i_{1},i_{2},...,i_{K}}=W. \\ \frac{-\alpha + \omega_{1}}{2}, & \text{if}~ \mathcal{Y}_{i_{1},i_{2},...,i_{K}}=1. \\ 0, & (i_{1},i_{2},\cdots,i_{K}) \not\in \Omega. \end{array} \right. $$

(35)

$\mathbf {A_{k}}^{0} \in \mathbb {R}^{n_{k} \times r},\forall k \in [K]$ are obtained through the decomposition of $\mathcal {X}^{0}$. The details of TAPGD are shown in Algorithm 1. Note that when the quantization boundaries $\omega _{l}^{*}$’s are known, TAPGD can be revised easily by removing steps 14–20 from Algorithm 1.

To improve the recovery performance, one can multiple λ by a small constant larger than one in each iteration. This provides a better numerical result than fixing λ in all iterations. The complexity of TAPGD in each iteration is $O(Krn_{1}n_{2}\dots n_{K})$. The convergence of TAPGD is summarized in Theorem 4.

Theorem 4

Assume that the sequence {A_k^t} generated by Algorithm 1 is bounded. Then, TAPGD globally converges to a critical point of (25) from any initial point, and the convergence rate is at least $O\left (t^{\frac {\theta - 1}{2\theta - 1}}\right)$, for some $\theta \in \left (\frac {1}{2},1\right)$.

Theorem 4 indicates a sublinear convergence of TAPGD. One way to satisfy the requirement of bounded sequence is to scale the factorized variables so that ∥A₁∥_F=∥A₂∥_F=⋯=∥A_K∥_F after each iteration. We find TAPGD performs well numerically without the additional steps.

4.2 TSVD-based alternating projected gradient descent

We develop an algorithm named TSVD-based alternating projected gradient descent (TSVD-APGD) to solve the nonconvex problem (10). Let $H'=F_{\Omega }(\mathcal {X},\omega _{1},\omega _{2},\cdots,\omega _{W-1}) + \frac {\lambda }{2}\|\mathcal {X} - \sum _{i=1}^{r} \zeta _{i} \mathbf {V_{1}}_{i} \circ \mathbf {V_{2}}_{i} \circ \dots \circ \mathbf {V_{K}}_{i}\|_{F}^{2}$. Similar to (25), we relax the problem of (10) to

$$ \begin{aligned} &(\hat{\mathcal{X}},\hat{\omega}_{1},\hat{\omega}_{2},\cdots,\hat{\omega}_{W-1}) =\\& {\arg\min}_{\mathcal{X}, \mathbf{V_{k}}, k \in [K], \omega_{l}, l\in [W-1]} H' + \Psi_{1}(\mathcal{X})+\sum_{l=1}^{W-1}\Psi_{2}(\omega_{l}) \\& \mathrm{s.t.} \langle \mathbf{V_{k}}_{i},\mathbf{V_{k}}_{j} \rangle = \boldsymbol{1}_{[i=j]}, 1 \le i,j \le r \end{aligned} $$

(36)

The updates of $\mathcal {X}, \omega _{l}$ are the same as TAPGD, while the updates of the decomposition components ζ_i,V_k are different. In Algorithm 2, we borrow the idea from work of Li et al. [39] to update ζ_i,V_k,i∈[r],k∈[K] in steps 2–8. QR in step 4 in Algorithm 2 represents the QR decomposition [55] that returns an orthonormal matrix and an upper triangular matrix, and we use the orthonormal matrix to update V_k.

Similar to Algorithm 1, when the quantization boundaries are known, TSVD-APGD can be revised easily by removing step 11. We cannot prove the convergence of Algorithm 2 yet and will leave it to future work. However, numerically, Algorithm 1 demonstrates reliable numerical performance as shown in Section 5.

5 Results: numerical experiments

We conduct simulations on synthetic data, image data, and data from an in-car music recommender system [16] in this section. The recovery performance is measured by $\|\mathcal {X}^{*}-\tilde {\mathcal {X}}\|_{F}^{2}/\|\mathcal {X}^{*}\|_{F}^{2}$, where $\tilde {\mathcal {X}}$ is our estimation of $\mathcal {X}^{*}$. K=3 in both tests on synthetic data and real data. We set T=200. All the results are averaged over 100 runs. The simulations are run in MATLAB on a 3.4-GHz Intel Core i7 computer.

5.1 Synthetic data

A rank-r, three-dimensional tensor is generated as follows. We first generate $\mathbf {A_{1}} \in \mathbb {R}^{n_{1} \times r}$ with entries sampled independently from a uniform distribution in $[-0.5,0.5],\mathbf {A_{2}} \in \mathbb {R}^{n_{2} \times r}$, and $\mathbf {A_{3}} \in \mathbb {R}^{n_{3} \times r}$ with each entry sampled independently from a uniform distribution in [0,1]. Then, we obtain the tensor by calculating A₁∘A₂∘A₃ and scaling all the values to [−1,1]. A rank-r, three-dimensional SVD-tensor is generated as follows. V₁,V₂,V₃ are first obtained by transforming A₁,A₂,A₃ to orthonormal matrices. ζ_i,i∈[r] are generated from right-half standard normal distribution. Then, we obtain the SVD-tensor by (2) and scaling all the values to [−1,1]. The entries of $\mathcal {N}$ are i.i.d. generated from the Gaussian distribution with mean 0 and the standard deviation σ of 0.25. We choose W=2 (1-bit) and 4 (2-bit) in our experiments. When $W = 2,\omega _{0}^{*} = -\infty,\omega _{1}^{*} = 0,\omega _{2}^{*} = \infty $. When $W = 4,\omega _{0}^{*} = -\infty,\omega _{1}^{*} = -0.4,\omega _{2}^{*} = 0,\omega _{3}^{*} = 0.4,\omega _{4}^{*} = \infty $.

Figure 2 compares TAPGD with M-norm constrained 1-bit tensor recovery (MNC-1bit-TR) method [43] and the quantized matrix recovery method [4]. We remark that MNC-1bit-TR can only deal with 1-bit measurements and requires solving a convex optimization problem. The tolerance value is set as 0.01 for the matrix recovery method. In the MNC-1bit-TR method, we use the theoretical upper bound $r^{\frac {3}{2}}\alpha $ (here α=1) to bound the maximum row norm of the low-rank factors. We vary one of the rank, the dimension, and the noise level while fixing the other parameters. n₁=n₂=n₃=120 when we only vary the rank and the noise level. Figure 2 demonstrates that the relative recovery error increases when the rank increases or the dimension decreases. The results also show that TAPGD has the best performance among all these methods. Moreover, the performance improves when the number of bits increases. Figure 3 shows the comparisons of different algorithms on recovering SVD-tensors. The performance of TSVD-APGD outperforms the existing methods when the tensor is a SVD-tensor. We notice that the recovery errors of using TAPGD and TSVD-APGD are very close.

As shown in Fig. 4, when the noise level (the standard deviation σ) increases, the relative recovery error first decreases and then increases. The reason is that the noise is considered as part of the quantization process and plays the role of adding measurement uncertainty. The problem without noise (measurement uncertainty) is ill-posed because all the values in one bin are mapped to one same quantized value deterministically.

5.2 Image data

We test our method on the Extend Yale Face Dataset B [56, 57]. The dataset contains 192×168 pixel face images from 38 different people. Each person has 64 images with different poses and various illumination. We pick two objects to form a 192×168×128 three-dimensional tensor. All entries are scaled to [0,1]. We add $\mathcal {N}$ with i.i.d. entries generated from the Gaussian distribution with mean 0 and the standard deviation of 0.3. When $W = 2,\omega _{0}^{*} = -\infty,\omega _{1}^{*} = 0.4,\omega _{2}^{*} = \infty $. When $W = 3,\omega _{0}^{*} = -\infty,\omega _{1}^{*} = 0.2,\omega _{2}^{*} = 0.4,\omega _{3}^{*} = \infty $. Figure 5a compares TAPGD with MNC-1bit-TR, the quantized matrix recovery method, and a nonconvex low-rank tensor recovery method named nonconvex regularized tensor (NORT) [30]. Note that MNC-1bit-TR models the quantization process like our approach, while NORT does not model quantization and treats the data as general noisy measurements. We find that our method works well in a wide range of r, and the results are under the selection of r=50. The tolerance rate is set as 0.001 for the matrix recovery method. In the NORT method, we set the hyperparameters as λ=0.1,θ=5 (the parameters have different meanings from the parameters in our work), and the tolerance rate as 0.0001. In the MNC-1bit-TR method, we use $r^{\frac {3}{2}}\alpha $ (here α=1) to bound the maximum row norm of the low-rank factors. It shows that the relative recovery error decreases when the percentage of the observation increases, and TAPGD obtains the best performance among all the methods. Figure 5b compares the recovery error when the bin boundaries are known and unknown to the recovery algorithm. When the boundaries are unknown, the initial point is uniformly chosen from [0.1,0.6] for ω₁ when W=2, and [0.1,0.3],[0.2,0.6] for ω₁,ω₂, respectively when W=3. α_upper,α_low are selected as 0.6,0.1. κ_l is set to 0.1 for ∀l∈[W−1].

In Fig. 6, we show a boxplot diagram of relative recovery error with 100 runs obtained by TAPGD. All the setups are the same as the scenario W=3 in Fig. 5a. The tops and bottoms of each “box” are the 25th and 75th percentiles of the samples, respectively. The maximum standard deviation happens when the observation rate is 0.3, which equals to 8.79×10⁻⁴. The relative standard deviation, which is defined as the ratio of the standard deviation to the mean, reaches its maximum value 0.028 when the observation rate is 0.6.

Figure 7 compares the time cost of TAPGD and MNC-1bit-TR [43] when the number of facial images changes. TAPGD is three magnitudes faster than MNC-1bit-TR. Figure 8 visualizes the quantized and recovered images by TAPGD.

5.3 In-car music recommender dataset

Many recommender systems’ ratings from users are highly quantized (such as like or dislike) with many missing entries (e.g., users do not give rating for all subjects), while the underlying systems may want to recover real-valued user ratings. Following the same motivations and assumptions as in quantized matrix [1, 13] and 1-bit tensor work [43], the quantized measurements are caused by system limitation, and the actual ratings of users lie in the real-valued domain [1, 13, 43]. Moreover, users’ actual ratings are affected by a few factors and thus satisfy the low-rank property [58]. Our method can be used to recover the true underlying real-valued user preferences, thus improving the quality of recommendations. We apply our method to an in-car music recommender dataset [16]. The recommender dataset contains 139 songs with 4012 ratings from 42 users. This dataset has 26 contexts that include relaxed driving, country side, happy, and sleepy. The same user may rate different scores to the same song under different contexts. A total of 2751 ratings have the corresponding context information while the rest 1261 ratings do not have context information. We only use the ratings with context information. An example of three ratings is shown in Table 2.

Table 2 Example of the in-car music recommender dataset [16]

Full size table

We construct the resulting tensor $\mathcal {M}$ as {users×musics×contexts}, which is a 42×139×26 tensor. The ratings are quantized to 0,1,2,3,4,5 (we change ratings to 1,2,3,4,5,6 to distinguish them from missing values). All the locations without ratings are set to be zero. We then randomly set 0.362% of the data (20% of the observed data) to be zero and let Ω_predict denote the set of the indices. We predict data with indices belonging to Ω_predict using the rest 1.448% of the data (80% of the observed data), which are referred to as training data. In this test, we define the relative recovery error as

$$ \frac{1}{|\Omega_{\text{predict}}|}\sum_{(i_{1},i_{2},i_{3}) \in \Omega_{\text{predict}}}\frac{|\mathcal{M}_{i_{1},i_{2},i_{3}}-\bar{\mathcal{M}}_{i_{1},i_{2},i_{3}}|}{5}, $$

(37)

where $\tilde {\mathcal {M}}$ is our estimation of the ground truth, and $\bar {\mathcal {M}}$ maps the values in $\tilde {\mathcal {M}}$ to their nearest quantized values. The reason for the occurrence of 5 at denominator is that the maximum difference between $\bar {\mathcal {M}}$ and $\mathcal {M}$ is 5. The error increases when the difference increases. Ref. [43] also studies on the same dataset and first maps the multi-level quantized values to binary values. It then deletes some binary values and evaluates the recovery error. The smallest recovery error is 0.23 by their method. We remark that the multi-level prediction is harder than binary prediction in this application, since the binary case is to choose one out of two numbers, while the multi-level case is to choose one out of W>2 numbers. Here, we estimate the rank r and the noise level σ, since we do not know the actual rank and noise. In Algorithm 1, we choose the estimated rank r from the set {5,10,15,20,25}, and choose the estimated standard variation σ from the set {0.001,0.01,0.05,0.1,0.15,0.2,0.25}. The recovery results are shown in Fig. 9. The relative recovery error reaches its smallest value when r=5 and σ=0.05, and the smallest relative recovery error is 0.22. Figure 10 shows the comparison of TAPGD to NORT [30] when the percentage of the training data changes. Note that when the percentage equals to one, we use 80% of the observed data. The relative recovery error obtained by NORT is defined in the same way as using TAPGD. We set r=5 and σ=0.1 for TAPGD. In the NORT method, we set the hyperparameters as λ=0.1,θ=5, and the tolerance rate as 0.0001. The relative recovery error using NORT is about twice larger than the relative recovery error using TAPGD, suggesting that the performance improvement can be achieved when the users’ actual ratings are considered to be real-valued.

6 Conclusion and discussion

This paper recovers a low-rank tensor from quantized measurements. A constrained maximum log-likelihood problem is proposed to estimate the ground-truth tensor. The recovery error is proved to be at most $O(\sqrt {\frac {r^{K-1}K\log (K)}{n^{K-1}}})$ when boundaries are known. The recovery error decreases to $O(\sqrt {\frac {rK\log (K)}{n^{K-1}}})$ when the tensor is a SVD-tensor. When reduced to the special case of 1-bit tensor recovery and low-rank matrix recovery from quantized measurements, our error bounds are significantly smaller than those of the existing methods. We also provide the fundamental limit of the recovery error by any recovery method and show that our method is nearly order-wise optimal. We propose two algorithms TAPGD and TSVD-APGD to solve the nonconvex optimization problems. We prove that TAPGD can converge to a critical point from any initial point. Both algorithms can handle missing data and do not require information of the quantization rule. Future works include data recovery when partial measurements contain significant errors and developing algorithms with global optimality guarantees.

7 Appendix 1. Supporting lemmas used in the proof of Theorems 1 and 2

Let $\langle \mathcal {A}, \mathcal {B} \rangle $ denote the inner product of $\mathcal {A} \in \mathbb {R}^{n_{1} \times... \times n_{K}}$ and $\mathcal {B} \in \mathbb {R}^{n_{1} \times... \times n_{K}}$, i.e., the sum of the products of their entries. Then, the spectral norm of a tensor $\mathcal {X} \in \mathbb {R}^{n_{1} \times... \times n_{K}}$ is defined as

$$ \begin{aligned} &\|\mathcal{X}\| \\& = \text{sup}\{\langle \mathcal{X}, u_{1} \circ u_{2}... \circ u_{K} \rangle: \|u_{k}\|_{2}=1, \\&~~~~~ u_{k} \in \mathbb{R}^{n_{k}}, \forall k \in [K]\} \\&= \text{sup}_{u_{1},u_{2},...,u_{K}} \mathcal{X}(u_{1},u_{2},...,u_{K}), \|u_{k}\|_{2}=1, \\&~~~~~ u_{k} \in \mathbb{R}^{n_{k}}, \forall k \in [K] \\& = \text{sup}_{u_{1},u_{2},...,u_{K}} \sum_{i_{1},i_{2},...,i_{K}} \mathcal{X}_{i_{1},i_{2},...,i_{K}}u_{1i_{1}}u_{2i_{2}},...,u_{Ki_{K}},\\&~~~~~ \|u_{k}\|_{2}=1, u_{k} \in \mathbb{R}^{n_{k}}, \forall k \in [K] \end{aligned} $$

(38)

where $u_{1} \circ u_{2}... \circ u_{K} \in \mathbb {R}^{n_{1} \times... \times n_{K}}$.

Lemma 1 provides an upper bound on the spectral norm of a tensor with independent random entries.

Lemma 1

Suppose that $\mathcal {X} \in \mathbb {R}^{n_{1} \times... \times n_{K}}$ is a K-dimensional tensor whose entries are independent random variables that satisfy, for some s²,

$$ \mathbb{E}[\mathcal{X}_{i_{1},i_{2},...,i_{K}}] = 0, \ \ \mathbb{E}[e^{\epsilon\mathcal{X}_{i_{1},i_{2},...,i_{K}}}]\le e^{s^{2}\epsilon^{2}/2}, \ \ a.s. $$

(39)

Then

$$ P(\|\mathcal{X}\| \ge \mu) \le \delta $$

(40)

for some δ∈[0,1], where

$$ \mu = \sqrt{8s^{2}\left(\left(\sum_{k=1}^{K}n_{k}\right)\log(4K/3)+\log(2/\delta)\right)}. $$

(41)

Proof

The proof is completed by combining Lemma 1 and Theorem 1 in [59]. □

We first define $F(\mathcal {X})$ as the function when $F_{\Omega }(\mathcal {X},\omega _{1},\-\omega _{2},\cdots,\omega _{W-1})$ is under the full observation and ω_l,∀l∈[W−1] are known. Specifically,

$$ \begin{aligned} &F(\mathcal{X}) \\&= -\sum_{i_{1}=1}^{n_{1}}\sum_{i_{2}=1}^{n_{2}}\cdots \sum_{i_{K}=1}^{n_{K}} \sum_{l=1}^{W} \\&~~~~\boldsymbol{1}_{[\mathcal{Y}_{i_{1},i_{2},...,i_{K}}=l]}\log(f_{l}(\mathcal{X}_{i_{1},i_{2},...,i_{K}},\omega_{l-1}^{*}, \omega_{l}^{*})). \end{aligned} $$

(42)

Lemma 2

With probability at least 1−δ,

$$ \begin{aligned} &\|\nabla_{X}F(\mathcal{X}^{*})\| \\&\le \sqrt{8L_{\alpha}^{2}\left(\left(\sum_{k=1}^{K}n_{k}\right)\log(4K/3)+\log(2/\delta)\right)} \end{aligned} $$

(43)

Proof

Consider

$$\begin{aligned} \mathcal{Z}_{i_{1},i_{2},...,i_{K}}&:=[\nabla_{\mathcal{X}}F(\mathcal{X}^{*})]_{i_{1},i_{2},...,i_{K}} \\&= -\sum_{l=1}^{W}\frac{\dot{f}_{l}(\mathcal{X}^{*}_{i_{1},i_{2},...,i_{K}})}{f_{l}(\mathcal{X}^{*}_{i_{1},i_{2},...,i_{K}})}\boldsymbol{1}_{[Y_{i_{1},i_{2},...,i_{K}}=l]}. \end{aligned} $$

Recall that the probability $\mathcal {Y}_{i_{1},i_{2},\dots,i_{K}} = l$ given $\mathcal {X}_{i_{1},i_{2},\dots,i_{K}}^{*}$ is expressed by $f_{l}(\mathcal {X}_{i_{1},i_{2},\dots,i_{K}}^{*})$, which only holds true for X^∗. Then using the fact that $\sum _{l=1}^{W} f_{l}\-(\-\mathcal {X}_{i_{1},i_{2},...,i_{K}})=1$, we have $\mathbb {E}[\mathcal {Z}_{i_{1},i_{2},...,i_{K}}]=0,-L_{\alpha } \le \mathcal {Z}_{i_{1},i_{2},...,i_{K}}\le L_{\alpha }$. By Hoeffding’s lemma, we can obtain $\mathbb {E}[e^{\epsilon Z^{2}_{i_{1},i_{2},...,i_{K}}}]\le e^{(L_{\alpha } + L_{\alpha })^{2} \epsilon ^{2}/8} = e^{L_{\alpha }^{2} \epsilon ^{2}/2}$. Replacing s with L_α in Lemma 1, we complete the proof. □

Lemma 3 and Lemma 4 describe the relation of $\mathcal {X}^{*}$ with any data in the feasible set $\mathcal {S}_{f}$ and $\mathcal {S}_{fs}$. Considering any $\mathcal {X}' \in \mathcal {S}_{f}$ and $\mathcal {X}' \in \mathcal {S}_{fs}$, we can calculate the second-order Taylor expansion of $F(\mathcal {X}')$ at $\mathcal {X}^{*}$. Both lemmas indicate that the absolute value of the first-order term of the Taylor expansion can always be upper bounded by a term related to $\|\mathcal {X}'-\mathcal {X}^{*}\|_{F}$.

Lemma 3

Let $\theta '=\text {vec}(\mathcal {X}'),\theta ^{*}=\text {vec}(\mathcal {X}^{*}),\-\nabla _{\theta }F(\theta ^{*})\-=\text {vec}(\nabla _{\mathcal {X}}F(\mathcal {X}^{*}))$, and $\mathcal {X}',\mathcal {X}^{*} \in \mathcal {S}_{f}$. Then with probability at least 1−δ,

$$ \begin{aligned} & |\left \langle \nabla_{\theta}F(\theta^{*}),\theta'-\theta^{*} \right \rangle| \le \\&\sqrt{16r^{K-1}L_{\alpha}^{2}((\sum_{k=1}^{K}n_{k})\log(\frac{4K}{3})+\log(\frac{2}{\delta}))}\|\mathcal{X}'-\mathcal{X}^{*}\|_{F} \end{aligned} $$

(44)

Proof

The tensor nuclear norm $\|\mathcal {X}\|_{*}$ is defined as

$$ \begin{aligned} \|\mathcal{X}\|_{*} = \text{inf}&\{\sum_{i=1}^{r}|\zeta_{i}|: \mathcal{X} = \sum_{i=1}^{r}\zeta_{i} \mathbf{V_{1}}_{i} \circ \mathbf{V_{2}}_{i} \circ \dots \circ \mathbf{V_{K}}_{i}, \\& \|\mathbf{V_{k}}_{i}\|_{2}=1 \} \end{aligned} $$

(45)

According to Theorem 9.4 of [60], $\|\mathcal {X}\|_{*}$ satisfies $\|\mathcal {X}\|_{*} \le \-\sqrt {\frac {r_{1}r_{2}r_{3}}{\max (r_{1}, r_{2}, r_{3})}}\|\mathcal {X}\|_{F}$ when K=3, where r_k,k∈[K] is the k-rank of the tensor $\mathcal {X}$, which is defined as the column rank of X_(k). A generalization to any K is shown as follows

$$ \begin{aligned} &\|\mathcal{X}\|_{*} \le \sqrt{\frac{\prod_{i=1}^{K}r_{i}}{\max(r_{1}, r_{2}, \cdots, r_{K})}}\|\mathcal{X}\|_{F}. \end{aligned} $$

(46)

The details can be viewed in [61]. Note that r_k≤r,∀k∈[K], since $\mathbf {X}_{(k)} = \mathbf {A_{k}}(\mathbf {A_{K}}\odot \dots \odot \mathbf {A_{k+1}}\odot \mathbf {A_{k-1}}\odot \dots \mathbf {A_{1}})^{T}$. Therefore,

$$ \begin{aligned} &\|\mathcal{X}'-\mathcal{X}^{*}\|_{*} \le \sqrt{2r^{K-1}}\|\mathcal{X}'-\mathcal{X}^{*}\|_{F}. \end{aligned} $$

(47)

where the last inequality holds because $\|\cdot \|_{*} \le \sqrt {r}\|\cdot \|_{F}$ for any matrix. We then have

$$ \begin{aligned} &\|\nabla_{\mathcal{X}}F(\mathcal{X}^{*})\|\|\mathcal{X}'-\mathcal{X}^{*}\|_{*} \le \\& \sqrt{16r^{K-1}L_{\alpha}^{2}\left(\left(\sum_{k=1}^{K}n_{k}\right)\log\left(\frac{4K}{3}\right)+\log\left(\frac{2}{\delta}\right)\right)}\|\mathcal{X}'-\mathcal{X}^{*}\|_{F} \end{aligned} $$

(48)

holds with probability at least 1−δ. Then,

$$\begin{aligned} & |\left \langle \nabla_{\theta}\mathcal{F}(\theta^{*}),\theta'-\theta^{*} \right \rangle| = |\left \langle \nabla_{\mathcal{X}}F(\mathcal{X}^{*}),\mathcal{X}'-\mathcal{X}^{*} \right \rangle| \\ & \le |\left \langle \nabla_{\mathcal{X}}F(\mathcal{X}^{*}),\mathcal{X}'-\mathcal{X}^{*} \right \rangle| \\&\le \|\nabla_{\mathcal{X}}F(\mathcal{X}^{*})\|\|\mathcal{X}'-\mathcal{X}^{*}\|_{*} \le\\ & \sqrt{16r^{K-1}L_{\alpha}^{2}\left(\left(\sum_{k=1}^{K}n_{k}\right)\log\left(\frac{4K}{3}\right)+\log\left(\frac{2}{\delta}\right)\right)}\|\mathcal{X}'-\mathcal{X}^{*}\|_{F} \end{aligned} $$

holds with probability at least 1−δ. The second inequality comes from the fact $|\left \langle \mathcal {A}, \mathcal {B} \right \rangle | \le \|\mathcal {A}\|\|\mathcal {B}\|_{*}$ for two tensors $\mathcal {A}$ and $\mathcal {B}$ [60]. We then have the desired result. □

Lemma 4

Let $\theta '=\text {vec}(\mathcal {X}'),\theta ^{*}\-=\text {vec}(\mathcal {X}^{*}),\-\nabla _{\theta }F(\theta ^{*})=\text {vec}(\nabla _{\mathcal {X}}F(\mathcal {X}^{*}))$, and $\mathcal {X}',\mathcal {X}^{*} \in \mathcal {S}_{fs}$. Then with probability at least 1−δ,

$$ \begin{aligned} & |\left \langle \nabla_{\theta}F(\theta^{*}),\theta'-\theta^{*} \right \rangle| \le \\&\sqrt{16rL_{\alpha}^{2}\left(\left(\sum_{k=1}^{K}n_{k}\right)\log\left(\frac{4K}{3}\right)+\log\left(\frac{2}{\delta}\right)\right)}\|\mathcal{X}'-\mathcal{X}^{*}\|_{F}, \end{aligned} $$

(49)

Proof

Let $\mathcal {T}_{i}$ denote the $\mathbf {V_{1}}_{i} \circ \mathbf {V_{2}}_{i} \circ \dots \circ \mathbf {V_{K}}_{i}$ in (2). One can easily check that $\langle \mathcal {T}_{i}, \mathcal {T}_{j} \rangle = 0$ and $\langle \mathcal {X}, \mathcal {T}_{i} \rangle = \zeta _{i}, i, j \in [R], i\neq j$. Then $\|\mathcal {X}\|_{F} = \sqrt {\sum _{i=1}^{r} \zeta _{i}^{2}}$. Equation (45) defines that $\|\mathcal {X}\|_{*} = \sum _{i=1}^{r} \zeta _{i}$. From Cauchy–Schwarz inequality, we have $\|\mathcal {X}\|_{*} \le \sqrt {r}\|\mathcal {X}\|_{F}$. Therefore,

$$ \begin{aligned} &\|\mathcal{X}'-\mathcal{X}^{*}\|_{*} \le \sqrt{2r}\|\mathcal{X}'-\mathcal{X}^{*}\|_{F}. \end{aligned} $$

(50)

Following the same proof technique of (48), we have the desired result. □

Lemma 5 provides a lower bound on the second-order term of the second-order Taylor expansion. This lower bound is also related to $\|\mathcal {X}'-\mathcal {X}^{*}\|_{F}$.

Lemma 5

Let $\theta '=\text {vec}(\mathcal {X}'),\theta ^{*}=\text {vec}(\mathcal {X}^{*})$, and $\mathcal {X}',\mathcal {X}^{*} \in \mathcal {S}_{f}$. Then for any $\tilde {\theta }=\theta ^{*}+\eta (\theta '-\theta ^{*})$ and any η∈[0,1], we have

$$ \left \langle \theta'-\theta^{*}, (\nabla^{2}_{\theta\theta}F(\tilde{\theta}))(\theta'-\theta^{*})\right \rangle \ge \gamma_{\alpha}\|\mathcal{X}'-\mathcal{X}^{*}\|_{F}^{2}. $$

(51)

Proof

Lemma 5 is an extension of Lemma 7 in [13].

Using (42), it follows that

$$\begin{aligned} &\frac{\partial^{2} F(\mathcal{X})}{\partial^{2} \mathcal{X}_{i_{1},i_{2},...,i_{K}}}= \\& \sum_{l=1}^{W}(\frac{\dot{f}_{l}^{2}(\mathcal{X}_{i_{1},i_{2},...,i_{K}})}{f_{l}^{2}(\mathcal{X}_{i_{1},i_{2},...,i_{K}})}-\frac{\ddot{f}_{l}(\mathcal{X}_{i_{1},i_{2},...,i_{K}})}{f_{l}(\mathcal{X}_{i_{1},i_{2},...,i_{K}})})\boldsymbol{1}_{[\mathcal{Y}_{i_{1},i_{2},...,i_{K}}=l]} \end{aligned} $$

Then, we have

$$\begin{aligned} &\left \langle \theta'-\theta^{*}, (\nabla^{2}_{\theta\theta}F(\tilde{\theta}))(\theta'-\theta^{*})\right \rangle \\&= {\sum_{i_{1}=1}^{n_{1}}}\cdots{\sum_{i_{K}=1}^{n_{K}}}\frac{\partial^{2} F(\tilde{\mathcal{X}})}{\partial^{2} \mathcal{X}_{i_{1},i_{2},...,i_{K}}}(\mathcal{X}'_{i_{1},i_{2},...,i_{K}} - \mathcal{X}^{*}_{i_{1},i_{2},...,i_{K}}) \\&\ge \gamma_{\alpha} {\sum_{i_{1}=1}^{n_{1}}}\cdots{\sum_{i_{K}=1}^{n_{K}}} (\mathcal{X}'_{i_{1},i_{2},...,i_{K}} - \mathcal{X}_{i_{1},i_{2},...,i_{K}}^{*})^{2} \\& = \gamma_{\alpha} \|\mathcal{X}' - \mathcal{X}^{*}\|_{F}^{2} \end{aligned} $$

where the first inequality comes from the fact that $\gamma _{\alpha } = \min _{l\in [W]}\inf _{|x|\le 2\alpha }\left \{\frac {\dot {f}_{l}^{2}(x)}{f_{l}^{2}(x)}-\frac {\ddot {f}_{l}(x)}{f_{l}(x)}\right \}$. □

8 Appendix 2. Proofs of Theorems 1 and 2

Proof

The first bound 2α follows from the fact that $\|\hat {\mathcal {X}}\|_{\infty },\|\mathcal {X}^{*}\|_{\infty } \le \alpha $. We have

$$ \begin{aligned} &\|\hat{\mathcal{X}}-\mathcal{X}^{*}\|_{F}/\sqrt{n_{1}n_{2}...n_{K}} \\& \le 2\alpha\sqrt{n_{1}n_{2}...n_{K}}/\sqrt{n_{1}n_{2}...n_{K}} = 2\alpha. \end{aligned} $$

(52)

Let $\hat {\theta }=\text {vec}(\hat {\mathcal {X}})$ and $\mathcal {F}(\hat {\theta })=F(\hat {\mathcal {X}})$. By the second-order Taylor’s theorem, we have

$$ \begin{aligned} F(\hat{\theta}) = &F(\theta^{*}) + \left \langle \nabla_{\theta}F(\theta^{*}),\hat{\theta}-\theta^{*} \right \rangle\\& + \frac{1}{2}\left \langle \theta-\theta^{*}, (\nabla^{2}_{\theta\theta}F(\tilde{\theta}))(\hat{\theta}-\theta^{*}) \right \rangle, \end{aligned} $$

(53)

where $\tilde {\theta }=\theta ^{*}+\eta (\hat {\theta }-\theta ^{*})$ for some η∈[0,1], with the corresponding tensor $\tilde {\mathcal {X}}=\mathcal {X}^{*}+\eta (\hat {\mathcal {X}}-\mathcal {X}^{*})$.

Using the results of Lemma 3 and Lemma 5, we can obtain that

$$ \begin{aligned} & F(\hat{\mathcal{X}}) \ge F(\mathcal{X}^{*})-\\&\sqrt{16r^{K-1}L_{\alpha}^{2}((\sum_{k=1}^{K}n_{k})\log(\frac{4K}{3})+\log(\frac{2}{\delta}))}\|\hat{\mathcal{X}}-\mathcal{X}^{*}\|_{F} \\&+ \frac{\gamma_{\alpha}}{2}\|\hat{\mathcal{X}}-\mathcal{X}^{*}\|_{F}^{2} \end{aligned} $$

(54)

holds with probability at least 1−δ for $\hat {\mathcal {X}},\mathcal {X}^{*} \in \mathcal {S}_{f}$. Note that $\hat {\mathcal {X}}$ is the global optimal of the optimization problem. Thus, $F(\hat {\mathcal {X}}) \le F(\mathcal {X}^{*})$. We then have

$$ \begin{aligned} & \frac{\gamma_{\alpha}}{2}\|\hat{\mathcal{X}}-\mathcal{X}^{*}\|_{F}^{2} \le\\& \sqrt{16r^{K-1}L_{\alpha}^{2}\left(\left(\sum_{k=1}^{K}n_{k}\right)\log\left(\frac{4K}{3}\right)+\log(\frac{2}{\delta})\right)}\|\hat{\mathcal{X}}-\mathcal{X}^{*}\|_{F} \end{aligned} $$

(55)

holds with probability at least 1−δ. Thus,

$$ \|\hat{\mathcal{X}}-\mathcal{X}^{*}\|_{F}/\sqrt{n_{1}n_{2}...n_{K}} \le U_{\alpha} $$

(56)

holds with the same probability 1−δ, where

$$ \begin{aligned} &U_{\alpha} = \frac{\sqrt{64r^{K-1}L_{\alpha}^{2}((\sum_{k=1}^{K}n_{k})\log(\frac{4K}{3})+\log(\frac{2}{\delta}))}}{\gamma_{\alpha} \sqrt{n_{1}n_{2}...n_{K}}}. \end{aligned} $$

(57)

Similarly, using the results of Lemma 4 and Lemma 5, we can obtain that

$$ \begin{aligned} & F(\hat{\mathcal{X}}) \ge F(\mathcal{X}^{*})\\&-\sqrt{16rL_{\alpha}^{2}((\sum_{k=1}^{K}n_{k})\log(\frac{4K}{3})+\log(\frac{2}{\delta}))}\|\hat{\mathcal{X}}-\mathcal{X}^{*}\|_{F} \\&+ \frac{\gamma_{\alpha}}{2}\|\hat{\mathcal{X}}-\mathcal{X}^{*}\|_{F}^{2} \end{aligned} $$

(58)

holds with probability at least 1−δ for $\hat {\mathcal {X}},\mathcal {X}^{*} \in \mathcal {S}_{fs}$. Following the same process as (55)–(57), we can obtain

$$ \|\hat{\mathcal{X}}-\mathcal{X}^{*}\|_{F}/\sqrt{n_{1}n_{2}...n_{K}} \le U_{\alpha}' $$

(59)

and

$$ \begin{aligned} &U_{\alpha}' = \frac{\sqrt{64rL_{\alpha}^{2}((\sum_{k=1}^{K}n_{k})\log(\frac{4K}{3})+\log(\frac{2}{\delta}))}}{\gamma_{\alpha} \sqrt{n_{1}n_{2}...n_{K}}}. \end{aligned} $$

(60)

Combining (52) and (56), (52) and (59), we have the results of Theorem 1 and Theorem 2, respectively. □

9 Appendix 3. Supporting lemmas used in the proof of Theorem 3

Lemma 6

Let ς≤1. There is a set $\mathcal {S}_{X}\subset \mathcal {S}_{f}$ with

$$ |\mathcal{S}_{X}| \ge \exp(\frac{rn_{\text{max}}}{16}) $$

(61)

with the following properties:

1. For all $\mathcal {X}\in \mathcal {S}_{X},|\mathcal {X}_{i_{1},i_{2}, \dots,i_{K}}|=\alpha \varsigma,\forall i_{1},i_{2}, \dots,i_{K}$

2. For all $\mathcal {X}^{(i)},\mathcal {X}^{(j)}\in \mathcal {S}_{X}$, i≠j,

$$ \|\mathcal{X}^{(i)}-\mathcal{X}^{(j)}\|_{F}^{2} > \alpha^{2}\varsigma^{2}(\frac{n_{1}n_{2}\cdots n_{K}}{2}). $$

(62)

3. For any $\mathcal {X} \in \mathcal {S}_{X}$ and $\mathcal {Y} = \mathcal {X} + \mathcal {N}$, we can bound the mutual information with the following inequality

$$ I(\mathcal{X},\mathcal{Y}) \le \frac{n_{1}n_{2} \cdots n_{K}}{2}\log(1+(\frac{\alpha\varsigma}{\sigma})^{2}) $$

(63)

Proof

Without loss of generality, we assume n₁=n_max. We first construct a matrix $\mathbf {D} \in \mathbb {R}^{n_{1} \times n_{2}}$ with rank r in the following way. The entries in D_i,j,∀i∈[n₁],j∈[r] are i.i.d. symmetric random variables with values ±ας. We then construct the rest parts of D as follows.

$$ \mathbf{D}_{i,j} := \mathbf{D}_{i,j'} \text{ where} j'=j(\text{mod} r)+1. $$

(64)

The matrix D will consist of same blocks of dimensions n₁×r. Note that D can be decomposed into $\sum _{i=1}^{r}\zeta _{i} \mathbf {V_{1}}_{i}\circ \mathbf {V_{2}}_{i}$. We then construct a tensor by $\mathbf {D} \circ I_{n_{3}} \circ I_{n_{4}} \cdots \circ I_{n_{K}}=\sum _{i=1}^{r}\zeta _{i} \mathbf {V_{1}}_{i}\circ \mathbf {V_{2}}_{i} \circ I_{n_{3}} \circ I_{n_{4}} \cdots \circ I_{n_{K}}$, where $I_{n_{k}} \in \mathbb {R}^{n_{k}}$ is the vector containing all entries 1. Therefore, the CP rank of this tensor is smaller or equal to r. One can easily check that the matrix D is copied along dimension 3 to K. By varying D, we can obtain a set of low-rank tensors $\mathcal {S}_{X}$. For any $\mathcal {X}^{(i)}, \mathcal {X}^{(j)}\in \mathcal {S}_{X}$, we have

$$ \begin{aligned} &\|\mathcal{X}^{(i)}-\mathcal{X}^{(j)}\|_{F}^{2}\\& = \sum_{i_{1},i_{2},\cdots,i_{K}}(\mathcal{X}^{(i)}_{i_{1},i_{2},\cdots,i_{K}}-\mathcal{X}^{(j)}_{i_{1},i_{2},\cdots,i_{K}})^{2}\\& \ge n_{3}n_{4}\cdots n_{K}\lfloor\frac{n_{2}}{r}\rfloor \cdot\\&\sum_{i_{1}\in [n_{1}]}\sum_{i_{2}\in[r]}(\mathcal{X}^{(i)}_{i_{1},i_{2},\cdots,i_{K}}-\mathcal{X}^{(j)}_{i_{1},i_{2},\cdots,i_{K}})^{2}\\&= 4\alpha^{2}\varsigma^{2} n_{3}n_{4}\cdots n_{K}\lfloor\frac{n_{2}}{r}\rfloor \sum_{i=1}^{rn_{1}}\delta_{i}, \end{aligned} $$

(65)

where δ_i’s are independent variables chosen from {0,1} and with mean 1/2. We then have

$$ \begin{aligned} & P(\min_{\mathcal{X}^{(i)} \neq \mathcal{X}^{(j)} \in \mathcal{S}_{X}}\|\mathcal{X}^{(i)}-\mathcal{X}^{(j)}\|_{F}^{2} \le \\&\alpha^{2}\varsigma^{2} n_{3}n_{4}\cdots n_{K}\lfloor\frac{n_{2}}{r}\rfloor rn_{1}) \\&\le \dbinom{|\mathcal{S}_{X}|}{2}\exp(-\frac{rn_{\text{max}}}{8}). \end{aligned} $$

(66)

Equation (66) comes from Hoeffding’s inequality and the union bound. Note that the right-hand side of (66) is less than 1 for $\mathcal {X}$ of the size given in (61). Thus, the event that

$$ \begin{aligned} &\|X^{(i)}-X^{(j)}\|_{F}^{2} > \alpha^{2}\varsigma^{2} n_{3}n_{4}\cdots n_{K}\lfloor\frac{n_{2}}{r}\rfloor rn_{1} \\&\ge \alpha^{2}\varsigma^{2}\frac{n_{1}n_{2}\cdots n_{K}}{2} \end{aligned} $$

(67)

for all $\mathcal {X}^{(i)} \neq \mathcal {X}^{(j)} \in \mathcal {S}_{X}$ has nonzero probability, where the second inequality comes from the fact that ⌊x⌋≥x/2 for all x≥1.

The third property comes from modification on Lemma A.5 in [1]. By replacing the matrix dimension with the tensor dimension, we can obtain the desired result. □

10 Appendix 4. Proof of Theorem 3

Proof

We first define ε as follows

$$ \varepsilon^{2} = \min\{\frac{\alpha^{2}}{16}, C_{1}^{2}\sigma^{2}\frac{rn_{\text{max}}-64}{n_{1}n_{2} \cdots n_{K}}\} $$

(68)

where C₁ is a constant to be determined later. We consider ς in the range

$$ \frac{2\sqrt{2}\varepsilon}{\alpha} \le \varsigma \le \frac{4\varepsilon}{\alpha} \le 1. $$

(69)

We will consider running any algorithms on the set $\mathcal {S}_{X}$. Suppose for the sake of a contradiction that there exists an algorithm, for any $\mathcal {X} \in \mathcal {S}_{f}$, given Y, returns an $\hat {\mathcal {X}}$ such that

$$ \|\mathcal{X}-\hat{\mathcal{X}}\|^{2}_{F}/n_{1}n_{2}\cdots n_{K} \le \epsilon^{2} $$

(70)

with probability at least 1/4. We will show that if $\mathcal {X}^{*} = \arg \min _{\mathcal {X}'\in \mathcal {S}_{X}}\|\mathcal {X}'-\hat {\mathcal {X}}\|^{2}_{F}$, then $\mathcal {X}^{*}=\mathcal {X}$. Based on (62) and (69), for any $\mathcal {X}' \in \mathcal {S}_{X}$ with $\mathcal {X}' \neq \mathcal {X}$, we have

$$ \|\mathcal{X}'-\mathcal{X}\|_{F} > \alpha\varsigma\sqrt{n_{1}n_{2}\cdots n_{K}/2} \ge 2\sqrt{n_{1}n_{2}\cdots n_{K}}\varepsilon. $$

(71)

Combining (70) and (71), we then have

$$ \begin{aligned} & \|\mathcal{X}'-\hat{\mathcal{X}}\|_{F} = \|\mathcal{X}'-\mathcal{X}+\mathcal{X}-\hat{\mathcal{X}}\|_{F}\\ & \ge \|\mathcal{X}'-\mathcal{X}\|_{F}-\|\mathcal{X}-\hat{\mathcal{X}}\|_{F} \\ &\ge 2\sqrt{n_{1}n_{2}\cdots n_{K}}\epsilon - \sqrt{n_{1}n_{2}\cdots n_{K}}\epsilon \\ &= \sqrt{n_{1}n_{2}\cdots n_{K}}\varepsilon. \end{aligned} $$

(72)

Since $\mathcal {X} \in \mathcal {S}_{X}$ is also a candidate for $\mathcal {X}^{*}$, we have

$$ \|\mathcal{X}^{*}-\hat{\mathcal{X}}\|_{F} \le \|\mathcal{X}-\hat{\mathcal{X}}\|_{F} \le \sqrt{n_{1}n_{2}\cdots n_{K}}\varepsilon. $$

(73)

Thus, if (70) holds, then $\|\mathcal {X}^{*}-\hat {\mathcal {X}}\|_{F} < \|\mathcal {X}'-\hat {\mathcal {X}}\|_{F}$ for any $\mathcal {X}' \in \mathcal {S}_{X}$ with $\mathcal {X}' \neq \mathcal {X}$, and hence, we must have $\mathcal {X}^{*} = \mathcal {X}$. By assumption, (70) holds with probability at least 1/4, and thus $P(\mathcal {X} \neq \mathcal {X}^{*}) \le 3/4$. However, by Fano’s inequality, the probability that $\mathcal {X} \neq \mathcal {X}^{*}$ is at least

$$ \begin{aligned} & P(\mathcal{X} \neq \mathcal{X}^{*}) \ge \frac{H(\mathcal{X}|\mathcal{Y})-1}{\log|\mathcal{S}_{X}|}\\ & = \frac{H(\mathcal{X})-I(\mathcal{X},\mathcal{Y})-1}{\log|\mathcal{S}_{X}|} \ge 1 - \frac{I(\mathcal{X},\mathcal{Y})+1}{\log|\mathcal{S}_{X}|}. \end{aligned} $$

(74)

Combining $|\mathcal {S}_{X}|$ and $I(\mathcal {X},\mathcal {Y})$ from Lemma 6, and using the inequality log(1+z)≤z, we obtain

$$ P(\mathcal{X} \neq \hat{\mathcal{X}}) \ge 1- \frac{16}{rn_{\text{max}}} (\frac{n_{1}n_{2}\cdots n_{K}}{2}(\frac{\alpha\varsigma}{\sigma})^{2}+1). $$

(75)

Combining (75) with (69), we obtain

$$ \frac{16}{rn_{\text{max}}} (8n_{1}n_{2}\cdots n_{K}(\frac{\varepsilon}{\sigma})^{2}+1) \ge \frac{1}{4}, $$

(76)

which implies that

$$ \epsilon^{2} \ge \frac{\sigma^{2}}{512} \frac{rn_{\text{max}}-64}{n_{1}n_{2}\cdots n_{K}}. $$

(77)

Setting $C_{1}^{2} < \frac {1}{512}$ will lead to a contradiction; hence, (70) must fail to hold with probability at least 3/4. This finishes the proof. □

11 Appendix 5. TAPGD: proof of the Lipschitz differential property and calculation of Lipschitz constants

We provide the Lipschitz differential property of H and compute the corresponding Lipschitz constants of its partial derivatives with respect to $\mathbf {A_{k}}\in \mathbb {R}^{n_{k} \times r}, \forall k \in [K],\mathcal {X} \in \mathbb {R}^{n_{1} \times n_{2} \times \dots \times n_{K}}$, and ω_l,∀l∈[W−1]. We call a function Lipschitz differentiable if and only if all its partial derivatives are Lipschitz continuous. The definition of Lipschitz continuous of a function’s partial derivatives is shown in Definition 1.

Definition 1

[54] For any variable y, and a function y→Υ(y,z₁,z₂,...,z_n), with other variables z₁,z₂,..,z_n fixed, the partial derivative ∇_yΥ(y,z₁,z₂,⋯,z_n) is said to be Lipschitz continuous with Lipschitz constant L_p(z₁,z₂,...,z_n), if the following relation holds

$$ \begin{aligned} &\| \nabla_{y} \Upsilon(y_{1},z_{1},z_{2},...,z_{n}) - \nabla_{y} \Upsilon(y_{2},z_{1},z_{2},...,z_{n}) \|_{F} \\& \le L_{p}(z_{1},z_{2},...,z_{n}) \| y_{1} - y_{2} \|_{F},~~ \forall y_{1},y_{2}. \end{aligned} $$

Let $L^{t+1}_{\mathbf {A_{k}}},\forall k \in [K],L^{t+1}_{\mathcal {X}}$, and $L^{t+1}_{\omega _{l}}, \forall l \in [W-1]$ denote the smallest Lipschitz constants of $\nabla _{\mathbf {A_{k}}} H,\nabla _{\mathcal {X}} H$, and $\nabla _{\omega _{l}} H$ in the (t+1)-th iteration. The details of the calculation are shown in (78), (81), and (82).

$$ \begin{aligned} &\| \nabla_{\mathbf{A_{k}}} H(\mathbf{A_{k}}) - \nabla_{\mathbf{A_{k}}} H(\mathbf{A_{k}}') \|_{F} \\ & = \| (\mathbf{A_{k}}(\mathbf{B_{k}}^{t})^{T} - \mathcal{X}_{(k)}) \mathbf{B_{k}}^{t} \\&~~~~~ - (\mathbf{A_{k}}'(\mathbf{B_{k}}^{t})^{T} - \mathcal{X}_{(k)}) \mathbf{B_{k}}^{t}\|_{F} \\&= \| (\mathbf{A_{k}}-\mathbf{A_{k}}')(\mathbf{B_{k}}^{t})^{T} \mathbf{B_{k}}^{t}\|_{F} \\& \stackrel{(\rm{a})} \le \|(\mathbf{B_{k}}^{t})^{T} \mathbf{B_{k}}^{t}\|\|\mathbf{A_{k}}-\mathbf{A_{k}}'\|_{F} \\&\stackrel{(\rm{b})} = \frac{1}{\tau_{\mathbf{A_{k}}}(\mathbf{B_{k}}^{t})} \| \mathbf{A_{k}}-\mathbf{A_{k}}' \|_{F}, \end{aligned} $$

(78)

where $\nabla _{\mathbf {A_{k}}} H(\mathbf {A_{k}})$ and $\nabla _{\mathbf {A_{k}}} H(\mathbf {A_{k}}')$ are the abbreviations of $\nabla _{\mathbf {A_{k}}} H(\mathbf {A_{1}}^{t+1},\- \mathbf {A_{2}}^{t+1},\- \cdots,\- \mathbf {A_{k-1}}^{t+1},\- \mathbf {A_{k}},\- \mathbf {A_{k+1}}^{t},\- \cdots,\- \mathbf {A_{K}}^{t},\- \mathcal {X}^{t},\- \omega _{1}^{t},\- \omega _{2}^{t},\- \cdots,\- \omega _{W-1}^{t})$ and $\nabla _{\mathbf {A_{k}}} \-H\-(\-\mathbf {A_{1}}^{t+1},\- \mathbf {A_{2}}^{t+1},\- \cdots,\- \mathbf {A_{k-1}}^{t+1},\- \mathbf {A_{k}}',\- \mathbf {A_{k+1}}^{t},\- \cdots,\- \mathbf {A_{K}}^{t},\- \mathcal {X}^{t}\-,\- \omega _{1}^{t},\- \omega _{2}^{t},\- \cdots,\- \omega _{W-1}^{t})$, respectively. B_k^t represents A_K^t⊙...⊙A_k+1^t⊙A_k−1^t+1⊙...⊙A₁^t+1. (a) holds from the inequality ∥AB∥_F≤∥A∥∥B∥_F. (b) follows from

$$ \begin{aligned} \tau_{\mathbf{A_{k}}} = \frac{1}{\|(\mathbf{B_{k}})^{T}\mathbf{B_{k}}\|},\forall k \in [K], \end{aligned} $$

(79)

and (78) implies that

$$ \begin{aligned} &L^{t+1}_{\mathbf{A_{k}}} \leq \|(\mathbf{B_{k}}^{t})^{T} \mathbf{B_{k}}^{t}\|, \mathrm{ and } \\&\tau_{\mathbf{A_{k}}}(\mathbf{B_{k}}^{t}) \le 1/L^{t+1}_{\mathbf{A_{k}}}. \end{aligned} $$

(80)

$$ \begin{aligned} &\| \nabla_{\mathcal{X}} H(\mathcal{X}) - \nabla_{\mathcal{X}} H(\mathcal{X}') \|_{F} \\ & = \| \nabla_{\mathcal{X}} F_{\Omega}(\mathcal{X}, \omega_{1}^{t}, \omega_{2}^{t}, \cdots, \omega_{W-1}^{t}) \\&~~~~~ + \lambda (\mathcal{X} - \mathbf{A_{1}}^{t+1} \circ \mathbf{A_{2}}^{t+1} \circ...\circ \mathbf{A_{K}}^{t+1}) \\&~~~~~- \nabla_{\mathcal{X}} F_{\Omega}(\mathcal{X}', \omega_{1}^{t}, \omega_{2}^{t}, \cdots, \omega_{W-1}^{t}) \\&~~~~~- \lambda (\mathcal{X}' - \mathbf{A_{1}}^{t+1} \circ \mathbf{A_{2}}^{t+1} \circ...\circ \mathbf{A_{K}}^{t+1})\|_{F} \\& \stackrel{(\rm{c})} = \| \nabla_{\mathcal{X}} F_{\Omega}(\mathcal{X}, \omega_{1}^{t}, \omega_{2}^{t}, \cdots, \omega_{W-1}^{t}) \\&~~~~~- \nabla_{\mathcal{X}} F_{\Omega}(\mathcal{X}', \omega_{1}^{t}, \omega_{2}^{t}, \cdots, \omega_{W-1}^{t}) \|_{F} \\&~~~~~+ \|\lambda(\mathcal{X}-\mathcal{X}')\|_{F} \\& \stackrel{(\rm{d})} = \| \text{diag}(\nabla^{2} F_{\Omega}(\bar{\mathcal{X}}))\text{vec}(\mathcal{X}-\mathcal{X}') \|_{2} + \|\lambda(\mathcal{X}-\mathcal{X}')\|_{F} \\&\stackrel{(\rm{e})} = (\| \text{diag}(\nabla^{2} F_{\Omega}(\bar{\mathcal{X}}))\|_{\infty} + \lambda) \|\mathcal{X}-\mathcal{X}'\|_{F} \\& \stackrel{(\rm{f})} \le (\frac{1}{\sigma^{2}\beta^{2}} + \lambda) \|\mathcal{X}-\mathcal{X}'\|_{F} \\&\stackrel{(\rm{g})} = \frac{1}{\tau_{\mathcal{X}}} \| \mathcal{X}-\mathcal{X}'\|_{F}, \end{aligned} $$

(81)

where $\nabla _{\mathcal {X}} H(\mathcal {X})$ and $\nabla _{\mathcal {X}} H(\mathcal {X}')$ are the abbreviations of $\nabla _{\mathcal {X}} H(\mathbf {A_{1}}^{t+1},\- \mathbf {A_{2}}^{t+1},\- \cdots,\- \mathbf {A_{K}}^{t+1},\- \mathcal {X},\- \omega _{1}^{t},\- \omega _{2}^{t},\- \cdots,\- \omega _{W-1}^{t})$ and $\nabla _{\mathcal {X}} H(\mathbf {A_{1}}^{t+1},\- \mathbf {A_{2}}^{t+1},\- \cdots,\- \mathbf {A_{K}}^{t+1},\- \mathcal {X}',\- \omega _{1}^{t},\- \omega _{2}^{t},\- \cdots,\- \omega _{W-1}^{t})$, respectively. In (81), (c) comes from the triangle inequality. (d) follows from the differential mean value theorem, and the fact ∥A∥_F=∥vec(A)∥₂. $\nabla ^{2} F_{\Omega }(\bar {\mathcal {X}}) \in \mathbb {R}^{n_{1} \times n_{2} \times \dots \times n_{K}}$ has the $(i_{1},i_{2},\- \dots,i_{K})$-th entry equaling to ${\frac {\partial ^{2} F_{\Omega }}{\partial ^{2} \mathcal {X}_{i_{1},i_{2}, \dots,i_{K}}}|}_{\bar {\mathcal {X}}_{i_{1},i_{2}, \dots,i_{K}}}$, and $\text {diag}\-(\-\nabla ^{2} F_{\Omega }(\bar {\mathcal {X}})) \in \mathbb {R}^{n_{1} n_{2} \dots n_{K}\times n_{1} n_{2} \dots n_{K}}$ is a diagonal matrix with the diagonal vector equaling to $\text {vec}(\nabla ^{2} F_{\Omega }(\bar {\mathcal {X}}))$. (e) follows from the fact that the l₂ norm of a diagonal matrix is equal to its entry-wise infinity norm. Note that the probability distribution function of the normal distribution and its derivative have the upper bounds $\frac {1}{\sqrt {2\pi } \sigma }$ and $\frac {e^{-1/2}}{\sqrt {2\pi } \sigma ^{2}}$, respectively. Then, one can check that $\|\text {diag}(\nabla ^{2} F_{\Omega }(\bar {\mathcal {X}}))\|_{\infty }$ is bounded by $\frac {1}{\sigma ^{2}\beta ^{2}}$. (f) follows from upper bounding $\|\text {diag}(\nabla ^{2} F_{\Omega }(\bar {\mathcal {X}}))\|_{\infty }$ with $\frac {1}{\sigma ^{2}\beta ^{2}}$. (g) comes from $\tau _{\mathcal {X}} = \frac {1}{\frac {1}{\sigma ^{2}\beta ^{2}}+\lambda }$. Therefore, $\tau _{\mathcal {X}} \le 1/L^{t+1}_{\mathcal {X}}$.

$$ \begin{aligned} &\| \nabla_{\omega_{l}} H(\omega_{l}) - \nabla_{\omega_{l}} H(\omega_{l}') \|_{F} \\ & = \| \sum_{(i_{1},i_{2},\cdots,i_{K}) \in \Omega} \\&~~~~~(\frac{\boldsymbol{1}_{[\mathcal{Y}_{i_{1},i_{2},...,i_{K}}=l+1]}\dot{\Phi}(\omega_{l}-\mathcal{X}_{i_{1},i_{2},\dots,i_{K}}^{t+1})}{\Phi(\omega_{l+1}^{t}-\mathcal{X}_{i_{1},i_{2},\dots,i_{K}}^{t+1})-\Phi(\omega_{l}-\mathcal{X}_{i_{1},i_{2},\dots,i_{K}}^{t+1})} \\&~~~~~- \frac{\boldsymbol{1}_{[\mathcal{Y}_{i_{1},i_{2},...,i_{K}}=l]}\dot{\Phi}(\omega_{l}-\mathcal{X}_{i_{1},i_{2},\dots,i_{K}}^{t+1})}{\Phi(\omega_{l}-\mathcal{X}_{i_{1},i_{2},\dots,i_{K}}^{t+1})-\Phi(\omega_{l-1}^{t+1}-\mathcal{X}_{i_{1},i_{2},\dots,i_{K}}^{t+1})}) \\&~~~~~-\sum_{(i_{1},i_{2},\cdots,i_{K}) \in \Omega}\\&~~~~~ (\frac{\boldsymbol{1}_{[\mathcal{Y}_{i_{1},i_{2},...,i_{K}}=l+1]}\dot{\Phi}(\omega_{l}'-\mathcal{X}_{i_{1},i_{2},\dots,i_{K}}^{t+1})}{\Phi(\omega_{l+1}^{t}-\mathcal{X}_{i_{1},i_{2},\dots,i_{K}}^{t+1})-\Phi(\omega_{l}'-\mathcal{X}_{i_{1},i_{2},\dots,i_{K}}^{t+1})} \\&~~~~~- \frac{\boldsymbol{1}_{[\mathcal{Y}_{i_{1},i_{2},...,i_{K}}=l]}\dot{\Phi}(\omega_{l}'-\mathcal{X}_{i_{1},i_{2},\dots,i_{K}}^{t+1})}{\Phi(\omega_{l}'-\mathcal{X}_{i_{1},i_{2},\dots,i_{K}}^{t+1})-\Phi(\omega_{l-1}^{t+1}-\mathcal{X}_{i_{1},i_{2},\dots,i_{K}}^{t+1})})\|_{F} \\& \stackrel{(\rm{h})} \le \| \langle \mathcal{G}_{l+1}, \nabla J(\mathcal{U}_{\omega_{l}}) \rangle (\omega_{l}-\omega_{l}')\|_{F} \\&~~~~~+ \|\langle \mathcal{G}_{l}, \nabla M(\mathcal{V}_{\omega_{l}}) \rangle (\omega_{l}-\omega_{l}')\|_{F} \\& \stackrel{(\rm{i})} \le \|\mathcal{G}_{l+1}\|_{F}\|\nabla J(\mathcal{U}_{\omega_{l}})\|_{\infty}\|\omega_{l}-\omega_{l}'\|_{F}\\&~~~~~ + \|\mathcal{G}_{l}\|_{F}\|\nabla M(\mathcal{V}_{\omega_{l}})\|_{\infty}\|\omega_{l}-\omega_{l}'\|_{F} \\& \stackrel{(\rm{j})} \le \|\mathcal{G}_{l+1}\|_{F}\frac{1}{\sigma^{2}\beta^{2}}\|\omega_{l}-\omega_{l}'\|_{F} + \|\mathcal{G}_{l}\|_{F}\frac{1}{\sigma^{2}\beta^{2}}\|\omega_{l}-\omega_{l}'\|_{F} \\& = \frac{1}{\sigma^{2}\beta^{2}}(\sqrt{G_{l}}+\sqrt{G_{l+1}})\|\omega_{l}-\omega_{l}'\|_{F} \\& \stackrel{(\rm{k})} = \frac{1}{\tau_{\omega_{l}}} \| \omega_{l}-\omega_{l}'\|_{F}, \end{aligned} $$

(82)

where $\nabla _{\omega _{l}} H(\omega _{l})$ and v$\nabla _{\omega _{l}} H(\omega _{l}')$ are the abbreviations of $\nabla _{\omega _{l}} H(\mathbf {A_{1}}^{t+1},\- \mathbf {A_{2}}^{t+1},\- \cdots,\- \mathbf {A_{K}}^{t+1},\- \mathcal {X}^{t+1},\- \omega _{1}^{t+1},\- \omega _{2}^{t+1},\- \cdots,\- \omega _{l-1}^{t+1},\- \omega _{l},\- \omega _{l+1}^{t},\- \cdots,\- \omega _{W-1}^{t})$ and $\nabla _{\omega _{l}} H(\mathbf {A_{1}}^{t+1},\- \mathbf {A_{2}}^{t+1},\- \cdots,\- \mathbf {A_{K}}^{t+1},\- \mathcal {X}^{t+1},\- \omega _{1}^{t+1},\- \omega _{2}^{t+1},\- \cdots,\- \omega _{l-1}^{t+1},\- \omega _{l}',\- \omega _{l+1}^{t}\-,\- \cdots,\- \omega _{W-1}^{t})$, respectively. In (82), $\mathcal {G}_{l},\mathcal {G}_{l+1}$ are binary tensors with entries equaling to one when the corresponding positions of $\mathcal {Y}$ equal to l and l+1, respectively, and with entries equaling to zero otherwise. (h) follows from the differential mean value theorem, and $\mathcal {U}_{\omega _{l}},\mathcal {V}_{\omega _{l}} \in \mathbb {R}^{n_{1} \times n_{2} \times \dots \times n_{K}}$ have the entries between ω_l and ωl′ to satisfy the differential mean value theorem. The $(i_{1},i_{2}, \dots,i_{K})$-th entries of $\nabla J(\mathcal {U}_{\omega _{l}}),\nabla M(\mathcal {V}_{\omega _{l}}) \in \mathbb {R}^{n_{1} \times n_{2} \times \dots \times n_{K}}$ are partial derivatives of $\frac {\dot {\Phi }(\omega _{l}-\mathcal {X}_{i_{1},i_{2},\dots,i_{K}}^{t+1})}{\Phi (\omega _{l+1}^{t}-\mathcal {X}_{i_{1},i_{2},\dots,i_{K}}^{t+1})-\Phi (\omega _{l}-\mathcal {X}_{i_{1},i_{2},\dots,i_{K}}^{t+1})}$, and $\frac {\dot {\Phi }(\omega _{l}-\mathcal {X}_{i_{1},i_{2},\dots,i_{K}}^{t+1})}{\Phi (\omega _{l}-\mathcal {X}_{i_{1},i_{2},\dots,i_{K}}^{t+1})-\Phi (\omega _{l-1}^{t+1}-\mathcal {X}_{i_{1},i_{2},\dots,i_{K}}^{t+1})}$ with respect to ω_l at the points $(\mathcal {U}_{\omega _{l}})_{i_{1},i_{2}, \dots,i_{K}}$ and $(\mathcal {V}_{\omega _{l}})_{i_{1},i_{2}, \dots,i_{K}}$, respectively. (j) comes from the fact that $\|\nabla J(\mathcal {U}_{\omega _{l}})\|_{\infty }$ and $\|\nabla M(\mathcal {V}_{\omega _{l}})\|_{\infty }$ are upper bounded by $\frac {1}{\sigma ^{2}\beta ^{2}}$. (k) comes from $\tau _{\omega _{l}} = \frac {\sigma ^{2}\beta ^{2}}{\sqrt {G_{l}}+\sqrt {G_{l+1}}},\forall l \in [W-1]$. Thus, $\tau _{\omega _{l}} \le 1/L^{t+1}_{\omega _{l}}$.

We remark that the results of (78) and (81) do not change when the boundaries $\omega _{l}^{*}, \forall l \in [W-1]$ are known to TAPGD, since $\omega _{l}^{t=1}, \forall l \in [W-1]$ are fixed values in (78) and (81).

12 Appendix 6. Proof of Theorem 4

Proof

As described in Section 4.1 of the paper, $\Psi _{1}(\mathcal {X})$ corresponds to the operations of setting $\mathcal {X}_{i_{1},i_{2},...,i_{K}}$ to α if $\mathcal {X}_{i_{1},i_{2},...,i_{K}} > \alpha $, and setting $\mathcal {X}_{i_{1},i_{2},...,i_{K}}$ to −α if $\mathcal {X}_{i_{1},i_{2},...,i_{K}} < -\alpha,\forall i_{k} \in [n_{k}], k \in [K]$. ψ₂(ω_l) corresponds to the operations of setting ω_l= min(ω_l+1−κ_l+1,α_upper) if ω_l> min(ω_l+1−κ_l+1,α_upper), and setting ω_l= max(ω_l−1+κ_l,α_low) if ω_l< max(ω_l−1+κ_l,α_low),∀l∈[W−1]. TAPGD is a special case of the proximal alternating linearized minimization (PALM) algorithm from the results by Bolte et al. [54]. The global convergence of TAPGD to a critical point of (12) from any initial point can be proved by two steps: (1) $H(\mathbf {A_{1}}, \mathbf {A_{2}}, \cdots, \mathbf {A_{K}}, \mathcal {X}, \omega _{1}, \omega _{2},\cdots, \omega _{W-1})$ is Lipschitz differentiable; (2) $H(\mathbf {A_{1}}, \mathbf {A_{2}}, \cdots, \mathbf {A_{K}}, \mathcal {X}, \omega _{1}, \omega _{2},\-\cdots, \-\omega _{W-1}) + \Psi _{1}(\mathcal {X}) + \sum _{l=1}^{W-1}\Psi _{2}(\omega _{l})$ satisfies the Kurdyka-Lojasiewicz (KL) property.

The Lipschitz differential property of $H(\mathbf {A_{1}}, \mathbf {A_{2}}, \cdots \-, \mathbf {A_{K}}, \mathcal {X}, \omega _{1}, \omega _{2},\cdots, \omega _{W-1})$ has been shown in Appendix 5. Ψ₁ and Ψ₂ are semi-algebraic functions. According to [54], a semi-algebraic function satisfies the KL property. In addition, function $H(\mathbf {A_{1}}, \mathbf {A_{2}}, \cdots, \mathbf {A_{K}}, \mathcal {X}, \omega _{1}, \-\omega _{2},\cdots, \omega _{W-1})$ is differentiable everywhere, which is equivalent to being real analytic. Thus, $H(\mathbf {A_{1}}, \mathbf {A_{2}}, \cdots, \-\mathbf {A_{K}}, \mathcal {X}, \omega _{1}, \-\omega _{2},\cdots, \omega _{W-1})$ is a KL function according to Xu et al. [62]. Finally, we have $H(\mathbf {A_{1}}, \mathbf {A_{2}}, \cdots, \mathbf {A_{K}},\- \mathcal {X}, \omega _{1}, \omega _{2},\-\cdots, \omega _{W-1}) + \Psi _{1}(\mathcal {X}) + \sum _{l=1}^{W-1}\Psi _{2}(\omega _{l})$ satisfying the KL property. The claim follows by Xu et al. [62]. By Remark 3.4 in the work of Bolte et al. [54], the convergence rate is at least $O(t^{\frac {\theta - 1}{2\theta - 1}})$, for some $\theta \in (\frac {1}{2},1)$. The proof is done. □

Availability of data and materials

The datasets analyzed during the current study are available in http://vision.ucsd.edu/~leekc/ExtYaleDatabase/ExtYaleB.htmland https://github.com/UCM-GAIA/Context-Dataset/wiki/Datasets.

Notes

We use the notations g=O(n),g=Θ(n) if as n goes to infinity, g≤c·n,c₁·n≤g≤c₂·n eventually holds for some positive constants c, c₁ and c₂, respectively.

Abbreviations

CP:: CANDECOMP/PARAFAC
TSVD:: Tensor singular value decomposition
TAPGD:: Tensor-based alternating proximal gradient descent
TSVD-APGD:: TSVD-based alternating projected gradient descent
MNC-1bit-TR:: M-norm constrained 1-bit tensor recovery
NORT:: Nonconvex regularized tensor

References

M. A. Davenport, Y. Plan, E. van den Berg, M. Wootters, 1-bit matrix completion. Inf. Infer.3(3), 189–223 (2014).
MathSciNet MATH Google Scholar
R. Wang, M. Wang, J. Xiong, Data recovery and subspace clustering from quantized and corrupted measurements. IEEE J. Sel. Top. Signal Process.12(6), 1547–1560 (2018).
Google Scholar
J. Choi, J. Mo, R. W. Heath, Near maximum-likelihood detector and channel estimator for uplink multiuser massive mimo systems with one-bit adcs. IEEE Trans. Commun.64(5), 2005–2018 (2016).
Google Scholar
P. Gao, R. Wang, M. Wang, J. H. Chow, Low-rank matrix recovery from noisy, quantized and erroneous measurements. IEEE Trans. Signal Process.66(11), 2918–2932 (2018).
MathSciNet MATH Google Scholar
A. Reinhardt, F. Englert, D. Christin, in Proc. Sustainable Internet and ICT for Sustainability (SustainIT). Enhancing user privacy by preprocessing distributed smart meter data, (2013), pp. 1–7. https://doi.org/10.1109/SustainIT.2013.6685194.
Y. Li, C. Tao, G. Seco-Granados, A. Mezghani, A. L. Swindlehurst, L. Liu, Channel estimation and performance analysis of one-bit massive mimo systems. IEEE Trans. Signal Process.65(15), 4075–4089 (2017).
MathSciNet MATH Google Scholar
S. Khobahi, N. Naimipour, M. Soltanalian, Y. C. Eldar, in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Deep signal recovery with one-bit quantization (IEEE, 2019), pp. 2987–2991. https://doi.org/10.1109/ICASSP.2019.8683876.
Y. Plan, R. Vershynin, Robust 1-bit compressed sensing and sparse logistic regression: a convex programming approach. IEEE Trans. Inf. Theory. 59(1), 482–494 (2013).
MathSciNet MATH Google Scholar
M. Slawski, P. Li, in Advances in Neural Information Processing Systems. b-bit marginal regression (NIPSMontreal, 2015), pp. 2062–2070.
Google Scholar
L. Zhang, J. Yi, R. Jin, in International Conference on Machine Learning. Efficient algorithms for robust one-bit compressive sensing (JMLRBeijing, 2014), pp. 820–828.
Google Scholar
S. Bhojanapalli, B. Neyshabur, N. Srebro, in Advances in Neural Information Processing Systems. Global optimality of local search for low rank matrix recovery (NIPSBarcelona, 2016), pp. 3873–3881.
Google Scholar
T. Zhao, Z. Wang, H. Liu, in Advances in Neural Information Processing Systems. A nonconvex optimization framework for low rank matrix estimation (NIPSMontreal, 2015), pp. 559–567.
Google Scholar
S. A. Bhaskar, Probabilistic low-rank matrix completion from quantized measurements. J. Mach. Learn. Res.17(60), 1–34 (2016).
MathSciNet MATH Google Scholar
T. Cai, W. -X. Zhou, A max-norm constrained minimization approach to 1-bit matrix completion. J. Mach. Learn. Res.14(1), 3619–3647 (2013).
MathSciNet MATH Google Scholar
Y. Fu, J. Gao, D. Tien, Z. Lin, in 2014 International Joint Conference on Neural Networks (IJCNN). Tensor LRR based subspace clustering (IEEE, 2014), pp. 1877–1884. https://doi.org/10.1109/IJCNN.2014.6889472.
L. Baltrunas, M. Kaminskas, B. Ludwig, O. Moling, F. Ricci, A. Aydin, K. -H. Lüke, R. Schwaiger, in International Conference on Electronic Commerce and Web Technologies. Incarmusic: context-aware music recommendations in a car (Springer, 2011), pp. 89–100. https://doi.org/10.1007/978-3-642-23014-1_8.
H. S. Sahambi, K. Khorasani, A neural-network appearance-based 3-d object recognition using independent component analysis. IEEE Trans. Neural Netw.14(1), 138–149 (2003).
Google Scholar
N. I. Bruce, B. Murthi, R. C. Rao, A dynamic model for digital advertising: the effects of creative format, message content, and targeting on engagement. J. Mark. Res.54(2), 202–218 (2017).
Google Scholar
R. Li, W. Zhang, Y. Zhao, Z. Zhu, S. Ji, Sparsity learning formulations for mining time-varying data. IEEE Trans. Knowl. Data Eng.27(5), 1411–1423 (2015).
Google Scholar
N. Cohen, A. Shashua, in International Conference on Machine Learning. Convolutional rectifier networks as generalized tensor decompositions (JMLRNew York, 2016), pp. 955–963.
Google Scholar
K. Maruhashi, M. Todoriki, T. Ohwa, K. Goto, Y. Hasegawa, H. Inakoshi, H. Anai, in Thirty-Second AAAI Conference on Artificial Intelligence. Learning multi-way relations via tensor decomposition with neural networks (AAAI PressNew Orleans, Louisiana, 2018).
Google Scholar
F. L. Hitchcock, The expression of a tensor or a polyadic as a sum of products. J. Math. Phys.6(1-4), 164–189 (1927).
MATH Google Scholar
J. B. Kruskal, Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Algebra Appl.18(2), 95–138 (1977).
MathSciNet MATH Google Scholar
L. R. Tucker, Some mathematical notes on three-mode factor analysis. Psychometrika. 31(3), 279–311 (1966).
MathSciNet Google Scholar
J. Liu, P. Musialski, P. Wonka, J. Ye, Tensor completion for estimating missing values in visual data. IEEE Trans. Pattern. Anal. Mach. Intell.35(1), 208–220 (2012).
Google Scholar
X. Zhang, Z. Zhou, D. Wang, Y. Ma, in Twenty-Eighth AAAI Conference on Artificial Intelligence. Hybrid singular value thresholding for tensor completion (AAAI PressQuébec City, 2014).
Google Scholar
Q. Zhao, L. Zhang, A. Cichocki, Bayesian cp factorization of incomplete tensors with automatic rank determination. IEEE Trans. Pattern. Anal. Mach. Intell.37(9), 1751–1763 (2015).
Google Scholar
T. Yokota, Q. Zhao, A. Cichocki, Smooth parafac decomposition for tensor completion. IEEE Trans. Signal Process.64(20), 5423–5436 (2016).
MathSciNet MATH Google Scholar
J. A. Bengua, H. N. Phien, H. D. Tuan, M. N. Do, Efficient tensor completion for color image and video recovery: low-rank tensor train. IEEE Trans. Image Process.26(5), 2466–2479 (2017).
MathSciNet MATH Google Scholar
Q. Yao, J. T. -Y. Kwok, B. Han, in International Conference on Machine Learning. Efficient nonconvex regularized tensor completion with structure-aware proximal iterations (JMLRLong Beach, CA, 2019), pp. 7035–7044.
Google Scholar
C. Mu, B. Huang, J. Wright, D. Goldfarb, in International Conference on Machine Learning. Square deal: lower bounds and improved relaxations for tensor recovery (JMLRBeijing, 2014), pp. 73–81.
Google Scholar
X. Zhang, D. Wang, Z. Zhou, Y. Ma, Robust low-rank tensor recovery with rectification and alignment. IEEE Trans. Pattern. Anal. Mach. Intell. (2019). https://doi.org/10.1109/TPAMI.2019.2929043.
J. -H. Yang, X. -L. Zhao, T. -Y. Ji, T. -H. Ma, T. -Z. Huang, Low-rank tensor train for tensor robust principal component analysis. Appl. Math. Comput.367:, 124783 (2020).
MathSciNet MATH Google Scholar
T. -X. Jiang, T. -Z. Huang, X. -L. Zhao, L. -J. Deng, Multi-dimensional imaging data recovery via minimizing the partial sum of tubal nuclear norm. J. Comput. Appl. Math.372:, 112680 (2020).
MathSciNet MATH Google Scholar
Y. -B. Zheng, T. -Z. Huang, X. -L. Zhao, T. -X. Jiang, T. -H. Ma, T. -Y. Ji, Mixed noise removal in hyperspectral image via low-fibered-rank regularization. IEEE Trans. Geosci. Remote Sens.58(1), 734–749 (2019).
Google Scholar
T. G. Kolda, B. W. Bader, Tensor decompositions and applications. SIAM Rev.51(3), 455–500 (2009).
MathSciNet MATH Google Scholar
V. De Silva, L. -H. Lim, Tensor rank and the ill-posedness of the best low-rank approximation problem. SIAM J. Matrix Anal. Appl.30(3), 1084–1127 (2008).
MathSciNet MATH Google Scholar
J. Chen, Y. Saad, On the tensor svd and the optimal low rank orthogonal approximation of tensors. SIAM J. Matrix Anal. Appl.30(4), 1709–1734 (2009).
MathSciNet MATH Google Scholar
J. Li, F. Huang, Guaranteed simultaneous asymmetric tensor decomposition via orthogonalized alternating least squares. arXiv preprint arXiv:1805.10348 (2018).
W. P. Krijnen, T. K. Dijkstra, A. Stegeman, On the non-existence of optimal solutions and the occurrence of “degeneracy” in the candecomp/parafac model. Psychometrika. 73(3), 431–439 (2008).
MathSciNet MATH Google Scholar
A. Aidini, G. Tsagkatakis, P. Tsakalides, 1-bit tensor completion. Electron. Imaging. 2018(13), 261–1 (2018).
Google Scholar
B. Li, X. Zhang, X. Li, H. Lu, Tensor completion from one-bit observations. IEEE Trans. Image Process.28(1), 170–180 (2019).
MathSciNet MATH Google Scholar
N. Ghadermarzy, Y. Plan, O. Yilmaz, Learning tensors from partial binary measurements. IEEE Trans. Signal Process.67(1), 29–40 (2019).
MathSciNet MATH Google Scholar
S. Zhe, K. Zhang, P. Wang, K. -c. Lee, Z. Xu, Y. Qi, Z. Ghahramani, in Advances in Neural Information Processing Systems. Distributed flexible nonlinear tensor factorization (NIPSBarcelona, 2016), pp. 928–936.
Google Scholar
S. Chen, M. R. Lyu, I. King, Z. Xu, in Advances in Neural Information Processing Systems. Exact and stable recovery of pairwise interaction tensors (NIPSLake Tahoe, 2013), pp. 1691–1699.
Google Scholar
E. Richard, A. Montanari, in Advances in Neural Information Processing Systems. A statistical model for tensor PCA (NIPSMontreal, 2014), pp. 2897–2905.
Google Scholar
X. Zhang, D. Wang, Z. Zhou, Y. Ma, in Advances in Neural Information Processing Systems. Simultaneous rectification and alignment via robust recovery of low-rank tensors (NIPSLake Tahoe, 2013), pp. 1637–1645.
Google Scholar
A. Smilde, R. Bro, P. Geladi, Multi-way Analysis: Applications in the Chemical Sciences (Wiley, Hoboken, 2005).
Google Scholar
Y. Baig, E. M. Lai, J. Lewis, in 2010 17th International Conference on Telecommunications. Quantization effects on compressed sensing video (IEEE, 2010), pp. 935–940.
G. Zhang, J. Jia, T. -T. Wong, H. Bao, Consistent depth maps recovery from a video sequence. IEEE Trans. Pattern. Anal. Mach. Intell.31(6), 974–988 (2009).
Google Scholar
R. A. Harshman, et al., Foundations of the parafac procedure: models and conditions for an “explanatory” multimodal factor analysis. UCLA Working Papers in Phonetics. UCLA, 1–84 (1970).
J. D. Carroll, J. -J. Chang, Analysis of individual differences in multidimensional scaling via an n-way generalization of “eckart-young” decomposition. Psychometrika. 35(3), 283–319 (1970).
MATH Google Scholar
L. R. Tucker, Some mathematical notes on three-mode factor analysis. Psychometrika. 31(3), 279–311 (1966).
MathSciNet Google Scholar
J. Bolte, S. Sabach, M. Teboulle, Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program.146(1-2), 459–494 (2014).
MathSciNet MATH Google Scholar
H. Golub, C. F. Van Loan, Matrix computations (Johns Hopkins Universtiy Press, 1996).
A. S. Georghiades, B. Peter N, K. David J, From few to many: illumination cone models for face recognition under variable lighting and pose. IEEE Trans. Pattern. Anal. Mach. Intell.23(6), 643–660 (2001).
Google Scholar
K. -C. Lee, J. Ho, D. J. Kriegman, Acquiring linear subspaces for face recognition under variable lighting. IEEE Trans. Pattern. Anal. Mach. Intell.27(5), 684–698 (2005).
Google Scholar
E. J. Candes, Y. Plan, Matrix completion with noise. Proc. IEEE. 98(6), 925–936 (2010).
Google Scholar
R. Tomioka, T. Suzuki, Spectral norm of random tensors. arXiv preprint arXiv:1407.1870 (2014).
S. Friedland, L. -H. Lim, Nuclear norm of higher-order tensors. Math. Comput.87(311), 1255–1281 (2018).
MathSciNet MATH Google Scholar
B. Jiang, F. Yang, S. Zhang, Tensor and its tucker core: the invariance relationships. Numer. Linear Algebra Appl.24(3), 2086 (2017).
MathSciNet MATH Google Scholar
Y. Xu, W. Yin, A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM J. Imaging Sci.6(3), 1758–1789 (2013).
MathSciNet MATH Google Scholar

Download references

Acknowledgements

This research is supported in part by ARO W911NF-17-1-0407 and the Rensselaer-IBM AI Research Collaboration (http://airc.rpi.edu), part of the IBM AI Horizons Network (http://ibm.biz/AIHorizons).

Author information

Authors and Affiliations

Dept. of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute, Troy, NY, USA
Ren Wang & Meng Wang
IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA
Jinjun Xiong

Authors

Ren Wang
View author publications
You can also search for this author in PubMed Google Scholar
Meng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jinjun Xiong
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Authors’ contributions

Ren and Meng conceived and designed the method and the experiments. Ren performed the experiments and drafted the manuscript. Meng revised the manuscript. Jinjun provided many helpful suggestions. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Meng Wang.

Ethics declarations

Consent for publication

Informed consent was obtained from all authors included in the study.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, R., Wang, M. & Xiong, J. Tensor recovery from noisy and multi-level quantized measurements. EURASIP J. Adv. Signal Process. 2020, 41 (2020). https://doi.org/10.1186/s13634-020-00698-z

Download citation

Received: 01 April 2020
Accepted: 26 August 2020
Published: 14 September 2020
DOI: https://doi.org/10.1186/s13634-020-00698-z

Tensor recovery from noisy and multi-level quantized measurements

Abstract

1 Introduction

1.1 Notation and preliminaries

2 Our proposed framework of tensor recovery from noisy and multi-level quantized measurements

3 Results: theoretical

3.1 Tensor recovery guarantee

Theorem 1

Theorem 2

3.1.1 Recovery enhancement over the existing work on 1-bit tensor recovery

3.1.2 Reduction to the matrix case

3.1.3 Recovery enhancement over quantized matrix recovery

3.2 Fundamental limitation of the recovery

Theorem 3

4 Algorithms: tensor recovery from quantized measurements

4.1 Alternating proximal gradient descent based on tensors

Theorem 4

4.2 TSVD-based alternating projected gradient descent

5 Results: numerical experiments

5.1 Synthetic data

5.2 Image data

5.3 In-car music recommender dataset

6 Conclusion and discussion

7 Appendix 1. Supporting lemmas used in the proof of Theorems 1 and 2

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Lemma 4

Proof

Lemma 5

Proof

8 Appendix 2. Proofs of Theorems 1 and 2

Proof

9 Appendix 3. Supporting lemmas used in the proof of Theorem 3

Lemma 6

Proof

10 Appendix 4. Proof of Theorem 3

Proof

11 Appendix 5. TAPGD: proof of the Lipschitz differential property and calculation of Lipschitz constants

Definition 1

12 Appendix 6. Proof of Theorem 4

Proof

Availability of data and materials

Notes

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Authors’ contributions

Corresponding author

Ethics declarations

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords