Skip to content


Open Access

On the effect of model mismatch for sequential Info-Greedy Sensing

EURASIP Journal on Advances in Signal Processing20182018:32

Received: 11 October 2017

Accepted: 1 May 2018

Published: 5 June 2018


We characterize the performance of sequential information-guided sensing (Info-Greedy Sensing) when the model parameters (means and covariance matrices) are estimated and inaccurate. Our theoretical results focus on Gaussian signals and establish performance bounds for signal estimators obtained by Info-Greedy Sensing, in terms of conditional entropy (related to the estimation error) and additional power required due to inaccurate models. We also show covariance sketching can be used as an efficient initialization for Info-Greedy Sensing. Numerical examples demonstrate the good performance of Info-Greedy Sensing algorithms compared with random measurement schemes in the presence of model mismatch.


Sequential compressed sensingAdaptive sensingMutual informationModel mismatch

1 Introduction

Sequential compressed sensing is a promising new information acquisition and recovery technique to process big data that arises in various applications such as compressive imaging [13], power network monitoring [4], and large-scale sensor networks [5]. The sequential nature of the problems is either because the measurements are taken one after another or due to the fact that the data is obtained in a streaming fashion so that it has to be processed in one pass.

To harvest the benefits of adaptivity in sequential compressed sensing, various algorithms have been developed (see [6] for a review). We may classify these algorithms as (1) being agnostic about the signal distribution and, hence, random measurements are used [710], (2) exploiting additional structure of the signal (such as graphical structure [11], sparse [1214], low rank [15], and tree-sparse structure [16, 17]) to design measurements, and (3) exploiting the distributional information of the signal in choosing measurements [18], possibly through maximizing mutual information. The additional knowledge about signal structure or distributions are various forms of information about the unknown signal. Such work includes the seminal Bayesian compressive sensing work [19], Gaussian mixture models (GMM) [20, 21], the classic information gain maximization [22] based on quadratic approximation to the information gain function, and our earlier work [6] which is referred to as Info-Greedy Sensing. Info-Greedy Sensing is a framework that aims at designing subsequent measurements to maximize the mutual information conditioned on previous measurements. Conditional mutual information is a natural metric here, as it captures exclusively useful new information between the signal and the resulted measurements disregarding noise and what has already been learned from previous measurements. Information may play a distinguishing role: as the compressive imaging example demonstrated in Fig. 1 (see Section 4 for more details), with a bit of (albeit inaccurate) information estimated via random samples of small patches of the image, our Info-Greedy Sensing is able to recover details of a high-resolution image, whereas random measurements completely miss the image. As shown in [6], Info-Greedy Sensing for a Gaussian signal becomes a simple iterative algorithm: choosing the measurement as the leading eigenvector of the conditional signal covariance matrix in that iteration and then updating the covariance matrix via a simple rank-one update or, equivalently, choosing measurement vectors a1,a2,… as the orthonormal eigenvectors of the signal covariance matrix Σ in a decreasing order of eigenvalues. Different from the earlier literature [22], Info-Greedy Sensing determines not only the direction but also the precise magnitude of the measurements.
Figure 1
Fig. 1

Value of information in sensing a high-resolution image of size 1904×3000. Here, compressive linear measurements correspond to extracting the so-called features in compressive imaging [13]. In this example, the compressive imaging system captures five low-resolution images of size 238×275 using masks designed by Info-Greedy Sensing or random sensing (this corresponds to compressing the data into 8.32% of its original dimensionality). Info-Greedy Sensing performs much better than random features and preserves richer details in the recovered image. Details are explained in Section 4.3.2

In practice, we usually need to estimate the signal covariance matrix, e.g., through a training session. For Gaussian signals, there are two possible approaches: either using training samples of the same dimension or through the new “covariance sketching” technique [2325], which uses low-dimensional random sketches of the samples. Due to the inaccuracy of the estimated covariance matrices, measurement vectors usually deviate from the optimal directions as they are calculated as eigenvectors of the estimated covariance matrix. Hence, to understand the performance of information-guided algorithms in practice, it is crucial to quantify the performance of algorithms with model mismatch. This may also shed some light on how to properly initialize the algorithm.

In this paper, we aim at quantifying the performance of Info-Greedy Sensing when the parameters (in particular, the covariance matrices) are estimated. We focus on analyzing deterministic model mismatch, which is a reasonable assumption since we aim at providing instance-specific performance guarantees with sample estimated or sketched initial parameters. We establish a set of theoretical results including (1) studying the bias and variance of the signal estimator via posterior mean, by relating the error in the covariance matrix \(\|\Sigma -\widehat {\Sigma }\|\) to the entropy of the signal posterior distribution after each sequential measurement, (2) establishing an upper bound on the additional power needed to achieve the signal precision \(\|x-\hat {x}\|\leq \varepsilon \), where power is defined as the square of the norm of the measurement vector, and (3) translating these into requirements on the choice of the sample covariance matrix through direct estimation or through covariance sketching. Note that the power allocated for the measurements here is the minimum power required in order to achieve a prescribed precision for signal recovery within a fixed number of iterations. Furthermore, we also study Info-Greedy Sensing in a special setting when the measurement vector is desired to be one-sparse and establish analogously a set of theoretical results. Such a requirement arises from applications such as nondestructive testing (NDT) [26] or network tomography. We also present numerical examples to demonstrate the good performance of Info-Greedy Sensing compared to a batch method (where measurements are not adaptive) when there is mismatch. The main contribution of the paper is to study and understand the performance of Info-Greedy algorithm [6] in the presence of perturbed parameters, rather than proposing new algorithms.

Some other related works include [27], where adaptive methods for recovering structured sparse signals with Gaussian and Gaussian joint posterior are discussed, and [28], which analyzes the recovery of Gaussian mixture models with estimated mean and covariance using maximum a posteriori estimation. In [29], the orthogonal matching pursuit which aims at detecting the support of sparse signals while suffering from faulty measurements is studied. In this work, we focus on the case where the estimated mean, covariance, as well as the prior probability for each separate Gaussian component are available. Another work [20] discusses an adaptive sensing method for GMM, which is a two-step strategy that first adaptively detects the classification of the GMM, and then reconstructs the signal assuming it falls in the category determined in the previous step. While [20] assumes that there are sufficient samples for the first step in the first place, our early work [6] and this paper are different in that, sensing for GMM signal works on signal recovery directly without trying to identify the signal class as a first step. Hence, in general, our method is more tolerant to inaccuracy of the estimated parameters, and our algorithm can achieve good performance even without a large number of samples as demonstrated by numerical examples. The design of information-guided sequential sensing is related to the design of sequential experiments (see [15, 30, 31]) and large computer experiment approximation (see [32]). However, compared to the literature on design of experiments (e.g., [30]), our work does not use a statistical criterion based on the output of each iteration. In order words, we are designing our measurements based on the knowledge of the assumed model of the signal instead of the outputs of measurement.

Our notations are standard. Denote \([\!n] \triangleq \{1,2,\ldots,n\}\); X, X F , and X represent the spectral norm, the Frobenius norm, and the nuclear norm of a matrix X, respectively; let ν i (Σ) denote the ith largest eigenvalue of a positive semi-definite matrix Σ; x0, x1, and x represent the 0, 1 and 2 norm of a vector x, respectively; let \(\chi _{n}^{2}\) be the quantile function of the chi-squared distribution with n degrees of freedom; let \(\mathbb {E}[\!x]\) and Var[ x] denote the mean and the variance of a random variable x; we write X0 to indicate that the matrix is positive semi-definite; ϕ(x|μ,Σ) denotes the probability density function of the multivariate Gaussian with mean μ and covariance matrix Σ; let e j denote the jth column of identity matrix I (i.e., e j is a vector with only one non-zero entry at location j); and \((x)^{+} \triangleq \max \{x, 0\}\) for \(x\in \mathbb {R}\).

2 Method: Info-Greedy Sensing

A typical sequential compressed sensing setup is as follows. Let \(x \in \mathbb {R}^{n}\) be an unknown n-dimensional signal. We make K measurements of x sequentially
$$y_{k} = a_{k}^{\intercal} x + w_{k}, \quad k = 1, \ldots, K, $$
and the power of the measurement vector is a k 2=β k . The goal is to recover x using measurements \(\{y_{k}\}_{k=1}^{K}\). Consider a Gaussian signal \(x \sim \mathcal {N}(0, \Sigma)\) with known zero mean and covariance matrix Σ (here without loss of generality we have assumed the signal has zero mean). Assume the rank of Σ is s and the signal is low rank, i.e. sn (however, the algorithm does not require the covariance to be low rank):
$$\text{rank}(\Sigma) = s\ll n. $$
Our goal is to estimate the signal x using sequential and adaptive measurements. Info-Greedy Sensing introduced in [6] is one of such adaptive methods which chooses each measurement to maximize the conditional mutual information
$$ a_{k} \leftarrow \underset{a}{\text{argmax}} \left\{\mathbb{I}\left[{x}; {{a}^{\intercal} x + w} | y_{j}, a_{j}, j < k \right]/a^{\intercal} a \right\}. $$
The goal of this sensing scheme is to use a minimum number of measurements (or to use the minimum total power) so that the estimated signal is recovered with precision ε; i.e., \(\|\widehat {x} - x\| < \varepsilon \) with a high probability p. Define
$$\chi_{n, p, \varepsilon} \triangleq \varepsilon^{2}/\chi_{n}^{2}(p), $$
and we will show in the following that this is a fundamental quantity that determines the termination condition of our algorithm to achieve the precision ε with the confidence level p. Note that χn,p,ε is a precision ε adjusted by the confidence level.

2.1 Gaussian signal

In [6], we have devised a solution to (1) when the signal is Gaussian. The measurement will be made in the directions of the eigenvectors of Σ in a decreasing order of eigenvalues, and the powers (or the number of measurements) will be such that the eigenvalues after the measurements are sufficiently small (i.e., less than ε). The power allocation depends on the noise variance, signal recovery precision ε, and confidence level p, as given in Algorithm 1. Note that in Step 6, the update of covariance matrix can also be implemented, equivalently, via \(\lambda \sigma ^{2} uu^{\intercal }/\left (\beta \lambda +\sigma ^{2}\right) + \Sigma ^{\perp u}\), as explained in (6). In the algorithm, the initializations μ and Σ are estimated and may not be very accurate.

2.2 One-sparse measurement

The problem of Info-Greedy Sensing with sparse measurement constraint, i.e., each measurement has only k0 non-zero entries a0=k0, has been examined in [6] and solved using outer approximation (cutting planes). Here, we will focus on one-sparse measurements, a0=1, as it is an important instance arising in applications such as nondestructive testing (NDT).

Info-Greedy Sensing with one-sparse measurements can be readily derived. Note that the mutual information between x and the outcome using one-sparse measurement \(y_{1} = e_{j}^{\intercal } x + w_{1}\) is given by
$$\mathbb{I} [x;y_{1}]=\frac{1}{2}\text{ln} \left(\Sigma_{jj}/\sigma^{2}+1\right), $$
where Σ jj denote the jth diagonal entry of matrix Σ. Hence, the measurement that maximizes the mutual information is given by \(\phantom {\dot {i}\!}e_{j^{*}}\) where \(j^{*} \triangleq \arg \max _{j} \Sigma _{jj}\), i.e., measuring in the signal coordinate with the largest variance or largest uncertainty. Then Info-Greedy Sensing measurements can be found iteratively, as presented in Algorithm 2. Note that the correlation of signal coordinates are reflected in the update of the covariance matrix: if the ith and jth coordinates of the signal are highly correlated, then the uncertainty in j will also be greatly reduced if we measure in i. Similar to the previous two algorithms, the initial parameters are not required to be accurate.

2.3 Updating covariance with sequential data

If our goal is to estimate a sequence of data x1,x2,… (versus just estimating a single instance), we may be able to update the covariance matrix using the already estimated signals simply via
$$ \widehat{\Sigma}_{t} = \alpha \widehat{\Sigma}_{t-1} + (1-\alpha)\hat{x}_{t}\hat{x}_{t}^{\intercal}, \quad t = 1, 2, \ldots, $$

and the initial covariance matrix is specified by our prior knowledge \(\widehat {\Sigma }_{0} = \widehat {\Sigma }\). Using the updated covariance matrix \(\widehat {\Sigma }_{t}\), we design the next measurement for signal xt+1. This way, we may be able to correct the inaccuracy of \(\widehat {\Sigma }\) by including new samples. Here, α is a parameter for the update step-size. We refer to this method as “Info-Greedy-2” hereafter.

2.4 Gaussian mixture model signals

In this subsection we introduce the case of sensing Gaussian mixture model (GMM) signals. The probability density function of GMM is given by
$$ p(x) = \sum\limits_{c=1}^{C} \pi_{c} \phi(x|\mu_{c}, \Sigma_{c}), $$
where C is the number of classes, and π c is the probability that the sample is drawn from class c. Unlike for Gaussian signals, the mutual information of GMM has no explicit form. However, for GMM signals, there are two approaches that tend to work well: Info-Greedy Sensing derived based on a gradient descent approach [6, 21] uses the fact that the gradient of the conditional mutual information with respect to a is a linear transform of the minimum mean square error (MMSE) matrix [33, 34], and the so-called greedy heuristic [6], which approximately maximizes the mutual information, shown in Algorithm 3. The greedy heuristic picks the Gaussian component with the highest posterior π c at that moment and chooses the next measurement a as its eigenvector associated with the maximum eigenvalue. The greedy heuristic can be implemented more efficiently compared to the gradient descent approach and sometimes has competitive performance [6]. Also, the initialization for means, covariances, and weights can be off from the true values.

3 Performance bounds

In the following, we establish performance bounds, for cases when we (1) sense Gaussian signals using estimated covariance matrices and (2) sense Gaussian signals with one-sparse measurements.

3.1 Gaussian case with model mismatch

To analyze the performance of our algorithms when the assumed covariance \(\widehat {\Sigma }\) used in Algorithm 1 is different from the true signal covariance matrix Σ, we introduce the following notations. Let the eigenpairs of Σ with the eigenvalues (which can be zero) ranked from the largest to the smallest to be (λ1,u1),(λ2,u2),…,(λ n ,u n ), and let the eigenpairs of \(\widehat {\Sigma }\) with the eigenvalues (which can be zero) ranked from the largest to the smallest to be \((\hat {\lambda }_{1}, \hat {u}_{1}), (\hat {\lambda }_{2}, \hat {u}_{2}), \ldots, (\hat {\lambda }_{n}, \hat {u}_{n})\). Let the updated covariance matrix in Algorithm 1 starting from \(\widehat {\Sigma }\) after k measurements be \(\widehat {\Sigma }_{k}\) and the true posterior covariance matrix of the signal conditioned on these measurements be Σ k .

Note that since each time we measure in the direction of the dominating eigenvector of the posterior covariance matrix, \((\hat {\lambda }_{k}, \hat {u}_{k})\) and (λ k ,u k ) correspond to the largest eigenpair of \(\widehat {\Sigma }_{k-1}\) and Σk−1, respectively. Furthermore, define the difference between the true and the assumed conditional covariance matrices after k measurements as
$$E_{k}\triangleq\widehat{\Sigma}_{k}-\Sigma_{k},\quad k = 1, \ldots, K, $$
and their sizes
$$\delta_{k}\triangleq\Vert E_{k}\Vert, \quad k = 1, \ldots, K. $$
Let the eigenvalues of E k be e1e2e n , then the spectral norm of E k is the maximum of the absolute values of the eigenvalues. Hence, δ k = max{|e1|,|e n |}. Let
$$\delta_{0} \triangleq \|\widehat{\Sigma} - \Sigma\| $$
denote the size of the initial mismatch.

3.1.1 Deterministic mismatch

First, we assume the mismatch is deterministic and find bounds for bias and variance of the estimated signal. It is common in practice to use estimated covariance matrices, which may have deterministic bias from the true covariances. Assume the initial mean is \(\hat {\mu }\) and the true signal mean is μ, the updated mean using Algorithm 1 after k measurements is \(\hat {\mu }_{k}\), and the true posterior mean is μ k .

Theorem 1

[Unbiasedness] After k measurements, the expected difference between the updated mean and the true posterior mean is given by
$$\mathbb E[\hat{\mu}_{k} - \mu_{k}]=(\hat{\mu} - \mu) \cdot\prod_{j=1}^{k}\left(I_{n}-\frac{\beta_{j}\hat{\lambda}_{j} }{\beta_{j}\hat{\lambda}_{j}+\sigma^{2}} \hat{u}_{j} \hat{u}_{j}^{\intercal}\right). $$

Moreover, if \(\hat {\mu } = \mu \), i.e., the assumed mean is accurate, the estimator is unbiased throughout all the iterations \(\mathbb E[\hat {\mu }_{k} - \mu _{k}]=0\), for k=1,…,K.

Next, we show that the variance of the estimator, when the initial mismatch \(\|\widehat {\Sigma } - \Sigma \|\) is sufficiently small, reduces gracefully. This is captured through the reduction of entropy, which is also a measure of the uncertainty in the estimator. In particular, we consider the posterior entropy of the signal conditioned on the previous measurement outcomes. Since the entropy of a Gaussian signal \(x \sim \mathcal {N}(\mu, \Sigma)\) is given by \( \mathbb {H}[\!x] = \ln \left [(2\pi e)^{n/2} \det ^{1/2}(\Sigma)\right ], \) the conditional mutual information is the log of the determinant of the conditional covariance matrix, or equivalently the log of the volume of the ellipsoid defined by the covariance matrix. Here, to accommodate the scenario where the covariance matrix is low rank (our earlier assumption), we consider a modified definition for conditional entropy, which is the logarithm of the volume of the ellipsoid on the low-dimensional space that the signal lies on:
$$\mathbb{H}[{x}|y_{j}, a_{j}, j \leq k] = \ln [(2\pi e)^{s/2} {\textsf{Vol}}(\Sigma_{k})], $$
where Vol(Σ k ) is the volume of the ellipse, which equals to the product of the non-zero eigenvalues of Σ k :
$${\textsf{Vol}}(\Sigma_{k}) = \lambda_{1}\cdots\lambda_{s_{k}}, $$
where rank(Σ k )=s k .

Theorem 2

[Entropy of estimator] If for some constant δ(0,1) the initial error satisfies
$$ \|\widehat{\Sigma}-\Sigma \|\leq \frac{\delta}{4^{K+1}}{\chi_{n, p, \varepsilon}}, $$
then for k=1,…,K,
$$ \mathbb{H}[{x}|y_{j}, a_{j}, j \leq k] \leq \frac{s}{2} \left\{\!\ln [2\pi e~\text{tr}(\Sigma)] - \sum\limits_{j=1}^{k}\ln (1/f_{j})\! \right\}\!, $$
$$ f_{k}\triangleq 1-\frac{1-\delta}{s}\frac{\beta_{k}\hat{\lambda}_{k}}{\beta_{k}\hat{\lambda}_{k}+\sigma^{2}} \in (0, 1),\quad k = 1, \ldots, K. $$
Note that in (3), the allowable initial error decreases with K. This is due to that larger K means the recovery precision criterion gets stricter, and hence, the maximum tolerable initial bias gets smaller. In the proof of Theorem 2, we track the trace of the underlying actual covariance matrix tr(Σ k ) as the cost function, which serves as a surrogate for the product of eigenvalues that determines the volume of the ellipsoid and hence the entropy, since it is much easier to calculate the trace of the observed covariance matrix \(\text {tr} (\widehat {\Sigma }_{k})\). The following recursion is crucial for the derivation: for an assumed covariance matrix Σ, after measuring in the direction of a unit norm eigenvector u with eigenvalue λ using power β, the updated matrix takes the form of
$$\begin{array}{*{20}l} \Sigma - \Sigma \sqrt{\beta} u& \left({\sqrt{\beta}u}^{\intercal} \Sigma \sqrt{\beta} u + \sigma^{2} \right)^{-1} {\sqrt{\beta} u}^{\intercal} \Sigma \\ &= \frac{\lambda \sigma^{2}}{\beta \lambda + \sigma^{2}} u {u}^{\intercal} + \Sigma^{\perp u}, \end{array} $$
where Σu is the component of Σ in the orthogonal complement of u. Thus, the only change in the eigen-decomposition of Σ is the update of the eigenvalue of u from λ to λσ2/(βλ+σ2). Based on (6), after one measurement, the trace of the covariance matrix becomes
$$ \text{tr}\left(\widehat{\Sigma}_{k}\right)=\text{tr}\left(\widehat{\Sigma}_{k-1}\right)-\frac{\beta_{k}\hat{\lambda}_{k}^{2}}{\beta_{k}\hat{\lambda}_{k}+\sigma^{2}}. $$

Remark 1

The upper bound of the posterior signal entropy in (4) shows that the amount of uncertainty reduction by the kth measurement is roughly (s/2) ln(1/f k ).

Remark 2

Using the inequality ln(1−x)≤−x for x(0,1), we have that in (4)
$$\begin{array}{*{20}l} {}\mathbb{H}[{x}|y_{j}, a_{j}, j \leq k] &\leq \frac{s}{2}\ln[2\pi e\text{tr}(\Sigma)]-\frac{1-\delta}{2}\sum\limits_{j=1}^{k} \frac{\beta_{j}\hat{\lambda}_{j}}{\beta_{j}\hat{\lambda}_{j}+\sigma^{2}}\\ &=\frac{s}{2}\ln[2\pi e\text{tr}(\Sigma)]-\frac{k(1-\delta)}{2} \\ &\quad+\frac{(1-\delta)}{2}\sum\limits_{j=1}^{k}\frac{{\chi_{n, p, \varepsilon}}}{\hat{\lambda}_{j}}. \end{array} $$
On the other hand, in the ideal case if the true covariance matrix is used, the posterior entropy of the signal is given by
$$ \mathbb{H}_{\text{ideal}}\left[x,\middle| y_{j}, a_{j}, j \leq k\right] \!=\frac{1}{2} \ln \left[\!\!(2\pi e)^{s} \prod_{j=1}^{s} \lambda_{j}\!\! \right] -\frac{1}{2}\sum\limits_{j=1}^{k} \frac{\lambda_{j}}{{\chi_{n, p, \varepsilon}}}, $$
where \(\tilde {\beta }_{j} = (1/{\chi _{n, p, \varepsilon }}-1/\lambda _{j})^{+}\sigma ^{2}\). Hence, we have
$$\begin{array}{*{20}l} \mathbb{H}[{x}|y_{j}, a_{j}, j \leq k] &\leq \mathbb{H}_{\text{ideal}}\left[x,\middle| y_{j}, a_{j}, j \leq k\right] \\ + C - \frac{1}{2} \sum\limits_{j=1}^{k} &\left[ \frac{\lambda_{j}}{{\chi_{n, p, \varepsilon}}} + (1-\delta)\left(1-\frac{{\chi_{n, p, \varepsilon}}}{\hat{\lambda}_{j}}\right) \right]. \end{array} $$

where \(C \triangleq (s/2) \ln [\text {tr}(\Sigma)/(\prod _{j=1}^{s}\lambda _{j})^{1/s}]\) is a constant independent of measurements. This upper bound has a nice interpretation: it characterizes the amount of uncertainty reduction with each measurement. For example, when the number of measurements required when using the assumed covariance matrix versus using the true covariance matrix are the same, we have λ j χn,p,ε and \(\hat {\lambda }_{j} \geq {\chi _{n, p, \varepsilon }}\). Hence, the third term in (9) is upper bounded by −k/2, which means that the amount of reduction in entropy is roughly 1/2 nat per measurement.

Remark 3

Consider the special case where the errors only occur in the eigenvalues of the matrix but not in the eigenspace U, i.e., \(\widehat {\Sigma } - \Sigma = U \text {diag}\{e_{1}, \cdots, e_{s}\} U^{\intercal }\) and max1≤js|e j |=δ0, then the upper bound in (8) can be further simplified. Suppose only the first K (Ks) largest eigenvalues of \(\widehat {\Sigma }\) are larger than the stopping criterion χn,p,ε required by the precision, i.e., the algorithm takes K iterations in total. Then,
$$\begin{array}{*{20}l} \mathbb{H}[{x}|y_{j}, a_{j}, j \leq k] &\leq \mathbb{H}_{\text{ideal}}\left[x,\middle| y_{j}, a_{j}, j \leq k\right] \\ + K\ln(1+\delta_{K}/{\chi_{n, p, \varepsilon}}) +&\sum\limits_{j=K+1}^{s} \ln(1+(\delta_{0}+\delta_{K})/\lambda_{j}). \end{array} $$

The additional entropy relative to the ideal case \(\mathbb {H}_{\text {ideal}}\) is typically small, because δ K δ04 K (according to Lemma 7 in the Appendix 2), δ0 is on the order of ε2, and hence the second term is on the order of K2; the third term will be small because δ0 and δ K are small compare to λ j .

Note that, however, if the power allocations β i are calculated using the eigenvalues of the assumed covariance matrix \(\widehat {\Sigma }\), after K=s iterations, we are not guaranteed to reach the desired precision ε with probability p. However, this becomes possible if we increase the total power slightly. The following theorem establishes an upper bound on the amount of extra total power needed to reach the same precision ε compared to the total power Pideal if we use the correct covariance matrix.

Theorem 3

[Additional power required] Assume Ks eigenvalues of Σ are larger than χn,p,ε. If
$$\|\widehat{\Sigma} - \Sigma\|\leq \frac{1}{4^{s+1}}{\chi_{n, p, \varepsilon}}, $$
then to reach a precision εat confidence level p, the total power Pmismatch required by Algorithm 1 when using \(\widehat {\Sigma }\) is upper bounded by
$$P_{\text{mismatch}} < P_{\text{ideal}} + \left[\frac{20}{51}s+\frac{1}{272}K\right]\frac{\sigma^{2}}{{\chi_{n, p, \varepsilon}}}. $$
Note that in Theorem 3, when K=s eigenvalues of Σ are larger than χn,p,ε, under the conditions of Theorem 3, we have a simpler expression for the upper bound
$$\begin{array}{*{20}l} P_{\text{mismatch}} & < P_{\text{ideal}} + \frac{323}{816} \frac{\sigma^{2}}{{\chi_{n, p, \varepsilon}}}s. \end{array} $$

Note that the additional power required is quite small and is only linear in s.

3.2 One-sparse measurement

In the following, we provide performance bounds for the case of one-sparse measurements in Algorithm 2. Assume the signal covariance matrix is known precisely. Now that a k 0=1, we have \(a_{k}=\sqrt {\beta _{k}} u_{k}\), where u k {e1,,e n }. Suppose the largest diagonal entry of Σ(k−1) is determined by
$$j_{k-1}=\text{arg} \max_{t}\Sigma_{tt}^{(k-1)}. $$
From the update equation for the covariance matrix in Algorithm 2, the largest diagonal entry of Σ(k) can be determined from
$$j_{k}=\text{arg}\max_{t}~\left\{ \Sigma_{tt}^{(k-1)}-\frac{\left(\Sigma_{t j_{k-1}}^{(k-1)}\right)^{2}}{\Sigma_{j_{k-1}j_{k-1}}^{(k-1)}+\sigma^{2}/\beta_{k}}\right\}. $$
Let the correlation coefficient be denoted as
$$\rho_{ij}^{(k)}\triangleq\frac{\left(\Sigma_{ij}^{(k)}\right)^{2}}{\Sigma_{ii}^{(k)}\Sigma_{jj}^{(k)}}, $$
where the covariance of the ith and jth coordinate of x after k measurements is denoted as \(\Sigma _{ij}^{(k)}\).

Lemma 1

[One sparse measurement. Recursion for trace of covariance matrix] Assume the minimum correlation for the kth iteration is ρ(k−1)[0,1) such that \(\rho ^{(k-1)}\leq \left |\rho _{ij_{k-1}}^{(k-1)}\right |\) for any i[n]. Then, for a constant γ>0, if the power of the kth measurement β k satisfies \(\beta _{k}\geq {\sigma ^{2}}/\left ({\gamma \max _{t}\Sigma _{tt}^{(k-1)}}\right)\), we have
$$ \text{tr}(\Sigma_{k})\leq \left[1-\frac{(n-1)\rho^{(k-1)}+1}{n(1+\gamma)}\right]\text{tr}(\Sigma_{k-1}). $$
Lemma 1 provides a good bound for a one-step ahead prediction for the trace of the covariance matrix, as demonstrated in Fig. 2. Using the above lemma, we can obtain an upper bound on the number of measurements needed for one-sparse measurements.
Figure 2
Fig. 2

One-step ahead prediction for the trace of the covariance matrix: the offline bound corresponds to applying (10) iteratively k times, and the online bound corresponds to predicting tr(Σ k ) using tr(Σk−1). Here n=100, p=0.95, ε=0.1, \(\Sigma =dd^{\intercal }+5I_{n}\) where \(d=[1,\cdots,1]^{\intercal }\)

Theorem 4

[Gaussian, one-sparse measurement] For constant γ>0, when power is allocated satisfying \(\beta _{k}\geq {\sigma ^{2}}/({\gamma \max _{t}\Sigma _{tt}^{(k-1)}})\) for k=1,2,…,K, we have \(\|\hat {x} - x\|\leq \varepsilon \) with probability p as long as
$$ K\geq \frac{\ln[\text{tr}(\Sigma)/{\chi_{n, p, \varepsilon}}]}{\ln\frac{1}{1-{1}/[n(1+\gamma)]}}. $$

The above theorem requires the number of iterations to be on the order of ln(1/ε) to reach a precision of ε (recall that \({\chi _{n, p, \varepsilon }} = \varepsilon ^{2}/\chi _{n}^{2}(p)\)), as expected. It also suggests a method of power allocation, which sets β k to be proportional to \(\sigma ^{2}/\max _{t}\Sigma _{tt}^{(k-1)}\). This captures the inter-dependence of the signal entries as the dependence will affect the diagonal entries of the updated covariance matrix.

4 Results: numerical examples

In the following, we have three sets of numerical examples to demonstrate the performance of Info-Greedy Sensing when there is mismatch in the signal covariance matrix, when the signal is sampled from Gaussian, and from GMM models, respectively. Below, in all figures, we present sorted estimation errors from the smallest to the largest over all trials.

4.1 Sensing Gaussian with mismatched covariance matrix

In the two examples below, we generate true covariance matrices using random positive semi-definite matrices. When the assumed covariance matrix for the signal x is equal to its true covariance matrix, Info-Greedy Sensing is identical to the batch method [21] (the batch method measures using the largest eigenvectors of the signal covariance matrix). However, when there is a mismatch between the two, Info-Greedy Sensing outperforms the batch method due to its adaptivity, as shown by the example demonstrated in Fig. 3 (with K=20). Further performance improvement can be achieved by updating the covariance matrix using estimated signal sequentially such as described in (2). Info-Greedy Sensing also outperforms the sensing algorithm where a i are chosen to be random Gaussian vectors with the same power allocation, as it uses prior knowledge (albeit being imprecise) about the signal distribution.
Figure 3
Fig. 3

Sensing a Gaussian signal of dimension n=100, when there is mismatch between the assumed covariance matrix and the true covariance matrix: \(\widehat {\Sigma } \propto \Sigma + RR^{\intercal }\), where \(R\in \mathbb {R}^{n\times 3}\) and each entry of \(R_{ij} \sim \mathcal {N}(0, 1)\). We repeat 1000 Monte Carlo trials, and for each trial, we use K=20 measurements. The Info-Greedy-2 method corresponds to (2), where we update the assumed covariance matrix sequentially each time we recover a signal and α=0.5

Figure 4 demonstrates an effect that when there is a mismatch in the assumed covariance matrix, better performance can be achieved if we make many lower power measurements than making one full power measurement because we update the assumed covariance matrix in between. Performance of these scenarios are compared with the case without mismatch. And it is also shown in the figure that many lower power measurements and one full power measurement perform the same when the assumed model is exact.
Figure 4
Fig. 4

Comparison of sensing a Gaussian signal with dimension n=100 using unit power measurements along the eigenvector direction, versus splitting each unit power measurement into five smaller ones, each with amplitude \(\sqrt {1/5}\), and we update the covariance matrix in between. The mismatched covariance matrix is \(\widehat {\Sigma } \propto \Sigma + RR^{\intercal }\), where \(r\in \mathbb {R}^{n\times 5}\) and each entry of r is i.i.d. \(\mathcal {N}(0, 1)\), and \(\widehat {\Sigma }\) is normalized to have unit spectral norm. Performance of the algorithm in the presence of mismatch is compared with that with exact parameters

4.2 Measure Gaussian mixture model signals using one-sparse measurements

In this example, we sense a GMM signal with a one-sparse measurement. Assume there are C=3 components and we know the signal covariance matrix exactly. We consider two cases of generating the covariance matrix for each signal: when the low-rank covariance matrices for each component are generated completely at random and when it has certain structure. In this example, we expect “Info-Greedy” to have much better performance than “Random” in the second case (b) because there is a structure in the covariance matrix. Since Info-Greedy has an advantage in exploiting structure in covariance, it should have larger performance gain. In the first case (a), the covariance matrix is generated randomly, and thus, the performance gain is not significant.

Figure 5 shows the reconstruction error \(\|\hat {x} - x\|\), using K=40 one-sparse measurements for GMM signals. Note that Info-Greedy Sensing (Algorithm 2) with unit power β j =1 can significantly outperform the random approach with unit power (which corresponds to randomly selecting coordinates of the signal to measure). The experiment results validate our expectation.
Figure 5
Fig. 5

Sensing a low-rank GMM signal of dimension n=100 using K=40 measurements with σ=0.001, when the covariance matrices are generated a completely randomly, \(\Sigma _{c} \propto RR^{\intercal }, R\in \mathbb {R}^{n\times 3}, R_{ij} \sim \mathcal {N}(0, 1)\) or b having certain structure, \(\Sigma _{c} \propto \left (11^{\intercal } + 20 \alpha ^{2} \cdot \right.\text {diag}\left.\left \{n,n-1, \cdots,1\right \}\right), \alpha \sim \mathcal {N}(0, 1)\). The covariance matrices Σ c are normalized so that their spectral norms are 1

4.3 Real data

4.3.1 Sensing of a video stream using Gaussian model

In this example, we use a video from the Solar Data Observatory. In this scenario, one aims to compress the high-resolution video (before storage and transmission). Each measurement corresponds to a linear compression of a frame. The frame is of size 232×292 pixels. We use the first 50 frames to form a sample covariance matrix \(\widehat {\Sigma }\) and use it to perform Info-Greedy Sensing on the rest of the frames. We take K=90 measurements. As demonstrated in Fig. 6, Info-Greedy Sensing performs much better in that it acquires more information such that the recovered image has much richer details.
Figure 6
Fig. 6

Recovery of solar flare images of size 224 by 288 with K=90 measurements and no sensing noise. We used the first 50 frames to estimate the mean and covariance matrix of a single Gaussian. a original image for 300th frame. b Ordered relative recovery error of the 200th to the 300th frames. c Recovered the 300th frame using random measurement. d Recovered the 300th frame using Info-Greedy Sensing

4.3.2 Sensing of a high-resolution image using GMM

The second example is motivated by computational photography [35], where one takes a sequence of measurements and each measurement corresponds to the integrated light intensity through a designed mask. We consider a scheme for sensing a high-resolution image that exploits the fact that the patches of the image can be approximated using a Gaussian mixture model, as demonstrated in Fig. 1. We break the image into 8×8 patches, which resulted in 89250 patches. We randomly select 500 patches (0.56% of the total pixels) to estimate a GMM model with C=10 components, and then based on the estimated GMM, initialize Info-Greedy Sensing with K=5 measurements and sense the rest of the patches. This means we can use a compressive imaging system to capture five low-resolution images of size 238×275 (this corresponds to compressing the data into 8.32% of its original dimensionality). With such a small number of measurements, the recovered image from Info-Greedy Sensing measurements has superior quality compared with those with random sensing measurements.

5 Covariance sketching

We may be able to initialize \(\widehat {\Sigma }\) with desired precision via covariance sketching, i.e., using fewer samples to reach a “rough” estimate of the covariance matrix. In this section, we present the covariance sketching scheme, by adapting the covariance sketching in earlier works [24, 25]. The goal here is not to present completely new covariance sketching algorithms, but rather to illustrate how to efficiently obtain initialization for Info-Greedy.

Consider the following setup for covariance sketching. Suppose we are able to form a measurement in the form of \(y = a^{\intercal } x + w\) like we have in the Info-Greedy Sensing algorithm.

Suppose there are N copies of Gaussian signal, we would like to sketch \(\tilde {x}_{1},\ldots, \tilde {x}_{N}\) that are i.i.d. sampled from \(\mathcal {N}(0, \Sigma)\), and we sketch using M random vectors: b1,…,b M . Then, for each fixed sketching vector b i and fixed copy of the signal \(\tilde {x}_{j}\), we acquire L noisy realizations of the projection result y ijl via
$$y_{ijl}=b_{i}^{\intercal} \tilde{x}_{j} +w_{ijl}, \quad l = 1, \ldots, L. $$
We choose the random sampling vectors b i as i.i.d. Gaussian with zero mean and covariance matrix equal to an identity matrix. Then, we average y ijl over all realizations l = 1,…,L to form the ith sketch y i j for a single copy \(\tilde {x}_{j}\):
$$y_{ij}= b_{i}^{\intercal}\tilde{x}_{j} +\underbrace{\frac{1}{L}\sum\limits_{l=1}^{L} w_{ijl}}_{w_{ij}}. $$
The average is introduced to suppress measurement noise, which can be viewed as a generalization of sketching using just one sample. Denote \( w_{ij}\triangleq \frac {1}{L}{\sum \nolimits }_{l=1}^{L} w_{ijl}, \) which is distributed as \(\mathcal N(0, \sigma ^{2}/L)\). Then, we will use the average energy of the sketches as our data γ i , i=1,…,M, for covariance recovery \( \gamma _{i} \triangleq \frac {1}{N}{\sum \nolimits }_{j=1}^{N}y_{ij}^{2}. \) Note that γ i can be further expanded as
$$ \gamma_{i} = \text{tr}\left(\widehat{\Sigma}_{N} b_{i} b_{i}^{\intercal}\right)+\frac{2}{N}\sum\limits_{j=1}^{N} w_{ij} b_{i}^{\intercal} \tilde{x}_{j} +\frac{1}{N}\sum\limits_{j=1}^{N} w_{ij}^{2}, $$
where \( \widehat {\Sigma }_{N}\triangleq \frac {1}{N}\sum _{j=1}^{N}\tilde {x}_{j} \tilde {x}^{\intercal }_{j} \) is the maximum likelihood estimate of Σ (and is also unbiased). We can write (12) in vector matrix notation as follows. Let \(\gamma =[\gamma _{1},\cdots \gamma _{M}]^{\intercal }\). Define a linear operator \(\mathcal B:\mathbb R^{n\times n}\mapsto \mathbb R^{M}\) such that \(\mathcal [B(X)]_{i}=\text {tr}\left (X b_{i} b_{i}^{\intercal }\right)\). Thus, we can write (12) as a linear measurement of the true covariance matrix Σ\(\gamma =\mathcal {B} (\Sigma)+\eta,\) where \(\eta \in \mathbb {R}^{M}\) contains all the error terms and corresponds to the noise in our covariance sketching measurements, with the ith entry given by
$$\eta_{i}=b_{i}^{\intercal}(\widehat{\Sigma}_{N}-\Sigma) b_{i}+\frac{2}{N}\sum\limits_{j=1}^{N} w_{ij}b_{i}^{\intercal} \tilde{x}_{j} +\frac{1}{N}\sum\limits_{j=1}^{N} w_{ij}^{2}. $$
Note that we can further bound the 1 norm of the error term as
$$\Vert\eta\Vert_{1} =\sum\limits_{i=1}^{M}\vert\eta_{i}\vert \leq \Vert\widehat{\Sigma}_{N}-\Sigma\Vert b+ 2\sum\limits_{i=1}^{M}\vert z_{i}\vert+w, $$
where \( b\triangleq \sum _{i=1}^{M} \Vert b_{i}\Vert ^{2},\ \mathbb E[\!b]=Mn,\ \text {Var}[b] =2Mn, w\triangleq \frac {1}{N}\sum _{i=1}^{M} \sum _{j=1}^{N} w_{ij}^{2},\ \mathbb E[w]=M\sigma ^{2}/L,\ \text {and}\ \text {Var} [w]=\frac {2M\sigma ^{4}}{NL^{2}}, \) and
$$z_{i}\triangleq\frac{1}{N}\sum\limits_{j=1}^{N} w_{ij}b_{i}^{\intercal} \tilde{x}_{j},\ \mathbb E[z_{i}]=0\ \text{and}\ \text{Var} [z_{i}]=\frac{\sigma^{2} \text{tr}(\Sigma)} {NL}. $$
We may recover the true covariance matrix from the sketches γ using the convex optimization problem (13).

We need L to be sufficiently large to reach the desired precision. The following Lemma 2 arises from a simple tail probability bound of the Wishart distribution (since the sample covariance matrix follows a Wishart distribution).

Lemma 2

[Initialize with sample covariance matrix] For any constant δ>0, we have \(\Vert \widehat {\Sigma } -\Sigma \Vert \leq \delta \) with probability exceeding \(1-2n\exp (-\sqrt {n})\), as long as
$$L\geq 4n^{1/2}\text{tr}(\Sigma)\left(\Vert\Sigma\Vert/\delta^{2}+4/\delta\right). $$

Lemma 2 shows that the number of measurements needed to reach a precision δ for a sample covariance matrix is \(\mathcal {O}(1/\delta ^{2})\) as expected.

We may also use a covariance sketching scheme similar to that described in [2325] to estimate \(\widehat {\Sigma }\). Covariance sketching is based on random projections of each training sample, and hence, it is memory efficient when we are not able to store or operate on the full vectors directly. The covariance sketching scheme is described below. Assume training samples \(\tilde {x}_{i}\), i=1,…,N are drawn from the signal distribution. Each sample, \(\tilde {x}_{i}\) is sketched M times using random sketching vectors b ij , j=1,…,M, through a noisy linear measurement \(\left (b_{ij}^{\intercal } x_{i} + w_{ijl}\right)^{2}\), and we repeat this for L times (l=1,…,L) and compute the average energy to suppress noise1. This sketching process can be shown to be a linear operator \(\mathcal {B}\) applied on the original covariance matrix Σ. We may recover the original covariance matrix from the vector of sketching outcomes \(\gamma \in \mathbb {R}^{M}\) by solving the following convex optimization problem
$$ \begin{array}{rl} \widehat{\Sigma}= \text{argmin}_{X} & \text{tr}(X)\\ \text{subject\ to}& X\ \succeq 0,\ \Vert \gamma-\mathcal B(X)\Vert_{1}\leq \tau, \end{array} $$

where τ is a user parameter that depends on the noise level. In the following theorem, we further establish conditions on the covariance sketching parameters N, M, L, and τ so that the recovered covariance matrix \(\widehat {\Sigma }\) may reach the required precision in Theorem 2, by adapting the results in [25].

Lemma 3

[Initialize with covariance sketching] For any δ>0 the solution to (13) satisfies \(\Vert \widehat {\Sigma }-\Sigma \Vert \leq \delta,\) with probability exceeding \(1-2/n-{2}/{\sqrt {n}}-2n\exp (-\sqrt {n}) -\exp (-c_{1} M)\), as long as the parameters M, N, L and τ satisfy the following conditions
$$\begin{array}{*{20}l} M&>c_{0} ns, \quad N \geq 4n^{1/2}\text{tr}(\Sigma)\left(\frac{36 M^{2} n^{2} \Vert\Sigma\Vert}{\tau^{2}}+\frac{24Mn}{\tau}\right), \end{array} $$
$$\begin{array}{*{20}l} L & \!\geq\! \max\!\left\{ \!\frac{M}{4n^{2}\Vert\Sigma\Vert}\sigma^{2}\!, \ \frac{1}{\sqrt{2[\text{tr}{(\Sigma)}/\Vert\Sigma\Vert] Mn^{2}}}\sigma^{2}, \frac{6M}{\tau}\sigma^{2}\!\right\}, \end{array} $$
$$\begin{array}{*{20}l} \tau& =M \delta/c_{2}, \end{array} $$

where c0, c1, and c2 are absolute constants.

Finally, we present one numerical example to validate covariance sketching as initialization for Info-Greedy, as shown in Fig. 7. We compare it with the case (“direct” in the figure) when sample covariance matrix is directly estimated using original samples. The parameters are signal dimension n=10; there are 30 samples and m=40 sketches for each sample (thus the dimensionality reduction ratio is 40/102=0.4); precision level ε=0.1; the confidence level p=0.95; and noise standard deviation σ0=0.01. The covariance matrix \(\widehat {\Sigma }\) is obtained by solving the optimization problem (13) using standard optimization solver CVX, a package for specifying and solving convex programs [36, 37]. Note that the covariance sketching has a higher error level (to achieve dimensionality reduction); however, the errors are still below the precision level (ε=0.1) thus the performance of covariance sketching is acceptable.
Figure 7
Fig. 7

Covariance sketching as initialization for Info-Greedy Sensing. Sorted estimation error in 500 trials. In this example, signal dimension n=10, there are m=40 sketches; thus, the dimensional reduction ratio is 40/102=0.4. The errors of covariance sketching are higher than using the direct covariance estimation as initialization (to achieve the goal of dimensionality reduction); however, note that the errors of covariance sketching are still much below the pre-specified error tolerance ε=0.1 and thus are acceptable

6 Conclusions and discussions

In this paper, we have studied the robustness of sequential compressed sensing algorithm based on conditional mutual information maximization, the so-called Info-Greedy Sensing [6], when the parameters are learned from data. We quantified the algorithm performances in the presence of estimation errors. We further presented covariance sketching based scheme for initializing covariance matrices. Numerical examples demonstrated the robust performance of Info-Greedy.

Our results for Gaussian and GMM signals are quite general in the following sense. In high-dimensional problems, a commonly used low-dimensional signal model for x is to assume the signal lies in a subspace plus Gaussian noise, which corresponds to the case where the signal is Gaussian with a low-rank covariance matrix; GMM is also commonly used (e.g., in image analysis and video processing) as it models signals lying in a union of multiple subspaces plus Gaussian noise. In fact, parameterizing via low-rank GMMs is a popular way to approximate complex densities for high-dimensional data.

7 Appendix 1

7.1 Backgrounds

Lemma 4

[Eigenvalue of perturbed matrix [38]] Let Σ, \(\widehat {\Sigma }\in \mathbb {R}^{n\times n}\) be symmetric,with eigenvalues λ1λ n and \(\hat {\lambda }_{1}\geq \cdots \geq \hat {\lambda }_{n}\), respectively. Let \(E\triangleq \widehat {\Sigma }-\Sigma \) have eigenvalues e1e n . Then for each i{1,,n}, the perturbed eigenvalues satisfy \(\hat {\lambda }_{i}\in [\lambda _{i}+e_{n}, \lambda _{i}+e_{1}].\)

Lemma 5

[Stability conditions for covariance sketching [25]] Denote \(\mathcal A:\mathbb R^{n\times n}\mapsto \mathbb R^{m}\) a linear operator and for \(X\in \mathbb R^{n\times n}\), \(\mathcal A(X)=\{a_{i}^{T} X a_{i}\}_{i=1}^{m}\). Suppose the measurement is contaminated by noise ηR m , i.e., \(Y=\mathcal A(\Sigma)+\eta \) and assume η1ε1. Then with probability exceeding 1− exp(−c1m)the solution \(\widehat {\Sigma }\) to the trace minimization (13) satisfies
$$\Vert \widehat{\Sigma}-\Sigma \Vert_{F}\leq c_{0} \frac{\Vert\Sigma-\Sigma_{r}\Vert_{*}}{\sqrt{r}}+c_{2}\frac{\epsilon_{1}}{m}, $$
for all ΣRn×n, provided that m>c0nr. Here c0, c1, and c2 are absolute constants and Σ r represents the best rank-r approximation of Σ. When Σ r is exactly rank-r
$$\Vert \widehat{\Sigma}-\Sigma \Vert_{F}\leq c_{0}\frac{\epsilon_{1}}{m}. $$

Lemma 6

[Concentration of measure for Wishart distri-bution [39]] If \(X\in \!\! \mathbb R^{n\times n}\sim \mathcal W_{n}(N,\Sigma)\), then for t>0,
$${} P\!\left\{\!\left\Vert\frac{1}{N}X\,-\,\Sigma\right\Vert\!\geq\! \left(\!\sqrt{\frac{2t(\theta+1)}{N}}+\frac{2t\theta}{N}\!\right)\!\Vert\Sigma \Vert\right\}\leq 2n\exp(-t), $$
where θ=tr(Σ)/Σ.

8 Appendix 2

8.1 Proofs

8.1.1 Gaussian signal with mismatch

Proof [Proof of Theorem 11] Let \(\xi _{k}\triangleq \hat {\mu }_{k}-\mu _{k}. \) From the update equation for the mean \(\hat {\mu }_{k} = \hat {\mu }_{k-1} + \widehat {\Sigma }_{k-1} a_{k} \left (y_{k} - a_{k}^{\intercal } \hat {\mu }_{k-1}\right)/\left (\hat {a}_{k}^{\intercal } \widehat {\Sigma }_{k-1} a_{k} +\sigma ^{2}\right),\) since a k is eigenvector of \(\hat {\Sigma }_{k-1}\), we have the following recursion:
$$\begin{array}{*{20}l} \xi_{k} & = \left(I_{n}-\frac{\hat{\lambda}_{k}a_{k}a_{k}^{\intercal}}{\beta_{k}\hat{\lambda}_{k}+\sigma^{2}}\right)\xi_{k-1} \\&\quad+ \left[-\hat{\lambda}_{k} \frac{a_{k}^{\intercal}E_{k-1}a_{k}}{\left(\beta_{k}\hat{\lambda}_{k} + \sigma^{2}-a_{k}^{\intercal}E_{k-1}a_{k}\right){(\beta_{k}\hat{\lambda}_{k}+\sigma^{2}})}a_{k} \right.\\ {}&\left.\quad+\frac{E_{k-1}a_{k}}{\beta_{k}\hat{\lambda}_{k}+\sigma^{2}-a_{k}^{\intercal}E_{k-1}a_{k}}\right]\left(a_{k}^{\intercal} (x-\mu_{k-1}) + w_{k}\right). \end{array} $$
From the recursion of ξ k in (16), for some vector C k defined properly, we have that
$$\begin{array}{*{20}l} \mathbb{E}[\xi_{k}]=& \left(I-\frac{\hat{\lambda}_{k} \beta_{k}}{\beta_{k}\hat{\lambda}_{k}+\sigma^{2}}u_{k} u_{k}^{\intercal}\right) \mathbb{E}[\xi_{k-1}] \\&+C_{k} \underbrace{\mathbb E\left[a_{k}^{\intercal}(x-\mu_{k-1})+w_{k}\right]}_{0}, \end{array} $$
where the expectation is taken over random variables x and w’s. Note that the second term is equal to zero using an argument based on iterated expectation
$$\mathbb E\left[a_{k}^{\intercal}(x-\mu_{k-1}) + w_{k}\right]\!=a_{k}^{\intercal} \mathbb E[\mathbb E[x-\mu_{k-1}|y_{1}, \ldots, y_{k}]\!]=0. $$

Hence, Theorem 1 is proved by iteratively apply the recursion (17). When \(\hat {\mu }_{0} - \mu _{0}=0\), we have \( \mathbb E[\xi _{k}]=0, k=0,1,\ldots, K. \)

In the following, Lemma 7 to Lemma 9 are used to prove Theorem 2.

Lemma 7

[Recursion in covariance matrix mismatch.]

If δk−1≤3σ2/4β k , then δ k ≤4δk−1.


Let \(\widehat {A}_{k}\triangleq {a}_{k}{a}^{\intercal }_{k}\). Hence, \(\Vert \widehat {A}_{k} \Vert =\beta _{k}\). Recall that a k is the eigenvector of \(\widehat {\Sigma }_{k-1}\), using the definition of \(E_{k} \triangleq \widehat {\Sigma }_{k} - \Sigma _{k}\), together with the recursions of the covariance matrices
$$\begin{array}{*{20}l} \widehat{\Sigma}_{k} &= \widehat{\Sigma}_{k-1} - \widehat{\Sigma}_{k-1} a_{k} a_{k}^{\intercal} \Sigma_{k-1}/(\hat{\lambda}_{k}+\sigma^{2}), \end{array} $$
$$\begin{array}{*{20}l} \Sigma_{k} &= \Sigma_{k-1} - \Sigma_{k-1} a_{k} a_{k}^{\intercal} \Sigma_{k-1}/\left(a_{k}^{\intercal}\Sigma_{k-1}a_{k} +\sigma^{2}\right), \end{array} $$
we have
$$E_{k} =E_{k-1}+\frac{\Sigma_{k-1} {a}_{k} {a}_{k}^{\intercal}\Sigma_{k-1}}{{a}_{k}^{\intercal}\Sigma_{k-1} {a}_{k}+\sigma^{2}}-\frac{\hat{\lambda}_{k}{a}_{k}{a}_{k}^{\intercal}\widehat{\Sigma}_{k-1}}{\beta_{k}\hat{\lambda}_{k}+\sigma^{2}}. $$
Based on this recursion, using δ k =E k , the triangle inequality, and inequality ABAB, we have
$$\begin{array}{*{20}l} \delta_{k} &\leq \delta_{k-1} +\frac{\beta_{k}\hat{\lambda}_{k}{a}_{k} E_{k-1}{a}_{k}}{\left(\beta_{k}\hat{\lambda}_{k}+\sigma^{2}\right) \left(\beta_{k}\hat{{\lambda}}_{k}+\sigma^{2}-{a}_{k}^{\intercal} E_{k-1}{a}\right)}\\ &\quad\cdot \Vert \widehat{A}_{k}\widehat{\Sigma}_{k-1}\Vert+\frac{1}{\beta_{k}\hat{{\lambda}}_{k}+\sigma^{2}-{a}_{k}^{\intercal} E_{k-1}{a}_{k}}\\ &\quad\cdot[\hat{\lambda}_{k}(\Vert \widehat{A}_{k}E_{k-1}\Vert +\Vert E_{k-1}\widehat{A}_{k}\Vert) +\Vert E_{k-1}\widehat{A}_{k}E_{k-1}\Vert]\\ &\leq \delta_{k-1} +\frac{\beta_{k}^{2}\hat{\lambda}_{k}^{2}\delta_{k-1}}{\left(\beta_{k}\hat{\lambda}_{k}+\sigma^{2}\right)(\beta_{k}\hat{{\lambda}}_{k}+\sigma^{2}-\beta_{k}\delta_{k-1})}\\ &\quad+\frac{\beta_{k}}{\beta_{k}\hat{\lambda}_{k}+\sigma^{2}-\beta_{k}\delta_{k-1}}[2\hat{\lambda}_{k}\delta_{k-1}+\delta_{k-1}^{2}]\\ &\leq \left(1+\frac{3\beta_{k}\hat{\lambda}_{k}}{ \beta_{k}\hat{\lambda}_{k}+\sigma^{2}-\beta_{k}\delta_{k-1} }\right)\delta_{k-1}\\ &\quad+\frac{\beta_{k}}{\beta_{k}\hat{\lambda}_{k}+\sigma^{2}-\beta_{k}\delta_{k-1}}\delta_{k-1}^{2}. \end{array} $$
Hence, if we set δk−1≤3σ2/(4β k ), i.e., \(\delta _{k-1}\beta _{k}\leq \frac {3}{4}\sigma ^{2}\), the last inequality can be upper bounded by
$$\left(\!1\,+\,3\cdot\frac{\beta_{k}\hat{\lambda}_{k}}{ \beta_{k}\hat{\lambda}_{k}+\sigma^{2}/4}\!\right)\!\delta_{k-1} + 3\cdot\frac{\sigma^{2}/4}{\beta_{k} \hat{\lambda}_{k} + \sigma^{2}/4} \delta_{k-1} \,=\, 4\delta_{k-1}. $$
Hence, if δk−1≤3σ2/(4β k ), we have δ k ≤4δk−1.

Lemma 8

[Recursion for trace of the true covariance matrix] If \(\delta _{k-1}\leq \hat {\lambda }_{k}\),
$$ \text{tr}(\Sigma_{k})\leq \text{tr}(\Sigma_{k-1}) -\frac{\beta_{k}\hat{\lambda}_{k}^{2}}{\beta_{k}\hat{\lambda}_{k}+\sigma^{2}}+\frac{3\beta_{k}\hat{\lambda}_{k}\delta_{k-1}}{\beta_{k}\hat{\lambda}_{k}+\sigma^{2}-\beta_{k}\delta_{k-1}}. $$


Let \(\widehat {A}_{k}\triangleq {a}_{k}{a}^{\intercal }_{k}\). Using the definition of E k and the recursions (18) and (19), the perturbation matrix E k after k iterations is given by
$$\begin{array}{*{20}l} E_{k}&\,=\,E_{k-1} \,+\,\hat{\lambda}_{k}^{2}\widehat{A}_{k}\cdot\frac{{a}_{k}^{\intercal}E_{k-1}{a}_{k}}{\!\left(\!\beta_{k}\hat{\lambda}_{k} + \sigma^{2}\!\right)\left(\!\beta_{k}\hat{\lambda}_{k}+\sigma^{2}-{a}_{k}^{\intercal}E_{k-1}{a}_{k}\!\right)}\\ &\quad-\frac{\hat{\lambda}_{k}}{\beta_{k}\hat{\lambda}_{k}+\sigma^{2}-{a}_{k}^{\intercal}E_{k-1}{a}_{k}} \cdot(\widehat{A}_{k}E_{k-1}+E_{k-1}\widehat{A}_{k})\\ &\quad+\frac{1}{\beta_{k}\hat{\lambda}_{k} + \sigma^{2}-{a}_{k}^{\intercal}E_{k-1}{a}_{k}}E_{k-1}\widehat{A}_{k}E_{k-1}. \end{array} $$
Note that \(\text {rank}(\widehat {A}_{k})=1\), thus \(\text {rank}(\widehat {A}_{k}E_{k-1})\leq 1\); therefore, it has at most one non-zero eigenvalue,
$$\begin{array}{*{20}l} \left\vert\text{tr}\left(\widehat{A}_{k}E_{k-1}\right)\right\vert &\,=\,\left\vert\text{tr}\left({E_{k-1}\widehat{A}_{k}}\right)\right\vert \,=\,\left\Vert\widehat{A}_{k}E_{k-1}\right\Vert \leq\left\Vert\widehat{A}_{k}\right\Vert\Vert E_{k-1}\Vert \\&=\beta_{k}\delta_{k-1}. \end{array} $$
Note that Ek−1 is symmetric and \(\hat {A}_{k}\) is positive semi-definite, we have \( \text {tr}(E_{k-1}\widehat {A}_{k}E_{k-1})\geq 0. \) Hence, from (21) we have
$$\begin{array}{*{20}l} \text{tr}(E_{k})&=\text{tr}(\widehat{\Sigma}_{k})-\text{tr}(\Sigma_{k})\geq \text{tr}(E_{k-1})\\ &\quad-\frac{3\beta_{k}\hat{\lambda}_{k}\left(\beta_{k}\hat{\lambda}_{k}+\frac{2\sigma^{2}}{3}\right)\delta_{k-1}}{(\beta_{k}\hat{\lambda}_{k}+\sigma^{2})\left(\beta_{k}\hat{\lambda}_{k}+\sigma^{2}-\beta_{k}\delta_{k-1}\right)}\\ &\quad\geq \text{tr}(E_{k-1})-\frac{3\beta_{k}\hat{\lambda}_{k}\delta_{k-1}}{\beta_{k}\hat{\lambda}_{k}+\sigma^{2}-\beta_{k}\delta_{k-1}}. \end{array} $$
After rearranging terms we obtain
$$\begin{array}{*{20}l} \text{tr}(\Sigma_{k})&\leq \text{tr}(\Sigma_{k-1})+\left[\text{tr}\left(\widehat{\Sigma}_{k}\right)-\text{tr}\left(\widehat{\Sigma}_{k-1}\right)\right]\\ &\quad+ \frac{3\beta_{k}\hat{\lambda}_{k}\delta_{k-1}}{\beta_{k}\hat{\lambda}_{k}+\sigma^{2}-\beta_{k}\delta_{k-1}}. \end{array} $$
Together with the recursion for trace of \(\text {tr}(\widehat {\Sigma }_{k})\) in (7), we have
$$\begin{array}{*{20}l} \text{tr}(\Sigma_{k})\leq \text{tr}(\Sigma_{k-1}) &-\frac{\beta_{k}\hat{\lambda}_{k}^{2}}{\beta_{k}\hat{\lambda}_{k}+\sigma^{2}}+\frac{3\beta_{k}\hat{\lambda}_{k}\delta_{k-1}}{\beta_{k}\hat{\lambda}_{k}+\sigma^{2}-\beta_{k}\delta_{k-1}}. \end{array} $$

Lemma 9

For a given positive semi-definite matrix \(X \in \mathbb R^{n\times n}\), and a vector \(h\in \mathbb R^{n}\), if
$$Y=X-\frac{1}{h^{\intercal}X h+\sigma^{2}}Xhh^{\intercal}X, $$
then rank (X)=rank(Y).


Apparently, for all xker(X), Yx=0, i.e., ker(X)ker(Y). Decompose \(X = Q^{\intercal }Q\). For all x ker(Y), let \(b\triangleq Qh\), \(z\triangleq Qx\). If b=0, Y=X; otherwise, when b≠0, we have
$$0=x^{\intercal}Yx=z^{\intercal}z-\frac{z^{\intercal}bb^{\intercal}z}{b^{\intercal}b +\sigma^{2}}. $$
$$ z^{\intercal} z=\frac{z^{\intercal}bb^{\intercal}z}{b^{\intercal}b +\sigma^{2}}\leq \frac{b^{\intercal} b}{ b^{\intercal} b+\sigma^{2}} z^{\intercal} z. $$

Therefore z=0, i.e., x ker(X), ker(Y) ker(X). This shows that ker(X)= ker(Y) or equivalently rank(X)=rank(Y).

Proof [Proof of Theorem 2] Recall that for k=1,…,K, \(\hat {\lambda }_{k}\geq {\chi _{n, p, \varepsilon }}\). Using Lemma 7, we can show that for some 0<δ<1, if
$$ \delta_{0}\leq \delta {\chi_{n, p, \varepsilon}}/4^{K+1}\leq {3\sigma^{2}}/\left({4^{K+1}\beta_{1}}\right), $$
then for the first K measurements, we have
$$\delta_{k}\leq\frac{1}{4^{K-k+1}}\frac{\delta{\chi_{n, p, \varepsilon}}}{4}\leq\frac{1}{4^{K-k}}\frac{3\sigma^{2}}{4\beta_{1}},\quad k=1,\ldots, K. $$
Note that the second inequality in (22) comes from the fact that \((1/{\chi _{n, p, \varepsilon }}-1/\hat {\lambda }_{1}){\chi _{n, p, \varepsilon }}\sigma ^{2}\leq 3\sigma ^{2}\). Clearly, δk−1δχn,p,ε/16. Hence, (4+δ)δk−1δλ k . Note that β k δk−1σ2 and \(\vert \lambda _{k}-\hat {\lambda }_{k}\vert \leq \delta _{k-1}\), we have \( \beta _{k}\lambda _{k}\leq \beta _{k}(\hat {\lambda }_{k}+\delta _{k-1})\leq \beta _{k}\hat {\lambda }_{k}+\sigma ^{2}. \) Thus, \( 4\delta _{k-1}\left (\beta _{k}\hat {\lambda }_{k}+\sigma ^{2}\right)+\delta \beta _{k} \lambda _{k} \delta _{k-1}\leq \delta \lambda _{k}\left (\beta _{k}\hat {\lambda }_{k}+\sigma ^{2}\right). \) Then, we have \( 3\beta _{k} \hat {\lambda }_{k} \delta _{k-1}\left (\beta _{k}\hat {\lambda }_{k}+\sigma ^{2}\right)\leq \beta _{k} \hat {\lambda }_{k}(\delta \lambda _{k} - \delta _{k-1})(\beta _{k}\hat {\lambda }_{k}+\sigma ^{2}-\beta _{k}\delta _{k-1}), \) which can be rewritten as \( \frac {3\beta _{k} \hat {\lambda }_{k}\delta _{k-1}}{\beta _{k} \hat {\lambda }_{k}+\sigma ^{2}-\beta _{k} \delta _{k-1}}\leq \frac {\beta _{k} \hat {\lambda }_{k}}{\beta _{k} \hat {\lambda }_{k}+\sigma ^{2}}(\delta \lambda _{k}-\delta _{k-1}). \) Hence, \( \frac {3\beta _{k} \hat {\lambda }_{k}\delta _{k-1}}{\beta _{k} \hat {\lambda }_{k}+\sigma ^{2}-\beta _{k} \delta _{k-1}}\leq \frac {\beta _{k} \hat {\lambda }_{k}}{\beta _{k} \hat {\lambda }_{k}+\sigma ^{2}}[(\delta -1)\lambda _{k}+\hat {\lambda }_{k}], \) which can be written as \( -\frac {\beta _{k} \hat {\lambda }_{k}^{2}}{\beta _{k}\hat {\lambda }_{k}+\sigma ^{2}}+\frac {3\beta _{k} \hat {\lambda }_{k}\delta _{k-1}}{\beta _{k} \hat {\lambda }_{k}+\sigma ^{2}-\beta _{k} \delta _{k-1}}\leq -(1-\delta)\frac {\beta _{k}\hat {\lambda }_{k}}{\beta _{k} \hat {\lambda }_{k}+\sigma ^{2}}\lambda _{k}. \) By applying Lemma 8, we have
$$\begin{array}{*{20}l} \text{tr}(\Sigma_{k}) &\leq \text{tr}{(\Sigma_{k-1})}-(1-\delta)\frac{\beta_{k}\hat{\lambda}_{k}}{\beta_{k}\hat{\lambda}_{k}+\sigma^{2}}\lambda_{k}\leq \text{tr}(\Sigma_{k-1})\\&\quad-(1-\delta)\frac{\beta_{k}\hat{\lambda}_{k}}{\beta_{k}\hat{\lambda}_{k}+\sigma^{2}}\frac{\text{tr}(\Sigma_{k-1})}{s} \triangleq f_{k} \text{tr}(\Sigma_{k-1}), \end{array} $$
where we have used the definition for f k in (5). Subsequently,
$$\text{tr}\left(\Sigma_{k})\leq \left(\prod_{j=1}^{k}f_{j}\right) \text{tr}(\Sigma_{0}\right). $$
Lemma 9 shows that the rank of the covariance will not be changed by updating the covariance matrix sequentially: rank(Σ1)==rank(Σ k )=s. Hence, we may decompose the covariance matrix \(\Sigma _{k}=Q Q^{\intercal }\), with \(Q\in \mathbb R^{n\times s}\) being a full-rank matrix, then \({\textsf {Vol}}(\Sigma _{k})=\text {det}(Q^{\intercal } Q).\) Since \(\text {tr}(Q^{\intercal } Q)=\text {tr}(Q Q^{\intercal })\), we have
$$\begin{array}{*{20}l} {\textsf {Vol}}^{2}(\Sigma_{k})& = \text{det}{(Q^{\intercal} Q)} \!\overset{(1)}{\leq} \ \! prod_{j=1}^{s}(Q^{\intercal} Q)_{jj} \overset{(2)}{ \leq} \left(\frac{\text{ tr}(Q^{\intercal} Q)}{s}\right)^{s} \\&=\left(\frac{\text{tr}(\Sigma_{k})}{s}\right)^{s}, \end{array} $$
where (1) follows from the Hadamard’s inequality and (2) follows from the inequality of arithmetic and geometric means. Finally, we can bound the conditional entropy of the signal as
$$ \begin{aligned} \mathbb{H}[{x}|y_{j}, a_{j}, j \leq k] &= \ln (2\pi e)^{s/2} {\textsf{Vol}}(\Sigma_{k})\\ &\leq \frac{s}{2}\ln \left\{2\pi e \left(\prod_{j=1}^{k}f_{j}\right) \text{tr}(\Sigma_{0})\right\}, \end{aligned} $$

which leads to the desired result.

Proof [Proof of Theorem 3] Recall that rank(Σ)=s, and hence λ k =0, k=s+1,…,n. Note that for each iteration, the eigenvalue of \(\widehat {\Sigma }_{k}\) in the direction of a k , which corresponds to the largest eigenvalue of \(\widehat {\Sigma }_{k}\), is eliminated below the threshold χn,p,ε. Therefore, as long as the algorithm continues, the largest eigenvalue of \(\widehat {\Sigma }_{k}\) is exactly the (k+1)th largest eigenvalue of \(\widehat {\Sigma }\). Now, if
$$ \delta_{0}\leq {\chi_{n, p, \varepsilon}}/4^{s+1}, $$
using Lemma 4 and Lemma 7, we have that
$$\begin{array}{*{20}l} {}\vert \hat{\lambda}_{k}-\lambda_{k} \vert &\leq \delta_{0},\ \text{for}\ k=1,\ldots,s, \quad \vert \hat{\lambda}_{j} \vert\leq \delta_{0}\leq {\chi_{n, p, \varepsilon}}\\&\quad-\delta_{s},\ \text{for}\ k=s+1,\ldots,n. \end{array} $$
In the ideal case without perturbation, each measurement decreases the eigenvalue along a given eigenvector to be below χn,p,ε. Suppose in the ideal case, the algorithm terminates at Ks iterations, which means
$$\lambda_{1} \geq\cdots\geq \lambda_{L}\geq{\chi_{n, p, \varepsilon}} >\lambda_{K+1}(\Sigma)\geq\cdots\geq\lambda_{s}(\Sigma), $$
and the total power needed is
$$ P_{\text{ideal}}=\sum\limits_{k=1}^{K}\sigma^{2}\left(\frac{1}{{\chi_{n, p, \varepsilon}}}-\frac{1}{\lambda_{k}}\right). $$
On the other hand, in the presence of perturbation, the algorithm will terminate using more than K iterations since with perturbation, eigenvalues of Σ that are originally below χn,p,ε may get above χn,p,ε. In this case, we will also allocate power while taking into account the perturbation:
$$\beta_{k}=\sigma^{2}\left(\frac{1}{{\chi_{n, p, \varepsilon}}-\delta_{s}}-\frac{1}{\hat{\lambda}_{k}}\right). $$
This suffices to eliminate even the smallest eigenvalue to be below threshold χn,p,ε since
$$\frac{\sigma^{2}\hat{\lambda}_{k-1}}{\beta_{k-1}\hat{\lambda}_{k-1}+\sigma^{2}}= {\chi_{n, p, \varepsilon}}-\delta_{s} < {\chi_{n, p, \varepsilon}}. $$
We first estimate the total amount of power used at most to eliminate eigenvalues \(\hat {\lambda }_{k}\), for K+1≤ks:
$$\begin{array}{*{20}l} \beta_{k} &=\sigma^{2}(1/({\chi_{n, p, \varepsilon}}-\delta_{s}) - 1/\hat{\lambda}_{k}) \leq \sigma^{2}(1/({\chi_{n, p, \varepsilon}}-\delta_{s})\\ &\quad- 1/({\chi_{n, p, \varepsilon}}+\delta_{0})) \leq \sigma^{2} \frac{(4^{s}+1)\delta_{0}}{({\chi_{n, p, \varepsilon}}-4^{s} \delta_{0})({\chi_{n, p, \varepsilon}}+\delta_{0})} \\ &\leq \frac{20}{51}\frac{\sigma^{2}}{{\chi_{n, p, \varepsilon}}}. \end{array} $$
where we have used the fact that δ s ≤4 s δ0 (a consequence of Lemma 7), the assumption (24), and monotonicity of the upper bound in s. The total power to reach precision ε in the presence of mismatch can be upper bounded by
$$\begin{array}{*{20}l} P_{\text{mismatch}} &\leq \sum\limits_{k=1}^{s}\beta_{k} \leq\sigma^{2}\left\{\sum\limits_{k=1}^{K}\left(\frac{1}{{\chi_{n, p, \varepsilon}}-\delta_{s}}-\frac{1}{\hat{\lambda}_{k}}\right)\right.\\ &\left.+{\vphantom{\sum\limits_{k=1}^{K}}}\frac{20(s-K)}{51}\frac{\sigma^{2}}{{\chi_{n, p, \varepsilon}}}\right\}. \end{array} $$
In order to achieve precision ε and confidence level p, the extra power needed is upper bounded as
$$\begin{array}{*{20}l} P_{\text{mismatch}}-P_{\text{ideal}} &\leq \sigma^{2}\left\{\sum\limits_{k=1}^{K} \left(\frac{1}{3}\frac{1}{{\chi_{n, p, \varepsilon}}}+\frac{\delta_{0}}{\lambda_{k}^{2}}\right)\right. \\ &\left.{\vphantom{\sum\limits_{k=1}^{K}}}\quad+\frac{20(s-K)}{51}\frac{1}{{\chi_{n, p, \varepsilon}}}\right\}\\ &\leq\sigma^{2}\!\left\{\!\frac{1}{4^{s+1}}\!\sum\limits_{k=1}^{K}\!\frac{{\chi_{n, p, \varepsilon}}}{\lambda_{k}^{2}}\,+\,\frac{20s-3K}{51}\frac{1}{{\chi_{n, p, \varepsilon}}}\right\} \\ &<\left(\frac{20}{51}s-\left(\frac{3}{51}-\frac{1}{4^{s+1}}\right)K\right)\frac{\sigma^{2}}{{\chi_{n, p, \varepsilon}}}\\ &\leq \left(\frac{20}{51}s+\frac{1}{272}K\right)\frac{\sigma^{2}}{{\chi_{n, p, \varepsilon}}}, \end{array} $$

where we have again used δ s ≤4 s δ0≤4 s χn,p,ε/4s+1=χn,p,ε/4, \(1/\hat {\lambda }_{k} - 1/\lambda _{k} \leq \delta _{0}/\lambda _{k}^{2}\), the fact that λ k χn,p,ε for k=1,…,K.

Proof [Proof of Lemma 2] It is a direct consequence of Lemma 6. Let θ=tr(Σ)/Σ≥1. For some constant δ>0, set
$$L\geq 4n^{1/2}\text{tr}(\Sigma)({\Vert\Sigma\Vert}/{\delta^{2}}+{4}/{\delta}). $$
Then, from Lemma 6, we have
$$\begin{array}{*{20}l} P\left\{\Vert\widehat{\Sigma}-\Sigma\Vert \leq \delta\right\} &\geq P\left\{{\vphantom{\left(\sqrt{2n^{1/2}(\theta+1)/L}+2\theta n^{1/2}/L\right)}}\Vert\widehat{\Sigma}-\Sigma\Vert\right.\\ &\leq\!\left. \left(\sqrt{2n^{1/2}(\theta+1)/L}+2\theta n^{1/2}/L\right)\Vert\Sigma\Vert\right\}\\ &> 1-2n\exp(-\sqrt{n}). \end{array} $$

The following Lemma is used in the proof of Lemma 3.

Lemma 10

If for some constants M, N, and L that satisfy the conditions in Lemma 3, then η1τ with probability exceeding \(1-{2}/{n}-{2}/{\sqrt {n}}-2n\exp (-c_{1}M)\) for some universal constant c1>0.


Let \(\theta \triangleq \text {tr}(\Sigma)/\|\Sigma \|\). With Chebyshev’s inequality, we have that
$$\begin{array}{*{20}l} &\mathbb P\left\{\vert z_{i}\vert<\frac{\tau}{6M}\right\} \geq 1-\frac{36M^{2}\sigma^{2}\text{tr}(\Sigma)}{NL\tau^{2}}, \quad i = 1, \ldots, K, \\&\mathbb P\left\{|w|<M\frac{\sigma^{2}}{L}+\frac{\tau}{6}\right\} \geq 1-\frac{72\sigma^{4}M}{NL^{2}\tau^{2}}, \end{array} $$
$$\mathbb P\left\{|b|<(M+\sqrt{M})n\right\} \geq 1-\frac{2}{n}. $$
$$ N \geq 4n^{1/2}\text{tr}(\Sigma)\left(\frac{36n^{2}M^{2}\Vert\Sigma\Vert}{\tau^{2}}+\frac{24nM}{\tau}\right), $$
with the concentration inequality for Wishart distribution in Lemma 6 and plugging in the lower bound for N in (26) and the definition for τ in (15), we have
$$\begin{array}{*{20}l} \mathbb P\{\Vert\widehat{\Sigma}_{N}-\Sigma\Vert &\leq {\tau}/[{3n(M+\sqrt{M})}]\} \geq \mathbb P\{\Vert\widehat{\Sigma}_{N}-\Sigma\Vert \\&\leq \left(\sqrt{\frac{2n^{1/2}\theta}{N}}+\frac{2\theta n^{1/2}}{N}\right)\Vert\Sigma\Vert\} \\&> 1-2n\exp(-\sqrt{n}). \end{array} $$
Furthermore, when L satisfies (14), we have
$$\begin{array}{*{20}l} &\mathbb P\left\{\vert z_{i}\vert<\frac{\tau}{6M}\} \geq 1-\frac{1}{M\sqrt{n}}, \quad \mathbb P\{|w|<\frac{\tau}{3}\right\} \geq 1-\frac{1}{\sqrt{n}},\\{}& \mathbb P\left\{\vert b\vert<(M+\sqrt{M})n\right\} \geq 1-\frac{2}{n}. \end{array} $$

Therefore, η1τ holds with probability at least \(1-{2}/{n}-{2}/{\sqrt {n}}-2n\exp (-\sqrt {n})\).

Proof [Proof of Lemma 3 ] With Lemma 10, let τ=Mδ/c2, the choices of M, N, and L ensure that η1Mδ/c2 with probability at least \(1-{2}/{n}-2/\sqrt {n}-2n\exp (-\sqrt {n})\). By Lemma 5 in Appendix 1 and noting that the rank of Σ is s, we have \(\Vert \widehat {\Sigma }-\Sigma \Vert _{F}\leq \delta.\) Therefore, with probability exceeding \(1-2/n-{2}/{\sqrt {n}}-2n\exp (-\sqrt {n})-\exp (-c_{0}c_{1}ns), \Vert \widehat {\Sigma }-\Sigma \Vert \leq \Vert \widehat {\Sigma }-\Sigma \Vert _{F}\leq \delta. \)

The proof will use the following two lemmas.

Lemma 11

[Moment generating function of multivariate Gaussian [40]] Assume \(X\sim \mathcal N(0,\Sigma)\). The moment generating function of X2 is \( \mathbb E[e^{s\Vert X \Vert _{2}}]=1/\sqrt {I-2s\Sigma }. \)

Note that |ϱ k | can be computed recursively. We may derive a recursion. Let \(z_{k} \triangleq a_{k}^{\intercal }(x-\mu _{k-1}) + w_{k} = y_{k} - a_{k}^{\intercal } \mu _{k-1}\). Also Let \(\varrho _{k} \triangleq a^{\intercal }(\hat {\mu }_{k}-\mu _{k})\). Note that \(\varrho _{k} = a^{\intercal } \xi _{k}\) for \(\xi _{k} = \hat {\mu }_{k}-\mu _{k}\) in (16). Based on the recursion for ξ k in (16) that we derived earlier, we have
$$\varrho_{k} =\frac{\sigma^{2}}{\beta_{k}\hat{\lambda}_{k}+\sigma^{2}}\left[\varrho_{k-1}+\frac{a_{k}^{\intercal} E_{k-1} a_{k} \left(y_{k}-a_{k}^{\intercal}\mu_{k-1}\right)}{\beta_{k}\hat{\lambda}_{k}+\sigma^{2}-a^{\intercal}_{k}E_{k-1}a_{k}}\right]$$
$$|\varrho_{k}| \leq \frac{1}{\hat{\lambda}_{k} (\beta_{k}/\sigma^{2})+1}\left[|\varrho_{k-1}|+\frac{\delta_{k} }{(\hat{\lambda}_{k}-\delta_{k})+\sigma^{2}/\beta_{k}} |z_{k}|\right].$$
Proof [Proof of Lemma 1] The recursion of the diagonal entries can be written as
$$\begin{array}{*{20}l} \Sigma_{ii}^{(k)} &=\Sigma_{ii}^{(k-1)}-\frac{\left(\Sigma_{i j_{k-1}}^{(k-1)}\right)^{2}}{\Sigma_{j_{k-1} j_{k-1}}^{(k-1)}+\sigma^{2}/\beta_{k}} \\&=\frac{\Sigma_{ii}^{(k-1)}\Sigma_{j_{k-1} j_{k-1}}^{(k-1)}\left(1-\rho^{(k-1)}_{i j_{k-1}}\right)+\Sigma_{ii}^{(k-1)}\sigma^{2}/\beta_{k}}{\Sigma^{(k-1)}_{j_{k-1} j_{k-1}}+\sigma^{2}/\beta_{k}}. \end{array} $$
Note that for i=jk−1,
$$\Sigma_{j_{k-1}j_{k-1}}^{(k)}=\frac{\Sigma_{j_{k-1}j_{k-1}}^{(k-1)}\sigma^{2}/\beta_{k}}{\Sigma^{(k-1)}_{j_{k-1} j_{k-1}}+\sigma^{2}/\beta_{k}}\leq \frac{\gamma}{1+\gamma}\Sigma_{j_{k-1}j_{k-1}}^{(k-1)}, $$
and for ijk−1,
$$\begin{array}{*{20}l} \Sigma_{ii}^{(k)} &\leq \frac{\Sigma_{ii}^{(k-1)}\Sigma_{j_{k-1} j_{k-1}}^{(k-1)}\left(1-\rho^{(k-1)}\right)+\Sigma_{ii}^{(k-1)}\sigma^{2}/\beta_{k}}{\Sigma^{(k-1)}_{j_{k-1} j_{k-1}}+\sigma^{2}/\beta_{k}} \\&\leq \Sigma_{ii}^{(k-1)}\frac{\Sigma_{j_{k-1} j_{k-1}}^{(k-1)}\left(1-\rho^{(k-1)}\right)+\sigma^{2}/\beta_{k}}{\Sigma_{j_{k-1} j_{k-1}}^{(k-1)}+\sigma^{2}/\beta_{k}}\\&\leq \Sigma_{ii}^{(k-1)}\frac{1-\rho^{(k-1)}+\gamma}{1+\gamma}. \end{array} $$
$$\begin{array}{*{20}l} \text{tr}(\Sigma_{k}) &\leq \left(1-\frac{\rho^{(k-1)}}{1+\gamma}\right)\text{tr}(\Sigma_{k-1})-\frac{1-\rho^{(k-1)}}{1+\gamma}\Sigma_{j_{k-1}j_{k-1}}^{(k-1)} \\&\leq\ \left[1-\frac{(n-1)\rho^{(k-1)}+1}{n(1+\gamma)}\right]\text{tr}(\Sigma_{k-1}). \end{array} $$
Proof [Proof of Theorem 4] Let \(\varepsilon \geq \sqrt {\|\Sigma _{K}\|\cdot \chi _{n}^{2}(p)}\), i.e. Σ K χn,p,ε. Then, Theorem 4 follows from
$$\begin{array}{*{20}l} &\mathbb{P}_{x\sim \mathcal{N}(\mu_{K}, \Sigma_{K})}[\|x-\mu_{K}\|_{2}\leq \varepsilon] \\ &\geq \mathbb{P}_{x\sim \mathcal{N}(\mu_{K}, \Sigma_{K})} [\|x-\mu_{K}\|_{2}\leq \sqrt{\|\Sigma_{K}\|\cdot\varepsilon^{2}}]\\ &\geq \mathbb{P}_{x\sim \mathcal{N}(\mu_{K}, \Sigma_{K})} [(x-\mu_{K})^{\intercal{\Sigma_{K}}^{-1}}(x-\mu_{K})\leq \chi_{n}^{2}(p)] = p. \end{array} $$
This says that, if Σ K χn,p,ε, then (27) holds, we have \(\Vert \hat {x} -x \Vert \leq \varepsilon \) with probability at least p. From Lemma 1, we have that when the powers β i are sufficiently large
$$\Vert\Sigma_{K}\Vert\leq \text{tr}(\Sigma_{K})\leq \left(1-\frac{1}{n(1+\gamma)}\right)^{K}\text{tr}(\Sigma). $$
Hence, for (27) to hold, we can simple require \(\left (1-\frac {1}{n(1+\gamma)}\right)^{K}\text {tr}(\Sigma) \leq {\chi _{n, p, \varepsilon }}\), or equivalently (11) in Theorem 4.

Our sketching scheme is slightly different from that used in [25] because we would like to use the square of the noisy linear measurements \(y_{i}^{2}\) (where as the measurement scheme in [25] has a slightly different noise model). In practice, this means that we may use the same measurement scheme in the first stage as training to initialize the sample covariance matrix.




Gaussian mixture models


Non-destructive testing



We would like to acknowledge Tsinghua University for supporting Ruiyang Song while he visited at Georgia Institute of Technology.


This work is partially supported by an NSF CAREER Award CMMI-1452463, an NSF grant CCF-1442635 and an NSF grant CMMI-1538746. Ruiyang Song was visiting the H. Milton Stewart School of Industrial and Systems Engineering at the Georgia Institute of Technology while working on this paper.

Authors’ contributions

We consider a class of mutual information maximization-based algorithms which are called Info-Greedy algorithms, and we present a rigorous performance analysis for these algorithms in the presence of model parameter mismatch. All authors read and approved the final manuscript.

Competing interests

The author declares that he/she has no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Authors’ Affiliations

Department of Electrical Engineering, Stanford University, Stanford, USA
H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, USA


  1. A Ashok, P Baheti, MA Neifeld, Compressive imaging system design using task-specific information. Appl. Opt. 47(25), 4457–4471 (2008).View ArticleGoogle Scholar
  2. J Ke, A Ashok, M Neifeld, Object reconstruction from adaptive compressive measurements in feature-specific imaging. Appl. Opt. 49(34), 27–39 (2010).View ArticleGoogle Scholar
  3. A Ashok, MA Neifeld, Compressive imaging: hybrid measurement basis design. J. Opt. Soc. Am. A. 28(6), 1041–1050 (2011).View ArticleGoogle Scholar
  4. W Boonsong, W Ismail, Wireless monitoring of household electrical power meter using embedded RFID with wireless sensor network platform. Int. J. Distrib. Sens. Networks. 2014(876914), 10 (2014).Google Scholar
  5. B Zhang, X Cheng, N Zhang, Y Cui, Y Li, Q Liang, in Sparse Target Counting and Localization in Sensor Networks Based on Compressive Sensing. IEEE Int. Conf. Computer Communications (INFOCOM), (2014), pp. 2255–2258.Google Scholar
  6. G Braun, S Pokutta, Y Xie, Info-greedy sequential adaptive compressed sensing. IEEE J. Sel. Top. Signal Proc. 9(4), 601–611 (2015).View ArticleGoogle Scholar
  7. J Haupt, R Nowak, R Castro, in Adaptive Sensing for Sparse Signal Recovery. IEEE 13th Digital Signal Processing Workshop and 5th IEEE Signal Processing Education Workshop (DSP/SPE), (2009), pp. 702–707.Google Scholar
  8. A Tajer, HV Poor, Quick search for rare events. IEEE Transactions on Information Theory. 59(7), 4462–4481 (2013).MathSciNetView ArticleMATHGoogle Scholar
  9. D Malioutov, S Sanghavi, A Willsky, Sequential compressed sensing. IEEE J. Sel. Topics Sig. Proc.4(2), 435–444 (2010).View ArticleGoogle Scholar
  10. J Haupt, R Baraniuk, R Castro, R Nowak, in Sequentially Designed Compressed Sensing. Proc. IEEE/SP Workshop on Statistical Signal Processing, (2012).Google Scholar
  11. A Krishnamurthy, J Sharpnack, A Singh, in Recovering Graph-structured Activations Using Adaptive Compressive Measurements. Annual Asilomar Conference on Signals, Systems, and Computers, (2013).Google Scholar
  12. J Haupt, R Castro, R Nowak, in International Conference on Artificial Intelligence and Statistics. Distilled sensing: Selective sampling for sparse signal recovery, (2009), pp. 216–223.Google Scholar
  13. MA Davenport, E Arias-Castro, in Information Theory Proceedings (ISIT), 2012 IEEE International Symposium On. Compressive binary search, (2012), pp. 1827–1831.Google Scholar
  14. ML Malloy, RD Nowak, Near-optimal adaptive compressed sensing. IEEE Trans. Inf. Theory. 60(7), 4001–4012 (2014).MathSciNetView ArticleMATHGoogle Scholar
  15. S Jain, A Soni, J Haupt, in Signals, Systems and Computers, 2013 Asilomar Conference On. Compressive measurement designs for estimating structured signals in structured clutter: a Bayesian experimental design approach, (2013), pp. 163–167.Google Scholar
  16. E Tanczos, R Castro, Adaptive sensing for estimation of structure sparse signals. arXiv:1311.7118 (2013).Google Scholar
  17. A Soni, J Haupt, On the fundamental limits of recovering tree sparse vectors from noisy linear measurements. IEEE Trans. Info. Theory. 60(1), 133–149 (2014).MathSciNetView ArticleMATHGoogle Scholar
  18. HS Chang, Y Weiss, WT Freeman, Informative sensing. arXiv preprint arXiv:0901.4275 (2009).Google Scholar
  19. S Ji, Y Xue, L Carin, Bayesian compressive sensing. IEEE Trans. Sig. Proc. 56(6), 2346–2356 (2008).MathSciNetView ArticleGoogle Scholar
  20. JM Duarte-Carvajalino, G Yu, L Carin, G Sapiro, Task-driven adaptive statistical compressive sensing of gaussian mixture models. IEEE Trans. Signal Process. 61(3), 585–600 (2013).MathSciNetView ArticleGoogle Scholar
  21. W Carson, M Chen, R Calderbank, L Carin, Communication inspired projection design with application to compressive sensing. SIAM J. Imaging Sci (2012).Google Scholar
  22. DJC MacKay, Information based objective functions for active data selection. Comput. Neural Syst. 4(4), 589–603 (1992).Google Scholar
  23. G Dasarathy, P Shah, BN Bhaskar, R Nowak, in Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference On. Covariance Sketching, (2012).Google Scholar
  24. G Dasarathy, P Shah, BN Bhaskar, R Nowak, Sketching sparse matrices. ArXiv ID:1303.6544 (2013).Google Scholar
  25. Y Chen, Y Chi, AJ Goldsmith, Exact and stable covariance estimation from quadratic sampling via convex programming. IEEE Trans. Inf. Theory. 61(7), 4034–4059 (2015).MathSciNetView ArticleMATHGoogle Scholar
  26. C Hellier, Handbook of Nondestructive Evaluation (McGraw-Hill, 2003).Google Scholar
  27. P Schniter, in Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), 2011 4th IEEE International Workshop On. Exploiting structured sparsity in bayesian experimental design, (2011), pp. 357–360.Google Scholar
  28. G Yu, G Sapiro, Statistical compressed sensing of Gaussian mixture models. IEEE Trans. Signal Process. 59(12), 5842–5858 (2011).MathSciNetView ArticleGoogle Scholar
  29. Y Li, Y Chi, C Huang, L Dolecek, Orthogonal matching pursuit on faulty circuits. IEEE Transactions on Communications. 63(7), 2541–2554 (2015).View ArticleGoogle Scholar
  30. H Robbins, in Herbert Robbins Selected Papers. Some aspects of the sequential design of experiments (Springer, 1985), pp. 169–177.Google Scholar
  31. CF Wu, M Hamada, Experiments: Planning, Analysis, and Optimization, vol. 552 (Wiley, 2011).Google Scholar
  32. R Gramacy, D Apley, Local Gaussian process approximation for large computer experiments. J. Comput. Graph. Stat. (just-accepted):, 1–28 (2014).Google Scholar
  33. D Palomar, Verdu, Ś, Gradient of mutual information in linear vector Gaussian channels. IEEE Trans. Info. Theory. 52:, 141–154 (2006).MathSciNetView ArticleMATHGoogle Scholar
  34. Payaro, Ḿ, DP Palomar, Hessian and concavity of mutual information, entropy, and entropy power in linear vector Gaussian channels. IEEE Trans. Info. Theory, 3613–3628 (2009).Google Scholar
  35. DJ Brady, Optical Imaging and Spectroscopy (Wiley-OSA, 2009).Google Scholar
  36. M Grant, S Boyd, CVX: Matlab Software for Disciplined Convex Programming, version 2.1 (2014). Scholar
  37. M Grant, S Boyd, in Recent Advances in Learning and Control. Lecture Notes in Control and Information Sciences, ed. by V Blondel, S Boyd, and H Kimura. Graph implementations for nonsmooth convex programs (Springer, 2008), pp. 95–110.Google Scholar
  38. GW Stewart, J-G Sun, Matrix Perturbation Theory (Academic Press, Inc., 1990).Google Scholar
  39. S Zhu, A short note on the tail bound of Wishart distribution. arXiv:1212.5860 (2012).Google Scholar
  40. T Vincent, L Tenorio, M Wakin, Concentration of measure: fundamentals and tools. Lect. Notes Rice Univ (2015).Google Scholar


© The Author(s) 2018