 Research
 Open Access
 Published:
On the effect of model mismatch for sequential InfoGreedy Sensing
EURASIP Journal on Advances in Signal Processingvolume 2018, Article number: 32 (2018)
Abstract
We characterize the performance of sequential informationguided sensing (InfoGreedy Sensing) when the model parameters (means and covariance matrices) are estimated and inaccurate. Our theoretical results focus on Gaussian signals and establish performance bounds for signal estimators obtained by InfoGreedy Sensing, in terms of conditional entropy (related to the estimation error) and additional power required due to inaccurate models. We also show covariance sketching can be used as an efficient initialization for InfoGreedy Sensing. Numerical examples demonstrate the good performance of InfoGreedy Sensing algorithms compared with random measurement schemes in the presence of model mismatch.
Introduction
Sequential compressed sensing is a promising new information acquisition and recovery technique to process big data that arises in various applications such as compressive imaging [1–3], power network monitoring [4], and largescale sensor networks [5]. The sequential nature of the problems is either because the measurements are taken one after another or due to the fact that the data is obtained in a streaming fashion so that it has to be processed in one pass.
To harvest the benefits of adaptivity in sequential compressed sensing, various algorithms have been developed (see [6] for a review). We may classify these algorithms as (1) being agnostic about the signal distribution and, hence, random measurements are used [7–10], (2) exploiting additional structure of the signal (such as graphical structure [11], sparse [12–14], low rank [15], and treesparse structure [16, 17]) to design measurements, and (3) exploiting the distributional information of the signal in choosing measurements [18], possibly through maximizing mutual information. The additional knowledge about signal structure or distributions are various forms of information about the unknown signal. Such work includes the seminal Bayesian compressive sensing work [19], Gaussian mixture models (GMM) [20, 21], the classic information gain maximization [22] based on quadratic approximation to the information gain function, and our earlier work [6] which is referred to as InfoGreedy Sensing. InfoGreedy Sensing is a framework that aims at designing subsequent measurements to maximize the mutual information conditioned on previous measurements. Conditional mutual information is a natural metric here, as it captures exclusively useful new information between the signal and the resulted measurements disregarding noise and what has already been learned from previous measurements. Information may play a distinguishing role: as the compressive imaging example demonstrated in Fig. 1 (see Section 4 for more details), with a bit of (albeit inaccurate) information estimated via random samples of small patches of the image, our InfoGreedy Sensing is able to recover details of a highresolution image, whereas random measurements completely miss the image. As shown in [6], InfoGreedy Sensing for a Gaussian signal becomes a simple iterative algorithm: choosing the measurement as the leading eigenvector of the conditional signal covariance matrix in that iteration and then updating the covariance matrix via a simple rankone update or, equivalently, choosing measurement vectors a_{1},a_{2},… as the orthonormal eigenvectors of the signal covariance matrix Σ in a decreasing order of eigenvalues. Different from the earlier literature [22], InfoGreedy Sensing determines not only the direction but also the precise magnitude of the measurements.
In practice, we usually need to estimate the signal covariance matrix, e.g., through a training session. For Gaussian signals, there are two possible approaches: either using training samples of the same dimension or through the new “covariance sketching” technique [23–25], which uses lowdimensional random sketches of the samples. Due to the inaccuracy of the estimated covariance matrices, measurement vectors usually deviate from the optimal directions as they are calculated as eigenvectors of the estimated covariance matrix. Hence, to understand the performance of informationguided algorithms in practice, it is crucial to quantify the performance of algorithms with model mismatch. This may also shed some light on how to properly initialize the algorithm.
In this paper, we aim at quantifying the performance of InfoGreedy Sensing when the parameters (in particular, the covariance matrices) are estimated. We focus on analyzing deterministic model mismatch, which is a reasonable assumption since we aim at providing instancespecific performance guarantees with sample estimated or sketched initial parameters. We establish a set of theoretical results including (1) studying the bias and variance of the signal estimator via posterior mean, by relating the error in the covariance matrix \(\\Sigma \widehat {\Sigma }\\) to the entropy of the signal posterior distribution after each sequential measurement, (2) establishing an upper bound on the additional power needed to achieve the signal precision \(\x\hat {x}\\leq \varepsilon \), where power is defined as the square of the norm of the measurement vector, and (3) translating these into requirements on the choice of the sample covariance matrix through direct estimation or through covariance sketching. Note that the power allocated for the measurements here is the minimum power required in order to achieve a prescribed precision for signal recovery within a fixed number of iterations. Furthermore, we also study InfoGreedy Sensing in a special setting when the measurement vector is desired to be onesparse and establish analogously a set of theoretical results. Such a requirement arises from applications such as nondestructive testing (NDT) [26] or network tomography. We also present numerical examples to demonstrate the good performance of InfoGreedy Sensing compared to a batch method (where measurements are not adaptive) when there is mismatch. The main contribution of the paper is to study and understand the performance of InfoGreedy algorithm [6] in the presence of perturbed parameters, rather than proposing new algorithms.
Some other related works include [27], where adaptive methods for recovering structured sparse signals with Gaussian and Gaussian joint posterior are discussed, and [28], which analyzes the recovery of Gaussian mixture models with estimated mean and covariance using maximum a posteriori estimation. In [29], the orthogonal matching pursuit which aims at detecting the support of sparse signals while suffering from faulty measurements is studied. In this work, we focus on the case where the estimated mean, covariance, as well as the prior probability for each separate Gaussian component are available. Another work [20] discusses an adaptive sensing method for GMM, which is a twostep strategy that first adaptively detects the classification of the GMM, and then reconstructs the signal assuming it falls in the category determined in the previous step. While [20] assumes that there are sufficient samples for the first step in the first place, our early work [6] and this paper are different in that, sensing for GMM signal works on signal recovery directly without trying to identify the signal class as a first step. Hence, in general, our method is more tolerant to inaccuracy of the estimated parameters, and our algorithm can achieve good performance even without a large number of samples as demonstrated by numerical examples. The design of informationguided sequential sensing is related to the design of sequential experiments (see [15, 30, 31]) and large computer experiment approximation (see [32]). However, compared to the literature on design of experiments (e.g., [30]), our work does not use a statistical criterion based on the output of each iteration. In order words, we are designing our measurements based on the knowledge of the assumed model of the signal instead of the outputs of measurement.
Our notations are standard. Denote \([\!n] \triangleq \{1,2,\ldots,n\}\); ∥X∥, ∥X∥_{ F }, and ∥X∥_{∗} represent the spectral norm, the Frobenius norm, and the nuclear norm of a matrix X, respectively; let ν_{ i }(Σ) denote the ith largest eigenvalue of a positive semidefinite matrix Σ; ∥x∥_{0}, ∥x∥_{1}, and ∥x∥ represent the ℓ_{0}, ℓ_{1} and ℓ_{2} norm of a vector x, respectively; let \(\chi _{n}^{2}\) be the quantile function of the chisquared distribution with n degrees of freedom; let \(\mathbb {E}[\!x]\) and Var[ x] denote the mean and the variance of a random variable x; we write X≽0 to indicate that the matrix is positive semidefinite; ϕ(xμ,Σ) denotes the probability density function of the multivariate Gaussian with mean μ and covariance matrix Σ; let e_{ j } denote the jth column of identity matrix I (i.e., e_{ j } is a vector with only one nonzero entry at location j); and \((x)^{+} \triangleq \max \{x, 0\}\) for \(x\in \mathbb {R}\).
Method: InfoGreedy Sensing
A typical sequential compressed sensing setup is as follows. Let \(x \in \mathbb {R}^{n}\) be an unknown ndimensional signal. We make K measurements of x sequentially
and the power of the measurement vector is ∥a_{ k }∥^{2}=β_{ k }. The goal is to recover x using measurements \(\{y_{k}\}_{k=1}^{K}\). Consider a Gaussian signal \(x \sim \mathcal {N}(0, \Sigma)\) with known zero mean and covariance matrix Σ (here without loss of generality we have assumed the signal has zero mean). Assume the rank of Σ is s and the signal is low rank, i.e. s≪n (however, the algorithm does not require the covariance to be low rank):
Our goal is to estimate the signal x using sequential and adaptive measurements. InfoGreedy Sensing introduced in [6] is one of such adaptive methods which chooses each measurement to maximize the conditional mutual information
The goal of this sensing scheme is to use a minimum number of measurements (or to use the minimum total power) so that the estimated signal is recovered with precision ε; i.e., \(\\widehat {x}  x\ < \varepsilon \) with a high probability p. Define
and we will show in the following that this is a fundamental quantity that determines the termination condition of our algorithm to achieve the precision ε with the confidence level p. Note that χ_{n,p,ε} is a precision ε adjusted by the confidence level.
Gaussian signal
In [6], we have devised a solution to (1) when the signal is Gaussian. The measurement will be made in the directions of the eigenvectors of Σ in a decreasing order of eigenvalues, and the powers (or the number of measurements) will be such that the eigenvalues after the measurements are sufficiently small (i.e., less than ε). The power allocation depends on the noise variance, signal recovery precision ε, and confidence level p, as given in Algorithm 1. Note that in Step 6, the update of covariance matrix can also be implemented, equivalently, via \(\lambda \sigma ^{2} uu^{\intercal }/\left (\beta \lambda +\sigma ^{2}\right) + \Sigma ^{\perp u}\), as explained in (6). In the algorithm, the initializations μ and Σ are estimated and may not be very accurate.
Onesparse measurement
The problem of InfoGreedy Sensing with sparse measurement constraint, i.e., each measurement has only k_{0} nonzero entries ∥a∥_{0}=k_{0}, has been examined in [6] and solved using outer approximation (cutting planes). Here, we will focus on onesparse measurements, ∥a∥_{0}=1, as it is an important instance arising in applications such as nondestructive testing (NDT).
InfoGreedy Sensing with onesparse measurements can be readily derived. Note that the mutual information between x and the outcome using onesparse measurement \(y_{1} = e_{j}^{\intercal } x + w_{1}\) is given by
where Σ_{ jj } denote the jth diagonal entry of matrix Σ. Hence, the measurement that maximizes the mutual information is given by \(\phantom {\dot {i}\!}e_{j^{*}}\) where \(j^{*} \triangleq \arg \max _{j} \Sigma _{jj}\), i.e., measuring in the signal coordinate with the largest variance or largest uncertainty. Then InfoGreedy Sensing measurements can be found iteratively, as presented in Algorithm 2. Note that the correlation of signal coordinates are reflected in the update of the covariance matrix: if the ith and jth coordinates of the signal are highly correlated, then the uncertainty in j will also be greatly reduced if we measure in i. Similar to the previous two algorithms, the initial parameters are not required to be accurate.
Updating covariance with sequential data
If our goal is to estimate a sequence of data x_{1},x_{2},… (versus just estimating a single instance), we may be able to update the covariance matrix using the already estimated signals simply via
and the initial covariance matrix is specified by our prior knowledge \(\widehat {\Sigma }_{0} = \widehat {\Sigma }\). Using the updated covariance matrix \(\widehat {\Sigma }_{t}\), we design the next measurement for signal x_{t+1}. This way, we may be able to correct the inaccuracy of \(\widehat {\Sigma }\) by including new samples. Here, α is a parameter for the update stepsize. We refer to this method as “InfoGreedy2” hereafter.
Gaussian mixture model signals
In this subsection we introduce the case of sensing Gaussian mixture model (GMM) signals. The probability density function of GMM is given by
where C is the number of classes, and π_{ c } is the probability that the sample is drawn from class c. Unlike for Gaussian signals, the mutual information of GMM has no explicit form. However, for GMM signals, there are two approaches that tend to work well: InfoGreedy Sensing derived based on a gradient descent approach [6, 21] uses the fact that the gradient of the conditional mutual information with respect to a is a linear transform of the minimum mean square error (MMSE) matrix [33, 34], and the socalled greedy heuristic [6], which approximately maximizes the mutual information, shown in Algorithm 3. The greedy heuristic picks the Gaussian component with the highest posterior π_{ c } at that moment and chooses the next measurement a as its eigenvector associated with the maximum eigenvalue. The greedy heuristic can be implemented more efficiently compared to the gradient descent approach and sometimes has competitive performance [6]. Also, the initialization for means, covariances, and weights can be off from the true values.
Performance bounds
In the following, we establish performance bounds, for cases when we (1) sense Gaussian signals using estimated covariance matrices and (2) sense Gaussian signals with onesparse measurements.
Gaussian case with model mismatch
To analyze the performance of our algorithms when the assumed covariance \(\widehat {\Sigma }\) used in Algorithm 1 is different from the true signal covariance matrix Σ, we introduce the following notations. Let the eigenpairs of Σ with the eigenvalues (which can be zero) ranked from the largest to the smallest to be (λ_{1},u_{1}),(λ_{2},u_{2}),…,(λ_{ n },u_{ n }), and let the eigenpairs of \(\widehat {\Sigma }\) with the eigenvalues (which can be zero) ranked from the largest to the smallest to be \((\hat {\lambda }_{1}, \hat {u}_{1}), (\hat {\lambda }_{2}, \hat {u}_{2}), \ldots, (\hat {\lambda }_{n}, \hat {u}_{n})\). Let the updated covariance matrix in Algorithm 1 starting from \(\widehat {\Sigma }\) after k measurements be \(\widehat {\Sigma }_{k}\) and the true posterior covariance matrix of the signal conditioned on these measurements be Σ_{ k }.
Note that since each time we measure in the direction of the dominating eigenvector of the posterior covariance matrix, \((\hat {\lambda }_{k}, \hat {u}_{k})\) and (λ_{ k },u_{ k }) correspond to the largest eigenpair of \(\widehat {\Sigma }_{k1}\) and Σ_{k−1}, respectively. Furthermore, define the difference between the true and the assumed conditional covariance matrices after k measurements as
and their sizes
Let the eigenvalues of E_{ k } be e_{1}≥e_{2}≥⋯≥e_{ n }, then the spectral norm of E_{ k } is the maximum of the absolute values of the eigenvalues. Hence, δ_{ k }= max{e_{1},e_{ n }}. Let
denote the size of the initial mismatch.
Deterministic mismatch
First, we assume the mismatch is deterministic and find bounds for bias and variance of the estimated signal. It is common in practice to use estimated covariance matrices, which may have deterministic bias from the true covariances. Assume the initial mean is \(\hat {\mu }\) and the true signal mean is μ, the updated mean using Algorithm 1 after k measurements is \(\hat {\mu }_{k}\), and the true posterior mean is μ_{ k }.
Theorem 1
[Unbiasedness] After k measurements, the expected difference between the updated mean and the true posterior mean is given by
Moreover, if \(\hat {\mu } = \mu \), i.e., the assumed mean is accurate, the estimator is unbiased throughout all the iterations \(\mathbb E[\hat {\mu }_{k}  \mu _{k}]=0\), for k=1,…,K.
Next, we show that the variance of the estimator, when the initial mismatch \(\\widehat {\Sigma }  \Sigma \\) is sufficiently small, reduces gracefully. This is captured through the reduction of entropy, which is also a measure of the uncertainty in the estimator. In particular, we consider the posterior entropy of the signal conditioned on the previous measurement outcomes. Since the entropy of a Gaussian signal \(x \sim \mathcal {N}(\mu, \Sigma)\) is given by \( \mathbb {H}[\!x] = \ln \left [(2\pi e)^{n/2} \det ^{1/2}(\Sigma)\right ], \) the conditional mutual information is the log of the determinant of the conditional covariance matrix, or equivalently the log of the volume of the ellipsoid defined by the covariance matrix. Here, to accommodate the scenario where the covariance matrix is low rank (our earlier assumption), we consider a modified definition for conditional entropy, which is the logarithm of the volume of the ellipsoid on the lowdimensional space that the signal lies on:
where Vol(Σ_{ k }) is the volume of the ellipse, which equals to the product of the nonzero eigenvalues of Σ_{ k }:
where rank(Σ_{ k })=s_{ k }.
Theorem 2
[Entropy of estimator] If for some constant δ∈(0,1) the initial error satisfies
then for k=1,…,K,
where
Note that in (3), the allowable initial error decreases with K. This is due to that larger K means the recovery precision criterion gets stricter, and hence, the maximum tolerable initial bias gets smaller. In the proof of Theorem 2, we track the trace of the underlying actual covariance matrix tr(Σ_{ k }) as the cost function, which serves as a surrogate for the product of eigenvalues that determines the volume of the ellipsoid and hence the entropy, since it is much easier to calculate the trace of the observed covariance matrix \(\text {tr} (\widehat {\Sigma }_{k})\). The following recursion is crucial for the derivation: for an assumed covariance matrix Σ, after measuring in the direction of a unit norm eigenvector u with eigenvalue λ using power β, the updated matrix takes the form of
where Σ^{⊥u} is the component of Σ in the orthogonal complement of u. Thus, the only change in the eigendecomposition of Σ is the update of the eigenvalue of u from λ to λσ^{2}/(βλ+σ^{2}). Based on (6), after one measurement, the trace of the covariance matrix becomes
Remark 1
The upper bound of the posterior signal entropy in (4) shows that the amount of uncertainty reduction by the kth measurement is roughly (s/2) ln(1/f_{ k }).
Remark 2
Using the inequality ln(1−x)≤−x for x∈(0,1), we have that in (4)
On the other hand, in the ideal case if the true covariance matrix is used, the posterior entropy of the signal is given by
where \(\tilde {\beta }_{j} = (1/{\chi _{n, p, \varepsilon }}1/\lambda _{j})^{+}\sigma ^{2}\). Hence, we have
where \(C \triangleq (s/2) \ln [\text {tr}(\Sigma)/(\prod _{j=1}^{s}\lambda _{j})^{1/s}]\) is a constant independent of measurements. This upper bound has a nice interpretation: it characterizes the amount of uncertainty reduction with each measurement. For example, when the number of measurements required when using the assumed covariance matrix versus using the true covariance matrix are the same, we have λ_{ j }≥χ_{n,p,ε} and \(\hat {\lambda }_{j} \geq {\chi _{n, p, \varepsilon }}\). Hence, the third term in (9) is upper bounded by −k/2, which means that the amount of reduction in entropy is roughly 1/2 nat per measurement.
Remark 3
Consider the special case where the errors only occur in the eigenvalues of the matrix but not in the eigenspace U, i.e., \(\widehat {\Sigma }  \Sigma = U \text {diag}\{e_{1}, \cdots, e_{s}\} U^{\intercal }\) and max1≤j≤se_{ j }=δ_{0}, then the upper bound in (8) can be further simplified. Suppose only the first K (K≤s) largest eigenvalues of \(\widehat {\Sigma }\) are larger than the stopping criterion χ_{n,p,ε} required by the precision, i.e., the algorithm takes K iterations in total. Then,
The additional entropy relative to the ideal case \(\mathbb {H}_{\text {ideal}}\) is typically small, because δ_{ K }≤δ_{0}4^{K} (according to Lemma 7 in the Appendix 2), δ_{0} is on the order of ε^{2}, and hence the second term is on the order of K^{2}; the third term will be small because δ_{0} and δ_{ K } are small compare to λ_{ j }.
Note that, however, if the power allocations β_{ i } are calculated using the eigenvalues of the assumed covariance matrix \(\widehat {\Sigma }\), after K=s iterations, we are not guaranteed to reach the desired precision ε with probability p. However, this becomes possible if we increase the total power slightly. The following theorem establishes an upper bound on the amount of extra total power needed to reach the same precision ε compared to the total power P_{ideal} if we use the correct covariance matrix.
Theorem 3
[Additional power required] Assume K≤s eigenvalues of Σ are larger than χ_{n,p,ε}. If
then to reach a precision εat confidence level p, the total power P_{mismatch} required by Algorithm 1 when using \(\widehat {\Sigma }\) is upper bounded by
Note that in Theorem 3, when K=s eigenvalues of Σ are larger than χ_{n,p,ε}, under the conditions of Theorem 3, we have a simpler expression for the upper bound
Note that the additional power required is quite small and is only linear in s.
Onesparse measurement
In the following, we provide performance bounds for the case of onesparse measurements in Algorithm 2. Assume the signal covariance matrix is known precisely. Now that ∥a_{ k }∥_{0}=1, we have \(a_{k}=\sqrt {\beta _{k}} u_{k}\), where u_{ k }∈{e_{1},⋯,e_{ n }}. Suppose the largest diagonal entry of Σ^{(k−1)} is determined by
From the update equation for the covariance matrix in Algorithm 2, the largest diagonal entry of Σ^{(k)} can be determined from
Let the correlation coefficient be denoted as
where the covariance of the ith and jth coordinate of x after k measurements is denoted as \(\Sigma _{ij}^{(k)}\).
Lemma 1
[One sparse measurement. Recursion for trace of covariance matrix] Assume the minimum correlation for the kth iteration is ρ^{(k−1)}∈[0,1) such that \(\rho ^{(k1)}\leq \left \rho _{ij_{k1}}^{(k1)}\right \) for any i∈[n]. Then, for a constant γ>0, if the power of the kth measurement β_{ k } satisfies \(\beta _{k}\geq {\sigma ^{2}}/\left ({\gamma \max _{t}\Sigma _{tt}^{(k1)}}\right)\), we have
Lemma 1 provides a good bound for a onestep ahead prediction for the trace of the covariance matrix, as demonstrated in Fig. 2. Using the above lemma, we can obtain an upper bound on the number of measurements needed for onesparse measurements.
Theorem 4
[Gaussian, onesparse measurement] For constant γ>0, when power is allocated satisfying \(\beta _{k}\geq {\sigma ^{2}}/({\gamma \max _{t}\Sigma _{tt}^{(k1)}})\) for k=1,2,…,K, we have \(\\hat {x}  x\\leq \varepsilon \) with probability p as long as
The above theorem requires the number of iterations to be on the order of ln(1/ε) to reach a precision of ε (recall that \({\chi _{n, p, \varepsilon }} = \varepsilon ^{2}/\chi _{n}^{2}(p)\)), as expected. It also suggests a method of power allocation, which sets β_{ k } to be proportional to \(\sigma ^{2}/\max _{t}\Sigma _{tt}^{(k1)}\). This captures the interdependence of the signal entries as the dependence will affect the diagonal entries of the updated covariance matrix.
Results: numerical examples
In the following, we have three sets of numerical examples to demonstrate the performance of InfoGreedy Sensing when there is mismatch in the signal covariance matrix, when the signal is sampled from Gaussian, and from GMM models, respectively. Below, in all figures, we present sorted estimation errors from the smallest to the largest over all trials.
Sensing Gaussian with mismatched covariance matrix
In the two examples below, we generate true covariance matrices using random positive semidefinite matrices. When the assumed covariance matrix for the signal x is equal to its true covariance matrix, InfoGreedy Sensing is identical to the batch method [21] (the batch method measures using the largest eigenvectors of the signal covariance matrix). However, when there is a mismatch between the two, InfoGreedy Sensing outperforms the batch method due to its adaptivity, as shown by the example demonstrated in Fig. 3 (with K=20). Further performance improvement can be achieved by updating the covariance matrix using estimated signal sequentially such as described in (2). InfoGreedy Sensing also outperforms the sensing algorithm where a_{ i } are chosen to be random Gaussian vectors with the same power allocation, as it uses prior knowledge (albeit being imprecise) about the signal distribution.
Figure 4 demonstrates an effect that when there is a mismatch in the assumed covariance matrix, better performance can be achieved if we make many lower power measurements than making one full power measurement because we update the assumed covariance matrix in between. Performance of these scenarios are compared with the case without mismatch. And it is also shown in the figure that many lower power measurements and one full power measurement perform the same when the assumed model is exact.
Measure Gaussian mixture model signals using onesparse measurements
In this example, we sense a GMM signal with a onesparse measurement. Assume there are C=3 components and we know the signal covariance matrix exactly. We consider two cases of generating the covariance matrix for each signal: when the lowrank covariance matrices for each component are generated completely at random and when it has certain structure. In this example, we expect “InfoGreedy” to have much better performance than “Random” in the second case (b) because there is a structure in the covariance matrix. Since InfoGreedy has an advantage in exploiting structure in covariance, it should have larger performance gain. In the first case (a), the covariance matrix is generated randomly, and thus, the performance gain is not significant.
Figure 5 shows the reconstruction error \(\\hat {x}  x\\), using K=40 onesparse measurements for GMM signals. Note that InfoGreedy Sensing (Algorithm 2) with unit power β_{ j }=1 can significantly outperform the random approach with unit power (which corresponds to randomly selecting coordinates of the signal to measure). The experiment results validate our expectation.
Real data
Sensing of a video stream using Gaussian model
In this example, we use a video from the Solar Data Observatory. In this scenario, one aims to compress the highresolution video (before storage and transmission). Each measurement corresponds to a linear compression of a frame. The frame is of size 232×292 pixels. We use the first 50 frames to form a sample covariance matrix \(\widehat {\Sigma }\) and use it to perform InfoGreedy Sensing on the rest of the frames. We take K=90 measurements. As demonstrated in Fig. 6, InfoGreedy Sensing performs much better in that it acquires more information such that the recovered image has much richer details.
Sensing of a highresolution image using GMM
The second example is motivated by computational photography [35], where one takes a sequence of measurements and each measurement corresponds to the integrated light intensity through a designed mask. We consider a scheme for sensing a highresolution image that exploits the fact that the patches of the image can be approximated using a Gaussian mixture model, as demonstrated in Fig. 1. We break the image into 8×8 patches, which resulted in 89250 patches. We randomly select 500 patches (0.56% of the total pixels) to estimate a GMM model with C=10 components, and then based on the estimated GMM, initialize InfoGreedy Sensing with K=5 measurements and sense the rest of the patches. This means we can use a compressive imaging system to capture five lowresolution images of size 238×275 (this corresponds to compressing the data into 8.32% of its original dimensionality). With such a small number of measurements, the recovered image from InfoGreedy Sensing measurements has superior quality compared with those with random sensing measurements.
Covariance sketching
We may be able to initialize \(\widehat {\Sigma }\) with desired precision via covariance sketching, i.e., using fewer samples to reach a “rough” estimate of the covariance matrix. In this section, we present the covariance sketching scheme, by adapting the covariance sketching in earlier works [24, 25]. The goal here is not to present completely new covariance sketching algorithms, but rather to illustrate how to efficiently obtain initialization for InfoGreedy.
Consider the following setup for covariance sketching. Suppose we are able to form a measurement in the form of \(y = a^{\intercal } x + w\) like we have in the InfoGreedy Sensing algorithm.
Suppose there are N copies of Gaussian signal, we would like to sketch \(\tilde {x}_{1},\ldots, \tilde {x}_{N}\) that are i.i.d. sampled from \(\mathcal {N}(0, \Sigma)\), and we sketch using M random vectors: b_{1},…,b_{ M }. Then, for each fixed sketching vector b_{ i } and fixed copy of the signal \(\tilde {x}_{j}\), we acquire L noisy realizations of the projection result y_{ ijl } via
We choose the random sampling vectors b_{ i } as i.i.d. Gaussian with zero mean and covariance matrix equal to an identity matrix. Then, we average y_{ ijl } over all realizations l = 1,…,L to form the ith sketch y_{ i j } for a single copy \(\tilde {x}_{j}\):
The average is introduced to suppress measurement noise, which can be viewed as a generalization of sketching using just one sample. Denote \( w_{ij}\triangleq \frac {1}{L}{\sum \nolimits }_{l=1}^{L} w_{ijl}, \) which is distributed as \(\mathcal N(0, \sigma ^{2}/L)\). Then, we will use the average energy of the sketches as our data γ_{ i }, i=1,…,M, for covariance recovery \( \gamma _{i} \triangleq \frac {1}{N}{\sum \nolimits }_{j=1}^{N}y_{ij}^{2}. \) Note that γ_{ i } can be further expanded as
where \( \widehat {\Sigma }_{N}\triangleq \frac {1}{N}\sum _{j=1}^{N}\tilde {x}_{j} \tilde {x}^{\intercal }_{j} \) is the maximum likelihood estimate of Σ (and is also unbiased). We can write (12) in vector matrix notation as follows. Let \(\gamma =[\gamma _{1},\cdots \gamma _{M}]^{\intercal }\). Define a linear operator \(\mathcal B:\mathbb R^{n\times n}\mapsto \mathbb R^{M}\) such that \(\mathcal [B(X)]_{i}=\text {tr}\left (X b_{i} b_{i}^{\intercal }\right)\). Thus, we can write (12) as a linear measurement of the true covariance matrix Σ\(\gamma =\mathcal {B} (\Sigma)+\eta,\) where \(\eta \in \mathbb {R}^{M}\) contains all the error terms and corresponds to the noise in our covariance sketching measurements, with the ith entry given by
Note that we can further bound the ℓ_{1} norm of the error term as
where \( b\triangleq \sum _{i=1}^{M} \Vert b_{i}\Vert ^{2},\ \mathbb E[\!b]=Mn,\ \text {Var}[b] =2Mn, w\triangleq \frac {1}{N}\sum _{i=1}^{M} \sum _{j=1}^{N} w_{ij}^{2},\ \mathbb E[w]=M\sigma ^{2}/L,\ \text {and}\ \text {Var} [w]=\frac {2M\sigma ^{4}}{NL^{2}}, \) and
We may recover the true covariance matrix from the sketches γ using the convex optimization problem (13).
We need L to be sufficiently large to reach the desired precision. The following Lemma 2 arises from a simple tail probability bound of the Wishart distribution (since the sample covariance matrix follows a Wishart distribution).
Lemma 2
[Initialize with sample covariance matrix] For any constant δ>0, we have \(\Vert \widehat {\Sigma } \Sigma \Vert \leq \delta \) with probability exceeding \(12n\exp (\sqrt {n})\), as long as
Lemma 2 shows that the number of measurements needed to reach a precision δ for a sample covariance matrix is \(\mathcal {O}(1/\delta ^{2})\) as expected.
We may also use a covariance sketching scheme similar to that described in [23–25] to estimate \(\widehat {\Sigma }\). Covariance sketching is based on random projections of each training sample, and hence, it is memory efficient when we are not able to store or operate on the full vectors directly. The covariance sketching scheme is described below. Assume training samples \(\tilde {x}_{i}\), i=1,…,N are drawn from the signal distribution. Each sample, \(\tilde {x}_{i}\) is sketched M times using random sketching vectors b_{ ij }, j=1,…,M, through a noisy linear measurement \(\left (b_{ij}^{\intercal } x_{i} + w_{ijl}\right)^{2}\), and we repeat this for L times (l=1,…,L) and compute the average energy to suppress noise^{Footnote 1}. This sketching process can be shown to be a linear operator \(\mathcal {B}\) applied on the original covariance matrix Σ. We may recover the original covariance matrix from the vector of sketching outcomes \(\gamma \in \mathbb {R}^{M}\) by solving the following convex optimization problem
where τ is a user parameter that depends on the noise level. In the following theorem, we further establish conditions on the covariance sketching parameters N, M, L, and τ so that the recovered covariance matrix \(\widehat {\Sigma }\) may reach the required precision in Theorem 2, by adapting the results in [25].
Lemma 3
[Initialize with covariance sketching] For any δ>0 the solution to (13) satisfies \(\Vert \widehat {\Sigma }\Sigma \Vert \leq \delta,\) with probability exceeding \(12/n{2}/{\sqrt {n}}2n\exp (\sqrt {n}) \exp (c_{1} M)\), as long as the parameters M, N, L and τ satisfy the following conditions
where c_{0}, c_{1}, and c_{2} are absolute constants.
Finally, we present one numerical example to validate covariance sketching as initialization for InfoGreedy, as shown in Fig. 7. We compare it with the case (“direct” in the figure) when sample covariance matrix is directly estimated using original samples. The parameters are signal dimension n=10; there are 30 samples and m=40 sketches for each sample (thus the dimensionality reduction ratio is 40/10^{2}=0.4); precision level ε=0.1; the confidence level p=0.95; and noise standard deviation σ_{0}=0.01. The covariance matrix \(\widehat {\Sigma }\) is obtained by solving the optimization problem (13) using standard optimization solver CVX, a package for specifying and solving convex programs [36, 37]. Note that the covariance sketching has a higher error level (to achieve dimensionality reduction); however, the errors are still below the precision level (ε=0.1) thus the performance of covariance sketching is acceptable.
Conclusions and discussions
In this paper, we have studied the robustness of sequential compressed sensing algorithm based on conditional mutual information maximization, the socalled InfoGreedy Sensing [6], when the parameters are learned from data. We quantified the algorithm performances in the presence of estimation errors. We further presented covariance sketching based scheme for initializing covariance matrices. Numerical examples demonstrated the robust performance of InfoGreedy.
Our results for Gaussian and GMM signals are quite general in the following sense. In highdimensional problems, a commonly used lowdimensional signal model for x is to assume the signal lies in a subspace plus Gaussian noise, which corresponds to the case where the signal is Gaussian with a lowrank covariance matrix; GMM is also commonly used (e.g., in image analysis and video processing) as it models signals lying in a union of multiple subspaces plus Gaussian noise. In fact, parameterizing via lowrank GMMs is a popular way to approximate complex densities for highdimensional data.
Appendix 1
Backgrounds
Lemma 4
[Eigenvalue of perturbed matrix [38]] Let Σ, \(\widehat {\Sigma }\in \mathbb {R}^{n\times n}\) be symmetric,with eigenvalues λ_{1}≥⋯≥λ_{ n } and \(\hat {\lambda }_{1}\geq \cdots \geq \hat {\lambda }_{n}\), respectively. Let \(E\triangleq \widehat {\Sigma }\Sigma \) have eigenvalues e_{1}≥⋯≥e_{ n }. Then for each i∈{1,⋯,n}, the perturbed eigenvalues satisfy \(\hat {\lambda }_{i}\in [\lambda _{i}+e_{n}, \lambda _{i}+e_{1}].\)
Lemma 5
[Stability conditions for covariance sketching [25]] Denote \(\mathcal A:\mathbb R^{n\times n}\mapsto \mathbb R^{m}\) a linear operator and for \(X\in \mathbb R^{n\times n}\), \(\mathcal A(X)=\{a_{i}^{T} X a_{i}\}_{i=1}^{m}\). Suppose the measurement is contaminated by noise η∈R^{m}, i.e., \(Y=\mathcal A(\Sigma)+\eta \) and assume ∥η∥_{1}≤ε_{1}. Then with probability exceeding 1− exp(−c_{1}m)the solution \(\widehat {\Sigma }\) to the trace minimization (13) satisfies
for all Σ∈R^{n×n}, provided that m>c_{0}nr. Here c_{0}, c_{1}, and c_{2} are absolute constants and Σ_{ r } represents the best rankr approximation of Σ. When Σ_{ r } is exactly rankr
Lemma 6
[Concentration of measure for Wishart distribution [39]] If \(X\in \!\! \mathbb R^{n\times n}\sim \mathcal W_{n}(N,\Sigma)\), then for t>0,
where θ=tr(Σ)/∥Σ∥.
Appendix 2
Proofs
Gaussian signal with mismatch
Proof [Proof of Theorem 11] Let \(\xi _{k}\triangleq \hat {\mu }_{k}\mu _{k}. \) From the update equation for the mean \(\hat {\mu }_{k} = \hat {\mu }_{k1} + \widehat {\Sigma }_{k1} a_{k} \left (y_{k}  a_{k}^{\intercal } \hat {\mu }_{k1}\right)/\left (\hat {a}_{k}^{\intercal } \widehat {\Sigma }_{k1} a_{k} +\sigma ^{2}\right),\) since a_{ k } is eigenvector of \(\hat {\Sigma }_{k1}\), we have the following recursion:
From the recursion of ξ_{ k } in (16), for some vector C_{ k } defined properly, we have that
where the expectation is taken over random variables x and w’s. Note that the second term is equal to zero using an argument based on iterated expectation
Hence, Theorem 1 is proved by iteratively apply the recursion (17). When \(\hat {\mu }_{0}  \mu _{0}=0\), we have \( \mathbb E[\xi _{k}]=0, k=0,1,\ldots, K. \)
In the following, Lemma 7 to Lemma 9 are used to prove Theorem 2.
Lemma 7
[Recursion in covariance matrix mismatch.]
If δ_{k−1}≤3σ^{2}/4β_{ k }, then δ_{ k }≤4δ_{k−1}.
Proof
Let \(\widehat {A}_{k}\triangleq {a}_{k}{a}^{\intercal }_{k}\). Hence, \(\Vert \widehat {A}_{k} \Vert =\beta _{k}\). Recall that a_{ k } is the eigenvector of \(\widehat {\Sigma }_{k1}\), using the definition of \(E_{k} \triangleq \widehat {\Sigma }_{k}  \Sigma _{k}\), together with the recursions of the covariance matrices
we have
Based on this recursion, using δ_{ k }=∥E_{ k }∥, the triangle inequality, and inequality ∥AB∥≤∥A∥∥B∥, we have
Hence, if we set δ_{k−1}≤3σ^{2}/(4β_{ k }), i.e., \(\delta _{k1}\beta _{k}\leq \frac {3}{4}\sigma ^{2}\), the last inequality can be upper bounded by
Hence, if δ_{k−1}≤3σ^{2}/(4β_{ k }), we have δ_{ k }≤4δ_{k−1}.
Lemma 8
[Recursion for trace of the true covariance matrix] If \(\delta _{k1}\leq \hat {\lambda }_{k}\),
Proof
Let \(\widehat {A}_{k}\triangleq {a}_{k}{a}^{\intercal }_{k}\). Using the definition of E_{ k } and the recursions (18) and (19), the perturbation matrix E_{ k } after k iterations is given by
Note that \(\text {rank}(\widehat {A}_{k})=1\), thus \(\text {rank}(\widehat {A}_{k}E_{k1})\leq 1\); therefore, it has at most one nonzero eigenvalue,
Note that E_{k−1} is symmetric and \(\hat {A}_{k}\) is positive semidefinite, we have \( \text {tr}(E_{k1}\widehat {A}_{k}E_{k1})\geq 0. \) Hence, from (21) we have
After rearranging terms we obtain
Together with the recursion for trace of \(\text {tr}(\widehat {\Sigma }_{k})\) in (7), we have
Lemma 9
For a given positive semidefinite matrix \(X \in \mathbb R^{n\times n}\), and a vector \(h\in \mathbb R^{n}\), if
then rank (X)=rank(Y).
Proof
Apparently, for all x∈ker(X), Yx=0, i.e., ker(X)⊂ker(Y). Decompose \(X = Q^{\intercal }Q\). For all x∈ ker(Y), let \(b\triangleq Qh\), \(z\triangleq Qx\). If b=0, Y=X; otherwise, when b≠0, we have
Thus,
Therefore z=0, i.e., x∈ ker(X), ker(Y)⊂ ker(X). This shows that ker(X)= ker(Y) or equivalently rank(X)=rank(Y).
Proof [Proof of Theorem 2] Recall that for k=1,…,K, \(\hat {\lambda }_{k}\geq {\chi _{n, p, \varepsilon }}\). Using Lemma 7, we can show that for some 0<δ<1, if
then for the first K measurements, we have
Note that the second inequality in (22) comes from the fact that \((1/{\chi _{n, p, \varepsilon }}1/\hat {\lambda }_{1}){\chi _{n, p, \varepsilon }}\sigma ^{2}\leq 3\sigma ^{2}\). Clearly, δ_{k−1}≤δχ_{n,p,ε}/16. Hence, (4+δ)δ_{k−1}≤δλ_{ k }. Note that β_{ k }δ_{k−1}≤σ^{2} and \(\vert \lambda _{k}\hat {\lambda }_{k}\vert \leq \delta _{k1}\), we have \( \beta _{k}\lambda _{k}\leq \beta _{k}(\hat {\lambda }_{k}+\delta _{k1})\leq \beta _{k}\hat {\lambda }_{k}+\sigma ^{2}. \) Thus, \( 4\delta _{k1}\left (\beta _{k}\hat {\lambda }_{k}+\sigma ^{2}\right)+\delta \beta _{k} \lambda _{k} \delta _{k1}\leq \delta \lambda _{k}\left (\beta _{k}\hat {\lambda }_{k}+\sigma ^{2}\right). \) Then, we have \( 3\beta _{k} \hat {\lambda }_{k} \delta _{k1}\left (\beta _{k}\hat {\lambda }_{k}+\sigma ^{2}\right)\leq \beta _{k} \hat {\lambda }_{k}(\delta \lambda _{k}  \delta _{k1})(\beta _{k}\hat {\lambda }_{k}+\sigma ^{2}\beta _{k}\delta _{k1}), \) which can be rewritten as \( \frac {3\beta _{k} \hat {\lambda }_{k}\delta _{k1}}{\beta _{k} \hat {\lambda }_{k}+\sigma ^{2}\beta _{k} \delta _{k1}}\leq \frac {\beta _{k} \hat {\lambda }_{k}}{\beta _{k} \hat {\lambda }_{k}+\sigma ^{2}}(\delta \lambda _{k}\delta _{k1}). \) Hence, \( \frac {3\beta _{k} \hat {\lambda }_{k}\delta _{k1}}{\beta _{k} \hat {\lambda }_{k}+\sigma ^{2}\beta _{k} \delta _{k1}}\leq \frac {\beta _{k} \hat {\lambda }_{k}}{\beta _{k} \hat {\lambda }_{k}+\sigma ^{2}}[(\delta 1)\lambda _{k}+\hat {\lambda }_{k}], \) which can be written as \( \frac {\beta _{k} \hat {\lambda }_{k}^{2}}{\beta _{k}\hat {\lambda }_{k}+\sigma ^{2}}+\frac {3\beta _{k} \hat {\lambda }_{k}\delta _{k1}}{\beta _{k} \hat {\lambda }_{k}+\sigma ^{2}\beta _{k} \delta _{k1}}\leq (1\delta)\frac {\beta _{k}\hat {\lambda }_{k}}{\beta _{k} \hat {\lambda }_{k}+\sigma ^{2}}\lambda _{k}. \) By applying Lemma 8, we have
where we have used the definition for f_{ k } in (5). Subsequently,
Lemma 9 shows that the rank of the covariance will not be changed by updating the covariance matrix sequentially: rank(Σ_{1})=⋯=rank(Σ_{ k })=s. Hence, we may decompose the covariance matrix \(\Sigma _{k}=Q Q^{\intercal }\), with \(Q\in \mathbb R^{n\times s}\) being a fullrank matrix, then \({\textsf {Vol}}(\Sigma _{k})=\text {det}(Q^{\intercal } Q).\) Since \(\text {tr}(Q^{\intercal } Q)=\text {tr}(Q Q^{\intercal })\), we have
where (1) follows from the Hadamard’s inequality and (2) follows from the inequality of arithmetic and geometric means. Finally, we can bound the conditional entropy of the signal as
which leads to the desired result.
Proof [Proof of Theorem 3] Recall that rank(Σ)=s, and hence λ_{ k }=0, k=s+1,…,n. Note that for each iteration, the eigenvalue of \(\widehat {\Sigma }_{k}\) in the direction of a_{ k }, which corresponds to the largest eigenvalue of \(\widehat {\Sigma }_{k}\), is eliminated below the threshold χ_{n,p,ε}. Therefore, as long as the algorithm continues, the largest eigenvalue of \(\widehat {\Sigma }_{k}\) is exactly the (k+1)th largest eigenvalue of \(\widehat {\Sigma }\). Now, if
using Lemma 4 and Lemma 7, we have that
In the ideal case without perturbation, each measurement decreases the eigenvalue along a given eigenvector to be below χ_{n,p,ε}. Suppose in the ideal case, the algorithm terminates at K≤s iterations, which means
and the total power needed is
On the other hand, in the presence of perturbation, the algorithm will terminate using more than K iterations since with perturbation, eigenvalues of Σ that are originally below χ_{n,p,ε} may get above χ_{n,p,ε}. In this case, we will also allocate power while taking into account the perturbation:
This suffices to eliminate even the smallest eigenvalue to be below threshold χ_{n,p,ε} since
We first estimate the total amount of power used at most to eliminate eigenvalues \(\hat {\lambda }_{k}\), for K+1≤k≤s:
where we have used the fact that δ_{ s }≤4^{s}δ_{0} (a consequence of Lemma 7), the assumption (24), and monotonicity of the upper bound in s. The total power to reach precision ε in the presence of mismatch can be upper bounded by
In order to achieve precision ε and confidence level p, the extra power needed is upper bounded as
where we have again used δ_{ s }≤4^{s}δ_{0}≤4^{s}χ_{n,p,ε}/4^{s+1}=χ_{n,p,ε}/4, \(1/\hat {\lambda }_{k}  1/\lambda _{k} \leq \delta _{0}/\lambda _{k}^{2}\), the fact that λ_{ k }≥χ_{n,p,ε} for k=1,…,K.
Proof [Proof of Lemma 2] It is a direct consequence of Lemma 6. Let θ=tr(Σ)/∥Σ∥≥1. For some constant δ>0, set
Then, from Lemma 6, we have
The following Lemma is used in the proof of Lemma 3.
Lemma 10
If for some constants M, N, and L that satisfy the conditions in Lemma 3, then ∥η∥_{1}≤τ with probability exceeding \(1{2}/{n}{2}/{\sqrt {n}}2n\exp (c_{1}M)\) for some universal constant c_{1}>0.
Proof
Let \(\theta \triangleq \text {tr}(\Sigma)/\\Sigma \\). With Chebyshev’s inequality, we have that
and
When
with the concentration inequality for Wishart distribution in Lemma 6 and plugging in the lower bound for N in (26) and the definition for τ in (15), we have
Furthermore, when L satisfies (14), we have
Therefore, ∥η∥_{1}≤τ holds with probability at least \(1{2}/{n}{2}/{\sqrt {n}}2n\exp (\sqrt {n})\).
Proof [Proof of Lemma 3 ] With Lemma 10, let τ=Mδ/c_{2}, the choices of M, N, and L ensure that ∥η∥_{1}≤Mδ/c_{2} with probability at least \(1{2}/{n}2/\sqrt {n}2n\exp (\sqrt {n})\). By Lemma 5 in Appendix 1 and noting that the rank of Σ is s, we have \(\Vert \widehat {\Sigma }\Sigma \Vert _{F}\leq \delta.\) Therefore, with probability exceeding \(12/n{2}/{\sqrt {n}}2n\exp (\sqrt {n})\exp (c_{0}c_{1}ns), \Vert \widehat {\Sigma }\Sigma \Vert \leq \Vert \widehat {\Sigma }\Sigma \Vert _{F}\leq \delta. \)
The proof will use the following two lemmas.
Lemma 11
[Moment generating function of multivariate Gaussian [40]] Assume \(X\sim \mathcal N(0,\Sigma)\). The moment generating function of ∥X∥_{2} is \( \mathbb E[e^{s\Vert X \Vert _{2}}]=1/\sqrt {I2s\Sigma }. \)
Note that ϱ_{ k } can be computed recursively. We may derive a recursion. Let \(z_{k} \triangleq a_{k}^{\intercal }(x\mu _{k1}) + w_{k} = y_{k}  a_{k}^{\intercal } \mu _{k1}\). Also Let \(\varrho _{k} \triangleq a^{\intercal }(\hat {\mu }_{k}\mu _{k})\). Note that \(\varrho _{k} = a^{\intercal } \xi _{k}\) for \(\xi _{k} = \hat {\mu }_{k}\mu _{k}\) in (16). Based on the recursion for ξ_{ k } in (16) that we derived earlier, we have
and
Proof [Proof of Lemma 1] The recursion of the diagonal entries can be written as
Note that for i=j_{k−1},
and for i≠j_{k−1},
Therefore,
Proof [Proof of Theorem 4] Let \(\varepsilon \geq \sqrt {\\Sigma _{K}\\cdot \chi _{n}^{2}(p)}\), i.e. ∥Σ_{ K }∥≤χ_{n,p,ε}. Then, Theorem 4 follows from
This says that, if ∥Σ_{ K }∥≤χ_{n,p,ε}, then (27) holds, we have \(\Vert \hat {x} x \Vert \leq \varepsilon \) with probability at least p. From Lemma 1, we have that when the powers β_{ i } are sufficiently large
Hence, for (27) to hold, we can simple require \(\left (1\frac {1}{n(1+\gamma)}\right)^{K}\text {tr}(\Sigma) \leq {\chi _{n, p, \varepsilon }}\), or equivalently (11) in Theorem 4.
Notes
 1.
Our sketching scheme is slightly different from that used in [25] because we would like to use the square of the noisy linear measurements \(y_{i}^{2}\) (where as the measurement scheme in [25] has a slightly different noise model). In practice, this means that we may use the same measurement scheme in the first stage as training to initialize the sample covariance matrix.
Abbreviations
 GMM:

Gaussian mixture models
 NDT:

Nondestructive testing
References
 1
A Ashok, P Baheti, MA Neifeld, Compressive imaging system design using taskspecific information. Appl. Opt. 47(25), 4457–4471 (2008).
 2
J Ke, A Ashok, M Neifeld, Object reconstruction from adaptive compressive measurements in featurespecific imaging. Appl. Opt. 49(34), 27–39 (2010).
 3
A Ashok, MA Neifeld, Compressive imaging: hybrid measurement basis design. J. Opt. Soc. Am. A. 28(6), 1041–1050 (2011).
 4
W Boonsong, W Ismail, Wireless monitoring of household electrical power meter using embedded RFID with wireless sensor network platform. Int. J. Distrib. Sens. Networks. 2014(876914), 10 (2014).
 5
B Zhang, X Cheng, N Zhang, Y Cui, Y Li, Q Liang, in Sparse Target Counting and Localization in Sensor Networks Based on Compressive Sensing. IEEE Int. Conf. Computer Communications (INFOCOM), (2014), pp. 2255–2258.
 6
G Braun, S Pokutta, Y Xie, Infogreedy sequential adaptive compressed sensing. IEEE J. Sel. Top. Signal Proc. 9(4), 601–611 (2015).
 7
J Haupt, R Nowak, R Castro, in Adaptive Sensing for Sparse Signal Recovery. IEEE 13th Digital Signal Processing Workshop and 5th IEEE Signal Processing Education Workshop (DSP/SPE), (2009), pp. 702–707.
 8
A Tajer, HV Poor, Quick search for rare events. IEEE Transactions on Information Theory. 59(7), 4462–4481 (2013).
 9
D Malioutov, S Sanghavi, A Willsky, Sequential compressed sensing. IEEE J. Sel. Topics Sig. Proc.4(2), 435–444 (2010).
 10
J Haupt, R Baraniuk, R Castro, R Nowak, in Sequentially Designed Compressed Sensing. Proc. IEEE/SP Workshop on Statistical Signal Processing, (2012).
 11
A Krishnamurthy, J Sharpnack, A Singh, in Recovering Graphstructured Activations Using Adaptive Compressive Measurements. Annual Asilomar Conference on Signals, Systems, and Computers, (2013).
 12
J Haupt, R Castro, R Nowak, in International Conference on Artificial Intelligence and Statistics. Distilled sensing: Selective sampling for sparse signal recovery, (2009), pp. 216–223.
 13
MA Davenport, E AriasCastro, in Information Theory Proceedings (ISIT), 2012 IEEE International Symposium On. Compressive binary search, (2012), pp. 1827–1831.
 14
ML Malloy, RD Nowak, Nearoptimal adaptive compressed sensing. IEEE Trans. Inf. Theory. 60(7), 4001–4012 (2014).
 15
S Jain, A Soni, J Haupt, in Signals, Systems and Computers, 2013 Asilomar Conference On. Compressive measurement designs for estimating structured signals in structured clutter: a Bayesian experimental design approach, (2013), pp. 163–167.
 16
E Tanczos, R Castro, Adaptive sensing for estimation of structure sparse signals. arXiv:1311.7118 (2013).
 17
A Soni, J Haupt, On the fundamental limits of recovering tree sparse vectors from noisy linear measurements. IEEE Trans. Info. Theory. 60(1), 133–149 (2014).
 18
HS Chang, Y Weiss, WT Freeman, Informative sensing. arXiv preprint arXiv:0901.4275 (2009).
 19
S Ji, Y Xue, L Carin, Bayesian compressive sensing. IEEE Trans. Sig. Proc. 56(6), 2346–2356 (2008).
 20
JM DuarteCarvajalino, G Yu, L Carin, G Sapiro, Taskdriven adaptive statistical compressive sensing of gaussian mixture models. IEEE Trans. Signal Process. 61(3), 585–600 (2013).
 21
W Carson, M Chen, R Calderbank, L Carin, Communication inspired projection design with application to compressive sensing. SIAM J. Imaging Sci (2012).
 22
DJC MacKay, Information based objective functions for active data selection. Comput. Neural Syst. 4(4), 589–603 (1992).
 23
G Dasarathy, P Shah, BN Bhaskar, R Nowak, in Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference On. Covariance Sketching, (2012).
 24
G Dasarathy, P Shah, BN Bhaskar, R Nowak, Sketching sparse matrices. ArXiv ID:1303.6544 (2013).
 25
Y Chen, Y Chi, AJ Goldsmith, Exact and stable covariance estimation from quadratic sampling via convex programming. IEEE Trans. Inf. Theory. 61(7), 4034–4059 (2015).
 26
C Hellier, Handbook of Nondestructive Evaluation (McGrawHill, 2003).
 27
P Schniter, in Computational Advances in MultiSensor Adaptive Processing (CAMSAP), 2011 4th IEEE International Workshop On. Exploiting structured sparsity in bayesian experimental design, (2011), pp. 357–360.
 28
G Yu, G Sapiro, Statistical compressed sensing of Gaussian mixture models. IEEE Trans. Signal Process. 59(12), 5842–5858 (2011).
 29
Y Li, Y Chi, C Huang, L Dolecek, Orthogonal matching pursuit on faulty circuits. IEEE Transactions on Communications. 63(7), 2541–2554 (2015).
 30
H Robbins, in Herbert Robbins Selected Papers. Some aspects of the sequential design of experiments (Springer, 1985), pp. 169–177.
 31
CF Wu, M Hamada, Experiments: Planning, Analysis, and Optimization, vol. 552 (Wiley, 2011).
 32
R Gramacy, D Apley, Local Gaussian process approximation for large computer experiments. J. Comput. Graph. Stat. (justaccepted):, 1–28 (2014).
 33
D Palomar, Verdu, Ś, Gradient of mutual information in linear vector Gaussian channels. IEEE Trans. Info. Theory. 52:, 141–154 (2006).
 34
Payaro, Ḿ, DP Palomar, Hessian and concavity of mutual information, entropy, and entropy power in linear vector Gaussian channels. IEEE Trans. Info. Theory, 3613–3628 (2009).
 35
DJ Brady, Optical Imaging and Spectroscopy (WileyOSA, 2009).
 36
M Grant, S Boyd, CVX: Matlab Software for Disciplined Convex Programming, version 2.1 (2014). http://cvxr.com/cvx.
 37
M Grant, S Boyd, in Recent Advances in Learning and Control. Lecture Notes in Control and Information Sciences, ed. by V Blondel, S Boyd, and H Kimura. Graph implementations for nonsmooth convex programs (Springer, 2008), pp. 95–110.
 38
GW Stewart, JG Sun, Matrix Perturbation Theory (Academic Press, Inc., 1990).
 39
S Zhu, A short note on the tail bound of Wishart distribution. arXiv:1212.5860 (2012).
 40
T Vincent, L Tenorio, M Wakin, Concentration of measure: fundamentals and tools. Lect. Notes Rice Univ (2015).
Acknowledgements
We would like to acknowledge Tsinghua University for supporting Ruiyang Song while he visited at Georgia Institute of Technology.
Funding
This work is partially supported by an NSF CAREER Award CMMI1452463, an NSF grant CCF1442635 and an NSF grant CMMI1538746. Ruiyang Song was visiting the H. Milton Stewart School of Industrial and Systems Engineering at the Georgia Institute of Technology while working on this paper.
Author information
Affiliations
Contributions
We consider a class of mutual information maximizationbased algorithms which are called InfoGreedy algorithms, and we present a rigorous performance analysis for these algorithms in the presence of model parameter mismatch. All authors read and approved the final manuscript.
Corresponding author
Correspondence to Yao Xie.
Ethics declarations
Competing interests
The author declares that he/she has no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional information
Availability of data and materials
The Georgia Tech campus image is available at www2.isye.gatech.edu/~yxie77/campus.mat and the data for solar flare image is at www2.isye.gatech.edu/~yxie77/data_193.mat.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Received
Accepted
Published
DOI
Keywords
 Sequential compressed sensing
 Adaptive sensing
 Mutual information
 Model mismatch