Denoising by Sparse Approximation: Error Bounds Based on Rate-Distortion Theory

If a signal x is known to have a sparse representation with respect to a frame, it can be estimated from a noise-corrupted observation y by ﬁnding the best sparse approximation to y . Removing noise in this manner depends on the frame e ﬃ ciently representing the signal while it ine ﬃ ciently represents the noise. The mean-squared error (MSE) of this denoising scheme and the probability that the estimate has the same sparsity pattern as the original signal are analyzed. First an MSE bound that depends on a new bound on approximating a Gaussian signal as a linear combination of elements of an overcomplete dictionary is given. Further analyses are for dictionaries generated randomly according to a spherically-symmetric distribution and signals expressible with single dictionary elements. Easily-computed approximations for the probability of selecting the correct dictionary element and the MSE are given. Asymptotic expressions reveal a critical input signal-to-noise ratio for signal recovery.


INTRODUCTION
Estimating a signal from a noise-corrupted observation of the signal is a recurring task in science and engineering.This paper explores the limits of estimation performance in the case where the only a priori structure on the signal x ∈ R N is that it has known sparsity K with respect to a given set of vectors Φ = {ϕ i } M i=1 ⊂ R N .The set Φ is called a dictionary and is generally a frame [1,2].The sparsity of K with respect to Φ means that the signal x lies in the set α i ϕ i with at most K nonzero α i 's .(1) In many areas of computation, exploiting sparsity is motivated by reduction in complexity [3]; if K N, then certain computations may be more efficiently made on α than on x.In compression, representing a signal exactly or approximately by a member of Φ K is a common first step in efficiently representing the signal, though much more is known when Φ is a basis or union of wavelet bases than is known in the general case [4].Of more direct interest here is that sparsity models are becoming prevalent in estimation problems; see, for example, [5,6].
The parameters of dimension N, dictionary size M, and sparsity K determine the importance of the sparsity model.Representative illustrations of Φ K are given in Figure 1.With dimension N = 2, sparsity of K = 1 with respect to a dictionary of size M = 3 indicates that x lies on one of three lines, as shown in Figure 1a.This is a restrictive model, even if there is some approximation error in (1).When M is increased, the model stops seeming restrictive, even though the set of possible values for x has measure zero in R 2 .The reason is that unless the dictionary has gaps, all of R 2 is nearly covered.This paper presents progress in explaining the value of a sparsity model for signal denoising as a function of (N, M, K).

Denoising by sparse approximation with a frame
Consider the problem of estimating a signal x ∈ R N from the noisy observation y = x + d, where d ∈ R N has the i.i.d.Gaussian N (0, σ 2 I N ) distribution.Suppose we know that x lies in given K-dimensional subspace of R N .Then projecting y to the given subspace would remove a fraction of the noise without affecting the signal component.Denoting the projection operator by P, we would have and Pd has only K/N fraction of the power of d.
In this paper, we consider the more general signal model x ∈ Φ K .The set Φ K defined in (1) is the union of at most J = M K subspaces of dimension K.We henceforth assume that M > K (thus J > 1); if not, the model reduces to the classical case of knowing a single subspace that contains x.The distribution of x, if available, could also be exploited to remove noise.However, in this paper the denoising operation is based only on the geometry of the signal model Φ K and the distribution of d.
With the addition of the noise d, the observed vector y will (almost surely) not be represented sparsely, that is, not be in Φ K .Intuitively, a good estimate for x is the point from Φ K that is closest to y in Euclidean distance.Formally, because the probability density function of d is a strictly decreasing function of d 2 , this is the maximum-likelihood estimate of x given y.The estimate is obtained by applying an optimal sparse approximation procedure to y.We will write for this estimate and call it the optimal K-term approximation of y.Henceforth, we omit the subscript 2 indicating the Euclidean norm.
The main results of this paper are bounds on the percomponent mean-squared estimation error (1/N)E[ x − x SA 2 ] for denoising via sparse approximation. 1 These bounds depend on (N, M, K) but avoid further dependence on the dictionary Φ (such as the coherence of Φ); some results hold for all Φ and others are for randomly generated Φ.
To the best of our knowledge, the results differ from any in the literature in several ways.
(a) We study mean-squared estimation error for additive Gaussian noise, which is a standard approach to performance analysis in signal processing.In contrast, analyses such as [7] impose a deterministic bound on the norm of the noise.(b) We concentrate on having dependence solely on dictionary size rather than more fine-grained properties of the dictionary.In particular, most signal recovery results in the literature are based on noise being bounded above by a function of the coherence of the dictionary [8][9][10][11][12][13][14].(c) Some of our results are for spherically symmetric random dictionaries.The series of papers [15][16][17] is superficially related because of randomness, but in these papers, the signals of interest are sparse with respect to a single known, orthogonal basis and the observations are random inner products.The natural questions include a consideration of the number of measurements needed to robustly recover the signal.(d) We use source-coding thought experiments in bounding estimation performance.This technique may be useful in answering other related questions, especially in sparse approximation source coding.
Our preliminary results were first presented in [18], with further details in [19,20].Probability of error results in a rather different framework for basis pursuit appear in a manuscript submitted while this paper was under review [21].

Connections to approximation
A signal with an exact K-term representation might arise because it was generated synthetically, for example, by a compression system.A more likely situation in practice is that there is an underlying true signal x that has a good K-term approximation rather than an exact K-term representation.At very least, this is the goal in designing the dictionary Φ for a signal class of interest.It is then still reasonable to compute (3) to estimate x from y, but there are tradeoffs in the selections of K and M. Let f M,K denote the squared Euclidean approximation error of the optimal K-term approximation using an Melement dictionary.It is obvious that f M,K decreases with increasing K, and with suitably designed dictionaries, it also decreases with increasing M. One concern of approximation theory is to study the decay of f M,K precisely.(For this, we should consider N very large or infinite.)For piecewise smooth signals, for example, wavelet frames give exponential decay with K [4,22,23].
When one uses sparse approximation to denoise, the performance depends on both the ability to approximate x and the ability to reject the noise.Approximation is improved by increasing M and K, but noise rejection is diminished.The dependence on K is clear, as the fraction of the original noise that remains on average is at least K/N.For the dependence on M, note that increasing M increases the number of subspaces, and thus increases the chance that the selected subspace is not the best one for approximating x.Loosely, when M is very large and the dictionary elements are not too unevenly spread, there is some subspace very close to y, and thus x SA ≈ y.This was illustrated in Figure 1.
Fortunately, there are many classes of signals for which M need not grow too quickly as a function of N to get good sparse approximations.Examples of dictionaries with good computational properties that efficiently represent audio signals were given by Goodwin [24].For iterative design procedures, see papers by Engan et al. [25] and Tropp et al. [26].
One initial motivation for this work was to give guidance for the selection of M. This requires the combination of approximation results (e.g., bounds on f M,K ) with results such as ours.The results presented here do not address approximation quality.

Related work
Computing optimal K-term approximations is generally a difficult problem.Given ∈ R + and K ∈ Z + determine if there exists a K-term approximation x such that x − x ≤ is an NP-complete problem [27,28].This computational intractability of optimal sparse approximation has prompted study of heuristics.A greedy heuristic that is standard for finding sparse approximate solutions to linear equations [29] has been known as matching pursuit in the signal processing literature since the work of Mallat and Zhang [30].Also, Chen, et al. [31] proposed a convex relaxation of the approximation problem (3) called basis pursuit.
Two related discoveries have touched off a flurry of recent research.
(a) Stability of sparsity.Under certain conditions, the positions of the nonzero entries in a sparse representation of a signal are stable: applying optimal sparse approximation to a noisy observation of the signal will give a coefficient vector with the original support.Typical results are upper bounds (functions of the norm of the signal and the coherence of the dictionary) on the norm of the noise that allows a guarantee of stability [7][8][9][10]32].(b) Effectiveness of heuristics.Both basis pursuit and matching pursuit are able to find optimal sparse approximations, under certain conditions on the dictionary and the sparsity of signal [7,9,12,14,33,34].
To contrast, in this paper, we consider noise with unbounded support and thus a positive probability of failing to satisfy a sufficient condition for stability as in (a) above; and we do not address algorithmic issues in finding sparse approximations.It bears repeating that finding optimal sparse approximations is presumably computationally intractable except in the cases where a greedy algorithm or convex relaxation happens to succeed.Our results are thus bounds on the performance of the algorithms that one would probably use in practice.
Denoising by finding a sparse approximation is similar to the concept of denoising by compression popularized by Saito [35] and Natarajan [36].More recent works in this area include those by Krim et al. [37], Chang et al. [38], and Liu and Moulin [39].All of these works use bases rather than frames.To put the present work into a similar framework would require a "rate" penalty for redundancy.Instead, the only penalty for redundancy comes from choosing a subspace that does not contain the true signal ("overfitting" or "fitting the noise").The literature on compression with frames notably includes [40][41][42][43][44].
This paper uses quantization and rate-distortion theory only as a proof technique; there are no encoding rates because the problem is purely one of estimation.However, the "negative" results on representing white Gaussian signals with frames presented here should be contrasted with the "positive" encoding results of Goyal et al. [42].The positive results of [42] are limited to low rates (and hence signalto-noise ratios that are usually uninteresting).A natural extension of the present work is to derive negative results for encoding.This would support the assertion that frames in compression are useful not universally, but only when they can be designed to yield very good sparseness for the signal class of interest.

Preview of results and outline
To motivate the paper, we present a set of numerical results from Monte Carlo simulations that qualitatively reflect our main results.In these experiments, N, M, and K are small because of the high complexity of computing optimal approximations and because a large number of independent trials are needed to get adequate precision.Each data point shown is the average of 100 000 trials.
Consider a true signal x ∈ R 4 (N = 4) that has an exact 1-term representation (K = 1) with respect to M-element dictionary Φ.We observe y = x + d with d ∼ N (0, σ 2 I 4 ) and compute estimate x SA from (3).The signal is generated with unit norm so that the signal-to-noise ratio (SNR) is 1/σ 2 or −10 log 10 σ 2 dB.Throughout, we use the following definition for mean-squared error: To have tunable M, we used dictionaries that are M maximally separated unit vectors in R N , where separation is measured by the minimum pairwise angle among the vectors and their negations.These are cases of Grassmannian packings [45,46] in the simplest case of packing one-dimensional subspaces (lines).We used packings tabulated by Sloane et al. [47].
Figure 2 shows the MSE as a function of σ for several values of M. Note that for visual clarity, MSE /σ 2 is plotted, and all of the same properties are illustrated for K = 2 in Figure 3.For small values of σ, the MSE is (1/4)σ 2 .This is an example of the general statement that as described in detail in Section 2. For large values of σ, the 0 Figure 2: Performance of denoising by sparse approximation when the true signal x ∈ R 4 has an exact 1-term representation with respect to a dictionary that is an optimal M-element Grassmannian packing.
Figure 3: Performance of denoising by sparse approximation when the true signal x ∈ R 4 has an exact 2-term representation with respect to a dictionary that is an optimal M-element Grassmannian packing.
scaled MSE approaches a constant value: where g K,M is a slowly increasing function of M and lim M→∞ g K,M = 1.This limiting value makes sense because in the limit, x SA ≈ y = x + d and each component of d has variance σ2 ; the denoising does not do anything.The characterization of the dependence of g K,M on K and M is the main contribution of Section 3. Another apparent pattern in Figure 2 that we would like to explain is the transition between low-and high-SNR behavior.The transition occurs at smaller values of σ for larger values of M. Also, MSE /σ 2 can exceed 1, so in fact the sparse approximation procedure can increase the noise.We are not able to characterize the transition well for general frames.However, in Section 4 we obtain results for large frames that are generated by choosing vectors uniformly at random from the unit sphere in R N .There, we get a sharp transition between low-and high-SNR behavior.

PRELIMINARY COMPUTATIONS
Recall from the introduction that we are estimating a signal x ∈ Φ K ⊂ R N from an observation y = x + d, where d ∼ N (0, σ 2 I N ).Φ K was defined in (1) as the set of vectors that can be represented as a linear combination of K vectors from Φ = {ϕ m } M m=1 .We are studying the performance of the estimator This estimator is the maximum-likelihood estimator of x in this scenario, in which d has a Gaussian density and the estimator has no probabilistic prior information on x.The subscript SA denotes "sparse approximation" because the estimate is obtained by finding the optimal sparse approximation of y.There are values of y such that x SA is not uniquely defined.These collectively have probability zero and we ignore them.Finding x SA can be viewed as a two-step procedure: first, find the subspace spanned by K elements of Φ that contains x SA ; then, project y to that subspace.The identification of a subspace and the orthogonality of y − x SA to that subspace will be used in our analyses.Let P K = {P i } i be the set of the projections onto subspaces spanned by K of the M vectors in Φ.Then, P K has at most J = M K elements, 2 and the estimate of interest is given by The distribution of the error x − x SA and the average performance of the estimator both depend on the true signal x.
Where there is no distribution on x, the performance measure analyzed here is the conditional MSE, one could say that showing conditioning in ( 9) is merely for emphasis.
In the case that T is independent of d, the projection in ( 8) is to a fixed K-dimensional subspace, so This occurs when M = K (there is just one element in P K ) or in the limit of high-SNR (small σ 2 ).In the latter case, the subspace selection is determined by x, unperturbed by d.

RATE-DISTORTION ANALYSIS AND LOW-SNR BOUND
In this section, we establish bounds on the performance of sparse approximation denoising that apply for any dictionary Φ.One such bound qualitatively explains the low-SNR performance shown in Figures 2 and 3, that is, the right-hand side asymptotes in these plots.
The denoising bound depends on a performance bound for sparse approximation signal representation developed in Section 3.1.The signal representation bound is empirically evaluated in Section 3.2 and then related to low-SNR denoising in Section 3.3.We will also discuss the difficulties in extending this bound for moderate SNR.To obtain interesting results for moderate SNR, we consider randomly generated Φ's in Section 4.

Sparse approximation of a Gaussian source
Before addressing the denoising performance of sparse approximation, we give an approximation result for Gaussian signals.This result is a lower bound on the MSE when sparsely approximating a Gaussian signal; it is the basis for an upper bound on the MSE for denoising when the SNR is low.These bounds are in terms of the problem size parameters (M, N, K). where For v = 0, the stronger bound also holds.
The proof follows from Theorem 2, see Appendix A.
Remarks.(i) Theorem 1 shows that for any Φ, there is an approximation error lower bound that depends only on the frame size M, the dimension of the signal N, and the dimension of the signal model K.
(ii) As M → ∞ with K and N fixed, c 1 → 0. This is consistent with the fact that it is possible to drive the approximation error to zero by letting the dictionary grow.
(iii) The decay of c 1 as M increases is slow.To see this, define a sparsity measure α = K/N and a redundancy factor ρ = M/N.Now using the approximation (see, e.g., [48, page 530]) we can compute the limit Thus, the decay of the lower bound in (11) as ρ is increased behaves as ρ −2α/ (1−α) .This is slow when α is small.
The theorem below strengthens Theorem 1 by having a dependence on the entropy of the subspace selection random variable T in addition to the problem size parameters (M, N, K).The entropy of T is defined as where p T (i) is the probability mass function of T.
Theorem 2. Let Φ be an M-element dictionary, and let If v is the optimal Ksparse approximation of v with respect to Φ and T is the index of the subspace that contains v, then where For v = 0, the stronger bound also holds.
For the proof, see Appendix A.

Empirical evaluation of approximation error bounds
The bound in Theorem 1 does not depend on any characteristics of the dictionary other than M and N. Thus it will be nearest to tight when the dictionary is well suited to representing the Gaussian signal v.That the expression ( 11) is not just a bound but also a useful approximation is supported by the Monte Carlo simulations described in this section.
To empirically evaluate the tightness of the bound, we compare it to the MSE obtained with Grassmannian frames and certain random frames.The Grassmannian frames are from the same tabulation described in Section 1.4 [47].The random frames are generated by choosing M vectors uniformly at random from the surface of a unit sphere.One such vector can be generated, for example, by drawing an i.i.d.Gaussian vector and normalizing it.
Figure 4 shows comparisons between the bound in Theorem 1 and the simulated approximation errors as a function of M for several values of N and K.For all the simulations, v = 0; it is for v = 0 that T is the closest to being uniformly distributed, and hence the bound is the tightest.Each of parts (a)-(c) cover a single value of N and combine K = 1 and K = 2. Part (d) shows results for N = 10 and N = 100 for K = 1.In all cases, the bound holds and gives a qualitative match in the dependence of the approximation error on K and M. In particular, the slopes on these log-log plots correspond to the decay as a function of ρ discussed in Remark (iii).We also find that the difference in approximation error between using a Grassmannian frame or a random frame is small.

Bounds on denoising MSE
We now return to the analysis of the performance of sparse approximation denoising as defined in Section 2. We wish to bound the estimation error e(x) for a given signal x and frame Φ.
To create an analogy between the approximation problem considered in Section 3.1 and the denoising problem, let v = x, v − v = d, and v = y.These correspondences fit perfectly, since d ∼ N (0, σ 2 I N ) and we apply sparse approximation to y to get x SA .Theorem 2 gives the bound where c 2 is defined as before.As illustrated in Figure 5, it is as if we are attempting to represent d by sparse approximation and we obtain d = x SA − x.The quantity we are interested in is e(x) In the case that x and x SA are in the same subspace, immediately give an upper bound on e(x).
The interesting case is when x and x SA are not necessarily in the same subspace.Recalling that T is the index of the subspace selected in sparse, approximation orthogonally decompose d as d = d T ⊕ d T ⊥ with d T in the selected subspace and similarly decompose d.Then d T = d T and the expected squared norm of this component can be bounded above as in the previous paragraph.Unfortunately, d T ⊥ can be larger than d T ⊥ in proportion to x , as illustrated in Figure 5.The worst case is for d T ⊥ = 2 d T ⊥ , when y lies equidistant from the subspace of x and the subspace of x SA .
From this analysis, we obtain the weak bound and the limiting low-SNR bound

ANALYSIS FOR ISOTROPIC RANDOM FRAMES
In general, the performance of sparse approximation denoising is given by where f (•) is the density of the noise d.While this expression does not give any fresh insight, it does remind us that the performance depends on every element of Φ.In this section, we improve greatly upon (21) with an analysis that depends on each dictionary element being an independent random vector and on the dictionary being large.The results are expectations over both the noise d and the dictionary itself.In addition to analyzing the MSE, we also analyze the probability of error in the subspace selection, that is, the probability that x and x SA lie in different subspaces.In light of the simulations in Section 3.2, we expect these analyses to qualitatively match the performance of a variety of dictionaries.Section 4.1 delineates the additional assumptions made in this section.The probability of error and MSE analyses are then given in Section 4.2.Estimates of the probability of error and MSE are numerically validated in Section 4.3, and finally limits as N → ∞ are studied in Section 4.4.

Modeling assumptions
This section specifies the precise modeling assumptions in analyzing denoising performance with large, isotropic, random frames.Though the results are limited to the case of K = 1, the model is described for general K. Difficulties in extending the results to general K are described in the concluding comments of the paper.While many practical problems involve K > 1, the analysis of the K = 1 case presented here illustrates a number of unexpected qualitative phenomena, some of which have been observed for higher values of K.
The model is unchanged from earlier in the paper except that the dictionary Φ and signal x are random.
(a) Dictionary generation.The dictionary Φ consists of M i.i.d.random vectors uniformly distributed on the unit sphere in R N .(b) Signal generation.The true signal x is a linear combination of the first K dictionary elements so that for some random coefficients {α i }.The coefficients {α i } are independent of the dictionary except in that x is normalized to have x 2 = N for all realizations of the dictionary and coefficients.(c) Noise.The noisy signal y is given by y = x+d, where, as before, d ∼ N (0, σ 2 I N ).d is independent of Φ and x.We will let which is the input SNR because of the scaling of x.(d) Estimator.The estimator x SA is defined as before to be the optimal K-sparse approximation of y with respect to Φ. Specifically, we enumerate the J = M K Kelement subsets of Φ.The jth subset spans a subspace denoted by V j and P j denotes the projection operator onto V j .Then, For the special case when M and N are large and K = 1, we will estimate two quantities.Definition 1.The subspace selection error probability p err is defined as where T is the subspace selection index and j true is the index of the subspace containing the true signal x, that is, j true is the index of the subset {1, 2, . . ., K}. Definition 2. The normalized expected MSE is defined as Normalized expected MSE is the per-component MSE normalized by the per-component noise variance (1/N)E d 2 = σ 2 .The term "expected MSE" emphasizes that the expectation in ( 28) is over not just the noise d, but also the dictionary Φ and signal x.
We will give tractable computations to estimate both p err and E MSE .Specifically, p err can be approximated from a simple line integral and E MSE can be computed from a double integral.

Analyses of subspace selection error and MSE
The first result shows that the subspace selection error probability can be bounded by a double integral and approximately computed as a single integral.The integrands are simple functions of the problem parameters M, N, K, and γ.While the result is only proven for the case of K = 1, K is left in the expressions to indicate the precise role of this parameter.
Theorem 3. Consider the model described in Section 4.1.When K = 1 and M and N are large, the subspace selection error probability defined in (27) is bounded above by and p err is well approximated by where β(r, s) is the beta function, and Γ(r) is the gamma function [49].
For the proof, see Appendix B.
It is interesting to evaluate p err in two limiting cases.First, suppose that J = 1.This corresponds to the situation where there is only one subspace.In this case, C = 0 and (30) gives p err = 0.This is expected since with one subspace, there is no chance of a subspace selection error.
At the other extreme, suppose that N, K, and γ are fixed and M → ∞.Then C → ∞ and p err → 1.Again, this is expected since as the size of the frame increases, the number of possible subspaces increases and the probability of error increases.
The next result approximates the normalized expected MSE with a double integral.The integrand is relatively simple to evaluate and it decays quickly as ρ → ∞ and u → ∞ so numerically approximating the double integral is not difficult.
Theorem 4. Consider the model described in Section 4.1.When K = 1 and M and N are large, the normalized expected MSE defined in (28) is given approximately by where f r (u) is given in (35), g r (ρ) is the probability distribution and C, r, and a are defined in (32)- (34).
For the proof, see Appendix C.

Numerical examples
We now present simulation results to examine the accuracy of the approximations in Theorems 3 and 4. Three pairs of (N, M) values were used: (5,1000), (10,100), and (10,1000).
Alyson K. Fletcher et al.For each integer SNR from −10 dB to 35 dB, the subspace selection and normalized MSE were measured for 5×10 5 independent experiments.The resulting empirical probabilities of subspace selection error and normalized expected MSEs are shown in Figure 6.Plotted alongside the empirical results are the estimates p err and E MSE from (30) and (36).
Comparing the theoretical and measured values in Figure 6, we see that the theoretical values match the simulation closely over the entire SNR range.Also note that Figure 6b shows qualitatively the same behavior as Figures 2 and 3 (the direction of the horizontal axis is reversed).In particular, E MSE ≈ K/N for high SNR and the low-SNR behavior depends on M and N as described by (22).

Asymptotic analysis
The estimates p err and E MSE are not difficult to compute numerically, but the expressions (30) and ( 36) provide little direct insight.It is thus interesting to examine the asymptotic behavior of p err and E MSE as N and M grow.The following theorem gives an asymptotic expression for the limiting value of the error probability function.
Theorem 5. Consider the function p err (N, M, K, γ) defined in (30).Define the critical SNR as a function of M, N, and K as where C, r, s, and J are defined in (32) and (33).For K = 1 and any fixed γ and γ crit , where the limit is on any sequence of M and N with γ crit constant.
For the proof, see Appendix D.
The theorem shows that, asymptotically, there is a critical SNR γ crit above which the error probability goes to one and below which the probability is zero.Thus, even though the frame is random, the error event asymptotically becomes deterministic.
A similar result holds for the asymptotic MSE.
Theorem 6.Consider the function E MSE (M, N, K, γ) defined in (36) and the critical SNR γ crit defined in (38).For K = 1 and any fixed γ and γ crit , lim where the limit is on any sequence of M and N with γ crit constant, and For the proof, see Appendix E.
Remarks.(i) Theorems 5 and 6 hold for any values of K.They are stated for K = 1 because the significance of p err (N, M, K, γ) and E MSE (M, N, K, γ) is proven only for K = 1.
(ii) Both Theorems 5 and 6 involve limits with γ crit constant.
It is useful to examine how M, N, and K must be related asymptotically for this condition to hold.One can use the definition of the beta function β(r, s) = Γ(r)Γ(s)/Γ(r + s) along with Stirling's approximation, to show that when K N, rβ(r, s) Substituting ( 42) into (38), we see that γ crit ≈ J 1/r − 1.Also, for K N and K M, so that for small K and large M and N. Therefore, for γ crit to be constant, (M/K) 2K/N must be constant.Equivalently, the dictionary size M must grow as K(1 + γ crit ) N/(2K) , which is exponential in the inverse sparsity N/K.
The asymptotic normalized MSE is plotted in Figure 7 for various values of the critical SNR γ crit .When γ > γ crit , the normalized MSE is zero.This is expected: from Theorem 5, when γ > γ crit , the estimator will always pick the correct subspace.We know that for a fixed subspace estimator, the normalized MSE is K/N.Thus, as N → ∞, the normalized MSE approaches zero.
What is perhaps surprising is the behavior for γ < γ crit .In this regime, the normalized MSE actually increases with increasing SNR.At the critical level, γ = γ crit , the normalized MSE approaches its maximum value When γ crit > 1, the limit of the normalized MSE E lim (γ) satisfies E lim (γ) > 1.Consequently, the sparse approximation results in noise amplification instead of noise reduction.In the worst case, as γ crit → ∞, E lim (γ) → 2. Thus, sparse approximation can result in a noise amplification by a factor as large as 2. Contrast this with the factor of 4 in (21), which seems to be a very weak bound.

COMMENTS AND CONCLUSIONS
This paper has addressed properties of denoising by sparse approximation that are geometric in that the signal model is membership in a specified union of subspaces, without a probability density on that set.The denoised estimate is the feasible signal closest to the noisy observed signal.The first main result (Theorems 1 and 2) is a bound on the performance of sparse approximation applied to a Gaussian signal.This lower bound on mean-squared approximation error is used to determine an upper bound on denoising MSE in the limit of low input SNR.
The remaining results apply to the expected performance when the dictionary itself is random with i.i.d.entries selected according to an isotropic distribution.Easy-tocompute estimates for the probability that the subspace containing the true signal is not selected and for the MSE are given (Theorems 3 and 4).The accuracy of these estimates is verified through simulations.Unfortunately, these results are proven only for the case of K = 1.The main technical difficulty in extending these results to general K is that the distances to the various subspaces are not mutually independent.(Though Lemma 2 does not extend to K > 1, we expect that a relation similar to (B.10) holds.) Asymptotic analysis (N → ∞) of the situation with a random dictionary reveals a critical value of the SNR (Theorems 5 and 6).Below the critical SNR, the probability of selecting the subspace containing the true signal approaches zero and the expected MSE approaches a constant with a simple, closed form; above the critical SNR, the probability of selecting the subspace containing the true signal approaches one and the expected MSE approaches zero.
Sparsity with respect to a randomly generated dictionary is a strange model for naturally occurring signals.However, most indications are that a variety of dictionaries lead to performance that is qualitatively similar to that of random dictionaries.Also, sparsity with respect to randomly generated dictionaries occurs when the dictionary elements are produced as the random instantiation of a communication channel.Both of these observations require further investigation.

APPENDIX A. PROOF OF THEOREMS 1 AND 2
We begin with a proof of Theorem 2; Theorem 1 will follow easily.The proof is based on analyzing an idealized encoder for v.Note that despite the idealization and use of sourcecoding theory, the bounds hold for any values of (N, M, K)the results are not merely asymptotic.Readers unfamiliar with the basics of source-coding theory are referred to any standard text, such as [50][51][52], though the necessary facts are summarized below.
Consider the encoder for v shown in Figure 8.The encoder operates by first finding the optimal sparse approximation of v, which is denoted by v.The subspaces in Φ K are assumed to be numbered, and the index of the subspace containing v is denoted by T. v is then quantized with a Kdimensional, b-bit quantizer represented by the box "Q" to produce the encoded version of v, which is denoted by v Q .
The subspace selection T is a discrete random variable that depends on v.The average number of bits needed to communicate T to a receiver that knows the probability mass function of T is given by the entropy of T, which is denoted by H(T) [51].In analyzing the encoder for v, we assume that a large number of independent realizations of v are encoded at once.This allows b to be an arbitrary real number (rather than an integer) and allows the average number of bits used to represent T to be arbitrarily close to H(T).The encoder of Figure 8 can thus be considered to use The crux of the proof is to represent the squared error that we are interested in, v − v 2 , in terms of squared errors of the overall encoder v → v Q and the quantizer v → v Q .We will show the orthogonality relationship below and bound both terms: The two facts we need from rate-distortion theory are as follows [50][51][52].
Now we would like to define the quantizer "Q" in Figure 8 to get the smallest possible upper bound on Since the distribution of v does not have a simple form (e.g., it is not Gaussian), we have no better tool than fact (b), which requires us only to find (or upper bound) the variance of the input to a quantizer.Consider a two-stage quantization process for v.The first stage (with access to T) applies an affine, length-preserving transformation to v such that the result has zero mean and lies in a K-dimensional space.The output of the first stage is passed to an optimal b-bit quantizer.Using fact (b), the performance of such a quantizer must satisfy where σ 2 v|T is the per-component conditional variance of v, in the K-dimensional space, conditioned on T.
From here on, we have slightly different reasoning for the v = 0 and v = 0 cases.For v = 0, we get an exact expression for the desired conditional variance; for v = 0, we use an upper bound.
When v = 0, symmetry dictates that E[ v | T] = 0 for all T and E[ v] = 0. Thus, the conditional variance σ 2 v|T and unconditional variance σ 2 v are equal.Taking the expectation of gives where we have used -which is the quantity we are bounding in the theorem.Substituting (A.6) into (A.3) now gives To usefully combine (A.2) and (A.7), we need one more orthogonality fact.Since the quantizer Q operates in subspace T, its quantization error is also in subspace T. On the other hand, because v is produced by orthogonal projection to subspace T, v − v is orthogonal to subspace T. So Taking expectations, rearranging, and substituting (A.2) and (A.7) gives Recalling that the left-hand side of (A.9) is ND SA and rearranging gives Since this bound must be true for all b ≥ 0, one can maximize with respect to b to obtain the strongest bound.This maximization is messy; however, maximizing the numerator is easier and gives almost as strong a bound.The numerator is maximized when .11) and substituting this value of b in (A.10) gives (A.12) We have now completed the proof of Theorem 2 for v = 0.For v = 0, there is no simple expression for σ 2 v|T that does not depend on the geometry of the dictionary, such as (A.6), to use in (A.3).Instead, use where the first inequality holds because conditioning cannot increase variance and the second follows from the fact that the orthogonal projection of v cannot increase its variance, even if the choice of projection depends on v. Now following the same steps as for the v = 0 case yields in place of (A.10).The bound is optimized over b to obtain The proof of Theorem 1 now follows directly: since T is a discrete random variable that can take at most J values, H(T) ≤ log 2 J.

B. PROOF OF THEOREM 3
Using the notation of Section 4.1, let V j , j = 1, 2, . . ., J, be the subspaces spanned by the J possible K-element subsets of the dictionary Φ.Let P j be the projection operator onto V j , and let T be index of the subspace closest to y.Let j true be the index of the subspace containing the true signal x, so that the probability of error is For each j, let x j = P j y, so that the estimator x SA in (26) can be rewritten as x SA = x T .Also, define random variables to represent the normalized distances between y and the V j 's.Henceforth, the ρ j 's will be called angles, since ρ j = sin 2 θ j , where θ j is the angle between y and V j .The angles are well defined since y 2 > 0 with probability one.
Lemma 1.For all j = j true , the angle ρ j is independent of x and d.
Proof.Given a subspace V and vector y, define the function where P V is the projection operator onto the subspace y.Thus, R(y, V ) is the angle between y and V .With this notation, ρ j = R(y, V j ).Since ρ j is a deterministic function of y and V j and y = x + d, to show ρ j is independent of x and d, it suffices to prove that ρ j is independent of y.Equivalently, we need to show that for any function G(ρ) and vectors y 0 and y 1 , This property can be proven with the following symmetry argument.Let U be any orthogonal transformation.Since U is orthogonal, P UV (U y) = UP V y for all subspaces V and vectors y.Combining this with the fact that Uv = v for all v, we see that Also, for any scalar α > 0, it can be verified that R(αy, V ) = R(y, V ).Now, let y 0 and y 1 be any two possible nonzero values for the vector y.Then, there exist an orthogonal transformation U and scalar α > 0 such that y 1 = αU y 0 .Since j = j true and K = 1, the subspace V j is spanned by vectors ϕ i , independent of the vector y.Therefore, Now since the elements of Φ are distributed uniformly on the unit sphere, the subspace UV j is identically distributed to V j .Combining this with (B.5) and (B.6), and this completes the proof.
Lemma 2. The random angles ρ j , j = j true , are i.i.d., each with a probability density function given by the beta distribution where r = (N − K)/2 and s = K/2 as defined in (33).
Proof.Since K = 1, each of the subspaces V j for j = j true is spanned by a single, unique vector in Φ.Since the vectors in Φ are independent and the random variables ρ j are the angles between y and the spaces V j , the angles are independent.Now consider a single angle ρ j for j = j true .The angle ρ j is the angle between y and a random subspace V j .Since the distribution of the random vectors defining V j is spherically symmetric and ρ j is independent of y, ρ j is identically distributed to the angle between any fixed subspace V and a random vector z uniformly distributed on the unit sphere.One way to create such a random vector z is to take z = w/ w , where w ∼ N (0, I N ).Let w 1 , w 2 , . . ., w K be the components of w in V , and let w K+1 , w K+2 , . . ., w N be the components in the orthogonal complement to V .If we define then the angle between z and V is ρ = Y/(X +Y ).Since X and Y are the sums of K and N − K i.i.d.squared Gaussian random variables, they are Chi-squared random variables with K and N − K degrees of freedom, respectively [53].Now, a well-known property of Chi-squared random variables is that if X and Y are Chi-squared random variables with m and n degrees of freedom, Y/(X + Y ) will have the beta distribution with parameters m/2 and n/2.Thus, ρ = Y/(X + Y ) has the beta distribution, with parameters r and s defined in (33).The probability density function for the beta distribution is given in (B.8).for small , where C is given in (32).More precisely, 1) .

(B.11)
Proof.Since Lemma 1 shows that each ρ j is independent of x and d, it follows that ρ min is independent of x and d as well.Also, for any j = j true , by bounding the integrand of Now, there are J − 1 subspaces V j where j = j true , and by Lemma 2, the ρ j 's are mutually independent.Consequently, if we apply the upper bound of (B.14) and 1 − δ > exp(−δ/(1 − δ)) for δ ∈ (0, 1), with δ = r /(rβ(r, s)), we obtain 1) . (B.15) Similarly, using the lower bound of (B.14), we obtain rβ(r, s) .

(B.16)
Proof of Theorem 3. Let V true be the "correct" subspace, that is, V true = V j for j = j true .Let D true be the squared distance from y to V true , and let D min be the minimum of the squared distances from y to the "incorrect" subspaces V j , j = j true .
Since the estimator selects the closest subspace, there is an error if and only if D min ≤ D true .Thus, Now consider D min .For any j, x j is the projection of y onto V j .Thus, the squared distance from y to any space V j is y − x j 2 = ρ j y 2 .Hence, the minimum of the squared distances from y to the spaces V j , j = j true , is We will bound and approximate y 2 to obtain the bound and approximation of the theorem.Notice that y = x + d = x + d 0 + d 1 , where x + d 0 ∈ V true and d 1 ∈ V ⊥ true .Using this orthogonality and the triangle inequality, we obtain the bound For an accurate approximation, note that since d 0 is the component of d in the K-dimensional space V true , we have D 0 N unless the SNR is very low.Thus, where we have started with (B.20) substituted in (B.22); the first equality uses U = D 1 /((N − K)σ 2 ), which is a normalized Chi-squared random variable with N − K = 2r degrees of freedom and V = D 0 /(Kσ 2 ), which is a normalized Chisquared random variable with K = 2s degrees of freedom [53]; the last equality uses the definition of G from the statement of the theorem; and the final inequality is an application of Lemma 3.This yields (29).

C. PROOF OF THEOREM 4
We will continue with the notation of the proof of Theorem 3. To approximate the MSE, we will need yet another property of the random angles ρ j .

Lemma 4. For any subspace j
Proof.Define the random variable w j = x j − (1 − ρ j )y, and let μ j = E[w j | ρ j , y].Then, E x j | ρ j , y = 1 − ρ j y + μ j . (C.1) So the lemma will be proven if we can show that μ j = 0. To this end, first observe that since x j is the projection of y onto the space V j , x j − y is orthogonal to x j .Using this fact along with the definition of ρ j , That is, w j is orthogonal to y.Consequently, μ j = E[w j | ρ j , y] is orthogonal to y as well.
We can now show that μ j = 0 from a symmetry argument similar to that used in the proof of Lemma 1.For any vector y and subspace V , define the function where, as in the proof of Lemma 1, P V is the projection operator onto V , and R(y, V ) is given in (B.3).Since ρ j = R(y, V j ), we can rewrite w j as The proof of Lemma 1 showed that for any orthogonal transformation U, P UV (U y) = UP V y and R(U y, UV) = R(y, V ).Therefore, Now, fix y and let U be any fixed orthogonal transformation of R N with the property that U y = y.Since U is orthogonal and the space V j is generated by random vectors with a spherically symmetric distribution, UV j is identically distributed to V j .Combining this with (C.5) and the fact that U y = y gives Therefore, μ j = Uμ j for all orthogonal transformations U such that U y = y.Hence, μ j must be spanned by y.But, we showed above that μ j is orthogonal to y.Thus μ j = 0, and this proves the lemma.
Proof of Theorem 4. As in the proof of Theorem 3, let D 0 and D 1 be the squared norms of the components of d on the spaces V true and V ⊥ true , respectively.Also, let U = D 1 /((N − K)σ 2 ).Define the random variable and its conditional expectation Differentiating the approximate cumulative distribution function of ρ min in Lemma 3, we see that ρ min has an approximate probability density function of f r (ρ).Also, as argued in the proof of Theorem 3, U has the probability density function given by g r (u).Therefore, (C.9) In the last step, we have used the fact that D 0 = d 0 2 , where d 0 is the projection of d onto the K-dimensional subspace V true .Since d has variance σ 2 per dimension, E D 0 = Kσ 2 .Comparing (C.9) with (36), the theorem will be proven if we can show that F 0 (ρ, u) ≈ F(ρ, u), (C.10) where F(ρ, u) is given in (37).We consider two cases: when T = j true and T = j true .First, consider the case T = j true .In this case, SA is the projection of y onto the true subspace V true .The error x − x SA will be precisely d 0 , the component of the noise d on V true .Thus, (C.28) This shows that F 0 (ρ, u) ≈ F(ρ, u) and completes the proof.

D. PROOF OF THEOREM 5
The function g r (u) is the pdf of a normalized Chi-squared random variable with 2r degrees of freedom [53].That is, g r (u) is the pdf of a variable of the form

E. PROOF OF THEOREM 6
As in the proof of Theorem 5, let U r be a normalized Chisquared variable with pdf g r (u).Also let ρ r be a random variable with pdf f r (ρ).Then, we can write E MSE as where the expectation is over the random variables U r and ρ r .
As in the proof of Theorem 5, we saw U r → 1 almost surely as r → ∞.Integrating the pdf f r (ρ), we have the cdf

2 EURASIPFigure 1 :
Figure 1: Two sparsity models in dimension N = 2. (a) Having sparsity K = 1 with respect to a dictionary with M = 3 elements restricts the possible signals greatly.(b) With the dictionary size increased to M = 100, the possible signals still occupy a set of measure zero, but a much larger fraction of signals is approximately sparse.

Figure 5 :
Figure5: Illustration of variables to relate approximation and denoising problems.(An undesirable case in which x SA is not in the same subspace as x.)

Figure 6 :
Figure 6: Simulation of subspace selection error probability and normalized expected MSE for isotropic random dictionaries.Calculations were made for integer SNRs (in dB), with 5 × 10 5 independent simulations per data point.In all cases, K = 1.The curve pairs are labeled by (N, M).Simulation results are compared to the estimates from Theorems 3 and 4.

Figure 8 :
Figure8: The proof of Theorem 2 is based on the analysis of a hypothetical encoder for v.The sparse approximation box "SA" finds the optimal K-sparse approximation of v, denoted by v, by computing v = P T v.The subspace selection T can be represented with H(T) bits.The quantizer box "Q" quantizes v with b bits, with knowledge of T. The overall output of the encoder is denoted by v Q .
(a) The lowest possible per-component MSE for encoding an i.i.d.Gaussian source with per-component variance σ 2 with R bits per component is σ 2 2 −2R .(b) Any source with per-component variance σ 2 can be encoded with R bits per component to achieve percomponent MSE σ 2 2 −2R .(The combination of facts (a) and (b) tells us that Gaussian sources are the hardest to represent when distortion is measured by MSE.) Applying fact (a) to the

Lemma 3 .
Let ρ min = min j = jtrue ρ j .Then ρ min is independent of x and d and has the approximate distribution Pr ρ min > ≈ exp − (C ) r (B.10)

E
Pr ρ r < x = exp − (Cx) r .r → 1/C in distribution.Therefore, taking the limit of (E.1) with K = 1 and C constant, and N, M → ∞, the limit (D.4) and the definition γ crit = C − 1in the proof of Theorem 5, in the limit as N → ∞, (1 + a)/C < a is equivalent to γ < γ crit .Therefore, lim lim (γ) if γ < γ crit , 0 i fγ > γ crit .(E.6) Pr D min ≤ D true = Pr ρ min y 2 ≤ D 1 Note that by Lemma 3, ρ min is independent of x and d.Therefore, ρ min is independent of D 0 and D 1 .We can now obtain a bound and an approximation from (B.22) by taking expectations over D 0 and D 1 .To obtain a bound, combine the lower bound on Pr(ρ min > ) from (B.11) with (B.20):