A unified approach to sparse signal processing

A unified view of the area of sparse signal processing is presented in tutorial form by bringing together various fields in which the property of sparsity has been successfully exploited. For each of these fields, various algorithms and techniques, which have been developed to leverage sparsity, are described succinctly. The common potential benefits of significant reduction in sampling rate and processing manipulations through sparse signal processing are revealed. The key application domains of sparse signal processing are sampling, coding, spectral estimation, array processing, component analysis, and multipath channel estimation. In terms of the sampling process and reconstruction algorithms, linkages are made with random sampling, compressed sensing, and rate of innovation. The redundancy introduced by channel coding in finite and real Galois fields is then related to over-sampling with similar reconstruction algorithms. The error locator polynomial (ELP) and iterative methods are shown to work quite effectively for both sampling and coding applications. The methods of Prony, Pisarenko, and MUltiple SIgnal Classification (MUSIC) are next shown to be targeted at analyzing signals with sparse frequency domain representations. Specifically, the relations of the approach of Prony to an annihilating filter in rate of innovation and ELP in coding are emphasized; the Pisarenko and MUSIC methods are further improvements of the Prony method under noisy environments. The iterative methods developed for sampling and coding applications are shown to be powerful tools in spectral estimation. Such narrowband spectral estimation is then related to multi-source location and direction of arrival estimation in array processing. Sparsity in unobservable source signals is also shown to facilitate source separation in sparse component analysis; the algorithms developed in this area such as linear programming and matching pursuit are also widely used in compressed sensing. Finally, the multipath channel estimation problem is shown to have a sparse formulation; algorithms similar to sampling and coding are used to estimate typical multicarrier communication channels.


I. INTRODUCTION
T HERE are many applications in signal processing and communication systems where the discrete signals are sparse in some domain such as time, frequency, or space i.e., most of the samples are zero, or alternatively their transform in another domain (normally called "frequency coefficients") Manuscript received February 05, 2009; revised ??? ??, 2009.F. Marvasti, A. Amini, F. Haddadi, B. Khalaj and M. Soltanolkotabi are affiliated with Advanced Communication Research Institute (ACRI), Electrical Engineering Department, Sharif University of Technology {marvasti, khalaj}@sharif.edu,{arashsil, farzanhaddadi, msoltan}@ee.sharif.edu A. Aldroubi is affiliated with Math Department, Vanderbilt University, akram.aldroubi@vanderbilt.edu.The work of Akram Aldroubi was supported in part by grant nsf-dms 0807464.
S. Sanei is affiliated with Centre of Digital Signal Processing, School of Engineering, Cardiff University, Cardiff, UK, SaneiS@cf.ac.ukJ. Chambers is affiliated with Electrical and Electronic Department, Loughborough University, j.a.chambers@lboro.ac.uk S. Holm is affiliated with the Department of Informatics, University of Oslo, sverre@ifi.uio.no The invited contributors to specific sections are listed in the acknowledgments.is sparse (see Figs. 1 and 2).There are trivial sparse transformations where the sparsity is preserved in both "time" and "frequency" domains; the identity transform matrix and its sorted versions are extreme examples.Wavelet transformations that preserve the local characteristics of a sparse signal can be regarded as "almost" sparse in the "frequency" domain; in general, for sparse signals, the more similar the transformation matrix is to an identity matrix, the sparser the signal is in the transform domain.In addition, the transform matrix may be sparse; wavelet transformation matrices are such examples.
In any of these scenarios, sampling and processing can be optimized using sparse signal processing.In other words, the sampling rate and the processing manipulations can be significantly reduced; hence, a combination of data compression and processing time reduction can be achieved 1 .
The columns of Table II consist of 0-category, 1-topics, 2sparsity domain, 3-type of sparsity, 4-information domain, 5-type of sampling in information domain, 6-minimum sampling rate, 7-reconstruction method, and 8-applications.The first rows (2-7) of column 1 are on sampling techniques.The 8-9 th rows are related to channel coding, row 10 is on spectral estimation and rows 11-13 are related to array processing.Rows 14-15 correspond to SCA and finally, row 16 covers multicarrier channel estimation, which is a rather new topic.As shown in column 2 of the table, depending on the topics, sparsity is defined in time, space, or "frequency" domains.In some applications, the sparsity is defined as the number of polynomial coefficients (which in a way could be regarded as "frequency"), the number of sources (which may depend on location or time sparsity for the signal sources), or the number of "words" (signal bases) in a dictionary.The type of sparsity is shown in column 3; for sampling schemes, it is usually low-pass, band-pass, or multiband [25], while for compressed sensing, and most other applications, it is random.Column 4 represents the information domain, where the number of sparsity, locations, and amplitudes can be determined by proper sampling (column 5) of this domain.
The other columns are self explanatory and will be discussed in more details in the following sections.
The rows 2-4 of Table II are related to the sampling (uniform or random) of signals that are bandlimited in the Fourier domain.Band-limitedness is a special case of sparsity where the nonzero coefficients in the frequency domain are consecutive.A better assumption in the frequency domain is to have random sparsity [26], [27], [28] as shown in row 5 and column 3. A generalization of the sparsity in the frequency domain is sparsity in any transform domain such as Discrete Cosine Transform (DCT) and wavelets; this concept is further generalized in compressed sensing (row 6) where sampling is taken by a linear combination of time domain samples [3], [29], [30], [31].Sampling of signals with finite rate of innovation (row 7) is related to piecewise smooth (polynomial based) signals.The position of discontinuous points is determined by annihilating filters that are equivalent to error locator polynomials in error correction codes and Prony's method [28] as discussed in Sections III and IV, respectively.
Random errors in a Galois field (row 8) and the additive impulsive noise in real-field error correction codes (row 9) are sparse disturbances that need to be detected and removed.For erasure channels, the impulsive noise can be regarded as the negative of the sample value [32]; thus the missing sampling problem, which can also be regarded as a special case of nonuniform sampling, is also a special case of the error correction problem.A subclass of impulsive noise for 2-D signals is salt and pepper noise [33].The information domain, where the sampling process occurs, is called the syndrome which is usually in a transform domain.In addition, for special binary codes such as Low Density Parity Check (LDPC) codes, the parity check matrix is extremely sparse.Sparse matrix inversions and manipulations [34], [35], [36] are utilized in Discrete Wavelet Transform (DWT) and LDPC decoding.
Spectral estimation (row 10) is the dual of error correction codes, i.e., the sparsity is in the frequency domain.MSL (row 11) and multi-target detection in radars are similar to spectral estimation since targets act as spatial sparse monotones; each target is mapped to a specific spatial frequency regarding its line of sight direction relative to the receiver.The techniques developed for this branch of science is quite unique; with examples such as MUSIC [8], Prony [9], and Pisarenko [10].We shall see that the techniques used in realfield error correction codes and SCA can also be used in this area.
The array processing category (rows 11-13) consists of three separate topics.The first one covers MSL in radars and sonars and DOA, which are similar to spectral estimation.The techniques developed for this field are similar to the spectral estimation methods with emphasis on the Minimum Description Length (MDL) [37].The second topic in the array processing category is related to the design of sparse arrays where some of the array elements are missing; the remaining nodes form a nonuniform sparse grid.In this case, one of the optimization problems is to find the sparsest array (number, location and weight of elements) for a given beampattern.This problem has some resemblance to the missing sampling  problem (which is a special case of real-field error correction codes), and may be solved by similar techniques.The third topic is on sensor networks (row 13).Distributed sampling and recovery of a physical field using an array of sparse sensors is a problem of increasing interest in environmental and seismic monitoring applications of sensor networks [38].
Sensor fields may be bandlimited or non-bandlimited.Since the power consumption is the most restricting issue in sensors, it is vital to use the lowest possible number of sensors (sparse sensor networks) with the minimum processing computation.
In Sparse Component Analysis (SCA), the number of observations is much less than the number of sources (signals).However, if the sources are sparse in the time domain, then the active sources and their amplitudes can be determined; this is equivalent to error correction codes.Sparse Dictionary Representation (SDR) is another new area where signals are represented by the sparsest number of words (signal bases) in a dictionary of finite number of words; this sparsity may result in tremendous amount of data compression.When the dictionary is overcomplete, there are many ways to represent the signal; however, we are interested in the sparsest representation.Normally, for extraction of statistically independent sources, Independent Component Analysis (ICA 4 ) is used for a complete set of linear mixtures.In the case of a non-complete (underdetermined) set of linear mixtures, ICA can work if the sources are also sparse; for this special case, ICA analysis is synonymous with SCA.
Finally, channel estimation is shown in row 16.In mobile communication systems, multipath reflections create a channel that can be modeled by a sparse FIR filter.For proper decoding of the incoming data, the channel characteristics should be estimated before it can be equalized.For this purpose, a training sequence is inserted within the main data, which enables the receiver to obtain the output of the channel by exploiting this training sequence.The channel estimation problem becomes a deconvolution problem under noisy environments.The sparsity criterion of the channel greatly improves the channel estimation; this is where the algorithms for extraction of a sparse signal could be employed [22], [23], [39].When sparsity is random, further signal processing is needed.In this case there are three items that need to be considered.1-Evaluating the number of sparse coefficients (or samples), 2-finding the position of sparse coefficients, and 3-determining the values of these coefficients.In some applications only the first two items are needed; e.g., in spectral estimation.However, in almost all the other cases mentioned in Table II, all the three items should be determined.Various types of Linear Programming (LP) and some iterative algorithms, such as the Iterative Method with Adaptive Thresholding (IMAT), determine the number, positions and values of sparse samples at the same time.On the other hand, the Minimum Description Length (MDL) method, used in DOA/MSL and spectral estimation, determines the number of sparse source locations or frequencies.In the subsequent sections, we shall describe, in more details, each algorithm for various areas and applications based on Table II.
Finally, it should be mentioned that the signal model for each topic or application may be deterministic or stochastic.For example, in the sampling category for rows 2-4 and 7, the signal model is typically deterministic although stochastic models could also be envisioned [40].On the other hand for random sampling and CS (rows 5-6), the signal model is stochastic although deterministic models may also be envisioned [41].In channel coding and estimation (rows 8-9 and 16), the signal model is normally deterministic.For Spectral and DOA estimation (rows 10-11), stochastic models are assumed; while for array beam-forming (row 12), deterministic models are used.In sensor networks (row 13), both deterministic and stochastic signal models are employed.Finally, in SCA (rows [14][15], statistical independence of sources may be necessary and thus stochastic models are applied.

II. SAMPLING: UNIFORM, NONUNIFORM, MISSING, RANDOM, COMPRESSED SENSING, RATE OF INNOVATION
Analog signals can be represented by finite rate discrete samples (uniform, nonuniform, or random) if the signal has some sort of redundancies such as band-limitedness, finite polynomial representation (e.g., periodic signals that are represented by a finite number of trigonometric polynomials), and nonlinear functions of such redundant functions [42], [43].The minimum sampling rate is the Nyquist rate for uniform sampling and its generalizations for nonuniform [2] and multiband signals [44].When a signal is discrete, the equivalent discrete representation in the "frequency" domain (DFT, DCT, DWT, Discrete Hartley Transform (DHT), Dis-crete Sine Transform (DST)) may be sparse, which is the discrete version of bandlimited or multiband analog signals.
For discrete signals, if the nonzero coefficients ("frequency" sparsity) are consecutive, depending on the location of the zeros, they are called lowpass, bandpass, or multiband discrete signals; otherwise, the "frequency" sparsity is random.The number of discrete time samples needed to represent a frequency-sparse signal follows the law of algebra, that is, the number of time samples should be equal to the number of coefficients in the "frequency" domain; this is equivalent to the Nyquist rate-twice the bandwidth (for discrete signals with DC components it is twice the bandwidth minus one).The dual of frequency-sparsity is time-sparsity, which can happen in a burst or a random fashion.The number of "frequency" coefficients needed follows the Nyquist criterion.This will be further discussed in Section III for sparse additive impulsive noise channels.

A. Sampling of Sparse Signals
If the sparsity locations of a signal are known in a transform domain, then the number of samples needed in the time (space) domain should be at least equal to the number of sparse coefficients, i.e., the so called Nyquist rate.However, depending on the type of sparsity (lowpass, bandpass, or random) and the type of sampling (uniform, periodic nonuniform, or random), the reconstruction may be unstable and the corresponding reconstruction matrix may be ill-conditioned [45], [46].Thus in many applications mentioned in Table II, the sampling rate in column 6 is higher than the minimum (Nyquist) rate.
When the location of sparsity is not known, by the law of algebra, the number of samples needed to specify the sparsity is at least twice the number of sparse coefficients.Again for stability reasons, the actual sampling rate is higher than this minimum figure [2], [44].To guarantee stability, instead of direct sampling of the signal, a combination of the samples can be used.Donoho [30] has recently shown that if we take linear combinations of the samples, the minimum stable sampling rate is of the order O(k log( n k )), where n and k are the frame size and the sparsity number, respectively.
Iterative Methods When the Location of Sparsity is Known: The reconstruction algorithms have to recover the original sparse signal from the information domain and the type of sparsity in the transform domain.We know the samples (both position and amplitude) and we know the location of sparsity in the transform domain.An iteration between these two   3 1) Convert the input to the i th iteration (x (i) ) into the transform domain (for instance the Fourier domain); x (0) is normally the initial received signal.2) Multiply the transformed signal (X (i) ) by a mask (for instance a band-limiting filter).3) Take the inverse transform of the result in step 2 to get r (i) .4) Set the new result as: domains (Fig. 3 and Table III) or consecutive Projections Onto Convex Sets (POCS) should yield the original signal [45], [54], [55], [58], [61], [62], [63], [64].
In case of the usual assumption that the sparsity is in the "frequency" domain and for the uniform sampling case of lowpass signals, one projection (bandlimiting in the frequency domain) suffices.However, if the frequency sparsity is random, the time samples are nonuniform, or the "frequency" domain is defined in a domain other than the DFT, then we need several iterations to have a good replica of the original signal.In general, this iterative method converges if the "Nyquist" rate is satisfied, i.e., the number of samples per block is greater than or equal to the number of coefficients.Figure 4 shows the improvement in dB versus the number of iterations for a random sampling set for a bandpass signal.In this figure,  besides the standard iterative method, accelerated iterations such as Chebyshev and Conjugate Gradient methods are also used (please see Appendix I for the algorithms) [65].
Iterative methods are quite robust against quantization and additive noise.In fact, we can prove that the iterative methods approach the pseudo-inverse (least squares) solution for a noisy environment; specially, when the matrix is illconditioned [44].
Iterative Method with Adaptive Threshold (IMAT) for Unknown Location of Sparsity: As expected, when sparsity is assumed to be random, further signal processing is needed.We need to evaluate the number of sparse coefficients (or samples), the position of sparsity, and the values of the coefficients.The above iterative method cannot work since projection (the masking operation in Fig. 3) onto the "frequency" domain is not possible without the knowledge of the positions of sparse coefficients.In this scenario, we need to use the knowledge of sparsity in some way.The introduction of an adaptive nonlinear threshold in the iterative method can do the trick, thus the name: Iterative Method with Adaptive Threshold (IMAT); the block diagram and the algorithm are depicted in Fig. 5 and Table IV, respectively.The algorithms in [66], [32], [26], [24] are variations of this method.Figure 5 shows that by alternate projections between information and sparsity domains (adaptively lowering or raising the threshold levels in the sparsity domain), the sparse coefficients are gradually picked up after several iterations.This method can be considered as a modified version of Matching Pursuit as described in Section VI-D.1; the results are shown in Fig. 6.The sampling rate in the time domain is twice the number of unknown sparse coefficients.This is called the full capacity rate; this figure shows that after about 15 iterations, the SNR reaches its peak value.In general, the higher the sampling rate relative to the full capacity, the faster is the convergence rate and the better is the SNR value.
Matrix Solutions: When the sparse nonzero locations are known, matrix approaches can be utilized to determine the values of sparse coefficients [52].Although these approaches are rather straight forward, they may not be robust against quantization or additive noise.
There are other approaches such as Spline interpolation [51], nonlinear/time varying methods [52], Lagrange interpolation   IV).[49] and Error Locator Polynomial (ELP) [67] that will not be discussed here.These methods work quite well in the absence of additive noise but they may not be robust in the presence of noise.However the ELP approach will be discussed in Sec.III-A; variations of this method are called the annihilating filter in sampling with finite rate of innovation (Sec.II-C) and the Prony's method in spectral and DOA estimation (Sec.IV-A).

B. Compressed Sensing (CS)
The relatively new topic of Compressed (Compressive) Sensing (CS) which was originally introduced in [30] and further extended in [68], [69] and [31] deals with sampling of sparse signals, in general.The idea is to introduce sampling schemes with low number of required samples which uniquely represent the original sparse signal; these methods have lower computational complexities than the traditional techniques that employ oversampling and then apply compression.In other words, compression is achieved exactly at the time of sampling.Unlike the classical sampling theorem [70], the signals are assumed to be sparse in an arbitrary transform domain, not necessarily the Fourier transform.Furthermore, there is no restricting assumption for the location of nonzero coefficients in the sparsity domain; i.e., the locations should not follow a specific pattern such as lowpass or multiband structure.Clearly, this assumption includes a more general class of signals than the ones previously studied.
Since the concept of sparsity in a transform domain is easier to study for discrete signals, most of the research in this field is focused along discrete type signals [71]; however, recent results [72] show that most of the work can be generalized to continuous signals that have a sparse representation in a Riesz basis5 [73].We first study discrete signals and then briefly discuss the extension to the continuous case.
1) CS Mathematical Modeling: Let the vector x ∈ R n be a finite length discrete signal in the time domain which has to be under-sampled.We assume that x has a sparse representation in a transform domain denoted by a unitary matrix Ψ n×n ; i.e., we have: where s is an n × 1 vector which has at most k non-zero elements (k-sparse vectors).In practical cases, s has at most k significant elements and the insignificant elements are set to zero which means s is an almost k-sparse vector.For example, x can be the pixels of an image and Ψ can be the corresponding IDCT matrix.In this case, most of the DCT coefficients are insignificant and if they are set to zero, the quality of the image will not degrade significantly.In fact, this is the main concept behind some of the lossy compression methods such as JPEG2000.Since the inverse transform on x yields s, the vector s can be used instead of x, which can be succinctly represented by the location and values of the nonzero elements of s.Although this method efficiently compresses x, it initially requires all the samples of x to produce s, which undermines the whole purpose of CS.Now let us assume that instead of samples of x, we take m linear combinations of the samples (called generalized samples).If we represent these linear combinations by the matrix Φ m×n and the resultant vector of samples by y m×1 , we have: The question is how the matrix Φ and the size m should be chosen to ensure that these samples uniquely represent the original signal x.Apparently, the case of Φ = I n×n for m = n yields a trivial solution (keeping all the samples of x) that does not employ the sparsity characteristic.We look for Φ matrices with as few rows as possible which can guarantee the invertibility of the sampling process for the class of sparse inputs.
To solve this problem, we introduce probabilistic measures; i.e., instead of exact recovery of signals, we focus on the probability that a random sparse signal (according to a given probability density function) fails to be reconstructed using its generalized samples.If the probability of failure approaches zero, we can state that the sampling scheme (the joint pair of Ψ, Φ) is successful in recovering x with probability 1.
Let us assume that Φ (m) represents the first m rows of an invertible matrix Φ n×n .It is apparent that if we use {Φ (m) } n m=0 as the sampling matrices for a given sparsity domain, the failure probabilities for Φ (0) and Φ (n) are one and zero respectively, and as the index m increases, the failure probability decreases.The important point is that the decreasing rate of the failure probability is exponential with respect to m k [74].Therefore, we expect to reach an almost zero failure probability much earlier than m = n despite the fact that the exact rate highly depends on the mutual behavior of the two matrices Ψ, Φ.More precisely, it is shown in [74] that: where c is a positive constant and µ(Ψ, Φ (m) ) is the maximum coherence between the rows of Ψ and Φ (m) defined by [75]: where ψ a , φ b are the a th and b th rows of the matrices Ψ and Φ, respectively.The above result implies that, the probability of reconstruction is close to one for: The above derivation implies that, the lower the maximum coherence between the two matrices, the lower is the number of required samples.Thus, to decrease the number of samples, we should look for matrices Φ with low coherence with Ψ.For this purpose, we use a random Φ.It is shown that the coherence of a random matrix with i.i.d.Gaussian distribution with any unitary Ψ is considerably small [30], which makes it a proper candidate for the sampling matrix.Investigation on the probability distribution has shown that the Gaussian PDF is not the only solution (for example a binary distribution is considered in [76]) but may be the simplest to analyze.
The inequality shown in (6) can be simplified in terms of m, n and k; it can be shown [3] and [71] that a sparse signal can be reconstructed from its compressed samples with a probability of almost one if: 2) Reconstruction from Compressed Measurements: In this subsection, we would like to consider the reconstruction algorithms and stability issues.Essentially, there are three methods: A-Geometric, B-Combinatorial, and C-Information Theoretic.We would like to briefly discuss these three methods.
Geometric Methods: The oldest methods for reconstruction from compressed sampling are geometric, i.e., ℓ 1 minimization techniques for finding a k-sparse vector s ∈ R n from a set of m = O k log( n k ) measurements (y i ); see e.g., [30], [74], [77], [78], [79].Let us assume that we have applied a suitable Φ which guarantees the invertibility of the sampling process.The reconstruction method should be a technique to recover a k-sparse vector s n×1 from the observed samples y m×1 = Φ m×n •Ψ n×n •s n×1 or possibly y m×1 = Φ m×n •s n×1 +ν m×1 , where ν denotes the noise vector.Suitability of Φ implies that s n×1 is the only k-sparse vector that produces the observed samples; therefore, s n×1 is also the sparsest solution for y = Φ • Ψ • s.Consequently, s can be found using: Minimization with respect to ℓ 0 -norm (sparsity) is an NPcomplete problem in general.However, it is shown in [80] that minimization with ℓ 1 -norm results in the same vector s for many cases.The interesting part is that the number of required samples to replace ℓ 0 with ℓ 1 -minimization has the same order of magnitude as the one for the invertibility of the sampling scheme.Hence, s can be derived from (8) using ℓ 1minimization.It is worthwhile to mention that replacement of ℓ 1 -norm with ℓ 2 -norm, which is faster to implement, does not necessarily produce reasonable solutions.However, there are greedy methods (Matching Pursuit as discussed in Sec.VI on SCA [81], [82]) which iteratively approach the best solution faster than ℓ 1 -norm optimization (Basis Pursuit as discussed in Sec.VI on SCA).
The technique of IMAT, discussed in random sampling and simulated in Fig. 6 for sparse DFT signals, can also be used for the recovery of a sparse signal in other transform domains such as DCT.It should be noted that this sampling process is a special case of CS.Instead of a linear combination of samples, the actual random samples are used [27].Simulation results shown in Fig. 7 confirm the relation represented in (7).In our simulation results, perfect reconstruction corresponds to an SNR value of 100 dB with a reliability of at least 80%.It is interesting to see that (7) closely tracks the DFT sparse signal for a range of k when c = 1.When k increases for a given n (i.e., the signal becomes less sparse), the relative number of needed measurements to specify the signal is decreased and the relation (7) is no longer valid.
A sufficient condition for these methods to work is that the matrix Φ • Ψ must satisfy the so called Restricted Isometric Property (RIP) [83], [84], [76]; which will be discussed in the following subsection: RIP: It is important that in the presence of noise, the algorithms produce the best approximate solution to within a precision; in other words, small perturbations in the signal caused by noise result in small distortions in the output solution.This characteristic is usually defined as stability.In compressed sensing, the stability of the reconstruction is determined by the characteristics of the sampling matrix Φ.We say that the matrix Φ has RIP of order k, when for all k-sparse vectors s, we have [31]: where 0 ≤ δ k < 1 (isometry constant).The RIP is a sufficient condition that provides us with the maximum and minimum power of the samples with respect to the input power and ensures that none of the k-sparse inputs fall in the null space of the sampling matrix.The RIP property essentially states that every 4k columns of the matrix Φ m×n must be almost orthonormal.The explicit construction of a matrix with such a property is difficult for any given n ≫ m; however, the problem has been studied in some cases [41], [85].Moreover, given such a matrix Φ, finding s (or alternatively x) via the minimization problem involves linear programming with n variables and m constraints which can be computationally expensive.
Among the matrices that satisfy the RIP condition are Gaussian random matrices.If Φ is a Gaussian random matrix with the number of rows satisfying (6), Φ•Ψ is also a Gaussian random matrix with the same number of rows and thus it satisfies RIP, which guarantees a stable recovery.Assume that instead of Φ • s, we have Φ • s + ν, where ν represents the additive noise vector.Since Φ • s + ν may not belong to the range space of Φ over k-sparse vectors, the ℓ 1 minimization of (8) may not produce a solution.Thus we employ the following minimization instead: where ǫ 2 is the maximum noise power.Let us denote the result of the above minimization for y = Φ • s + ν by ŝ, which is also a k-sparse vector.Now we have: Since both s and ŝ are k-sparse, ŝ − s is 2k-sparse, and by using the RIP, we get: Or equivalently, This shows that small perturbations in the input cause small perturbations in the output (stability).Moreover, as δ 2k approaches unity, the distortion in the output caused by the input additive noise becomes more significant; the ideal case is when δ 2k = 0.
Combinatorial: Another standard approach for reconstruction of compressed sampling is combinatorial.The sampling matrix Φ is found using bipartite graphs, and consists of binary entries, i.e., entries that are either 1 or 0. Binary search methods are then used to find an unknown k-sparse vector s ∈ R n , see e.g., [77], [86], [87], [88], [89], [90], [91], [92] and the references therein.Typically, the binary matrix Φ has m = O(k log n) rows, and there exist fast algorithms for finding the solution x from the m measurements (typically a linear combination).However, the construction of Φ is also difficult.
Information Theoretic: A more recent approach is adaptive and information theoretic [93].In this method, the signal s ∈ R n is assumed to be an instance of a vector random variable x = (x 1 , . . ., x n ) t , and the i th row of Φ is constructed using the value of the previous sample y i−1 .Tools from the theory of Huffman coding are used to develop a deterministic construction of a sequence of binary sampling vectors φ σ (i.e., the components of φ σ consist of 0 or 1) in such a way as to minimize the average number of sampling (rows of Φ) needed to determine a signal.
3) Almost Sparse Signals and Noisy Measurements: An important issue in compressed sampling is the robustness to noisy measurements.Specifically, the reconstruction of a ksparse vector s from noisy y i1 = φ i1 , x + η i should produce a k-sparse vector ŝ, which is close to s.
Another important aspect is the behavior of compressive sampling algorithms on almost sparse signals which are more likely to occur in applications than exactly k-sparse vectors.For example a k-sparse vector s may be corrupted by noise η ∈ R n , producing the vector s = s + η.Another example is the wavelet transform of an image which consists mostly of small coefficients and a few large coefficients.Obviously, any method for the sampling and reconstruction of sparse signals must also be well adapted to almost sparse signals, i.e., if a sampling and reconstruction method is applied to an almost k-sparse signal s, it must produce a k-sparse signal ŝ that includes the k most significant coefficients of s (up a to small error).
4) CS for Analog Signals: Recently, there have been efforts to extend the concept of CS to analog signals [72].The sparse signals are assumed to be the elements of a Shift-Invariant (SI) space generated by n kernels with period T .For instance, assume {a l (t)} n l=1 form a Riesz basis (see the footnote in page 6) [73] for L 2 ; the respective generated SI space is For example, the set of all lowpass signals with bandwidth B form an SI space with a single kernel (sinc(2Bt)).Similarly, the set of multiband signals with the frequency support defined as n fixed disjoint intervals of equal length form an nkernel SI space.The classical sampling theorems suggest that the sampling rate of n T is sufficient for the signal recovery for the space given in (14).Now, a signal x(t) ∈ SI is called k-sparse if in its basis representation, at most k generators among the total n are active; i.e., d l [k] = 0 for l / ∈ {l 1 , l 2 , . . ., l k } where {l 1 , l 2 , . . ., l k } is an arbitrary subset of {1, 2, . . ., n}.Similar to the discrete case, we are looking for sampling schemes that employ the inherent sparsity in order to decrease the sampling rate.In [72], it is shown that the rate could be as low as 2k T for k-sparse signals; this is the theoretical lower bound for invertible sampling rates.Instead of the sampling matrix Φ, a filter bank is used for sampling; the signal is passed through p filters (2k ≤ p < n) followed by samplers, each sampling at rate 1  T .The analog continuous signal x(t) is thus transformed to an infinite length vector; generalization of the CS for finite length vectors has also been studied in [72].

C. Sampling with Finite Rate of Innovation
The classical sampling theorem states that: where B is the bandwidth of x(t) with the Nyquist interval T s = 1 2B .These uniform samples can be regarded as the degrees of freedom of the signal; i.e., a lowpass signal with bandwidth B has one degree of freedom in each Nyquist interval T s .Replacing the sinc function with other kernels in (15), we can generalize the sparsity (bandlimitedness) in the Fourier domain to a wider class of signals known as the SI spaces: Similarly, the above signals have one degree of freedom in each T s period of time (the coefficients c i ).A more general definition for the degree of freedom is introduced in [4] and is named the Rate of Innovation.For a given signal model, if we denote the degree of freedom in the time interval of [t 1 , t 2 ] by C x (t 1 , t 2 ), the local rate of innovation is defined by and the global rate of innovation (ρ) is defined as provided that the limit exists; in this case we say that the signal has finite rate of innovation [4], [28], [94], [95].As an example, for the lowpass signals with bandwidth B we have ρ = 2B, which is the same as the Nyquist rate.In fact by proper choice of the sampling process, we are extracting the innovations of the signal.Now the question that arises is that whether the uniform sampling theorems can be generalized to the signals with finite rate of innovation.The answer is positive for a class of non-bandlimited signals including the SI spaces.Consider the following signals: where {ϕ r (t)} k r=1 are arbitrary but known functions and {t i } i∈Z is a realization of a point process with mean µ.The free parameters of the above signal model are {c i,r } and {t i }.Therefore, for this class of signals we have ρ = 2 µ ; however, the classical sampling methods cannot reconstruct these kinds of signals with the sampling rate predicted by ρ.There are many variations for the possible choices of the functions ϕ r (t); nonetheless, we just describe the simplest version.Let the signal x(t) be a finite mixture of sparse Dirac functions: where {t i } is assumed to be an increasing sequence.We intend to show that the samples generated by proper sampling kernels ϕ(t) (shown in Fig. 8) can be used to reconstruct the sparse Dirac functions.In fact we choose the kernel ϕ(t) to satisfy the so called Strang-Fix condition of order 2k: The above condition for the Fourier domain becomes: where Φ(Ω) denotes the Fourier transform of ϕ(t), and the superscript (r) represents the r th derivative.It is also shown that such functions are of the form ϕ(t) = f (t) * β 2k (t), where β 2k (t) is the B-spline of order 2k th and f (t) is an arbitrary function with nonzero DC frequency [94].Therefore, the function β 2k (t) is itself among the possible options for the choice of ϕ(t).
We can show that for the sampling kernels which satisfy the Strang-Fix condition (20), the innovations of the signal x(t) (19) can be extracted from the samples (y[j]): Thus, In other words, we have filtered the discrete samples (y[j]) in order to obtain the values τ r ; (23) shows that these values are only a function of the innovation parameters (amplitudes c i and time instants t i ).However, the values τ r are nonlinearly related to the time instants and therefore, the innovations cannot be extracted from τ r using linear algebra 6 .However, these nonlinear equations form a well-known system which was studied by Prony in the field of spectral estimation (see Sec. IV-A) and its discrete version is also employed in both real and Galois field versions of Reed-Solomon codes (see Sec. III-A).This method which is called the annihilating filter is as follows: The sequence {τ r } can be viewed as the solution of a recursive equation.In fact if we define ), we will have (see Sec. III-A and Appendices II, III for the proof of a similar theorem): In order to find the time instants t i , we find the polynomial H(z) (or the coefficients h i ) and we look for its roots.A recursive relation for τ r becomes: By solving the above linear system of equations, we obtain coefficients h i (for a discussion on invertibility of the left side matrix see [94], [96]) and consequently, by finding the root of H(z), the time instants will be revealed.It should be mentioned that the choice of τ 1 , . . ., τ 2k in (25), can be replaced with any 2k consecutive terms of {τ i }.After determining {t i }, (23) becomes a linear system of equations with respect to the values {c i } which could be easily solved.This reconstruction method can be used for other types of signals satisfying (18) such as the signals represented by piecewise polynomials [94].An important issue in nonlinear reconstruction is the noise analysis; for the purpose of denoising and performance under additive noise the reader is encouraged to see [28].

III. ERROR CORRECTION CODES: GALOIS AND REAL/COMPLEX FIELDS
The relation between sampling and channel coding is the result of the fact that over-sampling creates redundancy [97].This redundancy can be used to correct for "sparse" impulsive noise.Normally, the channel encoding is performed in finite Galois fields as opposed to real/complex fields; The reason is the simplicity of logic circuit implementation and insensitivity to the pattern of errors.On the other hand, the real/complex field implementation of error correction codes has stability problems with respect to the pattern of impulsive, quantization and additive noise [46], [53], [67], [98], [99], [100], [101].Nevertheless, such implementation has found applications in fault tolerant computer systems [102], [103], [104], [105], [106] and impulsive noise removal from 1-D and 2-D signals [32], [33].Similar to finite Galois fields, real/complex field codes can be implemented in both block and convolutional fashions.
A discrete real-field block code is an oversampled signal with n samples such that, in the transform domain (e.g., DFT), a contiguous number of high frequency components are zero.In general, the zeros do not have to be the high frequency components or contiguous.However, if they are contiguous, the resultant m equations (from the syndrome information domain) and m unknown erasures form a Vandermonde matrix, which ensures invertibility and consequently erasure recovery.The DFT block codes are thus a special case of Reed-Solomon (RS) codes in the field of real/complex numbers [97].A more general condition to have a Vandermonde matrix is that the indices of the zero frequencies form an arithmetic progression using mod(n).This is equivalent to having contiguous zeros in the Sorted DFT (SDFT7 ) domain [2], [53], [107].
Figure 9 represents a convolutional encoder of rate 1 2 of finite constraint length [97] and infinite precision per symbol.Fig. 9(a) is a systematic convolutional encoder and resembles an oversampled signal discussed in Section II if the FIR filter acts as an ideal lowpass filter.Fig. 9(b) is a nonsystematic encoder used in the simulations to be discussed subsequently.In case of additive impulsive noise, errors could be detected based on the side information that there are frequency gaps in the original oversampled signal (syndrome).In the following subsections, various algorithms for decoding along with simulation results are given for both block and convolutional codes.Some of these algorithms can be used in other applications such as spectral and channel estimation.

A. Decoding of Block Codes-ELP Method
Iterative reconstruction for an erasure channel is identical to the missing sampling problem [108] discussed in Section II-A.1 and therefore, will not be discussed here.Let us assume that we have a finite discrete signal x orig [i], where i = 1, . . ., l.The DFT of this sequence yields l complex coefficients in the frequency domain (X orig [j], j = 1, . . ., l).If we insert p consecutive zeros 8 to get n = l + p samples (X[j], j = 1, . . ., n) and take its inverse DFT, we end up with an oversampled version of the original signal with n complex samples (x[i], i = 1, . . ., n).This oversampled signal is real if Hermitian symmetry (complex conjugate symmetry) is preserved in the frequency domain, e.g., the set Θ of p zeros are centered at where r is a member of the complement of Θ and the index additions are in mod(n).After finding E[j] values, the spectrum of the recovered oversampled signal X[j] can be found by removing E[j] from the received signal (see (97) in Appendix II).Hence the original signal can be recovered by removing the inserted zeros at the syndrome positions of X[j].The above algorithm, called Error Locator Polynomial (ELP) algorithm, is capable of correcting any combination of erasures.However, if the erasures are bursty, the above algorithm may become unstable.To combat bursty erasures, we can use the SDFT [2], [53], [109], [110], [107] instead of DFT.The simulation results for block codes with erasure and impulsive noise channels are given in the following two subsections.

1) Simulation Results for Erasure Channels:
The simulation results for the ELP decoding implementation for n = 32, p = 16, and k = 16 erasures (a burst of 16 consecutive missing samples from position 1 to 16) are shown in Fig. 10.
Since consecutive sample losses represent the worst case [53], [110], the proposed method works better for random samples.In practice, the error recovery capability of this technique degrades with the increase of the block and/or burst size due to the accumulation of round-off errors.In order to reduce the round-off error, instead of the DFT, a transform based on the SDFT, or Sorted DCT (SDCT) can be used [2], [53], [110].These types of transformations act as an interleaver to break down the bursty erasures.
2) Simulation Results for Random Impulsive Noise Channel: There are several methods to determine the number, locations and values of the impulsive noise samples, namely: Modified Berlekamp-Massey for real fields [111], [112], ELP, IMAT, and CFAR-RDE.The Berlekamp-Massey method for real numbers is sensitive to noise and will not be discussed here [111].The ELP method is described in the next subsection, while the other two methods are discussed below.
ELP Method [96]: When the number and positions of the impulsive noise samples are not known, h t in ( 26) is not known for any t; therefore, we assume the maximum possible number of impulsive noise samples per block, i.e., k = ⌊ n−l 2 ⌋ as given in (94) in Appendix II.To solve for h t , we need to know only n − l samples of E in the positions where zeros are added in the encoding procedure.Once the values of h t are determined from the pseudo-inverse [96], the number and positions of impulsive noise can be found from (96) in Appendix II.The actual values of the impulsive noise can be determined from (26) as in the erasure channel case.For the actual algorithm, please refer to Appendix III.As we are using the above method in the field of real numbers, exact zeros of {H k }, which are the DFT of {h i }, are rarely observed; consequently, the zeros can be found by thresholding the magnitudes of H k .Alternatively, the magnitudes of H k can be used as a mask for soft-decision; in this case, thresholding is not needed.
CFAR-RDE and IMAT Methods [32]: The Constant False Alarm Rate with Recursive Detection Estimation (CFAR-RDE) method is similar to the Iteration Method with Adaptive Thresholding (IMAT) with the additional inclusion of the CFAR module to estimate the impulsive noise; CFAR is extensively used in radars to detect and remove clutter noise from data.In CFAR, we compare the noisy signal with its neighbors and determine if an impulsive (sparse) noise is present or not (using soft decision) 9 .After removing the impulsive noise in a "soft" fashion, we estimate the signal using the iterative method for an erasure channel as described in Section II-A.1 for random sampling or using the ELP method.The impulsive noise and signal detection and estimation go through several iterations in a recursive fashion as shown in Fig. 11.As the number of recursions increases, the certainty about the detection of impulsive noise locations also increases; thus, the soft decision is designed to act more like the hard decision   at the latter parts of the iteration steps, which yields the error locations.Meanwhile, further iterations are performed to enhance the quality of the original signal since suppression of the impulsive noise also suppresses the original signal samples at the location of the impulsive noise.The improvement of using CFAR-RDE over a simple soft decision RDE is shown in Fig. 12.

B. Decoding for Convolutional Codes
The performance of convolutional decoders depends on the coding rate, the number and values of FIR taps for the encoders, and the type of the decoder.Let us take the convolutional encoder of rate 1 2 of Fig. 9(b) as our platform for simulations.For simulation results, the taps of the encoder of Fig. 9(b) are The input signal is taken from a uniform random distribution of size 50 and the simulations are run 1000 times and then averaged.The following subsections describe the simulation results for erasure and impulsive noise channels.
1) Decoding for Erasure Channels: For the erasure channels, we employ two methods as described below: Iterations with Averaging: An iterative method to decode for erasures in the convolutional code of Fig. 9(b) is shown in Fig. 13.This figure is designed for the rate 1  2 convolutional encoder.At each stage of decoding, the results of the two branches are averaged.For the rate 1  2 and specific FIR structure, the SNR improvement versus the relative rate of erasures with respect to the theoretical maximum rate of correction capability (full capacity) is shown in Fig. 14.This figure shows that the SNR values gradually decrease as the sampling rate reaches the full capacity.
Decoding Using the Generator Matrix: The generator matrix of a convolutional encoder of the type depicted in Fig. 9(b) with taps given in (27) can be shown to be [5] An iterative decoding scheme for this matrix representation is similar to that of Fig. 3 except that the operator G consists of the generator matrix, a mask (erasure operation) and the transpose of the generator matrix.If the rate of erasure does not exceed the encoder full capacity, the matrix form of the operator G can be shown to be a nonnegative definite square matrix and therefore its inverse exists [1], [45].Thus, with a proper choice of the relaxation parameter, the iteration represented in Fig. 13 converges to the actual signal.
By using the above operator G in our iterative simulations, better results can be obtained in comparison with the averaging method of Fig. 14. Figure 15 shows that the SNR values gradually decrease as the rate of erasure reaches its maximum (capacity).This figure shows that the generator matrix approach of decoding using the iteration matrix performs much better than the averaging method represented in Figs. 13 and 14.However, the complexity of the matrix approach is higher than the averaging method.
2) Decoding for Impulsive Noise Channels: Let us consider x and y as the input and the output streams of the encoder, respectively, related to each other through the generator matrix G as y = Gx.
Denoting the observation vector at the receiver by ŷ, we have ŷ = y + ν, where ν is the impulsive noise vector.Multiplying ŷ by the transpose of the parity check matrix H T , we get Multiplying the resultant by the right pseudo-inverse of the H T , we derive: Thus by multiplying the received vector by H(H T H) −1 H T , we obtain an approximation of the impulsive noise.In the IMAT method, we apply the operator H(H T H) −1 H T in the iteration of Fig. 5; the threshold level is reduced exponentially at each iteration step.The block diagram of IMAT in Fig. 5 is modified as shown in Fig. 16.
For simulation results, we use the generator matrix shown in (28); its generator matrix can be calculated from [5] and is given below: In our simulations, the locations of the impulsive noise samples are generated randomly and their amplitudes have Gaussian distributions with zero mean and variance equal to 1, 2, 5 and 10 times the variance of the encoder output.The  results are shown in Fig. 17 after 300 iterations.This figure shows that the high variance impulsive noise has a better performance.

C. LDPC Codes
Low Density Parity Check (LDPC) codes are another example of the application of sparse matrices in reducing complexity [113], [114], [115], [116].LDPC codes are linear block codes whose parity check matrix, H is sparse 10 .The term "low-density" refers to this fact.A code with a sparse H matrix which has equal number of 1's in each row and equal number of 1's in each column is called a regular LDPC code.The iterative decoding algorithm that is used to decode a given code, with either a high or low density of ones in the H matrix, is such that the decoding complexity grows almost linearly with the block length.The coefficient of this increase is directly dependent on the number of 1's in the H matrix as is explained below.
For decoding LDPC codes, we use Tanner graphs [117]; these graphs may also be useful in decoding real-field codes when the parity matrix is sparse and possibly binary.Each iteration module of the decoding algorithm consists of two steps.In the first step, for each row, certain operations are performed for each 1 on that row.Similarly for each column, certain operations are performed for each 1 on that column in the second step.Therefore, the sparser the matrix is, the lower the decoding complexity will be.
As an example, consider an H matrix of a regular LDPC code which has 6 and 3 ones in its rows and columns, respectively.Consider another code whose H matrix has similarly 8 and 4 ones.The decoding complexity of the former code is 75% of the latter.Note that in general (assuming that 10 Note that the generator matrix G of an LDPC code is not necessarily sparse and the above discussions are only valid for decoding and not encoding. H matrix has full rank), for a given code rate R, the ratio of the number of ones in each column to the number of ones in each row is a constant which is equal to 1 − R. Therefore, if the number of ones in the rows is scaled by a factor of c, the number of ones in the columns has to be scaled with the same factor to keep the code rate fixed.Consequently, the decoding complexity is scaled by c.
1) LDPC Decoding Using Linear Programming (LP): 11The linear programming decoding of block codes was initially suggested by Feldman in 2003 [118], [119].Consider Maximum Likelihood (ML) decoding of block codes; this is an optimization problem where we want to minimize a cost function by finding a codeword that has the minimum Euclidean distance to the received vector from the channel.In the logarithm domain, the cost function can be transformed into a linear function of the coded bits (see [120]).Note that the codeword has to satisfy a linear set of parity check equations.Therefore, the problem of decoding a block code can be considered as an LP optimization, and the computational complexity grows exponentially with the block length.However, a sub-optimal solution to this problem significantly reduces the complexity [120].This modification however makes the overall performance and complexity of the method comparable to that of the conventional decoding method of LDPC codes, the Maximum A Posteriori (MAP) decoding on a bit level.
The performance of the LP decoding method is usually inferior to that of MAP decoding but it has been shown to be superior to that of the MIN-SUM, which is a simplification of the original MAP decoding algorithm [121].

IV. SPECTRAL ESTIMATION
In this section, we review some of the methods which are used to evaluate the frequency content of data [8], [9], [10], [11].In the field of signal spectrum estimation, there are several methods which are appropriate for different types of signals.Some methods are known to better estimate the spectrum of wideband signals, where some others are proposed for the extraction of narrow-band components.Since our focus is on sparse signals, it would be reasonable to assume sparsity in the frequency domain, i.e., we assume the signal to be a combination of several sinusoids plus white noise.
Conventional methods for spectrum analysis are nonparametric methods in the sense that they do not assume any model (statistical or deterministic) for the data, except that it is zero or periodic outside the observation interval.As a well known non-parametric method, we can name the periodogram Pper (f ) that can be computed via the FFT: where m is the number of observations, T s is the sampling interval (usually assumed as unity), and x r is the signal.Although non-parametric methods are robust with low computational complexity, they suffer from fundamental limitations.
The most important limitation is its resolution; too closely spaced harmonics cannot be distinguished if the spacing is smaller than the inverse of the observation period.
To overcome this resolution problem, parametric methods are devised.Assuming a statistical model with some unknown parameters, we can get more resolution by estimating the parameters from the data at the cost of more computational complexity.Theoretically, in parametric methods we can resolve two closely spaced harmonics with limited data length if the SNR goes to infinity 12 .
In this section, we shall discuss three parametric approaches for spectral estimation: the Pisarenko, the Prony, and the well known MUSIC algorithms.The first two are mainly used in spectral estimation, while the MUSIC method that was first developed for array processing has been extended to spectral estimation (see the next section).It should be noted that the parametric methods unlike the non-parametric approaches, require prior knowledge of the model order (the number of tones).This can be decided from the data using the Minimum Discription Length (MDL) method discussed in the next section.

A. Prony Method
The Prony method was originally proposed for modeling the expansion of gases [122]; however, now it is known as a general spectral estimation method.In fact, Prony tried to fit a weighted mixture of k damped complex exponentials to 2k data measurements.The original approach is related to the noiseless measurements; however, it has been extended to produce the least squared solutions for noisy measurements.We focus only on the noiseless case here.The signal is modeled as a weighted mixture of k complex exponentials with complex amplitudes and frequencies: where x r is the noiseless discrete sparse signal consisting of k exponentials with parameters where a i , θ i , f i represent the amplitude, phase and the frequency (f i is a complex number in general), respectively.Let us define the polynomial H(z) such that its roots represent the complex exponential functions related to the sparse tones (see section II-C on FRI, (26) on ELP and Appendix II): By shifting the index of (33) and multiplying by the parameter h j and summing over j we get: 12 Similar to array processing to be discussed in the next section, we can resolve any two closely spaced sources conditioned on 1) limited snapshots and infinite SNR or 2) limited SNR and infinite number of observations, while the spatial aperture of the array is kept finite.

TABLE V
BASIC PRONY ALGORITHM 1) Solve the recursive equation in (36) to evaluate h i 's.
2) Find the roots of the polynomial represented in (35); these roots are the complex exponentials defined as z i in (33).3) Solve (33) to obtain the amplitudes of the exponentials (b i 's).
where r is indexed in the range k + 1 ≤ r ≤ 2k.This formula implies a recursive equation to solve for h i [9].After the evaluation of h i 's, the roots of ( 35) yield the frequency components.Hence the amplitudes of the exponentials can be evaluated from a set of linear equations given in (33).The basic Prony algorithm is given in Table V.
The Prony method is sensitive to noise, which was also observed in the ELP and the annihilating filter methods discussed in sections III-A and II-C.There are extended Prony methods that are better suited for noisy measurements.

B. Pisarenko Harmonic Decomposition (PHD)
The PHD method is based on the polynomial of the Prony method that utilizes the eigen-decomposition of the data covariance matrix [11].Assume k complex tones are present in the spectrum of the signal.Then, decompose the covariance matrix of k + 1 dimensions into a k-dimensional signal subspace and a 1-dimensional noise subspace that are orthogonal to each other.The noise subspace is spanned by the eigenvector v which satisfies Rv = σ 2 v.
By including the additive noise, the observations are given by: where y is the observation sample and ν is a zero-mean noise term that satisfies E{ν r ν r+i } = σ 2 δ[i].By replacing x r = y r − ν r in the difference equation (36), we get which reveals the Auto-Regressive Moving Average (ARMA) structure (order (k, k)) of the observations y r as a random process.To benefit from the tools in linear algebra, let us define the following vectors Now (38) can be written as Multiplying both sides of (40) by y and taking the expected value, we get E{yy H }h = E{yν H }h. Note that 1) Given the model order k (number of sinusoids), find the autocorrelation matrix of the noisy observations with dimension k + 1 (Ryy).2) Find the smallest eigenvalue (σ 2 ) of Ryy and the corresponding eigenvector (h). 3) Set the elements of the obtained vector as the coefficients of the polynomial in (35).The roots of this polynomial are the estimated frequencies.
We thus have an eigen-equation which is the key equation of the Pisarenko method.The eigenequation of (43) states that the elements of the eigenvector of the covariance matrix, corresponding to the smallest eigenvalue (σ 2 ), are the same as the coefficients in the recursive equation of x[r] (coefficients of the ARMA model in (38)).Therefore, by evaluating the roots of the polynomial with coefficients that are the elements of this vector, we can find the tones in the spectrum.
Although we started by eigen-decomposition of R yy , we observed that only one of the eigenvectors is required; the one that corresponds to the smallest eigenvalue.This eigenvector can be found using simple approaches (in contrast to eigendecomposition).The PHD method is briefly shown in Table VI.
A different formulation of the PHD method with linear programming approach (refer to Sec.VI-D.2 for description of linear programming) for array processing is studied in [123].The PHD method is shown to be equivalent to a geometrical projection problem which can be solved using ℓ 1norm optimization.Let us convert the autocorrelation matrix R m×m into an m 2 × 1 vector.If R 1 and R 2 are the spatial autocorrelation matrices of two uncorrelated sources, the overall autocorrelation matrix is R 1 + R 2 when both sources are active, which is translated to a similar summation of the corresponding vectors.Although it seems that the vectorized notation generates a subspace for uncorrelated sources, only linear combinations of m 2 ×1 vectors with positive coefficients are acceptable (resulting in a hyper-cone) since R is positive definite.When the number of sources is less than the number of sensors, the respective vector is restricted to lie on the surface of the cone.Due to the existence of noise, the observed vector falls within the cone.It is shown [123] that the PHD method projects the measured vectors into the surface.For the purpose of the projection, ℓ 1 -norm optimization can be employed instead of the common Pisarenko method.

C. MUSIC
MUltiple SIgnal Classification (MUSIC), is a method originally devised for high resolution source direction estimation in the context of array processing that will be discussed in the next section [124].The inherent equivalence of the array processing and time series analysis paves the way for the employment of this method in spectral estimation.MUSIC can be understood as a generalization and improvement of the Pisarenko method.It is known that in the context of array processing, MUSIC attains the statistical efficiency 13 in the limit of asymptotically large number of observations [12].This is a valuable property since MUSIC contains only a 1-D search while efficient ML methods require k-dimensional searches. 14USIC estimates the frequencies by finding k local maxima of a 1-D spectrum function while ML exhaustively searches a k-dimensional parameter space to find the global maximum of a k-variable spectrum function.
In the PHD method, we construct an autocorrelation matrix of dimension k + 1 under the assumption that its smallest eigenvalue (σ 2 ) belongs to the noise subspace.Then we use the Hermitian property of the covariance matrix to conclude that the noise eigenvector should be orthogonal to the signal eigenvectors.In MUSIC, we extend this method, using a noise subspace of dimension greater than one to improve the performance.We also use some kind of averaging over noise eigenvectors to obtain a more reliable signal estimator.
The data model for the sum of exponentials plus noise can be written in the matrix form as where the length of data is taken as m > k and . . .e jω k . . . . . . . . .
where ν represents the noise vector.Since the frequencies are different, A is of rank k and the first term in (44) forms a k-dimensional signal subspace, while the second term is randomly distributed in both signal and noise subspaces; i.e., unlike the first term, it is not confined to a subspace of lower dimension.The correlation matrix of the observations is given by where the noise is assumed to be white with variance σ 2 .
If we decompose R into its eigenvectors, k eigenvalues corresponding to the k-dimensional subspace of the first term of (46) are essentially greater than the remaining m − k values, σ 2 , corresponding to the noise subspace; thus, by sorting the eigenvalues, the noise and signal subspaces can be determined.Assume ω is an arbitrary frequency and e(ω) = [1, e jω , . . ., e j(m−1)ω ].The MUSIC method estimates the spectrum content of the signal at frequency ω by projecting the vector e(ω) into the noise subspace.When the projected vector is zero, the vector e(ω) falls in the signal subspace and most likely, ω is among the spectral tones.In fact, the frequency content of the spectrum is inversely proportional to the ℓ 2 -norm of the projected vector: where v i 's are eigenvectors of R corresponding to the noise subspace.
The k peaks of P MU (ω) are selected as the frequencies of the sparse signal.The determination of the number of frequencies (model order) in MUSIC is based on the MDL and Akaike Information Criterion (AIC) methods to be discussed in the next section.
Figure 18 shows results for various spectral line estimation methods discussed above.The first upper figure shows the original spectral lines, and the three other figures belong to the results of the estimation methods.We can see that the Prony method (which is similar to ELP and annihilating filter of Sec.II-C and ( 26)) does not yield good results, while the PHD method is better.

V. SPARSE ARRAY PROCESSING
We address three types of array processing: 1-Estimation of Multi-Source Location (MSL) and Direction of Arrival (DOA), 2-Sparse Array Beam-forming and Design, and 3-Sparse Sensor Networks.The first topic is related to estimating the direction and/or the location of multiple targets; this problem is very similar to the problem of spectral estimation dealt with in the previous section.The second topic is related to the design of sparse arrays with some missing and/or random array sensors.The last topic, depending on the type of sparsity, is either similar to the second topic or related to CS of sparse signal fields in a network.Below each topic will be briefly described.

A. Array Processing for MSL and DOA Estimation
Among the important fields of active research in array processing are MSL and DOA estimation [124], [125], [126].In such schemes, a passive or active array of sensors is used to locate the sources of narrow-band signals.Some applications may assume far-field sources (e.g.radar signal processing) where the array is only capable of DOA estimation, while other applications (e.g.biomedical imaging systems) assume near-field sources where the array is capable of locating the sources of radiation.A closely related field of study is spectral estimation due to similar linear statistical models.The stochastic sparse signals pass through a partially known linear transform (e.g., array response or inverse Fourier transform) and are observed in a noisy environment.
In the array processing context, the common temporal frequency of the source signals are known.Spatial sampling of the signal is used to extract direction of the signal (spatial frequency).As a far-field approximation, the signal wavefronts are assumed to be planar.Consider a signal arriving with angle ϕ as in Fig. 19.Simultaneous sampling of this wavefront on the array will exhibit a phase change of the signal from sensor to sensor.In this way, discrete samples of a complex exponential are obtained, where its frequency can be translated to the direction of the signal source.The response of a Uniform Linear Array (ULA) to a wavefront impinging on the array from direction ϕ is where d is the inter-element spacing of the array, λ is the wavelength, and n is the number of sensors in the array.When multiple sources are present, the observed vector is the sum of the response(sweep) vectors and noise.This resembles the spectral estimation problem with the difference that sampling of the array elements is not limited in time.In fact, in array processing an additional degree of freedom (the number of elements) is present; thus, array processing is more general than spectral estimation.Two main fields in array processing are MSL and DOA for estimating the source locations and directions, respectively; for both purposes, the angle of arrival (azimuth and elevation) should be estimated while for MSL an extra parameter of range is also needed.The simplest case is the 1-D ULA (azimuthonly) for DOA estimation.
For the general case of k sources with angles ϕ 1 , . . ., ϕ k with respect to the array, the ULA response is given by the matrix A(ϕ) = [a(ϕ 1 ), . . ., a(ϕ k )], where the vector ϕ of DOA's is defined as ϕ = [ϕ 1 , . . ., ϕ k ].In the above notation, A is a matrix of size n × k and a(ϕ i )'s are column vectors.Now, the vector of observations at array elements (x[i]) is given by where the vector s[i] represents the multi-source signals and ν[i] is the white Gaussian noise.Source signals and additive noise are assumed to be zero-mean and i.i.d.normal processes with covariance matrices P and σ 2 I, respectively.With these assumptions, the observation vector x[i] will also follow an n-dimensional zero-mean normal distribution with the covariance matrix In the field of DOA estimation, extensive research has been accomplished in 1) source enumeration, and 2) DOA estimation methods.Both of the subjects correspond to the determination of parameters k and ϕ.
Although some methods are proposed for simultaneous detection and estimation of the model statistical characteristics [127], most of the literature is devoted to two-stage approaches; first, the number of active sources is detected and then their directions are estimated by techniques such as Estimation of Signal Parameters via Rotational Invariance Techniques (ESPRIT) 15 [128], [129], [130], [131], [132].Usually the joint detection-estimation methods outperform the two-stage approaches with the cost of higher computational complexity.Source enumeration techniques often rely on the eigenvalues of the covariance matrix obtained from the received data.These eigenvalues are indicators of the energy level in certain principal components of the data.In the MSL estimation step, similar to the MUSIC method, the signal is decomposed into two orthogonal subspaces of signal and noise.The signal subspace corresponds to the column space of the sweep matrix A; thus, proper estimation of the signal subspace leads to estimation of the matrix A and consequently ϕ and range [12], [13].In the sequel, we take a closer look at these two steps.
1) Minimum Description Length (MDL): One of the most successful methods in array processing for source enumeration is the use of MDL criterion [133].This technique is very powerful and outperforms its older versions including AIC [134], [135], [136].Hence, we confine our discussion to MDL algorithms.
Preliminaries: MDL is an optimum method of finding the model order and parameters for the most compressed representation of the observed data.For the purpose of statistical modeling, the MAP probability or, the suboptimal criterion of ML is used; more precisely, conditioned on the observed data, the maximum probability among the possible options is found (hypotheses testing) [137].When the model parameters are not known, the MAP and ML criteria result in the most complex approach; consider fitting a finite sequence of data to a polynomial of unknown degree [37]: where is the observed Gaussian noise and k is the unknown model order (degree of the polynomial P (t)) which determines the complexity.Clearly, m − 1 is the maximum required order for unique description of the data (m observed samples) and the ML criterion, always selects this maximum value ( kML = m − 1); i.e., the ML method forces the polynomial P (t) to pass through all the points.MDL, on the other hand, yields a sparser solution ( kMDL < m − 1).
Due to the existence of additive noise, it is quite rational to look for a polynomial with degree less than m which also takes the complexity order into account.In MDL, the idea of how to consider the complexity order is borrowed from information theory: Given a specific statistical distribution, we can find an optimum source coding scheme (e.g., Huffman coding) which attains the lowest average code length for the symbols.Furthermore, if p x is the distribution of the source x and q x is another distribution, we have [138]: where H(x) is the entropy of the signal.This implies that the minimum average code length is obtained only for the correct source distribution (model parameters); in other words, the choice of wrong model parameters (distribution function) leads to larger code lengths.When a particular model with the set of parameters θ is assumed for the data a priori, each time a sequence x is received, the parameters first should be estimated.The optimum estimation method is usually the ML estimator which results in θML .Now, the probability distribution for a received sequence x becomes p(x| θML ) which according to information theory, requires an average code length of − log p(x| θML (x)) bits.In addition to the data, the model parameters should also be encoded which in turn require κ 2 log(m) bits where κ is the number of independent parameters to be encoded in the model and m is the number of data points 16 .Thus, the two part MDL selects the model that minimizes the whole required code length which is given by [139]: The first term is the ML term for data encoding and the second term is a penalty function that inhibits the number of free parameters of the model to get very large.MDL Source Enumeration: In the source enumeration problem, our model is a multivariate Gaussian random process with zero mean and covariance of the type shown in (51), where the number of active sources is unknown.In some enumeration methods (other than MDL), the exact form of ( 51) is employed which results in high computational complexity.In the conventional MDL method, it is assumed that the model is a covariance matrix with a spherical subspace 17 of dimension n − k.Suppose the sample covariance matrix is and assume the ordered eigenvalues of R are λ1 ≥ λ2 ≥ • • • ≥ λn , while the ordered eigenvalues of the exact covariance The normal distribution function of the received complex data x is [129] where tr(.) stands for the trace operator.The ML estimate of signal eigenvalues in R are λi , i = 1, . . ., k with the respective eigenvectors i=k+1 are all noise eigenvectors.Thus, the ML estimate of R given R is In fact, since we know that R has a spherical subspace of dimension n − k, we correct the observed R to obtain R ML .Now, we calculate − log p(x|R ML ) ; from Appendix IV, we have: which is independent of k and can be omitted in the minimization of (54).Thus, for the first term of (54) we only need the determinant |R ML | which is the product of the eigenvalues, and the MDL criterion becomes (see Appendix IV): where κ is the number of free parameters in the distribution.This expression should be computed for different values of 0 ≤ k ≤ n − 1 and its minimum point should be kMDL .Note that we can subtract the term m n i=1 log( λi ) from the expression, which is not dependent on k to get the well known MDL criterion [129] (also see Appendix IV): where the first term is the likelihood ratio for the sphericity test of the covariance matrix.This likelihood ratio is a function of arithmetic and geometric means of the noise subspace eigenvalues [140].Figure 20 is an example of MDL performance in determining the number of sources in array processing.It is evident that in low SNR's, the MDL has a strong tendency to underestimate the number of sources, while as SNR increases, it gives a consistent estimate.Also at high SNR's, underestimation is more probable than overestimation.Now we compute the number of independent parameters (κ) in the model.Since the noise subspace is spherical, the choice of eigenvectors in this subspace can accept any arbitrary orthonormal set; i.e., no information is revealed when these vectors are known.Thus, the set of parameters is {λ 1 , . . ., λ k , σ 2 , v 1 , . . ., v k }.The eigenvalues of a hermitian matrix (correlation matrix) are all real while the eigenvectors are normal complex vectors.Therefore, the eigenvalues (including σ 2 ) introduce k + 1 degrees of freedom.The first eigenvector has 2n − 2 degrees of freedom (since its first nonzero element can be adjusted to unity); while the second, due to its orthogonality to the first eigenvector, has 2n − 4 degrees of freedom.With the same argument, it can be shown that there are 2(n − i) free parameters in the i th eigenvector; hence where the last integer 1 can be omitted since it is independent of k.
The two-part MDL, despite its very low computational complexity, is among the most successful methods for source enumeration in array processing.Nonetheless, this method does not reach the best attainable performance for finite number of measurements [141].The new version of MDL, called one-part or Refined MDL has improved the performance for the cases of finite measurements which has not been applied to the array processing problem [37].

B. Sparse Array Beam-forming and Design
Sparsely sampled irregular and random arrays have been used in several fields such as radar, sonar, ultrasound imaging, and seismology.The objective is to improve economy of design, e.g., achieve better resolution for a given maximum number of array elements.The whole idea of sparse arrays depends on having a feature which is better than needed, e.g., sidelobe suppression which can be traded for resolution.
In array signal processing, the aperture smoothing function plays the same role as the transfer function of an LTI filter.Assume that n elements are spaced as a 1-D uniform linear array with a distance d and are located at x i = i • d for i = 0, 1, . . ., n−1 , as shown in Fig. 19.The aperture smoothing function when each element is weighted by a scalar w i is The variable u is defined by u = sin ϕ where ϕ is called the azimuth angle, λ the wavelength, and the weights, w i , are a standard window function [14].The weights, w i , could also be angle-dependent if the individual elements are directional (dipoles).The aperture smoothing function determines how the wavefield Fourier transform is filtered by a finite aperture, just as the filter transfer function determines how the received signal spectrum is shaped by the filtering operation.The condition for avoiding aliasing is that the argument in the exponent satisfies where β x = 2π u λ is the x component of the wave-number.The relationship between the array pattern for a regular 1-D array and a filter transfer function is By using these parallels, the time-frequency sampling theorem T ≤ π ωmax translates into the spatial sampling theorem d ≤ λmin 2 .The aperture smoothing function at the Nyquist rate of a uniform linear array is depicted in Fig. 21.In addition, array processing has some additional degrees of freedom that are not so common in time-frequency processing, such as irregular sampling, and things that are not found at all such as elements along curved surfaces.
A sparse array has elements removed from the aperture (thinning), and there are basically two approaches to finding the best thinning.The first is through optimization; This can be performed in 1-D and 2-D, and also with elements on an underlying regular grid or with freely chosen positions in the aperture (avoiding overlapping elements).It also makes a difference whether the array is flat or curved.The most popular optimization criterion is to minimize the peak sidelobe in the beampattern with a condition on the maximum mainlobe width.
The second approach is more heuristic, and is based on the experience that it is often not possible to formulate optimality in a sense that is compatible with standard optimization algorithms.Also, in an imaging system, more degrees of freedom can be obtained in the optimization, if one allows the layouts of the transmitter and the receiver to be different.We will still assume that the receiver and transmitter arrays are located in the same position (monostatic), but now there is partial or no overlap between the selected elements.
In the optimization approach, the challenge is the combinatorial problem of finding the best layout of sparse elements for one and two dimensions.The optimization problem is creation of beampatterns with low mainlobe width and low sidelobes.The layout optimization methods consist of LP, Genetic Algorithms (GA) and Simulated Annealing (SA) [142].
With a sparse optimized random array, we can get an array pattern as shown in Fig. 22. Comparing this figure to Fig. 21, we see that with even 1  4 of the elements, we can get comparable results for the peak sidelobe value.It is also evident that thinning has a price in that the overall energy in the sidelobe region has increased.
Another example is shown in Fig. 23 where enhanced simulated annealing has been used.This approach can handle arbitrary layouts where the elements are not locked to a Cartesian grid and where each individual element can have different directivity.Also, it can easily be extended to handle wideband excitations and the response optimization in the near  field (focused arrays) [143].
The simple examples of Fig. 23 illustrate the geometries that the algorithm can handle and shows a sparse hex-grid optimized array with hexagonally shaped elements, a linear grid design with square elements, and a ring array having increasing element sizes for increasing ring radii.
One of the optimization problems in array processing is to design the sparsest array for a given beam pattern [142], [145].In this case, as shown in Table II in row 12, the random sparsity is in the space domain and the information domain is a desirable aperture array pattern; we conjecture that, for a given beampattern, the IMAT used for random sampling and impulsive noise removal (Section II-A.1) can be used in synthesizing the number, locations and weights of the sparse array elements.Another optimization issue for a given number of array elements is equivalent to designing the weights and locations of sparse random arrays to yield an optimum array/beam-pattern [146], see also Figs. 24 and 25; this kind of optimization is unique for this application and has no parallel in other applications.The weights and the array locations can be optimized separately or jointly; joint optimization of thinning pattern and weights has been reported in sonar arrays.Joint optimization of positions and weights is also possible, usually by iterating over a sequence of position optimization followed by weight optimization.Optimization of the element positions of a sparse array is considerably more difficult than weight optimization.The reason is that for an array with n elements, the number of combinations for selecting k array elements is When n, k are large or for 2-D arrays, an exhaustive search is out of the question.There are several ways that this number can be reduced.The optimization criteria are the same for the layout problem as for the weight problem, i.e., • Minimize main sidelobe of the pattern while restricting the width of the main-lobe below a given limit.
• Minimize the total energy of the pattern wasted in the sidelobes while keeping the peak value of the main lobe above a given threshold.
The second heuristic approach builds on the properties of some of the basic building blocks of sparse arrays: random arrays, binned arrays, and periodic arrays.
Assume an array with n elements where only k elements are kept after random thinning.The average array pattern is equal to that of an aperture weighted with the probability density function of the thinning.This also determines the shape of the main lobe.For a uniform thinning, it turns out that the ratio of the average sidelobe power to the main lobe power is 1 k .But the variance is about k for |u| = | sin(ϕ)| > λ L , where L is the aperture.The relative peak level of a 1-D random array is √ k ln(k) [147] which gives, in our experience, a fairly good estimate of the peak level.
In the binned array, the aperture is divided into k equal size bins and one element is chosen at random in each bin.This resembles a nearest neighbor restriction as there can be no more than two neighbor elements in a 1-D binned array.With a uniform distribution in each bin, it can be shown that the binned array has the same properties as the random array except that the variance does not reach the value of k until the angle reaches |u| = | sin(ϕ)| > k λ L .Thus the binned array has random sidelobes close to the mainlobe that are much lower than the random array [147].
Periodic arrays are thinned with a periodicity, e.g., 1001001001, 101010101, or 11001100.This means that grating lobes are formed whose position and size are easily predicted.Periodic arrays have turned out to be particularly useful in imaging systems where one can use different periodicities for the transmitter and the receiver.As the two-way beampattern is the product of the transmitter and the receiver beampatterns, by proper design, grating lobes formed by the transmitter can be suppressed by the receiver and vice versa.A special case is the Vernier array which has periodicities p and p − 1 [148].
The simplest way to utilize the above properties is to use the properties of periodic arrays and variants where grating lobes in the transmitter are used to cancel the receiver grating lobes and vice versa.References [149], [150]  Arrays which have no overlap between the transmitter and receiver elements are • Binned arrays with different layouts for transmission and reception • Polar binned arrays where the bins are along rays emanating from the center of the array The proposed configurations have been tested on real data for an array that has been built.Good correspondence with simulations have been obtained.
Curved Arrays: More recently, there has been some research on optimizations of curved arrays.The beam pattern can be found by projecting the elements onto a flat surface (Fourier projection-slice theorem), therefore, the beampattern is equivalent to that of an array with unequal spacing between the elements.This reduces grating lobes.There is an optimal radius of the curvature for the array, which minimizes the peak sidelobe in the two-way response [151].The optimal value varies with the thinning method.Similar work for the optimization of one-way response has also been applied to cylindrical arrays for sonar applications [152].

C. Sparse Sensor Networks
Wireless sensor networks typically consist of a large number of sensor nodes, spatially distributed over a region of interest, that observe some physical environment including acoustic, seismic, and thermal fields with applications in a wide range of areas such as health care, geographical monitoring, homeland security, and hazard detection.The way sensor networks are used in practical applications can be divided into two general categories: 1) There exists a central node known as the Fusion Center (FC) that retrieves relevant field information from the sensor nodes and communication from the sensor nodes to FC generally takes place over a power-and bandwidth-constrained wireless channel.2) Such a central node does not exist and the nodes take specific decisions based on the information they obtain and exchange among themselves.Issues such as distributed computing and processing are of high importance in such scenarios.In general, there are three main tasks that should be implemented efficiently in a wireless sensor network: sensing, communication, and processing.The main challenge in design of practical sensor networks is to find an efficient way of jointly performing these tasks, while using the minimum amount of system resources (computation, power, bandwidth) and satisfying the required system design parameters (such as distortion levels).For example, one such metric is the so-called energy-distortion tradeoff which determines how much energy the sensor network consume in extracting and delivering relevant information up to a given distortion level.Although many theoretical results are already available in the case of point-to-point links in which separation between source and channel coding can be assumed, the problem of efficiently transmitting or sharing information among a vast number of distributed nodes remains a great challenge.This is due to the fact that well-developed theories and tools for distributed signal processing, communications, and information theory in large-scale networked systems are still under development.However, recent results on distributed estimation or detection indicate that joint optimization through some form of sourcechannel matching and local node cooperation can result in significant system performance improvement [153], [154], [155], [156], [157].
1) How sparsity can be exploited in a sensor network: Sparsity appears in many applications for which sensor networks are deployed, e.g., localization of targets in a large region or estimation of physical phenomena such as temperature fields that are sparse under a suitable transformation.For example, in radar applications, under a far-field assumption, the observation system is linear and can be expressed as a matrix of steering vectors [158], [159].In general, sparsity can arise in a sensor network from two main perspectives: 1) Sparsity of node distribution in spatial terms 2) Sparsity of the field to be estimated Although nodes in a sensor network can be assumed to be regularly deployed in a given environment, such an assumption is not valid in many practical scenarios.Therefore, the nonuniform distribution of nodes can lead to some type of sparsity in spatial domain that can be exploited to reduce the amount of sensing, processing, and/or communication.This issue is subsequently related to extensions of the nonuniform sampling techniques to two-dimensional domains through proper interpolation and data recovery when samples are spatially sparse [38], [160].The second scenario that provides a proper basis for exploiting the sparsity concepts arises when the field to be estimated is a sparse multi-dimensional signal.From this point of view, ideas such as those presented earlier in the context of compressed sensing (Sec.II-B) provide the proper framework to address the sparsity in such fields.
Spatial Sparsity and Interpolation in Sensor Networks: Although general two-dimensional interpolation techniques are well-known in various branches of statistics and signal processing, the main issue in a sensor network is exploring proper spatio/temporal interpolation such that communication and processing are also efficiently accomplished.While there is a wide range of interpolation schemes (polynomial, Fourier, and least squares [161]), many of these schemes are not directly applicable for spatial interpolation in sensor networks due to their communication complexity.
Another characteristic of many sensor networks is the nonuniformity of node distribution in the measurement field.Although non-uniformity has been dealt with extensively in contexts such as signal processing, geo-spatial data processing, and computational geometry [2], the combination of irregular sensor data sampling and intra-network processing is a main challenge in sensor networks.For example, reference [162] addresses the issue of spatio-temporal non-uniformity in sensor networks and how it impacts performance aspects of a sensor network such as compression efficiency and routing overhead.In order to reduce the impact of non-uniformity, the authors in [162] propose using a combination of spatial data interpolation and temporal signal segmentation.A simple interpolation wavelet transform for irregular sampling which is an extension of the 2-D irregular grid transform to 3-D spatio-temporal transform grids is also proposed in [163].Such a multi-scale transform extends the approach in [164] and removes the dependence on building a distributed mesh within the network.It should be noted that although wavelet compression allows the network to trade reconstruction quality for communication energy and bandwidth usage, such energy savings are naturally offset by the overhead cost of computing the wavelet coefficients.
Distributed wavelet processing within sensor networks is yet another approach to reduce communication energy and wireless bandwidth usage.Use of such distributed processing makes it possible to trade long-haul transmission of raw data to the FC for less costly local communication and processing among neighboring nodes [163].In addition, local collaboration among nodes decorrelates measurements and results in a sparser data set.
Compressive Sensing in Sensor Networks: Most natural phenomena in SN's are compressible through representation in a natural basis [79].Some examples of these applications are imaging in a scattering medium [158], MIMO radar [159], and geo-exploration via underground seismic data.In such cases, it is possible to construct a highly compressed version of a given field, in a decentralized fashion.If the correlations between data at different nodes are known a-priori, it is possible to use schemes that have very favorable powerdistortion-latency tradeoffs ( [153], [165], [166]).In such cases, distributed source coding techniques, such as Slepian-Wolf coding, can be used to design compression schemes without collaboration between nodes (see [165] and the references therein).Since prior knowledge of such correlations is not available in many applications, collaborative, intra-network processing and compression are used to determine unknown correlations and dependencies through information exchange between network nodes.In this regard, the concept of compressive wireless sensing has been introduced in [157] for energy-efficient estimation at the FC of sensor data, based on ideas from wireless communications [153], [155], [166], [167], [168] and compressive sampling theory [30], [84], [169].The main objective in such an approach is to combine processing and communications in a single distributed operation [170], [171], [172].
Methods to obtain the required sparsity in a SN: While transform-based compression is well-developed in traditional signal and image processing domains, the understanding of sparse transforms for networked data is not as trivial [173].There are methods such as associating a graph with a given network, where the vertices of the graph represent the nodes of the network, and edges between vertices represent relationships among data at adjacent nodes.The structure of the connectivity is the key to obtaining effective sparse transformations for networked data [173].For example, in the case of uniformly distributed nodes, tools such as DFT or DCT can be adopted to exploit the sparsity in the frequency domain.In more general settings, wavelet techniques can be extended to handle the irregular distribution of sampling locations [163].There are also scenarios in which standard signal transforms may not be directly applicable.For example, network monitoring applications rely on the analysis of communication traffic levels at the network nodes where network topology affects the nature of node relationships in complex ways.Graph wavelets [174] and diffusion wavelets [175] are two classes of transforms that have been proposed to address such complexities.In the former case, the wavelet coefficients are obtained by computing the digital differences of the data at different scales.The coefficients at the first scale are differences between neighboring data points, and those at subsequent spatial scales are computed by first aggregating data in neighborhoods and then computing differences between neighboring aggregations.The resulting graph wavelet coefficients are then defined by aggregated data at different scales, and computing differences between the aggregated data [174].In the latter scheme, diffusion wavelets are based on construction of an orthonormal basis for functions supported on a graph and obtaining a custom-designed basis by analyzing eigenvectors of a diffusion matrix derived from the graph adjacency matrix.The resulting basis vectors are generally localized to neighborhoods of varying size and may also lead to sparse representations of data on a graph [175].One example of such an approach is where the node data correspond to traffic rates of routers in a computer network.
Implementation of CS in a wireless SN: Two main approaches to implement random projections in a SN are discussed in the literature [173].In the first approach, the CS projections are simultaneously calculated through superposition of radio waves and communicated using amplitude-modulated coherent transmissions of randomly-weighted values directly from the nodes in the network to the FC (Fig. 27).This scheme, introduced in [157], [167] and further refined in [176], is based on the notion of so-called matched source-channel communication [166], [167].Although the need for complex routing, intra-network communications, and processing are alleviated, local phase synchronization among nodes is an issue to be addressed properly in this approach.
In the second approach, the projections can be computed and delivered to every subset of nodes in the network using gossip/consensus techniques, or be delivered to a single point using clustering and aggregation.This approach is typically used for networked data storage and retrieval applications.In this method, computation and distribution of each CS sample is accomplished through two simple steps [173].In the first step, each of the sensors multiplies its data with the corresponding element of the compressing matrix.Then, in the second step, the resulting local terms are simultaneously aggregated and distributed across the network using randomized gossip [177], which is a simple iterative decentralized algorithm for computing linear functions.Because each node only exchanges information with its immediate neighbors in the network, gossip algorithms are more robust to failures or changes in the network topology and cannot be easily compromised by eliminating a single server or fusion center [178].
Finally, it should be noted that in addition to the encoding process, the overall system performance is significantly affected by the decoding process [82], [179], [180]; this study and its extensions to sparse SN's remain as challenging tasks.
2) Sensing Capacity: Despite wide-spread development of SN ideas in recent years, understanding of fundamental performance limits of sensing and communication between sensors is still under development.One of the issues that has recently attracted attention in theoretical analysis of sensor networks, is the concept of sensor capacity.The sensing capacity was initially introduced for discrete alphabets in applications such as target detection [181], and later extended in [15], [182], [183] to the continuous case.The questions in this area are related to the problem of sampling of sparse signals, [30], [69], [169] and sampling with finite rate of innovation [4], [95].In the context of the CS, sensing capacity provides bounds on the maximum signal dimension or complexity per sensor measurement that can be recovered to a pre-defined degree of accuracy.Alternatively, it can be interpreted as the minimum number of sensors necessary to monitor a given region to a desired degree of fidelity based on noisy sensor measurements.The inverse of sensing capacity is the compression rate; i.e., the ratio of the number of measurements to the number of signal dimensions which characterizes the minimum rate to which the source can be compressed.As shown in [15], sensing capacity is a function of SNR, the inherent dimensionality of the information space, sensing diversity, and the desired distortion level.
Another issue to be noted with respect to the sensing capacity is the inherent difference between sensor network and CS scenarios in the way in which the SNR is handled [15], [183].In sensor networks composed of many sensors, fixed SNR can be imposed for each individual sensor.Thus, the sensed SNR per location is spread across the field of view leading to a row-wise normalization of the observation matrix.On the other hand, in CS, the vector-valued observation corresponding to each signal component is normalized by each column.This difference has led to different regimes of compression rate [183].In SN, in contrast to the CS setting, sensing capacity is generally small and correspondingly the number of sensors required does not scale linearly with the target sparsity.Specifically, the number of measurements is generally proportional to the signal dimension and is weakly dependent on target density sparsity.This issue has raised questions on compressive gains in power-limited SN applications based on sparsity of the underlying source domain.

A. Introduction
Recovery of the original source signals from their mixtures, without having a priori information about the sources and the way they are mixed, is called Blind Source Separation (BSS).This process is impossible if no assumption about the sources can be made.Such an assumption on the sources may be uncorrelatedness, statistical independence, lack of mutual information, or disjointness in some space [19], [20], [43].
The signal mixtures are often decomposed into their constituent principal components, independent components or are separated based on their disjoint characteristics described in a suitable domain.In the latter case the original sources should be sparse in that domain.Independent Component Analysis (ICA) is often used for separation of the sources in the former case whereas Sparse Component Analysis (SCA) is employed for the latter case.These two mathematical tools are described in the following sections followed by some results and illustrations of their applications.

B. Independent Component Analysis (ICA)
The main assumption in ICA is the statistical independence of the constituent sources.Based on this assumption, ICA can play a crucial role in the separation and denoising of signals (BSS).
There has been recent research interest in the field of BSS due to its practicality in a wide range of problems.For example, BSS of acoustic signals measured in a room is often referred to as the Cocktail Party problem, which means separation of individual sounds from a number of recordings in an echoic and noisy environment.Figure 28 illustrates the BSS concept, wherein the mixing block represents the multipath propagation model between the original sources and the microphone measurements.
Generally, BSS algorithms make assumptions about the environment in order to make the problem more tractable.There are typically three assumptions about the mixing medium.The most simple but widely used case is the instantaneous case, where the source signals arrive at the sensors at the same time.This has been considered for separation of biological signals such as the EEG where the signals have narrow bandwidths and the sampling frequency is normally low [184].The generative model for BSS in this case can be easily formulated as: where Generally, the mixing process can be nonlinear (due to inhomogenity of the environment and that the medium can change with respect to the source signal variations; e.g.stronger vibration of a drum as a medium, with louder sound).However, in an instantaneous linear case where the above problems can be avoided or ignored, the separation is performed by means of a separating matrix, W of size n × m, which uses only the information contained in x[i] to reconstruct the original source signals (or the independent components) as: where y[i] is the estimate for the source signal s[i].The early approaches in instantaneous BSS started from the work by Herault and Jutten [185] in 1986.In their approach, they considered non-Gaussian sources with equal number of independent sources and mixtures.They proposed a solution based on a recurrent artificial neural network for separation of the sources.
In the cases where the number of sources is known any ambiguity caused by false estimation of the number of sources can be avoided.If the number of sources is unknown, a criterion may be established to estimate the number of sources beforehand.In the context of model identification this is referred to as Model Order Selection and methods such as the Final Prediction Error (FPE), AIC, Residual Variance (RV), MDL and Hannan and Quinn (HNQ) methods [186] may be considered to solve this problem.
In acoustic applications, however, there are usually time lags between the arrival times of the signals at the sensors.The signals also may arrive through multiple paths.This type of mixing model is called a convolutive model [187].The convolutive mixing model can also be classified into two subcategories: anechoic and echoic.In both cases the vector representations of mixing and separating processes are modified as respectively, where * denotes the convolution operation.In an anechoic model, however, the expansion of the mixing process may be given as: where the attenuation, h r,j , and delay δ r,j of source j to sensor r would be determined by the physical position of the source relative to the sensors.Then the unmixing process to estimate the sources will be given as: where the w j,r 's are the elements of W. In an echoic mixing environment it is expected that the signals from the same sources reach the sensors through multiple paths.Therefore the expansion of the mixing and separating models will be changed to where L denotes the maximum number of paths for the sources, ν r [i] is the accumulated noise at sensor r, and (.) l refers to the l th path.The unmixing process will be formulated similarly to the anechoic one.For a known number of sources an accurate result may be expected if the number of paths is known; otherwise, the overall number of observations in an echoic case is infinite.The aim of BSS using ICA is to estimate an unmixing matrix W such that Y = WX best approximates the independent sources S, where Y and X are respectively matrices with columns y T and T .Thus the ICA separation algorithms are subject to permutation and scaling ambiguities in the output components, i.e.W = PDH −1 , where P and D are the permutation and scaling (diagonal) matrices, respectively.Permutation of the outputs is troublesome in places where either the separated segments of the signals are to be joined together or when a frequency-domain BSS is performed.
Mutual information is a measure of independence and maximizing the non-Gaussianity of the source signals is equivalent to minimizing the mutual information between them [188].
In those cases where the number of sources is more than the number of mixtures (underdetermined systems), the above BSS schemes cannot be applied simply because the mixing matrix is not invertible, and generally the original sources cannot be extracted.However, when the signals are sparse, the methods based on disjointness of the sources in some domain may be utilized.Separation of the mixtures of sparse signals is potentially possible in the situation where, at each sample instant, the number of nonzero sources is not more than a fraction of the number of sensors (see Table II, row and column 6).The mixtures of sparse signals can also be instantaneous or convolutive.

C. Sparse Component Analysis (SCA)
While the independence assumption for the sources is widely exploited in the design of BSS algorithms, the possible disjointness of the sources in some domain has not been considered.In SCA, this property is directly employed.Blind source separation by sparse decomposition has been addressed by Zibulevsky and Pearlmutter [189] for both overdetermined/exactly-determined and underdetermined systems using the maximum a posteriori approach.One way of formulating SCA is by representing the sources using a proper signal dictionary: where r = 1, . . ., m and n is the number of basis functions in the dictionary.The functions φ l [i] are called atoms or elements of the dictionary.These atoms do not have to be linearly independent and may form an overcomplete dictionary.The sparsity property requires that only a small number of the coefficients c r,l differ significantly from zero.Based on this definition, the mixing and unmixing systems are modeled as follows: x where ν[i] is an m × 1 vector.A and C can be determined by optimization of a cost function based on an exponential distribution for c i,j [189].Figure 29 shows three separated sources using the above approach.
In places where the sources are sparse and at each time instant, at most one of the sources has significant nonzero value, the columns of the mixing matrix may be calculated individually, which makes the solution to the underdetermined case possible.
The SCA problem can be stated as a clustering problem since the lines in the scatter plot can be separated based on their directionalities by means of clustering.A number of works on this method have been reported [19], [191], [192].In the work by Li et.al. [192], the separation has been performed in two different stages.First, the unknown mixing matrix is estimated using the k-means clustering method.Then, the source matrix is estimated using a standard linear programming algorithm.The line orientation of a data set may be thought of as the direction of its greatest variance.One way is to perform eigenvector decomposition on the covariance matrix of the data, the resultant principal eigenvector, i.e., the eigenvector with the largest eigenvalue, indicates the direction of the data, since it has the maximum variance.In [191], GAP statistics as a metric which measures the distance between the total variance and cluster variances, has been used to estimate the number of sources followed by a similar method to Li's algorithm explained above.In line with this approach, Bofill and Zibulevsky [16] developed a potential function method for estimating the mixing matrix followed by ℓ 1 -norm decomposition for the source estimation.Local maxima of the potential function correspond to the estimated directions of the basis vectors.After the mixing matrix is identified, the x 1 [i] x sources have to be estimated.Even when A is known the solution is not unique.So, a solution is found for which the ℓ 1norm is minimized.Therefore, for x[i] = a j s j [i], j |s j | is minimized using linear programming.
Geometrically, for a given feasible solution, each source component is a segment of length |s j | in the direction of the corresponding a j and, by concatenation, their sum defines a path from the origin to x[i].Minimizing j |s j | amounts therefore to finding the shortest path to x[i] over all feasible solutions j = 1, . . ., n, where n is the dimension of space of the independent basis vectors [19].Figure 30 shows the scatter plot, polar plot, and the shortest path from the origin to the data point x[i].
There are many cases for which the sources are disjoint in other domains, rather than the time-domain, or when they can be represented as sum of the members of a dictionary which can consist for example of wavelets or wavelet packets.In these cases the sparse component analysis can be performed in those domains more efficiently.Such methods often include transformation to time-frequency domain followed by a binary masking [193] or a BSS followed by binary masking [187].One such approach, called DUET [193], transforms the anechoic convolutive observations into the time-frequency domain using a short-time Fourier transform and the relative attenuation and delay values between the two observations are calculated from the ratio of corresponding time-frequency points.The regions of significant amplitudes (atoms) are then considered to be the source components in the timefrequency domain.In this method only two mixtures have been considered and as a major limit of this method, only one source has been considered active at each time instant.
For instantaneous separation of sparse sources, the common approach used by most researchers is to attempt to maximize the sparsity of the extracted signals at the output of the separator.The columns of the mixing matrix A assign each observed data point to only one source based on some measure of proximity to those columns [194], i.e., at each instant only one source is considered active.Therefore the mixing system can be presented as: where in an ideal case a j,r = 0 for r = j.Minimization of the ℓ 1 -norm is one of the most logical methods for estimation of sources as long as the signals can be considered sparse.ℓ 1 -norm minimization is a piecewise linear operation that partially assigns the energy of x[i] to the m columns of A around x[i] in R n space.The remaining n − m columns are assigned zero coefficients, therefore the ℓ 1 -norm minimization can be manifested as: A detailed discussion of signal recovery using ℓ 1 -norm minimization is presented by Takigawa et.al. [195] and described below.As mentioned above, it is important to choose a domain that sparsely represents the signals.
On the other hand, in the method developed by Pederson et.al., as applied to stereo signals, the binary masks are estimated after BSS of the mixtures and then applied to the microphone signals.The same technique has been used for convolutive sparse mixtures after the signals are transformed to the frequency domain.
In another approach [196] the effect of outlier noise has been reduced using median filtering then hybrid fast ICA filtering, and ℓ 1 -norm minimization have been used for separation of temporomandibular joint sounds.It has been shown that for such sources, this method outperforms both the Degenerate Unmixing Estimation Technique (DUET) and Li's algorithms.The authors of [197] have recently extended the DUET algorithm to separation of more than two sources in an echoic mixing scenario in the time-frequency domain.[199], [19].3) Estimate the source representation based on the sparsity assumption.A majority of proposed methods are primarily based on minimizing some norm or pseudonorm of the source representation vector.The most effective approaches are Matching Pursuit [199], [200], Basis Pursuit, [189], [201], [78], [202]], FOCUSS [203], IDE [66] and Smoothed ℓ 0 -norm [204].
In a very recent approach it has been considered that brain signal sources in the space-time-frequency domain are disjoint.Therefore, clustering the observation points in the space-timefrequency-domain can be effectively used for separation of brain sources [198].
As can be seen, generally, BSS exploits independence of the source signals whereas SCA benefits from the disjointness property of the source signals in some domain.While the BSS algorithms mostly rely on ICA with statistical properties of the signals, SCA uses their geometrical and behavioral properties.Therefore, in SCA, either a clustering approach or a masking procedure can result in estimation of the mixing matrix.Often, an ℓ 1 -norm is used to recover the source signals.Generally, in places where the source signals are sparse, the SCA methods often result in more accurate estimation of the signals with less ambiguities in the estimation.

D. SCA Algorithms
There are three main steps for the solution of an SCA problem as shown in Table VII [199].The first step of Table VII shows a linear model for the SCA problem, the second step consists of estimating the mixing matrix A using sparsity information, and finally the third step is to estimate the sparse source representation based on the estimate of A.
In the following, we present a survey of major approaches that are suggested for the third step.
1) Matching Pursuit: Mallat and Zhang have developed a general iterative method for approximating sparse decomposition [200].When the dictionary is orthogonal and the signal x is composed of k ≪ n atoms, the algorithm recovers the sparse decomposition exactly after n steps.As we will see, the algorithm is greedy [205].Since the algorithm is myopic, in some certain cases, wrong atoms are chosen in the first few iterations, and thus the remaining iterations are spent on correcting the first few mistakes.The algorithm is shown in Table VIII.
2) Basis Pursuit: The mathematical representation of counting the number of sparse components is denoted by ℓ 0 .However, ℓ 0 is not a proper norm and is not computationally tractable.The closest convex norm to ℓ 0 is ℓ 1 .The ℓ 1 optimization of an over complete dictionary is called Basis Pursuit.

TABLE VIII
MATCHING PURSUIT ALGORITHM 1) Let x (0) = 0 n×1 , r (0) = x and i = 1.2) Find index γ i such that r (i−1) , aγ i is maximum where a j corresponds to columns of the mixing matrix A (atoms). 3) Let s i = r (i−1) , aγ i , x (i) = x (i−1) + s i aγ i and r (i) = x − x (i) .4) If the last condition is not satisfied, increase i and return to step 2. However the ℓ 1 -norm is non-differentiable and we cannot use gradient methods for optimal solutions [206].On the other hand, the ℓ 1 solution is stable due to its convexity (the global optimum is the same as the local one) [21].
Formally, the Basis Pursuit can be formulated as: We now explain how the Basis Pursuit is related to Linear Programming (LP).The standard form of linear programming is a constrained optimization problem defined in terms of variable x ∈ R n by: where C T x is the objective function, Ax = b is a set of equality constraints and ∀i : x i ≥ 0 is a set of bounds.
Table IX shows this relationship.Thus, the solution of ( 75) can be obtained by solving the equivalent LP.There are two major approaches to solve LP: 1) Interior Point methods and 2) Simplex algorithms, depending on whether we solve the cost function and then check whether it satisfies the constraint bounds or vice versa.
3) FOCal Underdetermined System Solver (FOCUSS): FOCUSS is a non-parametric algorithm that consists of two parts [203].It starts by finding a low resolution estimation of the sparse signal, and then pruning this solution to a sparser signal representation through several iterations.The solution at each iteration step is found by taking the pseudoinverse of a modified weighted matrix.The pseudo-inverse of the modified weighted matrix is defined by (AW) + = (AW) H (AW • (AW) H ) −1 .This iterative algorithm is the solution of the following optimization problem: Find s = Wq, where: min q ℓ2 s.t.x = AWq (77) Description of this algorithm is given in Table X and an extended version is discussed in [203].
Step: Find indices of inactive sources:

Estimation
Step: Find the following projection as the new estimate: The solution is derived from Karush-Kuhn-Tucker system of equations.At the l + 1 th iteration: where the matrices and vectors are partitioned into inactive/active parts as A i , Aa, s i , sa and P = `Ai A T i ´−1 • stop after a fixed number of iterations.

4) Iterative Detection and Estimation (IDE):
The idea behind this method is based on a geometrical interpretation of the sparsity.Consider the elements of vector s are i.i.d.random variables.By plotting a sample distribution of vector s, which is obtained by plotting a large number of samples in the Sspace, it is observed that the points tend to concentrate first around the origin, then along the coordinate axes, then across the coordinate planes.The algorithm used in IDE is given in Table XI.In this table, s i 's are the inactive sources, s a 's are the active sources, A i is the column of A corresponding to the inactive s i and A a is the column of A corresponding to the active s a .Notice that IDE has some resemblances to the RDE method discussed in Sec.III-A.2,IMAT mentioned in Sec.III-A.2, and MIMAT explained in Sec.VII-A.2.
5) Smoothed ℓ 0 -norm (SL0) Method: As discussed earlier, the criterion for sparsity is the ℓ 0 -norm; thus our minimization is The ℓ 0 -norm has two major drawbacks: the need for a combinatorial search, and its sensitivity to noise.These problems arise from the fact that the ℓ 0 -norm is discontinuous.The idea of SL0 is to approximate the ℓ 0 -norm with functions of the type [204]: • For i = 1, . . ., K: Set ŝi = s.
• Final answer is ŝ = ŝK where σ is a parameter which determines the quality of the approximation.Note that we have For the vector s, we have s 0 ≈ n − F σ (s), where F σ (s) = n i=1 f σ (s i ).Now minimizing s 0 is equivalent to maximizing F σ (s) for some appropriate values of σ.For small values of σ, F σ (s) is highly non-smooth and contains many local maxima, and therefore its maximization over A • s = x may not be global.On the other hand, for larger values of σ, F σ (s) is a smoother function and contains fewer local maxima, and its maximization may be possible (in fact there is no local maxima for large values of σ [204]).Hence we use a decreasing sequence for σ in the steepest ascent algorithm and may escape from getting trapped into local maxima and reach the actual maximum for small values of σ, which gives the minimum ℓ 0 -norm solution.The algorithm is summarized in Table XII.

6) Comparison of Different Techniques:
The above techniques have been simulated and the results are depicted in Fig. 31.In order to compare the efficiency and computational complexity of these methods; we use a fixed synthetic mixing matrix and source vectors.The elements of the mixing matrix are obtained from zero mean independent Gaussian random variables with variance σ = 1.Sparse sources have been artificially generated using a Bernoulli-Gaussian model: s i = p N (0, σ on ) + (1 − p) N (0, σ of f ).We set σ of f = 0.01, σ on = 1 and p = 0.1.Then, we compute the noisy mixture vector x from x = As + ν, where ν is the noise vector.The elements of the vector ν are generated according to independent zero mean Gaussian random variables with variance σ 2 ν .We use Orthogonal Matching Pursuit (OMP) which is a variant of Matching Pursuit [200].OMP has a better performance in estimating the source vector in comparison to Performance of various methods with respect to the standard deviation.Matching Pursuit.Fig. 32 demonstrates the time needed for each algorithm to estimate the vector s with respect to the number of sources.This figure shows that IDE and SL0 have the lowest complexity.

E. Sparse Dictionary Representation (SDR) and Signal Modeling
A signal x ∈ R n may be sparse in a given basis but not sparse in a different basis.For example, an image may be sparse in a wavelet basis (i.e., most of the wavelet coefficients are small) even though the image itself may not be sparse (i.e., many of the gray values of the image are relatively large).Thus, given a class S ⊂ R n , an important problem is to find a basis or a frame in which all signals in S can be represented sparsely.More specifically, given a class of signals S ⊂ R n , it is important to find a basis (or a frame) D = {w j } d j=1 (if it exists) for R n such that every data vector x ∈ S can be represented by at most k ≪ n linear combinations of elements of D. The dictionary design problem has been addressed in [19], [20], [21], [81], [84], [95], [208].A related problem is the signal modeling problem in which the class S is to be modeled by a union of subspaces M = l i=1 V i where each V i is a subspace of R n with a dimension of V i ≤ k where k ≪ n [43].If the subspaces V i are known, then it is possible to pick a basis E i = {e i j } j for each V i and construct a dictionary D = l i=1 E i in which every signal of S has sparsity k (or is almost k sparse).The model M = l i=1 V i can be found from an observed set of data F = {f 1 , . . ., f m } ⊂ S by solving (if possible) the following non-linear least squares problem:   Find subspaces V 1 , . . ., V l of R n that minimize the expression over all possible choices of l subspaces with dimV i ≤ k < N .Here d denotes the Euclidian distance in R n and k is an integer with 1 ≤ k < n for i = 1, . . ., l.Note that e F, {V 1 , . . ., V l } is calculated as follows: for each f i ∈ F and fixed {V 1 , . . ., V l }, the subspace V i ∈ {V 1 , . . ., V l } closest to f i is found and the distance d 2 (f i , V j ) is computed.This process is repeated for all f i ∈ F and the squares of the distances are added together to find e F, {V 1 , . . ., V l } .The optimal model is then obtained as the union M = i V o i , where {V o 1 , . . ., V o l } minimize the expression (81).When l = 1 this problem reduces to the classical least squares problem.However, when l > 1 the set i V i is a nonlinear set and the problem is fully non-linear (see Figs. 34 and 35).A more general nonlinear least squares problem has been studied for finite and infinite Hilbert spaces [43].In that general setting, the existence of solutions is proved and a meta-algorithm for searching for the solution is described.
For the special finite dimensional case of R n in (81), the search algorithm is an iterative algorithm that alternates between data partition and the optimization of a simpler least squares problem.The search algorithm can be summarized as in Table XIII.
The above algorithm, which is similar to k-means, consists of two parts [209]: 1) an initialization; and 2) an iterative -initial partition {F 1 l , . . ., F 1 l } -Data set F • Iterations: 1) Use the SVD to find {V 1 1 , . . ., V 1 l } by minimizing e `F 1 i , V i ´for each i, and compute ´; 6) Increment j by 1, i.e., j → j + 1; 7) End while • Output: -P j and V P j .search algorithm.The initialization is a Hough-like transform that partitions the data set F into l classes F 1 1 , . . ., F 1 l , such that each class F 1 i consists of vectors that are close to the same (yet unknown) subspace V 1 i .For each partition F 1 i we find the subspace V 1 i closest to the vectors in i closets to the vectors in F 1 i for a fixed i is the classical linear least squares problem and can be found exactly using a Singular Value Decomposition.This initialization process produces a first approximation 81).We now use the subspaces V 1 1 , . . ., V 1 l to find a new partition of the data F 2 1 , . . ., F 2 l in such a way that the vectors in F 2 i consist of all the vectors in F that are closest to V 1 i , i = 1, . . ., l.Using this new partition F 2 1 , . . ., F 2 l of F , we find (for each i) the subspace V 2 i closest to the vectors in F 2 i using SVD as before.We continue this process until the value of e is unchanged between two consecutive iterations.This is a local minimum which is also likely to be a global one (see Fig. 35).

VII. MULTIPATH CHANNEL ESTIMATION
In wireless systems, channel estimation is required for the compensation of channel distortions.The transmitted signal bounces off different objects and arrives at the receiver from multiple paths.This phenomenon causes the received signal to be a mixture of reflected and scattered versions of the transmitted signal.The mobility of the transmitter, receiver, and scattering objects results in rapid changes in the channel response, and thus the channel estimation process becomes more complicated.Due to the sparse distribution of scattering objects, a multipath channel is sparse in the time domain as shown in Fig. 36.By taking sparsity into consideration, channel estimation can be simplified and/or made more accurate.The sparse time varying multipath channel is modeled as: where k is the number of taps, α l is the l th complex path gain, and τ l is the corresponding path delay.At time t, the transfer function is given by: The estimation of the multipath channel impulse response is very much similar to the determination of analog epochs and amplitudes of discontinuity for finite rate of innovation as shown in section II-B.4 Fig. 8 and (19).Essentially, if a known train of impulses is transmitted and the received signal from the multipath channel is filtered and sampled (information domain as shown in Fig. 8 -rate of innovation), the channel impulse response can be estimated from these samples using an annihilating filter (the Prony method [28]) defined in the Z-transform and a pseudo-inverse matrix inversion, in principle 18 .Once the channel impulse response is estimated, its effect is compensated; this process can be repeated according to the dynamics of the time varying channel.
A special case of multipath channel is an OFDM channel, which is widely used in ADSL, DAB, DVB, WLAN, WMAN, and WIMAX 19 .OFDM is a digital multi-carrier transmission technique where a single data stream is distributed over several sub-carrier frequencies to achieve robustness against multipath channels besides many other advantages.Channel estimation in OFDM can be easier than for other modulation schemes; the channel impulse response is now quantized 20 and instead of an annihilating filter defined in the Z-transform, we can use DFT and ELP of section III-A.Also, instead of a known train of impulses, some of the available sub-carriers of OFDM in each transmitting block are assigned to predetermined patterns, which are usually called comb-type pilots.These pilot tones help the receiver to extract some of the DFT samples of the discrete time varying channel (82) at the respective frequencies in each transmitting block.These characteristics make the OFDM channel estimation similar to unknown sparse signal recovery of section II-A.1 and the impulsive noise removal of section III-A.2.Because of these advantages, our main example and simulations are related to OFDM channel estimation.

A. OFDM Channel Estimation
For OFDM, the discrete version of the time varying channel of (83) in the frequency domain becomes where where T f and n are the symbol length and number of subcarriers in each OFDM symbol, respectively.∆f is the subcarrier spacing, and T s = 1 ∆f is the sample interval.The above equation shows that for the r th OFDM symbol, H[r, i] is the DFT of h[r, l].
Two major methods are used in the equalization process: 1) zero forcing and 2) MMSE.In the zero forcing method, regardless of the noise variance, equalization is obtained by dividing the received OFDM symbol by the estimated channel frequency response; while in the MMSE method, the approximation is chosen such that the MSE of the transmitted data vector E X − X 2 is minimized, which introduces the noise variance in the equations.

1) Statement of the Problem:
The goal of the channel estimation process is to obtain the channel impulse response from the noisy values of the channel transfer function in the pilot positions.This is equivalent to solving the following equation for h which is also shown graphically in Fig. 37.
where i p is an index vector denoting the pilot positions in the frequency spectrum, H ip is a vector containing the value of the channel frequency spectrum in these pilot positions and F ip denotes the matrix obtained from taking the rows of the DFT matrix pertaining to the pilot positions.ν ip is the additive noise on the pilot points in the frequency domain.Thus, the channel estimation problem is equivalent to finding the sparse vector h from the above set of equations for a set of pilots.Various channel estimation methods [210] have been used with the usual tradeoffs of optimality and complexity.The Least Square [210], ML [211], [212], Minimum Mean Square Error (MMSE) [213], [214], [215], and LMMSE [213], [211], [216] techniques are among some of these methods.But none of these techniques use the inherent sparsity of the multipath channel h, and thus, they are not as accurate.
2) Sparse OFDM Channel Estimation: In the following, we present two methods that utilize this sparsity to enhance the channel estimation process.
CS Based Channel Estimation: Recently the idea of using time-domain sparsity in OFDM channel estimation has been proposed by [217], [218], [219].The use of sparsity decreases the channel estimation error and hence the number of required pilots (overhead), thus increasing the bandwidth efficiency.In [217], the authors proposed to use CS for OFDM channel estimation and proved that, in case of uniform pilot insertion, the OFDM channel estimation problem satisfies the Restricted Isometric Property (RIP) described in section II-B, and thus LP-based algorithms similar to the ones discussed in section VI-D.2 can be used for channel estimation.Simulation results show that this method works effectively even in fast time varying channels.Furthermore, from (7), a lower bound on the number of pilots can be obtained for a given number of channel taps (sparsity).
However, the authors of [217] did not consider zero padding at the endpoints of the OFDM bandwidth in their scenario which is an essential part of the current standards based on OFDM transmission.This assumption causes the matrix F ip defined in ( 86) to be ill-conditioned and thus, the RIP  9) is not satisfied.Also we do not have any pilots in the zero padded parts which complicates the channel estimation techniques.In [24], we propose a method that exploits the inherent sparsity and also solves the zero padding problem.This algorithm is briefly discussed in the following.
Modified IMAT (MIMAT) for OFDM Channel Estimation: In this method, the spectrum of the channel is initially estimated using a simple interpolation such as linear interpolation between pilot sub-carriers.This initial estimate is further improved in a series of iterations between time (sparse) and frequency (information) domains to find the sparsest channel impulse response by using an adaptive thresholding scheme; in each iteration, after finding the location of the taps (locations with previously estimated amplitudes higher than the threshold), their respective amplitudes are again found using the MMSE criterion.In each iteration, due to thresholding, some of the false taps that are noise samples with amplitudes above the threshold are discarded.Thus, the new iteration starts with a lower number of false taps.Moreover, because of the MMSE estimator, the valid taps approach their actual values in each new iteration.In the last iteration, the actual taps are detected and the MMSE estimator gives their respective values.This method is similar to RDE and IDE methods discussed in section III-A.2 and VI-D.4.
Table XIV summarizes the steps in the MIMAT algorithm.In the threshold of the MIMAT algorithm, α and β are constants which depend on the number of taps and initial powers of noise and channel impulses.In the first iteration, the threshold is a small number and with each iteration it is gradually increased.Intuitively, this gradual increase of the threshold with the iteration number, results in a gradual reduction of false taps (taps that are created due to noise).In each iteration, the tap values are obtained from: where t denotes the index of nonzero impulses obtained from the previous step and F is obtained from F ip by keeping the columns determined by t.A graphical representation of the above equation is given in Fig. 38.The amplitudes of -Find an initial estimate of the time domain channel using linear interpolation: ĥ(0) = ĥlinear • Iterations: 1) Set Threshold=βe αi .
2) Using the threshold from the previous step, find the locations of the taps t by thresholding the time domain channel from the previous iteration( ĥ(i−1) ). 3) Solve for the value of the non-zero impulses using MMSE: 4) Find the new estimate of the channel ( ĥ(i) ) by substituting the taps in their detected positions.5) Stop if the estimated channel is the same as the previous iteration or when a maximum number of iterations is reached.nonzero impulses can be obtained from simple iterations, pseudo-inverse, or the MMSE equation ( 88) of Table XIV that yields better results under additive noise environments.
The equation that has to be solved in ( 87) is usually overdetermined which helps the suppression of the noise in each iteration step.Note that the solution presented in (88), represents a variant of the MMSE solution when the location of discrete impulses are known.If further statistical knowledge is available, this solution can be modified and a better estimation is obtained; however, this makes the approximation process more complex.This algorithm does not need many steps of iterations; the positions of the non-zero impulses are perfectly detected in 3 or 4 iterations for most types of channels.

B. Simulation Results and Discussions
For OFDM simulations, the DVB-H standard was used with the 16-QAM constellation in the 2K mode (2 11 FFT size).The channel profile was the Brazil channel D. Fig. 39 shows the Symbol Error Rate (SER) versus the Carrierto-Noise Ratio (CNR) after equalizing using the proposed MIMAT algorithm and the standard linear interpolation in the frequency domain using the noisy pilot samples.As can be seen in Fig. 39, the SER obtained from the MIMAT algorithm coincides with the one obtained from the hypothetical ideal channel (where the exact channel frequency response is used for equalization).Thus, in this sense the proposed channel estimation is perfect in time invariant channels.Also, this figure shows that the standard linear interpolation method performs poorly compared to MIMAT.This estimation technique is highly robust in rapidly changing channels and shows only minor performance degradation as the Doppler frequency increases as shown in Fig. 40.

VIII. CONCLUSION
A unified view of sparse signal processing has been presented in tutorial form.The sparsity in the key areas of sampling, coding, spectral estimation, array processing, component analysis, and channel estimation has been carefully exploited.Some form of uniform or random sampling has been shown to underpin the associated sparse processing methods used in each of these fields.The reconstruction methods used in each application domain have been introduced and the interconnections among them have been highlighted.This development has revealed; for example, that the iterative methods developed for random sampling can be applied to real-field block and convolutional channel coding for impulsive noise (salt-and-pepper noise in the case of images) removal, sparse component analysis, and channel estimation for orthogonal frequency division multiplexing systems.These iterative reconstruction methods have been shown to be naturally extendable to spectral estimation and sparse array processing due to their similarity to channel coding in terms of mathematical models.Conversely, the minimum description length method developed for spectral estimation and array processing has potential for application in other areas.The error locator polynomial method developed for channel coding has, moreover, been shown to be a discrete version of the annihilating filter used in sampling with a finite rate of innovation and the Prony method in spectral estimation; the Pisarenko and MUSIC methods are further improvements of the Prony method.
Linkages with emergent areas such as compressive sensing and sensor networks have also been considered.In addition, it has been suggested that the linear programming methods developed for compressive sensing and sparse component analysis can be applied to other applications with possible reduction of sampling rate.As such, this tutorial has provided the route for new applications of sparse signal processing to emerge, which can potentially reduce computational complexity and improve performance quality.

APPENDIX I ACCELERATION METHODS: CHEBYSHEV AND CONJUGATE GRADIENT (CG) [65]
A. Chebyshev Algorithm • Initialization: • For n = 2, . . ., N : where S and P are the sampling and filtering operators, respectively.Also, A and B are frame bound parameters and N is the number of iterations.

B. Conjugate Gradient Algorithm
• Initialization: • For n = 2, . . ., N : where S and P are the sampling and filtering operators, respectively.N is the number of iterations and x[i], y[i] denotes the inner product of the two functions x[i] and y[i].
APPENDIX II ELP DECODING FOR ERASURE CHANNELS [53] For lost samples, the polynomial locator for the erasure samples is , H(z im ) = 0, m = 1, 2, . . ., k where z i = e j 2π•i n .The polynomial coefficients h t , t = 0, . . ., k can be found from the product in (93); it is easier to find h t by obtaining the inverse FFT of H(z).By multiplying (94) Since X[j] = 0 for j ∈ Θ (see the footnote on page 10), then The remaining values of E[j] can be found from (96), by the following recursion: where r / ∈ Θ and the index additions are in mod(n).

APPENDIX III ELP DECODING FOR IMPULSIVE NOISE CHANNELS [32],
[96] For all integer values of r such that r ∈ Θ and r + k ∈ Θ, we obtain a system of k equations with k + 1 unknowns (h t coefficients).These equations yield a unique solution for the polynomial with the additional condition that the first nonzero h t is equal to one.After finding the coefficients, we need to determine the roots of the polynomial (93).Since the roots of H(z) are of the form e j where tr(.) represents the trace operator on matrices.We know that if both AB and BA are defined, tr(AB) = tr(BA).Thus: and since vH i vi = 1 for i = 1, . . ., n, we have: The first term of the above equation can be written as: where C(m) only depends on the parameter m and not k.
Thus we can ignore this term in the MDL criterion.[34], [35], [36] Many physical phenomena and engineering problems are represented by a mathematical model very often in the form of ordinary, or partial differential equations.The explicit solutions of these equations are rarely available, unless for special cases.It is customary to try to find approximate solutions to these problems by making them discrete and linear.The resulting linear system involves very large matrices, which hopefully contain many zeros.A matrix, with very few nonzero elements is called a sparse matrix.

APPENDIX V SPARSE MATRIX MANIPULATIONS
Solving large linear systems are difficult as it is very time consuming, costly and laborious, unless the matrices involved are sparse and have many zero elements.The sparse property of the matrix may reduce the storage and computing time considerably, crucial to solving the linear systems that represent such matrices.This is achieved by storing and computing only non-zero entries of the matrix.Various techniques can be applied to solve such a linear system [220].
There are two main approaches in finding admissible solutions for sparse linear systems, 1) Direct Methods [35], and 2) Iterative Methods [34], [221], [222]: 1) Direct Methods give the exact solutions in a finite number of elementary operations, provided that there are no rounding errors.Direct methods for a sparse linear system fall into three categories: (i) Gaussian elimination techniques, (ii) Triangular factorization, in particular using decomposition techniques, such as, LU, Incomplete Lower and Upper (ILU), and Cholesky factorization, (iii) Householder reduction to upper triangular form.Direct methods, in real applications, have an advantage of providing solutions with robust and predictable behavior.Detailed analysis and discussion of these methods have been well established, and can be found in [35], [36], and references therein.2) Iterative Methods consist of Jacobi, Gauss-Siedel, and Successive Over-Relaxation (SOR).The most suitable method for the sparse linear system is the projection method, and in particular the Krylov subspace method.
With the new development in iterative methods for approximate solution of linear systems, it has been realized that new iterative techniques of projection methods, and in particular the combination of preconditioning and projection onto Krylov subspace iterations, can be a simple and efficient solver of sparse linear systems, and can compete with the direct methods in its applications [34], [221].Each of the above two methods is suitable for different classes of matrices.Direct methods are best applied to those classes that produce only a few fill-ins -a fill-in being a new non-zero entry created at a position where the original matrix contains a zero entry after a linear algebraic operation is performed.Preconditioned iterative methods are the preferred choice for the class of matrices that produce many fill-ins in the process.
There are two types of sparse matrices: matrices with regularly patterned non-zeros (group I), and matrices with irregularly structured matrices (group II).Such a distinction between matrices is particularly important in the iterative solution methods since algebraic operations for the group I of matrices can be significantly reduced using a computer.There is a large number of different sparse matrices archived and accessible on the website managed by T. Davis, The University of Florida Sparse Matrix Collection, http://www.cise.ufl.edu/research/sparse/matrices.

A. Solution of Linear Systems
Consider the linear system where A is an m × n matrix, x is the unknown, and b is an m × 1 vector.For the exact solution of this system, there are three cases: 1) The square matrix A (m = n) is invertible (nonsingular): there is a unique solution x = A −1 b.
2) The matrix A is singular and b ∈ R(A) (b belongs to the Range of A): there are infinitely many solutions x = x 0 + v, for all v ∈ Ker(A) (Null space of A), where x 0 is a particular solution of Ax = b.
3) The matrix A is singular and b / ∈ R(A): there is no solution.However, for large n, calculating the inverse of A is a complex task.Further, if approximate solutions are found, it may be difficult to estimate how accurate they are.This will depend on the entries of matrix A. Further, for the case 2 above, where n ≤ m and rank(A) = n, denote the pseudo inverse by A + = (A T A) −1 A T , the problem is converted to finding the vector x = A + b (an approximate solution) subject to the following conditions: • x satisfies the least square solution.Namely, vector x minimizes the norm of r , where the residual vector r = b − Ax. • x is the solution of the system Ax = b−r, when A T r = 0.
Note that in the case that A is the product of certain special matrices (diagonal, orthogonal, triangular, and other invertible matrices), the solution can be found by various direct methods [34], [35], [36].

B. Iterative Methods
Consider an n × n real coefficient matrix A and a real nvector b in linear system (107), the decomposition of A as: where D is the diagonal matrix of A with all non-zero entries, and L and U are the strict lower and upper matrices of A, respectively.The iterative solution vector x k+1 is given by or in a general form, where in Jacobi iterations G = G J (A) = I − D −1 A, f = D −1 b, and in Guass-Seidel iteration G = G GS (A) = I − (D + L) −1 A, f = (D + L) −1 b.This iteration can be viewed as a technique for solving the linear system: where the precondition matrix M is given by M J = D, and M GS = D + L. The iterations given above converge and the limit is a solution of the original linear system (for details see [36], [221], [222]).1) Krylov Subspace Methods: Let x 0 be an initial approximation to the solution of (107), r 0 = b − Ax 0 be the initial residual, and let K m (A, r 0 ) be the Krylov subspace of dimension m defined by K m (A, r 0 ) = Span r 0 , Ar 0 , . . ., A m−1 r 0 (112) where these subspaces are nested, i.e., K m ⊆ K m+1 , (m = 1, 2, 3, . . .).Krylov subspace methods are iterative methods in which x m , an approximation to the solution of ( 107), at the m th step, is found in x 0 + K m , namely the approximation is of the form x = x 0 + q m−1 (A)r 0 (113) where, q m−1 is a polynomial of degree at most m − 1.If the system is real, then coefficients of q m−1 are real.This The error satisfies e = x m − x * = p m (A)(x 0 − x * ), where x * is the exact solution of (107).Let us denote by P m the set of all polynomials p of degree at most m such that p(0) = 1.The approximation x m ∈ x 0 + K m is often found by requiring x m to be the minimizer of some functional.There are different methods depending on the characteristics of the matrix and implementation.Thus, each method defines implicitly a different polynomial p m ∈ P m (for details see [223]).
The extra condition imposed for convergence and completeness are • r rm is orthogonal to K m (Galerkin condition: r m ⊥K m ), • Minimum residual condition: We note that the nested property of the Krylov subspaces imply that any method for which one of the conditions (115) holds will terminate in at most n steps.The desired methods are those which produce a good approximation to the solution of (107) in many fewer than n iterations.An important ingredient that makes Krylov subspace methods work is the use of preconditioners, a matrix or operator M used to convert equation ( 107) into (111).There is no one method which is recommended for all problems.Some of the applications on Sparse matrices besides the one discussed in the main body of the paper are:

C. Applications in Photonics and Electromagnetics
Numerical simulation for the development of photonics and electromagnetic CAD software packages has been the subject of intensive research in the past decade.The most widely used simulation method is the Finite-Difference Time-Domain (FDTD) for time-domain analysis of Maxwell's equations.The more recently introduced schemes would require the solution of large sparse linear systems at each unit of time [224].A popular method for time-domain problems is the timedomain beam propagation method [225], which necessitates a multiplication of the input field vector with a very sparse matrix.This sparse multiplication is advantageous in that the method can become very efficient and highly parallel [226].Another method for the time-domain simulation of Maxwell's equations is the Finite-Element Time-Domain (FETD) technique [224].The FETD leads to an almost sparse linear system; the computational complexity of this method can be considerably reduced by inverting the sparse matrix [227].[228] Micro-arrays (DNA and protein) are parallel biosensors capable of detecting a large number of different genomic particles simultaneously.DNA micro-arrays that use tens of thousands of probe spots detect multiple targets in a single experiment.This is a wasteful use of the sensing resources in comparative DNA micro-array experiments.Generally, only a fraction of the total number of genes with respect to a reference sample is differentially expressed, and, thus, a vast number of probe spots may not provide any useful information.An alternative design is the so-called compressed micro-arrays; this translates to significantly lower costs, simpler image acquisition and processing, and smaller amount of genomic material needed for experiments.To recover signals from compressed micro-array measurements, ideas from CS (Sec.II-B) can be employed.For sparse measurement matrices, sparse algorithms can be used to lower the computational complexity than the widely used linear-programming-based methods [92], [228].

ACKNOWLEDGMENTS
We would like to sincerely thank our colleagues for their specific contributions in various sections in this paper.Especially, Dr. M. Nouri Moghadam from Math.Dept. of George Washington University, who wrote Appendix V; Dr. H. Saeedi from University of Massachusetts who contributed to LDPC codes, and Dr. K. Mehrany from EE Dept. of Sharif University who contributed to section V-C and also M. Valiollahzadeh who edited and contributed to the SCA section.We are especially indebted to Prof. B. Sankur from Bogazici University in Turkey for his careful review and comments.The contribution of some of the sparse array examples by Dr. A. Austeng from University of Oslo is also appreciated.We are also thankful to the students of the Multimedia Lab and members of ACRI at Sharif University for their invaluable help and simulations.We are specifically indebted to V. Montazer Hodjat, S. Jafarzadeh, A. Salemi, M. Soltanalian, E. Azizi and H. Firuzi.

Fig. 3 .
Fig. 3. Block diagram of the iterative reconstruction method.The Mask is an appropriate filter with coefficients of 1's and 0's depending on the type of sparsity in the original signal.

Fig. 4 .
Fig.4.SNR improvement vs. the no. of iterations for a random sampling set at the Nyquist rate (OSR=1) for a bandpass signal.

Fig. 5 .
Fig. 5.The Iterative Method with Adaptive Thresholding (IMAT) for detecting the number, location, and values of sparsity.

Fig. 7 .
Fig. 7. Relation between m and k for sparse DFT and DCT signals; the frame size is n = 2 13 .

n 2 .Fig. 9 .
Fig. 9. Convolutional Encoders (a) A real-field systematic convolutional encoder of rate 1 2 ; h[i]'s are the taps of an FIR filter; (b) A non-systematic convolutional encoder of rate 1 2 , h 1 [i]'s and h 2 [i]'s are the taps of 2 FIR filters.

Fig. 11 .
Fig.11.CFAR-RDE method with the use of adaptive soft thresholding and an iterative method for signal reconstruction.

Fig. 12 .
Fig. 12.Comparison of CFAR-RDE and a simple soft decision RDE for DFT block codes.

Fig. 17 .
Fig.17.Simulation results by using the IMAT method for detecting the location and amplitude of the impulsive noise, λ = 1.9.

Fig. 18 .
Fig. 18.Spectral estimation of a sparse mixture of sinusoids using Prony, Pisarenko and MUSIC methods; input SNR is 5dB and 1024 time samples are used.

Fig. 19 .
Fig. 19.Uniform linear array with element distance d, element length I, and a wave arriving from direction ϕ.

Fig. 20 .
Fig.20.An example of the performance of the MDL criterion in array processing.The MDL estimates the number of active sources, which is 2.

Fig. 27 .
Fig. 27.Computation of CS projections through superposition of radio waves of randomly weighted values directly from the nodes in the network to the FC (from [173]).

Fig. 28 .
Fig. 28.The BSS concept; the unobservable sources s 1 [i], . . ., sn[i] are mixed and corrupted by additive zero mean noise to generate the observations x 1 [i], . . ., xm[i].The target of BSS is to estimate an unmixing system to recover the original sources in y 1 [i], . . ., yn[i].
and ν[i] denote respectively the vector of source signals, size n × 1, observed signals size m × 1, and noise signals size m × 1. H is the mixing matrix of size m × n.

Fig. 29 .
Fig. 29.Separation of sparse signals using the algorithm given in [190], left (a) sources, (b) mixtures, and (c) reconstructed sources, in both time-frequency left, and time, right.

Fig. 30 .
Fig. 30.(a) the scatter plot and (b) the shortest path from the origin to the data point, x[i], extracted from [16].

)
Maximize the function Fσ on the feasible set S = {s|As = x} using L iterations of the steepest ascent algorithm (followed by projection onto the feasible set): -Initialization: s = ŝi−1 .-for j = 1, . . ., L (loop L times): a) Let: ∆s = [s 1 e T .b) Set s ← s − µ∆s (where µ is a small positive constant).c) Project s back onto the feasible set S: Fig. 31.Performance of various methods with respect to the standard deviation.

h = 1 ,
´for each i, and computeΓ j+1 = P i e `F j+1 i , V j+1 i

Fig. 35 .
Fig. 35.Data F belongs to two planes in R 3 .The algorithm uses a random initial partition in left hand side, and produces the final partition and optimal subspaces in the right hand side.

Fig. 37 .
Fig. 37. Graphical representation of the sparse problem involved in OFDM channel estimation

Fig. 38 .
Fig. 38.Graphical representation of the equation involved in obtaining the tap values above the threshold

Fig. 39 .
Fig.39.SER (Symbol Error Rate) vs. CNR (Carrier to Noise Ratio) for the ideal channel, linear interpolation, and the MIMAT for the Brazil channel.

Fig. 40 .
Fig. 40.SER (Symbol Error Rate) vs. CNR (Carrier to Noise Ratio) of MIMAT method for the Brazil channel with various Doppler frequencies.
by e[i m ] • z im r (where r is an integer) and summing over m, we get m ] • z im k+r−t = 0 (95) Since the inner summation is the DFT of the missing samples e[i m ], we get k t=0 h t • E[k + r − t] = 0 (96) where E[.] is the DFT of e[i].The received samples, d[i], can be thought of as the original over-sampled signal, x[i], minus the missing samples e[i m ].The error signal, e[i], is the difference between the corrupted and the original over-sampled signal and hence is equal to the values of the missing samples for i = i m and is equal to zero otherwise.In the frequency domain, we have E[j] = X[j] − D[j], j = 1, . . ., n 2π•im n , the inverse DFT (IDFT) of the {h m } k m=0 can be used.Before performing IDFT, we have to pad n − 1 − k zeros at the end of the {h m } k m=0 sequence to obtain an n-point signal.We refer to the new signal (after IDFT) as {H i } n−1 i=0 .Each zero in {H i } represents an error in r[i] at the same location.APPENDIX IV PROOFS FOR MDL FORMULAS Proof of (58): The eigenvectors of R −1ML are the same as the ones for R and its eigenvalues are the reciprocals of the eigenvalues of R ML and we know the eigenvectors constitute an orthonormal set.Thus we have:tr(R −1 ML R) = tr ( , which results in:tr(R −1 ML R) = k + n − k = n(103)Proof of (60): The term κ 2 log m is the penalty function and we have:− log f (x; R ML ) = − log 1 |πR ML | m e −tr(R −1 M L R) = m log(π) + m log(|R ML |) +tr(R −1 ML R) = m log |R ML | + n + m log(π) C(m)

−
log f (x; R ML ) = m n i=1 log( λi ) +m(n − k) log 1 n − k natural expression implies that the residual r m = b − Ax m is associated with the so-called residual polynomial p m of degree at most m with p m (0) = 1 because r m = b − Ax m = r 0 − Ap m (A)r 0 = p m (A)r 0

TABLE I COMMON
NOTATIONS USED THROUGHOUT THE PAPER.

TABLE II VARIOUS
TOPICS AND APPLICATIONS WITH SPARSITY PROPERTIES: THE SPARSITY, WHICH MAY BE IN TIME/SPACE OR "FREQUENCY" DOMAINS, CONSISTS OF UNKNOWN SAMPLES/COEFFICIENTS THAT NEED TO BE DETERMINED.THE INFORMATION DOMAIN CONSISTS OF KNOWN SAMPLES/COEFFICIENTS IN "FREQUENCY" OR TIME/SPACE DOMAIN (THE COMPLEMENT OF THE SPARSE DOMAIN).A LIST OF ACRONYMS IS GIVEN IN TABLE XV AT THE END OF THE PAPER; ALSO, A LIST OF COMMON NOTATIONS IS PRESENTED IN TABLE I. FOR DEFINITION OF ESPRIT ON ROW 11 AND COLUMN 7, SEE THE FOOTNOTE ON PAGE 18.

TABLE III THE
ITERATIVE ALGORITHM BASED ON THE BLOCK DIAGRAM OF FIG.

TABLE IV GENERIC
IMAT OF FIG. 5 FOR ANY SPARSITY IN THE DISCRETE TRANSFORM (DT), WHICH IS TYPICALLY THE FAST FOURIER

TABLE VII SCA
STEPS 1) Consider the model x = A • s, we need a linear transformation that applies to both sides of the equation to yield a new sparse source vector.2) Estimate the mixing matrix A. Several approaches are presented for this step, such as natural gradient ICA approaches, and clustering techniques with variants of k-means algorithm Set ŝ0 equal to the minimum ℓ 2 -norm solution of As = x, obtained by pseudo-inverse of A.2) Choose a suitable decreasing sequence for σ, [σ 1 , . . ., σ K ].

TABLE XIII SEARCH ALGORITHM
• Input:

TABLE XIV MIMAT
ALGORITHM FOR OFDM CHANNEL ESTIMATION • Initialization: