Sparse representations are most likely to be the sparsest possible

Given a signal and a full-rank matrix with, we define the signal's overcomplete representations as all satisfying. Among all the possible solutions, we have special interest in the sparsest one—the one minimizing. Previous work has established that a representation is unique if it is sparse enough, requiring. The measure stands for the minimal number of columns from that are linearly dependent. This bound is tight—examples can be constructed to show that with or more nonzero entries, uniqueness is violated. In this paper we study the behavior of overcomplete representations beyond the above bound. While tight from a worst-case standpoint, a probabilistic point of view leads to uniqueness of representations satisfying. Furthermore, we show that even beyond this point, uniqueness can still be claimed with high confidence. This new result is important for the study of the average performance of pursuit algorithms—when trying to show an equivalence between the pursuit result and the ideal solution, one must also guarantee that the ideal result is indeed the sparsest.


General-sparse representations
In signal processing we are often interested in a replacement of the representation, seeking some simplification for an obvious gain. This is the rational behind the so many transforms proposed over the past several centuries, such as the Fourier, cosine, wavelets, and many others. The basic idea is to "change language," and describe the signal differently, in the hope that the new description is better for the application in mind. A natural justification for a transform is that given a signal, a representation has already been imposed due to the use of the trivial basis (e.g., samples as a function of time/space), and there is no reason to believe that this representation is the most appropriate one for our needs.
The ease with which linear transforms are operated and analyzed keeps those as the first priority candidates in defining alternative representations. It is therefore not surprising to find that linear transforms are the more popular ones in theory and practice in signal processing. A linear transform is defined through the use of a full-rank matrix D ∈ R N×L , where L ≥ N. Given the signal S ∈ R N , its representation is defined by where α ∈ R L . For the case of L = N (and a nonsingular matrix D due to the full-rank property), the above relationship implies a linear operation both for the forward transform (from S to α) and its inverse. Many of the practical transforms are of this type, and many of them go further and simplify the matrix D to be structured and unitary, so that its inverse is easier to operate and both directions can be computed with nearly O(N) operations. Such is the case with the DFT, DCT, the Hadamard, orthonormal wavelet, and other transforms.
In this paper we are interested in the case of L > N, referred to as the overcomplete transforms. When L > N, the relationship in (1) is an underdetermined linear set of equations, and thus in general it leads to an infinite number of possible solutions. Further information is therefore needed in order to uniquely define the transform, and this is typically achieved by defining the representation as the solution of min α α p subject to S = Dα.
(P p ) For p = 2 it is easy to show that again we obtain linearity in both directions (forward and inverse transforms). This case, typically referred to as "frame theory," has drawn a lot of attention because of this obvious simplicity. However, it is clear that two-way linearity poses a hard restriction on the space of possibilities, and may cost in performance.

EURASIP Journal on Applied Signal Processing
A different and far more complicated approach advocated strongly in recent years is to consider p = 0. The 0 notation is an abused p -norm with p → 0, effectively counting the number of nonzeros in the vector α. In such an approach we seek among all feasible representations (satisfying the constraint in (P p )) the one with the fewest nonzero entries, this way achieving an ultimate simplicity in representation. Referring to the matrix D as a dictionary of signalprototypes as its columns, we build s as a linear combination of only few of these columns, typically referred to as atoms. Thus, we can think of our signal as a molecule, and the forward transform decomposes it to its building atoms, where we try to use the fewest in this construction [1].
From the numerical standpoint, the forward transform, defined as (P 0 ), is a nonconvex and highly nonsmooth optimization problem, with many possible local minimum points. Prior work has established that this problem is an NP-hard one, its complexity grows exponentially with the number of columns in the dictionary [2,3]. Recent study of this problem and methods to approximate its solution give promising new results, indicating that even though complicated, means exist to solve it at least in some cases using either greedy [4][5][6][7][8][9][10][11][12][13] or convex programming approaches [1,10,[14][15][16][17][18][19][20]. One aspect of these recent works is the result of uniqueness which will be the focus of this paper.

The uniqueness result-worst-case analysis
We consider the problem min α α 0 subject to S = Dα.
(P 0 ) As explained above, the term α 0 stands for the count of the nonzero entries in α. Previous work has shown that if a feasible solution α 0 is sparse enough, it can be guaranteed to be the solution of (P 0 ) [18,19]. The argument is surprisingly simple and has the following reasoning: for a given dictionary D, its Spark is defined as the smallest number of columns from D that are linearly dependent. This scalar characterizes the dictionary with respect to sparse representations from a worst-case standpoint. By definition, the vectors in the null space of the dictionary Dδ = 0 must satisfy δ 0 ≥ Spark(D), since they linearly combine columns from D to give the zero vector, and at least such Spark columns are necessary.
If α 0 represents S, that is, S = Dα 0 , it implies that all the alternative representations of the same signal are characterized as α 0 + δ, for δ ∈ Null(D). If α 0 satisfies α 0 < Spark(D)/2, no vector δ from this null space exists such that it could be added to α 0 , nulling more entries than the newly introduced ones. Thus, this representation must be the sparsest one possible.
This result is very elementary and yet quite surprising, bearing in mind that (P 0 ) is a highly complicated optimization task of combinatorial flavor. In general, one cannot expect to successfully solve it unless a brute-force search is used. The above uniqueness result, while not constructive, implies that if a sparse enough solution is found via some approximation method, it can be guaranteed to be the desired global optimizer. In general, when solving such complicated optimization problems, even if a solution is proposed, one can at best guarantee that locally it is optimal, by searching a feasible descent direction and finding none. Here we are able to guarantee globally that this is the best solution, and hence the surprise.
Clearly, we can show that the above result is tight by definition. Suppose that we have a null-space vector δ such that Dδ = 0 with δ 0 = Spark(D)-this vector is realizing the Spark, and thus its existence is guaranteed. Then, taking its first Spark(D)/2 nonzero entries and zeroing the rest, we get a new vector α with α 0 = Spark(D)/2 being the representation of a necessarily nonzero signal D α = S. For this specific signal we have yet another representation of the same cardinality, because α − δ has Spark(D)/2 nonzeros too, representing the same signal. Similarly, when choosing more than half of the nonzeros in δ to α, the remaining entries will form an alternative sparser representation. This way we have constructed a special signal for which uniqueness cannot be guaranteed.

Behavior beyond the bound
Does the above example imply that beyond the Spark(D)/2 bound we are destined to nonuniqueness? The answer is yes, if we think in terms of worst case. Does the above example imply that a randomly chosen representation with cardinality beyond the bound is necessarily not unique? Definitely not! In fact, given an arbitrary sparse representation with a number of nonzeros beyond this bound but still relatively low, chances are that this is the sparsest possible representation for the signal it forms. Said differently, examples showing the tightness of the uniqueness theorem, as constructed above, are very few and rare, and when a probabilistic point of view is adopted, their relative weight is expected to diminish entirely.
Thus, the question we pose here is more general and addresses the uniqueness property of candidate solutions to (P 0 ), hoping to enable some guarantees even beyond the worst-case bound mentioned above.

This work and prior art
In this paper we study the behavior of overcomplete representations beyond the known uniqueness bound. When adopting a probabilistic point of view, we show both empirically and theoretically that uniqueness can be guaranteed with high confidence with Spark(D)/2 and more nonzero entries. We show that the above-mentioned counter examples to uniqueness are of zero measure for representations satisfying α 0 < Spark(D). Furthermore, we show that even beyond this point, uniqueness can be still claimed with reasonable probability.
In order to build these results, we propose to characterize the dictionary in a way that extends the Spark, forming the signature of the matrix. Whereas the Spark "thinks" worstcase, the signature gets the more general picture by gathering all the subsets of columns from D that are linearly dependent.

3
This signature is used to analyze the uniqueness and compose the results presented.
At the heart of the analysis proposed here is a specific probability density function (PDF) assumed on the signal space. However, instead of specifying this PDF directly with respect to the signals, it is driven via the representations. A commonly used regularization in inverse problems forces sparsity of the representation of the unknown signal, and assumes independence in its coefficients [1]. Such regularization is essentially the manifestation of a PDF on the unknown signal. Thus, generating representations following these rules, signals emerge with a PDF being a mixture of Gaussians. Thus, motivated by how sparsity is used in inverse problems, we propose a simple signal source model. The bottom line to this work is the claim that when a sparse representation is given from this source, it is most likely to be the sparsest one possible. The bound for measuring how sparse is sparse enough for this claim to be true is less restrictive than previously believed.
One word of caution is necessary here: we use here a probabilistic model to describe how signals and their originating representations are constructed. For those signals once generated, the rules of uniqueness apply as indicated here. However, when a candidate representation describing a signal S is given from a different source, we cannot apply the given analysis. The reason is that the proposed representation may be drawn from a different distribution, with more emphasis on the "dark zone" of nonuniqueness. This, for example, explains why we cannot use the uniqueness results we obtain and impose them on the output of the pursuit algorithms as a simple test of success. Pursuit algorithms may (and will) tend to generate nonunique representations, which explains why a separate analysis for them is required. Still, for such analysis to take place, we must start with representations known to be unique, at least in probability, in order to carry out the study. Such analysis will benefit from the results given here.
Indeed, in a very recent pioneering work by Candès and Romberg (see [21]) and a parallel work by Donoho [22], the average performance of the basis pursuit has been studied, using the same signal source model as described above. A vital part of their analysis is the uniqueness claim: when trying to show an equivalence between the pursuit result and the ideal solution, one must also guarantee that the ideal result is indeed the sparsest one possible. In their work, the authors considered a special dictionary structure built of two unitary matrices, and focused on asymptotic results. Here we discuss the uniqueness for general dictionaries of arbitrary finite sizes, and take a completely different route.

This paper
In the next section we start with a simple experiment that explains what is the goal of this work. We show that the empirical probability for obtaining uniqueness is far better than theoretically suggested so far. In Section 3 we propose an analysis to explain this behavior. Section 4 summarizes some of our thoughts about the role of the pursuit algorithms in seeking approximate solutions to (P 0 ), and our expectations regarding their average performance, compared to the existing worst-case analysis. Section 5 summarizes and concludes this paper.

EMPIRICAL EVIDENCE
Before proving new properties on sparse representations, let us start by simple yet illustrative experiments that will demonstrate the results we expect to theoretically document later on. We start with the construction of the dictionary. Assume that D ∈ R 8×20 is built by a concatenation of random white and zero-mean multivariate Gaussian vectors as its columns. We obtain a full-rank matrix and its Spark is 9, that is, no 8 columns in this matrix can be found to be linearly dependent. This is a general property for such random matrices, stemming from the fact that square random matrices are nonsingular with probability 1 (see the seminal work by Edelman and later by Shen [23,24] on the probabilistic behavior of the extreme singular values of such matrices). Based on the known uniqueness result, every representation with less than Spark(D)/2 = 4.5 nonzeros must be unique. Thus, we are interested in studying the representations with α 0 > 4. Clearly, there is no point in considering representations with α 0 > 8, since those cannot be unique by definition.
Studying representations with α 0 = 8 is also expected to give nonuniqueness, although of a weaker form. Any such representation could be at least replaced by equally good representations coming from all L 8 column combinations from D. Nevertheless, it is interesting to see whether better representations (with cardinality strictly fewer than 8) could be found in such cases. Thus, the range α 0 ∈ [5,8] is to be experimented on. In our experiment we cover the interval [1,8] disregarding the lower part being theoretically guaranteed by a known result. We can generate many such representations by first choosing the k nonzero locations at random with uniform probability, and then assigning values to these k locations independently, using some scalar distribution rule. In our experiment we assume that these values are drawn from a zero-mean, unit variance, and independent Gaussian distribution.
In the above process we actually induce a probability density function on the representations of cardinality k, and through it a distribution on the signal family that has representations of that cardinality. This is a key feature that will be repeated in our theoretical analysis-rather than starting from the signal PDF, we choose to embark from the representation vectors. As sparsity of representations has been shown to be a powerful signal prior used in inverse problems, generating signals this way makes a lot of sense, not just due to the ease of analysis it brings along here.
Given such a random representation, α, the signal it represents is given by Dα = s. We can now search exhaustively through all other combinations of k or less columns from D to seek an alternative representation. For the size chosen, L = 20, we have at most 8 D in this experiment. Per each such candidate set of columns, the least-squares (LS) solution with the submatrix containing the chosen columns dictates the most suitable coefficients, and if the LS error is below the arithmetic accuracy threshold, a candidate representation is assumed found.
We have conducted this experiment as described, and Figure 1 documents its results. Per every cardinality in the range [1,8] we performed 100 experiments and we present the relative number of experiments that ended with perfect success (no sparser solution is found) and also the relative number of experiments that ended with partial success (representations with the same cardinality could also be considered as acceptable). We see that uniqueness can be empirically guaranteed for all representations with cardinality smaller or equal to 7. For representations with 8 nonzero entries, while there are other equivalently sparse representations, there were no better (sparser) ones found.
A second experiment was performed following the same structure, but with a modified dictionary. After creating the random matrix as before, we replaced the first column with a linear combination of the last five. This way we have changed the dictionary's Spark to 6 (or below, if we are really unlucky-again, probabilistic results on random matrices suggest that in this case finding a group of less than 6 columns being linearly dependent is of probability zero). Figure 2 presents the results obtained this time. We see several interesting effects.
(1) The existing uniqueness bound suggests that uniqueness can be guaranteed for 2 nonzero entries and below. We see that up to 5 = Spark(D) − 1 entries, we empirically get that the representations are unique.
(2) Even beyond this cardinality we can still get uniqueness with high probability-a phenomenon we will explain later on. For representations with 6 and 7 nonzero entries, even in cases of violated uniqueness, this violation is due to other representations of the same cardinality and not ones with a strictly sparser form.
(3) As in the previous experiment, for representations with 8 nonzero entries we can find equivalently sparse representations but not better ones, so this is a weaker uniqueness success.
The obvious question raised by the above two experiments is whether we can theoretically explain such results, and in this way discuss average uniqueness performance, rather than let extreme cases dictate the analysis of uniqueness. The next section provides this theoretic explanation, and as we will see, it is almost straightforward.

Some ugly preliminaries
Since our analysis is of a probabilistic flavor, it is clear that it has to be built on a specific random distribution of the signals, and the results will be different for different signals' source models. As we have already indicated in the previous section, instead of specifying this signal's PDF directly, it will be implied indirectly by the representation's PDF we effectively use. We assume the following on the PDF of the representation vectors.
(1) The probability for each cardinality is fixed and known: For example, p k could be inversely proportional to K in some way, to indicate that signals tend to have sparse representations, or it could be uniform in the range [1, N] and zero elsewhere.
(2) The nonzeros in the representation vector are uniformly spread with no preference to one zone over another: In fact, this condition can be relaxed substantially and replaced by a nonuniform probability, as long as we avoid degeneracy, that is, (3) The locations of the different nonzero entries in a given representation are statistically independent: Thus, the K locations are chosen at random with equal and independent probability, and this means that every combination of K entries has an equal chance to be selected. Again, this condition could be relaxed to allow statistical dependencies, as long as no degeneracies are encountered.
(4) The K nonzero entries in an representation are randomly generated from a Gaussian distribution with zeromean and unit variance: There is no special reason for the Gaussianity, and any reasonable nondegenerate alternative distribution can be assumed in replacement. (5) The nonzero entries in a representation are mutually independent: We refer hereafter to representations coming from the above distribution as the output of the machine M. Those are first generated by randomly choosing the cardinality based on p k , then by choosing the involved columns, and finally by choosing the nonzero coefficients' values.
When considering representations of cardinality K, there are K specific columns from D multiplied by a random vector of length K being a white multivariate normalized Gaussian. Thus, referring to the dictionary as deterministic, the PDF Prob{s/ support{α}} is of a multivariate Gaussian distribution of dimensionality dictated by both the rank of the submatrix of chosen columns from D, and α 0 . For example, for α 0 = Spark(D) − 1 = K, the rank of the subdictionary is necessarily full (otherwise we contradict the definition of the Spark), and thus we get an ellipsoid in the Kth dimensions describing the spread of the signals related to such representations. For more than Spark columns, the rank of the submatrix used could be smaller, and then the dimensionality of the signal space is degenerate.
Due to the above, the distribution of the signals Prob{s/ α = K} is expected to be a mixture of L K equally probable Gaussians of the above form, each referring to a different choice of K columns from the dictionary. Similarly, the signal source model Prob{s} is also a mixture of Gaussians, this time with mixtures of different cardinalities and different weights p k .
In the coming subsections we will study the uniqueness behavior for different cardinalities. We start with the easier case where Spark(D) = N + 1 and then turn to discuss the more general case of Spark(D) ≤ N. Throughout our analysis we assume that L, the number of columns in D is finite.

Part 1: Spark(D) = N + 1
We start our analysis by treating the special case where the Spark of the dictionary is at its peak, being Spark(D) = N +1. In this case we can guarantee uniqueness for representations satisfying α 0 < (N + 1)/2. This case could refer to dictionaries generated as Grassmanian frames [25,26], random dictionaries as discussed before, and possibly other constructions. This is the most optimistic scenario, paralleling the first experiment from Section 2. In the coming analysis we consider the following ranges of interest: α 0 > N, α 0 = N, and N/2 < α 0 < N.

Top interval: α 0 > N
Assume a representation α is drawn from the abovedescribed random source M, with α 0 > N. Clearly, we cannot claim it is the sparsest one describing the signal s = Dα. Since Spark(D) = N +1, every subset of N columns from D is linearly independent. This implies that s can be represented by alternative representations with each of those N column combinations, having only N nonzeros. Thus we have the following result. N + 1, α 0 > N). Assume a dictionary D ∈ R N×L is fixed with Spark(D) = N + 1, and a representation α generated from M with cardinality α 0 > N.

EURASIP Journal on Applied Signal Processing
Then, the probability to find an alternative representation for s with cardinality smaller than α 0 is 1.
The above result implies that the given representation is necessarily nonunique, implying that for the signal s = Dα an alternative representation can be found with cardinality N at most.

Middle interval: α 0 = N
We now treat the case where the candidate representation drawn from M is of cardinality N. As before, all the subgroups of N columns in D must be linearly independent and thus, whatever signal we get, it can be generated alternatively by all L N −1 other combinations of N columns from the dictionary, leading to alternative representations with the same cardinality N.
Could we do better? Can we find a group of N − 1 columns realizing the signal s = Dα? The answer to this question is the essence of this paper, and its rational will be used repeatedly in later cases as well. We will therefore try to motivate our reply from both algebraic and geometric considerations.
The signal in mind is originally generated as a linear combination of N linear independent columns ( α 0 = N). Let us fix those columns and denote them as the submatrix D N ∈ R N×N . The N nonzero coefficients in α are of random white normalized Gaussian values, implying that as far as those coefficients are involved, the spread of the representation vectors is spherical in this N-dimensional space.
Multiplying the "cloud" of possible representations sharing the same support by the matrix D N , we get that the signal is also a Gaussian random vector with zero-mean and a fullrank autocorrelation matrix being D N D H N . This signal occupies the N-dimensional space with nondegenerate ellipsoidal density. By nondegenerate we mean that the volume of this ellipsoid is nonzero, and this is an immediate consequence of the positive definiteness of the autocorrelation matrix we have formed.
Due to the value of the Spark, every subgroup of N − 1 (or smaller) columns from D is linearly independent, and as such, spans a subspace of dimension N −1 (and below, resp.). Multiplied by normalized Gaussian random vectors representing the nonzero part in candidate representations, random signals are generated in an (N − 1)-dimensional space. Since there are finite numbers of such subspaces to consider, being all the combinations of 1, 2, 3, . . . , N − 1 columns from D, their union cannot cover the entire N-dimensional space. Actually, this amalgam of subspaces has a zero volume in the N-dimensional space. Thus, chances that the signal we started with will be covered by one of those subspaces is zero. This leads to the conclusion that a representation sparser than N for the discussed case cannot be found. We conclude with the following result. (1) there are L N − 1 alternative representations for s with the same cardinality N, and thus the probability to find such alternative is 1, (2) the probability to find an alternative representation for s with cardinality smaller than N is 0.
This is a weak form of uniqueness, but nevertheless one of interest, saying that we could find similarly sparse alternatives and not better ones. Let us give a very simple and intuitive example to better explain our result. Suppose that the dictionary D is of size 2 × 5, implying that our signals are points in 2D. We further assume that the Spark of D is 3, meaning that every 2 × 2 submatrix from D is full rank. Suppose that a specific signal is constructed by linear combination of the first two columns. Since the 2 coefficients used by the linear combination are random normalized Gaussian ones, the signals we can possibly generate are also Gaussian and occupy the 2D space, although with a distorted spread from spherical to ellipsoidal distribution. The shape of the 2D ellipsoid is dictated by the two eigenvalues of the 2 × 2 matrix formed by the first two columns used to build the signal.
In our analysis we concentrate on the 2D space of signals, and we have just found out that every point in the plane is a possible signal (with varying nonzero probability). Now let us try building the same signal using only one column, in an attempt to find a sparser representation. Consider a specific column, and by randomly choosing the representation coefficient, consider the signals this can generate. We get a set of 2D Gaussian vectors all on a specific line passing through the origin, in a direction dictated by the column used. By considering all 5 1 = 5 columns, we have 5 such lines where the signals with representation cardinality 1 could reside. Any finite number of lines cannot cover more than zero volume in the 2D plane, and thus chances are that our signal can never find a sparser representation with cardinality 1.
Note that this analysis suggests that if we are to approximate a signal with some inaccuracies, rather than exactly represent it, the same approach could be used. This time, however, every such line should be replaced by a thickened version of it, increasing chances of failure (i.e., finding sparser representation alternatives). We leave such analysis for future work.

Bottom interval: N/2 < α 0 < N
The analysis required for the N/2 < α 0 < N case is quite similar to the one we presented earlier, with one major difference-whereas the previous case led to a weak version of uniqueness, here we will get more conclusive, strong uniqueness.
Suppose that a representation drawn from M is of cardinality α 0 = K, in the range (N/2, N). This implies that the signal in mind is originally generated as a linear combination of K linearly independent columns. As before, fixing those chosen columns and denoting them as the submatrix D K ∈ R N×K , due to the normalized Gaussianity of the K nonzero coefficients in α, the signal obtained is also a Gaussian random vector with zero-mean and a rank K autocorrelation matrix being D K D H K . These signals reside in the N-dimensional space, but fill only K < N dimensions of it.
Due to the value of the Spark, every subgroup of K − 1 (or smaller) columns from D is linearly independent, and as such, spans a subspace of dimension K −1 (and below, resp.). As before, those finite number of subgroups of columns define signal subspaces of dimensionality K − 1 and below, and their union has zero volume in the overlap with the subspace the original signals can be found in.
What about similarly sparse alternative representations? There are L K combinations of K columns from D that could build a competing representation. Choosing one such candidate group, it creates a "cloud" of signals of the same dimensionality K in the N-dimensional space. How overlapping are the original and the newly formed subspaces? We will show that this overlap could either be complete or empty (in measure of volume).
The complete overlap implies that the two different groups of K columns are spanning the same subspace. Thus, a group of K + 1 linear dependent columns can be built by taking the first K-column group, and adding any of the columns from the second group. Since these K + 1 columns span a K-dimensional space, they must be linearly dependent, and this contradicts the Spark. Thus, complete overlap is impossible.
The alternative case where the two subspaces of dimensionality K are different implies that their overlap is of dimensionality K − 1 at most (as an example, for a 3D space with two subspaces of dimensionality 2, the complete overlap implies that the two planes passing through the origin are the same, and if they are not so, their intersection is a line). As we have already stated, a subspace of dimensionality K − 1 has zero volume in the K-dimensional space. Even union of many such subspaces will not change this fact, if finite number of members participate in this union. This all leads to the conclusion that even equivalently sparse representations will not be found with probability 1. We thus have the following result. N + 1, N/2 < α 0 < N). Assume a dictionary D ∈ R N×L is fixed with Spark(D) = N + 1, and a representation α generated from M with cardinality α 0 in the range (N/2, N). Then, considering the signal s = Dα, the probability to find an alternative representation for s with cardinality α 0 or smaller is 0. This is a strong form of uniqueness, but as opposed to the classic result, it leans on probabilistic considerations, meaning that while counter examples to this uniqueness result can be created, their overall weight is negligible in the space of signals we have formed.

Relation to the empirical results
In the first experiment in Section 2 we had N = 8 and Spark(D) = 9, matching the case studied here. Due to Theorem 1 it is clear that there is no uniqueness for α 0 > 8, and this range was not part of the simulation. Theorem 2 gives us a weak guarantee of uniqueness for α 0 = 8, with 20 8 − 1 alternative representations with the same cardinality and no sparser ones. This aligns well with the result documented in Figure 1. Theorem 3 supplies us with the results for α 0 < 8, guaranteeing uniqueness, as indeed empirically obtained. Figure 3 presents a graph parallel to Figure 1, as we expect to obtain for general N (assumed for convenience to be even).

Part 2: Spark(D) ≤ N
We now turn to discuss the more common case where Spark(D) ≤ N. This case refers to dictionaries generated from overcomplete wavelets, ridgelets, curvelets, many other types of frames, and amalgams of them [15-17, 19, 27, 28]. This is the more realistic scenario, paralleling the second experiment from Section 2.
In this case we can guarantee uniqueness for representations satisfying α 0 < Spark(D)/2. This time the ranges of interest to consider are α 0 > N, Spark(D) ≤ α 0 ≤ N, and Spark(D)/2 ≤ α 0 ≤ Spark(D) − 1. As we will see next, the analysis here is similar but more involved. The range Spark(D) ≤ α 0 ≤ N in particular is problematic and requires a definition of the signature of a dictionary in order to get an evaluation of the uniqueness probability.
Definition 1 (signature). For a matrix D ∈ R N×L , its signature is defined as the discrete function Sig D (K), for K = 1, 2, 3, . . . , L, counting the relative number 1 of K-column combinations in D that are linearly dependent.
Here are some properties of the signature. (i) Due to the definition of the Spark, Sig D (K) is zero for all K < Spark(D).
(ii) For K ≥ Spark(D) there are L K possible combinations and at least one is linearly dependent, thus leading to strictly positive values, Sig D (K) > 0.
(iii) For K > N we necessarily have Sig D (K) = 1 since all groups of K such columns are linearly dependent.
(iv) We conjecture that the signature is monotonic nondecreasing. This property should be proven, but we leave this as an open problem at the moment, as we do not need to use it in the analysis that follows.
(v) For the case Spark(D) = N + 1, the signature is necessarily a simple step function, being zero for K ≤ N, and 1 for K > N. This will explain the ease with which the previous analysis was carried out, and the reason for separating the study of this case. (vi) The signature is NP-hard to compute, just as the Spark. Still, bounds on it can be derived. One such interesting bound based on known Spark is described in the appendix, based on result due to Björner, related to analysis of matroids [29].
While for the worst-case analysis the exact value of Sig D (K) has no consequence, this value becomes of extreme importance in evaluating probabilities of uniqueness under our probabilistic regime of signals. For illustration, the signatures of both dictionaries used in Section 2 are given in Figure 4.

Top interval: α 0 > N
Assume a representation α is drawn from the above-described random source M, with α 0 > N. Just as in Section 3.2.1, we cannot claim it is the sparsest one describing the signal s = Dα. No matter what Spark(D) is, a subset of linearly independent N columns from D can be found, since we assume that D is full rank. Thus, s can be represented by alternative representations with N nonzeros only, leading to the following result. ≤ N, α 0 > N). Assume a dictionary D ∈ R N×L is fixed with Spark(D) ≤ N, and a representation α generated from M with cardinality α 0 > N. Then, the probability to find an alternative representation for s with cardinality smaller than α 0 is 1.

Theorem 4 (Spark(D)
This case resembles the case of α 0 > N with the maximal Spark, as discussed in Section 3.2.1. There is one major but unimportant difference here-whereas in the maximal Spark case we could have claimed that any group of N columns is linearly independent, here we can just say that one such group exists. Again, this difference has no influence on the outcome, being a complete and certain loss of uniqueness, as expected.  N], if the K columns chosen are linearly dependent, an immediate reduction can lead to an alternative representation with a smaller cardinality. By replacing one column in this group with a linear combination of the K − 1 others, uniqueness is lost. Thus, at least a Sig D (K) portion of the cases lead to loss of uniqueness this way. Note that all the K-column combinations are equally probable due to prior assumptions, and thus the signature value is applicable directly without weighting.

Medium-low interval
If the K columns pointed to by α are linearly independent, alternative sparser representations cannot be found, leaning on the same rational we have exercised earlier. Any other group of K − 1 columns or smaller could potentially create a competing representation for the signal in mind. However, the union of all the volumes of these subspaces will not cover a substantial portion of the Kdimensional signal space, and thus no sparser solutions will be found. Thus, in the search for sparser representations, the probability that the given representation is the sparsest is 1 − Sig D (K), with problems encountered only with linear dependent groups.
When addressing the quest for the same cardinality alternatives, we focus on the linearly independent cases since those have no sparser alternatives. For a given such set of K columns used by the original representation, assume that the first K − 1 of them together with another column form a linearly dependent set. This implies that K − 1 alternative representations with the same cardinality are possible by replacing each of the K − 1 first columns with the external one, and those combinations lead to a weak uniqueness result.
Similarly, if the first K − 2 columns in the original group can be merged with a different column to give a linear dependency, we get K − 2 alternative representations of the same cardinality. This could continue with small groups with K − 3, K − 4, . . . , Spark(D) − 1 alternatives. Going below this set leads to no other alternatives.
Let us look closely into the first case generating the competing solutions, and count the number of combinations that may lead to this problem. We consider Sig D (K) · L K groups of K linearly dependent columns. Choosing one such group and replacing one of its columns by a different column from the remaining L−K ones, we get K(L−K) such replacements, all leading to the weak uniqueness result.
Similarly, taking the Sig D (K − 1) · L K−1 groups of K − 1 linearly dependent columns, we can propose per each (K −1)(L−K +1) replacements, and per each of those add an arbitrary column among the remaining L − K + 1 ones, getting a total of (K − 1)(L − K + 1) L−K+1 dependent groups, giving that there are no more than possible combinations of K columns that lead to the existence of alternative representations. The last term in the above expression takes all the groups of j columns from the remaining L − K + j ones in order to finally get K columns. Clearly, in gathering all those, we should count only the linearly independent ones, and discard of repetitions. Thus, the above number is an upper bound on the K-column combinations that lead to the weak uniqueness. Divided by L K we get a bound on the probability for weak uniqueness. We summarize with the following result. (1) the probability that the given representation is the sparsest of all (disregarding equally sparse alternatives) the probability to find an alternative representation of the same cardinality is or lower.

Bottom interval:
Suppose that a representation drawn from M is of cardinality α 0 = K, in the range Spark(D)/2 ≤ α 0 < Spark(D). This implies that the signal in mind is originally generated as a linear combination of K linear independent columns. The same reasoning leads us to the interpretation of those signals as a K-dimensional Gaussian cloud of signals in the N-dimensional signal space. Every subgroup of K − 1 (or smaller) columns from D is linearly independent as well, and as such, all those together span subspaces of dimension K − 1 (and below, resp.), thus leading to the conclusion that with probability 1 no sparser representation can be found. As to similarly sparse alternatives, the same analysis as in Section 3.2.3 gives that no such alternatives can be found. Thus we have the following strong uniqueness result.

Relation to the empirical results
In the second experiment in Section 2 we had N = 8 and Spark(D) = 6, matching the case studied here. Due to Theorem 4 it is clear that there is no uniqueness for α 0 > 8.
Theorem 5 supplies the results for 6 ≤ α 0 ≤ 8, suggesting that the probability to get a strong uniqueness is 1 − Sig D ( α 0 ). Since this number is very close to 1 (e.g., for α 0 = 8 this value is 1 − 7.2e − 4), the 100 experiments found no such cases, as can be seen in Figure 1.
As to equally sparse alternatives, the probability of finding those for K = 6 is 6 · 14/ 20 6 = 0.0022, and for K = 7 it is 0.0316-in both cases quite low but possible to encounter, as indeed displayed in the shown results. Figure 5 presents a graph parallel to Figure 2, assuming that Spark(D) is even for convenience.

RELATION TO AVERAGE PERFORMANCE OF PURSUIT ALGORITHMS
Given a signal S ∈ R N known to have a sparse representation over the dictionary D, we are interested in finding its representation faithfully, and with a reasonable amount of computations. We assume that the signal is drawn from the presented source model, by generating first a representation α at random from M and computing S = D α. This way we have characterized in full how signals are distributed.
Applying pursuit algorithms on S, could we guarantee successful recovery of α? Clearly, if α is not the unique

Cardinality of representation
· · · · · · · · · · · ·  (sparsest) representation of S, there is no point to this question, since we do not want to recover α in those cases. So, our question focuses on the cases where uniqueness holds true, and ask whether the pursuit algorithms succeed. Previous work analyzed this question for several variants of the greedy algorithm [6][7][8][9][10][11][12][13]. Other work studied the basis pursuit algorithms [10,[14][15][16][17][18][19][20]. All these works concentrated on the worst-case scenario, just as described above with respect to the uniqueness property, showing that if the signal has a sparse enough representation, the pursuit will succeed. The bound on sparsity is more restrictive compared to the uniqueness one, and its development is far more complicated in general. This bound is built on the definition of the mutual incoherence (M(D)), as the maximum overall absolute off-diagonal entries in the Gram matrix D H D.
In order to guarantee successful recovery of the representation, it should have smaller than 0.5(1 + M −1 ) nonzeros.
Thus, parallel to the results summarized in the previous section, there is great interest in knowing whether the pursuit algorithms are successful beyond this bound in probability. As we have mentioned earlier, the works in [21,22] are the first to address this question directly, with results for a specific structure of dictionaries, and with an asymptotic formulation. Here we offer some empirical results that will set the stage for a theoretical analysis that will study the behavior of the pursuit algorithms using general dictionaries. Figures 6 and 7 present the results for two dictionaries of size 30 × 80. The first, being completely random leads to the maximal Spark, and its mutual incoherence is 0.5575. Thus, success of the pursuit algorithms is guaranteed for representations with one nonzero entry. Similarly, the second graph corresponds to a dictionary of the same size, but with   a deliberate reduction of the Spark to 15. This dictionary's mutual incoherence is 0.7075 again implying that only one nonzero representation can be recovered well by the pursuit algorithms. As can be seen from the results, (1) in both cases the success rate is high for α 0 ≤ 5, and decays gracefully from there, (2) although the two dictionaries have very different Spark, the results of the pursuit algorithms in both cases are very similar, (3) the two pursuit algorithms perform very similarly with weaker performance of the greedy algorithm for small cardinalities, and slower decay in performance as the cardinality grows.
An analysis of these results from a theoretical point of view is valuable and should be carried out. In particular, it is interesting to ask whether the Spark or the signature have any role in dictating the pursuit results in probability.

CONCLUSIONS
In this paper we have studied the uniqueness of sparse representations of signals over a given overcomplete dictionary. We saw both empirically and theoretically that such representations are likely to be the sparsest ones for the signal they form if they are sparse enough. Previous work has shown that below half the Spark of the dictionary, the representation is necessarily the sparsest. Here we have extended this result and showed that representations with less than Spark nonzero entries are the sparsest with probability 1, and even beyond this cardinality, uniqueness can be still claimed with high probability.
A very helpful tool in our analysis is the signature of the dictionary. Further work is required in order to find ways to approximate or bound this function. Another promising direction for future research is the analysis of pursuit algorithms using the same probabilistic model drawn here, extending the results in [21,22]. Simulation results here and in [15] indicate that these algorithms are expected to perform far better than the worst-case bounds suggest. A similar analysis could shed light on this behavior.
Approximate representations rather than exact ones are appealing as well for many applications. A parallel study of the uniqueness of such representations is of great importance as well, extending prior results given in [10].

AN UPPER BOUND ON THE SIGNATURE
We have given the following definition.
Definition 2 (signature). For a matrix D ∈ R N×L , its signature is defined as the discrete function Sig D (K), for K = 1, 2, 3, . . . , L, counting the relative number of K-column combinations in D that are linearly dependent.
The signature is NP-hard to compute, just as the Spark (and actually harder). Still, bounds on it can be derived. One such interesting bound that we will show here is based on the assumption that the Spark is known. This result is due to Björner, who analyzed and bounded matroids properties [29]. This was adapted to the bounding of the signature by Goldberg [30]. We will state the result here without proof or discussion. Further work is required in order to bound the signature better, taking into account known interactions between the dictionary's columns, and more.
Theorem 7 (upper bound on the signature). For a full-rank dictionary D ∈ R N×L (rank N) with Spark(D) = σ ≤ N + 1, the signature of the dictionary is upper bounded by 1, 2, . . . , N. (A.10) The rational behind this result is that if Spark(D) = σ, it does not necessarily mean that every σ combination of columns from D is linearly dependent. In fact, only a limited number of such sets can be dependent without violating the rank of the matrix. As a simple example, for a full-rank dictionary of size 3 × 4 and Spark(D) = 3, if every triplet is linearly dependent, then the rank must be 2. This can be proven by construction of the column space representation: we drop the first column as it is spanned by the next two. The remaining triplet is also dependent and thus we can drop one more column without affecting the columns space spanned. Thus the rank is 2, violating the initial assumption about D being full rank. In such case, only one of the 4 possible triplets can be linearly dependent. The above theorem generalizes this idea.