- Open Access
Bayesian approach with prior models which enforce sparsity in signal and image processing
© Mohammad-Djafari; licensee Springer. 2012
- Received: 18 January 2012
- Accepted: 1 March 2012
- Published: 1 March 2012
In this review article, we propose to use the Bayesian inference approach for inverse problems in signal and image processing, where we want to infer on sparse signals or images. The sparsity may be directly on the original space or in a transformed space. Here, we consider it directly on the original space (impulsive signals). To enforce the sparsity, we consider the probabilistic models and try to give an exhaustive list of such prior models and try to classify them. These models are either heavy tailed (generalized Gaussian, symmetric Weibull, Student-t or Cauchy, elastic net, generalized hyperbolic and Dirichlet) or mixture models (mixture of Gaussians, Bernoulli-Gaussian, Bernoulli-Gamma, mixture of translated Gaussians, mixture of multinomial, etc.). Depending on the prior model selected, the Bayesian computations (optimization for the joint maximum a posteriori (MAP) estimate or MCMC or variational Bayes approximations (VBA) for posterior means (PM) or complete density estimation) may become more complex. We propose these models, discuss on different possible Bayesian estimators, drive the corresponding appropriate algorithms, and discuss on their corresponding relative complexities and performances.
- Bayesian approach
- sparse priors
- inverse problems
which is given by . When the regularization parameter λ = 0, one gets a generalized inverse and when H invertible, one gets the normal inverse solution . The regularization theory has been developed since the pioneer work of Tikhonov  and Tikhonov and Arsénine  who had introduced a quadratic regularization terms to account for some prior properties of the solution (smoothness). Since that, many different regularization terms have been proposed. In particular, in place of L2 norm: , it has been proposed to use the L0 norm or the L1 norm L1(f) = ||f||1 = Σ j |f j | to enforce the sparsity of the solution [3–11]. Then, due to the fact that L0(f) is not convex and L1(f) is convex, but not continuous, the optimization of a criterion with these expressions becomes more difficult than the L2 norm case. For this reason, there was a great number of works who specialized in proposing algorithms for the optimization of such criteria.
Interestingly, defining the solution of the problem (1) as the optimization of a criterion with two parts can be assimilated to a maximum a posteriori (MAP) solution in a Bayesian approach where the first term of the criterion (2) can be related to the likelihood and the second term to a prior model as we will see in the following where the main objective is to show how the Bayesian approach can go farther than the regularization in at least the following aspects:
A better account for the noise term characteristics;
A better and easier way for translating the prior knowledge and in particular the sparsity;
New tools for assessing the regularization parameter, a great subject of discussion for all those work with regularization theory;
New solutions and new tools for doing computations (optimizations and integrations).
1.1 The Bayesian approach
where the sign ∝ stands for "proportional to", p(g|f, θ1) is the likelihood, p(f|θ2) the prior model, θ= (θ1, θ2) are their corresponding parameters (often called the hyper parameters of the problem) and p(g|θ1, θ2) is called the evidence of the model.
In this approach, the likelihood p(g|f, θ1) summarizes our knowledge about the noise and the model linking the observed data g to the unknowns f and the prior term p(f|θ2) summarizes our incomplete prior knowledge about the unknowns and the posterior law p(f|g, θ) combines these two terms and contains all our state of knowledge about the unknowns f after accounting for the prior and the observed data.
As a very simple example, when the noise is assumed to be Gaussian, then the MAP solution is obtained as the optimizer of the criterion J(f) = ||g- Hf||2 + λ Ω(f) where the expression of Ω(f) depends on the prior law. When the prior knowledge is translated as a Gaussian probability law, then and when it is translated as a Laplace probability law, then Ω(f) = ||f||1 [12–14].
The first interest of using the Bayesian approach to the regularization approach is to have new tools for handling the hyper parameters .
1.2 Full Bayesian approach
and then use it for the estimation of the other one using .
As we mentioned before, one of the main steps in the Bayesian approach is the prior modeling which has the role of translating our prior knowledge on the unknown signal or image in a probability law. Sparsity is one of the prior knowledge we may translate. The main objective of this article is to see what are the different possibilities.
1.3 Prior modeling
generalized Gaussian (GG) with Gaussian (G) and Laplace or double exponential (DE) as particular cases;
symmetric Weibull (W) with symmetric Rayleigh (R) and again the DE as particular cases;
Student-t (St) with Cauchy (C) as particular case;
Elastic net prior model;
generalized hyperbolic model;
Dirichlet and symmetric Dirichlet;
Mixture of two centered Gaussians (MoG2), one with very small and one with a large variances;
Bernoulli-Gaussian (BG), also called Spike and slab;
Mixture of two Gammas (MoGamm);
Mixture of three Gaussians (MoG3), one centered with very small variance and two symmetrically centered on positive and negative axes and large variances;
Mixture of one Gaussian and two Gammas (MoGGammas), and in a more summary the case of
Bernoulli-Multinomial (BMult) or mixture of Dirichlet (MoD).
Some of these models are well-known [12–14, 18–26], some others less. In general, we can classify them into two categories: (i) simple non Gaussian models with heavy tails and (ii) mixture models with hidden variables which result to hierarchical models.
In the Section 2, we give more details about the sparsity and all these prior models which enforce the sparsity.
1.4 Bayesian computation
The second main step in the Bayesian approach is to do the computations. Depending on the prior model selected, the Bayesian computations needed are:
For simple prior models:
For hierarchical prior models with hidden variables z:
The second main objective of this article is to discuss on the relative complexities and performances of the algorithms obtained with the proposed prior law.
The rest of the article is organized as follows:
In Section 2, we present in details the proposed prior models and discuss their properties. For example, we will see that the Student-t model can be interpreted as an infinite mixture with a variance hidden variable or that the BG model can be considered as the degenerate case of a MoG2 where one of the variances go to zero. Also, we will examine the less known models of MoG3 and MoGGammas where the heavy tails are obtained by combining a centered Gaussian and two large variance non-centered Gaussians or Gammas.
In Section 3, we examine the expression of the posterior laws that we obtain using these priors and discuss then on complexity of the Bayesian computation of the algorithms. In particular for the mixture models, we give details of the joint estimation of the signal and the hidden variable as well as the hyper parameters (parameters of the mixtures and the noise) for unsupervised cases.
In Section 4, we give more details on the variational Bayesian approximation method, first for the general case and then for the case of mixture laws and more specifically the case of the Student-t considered as a continuous mixture.
Finally, we present the main conclusions of this article in Section 5.
First, as we mentioned, the sparsity is a property which can be described either directly for the signal itself or after some transformation, for example on the derivative of the signal, or in more general on the coefficients of the projection of the signal on any basis or any set of functions.
Different prior models have been used to enforce sparsity.
2.1 Generalized Gaussian (GG), Gaussian (G) and double exponentials (DE) models
Two particular cases are of importance:
β = 2 (Gaussian):(12)
β = 1 (double exponential or Laplace):(13)
2.2 Symmetric Weibull (W) and symmetric Rayleigh (R) models
2.3 Student-t (St) and Cauchy (C) models
2.4 Elastic Net (EN) prior model
2.5 Generalized hyperbolic (GH) prior model
2.6 Dirichlet (D) and symmetric Dirichlet (SD) models
It is noted that the support of this distribution is [0,1] N and ||f||1 = Σ j f j = 1.
It is also interesting to note that the domain of the Dirichlet distribution is itself a probability distribution, specifically a N-dimensional discrete distribution and the set of points in the support of a N-dimensional Dirichlet distribution is the open standard N - 1-simplex, which is a generalization of a triangle, embedded in the next-higher dimension.
2.7 Mixture of two Gaussians (MoG2) model
2.8 Bernoulli-Gaussian (BG) model
2.9 Mixture of three Gaussians (MoG3) model
2.10 Mixture of one Gaussian and two Gammas (MoGGammas) model
2.11 Bernoulli-Gamma (BGamma) model
2.12 Mixture of Dirichlet (MoD) model
Mixture of Dirichlet model(38)
is the symmetric Dirichlet distribution. We need to choose α1 > 1 for dense part and 0 < α2 < 1 for the sparse part.
2.13 Bernoulli-multinomial (BMultinomial) model
Now, we consider different priors.
3.1 Simple prior models
We may look at each case to examine the range of the parameters for which this Hessian matrix is positive definite.
When a great number of samples are thus generated, we may compute their means, variances or any other statistics about them.
We recently implemented these algorithms for different applications such as: synthetic aperture radar (SAR) Imaging , ...
3.2 Mixture models
For the mixture models, and in general for the models which can be expressed via the hidden variables, we want to estimate jointly the original unknowns f and the hidden variables: τ in Cauchy model, z in MoG2, BG or BGam models and z in MoG3 or MoGGammas. Let examine these a little in details.
3.3 Student-t and Cauchy models
Note that, τ j is the inverse of a variance and we have . We can interpret this as an iterative quadratic regularization inversion followed by the estimation of variances τ j which are used in the next iteration to define the variance matrix D(τ).
Here too, we may study the conditions on which the joint criterion is uni-modal and its alternate optimization converges to its unique solution.
3.4 Mixture of two Gaussians (MoG2) model
3.5 BG model
For the case of BG we have to be more careful, because the joint probability laws are degenerated. Two approaches are then possible:
i) Considering them as the particular case of the MoG models where the variance v0 is fixed to a small value or reduced gradually during the iterations.
ii) Trying first to integrate out f from the expression of p(f, z|g) to obtain p(z|g) and optimize it with respect to z(detection step) and then use it for the estimation step.
where B(z) = H(v diag [z j , j = 1, ..., n])H' + v ϵ I. We see the complexity of this expression which needs the inversion of the matrix B and its optimization which is a combinatorial optimization needing to evaluate this expression 2 n times.
which needs again the inversion of the matrix B.
The exact computations of and are often too costly, one may try to obtain approximate solutions. Many approximations have been proposed. A good overview of these methods can be found in [30, Chap. 5] and also in [31, 32].
3.6 BGamma and MoGGammas model
In these cases, it is no more possible to integrate out f analytically as it was the case with Gaussians. One strategy here is to use the MCMC methods to generate samples from the joint posterior. The second approach is to approximate the joint posterior by a simpler one, for example by a separable one on f and the hidden variables z in the BGamma or the MoGGammas cases. Very often then we can do the computations analytically. However, it may happens that, even after these separable approximations, still we need to use the MCMC methods on some of variables. Detailed explanation of these general methods is out of focus of this article. See [30, 33, 34]. Here, we just give the details for the case of the Gaussian mixtures (MoG2 or MoG3).
when z are discrete valued.
will be easy to handle because it is the product of two Gaussians and so it is a multivariate Gaussian. But the two others are not.
So, for a given model , minimizing KL(q : p) is equivalent to maximizing and when optimized, gives a lower bound for .
where all the expectations are with respect to q.
where p(f, z, θ, g) = p(g|f, θ)p(f|z, θ)p(z|θ)p(θ) and where q2(z) = Π j q2j(z j ), q3(θ) = Π l q3l(θ l ), q2(z(-j)) = Πi≠jq2j(z j ), 〈.〉 q means expected value with respect to q.
In that case, with appropriate models for the priors (exponential families) and hyper parameters (conjugate priors), we see that q(f) is a multivariate Gaussian , q(θ l ) are either Gaussians (for the means) or Inverse Gammas (for the variances) and q(z j ) are discrete distributions whose expressions can be written easily.
To illustrate this in more detail, we consider the case of the Student-t model.