- Research
- Open Access
- Published:

# Enhanced multi-task compressive sensing using Laplace priors and MDL-based task classification

*EURASIP Journal on Advances in Signal Processing*
**volume 2013**, Article number: 160 (2013)

## Abstract

In multi-task compressive sensing (MCS), the original signals of multiple compressive sensing (CS) tasks are assumed to be correlated. This is explored to recover signals in a joint manner to improve signal reconstruction performance. In this paper, we first develop an improved version of MCS that imposes sparseness over the original signals using Laplace priors. The newly proposed technique, termed as the Laplace prior-based MCS (LMCS), adopts a hierarchical prior model, and the MCS is shown analytically to be a special case of LMCS. This paper next considers the scenario where the CS tasks belong to different groups. In this case, the original signals from different task groups are not well correlated, which would degrade the signal recovery performance of both MCS and LMCS. We propose the use of the minimum description length (MDL) principle to enhance the MCS and LMCS techniques. New algorithms, referred to as MDL-MCS and MDL-LMCS, are developed. They first classify tasks into different groups and then reconstruct signals from each cluster jointly. Simulations demonstrate that the proposed algorithms have better performance over several state-of-art benchmark techniques.

## 1 Introduction

If a signal is compressible in the sense that its representation in a certain linear canonical basis is sparse, it can then be recovered from measurements obtained at a rate much lower than the Nyquist frequency using the technique of compressive sensing (CS) [1–3]. Mathematically, in CS, the signal is measured via

where ** θ** is the

*N*× 1 original signal vector,

**Φ**

_{0}denotes the

*M*×

*N*measurement matrix,

**denotes the**

*Ψ**N*×

*N*linear basis,

**Φ**=

**Φ**

_{0}

**,**

*Ψ***is the**

*y**M*× 1 compressive measurement vector, and

**is the additive noise. Since**

*n**M*is far smaller than

*N*, the original signal is now compressively represented, but the inverse problem, namely recovering

**from**

*θ***, is in general ill-posed. If**

*y***is sparse (i.e., most of its elements are zero), the signal reconstruction problem could become feasible. An approximation to the original signal in this case can be obtained through the technique of basis pursuit that solves**

*θ*where ∥·∥_{2} and ∥·∥_{1} denote the *l*
_{2}-norm and the *l*
_{1}-norm, respectively, and the scalar *ε* is a small constant. Equation 2 has been the starting point for the development of many signal recovery methods in the literature. Among them, the recovery algorithms under the Bayesian framework provide some advantages over other formulations. These include providing probabilistic predictions, automatic estimation of model parameters, and the evaluation of the uncertainty of reconstruction. The existing Bayesian approaches include the Bayesian compressive sensing (BCS) [4] that stems from the relevance vector machine [5] and the Laplace prior-based BCS [6].

In [7], multi-task compressive sensing (MCS) was introduced within the Bayesian framework. In this work, a CS task refers to the union of an original signal vector, the measurement matrix, and the associated compressive measurement vector obtained using Equation 1. In contrast to the CS aim of recovering a single signal from its compressive measurements, MCS exploits the statistical correlation among the original signals of multiple CS tasks and recovers them jointly to improve the signal reconstruction performance. It has been shown in [7] that MCS allows recovering in a robust manner the signals whose compressive measurements are insufficient when they are reconstructed separately. The MCS technique has been investigated extensively in machine learning literature, where it was referred to as simultaneous sparse approximation (SSA) [8–12] as well as distributed compressed sensing [13]. In [14], an empirical Bayesian strategy for SSA was developed.

The contribution of this paper is twofold. We shall first extend the work of [6] on the Laplace prior-based BCS to the MCS scenario. A new MCS algorithm for signal recovery, termed as the Laplace prior-based MCS (LMCS), is developed. We impose Laplace priors on the original signals in a hierarchical manner and show that the MCS is indeed a special case of LMCS. The incorporation of Laplace priors enforces signal sparsity to a higher extent [15] and offers posterior distributions rather than point estimates as in MCS. Another advantage comes from the log-concavity of the Laplace distribution, which leads to unimodal posterior distribution and eliminates the presence of local minima as a result.

The second part of this work comes from the following observation. Specifically, in order to provide satisfactory signal reconstruction performance, the MCS technique from [7], together with the newly proposed LMCS method, requires that the original signals of the multiple CS tasks are well correlated statistically. This assumption may not be fulfilled in many practical applications. For instance, some original signals may be realizations of different signal templates that differ in their supports. In other words, they could belong to different signal groups, and the statistical correlation among them is weak, which would degrade the signal recovery performance. A possible approach to address this problem is to group the CS tasks before the signal reconstruction stage, as in the MCS with Dirichlet process priors (DP-MCS) [16].

The second contribution of this paper is the use of the minimum description length (MDL) principle to augment the MCS and LMCS methods. The obtained techniques are referred to as the MDL-MCS and MDL-LMCS algorithms. The MDL principle has been adopted to solve the model selection problem [17–19] and can also be used in other aspects, such as sparse coding and dictionary learning [20] and radar emitter classification [21–23]. In MDL, the best model for a given data ** y** is the solution to the minimization problem \hat{\omega}=arg\underset{\omega \in \Omega}{\text{min}}\mathit{\text{DL}}\left(\mathit{y},\omega \right). Here,

*Ω*represents the set of possible models and

*D*

*L*(

**,**

*y**ω*) is a codelength assignment function which defines the theoretical codelength required to describe

**uniquely, which is the key component in any MDL-based classification technique. Common practice in MDL uses the ideal Shannon codelength assignment [24] to define**

*y**D*

*L*(

**,**

*y**ω*) in terms of a probability assignment

*p*(

**,**

*y**ω*) as \mathit{\text{DL}}\left(\mathit{y},\omega \right)=-{log}_{2}p\left(\mathit{y},\omega \right). Applying

*p*(

**,**

*y**ω*) =

*p*(

**|**

*y**ω*)

*p*(

*ω*), we have \hat{\omega}=arg\underset{\omega \in \Omega}{\text{min}}-{log}_{2}p\left(\mathit{y}\left|\omega \right.\right)-{log}_{2}p\left(\omega \right), where -{log}_{2}p\left(\omega \right) represents the model complexity. Note that the MCS and the new LMCS methods are both under the Bayesian framework, which enables their integration with the statistical MDL technique. Compared with the DP-MCS technique that utilizes variational Bayes (VB) inference and could suffer from local convergence, the newly proposed MDL-MCS and MDL-LMCS methods offer improved correct signal classification rate and better signal reconstruction performance. This is also illustrated via computer simulations in Section 5.

The remainder of this paper is structured as follows. In Section 2, we review the prior sharing concept in MCS and present the prior sharing framework in LMCS. Section 3 develops the proposed LMCS algorithm. We describe in Section 4 the MDL-based MCS and LMCS techniques, namely, the MDL-MCS and MDL-LMCS algorithms. Simulations are given in Section 5 to illustrate the performance of the proposed algorithms. Section 6 concludes the paper.

## 2 Prior sharing in MCS and LMCS

In the area of machine learning, information sharing among tasks is a well-known technique [25]. Typical approaches, to name a few, include sharing hidden nodes in neural networks [26, 27], assigning a common prior in hierarchical Bayesian models [28–30], placing a common structure on the predictor space [31], and the structured regularization in kernel methods [32]. Among them, the use of hierarchical Bayesian models with shared priors is one of the most important methods for multi-task learning [33–37], which is also essential for the development of MCS in [7] and the LMCS algorithm in this paper. For the sake of clarity, in the rest of this section, we shall first review the prior sharing in the MCS algorithm and then proceed to present the hierarchical Bayesian framework of LMCS.

To facilitate the presentation, suppose there are *L* CS tasks

where *i* = 1, 2, …, *L*, *y*_{
i
} is the *M*
_{
i
} × 1 compressive measurement vector and **Φ**
_{
i
} is the *M*
_{
i
} × *N* matrix (*M*
_{
i
} ≪ *N*) whose columns are **Φ**
_{
i,j
}, *j* = 1, 2, …, *N* such that **Φ**
_{
i
} = [**Φ**
_{
i,1}, …, **Φ**
_{
i,N
}]. Here, *θ*_{
i
} = [*θ*
_{
i,1}, …, *θ*
_{
i,N
}]^{T} is the original signal for task *i* and the measurement noise *n*_{
i
} is assumed to follow an i.i.d. Gaussian distribution with zero mean vector and covariance matrix *β*
^{-1}
**I**. The conditional likelihood function of *y*_{
i
} is

where \mathcal{N}\left({\mathit{y}}_{i}|{\mathbf{\Phi}}_{i}{\mathit{\theta}}_{i},{\beta}^{-1}\mathbf{I}\right) represents a Gaussian distribution with mean vector **Φ**
_{
i
}
*θ*_{
i
} and covariance matrix *β*
^{-1}
**I**. The noise precision *β* follows a Gamma distribution

where *a* and *b* are the shape and scale parameters of the Gamma distribution and *Γ*(*a*) is the Gamma function.

### 2.1 Prior sharing in MCS

In MCS [7], the elements in *θ*_{
i
} are statistically independent, and they follow a joint Gaussian distribution:

Here, ** α** = [

*α*

_{1},

*α*

_{2}, …,

*α*

_{ N }]

^{T}is the information vector shared by the original signals

*θ*_{ i }of all the

*L*tasks. Its distribution function is given by

In [7], the general strategy of setting the hyper-parameters *a*, *c* to ones and *b*, *d* to zeros in Equations 5 and 7 was adopted so that the prior of ** α** and

*β*are both uniformly distributed. As a result, they can be found via maximizing the following likelihood function:

This is equivalent to maximizing the posterior distribution of ** α** and

*β*. The original signals

*θ*_{ i }are then reconstructed using the estimated values of

**and**

*α**β*.

### 2.2 Prior sharing in LMCS

Within the LMCS framework, the original signals are assigned Laplace priors. A possible approach to achieve this is to impose Laplace priors directly on the original signal, or mathematically, let p\left({\mathit{\theta}}_{i}|\lambda \right)=\frac{\lambda}{2}exp\left(-\frac{\lambda}{2}{\u2225{\mathit{\theta}}_{i}\u2225}_{1}\right) as in [6]. However, this formulation is not conjugate to the conditional distribution in Equation 4, which would render the Bayesian analysis intractable. Therefore, we adopt the hierarchical prior given by

where ** γ** = [

*γ*

_{1}, …,

*γ*

_{ N }]

^{T},

*p*(

*γ*

_{ j }|

*λ*), and

*p*(

*λ*|

*ν*) are the prior distributions of

*γ*

_{ j }and

*λ*, respectively. Compared with the MCS model given in Equation 6, Equation 7 reveals that in LMCS, information sharing is realized via the vector

**and the hyper-parameter**

*γ**λ*. We have from Equations 9 to 11

This verifies that the used hierarchical prior model results in Laplace priors for the original signals *θ*_{
i
}.

As in MCS, LMCS recovers the original signals in a two-step manner. In particular, it first estimates ** γ**,

*λ*,

*β*, and

*ν*via maximizing the posterior distribution

Taking the logarithm on both sides of the above equation yields

It is straightforward to verify that p\left({\mathit{\theta}}_{i}|\mathit{\gamma},\lambda ,\beta ,{\mathit{y}}_{i}\right)=\frac{p\left({\mathit{y}}_{i}|{\mathit{\theta}}_{i},\beta \right)p\left({\mathit{\theta}}_{i}|\mathit{\gamma}\right)p\left(\mathit{\gamma}|\lambda \right)}{\int p\left({\mathit{y}}_{i}|{\mathit{\theta}}_{i},\beta \right)p\left({\mathit{\theta}}_{i}|\mathit{\gamma}\right)p\left(\mathit{\gamma}|\lambda \right)d{\mathit{\theta}}_{i}}=\frac{p\left({\mathit{y}}_{i}|{\mathit{\theta}}_{i},\beta \right)p\left({\mathit{\theta}}_{i}|\mathit{\gamma}\right)}{\int p\left({\mathit{y}}_{i}|{\mathit{\theta}}_{i},\beta \right)p\left({\mathit{\theta}}_{i}|\mathit{\gamma}\right)d{\mathit{\theta}}_{i}}. Furthermore, from Equations 4 and 9, we have that it also has a Gaussian distribution \mathcal{N}\left({\mathit{\theta}}_{i}|{\mathit{\mu}}_{i}^{\prime},{\mathbf{\Sigma}}_{i}^{\prime}\right) with the mean vector and covariance matrix equal to

where **Γ**
_{0} = diag(1/*γ*
_{1}, …, 1/*γ*
_{
N
}).

With the estimated ** γ**,

*λ*, and

*ν*, LMCS then proceeds to reconstruct the original signals from all the

*L*CS tasks.

We illustrate the hierarchical prior model adopted in LMCS in Figure 1. It can be observed that, as in MCS, the distribution of the measurement noise *n*_{
i
} is dependent on the noise precision *β* while the prior distribution functions of the original signals *θ*_{
i
} depend on the information sharing vector ** γ**. The difference here is that LMCS has one more layer of prior information, which is embedded in

*λ*. The introduction of

*λ*makes the prior distribution of the original signal Laplace, which is already shown in Equation 12. As a result, the proposed LMCS would promote the sparsity of the recovered signal, as pointed out in [15].

## 3 Multi-task compressive sensing using Laplace priors

We shall present the proposed LMCS algorithm in this section. The LMCS method differs from the MCS technique only in the step of identifying the information sharing vector ** γ** and the parameters

*λ*and

*ν*while their signal recovery steps are the same. As a result, we shall focus on the estimation of

**,**

*γ**λ*, and

*ν*. Interested readers are directed to [7] for details on the signal recovery process.

As shown in previous works [7, 38, 39], the signal reconstruction performance would be degraded if the noise precision *β* is not properly initialized. Therefore, in this work, we consider *β* as a nuisance parameter and integrate it out to reduce the number of unknowns and improve the robustness of the algorithm. For this purpose, the prior distributions of the original signals *θ*_{
i
} are rewritten as in [7]:

where *β* has a Gamma prior distribution

Note that in this case, *p*(*θ*_{
i
}|** γ**,

*λ*,

*β*,

*y*_{ i }) given above Equation 15 is still Gaussian with the mean vector and the covariance matrix given in Equation 15 and 16. After taking integration with respect to

*β*, we have

where det(·) is the determinant operator and

Note that *p*(*θ*_{
i
}|** γ**,

*λ*,

*y*_{ i }) has the functional form of a Student’s

*t*distribution, which is heavy tailed and as a result makes the LMCS algorithm more robust to the presence of outliers in the measurement noise in

*y*_{ i }if any, as pointed out in [40].

Taking integration with respect to *β* on both sides of Equation 13, using Equation 19, and applying the logarithm yields the posterior distribution function \sum _{i=1}^{L}lnp\left({\mathit{\theta}}_{i},\mathit{\gamma},\lambda ,\nu |{\mathit{y}}_{i}\right)=\sum _{i=1}^{L}lnp\left({\mathit{\theta}}_{i}|\mathit{\gamma},\lambda ,{\mathit{y}}_{i}\right)+\sum _{i=1}^{L}lnp\left(\mathit{\gamma},\lambda ,\nu |{\mathit{y}}_{i}\right). We shall maximize it to estimate the information sharing vector ** γ** and the parameter

*λ*. We begin with integrating

*θ*_{ i }out and applying the relationship p\left(\mathit{\gamma},\lambda ,\nu |{\mathit{y}}_{i}\right)=p\left({\mathit{y}}_{i},\mathit{\gamma},\lambda ,\nu \right)/p\left({\mathit{y}}_{i}\right)\propto p\left({\mathit{y}}_{i},\mathit{\gamma},\lambda ,\nu \right) to obtain

where {\mathbf{B}}_{i}={\mathbf{I}+\mathbf{\Phi}}_{i}{\mathbf{\Gamma}}_{0}^{-1}{\mathbf{\Phi}}_{i}^{T}, {\mathbf{B}}_{i}^{-1}={\left({\mathbf{I}+\mathbf{\Phi}}_{i}{\mathbf{\Gamma}}_{0}^{-1}{\mathbf{\Phi}}_{i}^{T}\right)}^{-1}=\mathbf{I}-{\mathbf{\Phi}}_{i}{\mathbf{\Sigma}}_{i}{\mathbf{\Phi}}_{i}^{T}, det(**B**
_{
i
}) = (det(**Γ**
_{0}))^{-1}(det(**Σ**
_{
i
}))^{-1}. The matrices **Γ**
_{0} and **Σ**
_{
i
} are defined under Equation 16 and in Equation 21, respectively.

In the rest of this section, we shall present two methods for identifying ** γ** and

*λ*. The first technique, described in Section 3.1 iteratively maximizes \mathcal{L}\left(\mathit{\gamma},\lambda ,\nu \right) to find the accurate solution. It has high computational complexity, which motivates the development of an alternative method with much lower complexity in Section 3.2.

### 3.1 Iterative solution

Differentiating \mathcal{L}\left(\mathit{\gamma},\lambda ,\nu \right) with respect to *γ*
_{
j
}, *j* = 1, 2, …, *N* and setting the result to zero yield

After some straightforward manipulations, we obtain

where *μ*_{
i,j
} is the *j* th element of *μ*_{
i
} and **Σ**
_{
i,j j
} is the *j* th diagonal element of **Σ**
_{
i
}. Following a similar approach, *λ* can be found to be

As in [6], we evaluate *ν* by solving

where *ψ*(*ν* / 2) denotes the derivative of ln*Γ*(*ν* / 2) with respect to *ν* / 2.

The iterative algorithm starts with an initial solution guess on ** γ**,

*λ*and

*ν*. We next update the estimates of

*γ*

_{ i }using Equation 24 first, then proceed to evaluate

*λ*and

*ν*using Equations 25 and 26. The above process would be repeated until convergence. The iterative algorithm is based on alternating optimization and is computationally intensive. One of the computational burdens lies in the evaluation of Equations 20 and 21 required in the evaluation of Equation 24, where inverting matrices of size

*N*×

*N*is needed. This motivates the development of the following alternative algorithm.

### 3.2 Fast alternative solution

We start with decomposing **B**
_{
i
} defined under Equation 22 as {\mathbf{B}}_{i}=\mathrm{I}+\sum _{k=1(\ne j)}^{N}{\gamma}_{k}{\mathbf{\Phi}}_{i,k}{\mathbf{\Phi}}_{i,k}^{T}+{\gamma}_{j}{\mathbf{\Phi}}_{i,j}{\mathbf{\Phi}}_{i,j}^{T}={\mathbf{B}}_{i,-j}+{\gamma}_{j}{\mathbf{\Phi}}_{i,j}{\mathbf{\Phi}}_{i,j}^{T}, where **B**
_{
i,-j
} is **B**
_{
i
} with the contribution of the column **Φ**
_{
i,j
} in the matrix **Φ**
_{
i
} removed such that we have det\left({\mathbf{B}}_{i}\right)=det\left({\mathbf{B}}_{i,-j}\right)det\left(1+{\gamma}_{k}\underset{i,j}{\overset{T}{\mathbf{\Phi}}}\underset{i,-j}{\overset{-1}{\mathbf{B}}}{\mathbf{\Phi}}_{i,j}\right). It can be verified via applying the matrix inversion lemma that the inverse of **B**
_{
i
} is equal to {\mathbf{B}}_{i}^{-1}={\mathbf{B}}_{i,-j}^{-1}-{\gamma}_{j}\frac{{\mathbf{B}}_{i,-j}^{-1}{\mathbf{\Phi}}_{i,j}{\mathbf{\Phi}}_{i,j}^{T}{\mathbf{B}}_{i,-j}^{-1}}{1+{\gamma}_{j}{\mathbf{\Phi}}_{i,j}^{T}{\mathbf{B}}_{i,-j}^{-1}{\mathbf{\Phi}}_{i,j}}. With the above notations, we are able to introduce {\mathcal{L}}_{0}\left(\mathit{\gamma}\right) that collects the terms relating to ** γ** in \mathcal{L}\left(\mathit{\gamma},\lambda ,\nu \right) in Equation 22, which is defined as

Here, *γ*_{-j
} is ** γ** with

*γ*

_{ j }removed, {s}_{i,j}\triangleq {\mathbf{\Phi}}_{i,j}^{T}{\mathbf{B}}_{i,-j}^{-1}{\mathbf{\Phi}}_{i,j}, {q}_{i,j}\triangleq {\mathbf{\Phi}}_{i,j}^{T}{\mathbf{B}}_{i,-j}^{-1}{\mathit{y}}_{i}, and {g}_{i,j}\triangleq {\mathit{y}}_{i}^{T}{\mathbf{B}}_{i,-j}^{-1}{\mathit{y}}_{i}+2b.

Differentiating {\mathcal{L}}_{0}\left(\mathit{\gamma}\right) with respect to *γ*
_{
j
} and setting the result to zero, we obtain

Dividing both sides with {\gamma}_{i}^{2}, we can transform Equation 28 into

Applying the approximation *s*
_{
i,j
} ≫ 1 / *γ*
_{
j
}, which is generally valid numerically (e.g., typically we have *s*
_{
i,j
} > 20 / *γ*
_{
j
}[7]), simplifies the denominator of Equation 29 into \left({s}_{i,j}-{q}_{\mathrm{i.j}}^{2}/{g}_{i,j}\right){s}_{i,j}. Meanwhile, let {A}_{0}\triangleq \sum _{i=1}^{L}\frac{{s}_{i,j}+\lambda -\left({M}_{i}+2a\right){q}_{\mathrm{i.j}}^{2}/{g}_{i,j}}{\left({s}_{i,j}-{q}_{\mathrm{i.j}}^{2}/{g}_{i,j}\right){s}_{i,j}}, {B}_{0}\triangleq \sum _{i=1}^{L}\frac{\lambda {s}_{i,j}+\left({s}_{i,j}+\lambda \right)\left({s}_{i,j}-{q}_{\mathrm{i.j}}^{2}/{g}_{i,j}\right)}{\left({s}_{i,j}-{q}_{\mathrm{i.j}}^{2}/{g}_{i,j}\right){s}_{i,j}}, and {C}_{0}\triangleq L\lambda, and as a result, Equation 29 becomes

The approximate solution of {\gamma}_{i}^{-1} from Equation 30 has the form

where {\Delta}_{0}={B}_{0}^{2}-4{A}_{0}{C}_{0} and *C*
_{0}≥0.

As shown in Appendix 1, on the basis of the fact that *γ*
_{
i
} ≥ 0, the estimate from Equation 31 can only take two possible values, i.e.,

When {\gamma}_{j}^{-1}=\infty, it is equivalent to setting *θ*_{
i,j
} to zero (see Equation 17). This indicates that **Φ**
_{
i,j
} can be deleted from the matrix **Φ**
_{
i
}. As a result, in contrast to the iterative approach for estimating *γ*
_{
i
} and *λ* (see Section 3.1), the alternative algorithm would have a complexity depending on the number of retained columns in the matrix **Φ**
_{
i
}. Moreover, the evaluation of Equations 32 and 33 is relatively easy since computing *s*
_{
i,j
} and *q*
_{
i,j
} required in *A*
_{0} and *B*
_{0} can be achieved via [7]:

where

We summarize the procedure of the fast algorithm in Algorithm 1. The convergence criterion there is

where \Delta {\mathcal{L}}_{0}\left({\mathit{\gamma}}^{k}\right) is the increment of {\mathcal{L}}_{0}\left(\mathit{\gamma}\right) in the *k* th iteration and *thresh* denotes a pre-specified threshold value. To improve the convergence speed, in step 5 of Algorithm 1, we select the {\gamma}_{j}^{k} that leads to the largest increase in {\mathcal{L}}_{0}\left(\mathit{\gamma}\right). Other steps in the algorithm, including updating *μ*_{
i
}, **Σ**
_{
i
}, *s*
_{
i,j
}, *q*
_{
i,j
}, and *g*
_{
i,j
} in steps 10 to 11 and changing the model as in steps 6 to 8, are the same as those detailed in 6 of [7].

#### Algorithm 1 **FAST LMCS**

Before the end of this section, we shall illustrate the relationship between the MCS algorithm and the newly proposed LMCS technique, in order to gain more insights. Within the MCS framework, the elements *γ*
_{
j
} in the information sharing vector ** γ** are found via [7]:

On the other hand, from Equation 27, we have LMCS that obtains the estimate of *γ*
_{
j
} through

Clearly, LMCS would reduce to MCS if *λ* = 0. This is somewhat expected from the comparison presented at the end of Section 2, where we show that, compared with MCS, LMCS introduces another layer of prior information embedded in the parameter *λ*. When *λ* = 0, we can verify that {A}_{0}=\sum _{i=1}^{L}\frac{{s}_{i,j}-\left({M}_{i}+2a\right){q}_{\mathrm{i.j}}^{2}/{g}_{i,j}}{\left({s}_{i,j}-{q}_{\mathrm{i.j}}^{2}/{g}_{i,j}\right){s}_{i,j}}, *B*
_{0} = *L*, and *C*
_{0} = 0. As a result, Equations 32 and 33 would become

which are identical to the approximate solutions to Equation 39 established in [7] (see Equations 39 and 40 in [7]). This corroborates the validity of the Bayesian derivations that lead to LMCS.

## 4 MDL-based task classification and signal reconstruction

The MCS algorithm and the newly proposed LMCS method both assume that the original signals of the *L* CS tasks are statistically correlated. In other words, the original signals belong to the same cluster or group, from the viewpoint of signal classification. When the above assumption is not fulfilled, the signal reconstruction performance of MCS and LMCS would be degraded. We shall develop in this section novel signal classification and recovery algorithms on the basis of the MDL principle. The new methods are referred to as MDL-MCS or MDL-LMCS so as to reflect the fact that we augment the MCS and LMCS techniques with MDL. We start this section with the theoretical derivation of the MDL-based classification for MCS and LMCS.

### 4.1 MDL-based classification

This subsection presents the basic MDL-based task classification framework. With MDL, the best model out of a family of competing statistical models for a given data is the one that yields the minimum description length for the data. Let **Y** = {*y*_{1}, …, *y*_{
L
}} be the set collecting the compressive measurements of the *L* CS tasks in consideration and ** ι** = [

*ι*

_{1}, …,

*ι*

_{ L }] be the partition of Y into

*K*clusters, where

*ι*

_{ i }=

*k*means that

*y*_{ i }belongs to the

*k*th cluster,

*i*= 1, …,

*L*, and

*k*= 1, …,

*K*. Assuming statistical independence among signals from two different clusters, we can express the likelihood function of Y as

where **D** = {*d*_{1}, …, *d*_{
K
}} is the set of model parameters, *d*_{
k
} is the model parameter vector of the model for the *k* th cluster, **Y**
_{
k
} contains the compressive measurements of the CS tasks in the *k* th cluster, and *p*
_{
k
}(**Y**
_{
k
}|*d*_{
k
}) represents the likelihood function of **Y**
_{
k
}. The description length of **Y** under the model set ** D** is then

where \mathit{\text{DL}}\left(\mathbf{Y}\left|\mathbf{D}\right.,\mathit{\iota}\right)=-{log}_{2}p\left({\left[\mathbf{Y}\left|\mathbf{D}\right.,\mathit{\iota}\right]}_{\delta}\right) measures the goodness of fit between the data and the model. Under the assumption that the model parameter set **D** and the CS task partition ** ι** are statistically independent, we have \mathit{\text{DL}}\left(\mathbf{D},\mathit{\iota}\right)=-{log}_{2}p\left({\left[\mathbf{D}\right]}_{\delta}\right)-{log}_{2}p\left({\left[\mathit{\iota}\right]}_{\delta}\right), and it acts as a penalty function measuring the model complexity. The notation [·]

_{ δ }denotes elementwise quantization with precision

*δ*. With sufficient quantization precision, we have p\left({\left[\mathbf{Y}\left|\mathbf{D}\right.,\mathit{\iota}\right]}_{\delta}\right)\approx p\left(\mathbf{Y}\left|\mathbf{D}\right.,\mathit{\iota}\right){\delta}^{{S}_{\mathbf{Y}}}, p\left({\left[\mathbf{D}\right]}_{\delta}\right)\approx p\left(\mathbf{D}\right){\delta}^{{S}_{\mathbf{D}}}, and p\left({\left[\mathit{\iota}\right]}_{\delta}\right)\approx p\left(\mathit{\iota}\right){\delta}^{{S}_{\mathit{\iota}}}[20]. Here,

*p*(

**D**) and

*p*(

**) are the priors of**

*ι***D**and

**.**

*ι**S*

_{ Y },

*S*

_{ D }, and

*S*

_{ ι }denote the numbers of elements in

**Y**,

**D**, and

**. As a result, the description length of**

*ι***Y**can be rewritten as

We proceed to evaluate Equation 45 for the cases of LMCS and MCS sequentially. In particular, as shown in Appendix 2, we have that for LMCS,

where {\mathbf{D}}^{\text{LMCS}}=\left\{{\mathit{d}}_{k}^{\text{LMCS}}\right\}, *k* = 1, …, *K*, is the set of the model parameters in LMCS, {\mathit{d}}_{k}^{\text{LMCS}}=\left\{{\mathit{\gamma}}^{\left(k\right)},{\lambda}^{\left(k\right)}\right\} contains the information sharing parameters of the *k* th cluster, {\mathbf{B}}_{i}^{\left(k\right)}={\mathbf{I}}^{\left(k\right)}{+\mathbf{\Phi}}_{i}^{\left(k\right)}{\left({\mathbf{\Gamma}}_{0}^{\left(k\right)}\right)}^{-1}{\left({\mathbf{\Phi}}_{i}^{\left(k\right)}\right)}^{T}, {\mathbf{\Gamma}}_{0}^{\left(k\right)}=\text{diag}(1/{\gamma}_{1}^{\left(k\right)},\dots ,1/{\gamma}_{N}^{\left(k\right)}), and *L*
_{
k
} is the number of tasks in the *k* th cluster. Other variables are the same as those in Equation 22.

For MCS, according to Equation 30 in [7], we have

where {\mathbf{D}}^{\text{MCS}}=\left\{{\mathit{d}}_{k}^{\text{MCS}}\right\} is the set of model parameters for MCS, {\mathit{d}}_{k}^{\text{MCS}}=\left\{{\mathit{\alpha}}_{\text{MCS}}^{\left(k\right)}\right\}, {\mathit{\alpha}}_{\text{MCS}}^{\left(k\right)} is the information sharing vector of cluster *k*, {\mathbf{C}}_{i}^{\left(k\right)}={\mathbf{I}}^{\left(k\right)}{+\mathbf{\Phi}}_{i}^{\left(k\right)}{\left({\mathbf{A}}_{\text{MCS}}^{\left(k\right)}\right)}^{-1}{\left({\mathbf{\Phi}}_{i}^{\left(k\right)}\right)}^{T}, and {\mathbf{A}}_{\text{MCS}}^{\left(k\right)}=\text{diag}\left({\mathit{\alpha}}_{\text{MCS}}^{\left(k\right)}\right). In MCS, {\mathbf{A}}_{\text{MCS}}^{\left(k\right)} is distributed uniformly, so -{log}_{2}p\left({\mathbf{D}}^{\text{MCS}}\right) would be a constant (see Section 2.1).

We now compute -{log}_{2}p\left(\mathit{\iota}\right) to complete the evaluation of Equation 45 for LMCS and MCS. Let *n*(*L*, ** ι**) be the number of different ways to partition

*L*tasks into

*K*groups with each group having

*L*

_{ k }CS tasks and \sum _{k=1}^{K}{L}_{k}=L. It can be verified that

*n*(

*L*,

**) is equal to**

*ι*The numerator represents the number of different partitions if we generate them by taking sequentially *L*
_{
k
} tasks out of the *L* CS tasks while the denominator removes the partitions produced by simply swapping the tasks within a cluster without changing the clustering structure. Assuming that the ** ι** has the prior of a uniform distribution, we have

Putting Equation 4) together with Equation 46 or 47 back to Equation 45 completes the description length computation for the compressive measurement set **Y** of the *L* CS tasks under LMCS or MCS. Given a quantization precision *δ*, the MDL criterion finds the optimal number of clusters *K* via

### 4.2 MDL-LMCS and MDL-MCS

Solving Equation 50 directly may be computationally prohibitive since it requires calculating the description length of **Y** for all possible clustering structures. To address this difficulty, we shall propose the new MDL-LMCS and MDL-MCS algorithms for classifying the CS tasks and recovering all original signals in a joint and iterative manner. The algorithm flow is summarized in Algorithm 2. It takes as its input the sets **Y** and **Φ** that collect the compressive measurement vectors and the measurement matrices in the *L* CS tasks, respectively.

Since the tasks have not been classified at the beginning, the algorithm considers that they belong to a single cluster ** clust**{1} = {

**Y**,

**Φ**}, and as a result, it sets

*K*, the number of obtained clusters, to be 1, and

*num*, the number of unclassified tasks, to be

*L*. The algorithm also initializes \hat{\mathbf{Y}} and \hat{\mathbf{\Phi}}, the sets that collect the compressive measurements and the measurement matrices of the unclassified tasks, as \hat{\mathbf{Y}}=\mathbf{Y} and \hat{\mathbf{\Phi}}=\mathbf{\Phi}. Signal reconstruction via LMCS or MCS for MDL-LMCS or MDL-MCS is then performed using \hat{\mathbf{Y}} and \hat{\mathbf{\Phi}} to obtain the reconstructed signal set {\hat{\mathit{\Theta}}}_{1} and the sharing parameter set {\hat{\mathbf{D}}}_{1}. We plug {\hat{\mathbf{D}}}_{1} into Equation 46 or 47 to calculate the total description length (TDL)

*mdl*

_{1}for all the compressive measurements in

**Y**. This completes the initialization stage of the algorithm.

The proposed algorithm proceeds to classify the *L* tasks as follows. In the first iteration, it first applies the operation CLASSIFY(·) to form a new cluster \left\{{\hat{\mathbf{Y}}}_{\text{min}},{\hat{\mathbf{\Phi}}}_{\text{min}}\right\} from the unclassified task set \hat{\mathbf{Y}}. {\hat{\mathbf{Y}}}_{\text{min}} has *L*
_{min} tasks and their measurement matrices are collected in {\hat{\mathbf{\Phi}}}_{\text{min}}. We remove {\hat{\mathbf{Y}}}_{\text{min}} and {\hat{\mathbf{\Phi}}}_{\text{min}} from \hat{\mathbf{Y}} and \hat{\mathbf{\Phi}} to update them, while the number of remaining unclassified task becomes num-*L*
_{min}. Now, we have *K* = 2 clusters, \mathit{clust}\left\{1\right\}=\{\widehat{\mathbf{Y}},\widehat{\mathbf{\Phi}}\}, and \mathit{clust}\left\{2\right\}=\{{\widehat{\mathbf{Y}}}_{\text{min}},{\widehat{\mathbf{\Phi}}}_{\text{min}}\}
^{a}. LMCS or MCS is then applied to both clusters to identify their original signals and sharing parameters. The results are kept in {\hat{\mathit{\Theta}}}_{2} and {\hat{\mathbf{D}}}_{2}, the latter of which is substituted into (46) or (47) for MDL-LMCS or MDL-MCS to compute again the TDL of **Y**, denoted by mdl_{2}. This completes the processing of iteration 1. We then compare mdl_{1} with mdl_{2} and if mdl_{2} < *mdl*
_{1}, the algorithm would start its second iteration to continue the task classification, where CLASSIFY(·) will be applied to \widehat{\mathbf{Y}} and yield ** clust**{3}. The above process is repeated until

*mdl*

_{ K }>

*mdl*

_{ K-1}occurs, which implies the appearance of over-fitting. The algorithm finally outputs the clusters available in the (

*K*-2)th iteration.

The function CLASSIFY(·) runs as follows. Each time when CLASSIFY(·) is executed, it first selects randomly a task out of the unclassified task set \hat{Y}. With slight abuse of notation, we denote it as *y*_{
i
}. It is paired with every of the remaining tasks in \hat{Y}, and this yields *num*-1 two-task clusters. In the case of MDL-LMCS, we then apply LMCS to estimate the sharing parameters {*γ*^{(t)}, *λ*
^{(t)}, *ν*
^{(t)}} of the two-task cluster *t*, *t* = 1, 2, …, *num*-1 and compute the corresponding description length for *y*_{
i
} via

We next perform a grouping operation on the obtained num-1 description length {\mathit{\text{DL}}}_{\text{LMCS}}^{\left(t\right)}\left({\mathit{y}}_{i}\right) to identify those tasks in \widehat{\mathbf{Y}} that are likely to correlate well with *y*_{
i
} and should be grouped with *y*_{
i
} in a new cluster {\widehat{\mathbf{Y}}}_{\text{min}} (see Algorithm 2). Recall that each description length indeed corresponds to a task in \widehat{\mathbf{Y}} other than *y*_{
i
}. The grouping procedure is based on the well-known *K*-mean technique. The difference here is that before the application of the *K*-mean, we first compute the algorithmic mean of {\mathit{\text{DL}}}_{\text{LMCS}}^{\left(t\right)}\left({\mathit{y}}_{i}\right) and set those above the mean value to be equal to the mean. This is equivalent to excluding the tasks that lead to large value of {\mathit{\text{DL}}}_{\text{LMCS}}^{\left(t\right)}\left({\mathit{y}}_{i}\right) when being paired with *y*_{
i
} because they are unlikely to be well correlated with *y*_{
i
}. We next apply *K*-mean to the remaining description length to obtain two groups. The mean description length for both groups are found. The tasks belonging to the group with a smaller mean description length are combined with *y*_{
i
} to produce the output of CLASSIFY(·), {\widehat{\mathbf{Y}}}_{\text{min}}.

In the case of MDL-MCS, CLASSIFY(·) is realized in the same manner as described above, except that the description length for *y*_{
i
} is evaluated over every two-task cluster using

#### Algorithm 2 **MDL-LMCS (or MDL-MCS)**

### 4.3 Implementation aspect

The development of MDL-LMCS and MDL-MCS presented in the previous subsection implicitly assumes that the quantization precision *δ* is known *a priori*. Nevertheless, in an ideal case, *δ* should be determined jointly with the optimal number of clusters *K* through minimizing the right-hand side of Equation 50 with respect to *δ* and *K*.

We shall follow the approach similar to the one adopted in [20] to determine the quantization precision. First, it can be verified that the value of *δ* would have no impact on locating the unclassified tasks that are correlated with a randomly selected one if the compressive measurement vectors of all the tasks have the same dimension. This is because, in this case, the term depending on *δ* in Equations 51 and 52 would be the same for any value of *t*. As a result, *δ* will affect the task classification performance via Equations 46 and 47 only, from which it can be seen that a very fine quantization would lead to a smaller number of clusters. This may degrade the signal reconstruction performance as weakly correlated signals may be recovered jointly. A large value of *δ* would not necessarily improve performance, as in this case, the original signals may tend to be recovered separately. Our experiments suggest that *δ* be within the range of 0.01 to 0.1, depending on the type of data to be processed. Throughout the experiments in Section 5, we shall fix *δ* to be 0.1, instead of attempting to optimize it for different experiments.

## 5 Simulations

Monte Carlo (MC) simulations using synthetic data and images are performed to illustrate the performance of the LMCS algorithm developed in Section 3 and the MDL-augmented MCS algorithms, namely, the MDL-LMCS and MDL-MCS techniques presented in Section 4.

### 5.1 Synthetic signals

In each simulation of this subsection, the length of the original signals of all the CS tasks is fixed at *N* = 512, and we generate two sets of results. One set of results is produced when the non-zero elements of the original signals take binary values ±1 in a random manner. The other set is generated with the non-zero elements of the original signals being independently drawn from zero-mean Gaussian distribution with unit variance. The elements of the measurement matrix of any CS task, on the other hand, can only be drawn from a Gaussian distribution with zero mean and variance one. Each column of any measurement matrix is normalized to have a unit norm.

For the purpose of comparison, we implement the BCS and MCS techniques developed in [4] and [7]. We shall denote them as ST-BCS and MCS in the figures. Here, ST stands for single task, and it is introduced to highlight that ST-BCS and MCS recover the original signals separately and jointly. We also implement the Laplace prior-based BCS proposed in [6] and denote it as LST-BCS. When implementing the three benchmark algorithms (ST-BCS, MCS, and LST-BCS) and the three proposed methods (LMCS, MDL-LMCS, and MDL-MCS), we always initialize *a* = 10^{3} and *b* = 1 so that the noise precision *β* has the same prior distribution for all the algorithms considered (see Equation 5).

We shall follow the previous works [4, 6, 7] that proposed the three benchmark methods and use the average normalized signal reconstruction error as the primary performance metric. It is defined as \frac{1}{L}\sum _{i=1}^{L}{\u2225{\mathit{\theta}}_{i}-{\widehat{\theta}}_{i}\u2225}_{2}/{\u2225{\mathit{\theta}}_{i}\u2225}_{2}, where *θ*_{
i
} and {\widehat{\theta}}_{i} are the true and the estimated original signal vectors of the *i* th CS task. Note that the average normalized signal reconstruction error measures the Euclidean distance between the waveforms of the recovered and the original signals. It is not very informative regarding the quality of the recovered signal supports. Therefore, we shall also include in some experiments performance results of different algorithms in recovering the signal supports, which are quantified by the average incorrect support recovery ratio \frac{1}{L}\sum _{i=1}^{L}\left|\right|S\left({\mathit{\theta}}_{i}\right)-S\left({\widehat{\mathit{\theta}}}_{i}\right)|{|}_{0}/N. Here, ||·||_{0} denotes the *l*
_{0}-norm and *S*(**x**) sets all the non-zero elements in **x** to be 1.

#### 5.1.1 LMCS

We consider the case of *L* = 2 CS tasks as in [7], in order to illustrate the performance of the proposed LMCS technique and the existing methods under a simulation setup already used in the literature. The original signal of each task contains 64 non-zero elements at random locations. Zero-mean Gaussian noise with a standard deviation of 0.01 is added to the two obtained compressive measurement vectors^{b}.

In the first simulation, we illustrate in Figure 2 the impact of different choices of the parameters *λ* and *ν* on the performance of LMCS. The two signals are assumed to have 75% of their non-zero elements overlapped. We realize LMCS with *λ*=0, *λ* = 1, *λ* = 2, and *λ* estimated using Equation 25. The results shown are averaged over 200 runs. In particular, Figure 2a,b plots the average signal reconstruction error as a function of the number of compressive measurements for the two cases where the original signals are random binary numbers ±1 and zero-mean Gaussian random variables with unit variance. The results show that in both cases, the reconstruction error of LMCS gradually improves as the number of compressive measurements increases, and the best performance is obtained when *λ* is estimated using Equation 25. Moreover, we can see that the LMCS with *ν* = 0 and *ν* estimated using Equation 26 yields similar signal reconstruction performance. The underlying reason is that the value of *λ* estimated jointly with *ν* is nearly identical to that obtained with *ν* = 0. This can be better explained as follows. The value of *ν*, when it is identified together with *λ*, is generally non-zero but less than one in this simulation. Careful examination of Equation 25 that gives the estimate of *λ* reveals that the impact of a small non-zero *ν* on *λ* is negligible, when the original signal length *N* is large (in this section, *N* = 512) and the measurement noise level is low, which implies a large value of the noise precision *β*, and as a result, large values of the hyper-parameters *γ*
_{
j
} for original signals having a unit variance (see Equation 17). Therefore, in the remaining simulations, we fix *ν* at zero when realizing LMCS and MDL-LMCS.

It is worthwhile to point out that rigorously, *ν* = 0 is a boundary value for the Gamma distribution. As *ν* approaches 0, the prior distribution of *λ* would provide vague information on *λ* as p\left(\lambda \right)\propto 1/\lambda (also see Equation 19 in [6]). However, this would not change the fact that Laplace prior is imposed on the original signals, as shown in Equation 12. In other words, LMCS would still outperform MCS because it enhances the sparsity constraints on the non-zero elements of the original signals. This is also supported by the following simulation results (see Figures 3 and 4).

Figure 3 demonstrates the impact of the correlation between the two original signals on the performance of LMCS. It considers the cases when the two original signals have binary non-zero elements, and they have 75% and 50% of their non-zero elements overlapped. Figure 3a,b plots the average signal reconstruction error and the incorrect support recovery ratio of LMCS as a function of the number of compressive measurements. The results shown are averaged over 50 runs. For comparison, we also include in the figures the results from ST-BCS, LST-BCS, and MCS. We can observe from Figure 3a that LMCS and MCS outperform greatly over ST-BCS and LST-BCS due to the utilization of the prior sharing mechanism (see Section 2). The performance of LMCS and MCS improves as the number of the overlapping non-zero elements in the two original signals increases, as expected. More importantly, LMCS exhibits superior performance in terms of a much lower signal reconstruction error over MCS for the two cases where the two original signals have 75% and 50% of their non-zeros overlapped. The performance enhancement mainly comes from the use of Laplace priors on the original signals in LMCS. Compared with MCS, LMCS imposes another layer of prior information on the hyper-parameters of the original signals, which makes MCS a special case of LMCS as shown in Equations 39 and 40 at the end of Section 4. As a result, LMCS offers more flexibility in modeling the sparsity of the original signals. This is also corroborated by Figure 3b, where it shows that in the case where the two original signals have 75% of their non-zero elements colocated, LMCS can provide a lower incorrect support recovery ratio and can better recover the sparse signal support.

Figure 4 repeats the simulation experiment in Figure 3, but it considers the case where the two original signals have the non-zero elements drawn from zero-mean Gaussian distribution with unit variance. The obtained observations are similar to those in Figure 3.

#### 5.1.2 MDL-based task classification and signal reconstruction

In this subsection, we present simulation results to illustrate the performance of MDL-MCS and MDL-LMCS developed in Section 4. For the purpose of comparison, we also show the results of the ST-BCS, LST-BCS, MCS, and LMCS methods as well as the DP-MCS technique.

The simulated algorithms are used to recover the original signals of *L* = 40 CS tasks that belong to eight clusters with five tasks each. Every cluster has its own signal template that differs in the signal supports. All the original signals have 64 non-zero components, and their locations are initially chosen so that the correlation between any two original signals from different clusters is zero. Later, we perform the following perturbation to induce slight correlation among clusters. Specifically, in each ensemble run, six non-zero elements in each signal template are selected randomly and set to zero elements, while at the same time, six elements that are zeros in the original template are reset to be non-zeros. In this way, the five signals within the same cluster are highly correlated, but the signals from different clusters have distinct sparsity structures. The simulation results are obtained via averaging over 50 ensemble runs.

In Figure 5a,b, we plot as a function of the number of compressive measurements the binary signal reconstruction error and the correct task classification ratio of the simulated seven algorithms. As we can see from Figure 5a, pretending that the 40 CS tasks belong to the same group and recovering the original signals using LMCS or MCS would lead to a signal reconstruction error even higher than reconstructing the original signals separately via LST-BCS. This clearly demonstrates the impact of incorrect task classification on the signal recovery performance. On the other hand, the proposed MDL-LMCS and MDL-MCS algorithms outperform the DP-MCS technique in terms of lower signal reconstruction error. The performance improvement can be better explained by examining Figure 5b. We can see that the application of the MDL principle to augment LMCS and MCS leads to a greatly improved correct task classification ratio, compared with the DP-MCS technique. With the CS task correctly grouped, MDL-LMCS and MDL-MCS can better recover the original signals of every group.

We repeat the simulation used to generate Figure 5 with the original signals being zero-mean Gaussian random variables with unit variance. The obtained results are summarized in Figure 6. The observations obtained are similar to those in Figure 5.

### 5.2 Images

In this subsection, we compare the performance of MDL-MCS and MDL-LMCS with that of DP-MCS in recovering 2-D images of random bars. In this experiment, the elements of the measurement matrices of the three algorithms in consideration are drawn from a uniform spherical distribution.

Figure 7 summarizes the reconstruction results from a particular run. The first three images in Figure 7a, labeled as tasks 1 to 3, are taken from [7], and they belong to the same cluster. The remaining six images in Figure 7a forms another two clusters, where one cluster consists of tasks 4 to 6 and the other is composed of tasks 7 to 9. These six images are modified from the first three images via permuting randomly the intensities of the rectangles and shifting their positions by distance randomly sampled from a uniform distribution.

All original images have the dimension of 1,024 × 1,024. Here, we utilize the Haar wavelet expansion with a coarsest scale of 3 and a finest scale of 6. Figure 7a gives the result of the inverse wavelet transform with 4,096 samples, denoted as linear in the figure. This is the best performance achievable by all the CS algorithms considered here. The reconstruction result from DP-MCS is shown in Figure 7b, where we adopted the hybrid CS scheme that compresses the fine-scale coefficients only as in [7] into *M*
_{
i
} = 680 (*i* = 1, …, 9) measurements for each task. Figure 7c,d gives the recovery results of MDL-MCS and MDL-LMCS, respectively.

We fix the original images and repeat the above experiment 20 times, each time with independently generated measurement matrices for all the three algorithms. In every run, the image reconstruction error for each task is evaluated and averaged to obtain the normalized image reconstruction error, which is again averaged over 20 runs to yield the average image reconstruction error summarized in Table 1. We also include in Table 1 the correct classification ratio.

The results in Figure 7 and Table 1 show that MDL-LMCS has the best image reconstruction and classification performance, while MDL-MCS yields a better performance than DP-MCS. This is consistent with the observations obtained from Figures 5 and 6.

## 6 Conclusions

In this paper, we first extended previous works on the Laplace prior-based Bayesian CS to the scenario of multiple CS tasks and developed the LMCS technique. The hierarchical prior model was adopted to impose the Laplace priors, and it was shown that the MCS algorithm is indeed a special case of LMCS. Next, this paper considered the scenario where the multiple CS tasks are from different groups, under which the performance of both MCS and LMCS would be degraded, since they attempt to recover the uncorrelated signals jointly. We proposed the MDL-based MCS techniques, namely, MDL-MCS and MDL-LMCS, which first classify tasks into different groups using the MDL principle and then reconstruct signals of every cluster. Simulations verified the enhanced performance of MDL-MCS and MDL-LMCS in terms of lower signal reconstruction error over the benchmark MCS and DP-MCS techniques as well as single-task CS algorithms.

## Endnotes

^{a} It can be easily verified that in our algorithm, *K* is equal to the iteration index plus one. Besides, ** clust**{1} always contains all the unclassified tasks and

**{**

*clust**K*} is the newest cluster formed in the current iteration.

^{b} Our choice of the noise standard deviation of 0.01 is on the same order of the values adopted in the literature. For example, in [6] and [7], the noise standard deviation is set to be 0.03 and 0.005.

## Appendix 1

### Derivation and analysis of Equations 32 and 33

In this appendix, we shall present the derivation that leads to Equations 32 and 33 and show that it is only a suboptimal solution to the maximization of Equation 27.

Our derivation applies the approximation that *s*
_{
i,j
} ≫ 1 / *γ*
_{
j
}, which has been found to be valid numerically [7]. This results in the estimate of *γ*
_{
j
} having the functional form given in Equation 30. When *A*
_{0} > 0, both solutions in Equation 30 would be negative, which violates the requirement that *γ*
_{
j
} must be positive. If *A*
_{0} < 0, only the solution {\gamma}_{j}^{-1}=\left(-{B}_{0}-\sqrt{{\Delta}_{0}}\right)/\left(2{A}_{0}\right) is valid. For the case *A*
_{0} = 0, from Equation 27, *γ*
_{
j
} will have the accurate solution *γ*
_{
j
} = 0. This completes the derivation of Equations 32 and 33.

We next show that the solution in Equation 32 and 33 is suboptimal. For this purpose, utilizing the approximation *s*
_{
i,j
} ≫ 1 / *γ*
_{
j
} transforms Equation 28 into

We can also obtain easily

Substituting Equation 32 into Equation 54 yields

This indicates that the solution in Equation 32 is the unique maximizer of the approximated version of Equation 27. However, solving Equation 29 accurately, which is equal to finding all the candidate maximizers for Equation 27, may yield two or more positive estimates of *γ*
_{
j
}. Among them, one would be relatively close to the approximate solution in Equation 32. In other words, the approximate solution is within the vicinity of a stationary point of Equation 27, which may only correspond to a local maxima.

## Appendix 2

### Derivation of Equation 46

To avoid confusion, we use superscript (*k*) to denote the *k* th cluster in the following derivation. For mathematical tractability, besides the independence among signals from two different clusters, we also assume the independence among signals within the same cluster. As a result, we have

where *L*
_{
k
} is the number of tasks in the *k* th cluster such that \sum _{k=1}^{K}{L}_{k}=L, {\mathbf{D}}^{\text{LMCS}}=\left\{{\mathit{d}}_{k}^{\text{LMCS}}\right\}, *k* = 1, …, *K*, and {\mathit{d}}_{k}^{\text{LMCS}}=\left\{\phantom{\rule{1em}{0ex}}{\mathit{\gamma}}^{\left(k\right)},{\lambda}^{\left(k\right)}\right\} contain the information sharing parameters of the *k* th cluster. Similarly, assuming statistical independence among {\mathit{d}}_{k}^{\text{LMCS}}, we obtain

Combining Equations 56 and 57 yields

From Equation 22, Equation 58 can be rewritten as

Carrying out the integration, simplifying and applying some straightforward manipulations give Equation 46.

## References

Baraniuk R: A lecture on compressive sensing. IEEE Mag. Signal Process. 2007, 24(4):118-121.

Candés E, Romberg J, Tao T: Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information.

*IEEE Trans. Inf. Theory*2006, 52(2):489-509.Donoho DL: Compressed sensing.

*IEEE Trans. Inf. Theory*2006, 52(4):1289-1306.Ji S, Xue Y, Carin L: Bayesian compressive sensing.

*IEEE Trans. Signal Process*2008, 56(6):2346-2356.Tipping ME: Sparse Bayesian learning and the relevance vector machine.

*J. Mach. Learn. Res*2001, 1: 211-244.Babacan S, Katsaggelos A, Molina R: Bayesian compressive sensing using Laplace priors.

*IEEE Trans. Image Process*2010, 19(1):53-63.Ji S, Dunson D, Carin L: Multi-task compressive sensing.

*IEEE Trans. Signal Process*2009, 57(1):92-106.Leviatan D, Temlyakov VN: Simultaneous approximation by greedy algorithms, Technical report, University of South Carolina. 2003.

Cotter SF, Rao BD, Engan K, Kreutz-Delgado K: Sparse solutions to linear inverse problems with multiple measurement vectors.

*IEEE Trans. Signal Process*2005, 53(7):2477-2488.Tropp JA, Gilbert AC, Strauss MJ: Algorithms for simultaneous sparse approximation.

*Part I: Greedy pursuit. Signal Process*2006, 86: 572-588.Tropp JA: Algorithms for simultaneous sparse approximation.

*Part II: convex relaxation. Signal Process*2006, 86: 589-602.Escoda D, Granai L, Vandergheynst P: On the use of a priori information for sparse signal approximations.

*IEEE Trans. Signal Process*2006, 54(9):3468-3482.Baron D, Duarte MF, Sarvotham S: An information-theoretic approach to distributed compressed sensing.

*Proceedings of the 43rd Allerton Conference on Communication, Control, and Computing, Monticello, IL, Sept 2005*Wipf DP, Rao BD: An empirical Bayesian strategy for solving the simultaneous sparse approximation problem.

*IEEE Trans. Signal Process*2007, 55(7):3704-3716.Seeger MW, Nickisch H: Compressed sensing and Bayesian experimental design.

*Proceedings of the 25th International Conference on Machine Learning, Helsinki, July 2008*Qi Y, Liu D, Carin L, Dunson D: Multi-task compressive sensing with Dirichlet process priors.

*Proceedings of the 25th International Conference on Machine Learning, Helsinki, July 2008*Rissanen J: Modeling by shortest data description.

*Automatica*1978, 14: 465—471.Rissanen J, coding Universal, information prediction, Trans estimation. IEEE:

*Inf. Theory*. 1984, 30(4):629—636.Barron A, J Rissanen BYu: The minimum description length principle in coding and modeling.

*IEEE Trans. Inf. Theory*998, 44(6):2743—2760.Ramirez I, Sapiro G: An MDL framework for sparse coding and dictionary learning.

*IEEE Trans. Signal Process*2012, 60(6):2913—2927.Liu J, Gao SW, Luo ZQ: TN Davidson, JPY Lee, The minimum description length criterion applied to emitter number detection and pulse classification.

*Proceedings of the Ninth IEEE Workshop on Statistical Signal and Array Processing, Portland, OR, Sept 1998*Wong KM, Luo ZQ, Liu J, Lee JPY, Gao S W: Radar emitter classification using intrapulse data.

*Int. J. Electron. Comm*1999, 12: 324-332.Liu J, Lee JPY, Li L, Luo Z, Wong KM: Online clustering algorithms for radar emitter classification.

*IEEE Trans. Pattern Anal. Mach. Intell*2005, 27(8):1185-1196.Cover T, Thomas J:

*Elements of Information Theory*. New York: Wiley; 2006.Caruana R: Multi-task learning.

*Mach. Learn*1997, 28(1):41-75. 10.1023/A:1007379606734Baxter J: Learning internal representations.

*Proceedings of the Eighth Annual Conference on Computational Learning Theory, Santa Cruz, CA, July 1995*Baxter J: A model of inductive bias learning.

*J. Artif. Intell. Res*2000, 12: 149-198.Lawrence ND, Platt JC: Learning to learn with the informative vector machine.

*Proceedings of the 21st International Conference on Machine Learning, Banff, Alberta, July 2004, 65*Yu K, Tresp V, Schwaighofer A: Learning Gaussian processes from multiple tasks.

*Proc. 22nd Int. Conf. Mach. Learn. (ICML 22), 2005*Zhang J, Ghahramani Z, Yang Y: Learning multiple related tasks using latent independent component analysis.

*Advances in Neural Information Processing Systems (NIPS), Vancouver, British Columbia, Dec 2006*Ando RK, Zhang T: A framework for learning predictive structures from multiple tasks and unlabeled data.

*J. Mach. Learn. Res*2005, 6: 1817-1853.Evgeniou T, Micchelli CA, Pontil M: Learning multiple tasks with kernel methods.

*J. Mach. Learn. Res*2005, 6: 615-637.Burr D, Doss H: A Bayesian semiparametric model for random-effects meta-analysis.

*J. Amer. Stat. Assoc*2005, 100: 242-251. 10.1198/016214504000001024Dominici F, Parmigiani G, Wolpert R, Reckhow K: Combining information from related regressions.

*J. Agric. Biolog. Environ. Stat*1997, 2(3):294-312. 10.2307/1400447Hoff PD: Nonparametric modeling of hierarchically exchangeable data, Technical report, University of Washington. 2003.

Muller P, Quintana F, Rosner G: A method for combining inference across related nonparametric Bayesian models.

*J. R. Stat. Soc. Ser. B*2004, 66(3):735-749. 10.1111/j.1467-9868.2004.05564.xMallick BK, Walker SG: Combining information from several experiments with nonparametric priors.

*Biometrika*1997, 84(3):697-706. 10.1093/biomet/84.3.697Tang L, Zhou Z, Shi L, Yao H, Ye Y, ZhangJ: Laplace prior based distributed compressive sensing.

*Proceeding of the 5th International ICST Conference on Communications and Networking in China, Beijing, Aug 2010*Themelis KE, Rontogiannis AA, Koutroumbas KD: A Novel Hierarchical Bayesian Approach for Sparse Semisupervised Hyperspectral Unmixing.

*IEEE Trans. Signal Process*2012, 60(2):585-599.Bishop CM:

*Pattern Recognition and Machine Learning*. New York: Springer-Verlag; 2006.

## Acknowledgements

The authors wish to thank the editor and the anonymous reviewers for their constructive suggestions. The authors thank S. Derin Babacan, Shihao Ji, and David Dunson for sharing codes of their algorithms. This work was supported in part by Hunan Provincial Innovation Foundation for Postgraduates under Grant CX2012B019, Fund of Innovation, Graduate School of National University of Defense Technology under grant B120404, and National Natural Science Foundation of China (no. 61304264).

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Competing interests

The authors declare that they have no competing interests.

## Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

## Rights and permissions

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## About this article

### Cite this article

Wang, YG., Yang, L., Tang, L. *et al.* Enhanced multi-task compressive sensing using Laplace priors and MDL-based task classification.
*EURASIP J. Adv. Signal Process.* **2013**, 160 (2013). https://doi.org/10.1186/1687-6180-2013-160

Received:

Accepted:

Published:

DOI: https://doi.org/10.1186/1687-6180-2013-160

### Keywords

- Multi-task
- Compressive sensing
- Laplace priors
- Minimum description length
- Task classification