 Research
 Open Access
 Published:
An augmented Lagrangian multiscale dictionary learning algorithm
EURASIP Journal on Advances in Signal Processing volume 2011, Article number: 58 (2011)
Abstract
Learning overcomplete dictionaries for sparse signal representation has become a hot topic fascinated by many researchers in the recent years, while most of the existing approaches have a serious problem that they always lead to local minima. In this article, we present a novel augmented Lagrangian multiscale dictionary learning algorithm (ALMDL), which is achieved by first recasting the constrained dictionary learning problem into an AL scheme, and then updating the dictionary after each inner iteration of the scheme during which majorizationminimization technique is employed for solving the inner subproblem. Refining the dictionary from low scale to high makes the proposed method less dependent on the initial dictionary hence avoiding local optima. Numerical tests for synthetic data and denoising applications on real images demonstrate the superior performance of the proposed approach.
1. Introduction
I n the last two decades, more and more studies have focused on dictionary learning, the goal of which is to model signals as a sparse linear combination of atoms that form a dictionary below a certain error toleration. Sparse representation of signals under the learned dictionary possesses significant advantages over the prespecified dictionary such as wavelet and discrete cosine transform (DCT) as demonstrated in many literatures [1–3] and it has been widely used in denoising, inpainting, and classification areas with stateoftheart results obtained [1–5]. Considering there is a signal b_{ l } ∈R^{M} , it can be represented by a linear combination of a few atoms either exactly as b_{ l } = Ax_{ l } or proximately as b_{ l } ≈ Ax_{ l } , where A represents the dictionary and x_{ l } denotes the representation coefficients. Given an input matrix B = [b_{1}, ..., b_{ L } ] in R^{M×L} of L signals, the problem then can be formulated as an optimization problem jointly over a dictionary A = [a_{1}, ..., a_{ J } ] in R^{M×J} and the sparse representation matrix X = [x_{1}, ..., x_{ J } ] in R^{J×L} , namely
where ·_{0} denotes the l_{0} norm which counts the number of nonzero coefficients of the vector, ·_{2} stands for the Euclidean norm on R^{M} , and τ is the tolerable limit of error in reconstruction.
Most of the existing methods for solving Equation 1 can be essentially interpreted as different generalizations of the Kmeans clustering algorithm because they usually have twostep iterative approaches consisting of a sparse coding step where sparse approximations X is found with A fixed and a dictionary update step where A is optimized based on the current X[1]. After initialization of the dictionary A those algorithms keep iterating between the two steps until either they have run for a predefined number of alternating optimizations or a specific approximation error is reached. Concretely, at the sparse coding step, seeking the solution of Equation 1 with respect to a fixed dictionary A can be achieved by optimizing over each x_{ l } individually as follows:
Or equivalent form:
where λ is the regularization parameter related to τ and it tunes the weight between the regularization term x_{ l } _{0} and the fidelity term ${\u2225{b}_{l}A{x}_{l}\u2225}_{2}^{2}$. Solving Equation 2 or 3 proves to be a NPhard problem [6], one way to solve which is greedy pursuit algorithms such as matching pursuit (MP) and its variants [7, 8]; another commonly used approach is to relax the optimization problem convexly via basis pursuit [9] such as iterated thresholding [10], FOCal Underdetermined System Solution (FOCUSS) [11], and LARSLasso algorithm [12].
At the dictionary updating step, when the optimization problem Equation 1 is solved over bases A given fixed coefficients X, it reduces to a least squares problem with quadratic constraints as shown in Equation 4:
In general, this constrained optimization problem can be solved using several methods. One simple technique is gradient descent such as maximum likelihood (ML) [13, 14] and maximum a posteriori (MAP) with iterative projection [15], another is a dual version derived from its Lagrangian proposed by Lee et al. [16], the method of optimal directions (MOD) [17] proposed by Engan et al. is also a common technique which solves it using the pseudo inverse of X. For all the methods, the most important breakthrough is the Ksingular value decomposition (KSVD) proposed by Aharon et al. [1]. KSVD uses a different strategy such that the columns of A are updated sequentially one at a time by using an SVD to minimize the approximation error. Hence, the dictionary updating step is to be a truly generalization of the Kmeans since each patch can be represented by multiple atoms and with different weights.
Recently, much effort has been posed on tightening or loosening the constraint of the dictionary. Some parametric dictionary learning algorithms are proposed in [18, 19], which only optimize the parameters of prespecified atoms (e.g., Gaborlike atoms) instead of the dictionary itself and thus decrease the dimensionality of the corresponding optimization problem, while these algorithms depend too much on selecting a proper parametric dictionary experimentally in advance and only better match to the structure of a specific class of signals. In contrast, nonparametric (Bayesian) approaches proposed in [20, 21] learn the dictionary using some prior stochastic process, which automatically estimate the dictionary size and make no explicit assumption on the noise variance, while the drawback of them is the computational load. However, little attention in the literature has been paid to making the generalized clustering ability of the dictionary more stable.
Since the traditional dictionary learning methods can be viewed as various extensions of the Kmeans clustering method, a common drawback of them is that they are prone to local minima, i.e., the efficiency of the algorithms depends heavily on either the samples type or the initialization. Figure 1 shows a twodimensional toy example in which the atom identify ability of KSVD algorithm is investigated for two different sample types, both of which comprise of 1,000 samples and each sample is a multiple of one of eight basis vectors plus additive noise. One type is that the dictionary's atoms have a uniform angular distribution in the circle, and the other is that seven atoms have a uniform angular distribution in half a circle except the last one. We run KSVD over 50 times with different realizations of coefficients and noise, and obtain the average identify ability of the two types which is 87% for the former and almost 100% for the latter. The main reason of this phenomenon is that the KSVD algorithm is sensitive to initialization and has difficulty in updating the atom in a correct direction when the samples distribute in a nondirectional, nonregular way. A natural way to alleviate this problem is first updating the dictionary in a low resolution or smoothed version of the samples and then making the smoothed samples converge asymptotically to the original samples while refining the dictionary.
In this article, we propose a specific approach of such multiscale strategy, the outline of which is to transform the constrained dictionary learning problem into the augmented Lagrangian (AL) framework first and then refine the dictionary from low scale to high. We name this approach as ALbased multiscale dictionary learning (ALMDL) algorithm. AL is a standard and elementary tool in optimization field, it converges fast even superlinearly when forcing its penalty parameter updated to infinity [22, 23]. A closely related algorithm is the Bregman iterative method which was originally proposed by Osher et al. [24] for total variation regularization model, they are identical when the constraint is linear [25]. Under the circumstance of the study proposed in this article, AL is equivalent to the Bregman iterative method. We choose to follow the AL perspective, instead of Bregman iteration, only because of the fact that AL is popularly used in the optimization community. Usually, a "decouple" strategy (e.g., alternating direction methodADM) is used to solve the subproblem of the AL scheme, it facilitates the AL to be implemented efficiently in many inverse problems [26–29]. In this article, we resort to a variant of this spirit. We employ a modified majorizationminimization (MM) technique to tackle with the subproblem, enabling its solution accuracy and implementation efficiency.
The rest of the article is organized as follows: Section 2 describes the proposed method with two parts, i.e., the multiscale dictionary learning framework and the subproblem of inner minimization. In Section 3, we conduct the experiments on synthetic data and compare its ability for recovering the original dictionary with KSVD, MOD. Then, its ability for denoising real images is tested and compared with KSVD in Section 4. At last Section 5 concludes the article with remarks.
2. The proposed method
This section introduces the ALMDL algorithm for solving the dictionary learning problem, it is achieved by first recasting the constrained dictionary learning problem into an AL scheme, and then updating the dictionary after each inner iteration of the scheme, during which MM technique is employed for solving the inner subproblem.
2.1 A multiscale dictionary learning framework
In this section, l_{1} norm instead of l_{0} norm is used to relax the minimization problem Equation 2; therefore, the objective optimization problem over x in R^{J} is given with the subscript variable l omitted for the sake of clarity as follows
By reformulating the feasible set {x: bAx_{2} ≤ τ} as an indicator function ${\delta}_{\tau}^{2}\left(bAx\right)$, the constrained problem Equation 5 turns into an unconstrained one:
where ${\delta}_{\tau}^{2}\left(z\right)=\left\{\begin{array}{c}0,\mathsf{\text{if}}{\u2225z\u2225}_{2}\le \tau \\ +\infty ,\mathsf{\text{otherwise}}\end{array}\right.$.
Similarly in [26, 30], the resulting unconstrained problem is then converted into a different constrained problem by applying a variable splitting operation, namely:
We apply the method of AL for solving this constrained problem, which is replaced by solving a sequence of unconstrained subproblems in which the objective function is formed by the original objective of the constrained optimization plus additional "penalty" terms, the "penalty" terms are made up of constrained functions multiplied by a positive coefficient (for more details of AL scheme, see [22]), i.e.,
where ${L}_{\beta}\left(x,z\right)\stackrel{\Delta}{=}{\u2225x\u2225}_{1}+{\delta}_{\tau}^{2}\left(z\right)\u27e8{y}^{k},Ax+zb\u27e9+\frac{1}{4\beta}{\u2225Ax+zb\u2225}_{2}^{2}$.
where 〈·,·〉denotes the usual duality product.
For conventional dictionary learning approach, dictionary is updated after achieving the optimal minimization of Equations 8 and 9 and the whole learning procedure loops in an alternative way until satisfying some conditions. In contrast, here we update dictionary after inner iteration of Equations 8 and 9, i.e., taking the derivative of functional L_{ β } with respect to A we get the following gradient descent update rule:
A merit of the AL methodology is its superior convergence property: Ax^{k} → Ax* = b  z*[22], where each iterative variable "Ax^{k} " can be viewed as a lowresolution or smoothed version of the true image patches "Ax*". Suppose that each iterative step is regarded as a scale, then the dictionary updating, via summing the multiplication of primal and dual variables (i.e., Equation 10), can be seemed as a refinement process from the low scale to the high one. As discussed in the introduction, this method can avoid local optima problems because only the main features of the image patches exist at the initial stage of the iteration and we list the proposed method ALMDL in Diagram 1.
Diagram 1. The general description of the ALMDL algorithm

1:
initiation: X^{0} = 0; A_{0}

2:
while stopcriterion not satisfied

3:
forl = 1, ..., L, $\left\{{x}_{l}^{k+1},{z}_{l}^{k+1}\right\}=arg\underset{{x}_{l},{z}_{l}}{min}{L}_{\beta}\left({x}_{l},{z}_{l}\right)$
Where ${L}_{\beta}\left({x}_{l},{z}_{l}\right)\stackrel{\Delta}{=}{\u2225{x}_{l}\u2225}_{1}+{\delta}_{\tau}^{2}\left({z}_{l}\right)\u27e8{y}_{l}^{k},{A}_{k}{x}_{l}+{z}_{l}{b}_{l}\u27e9+\frac{1}{4\beta}{\u2225A{x}_{l}+{z}_{l}{b}_{l}\u2225}_{2}^{2}$

4:
${Y}^{k+1}={Y}^{k}+\frac{1}{2\beta}\left(A{X}^{k+1}{Z}^{k+1}+B\right)$

5:
A_{ k }_{+1} = A_{ k } + μY^{k}^{+1} (X^{k}^{+1})^{T}

6:
end while
2.2 The subproblem of inner minimization
From the pseudocode of the proposed algorithm depicted in Diagram 1, it is obvious that the speed and accuracy of the proposed method depend heavily on how the subproblem over variables x and z is solved, so a simple and efficient method should be developed to enable the efficiency of the whole algorithm. Ideally, the minimization of Equation 8 with respect to z can be computed analytically and z can be eliminated:
Denoting b_{1} = Ax + b + 2βy^{k} , then the minimization of the second and third terms in Equation 11 with respect to z is obtained:
Moreover, it follows that
The next most crucial problem is how to determine x. It is hard to minimize Equation 11 which is nonlinear with respect to variable x, so we develop an iterative procedure to find the approximate solution. In the developed method, z is replaced by its last state z^{m} and MM technique is employed to add an additional proximallike penalty at each inner step so as to cancel out the term Ax^{2} (for more details of MM technique, see [31–33]). Since both of the variables x and z are updated at each inner step, it seems justified to conclude that a satisfied solution will be obtained after just a few steps. The experimental verification is presented in Section 4.3.
where γ ≥ eig(A^{T}A) and the Shrink operator is defined as $\mathsf{\text{Shrink}}\left(f,\mu \right)=\left\{\begin{array}{c}f\mu ,\phantom{\rule{0.25em}{0ex}}f\ge \mu \\ 0,\phantom{\rule{0.25em}{0ex}}\mu \le f<\mu \\ f+\mu ,\phantom{\rule{0.25em}{0ex}}f<\mu \end{array}\right.$.
In summary, the proposed ALMDL algorithm consists of a twolevel nested loop; the outer loop updates the dual variables and the dictionary while the inner loop minimizes the primal variables at the same time to enable the accuracy of the algorithm. The detailed description of the algorithm is listed in Diagram 2, the initial dictionary A_{0} in line 1 can be any predefined matrix (e.g., the redundant DCT dictionary); the operator TH_{ τ } (Y) in line 4 implies to deal with each column of the matrix Y individually.
Diagram 2. The detailed description of the ALMDL algorithm

1:
initiation: X^{0} = 0; A_{0}

2:
while stopcriterion not satisfied (loop in k):

3:
while stopcriterion not satisfied (loop in m):

4:
${Y}^{m+1}=\frac{1}{2\beta}T{H}_{\tau}\left[2\beta {C}^{k}\left({A}_{k}{X}^{k,m}B\right)\right]$

5:
${X}^{k,m+1}=shrink({X}^{k,m}+\raisebox{1ex}{$2\beta $}\!\left/ \!\raisebox{1ex}{$\gamma $}\right.{A}_{k}^{T}{Y}^{m+1},\raisebox{1ex}{$2\beta $}\!\left/ \!\raisebox{1ex}{$\gamma $}\right.)$

6:
end while

7:
C^{k}^{+1} = Y^{m}^{+1}; X^{k}^{+1, 0} = X^{k, m}^{+1}

8:
A_{ k }_{+1} = A_{ k } + μC^{k}^{+1} (X^{k}^{+1, 0})^{T}

9:
end while
2.3 An hybrid method for improving performance
At first glance, it seems that our proposed iterative scheme of x^{m}^{+1} is very similar to the iterative shrinkage/thresholding algorithm (ISTA), which has been intensively studied in the fields of compressed sensing and image recovery [10, 11, 25, 26, 28, 29, 32, 34]. To improve the efficiency of the ISTA, various techniques have been applied to Equation 13. The most simple and fast approaches in recently years include FPC [35], SpaRSA [36], FISTA [34]. In fact, as noted in [32, 34], the MM technique we employ in Section 2.2 can lead to ISTA (for details one can also see our derivation in the Appendix 1), the main novelty in our work is that we accelerate the ISTA algorithm with regard to variable x^{m}^{+1} by using uptodate z^{m} . i.e. at each inner ISTA iteration of x, x^{m}^{+1} benefits from the latest value z^{m} . Seen from Diagram 2, by using uptodate z^{m} , the convergence of variables both x and y are accelerated, therefore the corresponding update of dictionary A, A_{ k }_{+1} = A_{ k } + μY^{k}^{+1} (X^{k}^{+1, 0})^{T}, are also accelerated accordingly.
After this paper was submitted for publication we recently became aware^{a} of some very recent studies by Yang [29] and Ganesh [37], the ADM framework adopted by these authors is very similar to ours, i.e. they first introduce auxiliary variables to reformulate the original problem into the form of AL scheme, and then apply alternating minimization to the corresponding AL functions. The main differences between these method and ours lie on the fact that the application field is different, the ultimate goal of Ganesh's and Yang's methods in compressed sensing field pursues the sparest coefficient x under predefined transform or dictionary, while our method is devoted to obtain the optimal dictionary A.
Keep this awareness in mind, we can find the major distinction between our method and Ganesh's and Yang's methods. Firstly, in Yang's study they apply the basic ISTA to solve the inner minimization with respect to variable x [[29], p. 6]. Secondly, in Ganesh's study they apply FISTA, an accelerated technique of ISTA, to solve the inner minimization with respect to variable x [[37], pp.1516]. Both Ganesh's and Yang's methods try to find sparest solution under fixed transform or dictionary [29, 37]. Finally, in our work we aim to obtain optimal dictionary and its corresponding update form is A_{ k }_{+1}= A_{ k } + μY^{k}^{+1} (X^{k}^{+1, 0})^{T} in the iterative process. This indicates that the convergence of updating A depends on both x and y. So we modified the naïve ISTA with respect to variable x^{m}^{+1} by taking advantage of uptodate z^{m} . Under these circumstances, both x and y are accelerated thereby the update of dictionary A is accelerated. As byproducts, through this modification the variable z is omitted and implicitly updated in the iterative scheme. Thus the whole iterative procedure deduces to a very simple and compact iterative fashion. It is worth noting that since the number of training samples is very big for dictionary learning problem (the number adopted in the experiment of Section 4 is 62001), a simple iteration formula is essential.
For comparison purpose, we modify and extend Yang's and Ganesh's method for dictionary learning problem by adding dictionary update stage, i.e. we update dictionary A the same as we have done in Equation 10 of Section 2.1. We call the extended Yang's method as ADMISTADL and Ganesh's method as ADMFISTADL. The detailed description of the two methods is presented in Diagrams 3 and 4 in the Appendix 2 respectively. Furthermore, since both of our's and Ganesh's methods can be viewed as accelerated techniques for ISTA, we can integrate them into a unified framework for our dictionary learning problem. Diagram 5 shows the pseudocode of the hybrid algorithm. As can be seen from the Diagram, lines 5 and 6 come from our method which pursues accelerating variables x and y; on the other hand, lines 7 and 8 belonging to FISTA aim to accelerating variable x. Compared with ADMFISTADL shown in the Appendix 2, the proposed hybrid algorithm has more simple formation and faster convergence of x and y. Compared with our ALMDL shown in Diagram 2, it inherits the strength of FISTA. To conclude, the hybrid algorithm would perform better than both, and its computational cost between our ALMDL and ADMFISTADL, the numerical comparison of the three approaches will be conducted in Section 4.3. As for real application of dictionary learning such as image denoising, we still choose the primary ALMDL because of its simple and compact formation.
Diagram 5. The detailed description of the hybrid algorithm

1:
initiation: X^{0} = 0; A_{0}

2:
while stopcriterion not satisfied (loop in k):

3:
W^{1} = X^{k} ; Q^{1} = X^{k} ; t_{1} = 1

4:
while stopcriterion not satisfied (loop in m):

5:
${Y}^{m+1}=\frac{1}{2\beta}T{H}_{\tau}\left[2\beta {C}^{k}\left({A}_{k}{Q}^{m}B\right)\right]$

6:
${W}^{m+1}=shrink({Q}^{m}+\raisebox{1ex}{$2\beta $}\!\left/ \!\raisebox{1ex}{$\gamma $}\right.{A}_{k}^{T}{Y}^{m+1},\raisebox{1ex}{$2\beta $}\!\left/ \!\raisebox{1ex}{$\gamma $}\right.)$

7:
${t}_{m+1}=\frac{1}{2}\left(1+\sqrt{1+4{t}_{m}^{2}}\right)$

8:
${Q}^{m+1}={W}^{m+1}+\frac{{t}_{m}1}{{t}_{m+1}}\left({W}^{m+1}{W}^{m}\right)$

9:
end while

10:
C^{k}^{+1} = Y^{m}^{+1}; X^{k}^{+1} = W^{m}^{+1}

11:
A_{ k }_{+1}= A_{ k } + μC^{k}^{+1} (X^{k}^{+1, 0})^{T}

12:
end while
3. Synthetic experiments
To evaluate the proposed method, ALMDL, we first try it on artificial data to test its ability for recovering the original dictionary and then compare it with the other two methods: KSVD and MOD.
3.1 Test data and comparison criterion
The experiment described in [1] is repeated, first a basis A^{orig} ∈ R^{M×J} is generated, consisting of J = 50 basis vectors of dimension M = 20, and then 1,500 data signals {b_{1},b_{2}, ..., b_{1500}} are produced, each obtained by a linear combination of three basis vectors with uniformly distributed independent identically distributed (i.i.d.) coefficients in random and independent locations. We add Gaussian noise with varying SNR to the resulting data, so that we finally get the test data.
For the comparison criterion, the learned bases were gained by applying the KSVD, MOD, and ALMDL to the data. As in [1], we compare the learned basis with the original basis using the maximum overlap between each original basis vector ${a}_{j}^{\mathsf{\text{orig}}}$ and the learned basis vector ${a}_{j}^{\mathsf{\text{learn}}}$, i.e., whenever $\underset{j}{max}\left(1\left{a}_{j}^{\mathsf{\text{orig}}}{a}_{j}^{\mathsf{\text{learn}}}\right\right)$ is smaller than 0.01, we count this as a success [1].
3.2 The parameter of the algorithm
The impact of parameter β on the ALMDL algorithm is investigated in this section. In the case of SNR = 10, we set β = 0.22, 0.44, 0.66, 0.88, respectively, and run the algorithm for 180 iterations. With the process of iterations, we investigate the evolution of detected atom numbers and the root mean square error (RMSE) which is defined as $\text{RMSE}=1/\sqrt{\text{ML}}{\Vert B{A}_{k}{X}^{k}\Vert}_{2}$. As can be seen from Figure 2, the RMSE increases but the number of successfully detected atoms (NSDA) decreases with increasing β, and an interesting phenomenon is that the larger the value of β, the less stable the NSDA, it seems that the NSDA increases more gradually and stably when β is very small. However, when the value of β is very small the algorithm needs more iterations. Thus, in practicable implement the parameter β should be given a relatively small value. As for the experiments conducted below, the parameter β is set as 0.45 and the number of iteration k is set as 100. For a fair comparison, the number of learning iterations of KSVD and MOD is also set to be 100, which is bigger than that in [1].
3.3 Comparison results
The ability of recovering the original dictionary is tested for three methods, namely, KSVD, MOD, and ALMDL, and the comparison results are given in this section. We repeat this experiment 50 times with a varying SNR of 10, 20, 30, 40, and 50 dB. As in [1], for each noise level, we sort the 50 trials according to the number of successfully learned basis elements and order them in groups of 10 experiments. Figure 3 shows the results of KSVD, MOD, and ALMDL. As can be seen, our algorithm outperforms both of them, especially when the noise level is low, ALMDL recovers the atoms much more accurately. We know that not only the test dictionary, but also the coefficients are generated in random and independent locations, the specific distribution of the sample data widens the performance gap between our proposed ALMDL and KSVD. This indicates that our method has better performance on images with irregular objectives and this advantage will also be validated for real applications as shown in the next section.
4. Numerical experiments of image denoising
This section presents the dictionary learned by ALMDL algorithm and demonstrates its behavior and properties in comparison with KSVD algorithm. We have tested our method for various denoising tests on a set of six 8bit grayscale standard images shown in Figure 4, which are "Barbara", "House", "Boat", "Lena", "Peppers", and "Cameraman". In the experiment, the whole process involves the following steps:
• Let $\widehat{I}$ be a corrupted version of the image I (256 × 256), after the addition of white zeromean Gaussian noise with power σ_{ n }, data examples {b_{1},b_{2}, ..., b_{62001}} of 8 × 8 pixels are extracted from the noisy images $\widehat{I}$, some initial dictionary A_{0} is specially chosen for both of the training algorithms.
• In the sparse coding stage of learning procedure, each patch is extracted and sparsecoded. For ALMDL we set m = 7, β = 100 and target error $\tau =C\sqrt{M}{\sigma}_{n}$ with the default value C = 1.15. The iteration is repeated until the error has been satisfied. Meanwhile, errorconstrained orthogonal MP (OMP) implementation is used in the KSVD algorithm [2, 38] (the KSVD codes are available at http://www.cs.technion.ac.il/~elad/software/) to solve Equation 1 with the same target error as mentioned above and KSVD runs ten iterations. To enable a fair comparison, the data samples are sparsecoded using OMP under the learned dictionary for both algorithms after the learning procedure, these implementations lead to approximate patches with reduced noise $\left\{{\stackrel{\u0303}{b}}_{1},{\stackrel{\u0303}{b}}_{2},\cdots \phantom{\rule{0.3em}{0ex}},{\stackrel{\u0303}{b}}_{62001}\right\}$.
• The output image $\u0128$ is obtained by adding the patches $\left\{{\stackrel{\u0303}{b}}_{1},{\stackrel{\u0303}{b}}_{2},\cdots \phantom{\rule{0.3em}{0ex}},{\stackrel{\u0303}{b}}_{62001}\right\}$ in their proper locations and averaging the contributions in each pixel, the implementation is the same as in [2].
4.1 The learned dictionary
We investigate the sensitivity of dictionaries generated by ALMDL and KSVD to initialization, respectively, in this section. First two dictionaries are chosen as the initializations, one is the redundant DCT dictionary (Figure 5a) and the other is a random matrix whose atom is randomly chosen from the training data (Figure 5b). Both of the dictionaries consist of J = 256 atoms and each atom is shown as an 8 × 8 pixel image. Then ALMDL and KSVD are used for denoising the image "Cameraman" with σ = 10, and at last two sequences of dictionaries generated by the two methods are shown in Figures 6 and 7, respectively, from each top line of which it can be seen that the ALMDL drastically changes the dictionary while KSVD does not, thus the proposed algorithm has a good ability to recover the main prototypes at the first few stages. Moreover, these figures also show that the ALMDL has another wellposed property, i.e., it is insensitive to initialization because the final learned dictionaries are very similar to each other regardless of the atom location (seen from Figures 6e and 7e), while KSVD depends too much on the initialization. Thus, our proposed method avoids largely getting trapped into some local optima.
4.2 Denoised results
In this section, ALMDL is compared with KSVD for the image denoising applications. In fact, the six test images in Figure 4 can be classified into two categories based on their overlapped patches' distributions, which can be distinguished by patches' standard covariance and principal components. The first three images (i.e., "Barbara", "House", "Boat"), typically characterized by regular textures or edges, are classified as the regular one, while the latter three images (i.e., "Lena", "Peppers", "Cameraman"), typically characterized by irregular objectives, are classified as the irregular one. Figure 8ab shows the standard covariance matrix of the 62,001 patch examples extracted from "Barbara" and "Cameraman", respectively, with standard deviation σ = 20. The entries of the 64 × 64 matrix are between 0 and 1. As can be seen, the coordinates in 64dimensional space of the image "Barbara" are connected more closely than those of the image "Cameraman". The first twodimensional projection of these patch examples (through PCA transform) presented in Figure 8cd also demonstrates the different distribution forms.
We now present denoising results obtained by our ALMDL approach and the KSVD method with noise level in the range of [5,100]. Every reported result is an average of over five experiments, having different realizations of the noise. In Table 1, the PSNRs for six test images using our ALMDL approach are compared with the KSVD when redundant DCT is chosen as the initial dictionary, and the best result gained by this two methods are highlighted, from which we can get a conclusion that our method is better than KSVD for all the noise levels lower than σ = 25, and from Table 2 it can be seen that the conclusion is still valid when the initial dictionary is a random matrix. In order to better visualize their comparison, Figure 9 describes the difference between the denoising results of the ALMDL and KSVD. It can be seen that our proposed approach outperforms KSVD for almost all the noise levels especially for the second type of images. No matter what the initial dictionary is, the PSNR value obtained by the ALMDL gives an average advantage of 0.2 dB over KSVD for all the noise levels lower than σ = 25, but as the noise increases, the advantages of our approach is gradually weakened, and this will be a future research direction. Figure 10 plots the initial dictionary, the dictionary trained by KSVD and our ALMDL algorithm, and the corresponding denoised results of image "Cameraman" with σ = 15. To facilitate the visual assessment of images quality, in Figure 10df small regions of the physical image are boxed, in which we clearly observe the differences of the edge and the noise those images contain. It can be seen that Figure 10e shows the edge blurred but the proposed method still keeps most part of the edge. What's more, the small boxes of Figure 10ef also show that the KSVD has some noise while our method does not.
The above experiments are conducted under the fixed number (i.e., J = 256) of dictionary elements, now we consider four different number of elements: 64, 128, 256, and 512. Figure 11 shows the PSNR values of image "House" and "Peppers". As can be seen, the denoising ability of ALMDL and KSVD improves as the number of dictionary elements increase, while the gap of the PSNR value obtained by the two methods is bigger when the elements number is very small, which indicates that our proposed method is more robust.
4.3 The inner subproblem solving and its computational load
As mentioned in Section 2.2, the inner subproblem is essential to our algorithm, so we test ALMDL with three different inner iterations (i.e., m = 1,4,7). Figure 12 shows the difference of the three denoising results of ALMDL compared with those of KSVD, which appears as a zero straight reference line. These comparisons are presented for images 'Boat' and 'Lena'. As can be seen, the number of iterations affects the accuracy of solution very much for noise levels lower than σ = 25, i.e., the larger the number of iterations, the better the denoising result; and again we get the conclusion that our proposed method outperforms KSVD much more for noise levels lower than σ = 25 as demonstrated in Section 4.2. So, in practical implementation of the proposed algorithm, better results are often produced with more iterations because the approximation is more accurate. However, on the other hand, more accurate approximates need more inner iterations and, thus, more computations. Therefore, an appropriate value of m should be selected to trade off between accuracy and efficiency. We suggest that selecting m = 7 as the inner iterations is a nice balance.
As we have analyzed in Section 2.3, our proposed method is very similar to Yang's [29] and Ganesh's [37] methods regardless of different application fields, hence we have extended them in Appendix 2 and named them as ADMISTADL and ADMFISTADL, respectively, we compare them with our ALMDL and the consequent hybrid algorithm. We evaluate the four methods from three criteria: RMSE, average L_{1} norm (ALN), and the computation time, the evolution of RMSE and ALN reflect the algorithm's effectiveness while computation time measures the algorithm's efficiency.
Figures 13 and 14 show the RMSE and ALN of image "Boat" in the case of m = 4 and m = 7, respectively. First, compared with ADMISTADL, both ADMFISTADL and our ALMDL exhibit faster convergence, with the iterative process, ADMFISTADL behaves slower increase of ALN since they use FISTA in the inner minimization such that has quicker reduction of ALN under the predefined iteration number; our ALMDL behaves faster decrease of RMSE due to the accelerated update of variables z and y. Second, the hybrid method outperforms the ADMISTADL, ADMFISTADL, and ALMDL. Figures 15 and 16 show the RMSE and ALN of image "Lena" in the case of m = 4 and m = 7, respectively, similar phenomenon is observed. Finally, from the viewpoint of computation time, Table 3 shows that our method possesses the minimum amount of time in the case of m = 4, when increasing the number of inner iteration from m = 4 to m = 7, the computation cost of our method is a litter bigger than that of ADMISTADL. Considering all the three criteria, it concludes that our proposed approach is a very promising method.
5. Conclusions
In this article, we have developed a primaldualbased dictionary learning algorithm under the AL framework. The dictionary is updated by summing the multiplication of primal and dual variables after each iteration of the AL scheme. The ultimate advantage of this strategy is that the proposed algorithm does not depend on the initialization too much, so it largely avoids getting trapped into some local optima. Experiments on image denoising application show that (1) our proposed approach outperforms the traditional alternating approaches especially for the "Cameraman"like images whose composite patches are distributed in a nondirectional, irregular way; (2) our proposed approach is more tolerant to the number of dictionary elements, which is often unknown for signal/image processing applications.
There are several research directions that we are considering currently. For instance, as proved in [22], the parameter β of the AL scheme updates in a nonincreasing way, including the case of β to be a constant, can guarantee its convergence. However, an automatic selection of parameter β will certainly accelerate the convergence, and how to achieve it remains an open question.
Appendix 1
In the appendix, we prove that the iterative scheme (13) derived by employing MM technique is essential to be an ISTA [25, 32]. As mentioned in [25], the standard formula of ISTA for solving the general L1minimization problem of the form:
is
Setting $\mu =4\beta ;\phantom{\rule{1em}{0ex}}H\left(x\right)={\u2225Axb2\beta {y}^{k}+{z}^{m}\u2225}_{2}^{2}$; and δ = 1/2γ, then we give the iterative scheme (13) as follows:
(13)
This is the same as we have obtained in Section 2.2.
Appendix 2
In this appendix, we modify and extend Yang's [29] and Ganesh's [37] methods for dictionary learning problem by adding dictionary updating stage. We are grateful to a referee for pointing out to us Yang's [29] and Ganesh's [37] studies. The ADM framework adopted by these authors is very similar to ours, i.e., they first introduce auxiliary variables to reformulate the original problem into the form of AL scheme, and then apply alternating minimization to the corresponding AL functions. Particularly, in Yang's study they apply ISTA to solve the inner minimization with respect to variable x [[29], p. 6], while in Ganesh's study they apply an accelerated FISTA for solving the inner minimization instead [[37], pp. 1516]. Although both of them try to find sparest solution under fixed dictionary, we can modify and extend them to our dictionary learning problem for comparison purpose, i.e., we update dictionary A the same as we have done in Equation 10. We call the extended Yang's method as ADMISTADL and Ganesh's method as ADMFISTADL. The detailed description of the two methods is presented in Diagrams 3 and 4, respectively.
Diagram 3. The detailed description of the ADMISTADL algorithm

1:
initiation: X^{0} = 0; A_{0}

2:
while stopcriterion not satisfied (loop in k):

3:
${z}^{k+1}=\left\{\begin{array}{c}\frac{\tau}{{\u2225{b}_{1}\u2225}_{2}}{b}_{1},if{\u2225{b}_{1}\u2225}_{2}\ge \tau \\ {b}_{1},\phantom{\rule{0.25em}{0ex}}otherwise\end{array}\right.;\phantom{\rule{2.77695pt}{0ex}}{b}_{1}={A}_{k}{x}^{k}+b+2\beta {y}^{k}$

4:
while stopcriterion not satisfied (loop in m):

5:
${X}^{k,m+1}=shrink({X}^{k,m}+\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$\gamma $}\right.{A}_{k}^{T}({Z}^{k+1}{A}_{k}{X}^{k,m}+B+2\beta {Y}^{k}),\raisebox{1ex}{$2\beta $}\!\left/ \!\raisebox{1ex}{$\gamma $}\right.)$

6:
end while

7:
X^{k}^{+1} = X^{k, m}^{+1}; ${Y}^{k+1}={Y}^{k}+\frac{1}{2\beta}\left({Z}^{k+1}{A}_{k}{X}^{k+1}+B\right)$

8:
A_{ k }_{+1}= A_{ k } + μY^{k}^{+1} (X^{k}^{+1})^{T}

9:
end while
Diagram 4. The detailed description of the ADMFISTADL algorithm

1:
initiation: X^{0} = 0; A_{0}

2:
while stopcriterion not satisfied (loop in k):

3:
${z}^{k+1}=\left\{\begin{array}{c}\frac{\tau}{{\u2225{b}_{1}\u2225}_{2}}{b}_{1},if{\u2225{b}_{1}\u2225}_{2}\ge \tau \\ {b}_{1},\phantom{\rule{0.25em}{0ex}}otherwise\end{array}\right.$; b_{1} = A_{ k } x^{k} + b + 2βy^{k}

4:
W^{1} = X^{k} ; Q^{1} = X^{k} ; t_{1} = 1

5:
while stopcriterion not satisfied (loop in m):

6:
${W}^{m+1}=shrink({Q}^{m}+\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$\gamma $}\right.{A}_{k}^{T}({Z}^{k+1}{A}_{k}{Q}^{m}+B+2\beta {Y}^{k}),\raisebox{1ex}{$2\beta $}\!\left/ \!\raisebox{1ex}{$\gamma $}\right.)$

7:
${t}_{m+1}=\frac{1}{2}\left(1+\sqrt{1+4{t}_{m}^{2}}\right)$

8:
${Q}^{m+1}={W}^{m+1}+\frac{{t}_{m}1}{{t}_{m+1}}\left({W}^{m+1}{W}^{m}\right)$

9:
end while

10:
X^{k}^{+1} = W^{m}^{+1}; ${Y}^{k+1}={Y}^{k}+\frac{1}{2\beta}\left({Z}^{k+1}{A}_{k}{X}^{k+1}+B\right)$

11:
A_{ k }_{+1}= A_{ k } + μY^{k}^{+1} (X^{k}^{+1})^{T}

12:
end while
References
 1.
Aharon M, Elad M, Bruckstein AM: The KSVD: an algorithm for designing of overcomplete dictionaries for sparse representations. IEEE Trans Signal Process 2006,54(11):43114322.
 2.
Elad M, Aharon M: Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans Image Process 2006,15(12):37363745.
 3.
Mairal J, Elad M, Sapiro G: Sparse representation for color image restoration. IEEE Trans Image Process 2008,17(1):5369.
 4.
Aharon M, Elad M: Sparse and redundant modeling of image content using an imagesignaturedictionary. SIAM Imag. Sci 2008, 1: 228247. 10.1137/07070156X
 5.
Mairal J, Bach F, Ponce J, Sapiro G: Online dictionary learning for sparse coding. In International Conference on Machine Learning ICML' 09. ACM, New York; 2009:689696.
 6.
Donoho DL, Elad M, Temlyakov V: Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans Inf. Theory 2006,50(1):618.
 7.
Mallat S, Zhang Z: Matching pursuits with timefrequency dictionaries. IEEE Trans Signal Process 1993,41(12):33973415. 10.1109/78.258082
 8.
Tropp JA: Greed is good: algorithmic results for sparse approximation. IEEE Trans Inf. Theory 2004,50(10):22312242. 10.1109/TIT.2004.834793
 9.
Chen SS, Donoho DL, Saunders MA: Atomic decomposition by basis pursuit. SIAM Rev 2001,43(1):129159. 10.1137/S003614450037906X
 10.
Elad M: Why simple shrinkage is still relevant for redundant representations? IEEE Trans Inf. Theory 2006,52(12):55595569.
 11.
Gorodnitsky I, Rao B: Sparse signal reconstruction from limited data using FOCUSS: a reweighted minimum norm algorithm. IEEE Trans Signal Process 1997,45(3):600616. 10.1109/78.558475
 12.
Efron B, Hastie T, Johnstone I, Tibshirani R: Least angle regression. Ann. Statist 2004,32(2):407499. 10.1214/009053604000000067
 13.
Olshausen B, Field D: Sparse coding with an overcomplete basis set: a strategy employed by V1? Vis Res 1997,37(23):33113325. 10.1016/S00426989(97)001697
 14.
Olshausen BA, Field DJ: Emergence of SimpleCell Receptive Field Properties by Learning a Sparse Code for Natural Images. Volume 381. SpringerVerlag, New York; 1996:607609.
 15.
KreutzDelgado K, Murray J, Rao B, Engan K, Lee T, Sejnowski T: Dictionary learning algorithms for sparse representation. Neural Comput 2003,15(2):349396. 10.1162/089976603762552951
 16.
Lee H, Battle A, Rajat R, Ng AY: Efficient sparse coding algorithms. In Advances in Neural Information Processing Systems 19. MIT Press, Cambridge, MA; 2007:801808.
 17.
Engan K, Aase SO, Husoy JH: Method of optimal directions for frame design. IEEE International Conference Acoust., Speech, Signal Process 1999, 5: 24432446.
 18.
Yaghoobi M, Daudet L, Davies M: Parametric dictionary design for sparse coding. IEEE Trans Signal Process 2009,57(12):48004810.
 19.
Ataee M, Zayyani H, Zadeh MB, Jutten C: Parametric dictionary learning using steepest descent. In Proc. ICASSP2010. Dallas, TX; 2010:19781981.
 20.
Zhou M, Chen H, Paisley J, Ren L, Sapiro G, Carin L: Nonparametric bayesian dictionary learning for sparse image representations. Neural Information Processing Systems (NIPS) 2009.
 21.
Dobigeon N, Tourneret JY: Bayesian orthogonal component analysis for sparse representation. IEEE Trans Signal Process 2010,58(5):26752685.
 22.
Bertsekas D: Constrained Optimization and Lagrange Multiplier Method. Academic Press; 1982.
 23.
Rockafellar RT: Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Math Oper Res 1976,1(2):97116. 10.1287/moor.1.2.97
 24.
Osher S, Burger M, Goldfarb D, Xu J, Yin W: An iterative regularization method for total variationbased image restoration. SIAM JMMS 2005, 4: 460489.
 25.
Yin W, Osher S, Goldfarb D, Darbon J: Bregman iterative algorithms for l1minimization with applications to compressed sensing. SIAM J Imag Sci 2008, 1: 142168. 10.1137/test6
 26.
Afonso M, BioucasDias J, Figueiredo M: Fast image recovery using variable splitting and constrained optimization. IEEE Trans Image Process 2010,19(9):23452356.
 27.
Goldstein T, Osher S: The split Bregman method for L1regularized problems. SIAM J Imag Sci 2009,2(2):323343. 10.1137/080725891
 28.
Esser E: Applications of Lagrangianbased alternating direction methods and connections to split Bregman. CAM Report 0931, UCLA 2009.
 29.
Yang J, Zhang Y: Alternating direction algorithms for l1 problems in compressive sensing. Technical Report, Rice University 2009. [http://www.caam.rice.edu/~zhang/reports/tr0937.pdf]
 30.
Tomioka R, Sugiyama M: Dual augmented lagrangian method for efficient sparse reconstruction. IEEE Signal Process. Lett 2009,16(12):10671070.
 31.
Hunter D, Lange K: A tutorial on MM algorithms. Am Statist 2004, 58: 3037. 10.1198/0003130042836
 32.
Daubechies I, De Friese M, De Mol C: An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun Pure Appl Math 2004, 57: 36013608.
 33.
Oliveira J, BioucasDias J, Figueiredo MAT: Adaptive total variation image deblurring: a majorizationminimization approach. Signal Process 2009,89(9):16831693. 10.1016/j.sigpro.2009.03.018
 34.
Beck A, Teboulle M: A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM J Imag Sci 2009,2(1):183202. 10.1137/080716542
 35.
Hale E, Yin W, Zhang Y: A fixedpoint continuation method for L1regularized minimization with applications to compressed sensing. CAAM Technical report TR0707, Rice University, Houston, TX 2007.
 36.
Wright S, Nowak R, Figueiredo M: Sparse reconstruction by separable approximation. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing 2008.
 37.
Ganesh A, Wahner A, Zhou Z, Yang AY, Ma Y, Wright J: Face recognition by sparse representation.2010. [http://www.eecs.berkeley.edu/~yang/paper/face_chapter.pdf]
 38.
Rubinstein R, Zibulevsky M, Elad M: Efficient implementation of the KSVD algorithm using batch orthogonal matching pursuit. Technical Report, CS Technion 2008.
Acknowledgements
This study was partly supported by the High Technology Research Development Plan (863 plan) of P. R. China under 2006AA020805, the NSFC of China under 30670574, Shanghai International Cooperation Grant under 06SR07109, Region RhôneAlpes of France under the project Mira Recherche 2008, and the joint project of Chinese NSFC (under 30911130364) and French ANR 2009 (under ANR09BLAN037201). The authors are indebted to two anonymous referees for their useful suggestions and for having drawn the authors' attention to additional relevant references.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Liu, Q., Luo, J., Wang, S. et al. An augmented Lagrangian multiscale dictionary learning algorithm. EURASIP J. Adv. Signal Process. 2011, 58 (2011). https://doi.org/10.1186/16876180201158
Received:
Accepted:
Published:
Keywords
 dictionary learning
 augmented Lagrangian, multiscale
 refinement
 image denoising.