Open Access

SSIM-inspired image restoration using sparse representation

  • Abdul Rehman1Email author,
  • Mohammad Rostami1,
  • Zhou Wang1,
  • Dominique Brunet2 and
  • Edward R Vrscay2
EURASIP Journal on Advances in Signal Processing20122012:16

Received: 6 June 2011

Accepted: 20 January 2012

Published: 20 January 2012


Recently, sparse representation based methods have proven to be successful towards solving image restoration problems. The objective of these methods is to use sparsity prior of the underlying signal in terms of some dictionary and achieve optimal performance in terms of mean-squared error, a metric that has been widely criticized in the literature due to its poor performance as a visual quality predictor. In this work, we make one of the first attempts to employ structural similarity (SSIM) index, a more accurate perceptual image measure, by incorporating it into the framework of sparse signal representation and approximation. Specifically, the proposed optimization problem solves for coefficients with minimum 0 norm and maximum SSIM index value. Furthermore, a gradient descent algorithm is developed to achieve SSIM-optimal compromise in combining the input and sparse dictionary reconstructed images. We demonstrate the performance of the proposed method by using image denoising and super-resolution methods as examples. Our experimental results show that the proposed SSIM-based sparse representation algorithm achieves better SSIM performance and better visual quality than the corresponding least square-based method.

1 Introduction

In many signal processing problems, mean squared error (MSE) has been the preferred choice as the optimization criterion due to its ease of use and popularity, irrespective of the nature of signals involved in the problem. The story is not different for image restoration tasks. Algorithms are developed and optimized to generate the output image that has minimum MSE with respect to the target image [16]. However, MSE is not the best choice when it comes to image quality assessment (IQA) and signal approximation tasks [7]. In order to achieve better visual performance, it is desired to modify the optimization criterion to the one that can predict visual quality more accurately. SSIM has been quite successful in achieving superior IQA performance [8]. Figure 1 demonstrates the difference between the performance of SSIM and absolute error (the bases for p , MSE, PSNR, etc.). Figure 1c shows the quality map of the image 1b with reference to 1a, obtained by calculating the absolute pixel-by-pixel error, which forms the basis of MSE calculation for quality evaluation. Figure 1d shows the corresponding SSIM quality map which is used to calculate the SSIM index of the whole image. It is quite evident from the maps that SSIM performs a better job in predicting perceived image quality. Specifically, the absolute error map is uniform over space, but the texture regions in the noisy image appear to be much less noisier than the smooth regions. Clearly, the SSIM map is more consistent with such observations.
Figure 1

Comparison of SSIM and MSE for "Barbara" image altered with additive white Gaussian noise. (a) Original image; (b) noisy image; (c) absolute error map (brighter indicates better quality/smaller absolute difference); (d) SSIM index map (brighter indicates better quality/larger SSIM value).

The SSIM index and its extensions have found a wide variety of applications, ranging from image/video coding i.e., H.264 video coding standard implementation [9], image classification [10], restoration and fusion [11], to watermarking, denoising and biometrics (see [7] for a complete list of references). In most existing works, however, SSIM has been used for quality evaluation and algorithm comparison purposes only. SSIM possesses a number of desirable mathematical properties, making it easier to be employed in optimization tasks than other state-of-the-art perceptual IQA measures [12]. But, much less has been done on using SSIM as an optimization criterion in the design and optimization of image processing algorithms and systems [1319].

Image restoration problems are of particular interest to image processing researchers, not only for their practical value, but also because they provide an excellent test bed for image modeling, representation and estimation theories. When addressing general image restoration problems with the help of Bayesian approach, an image prior model is required. Traditionally, the problem of determining suitable image priors has been based on a close observation of natural images. This leads to simplifying assumptions such as spatial smoothness, low/max-entropy or sparsity in some basis set. Recently, a new approach has been developed for learning the prior based on sparse representations. A dictionary is learned either from the corrupted image or a high-quality set of images with the assumption that it can sparsely represent any natural image. Thus, this learned dictionary encapsulates the prior information about the set of natural images. Such methods have proven to be quite successful in performing image restoration tasks such as image denoising [3] and image super-resolution [5, 20]. More specifically, an image is divided into overlapping blocks with the help of a sliding window and subsequently each block is sparsely coded with the help of dictionary. The dictionary, ideally, models the prior of natural images and is therefore free from all kinds of distortions. As a result the reconstructed blocks, obtained by linear combination of the atoms of dictionary, are distortion free. Finally, the blocks are put back into their places and combined together in light of a global constraint for which a minimum MSE solution is reached. The accumulation of many blocks at each pixel location might affect the sharpness of the image. Therefore, the distorted image must be considered as well in order to reach the best compromise between sharpness and admissible distortions.

Since MSE is employed as the optimization criterion, the resulting output image might not have the best perceptual quality. This motivated us to replace the role of MSE with SSIM in the framework. The solution of this novel optimization problem is not trivial because SSIM is non-convex in nature. There are two key problems that have to be resolved before effective SSIM-based optimization can be performed. First, how to optimally decompose an image as a linear combination of basis functions in maximal SSIM, as opposed to minimal MSE sense. Second, how to estimate the best compromise between the distorted and sparse dictionary reconstructed images for maximal SSIM. In this article, we provide solutions to these problems and use image denoising and image super-resolution as applications to demonstrate the proposed framework for image restoration problems.

We formulate the problem in Section 2.1 and provide our solutions to issues discussed above in Sections 2.2 and 2.3. Section 3.1 describes our approach to denoise the images. The proposed method for image super-resolution is described in Section 3.2 and finally we conclude in Section 4.

2 The proposed method

In this section we will incorporate SSIM as our quality measure, particularly for sparse representation. In contrast to what we may expect, it is shown that sparse representation in minimal 2 norm sense can be easily converted to maximal SSIM sense. We will also use a gradient descend approach to solve a global optimization problem in maximal SSIM sense. Our framework can be applied to a wide class of problems dealing with sparse representation to improve visual quality.

2.1 Image restoration from sparsity

The classic formulation of image restoration problem is as following:
y = Φ x + n

where x n , y m , n m , and Φ m x n. Here we assume x and y are vectorized versions, by column stacking, of original 2-D original and distorted images, respectively. n is the noise term, which is mostly assumed to be zero mean, additive, and independent Gaussian. Generally m < n and thus the problem is ill-posed. To solve the problem assertion of a prior on the original image is necessary. The early approaches used least square (LS) [21] and Tikhonov regularization [22] as priors. Later minimal total variation (TV) solution [23] and sparse priors [3] were used successfully on this problem. Our focus in the current work is to improve algorithms, in terms of visual quality, that assert sparsity prior on the solution in term of a dictionary domain.

Sparsity prior has been used successfully to solve different inverse problems in image processing [3, 5, 24, 25]. If our desired signal, x, is sparse enough then it has been shown that the solution to (1) is the one with maximum sparsity which is unique (within some ϵ-ball around x) [26, 27]. It can be easily found by solving a linear programming problem or by orthogonal matching pursuit (OMP). Not all natural signals are sparse but a wide range of natural signals can be represented sparsely in terms of a dictionary and this makes it possible to use sparsity prior on a wide range of inverse problems. One major problem is that the image signals are considered to be high dimensional data and thus, solving (1) directly is computationally expensive. To tackle this problem we assume local sparsity on image patches. Here, it is assumed that all the image patches have sparse representation in terms of a dictionary. This dictionary can be trained over some patches [28].

Central to the process of image restoration, using local sparse and redundant representations, is the solution to the following optimization problems [3, 5],
α ^ i j = arg min α μ i j α 0 + Ψ α - R i j X 2 2 ,
X ^ = arg min x X - W 2 2 + λ D H X - Y 2 2 .
where Y is the observed distorted image, X is the unknown output restored image, R ij is a matrix that extracts the (ij) block from the image, Ψ n x kis the dictionary with k > n, α ij is the sparse vector of coefficients corresponding to the (ij) block of the image, X ^ is the estimated image, λ is the regularization parameter, and W is the image obtained by averaging the blocks obtained using the sparse coefficients vectors α ^ i j calculated by solving optimization problem in (2). This is a local sparsity-based method that divides the whole image into blocks and represents each block sparsely using some trained dictionary Among other advantages, one major advantage of such a method is the ease to train a small dictionary as compared to one large global dictionary This is achieved with the help of (2) which is equivalent to (4). As to the coefficients μ ij , those must be location dependent, so as to comply with a set of constraints of the form Ψ α - R i j X 2 2 T . Solving this using the orthonormal matching pursuit [29] is easy, gathering one atom at a time, and stopping when the error Ψ α - R i j X 2 2 goes below T. This way, the choice of μ ij has been handled implicitly Equation (3) applies a global constraint on the reconstructed image and uses the local patches and the noisy image as input in order to construct the output that complies with local-sparsity and also lies within the proximity of the distorted image which is defined by amount and type of distortion.
α ^ i j = arg min α α 0 subject to Ψ α - R i j X 2 2 T

In (3), we have assumed that the distortion operator Φ in (1) may be represented by the product DH, where H is a blurring filter and D the downsampling operator. Here we have assumed each non-overlapping patch of the images can be represented sparsely in the domain of Ψ. Assuming this prior on each patch (2) refers to the sparse coding of local image patches with bounded prior, hence building a local model from sparse representations. This enables us to restore individual patches by solving (2) for each patch. By doing so, we face the problem of blockiness at the patch boundaries when denoised non-overlapping patches are placed back in the image. To remove these artifacts from the denoised images overlapping patches are extracted from the noisy image which are combined together with the help of (3). The solution of (3) demands the proximity between the noisy image, Y, and the output image X, thus enforcing the global reconstruction constraint. The 2 optimal solution suggests to take the average of the overlapping patches [3], thus eliminating the problem of blockiness in the denoised image.

As stated earlier, we propose a modified restoration method which incorporates SSIM into the procedure defined by (2) and (3). It is defined as follows,
α ^ i j = arg min α μ i j α 0 + ( 1 - S ( Ψ α , R i j X ) ) ,
X ^ = arg max x S ( W , X ) + λ S ( D H X , Y ) ,
where S(·,·) defines the SSIM measure. The expression for SSIM index is
S ( a , y ) = 2 μ a μ y + C 1 μ a 2 + μ y 2 + C 1 2 σ a , y + C 2 σ a 2 + σ y 2 + C 2 ,

with μ a and μ y the means of a and y respectively, σ a 2 and σ y 2 the sample variances of a and y respectively, and σ ay the covariance between a and y. The constants C1 and C2 are stabilizing constants and account for the saturation effect of the HVS.

Equation (5) aims to provide the best approximation of a local patch in SSIM-sense with the help of minimum possible number of atoms. The process is performed locally for each block in the image which are then combined together by simple averaging to construct W. Equation (6) applies a global constraint and outputs the image that is the best compromise between the noisy image, Y, and W in SSIM-sense. This step is very vital because it has been observed that the image W lacks the sharpness in the structures present in the image. Due to the masking effect of the HVS, same level of noise does not distort different visual content equally. Therefore, the noisy image is used to borrow the content from its regions which are not convoluted severely by noise. Use of SSIM is very well-suited for such a task, as compared to MSE, because it accounts for the masking effect of HVS and allows us to capture improve structural details with the help of the noisy image. Note the use of 1 - S(·, ·) in (5). This is motivated by the fact that 1 - S(·,·) is a squared variance-normalized 2 distance [30]. Solutions to the optimization problems in (5) and (6) are given in Sections 2.2 and 2.3, respectively.

2.2 SSIM-optimal local model from sparse representation

This section discusses the solution to the optimization problem in (5). Equation (2) can be solved approximately using OMP [29] by including one atom at a time and stopping when the error Ψ α i j - R i j X 2 2 goes below Tmse = ()2. C is the noise gain and σ is the standard deviation of the noise. We solve the optimization problem in (5) based on the same philosophy We gather one atom at a time and stop when S(Ψ α,x ij ) goes above Tssim, threshold defined in terms of SSIM. In order to obtain Tssim, we need to consider the relationship between MSE and SSIM. For the mean reduced a and y, the expression of SSIM reduces to the following equation
S ( a , y ) = 2 σ a , y + C 2 σ a 2 + σ y 2 + C 2 ,
Subtracting both sides of (8) from 1 yields
1 - S ( a , y ) = 1 - 2 σ a , y + C 2 σ a 2 + σ y 2 + C 2
= σ a 2 + σ y 2 - 2 σ a , y σ a 2 + σ y 2 + C 2
= a - y 2 2 σ a 2 + σ y 2 + C 2 ,


Equation (12) can be re-arranged to arrive at the following result
S ( a , y ) = 1 - a - y 2 2 σ a 2 + σ y 2 + C 2
With the help of the equation above, we can calculate the value of Tssim as follows
T ssim = 1 - T mse σ a 2 + σ y 2 + C 2 ,

where C2 is the constant originally used in SSIM index expression [8] and σ a 2 is calculated based on current approximation of the block given by a: = Ψ α.

It has already been shown that the main difference between SSIM and MSE is the divisive normalization [30, 31]. This normalization is conceptually consistent with the light adaptation (also called luminance masking) and contrast masking effect of HVS. It has been recognized as an efficient perceptually and statistically non-linear image representation model [32, 33]. It is shown to be a useful framework that accounts for the masking effect in human visual system, which refers to the reduction of the visibility of an image component in the presence of large neighboring components [34, 35]. It has also been found to be powerful in modeling the neuronal responses in the visual cortex [36, 37]. Divisive normalization has been successfully applied in IQA [38, 39], image coding [40], video coding [31] and image denoising [41].

Equation (14) suggests that the threshold is chosen adaptively for each patch. The set of coefficients α = (α1, α2, α3,..., α k ) should be calculated such that we get the best approximation a in terms of SSIM. We search for the stationary points of the partial derivatives of S with respect to α. The solution to this problem for orthogonal set of basis is discussed in [30]. Here we aim to solve a more general case of linearly independent atoms. The 2 -based optimal coefficients, { c i } i = 1 k , can be calculated by solving the following system of equations
j = 1 k c j ψ i , ψ j = y , ψ i , 1 i k ,

We denote the inner product of a signal with the constant signal (1/n, 1/n,..., 1/n) of length n by < ψ >: = < ψ, 1/n >, where < ·, · > represents the inner product.

First, we write the mean, the variance and the covariance of a in terms of α with n the size of the current block:
μ a = i = 1 k α i ψ i = i = 1 k α i ψ i
( n - 1 ) σ a 2 = a , a - n a 2 = i = 1 k j = 1 k α i α j ψ i ψ j - n μ a 2 ,
( n - 1 ) σ a y = a , y - n a y = i = 1 k α i y , ψ i - n μ a μ y ,
where < · > represents the sample mean. The partial derivatives are given as follows
μ a α i = ψ i ,
( n - 1 ) σ a 2 α i = 2 j = 1 k α j ψ i , ψ j - 2 n μ a ψ i ,
( n - 1 ) σ a y α i = y , ψ i - n μ y ψ i ,
The structural similarity can be written as
log S = log 2 μ a μ y + C 1 + log ( 2 σ a , y + C 2 ) - log σ a 2 + σ y 2 + C 2 - log μ a 2 + μ y 2 + C 2
From logarithmic differentiation of (7) combined with (19)-(21), we have
1 S S α i = 2 μ y ψ i 2 μ a μ y + C 1 - 2 μ a ψ i μ a 2 μ y 2 + C 1 + 2 y , ψ i - n μ y ψ i ( n - 1 ) [ 2 σ a , y + C 2 ] - 2 j = 1 k α j ψ i , ψ j - n μ a ψ i ( n - 1 ) σ a 2 + σ y 2 + C 2
After subtracting the corresponding DC values from all the blocks in the image, we are interested only in the particular case where the atoms are made of oscillatory functions, i.e., when 〈ψ i 〉 = 0 for 1 ≤ ik, thus reducing (23) to
1 S S α i = 2 y , ψ i ( n - 1 ) 2 σ a , y + C 2 - 2 j = 1 k α j ψ i , ψ j ( n - 1 ) σ a 2 + σ y 2 + C 2 .
We equate (24) to zero in order to find the stationary points. The result is the following linear system of equations
j = 1 k α j ψ i , ψ j = β y , ψ i , 1 i k ,
β = σ a 2 + σ y 2 + C 2 2 σ a y + C 2 .
where β is an unknown constant dependent on the statistics of the unknown image block a. Comparing α with the optimal coefficients in p sense denoted by c and given by (15) results in the following solution:
α i = β c i , 1 i k ,
which implies that the optimal SSIM-based solution is just a scaling of the optimal 2 -based solution. The last step is to find β. It is important to note that the value of β varies over the image and is therefore content dependent. Also, the scaling factor, β, may lead to selection of a different set of atoms from the dictionary, as compared to 2 where β = 1, which are better suited to providing a closer and sparser approximation of the patch in SSIM-sense. After substituting (27) in the expression (26) for β via (16), (17) and (18) and then isolating for β gives us the following quadratic equation
β 2 ( B - A ) + β C 2 - σ y 2 - C 2 = 0 .
A = 1 n - 1 i = 1 k j = 1 k c i c j ψ i , ψ j ,
B = 2 n - 1 j = 1 k c j y , ψ j .
Solving for β and picking a positive value for maximal SSIM gives us
β = - C 2 + C 2 2 + 4 ( B - A ) ( σ y 2 + C 2 ) 2 ( B - A ) .

Now we have all the tools required for an OMP algorithm that perform the sparse coding stage in optimal SSIM sense. The modified OMP pursuit algorithm is explained in Algorithm 1. There are two main differences between the OMP algorithm [29] and the one proposed in this work. First, the stopping criterion is based on SSIM. Unlike MSE, SSIM is adaptive according to the reference image. In particular, if the distortion is consistent with the underlying reference e.g., contract enhancement, the distortion is non-structural and is much less objectional than structural distortions. Defining the stopping criterion according to SSIM essentially means that we are modifying the set of accepted points (image patches) around the noisy image patch which can be represented as the linear combination of dictionary atoms. This way, in the space of image patches, we are omitting image patches in the direction of structural distortion and including the ones which are in the same direction as the original image patch in the set of acceptable image patches. Therefore, we can expect to see more structures in the image constructed using sparsity as a prior. Second, we calculate the SSIM-optimal coefficients from the optimal coefficients in 2 -sense using the derivation in Section 2.2, which are scalar multiple of the optimal 2 -based coefficients.

2.3 SSIM-based global reconstruction

The solution to this optimization problem defined in Equation (6) is the image that is the best compromise between the distorted image and the one obtained using sparse representation in the maximal SSIM sense. With the assumption of known dictionary, the only other thing the optimization problem in (6) requires is the coefficients α ij which can be obtained by solving optimization problem in (5). SSIM is a local quality measure when it is applied using a sliding window, it provides us with a quality map that reflects the variation of local quality over the whole image. The global SSIM is computed by pooling (averaging) the local SSIM map. The global SSIM for an image, Y, with respect to the reference image, X, is given by the following equation
S ( X , Y ) = 1 N l i j S ( x i j , y i j ) ,
where x ij = R ij X and y ij = R ij Y where R ij is an N w × N matrix that extracts the (ij) block from the image. The expression for local SSIM, S(x ij , y ij ), is given by (7). N l is the total number of local windows and can be calculated as
N l = 1 N w tr i j R i j T R i j .

where tr(·) denotes the trace of a matrix.

We use a gradient-descent approach to solve the optimization problem given by (6). The update equation is given by
X ^ k + 1 = X ^ k + λ Y S ( X , Y ) = X ^ k + λ 1 N l Y i j S ( x i j , y i j ) = X ^ k + λ 1 N l i j R i j T y S ( x i j , y i j )
y S ( x , y ) = 2 N w B 1 2 B 2 2 A 1 B 1 ( B 2 x - A 2 y + B 1 B 2 ( A 2 - A 1 ) μ x 1 + A 1 A 2 ( B 1 - B 2 ) μ y 1 ) , A 1 = 2 μ x μ y + C 1 , A 2 = 2 σ x y + C 2 , B 1 = μ x 2 + μ y 2 + C 1 , B 2 = σ x 2 + σ y 2 + C 2 ,

where N w is the number of pixels in the local image patch, μ x , σ x 2 and σ xy represent the sample mean of x, the sample variance of x, and the sample covariance of x and y, respectively Equation (34) suggests that averaging of the gradients of local patches is to be calculated in order to obtain the global SSIM gradient, and thus the direction and distance of the k th update in X ^ . More details regarding the computation of SSIM gradient can be found in [42]. In our experiment, we found this gradient based approach is well-behaved and it takes only a few iterations for X ^ to converge to a stationary point. We initialize x ^ as the best MSE solution. Having the gradient of SSIM we follow an iterative procedure to solve (6), assuming the initial value derived from minimal MSE solution.

3 Applications

The framework we proposed provides a general approach that can be used for different applications. To show the effectiveness of our method we will provide two applications: image denoising and super-resolution.

3.1 Image denoising

We use the SSIM-based sparse representations framework developed in Sections 2.2 and 2.3 to perform the task of image denoising. The noise-contaminated image is obtained using the following equation
Y = X + N ,

where Y is the observed distorted image, X is the noise-free image and N is additive Gaussian noise. Our goal is to remove the noise from distorted image. Here we train a dictionary, Ψ, for which the original image can be represented sparsely in its domain. We use KSVD method [28] to train the dictionary. In this method the dictionary, which is trained directly over the noisy image and denoising is done in parallel. For a fixed number of iterations, J, we initialize the dictionary by discrete cosine transform (DCT) dictionary. In each step we update the image and then the dictionary. First, based on the current dictionary, sparse coding is done for each patch, and then KSVD is used to update the dictionary (interested reader can refer to [28] for details of dictionary updating). Finally, after doing this procedure J times we execute a global construction stage, following the gradient descend procedure. The proposed image denoising algorithm is summarized in Algorithm 2.

The proposed image denoising scheme is tested on various images with different amount of noise. In all the experiments, the dictionary used was of size 64 × 256, designed to handle patches of 8 × 8 pixels. The value of noise gain, C, is selected to be 1.15 and λ = 30[3]. Table 1 shows the results for images Barbara, Lena, Peppers, House. It also compares the K-SVD method [3] with the proposed denoising method. It can be observed that the proposed denoising method achieves better performance in terms of SSIM which is expected to imply better perceptual quality of the denoised image. Figures 2 and 3 show the denoised images using K-SVD [3] and the proposed methods along with corresponding SSIM maps. It can be observed that SSIM-based method outperforms specially in the texture region which confirms that the proposed denoising scheme preserves the structures better and therefore has better perceptual image quality.
Table 1

SSIM and PSNR comparisons of image denoising results






Noise std

















PSNR comparison (in dB)




















































SSIM comparison




















































Figure 2

Visual comparison of denoising results. (a) Original image; (b) noisy image; (c) SSIM-map of noisy image; (d) KSVD-MSE; (e) SSIM-map of KSVD-MSE; (f) KSVD-SSIM; (g) SSIM-map of KSVD-SSIM.

Figure 3

Visual comparison of denoising results. (a) Original image; (b) noisy image; (c) SSIM-map of noisy image; (d) KSVD-MSE; (e) SSIM-map of KSVD-MSE; (f) KSVD-SSIM; (g) SSIM-map of KSVD-SSIM.

3.2 Image super-resolution

In this section we demonstrate the performance of the SSIM-based sparse representations when used for image super-resolution. In this problem, a low resolution image, Y, is given and a high resolution version of the image, X, is required as output. We assume that the low resolution image is produced from high resolution image based on the following equation:
Y = D H X ,
where H represents a blurring matrix, and D is a downsampling matrix. We use local sparsity model as prior to regularize this problem that has infinite many solutions which satisfy (37). Our approach is motivated by recent results in sparse signal representation, which suggests that the linear relationships among high-resolution signals can be accurately recovered from their low-dimensional projections. Here, we work with two coupled dictionaries, Ψ h for high-resolution patches, and Ψ l for low-resolution ones. The sparse representation of a low-resolution patch in terms of Ψ l will be directly used to recover the corresponding high resolution patch from Ψ h [20]. Given these two dictionaries, each corresponding patch of low resolution image, y, and high resolution image, x, can be represented sparsely with the same coefficient vector, α in Algorithm 2.
y = Ψ l α
x = Ψ h α

The patch from each location of the low-resolution image, that needs to be scaled up, is extracted and sparsely coded with the help of SSIM-optimal Algorithm 1. Once the sparse coefficients, α, are obtained, high resolution patches, y, are computed using (39) which are finally merged by averaging in the overlap area to create the resulting image. The proposed image super-resolution algorithm is summarized in Algorithm 3:

The proposed image super resolution scheme is tested on various images. To be consistent with [20] patches of 5 × 5 pixels were used on the low resolution image. Each patch is converted to a vector of length 25. The dictionaries are trained using KSVD [3] with the sizes of 25 × 1024 and 100 × 1024 for the low and the high resolution dictionaries, respectively. 66 natural images are used for dictionary training, which are also used in [43] for similar purpose. To remove artifacts on the patch edges we set overlap of one pixel during patch extraction from the image. Fixed number of atoms (3) has been used by [20] in the sparse coding stage. However SSIM-OMP determines the number of atoms adaptively from patch to patch based on its importance considering SSIM measure. In order to calculate the threshold, Tssim, defined in (14), Tmse is calculated using MSE-based sparse coding stage in [20]. After calculating sparse representation for all the low resolution patches, we use them to reconstruct the patches and then the difference with the original patch is calculated. We set Tmse to the average of these differences. The performance comparison with state-of-the-art method is given in Table 2. It can be observed that the proposed algorithm outperforms the other methods consistently in terms of SSIM evaluations. It is also interesting to observe PSNR improvements in some cases, though PSNR is not the optimization goal of the proposed approach. The improvements are not always consistent (for example, PSNR drops in some cases in Table 1, while SSIM always improves). There are complicated reasons behind these results. It needs to be aware that the so-called "MSE-optimal" algorithms include many suboptimal and heuristic steps and thus have potentials to be improved even in the MSE sense. Our methods are different from the "MSE-optimal" methods in multiple stages. Although the differences are made to improve SSIM, they may have positive impact on improving MSE as well. For example, when using the learned dictionary to reconstruct an image patch, if SSIM is used to replace MSE in selecting the atoms in the dictionary, then essentially the set of accepted atoms in the dictionary have been changed. In particular, since SSIM is variance normalized, the set of acceptable reconstructed patches near the noisy patch may be structurally similar but are significantly different in variance. This may lead to different selections of the atoms in the dictionary, which when appropriately scaled to approximate the noisy patch, may result in better reconstruction result. Although the visual and SSIM improvements are only moderate, these are promising results as an initial attempt of incorporating a perceptually more meaningful measure into the optimization problem of KSVD-based superresolution method. Figures 4 and 5 compare the reconstructed images obtained using [5] and the proposed methods for the Raccoon and the Girl images, respectively. It can be seen that the proposed scheme preserves many local structures better and therefore has better perceptual image quality. The visual quality improvement is also reflected in the corresponding SSIM maps, which provide useful guidance on how local image quality is improved over space. It can be observed from the SSIM maps that the areas which are relatively more structured benefit more from the proposed algorithm as the quality measure used is better at calculating the similarity of structures as compared to MSE.
Table 2

SSIM and PSNR comparisons of image super-resolution results














PSNR comparison (in dB)

   Yang et al.













   Zeyde et al.


























SSIM comparison

   Yang et al.













   Zeyde et al.


























Figure 4

Visual comparison of super-resolution results. (a) Original image; (b) low resolution image; (c) Yang's method; (d) SSIM-map of Yang's method; (e) proposed method; (f) SSIM-map of proposed method.

Figure 5

Visual comparison of super-resolution results. (a) Original image; (b) low resolution image; (c) Yang's method; (d) SSIM-map of Yang's method; (e) proposed method; (f) SSIM-map of proposed method.

4 Conclusions

In this article, we attempt to combine perceptual image fidelity measurement with optimal sparse signal representation in the context of image denoising and image super-resolution to improve two state-of-the-art algorithms in these areas. We proposed an algorithm to solve for the optimal coefficients for sparse and redundant dictionary in maximal SSIM sense. We also developed a gradient descent approach to achieve the best compromise between the distorted image and the image reconstructed using sparse representation. Our simulations demonstrate promising results and also indicate the potential of SSIM to replace the ubiquitous PSNR/MSE as the optimization criterion in image processing applications. It must be taken into account that this is only an early attempt along a new but promising direction. The main contribution of the current work is mostly in the general framework and theoretical development. Significant improvement in visual quality can be expected by improving the dictionary learning process based on SSIM, as dictionary encapsulates in itself the prior knowledge about the image to be restored. An SSIM-optimal dictionary will capture structures contained in the image in a better way and the restoration task will result into sharper output image. Further improvement is also expected in the future when some of the advanced mathematical properties of SSIM and normalized metrics [12] are incorporated into the optimization framework.

Algorithm 1: SSIM-inspired OMP

Initialize: D = {} set of selected atoms, S opt = 0, r = Y

while S opt < T ssim

  • Add the next best atom in 2 sense to D

  • Find the optimal 2 -based coefficient(s) using (15)

  • Find the optimal SSIM-based coefficient(s) using (27) and (31)

  • Update the residual r

  • Find SSIM-based approximation a

  • Calculate S opt = S(a, y)


Algorithm 2: SSIM-inspired image denoising

  1. 1.

    Initialize: X = Y, Ψ = overcomplete DCT dictionary

  2. 2.
    Repeat J times
    • Sparse coding stage: use SSIM-optimal OMP to compute the representation vectors α ij for each patch

    • Dictionary update stage: Use K-SVD [28] to calculate the updated dictionary and coefficients. Calculate

    SSIM-optimal coefficients using (27) and (31)

  3. 3.

    Global Reconstruction: Use gradient descent algorithm to optimize (6), where the SSIM gradient is given by (35).


Algorithm 3: SSIM-inspired image super resolution

  1. 1.

    Dictionary Training Phase: trained high and low resolution dictionaries Ψ l , Ψ h , [20]

  2. 2.
    Reconstruction Phase
    • Sparse coding stage: use SSIM-optimal OMP to compute the representation vectors a ij for all the patches of low resolution image

    • High resolution patches reconstruction: Reconstruct high resolution patches by Ψ h α ij

  3. 3.

    Global Reconstruction: merge high-resolution patches by averaging over the overlapped


region to create the high resolution image.



This work was supported in part by the Natural Sciences and Engineering Research Council of Canada and in part by Ontario Early Researcher Award program, which are gratefully acknowledged.

Authors’ Affiliations

Department of Electrical and Computer Engineering, University of Waterloo
Department of Applied Mathematics, University of Waterloo


  1. Dabov K, Foi A, Katkovnik V, Egiazarian K: Image denoising by sparse 3D transform-domain collaborative filtering. IEEE Trans. Image Process 2007, 16: 2080-2095.MathSciNetView ArticleGoogle Scholar
  2. Buades A, Coll B, Morel JM: A review of image denoising algorithms, with a new one. Multi-scale Model Simul 2005, 4(2):490-530. 10.1137/040616024MathSciNetView ArticleMATHGoogle Scholar
  3. Elad M, Aharon M: Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans Image Process 2006, 15(12):3736-3745.MathSciNetView ArticleGoogle Scholar
  4. Hou H, Andrews H: Cubic splines for image interpolation and digital filtering. IEEE Trans Signal Process 1978, 26: 508-517. 10.1109/TASSP.1978.1163154View ArticleMATHGoogle Scholar
  5. Yang J, Wright J, Huang T, Ma Y: Image super-resolution via sparse representation. IEEE Trans Image Process 2010, 19(11):2861-2873.MathSciNetView ArticleGoogle Scholar
  6. Yang J, Wright J, Huang TS, Ma Y: Image super-resolution as sparse representation of raw image patches. Proc IEEE Comput Vis Pattern Recognit 2008, 1-8.Google Scholar
  7. Wang Z, Bovik AC: Mean squared error: love it or leave it? A new look at signal fidelity measures. IEEE Signal Process Mag 2009, 26: 98-117.View ArticleGoogle Scholar
  8. Wang Z, Bovik AC, Sheikh HR, Simoncelli EP: Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 2004, 13(4):600-612. 10.1109/TIP.2003.819861View ArticleGoogle Scholar
  9. Joint Video Team (JVT) Reference Software [Online][]
  10. Gao Y, Rehman A, Wang Z: CW-SSIM Based image classification. In IEEE International Conference on Image Processing ICIP. Brussels, Belgium; 2011:1249-1252.Google Scholar
  11. Piella G, Heijmans H: A new quality metric for image fusion. In IEEE International Conference on Image Processing (ICIP). Volume 3. Barcelona, Spain; 2003:173-176.Google Scholar
  12. Brunet D, Vrscay ER, Wang Z:On the Mathematical Properties of the Structural Similarity Index (Preprint). University of Waterloo, Waterloo; 2011. []Google Scholar
  13. Channappayya SS, Bovik AC, Caramanis C, Heath R: Design of linear equalizers optimized for the structural similarity index. IEEE Trans Image Process 2008, 17(6):857-872.MathSciNetView ArticleGoogle Scholar
  14. Wang Z, Li Q, Shang X: Perceptual image coding based on a maximum of minimal structural similarity criterion. IEEE Int Conf Image Process 2007, 2: II-121-II-124.Google Scholar
  15. Rehman A, Wang Z: SSIM-based non-local means image denoising. In IEEE International Conference on Image Processing (ICIP). Brussels, Belgium; 2011:1-4.Google Scholar
  16. Wang S, Rehman A, Wang Z, Ma S, Gao W: Rate-SSIM optimization for video coding. In IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP 11). Prague, Czech Republic; 2011:833-836.Google Scholar
  17. Ou T, Huang Y, Chen H: A perceptual-based approach to bit allocation for H.264 encoder. SPIE Visual Communications and Image Processing 2010, 77441B.Google Scholar
  18. Mai Z, Yang C, Kuang K, Po L: A novel motion estimation method based on structural similarity for h.264 inter prediction. In IEEE Int Conf Acoust Speech Signal Process. Volume 2. Toulouse; 2006:913-916.Google Scholar
  19. Yang C, Wang H, Po L: Improved inter prediction based on structural similarity in H.264. In IEEE Int Conf Signal Process Commun. Volume 2. Dubai; 2007:340-343.Google Scholar
  20. Zeyde R, Elad M, Protter M: On single image scale-up using sparse-representations. In Curves & Surfaces. Avignon-France; 2010:711-730.Google Scholar
  21. Savitzky A, Golay MJE: Smoothing and differentiation of data by simplified least squares procedures. Anal Chem 1964, 36: 1627-1639. 10.1021/ac60214a047View ArticleGoogle Scholar
  22. Tikhonov AN, Arsenin VY: Solutions of Ill-Posed Problem. V. H. Winston, Washington DC; 1977.MATHGoogle Scholar
  23. Rudin LI, Osher S, Fatemi E: Nonlinear total variation based noise removal algorithms. Physica D 1992, 60: 259-268. 10.1016/0167-2789(92)90242-FView ArticleMathSciNetMATHGoogle Scholar
  24. Protter M, Elad M: Image sequence denoising via sparse and redundant representations. IEEE Trans Image Process 2009, 18: 27-35.MathSciNetView ArticleGoogle Scholar
  25. Mairal J, Sapiro G, Elad M: Learning multiscale sparse representations for image and video restoration. Multiscale Model Simul 2008, 7: 214-241. 10.1137/070697653MathSciNetView ArticleMATHGoogle Scholar
  26. Candés EJ, Romberg J, Tao T: Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans Inf Theory 2006, 52(2):489-509.View ArticleMathSciNetMATHGoogle Scholar
  27. Donoho DL: Compressed sensing. IEEE Trans Inf Theory 2006, 52(4):1289-1306.MathSciNetView ArticleMATHGoogle Scholar
  28. Aharon M, Elad M, Bruckstein A: K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans Signal Process 2006, 54(11):4311-4322.View ArticleGoogle Scholar
  29. Pati Y, Rezaiifar R, Krishnaprasad P: Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In Twenty Seventh Asilomar Conference on Signals, Systems and Computers. Volume 1. Pacific Grove, CA; 1993:40-44.View ArticleGoogle Scholar
  30. Brunet D, Vrscay ER, Wang Z: Structural similarity-based approximation of signals and images using orthogonal bases. In Proc Int Conf on Image Analysis and Recognition. Edited by: M Kamel, A Campilho. Springer, Heidelberg; 2010:11-22. vol. 6111 of LNCSView ArticleGoogle Scholar
  31. Wang S, Rehman A, Wang Z, Ma S, Gao W: SSIM-inspired divisive normalization for perceptual video coding. In IEEE International Conference on Image Processing ICIP. Brussels, Belgium; 2011:1657-1660.Google Scholar
  32. Wainwright MJ, Simoncelli EP: Scale mixtures of gaussians and the statistics of natural images. Adv Neural Inf Process Syst 2000, 12: 855-861.Google Scholar
  33. Lyu S, Simoncelli EP: Statistically and perceptually motivated nonlinear image representation. In Proc SPIE Conf Human Vision Electron Imaging XII. Volume 6492. San Jose, CA; 2007:649207-1-649207-15.Google Scholar
  34. Foley J: Human luminance pattern mechanisms: masking experiments require a new model. J Opt Soc Am 1994, 11: 1710-1719. 10.1364/JOSAA.11.001710View ArticleGoogle Scholar
  35. Watson AB, Solomon JA: Model of visual contrast gain control and pattern masking. J Opt Soc Am 1997, 14: 2379-2391. 10.1364/JOSAA.14.002379View ArticleGoogle Scholar
  36. Heeger DJ: Normalization of cell responses in cat striate cortex. Vis Neural Sci 1992, 9: 181-198.View ArticleGoogle Scholar
  37. Simoncelli EP, Heeger DJ: A model of neuronal responses in visual area MT. Vis Res 1998, 38: 743-761. 10.1016/S0042-6989(97)00183-1View ArticleGoogle Scholar
  38. Li Q, Wang Z: Reduced-reference image quality assessment using divisive normalization-based image representation. IEEE J Coupled dictionary training for image s Spec Top Signal Process 2009, 3: 202-211.View ArticleGoogle Scholar
  39. Rehman A, Wang Z: Reduced-reference SSIM estimation. In International Conference on Image Processing. Hong Kong, China; 2010:289-292.Google Scholar
  40. Malo J, Epifanio I, Navarro R, Simoncelli EP: Non-linear image representation for efficient perceptual coding. IEEE Trans Image Process 2006, 15: 68-80.View ArticleGoogle Scholar
  41. Portilla J, Strela V, Wainwright MJ, Simoncelli EP: Image denoising using scale mixtures of Gaussians in the wavelet domain. IEEE Trans Image Process 2003, 12: 1338-1351. 10.1109/TIP.2003.818640MathSciNetView ArticleMATHGoogle Scholar
  42. Wang Z, Simoncelli EP: Maximum differentiation (MAD) competition: a methodology for comparing computational models of perceptual quantities. J Vis 2008, 8(12):1-13. 10.1167/8.12.1View ArticleGoogle Scholar
  43. Yang J, Wang Z, Lin Z, Huang T: Coupled dictionary training for image super-resolution.2011. []Google Scholar


© Rehman et al; licensee Springer. 2012

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.