Enhancing LTW image encoder with perceptual coding and GPU-optimized 2D-DWT transform
© Martínez-Rach et al.; licensee Springer. 2013
Received: 31 January 2013
Accepted: 1 August 2013
Published: 23 August 2013
When optimizing a wavelet image coder, the two main targets are to (1) improve its rate-distortion (R/D) performance and (2) reduce the coding times. In general, the encoding engine is mainly responsible for achieving R/D performance. It is usually more complex than the decoding part. A large number of works about R/D or complexity optimizations can be found, but only a few tackle the problem of increasing R/D performance while reducing the computational cost at the same time, like Kakadu, an optimized version of JPEG2000. In this work we propose an optimization of the E_LTW encoder with the aim to increase its R/D performance through perceptual encoding techniques and reduce the encoding time by means of a graphics processing unit-optimized version of the two-dimensional discrete wavelet transform. The results show that in both performance dimensions, our enhanced encoder achieves good results compared with Kakadu and SPIHT encoders, achieving speedups of 6 times with respect to the original E_LTW encoder.
Wavelet transforms have been reported to have good performance for image compression; therefore, many state-of-the-art image codecs, including the JPEG2000 image coding standard, use the discrete wavelet transform (DWT) [1, 2]. The use of wavelet coefficient trees and successive approximations was introduced by the embedded zerotree wavelet (EZW) algorithm  with a bitplane coding approximation. SPIHT , an advanced version of EZW, processes the wavelet coefficient trees in a more efficient way by partitioning the coefficients depending on their significance. Both EZW and SPIHT need the coefficient tree construction to search for significant coefficients through a multiple iterative process at each bitplane, which involves high computational complexity.
Bitplane coding is implemented by the JPEG2000 encoding codeblocks with three passes per plane, so the most important information, from a rate-distortion (R/D) point of view, is first encoded. It also uses an optional and low-complexity post-compression optimization algorithm, based on the Lagrange multiplier method. Besides, it uses a large number of contexts for the arithmetic encoder. This post-compression rate-distortion optimization algorithm selects the most important coefficients by weighting them, based on the mean square error (MSE) distortion measurement.
Wavelet-based image processing systems are typically implemented with memory-intensive algorithms and with higher execution time than other encoders based on other transforms like discrete cosine transform. In usual two-dimensional (2D)-DWT implementations , image decomposition is computed by means of a convolution filtering process, and so its complexity rises as the filter length does. The image is transformed at every decomposition level, first column by column and then row by row.
In , the authors proposed the E_LTW codec with sign coding, precise rate-control, and some optimizations to avoid bitplane processing, at the cost of not being embedded, but with low memory requirements and similar R/D performance than the one obtained by embedded encoders like JPEG2000 and SPIHT.
Part II of the JPEG2000 standard includes visual progressive weighting  and visual masking by setting the weights based on the human visual system (HVS) using contrast sensitivity function (CSF). Many other image encoders have included much of the knowledge of the human visual system in order to obtain a better perceptual quality of the compressed images. The most widely used characteristic is the contrast adaptability of the HVS, because HVS is more sensitive to contrast than to absolute luminance . The CSF relates the spatial frequency with the contrast sensitivity.
This perceptual coding will improve the perceptual quality of the reconstructed images, so that for a desired rate range, a better perceptual R/D behavior is achieved.
Although most studies employ the peak signal-to-noise ratio (PSNR) metric to measure image quality performance, it is well known that this metric does not always capture the distortion perceived by the human being. Therefore, we decided to use objective quality assessment metrics whose design is inspired by the HVS, since our proposal includes perceptual-based encoding techniques that may not be properly evaluated by the PSNR metric.
In this work, we propose the PE_LTW (perceptually enhanced LTW) as an enhanced version of the E_LTW encoder by including perceptual coding based on the CSF and the use of graphics processing unit (GPU)-optimized 2D-DWT algorithms based on the methods described in [4, 8].
After improving the perceptual R/D behavior of our proposal, we proceed to optimize the 2D-DWT transform module by GPU processing to reduce the overall encoding time. From previous work, we have defined a CUDA implementation of the 2D-DWT transform that is able to considerably reduce the 2D-DWT computation time.
So as to test the behavior of our proposal, we have compared the performance of our PE_LTW encoder in terms of perceptual quality and encoding delays with the Kakadu implementation of the JPEG2000 standard, with and without enabling its perceptual weighting mode, and with the SPIHT image encoder.
2 Encoding system
The basic idea of this encoder is very simple: after computing the 2D-DWT transform of an image, the perceptually weighted wavelet coefficients are uniformly quantized and then encoded with arithmetic coding.
As mentioned, the 2D-DWT computation stage runs on a GPU and includes the perceptual weighting based on the CSF and implemented as an invariant scaling factor weighting (ISFW)  that weights the obtained coefficients depending on the importance that the frequency subband has for the HVS contrast sensitivity. We detail the CSF and the ISFW later in the next sections.
The uniform quantization of the perceptually weighted coefficients is performed by means of two strategies: one coarser and another finer. The finer one consists of applying a scalar uniform quantization (Q) to the coefficients. The coarser one is based on removing the least significant bitplanes (rplanes) from coefficients.
For the coding stage, if the absolute value of a coefficient and all its descendants (considering the classic quad-tree structure from ) is lower than a threshold value (2 rplanes ), the entire tree is encoded with a single symbol, which we call LOWER symbol (indicating that all the coefficients in the tree are lower than 2 rplanes and so they form a lower tree). However, if a coefficient is lower than the threshold and not all its descendants are lower than it, that coefficient is encoded with an ISOLATED LOWER symbol. On the other hand, for each wavelet coefficient higher than 2 rplanes , we encode a symbol indicating the number of bits needed to represent that coefficient, along with a binary-coded representation of its bits and sign (note that the rplanes less significant bits are not encoded).
The encoder exploits the sign neighborhood correlation of wavelet subband type (HL,LH,HH) as Deever and Hemami assessed in  by encoding the prediction of the sign (success of failure).
The proposed encoder also includes the rate control algorithm presented in  but taking into account the sign coding and the intrinsic error model of the rate control. As the rate control underestimates the target rate, the required bits to match the target bitrate are added to the bitstream. The selected bits correspond to the bitplanes (lower or equal to the rplanes quantization parameter) of significant coefficients added to the output bitstream following a particular order, from low-frequency subbands to the highest one.
2.2 The contrast sensitivity function
In , the authors explained how the sensitivity to contrast of the HVS can be exploited by means of the CSF curve to enhance the perceptual or subjective quality of the DWT-encoded images. A comprehensive review of HVS models for quality assessment/image compression is found in . Most of these models take into account the varying sensitivity over spatial frequency, color, and the inhibiting effects of strong local contrasts or activity, called masking.
where spatial frequency is with units of cycles/degree (f x and f y are the horizontal and vertical spatial frequencies, respectively). The frequency is usually measured in cycles per optical degree, which makes the CSF independent of the viewing distance.
One of the first works that demonstrate that the MSE cannot reliably predict the difference of the perceived quality of two images can be found in . They propose, by way of psychovisual experiments, the aforementioned model of the CSF, which is well suited and widely used [6, 14–16] for wavelet-based codecs; therefore, we adopt this model.
2.3 Using the CSF
In , the authors explained how the CSF can be implemented in wavelet-based codecs. Some codecs, like the JPEG2000 standard Part II, introduce the CSF as a visual progressive single factor weighting, replacing the MSE by the CSF-weighted MSE (WMSE) and optimizing system parameters to minimize WMSE for a given bitrate. This is done in the post-compression rate-distortion optimization algorithm where the WMSE replaces the MSE as the cost function which drives the formation of quality layers .
CSF weights can be obtained also by applying to each frequency subband the appropriate contrast detection threshold. In , subjective experiments were performed to obtain a model that expresses the threshold DWT noise as a function of spatial frequency. Using this model, the authors obtained a perceptually lossless quantization matrix for the linear phase 9/7 DWT. By the use of this quantization matrix, each subband is quantized by a value that weights the overall resulting quantized image at the threshold of artifacts visibility. For suprathreshold quantization, a uniform quantization stage is afterward performed.
However, we introduce the CSF in the encoder using the ISFW strategy proposed also in . So from the CSF curve, we obtain the weights for scaling the wavelet coefficients. This weighting can be introduced after the wavelet filtering stage and before the uniform quantization stage is applied. The weighting is a simple multiplication of the wavelet coefficients in each frequency subband by the corresponding weight. At the decoder, the inverse of this weight is applied. The CSF weights do not need to be explicitly transmitted to the decoder. This stage is independent to the other encoder modules (wavelet filtering, quantization, etc).
The granularity of the correspondence between frequency and weighting value is a key issue. As wavelet-based codecs obtain a multiresolution signal decomposition, the easiest association is to find a unique weighting value (or contrast detection threshold) for each wavelet frequency subband. If further decompositions of the frequency domain are done, for example, a finer association could be done between frequency and weights using packet wavelets .
Proposed CSF weighting matrix
2.4 GPU 2D-DWT optimization
In order to develop the 2D-DWT-optimized version, we will use an NVIDIA GTX 280 GPU that contains 30 multiprocessors with eight cores in each multiprocessor, 1 GB of global memory, and 16 kB of shared memory (SM) by block.
Firstly, we will define our GPU-based 2D-DWT algorithm, named as CUDA Conv 9/7, as the reference algorithm. It will only use the GPU shared memory space to store the buffer that will contain a copy of the working row/column data. The constant memory space is used to store the filter taps. We call each CUDA kernel with a one-dimensional number of thread blocks, NBLOCKS, and a one-dimensional number of threads by block, NTHREADS.
In the horizontal DWT filtering process, each image row is stored in the threads shared memory. After that, in the vertical filtering, each column is processed in the same way. The row or column size determines the NBLOCKS parameter, which must be greater or equal to the image width in the horizontal step or the image height in the vertical step. One of the goals in the proposed CUDA-based methods is not to increase memory requirements, so we will store the resulting wavelet coefficients in the original image memory space.
For computing the DWT, the threads use the shared memory space, where latency access is extremely low. The CUDA-Sep 9/7 algorithm stores the original image in the GPU global memory but computes the filtering steps from the shared memory.
We can reduce the number of idle threads by reducing the total number of threads per block and also using each thread to load multiple pixels into the shared memory. This ensures that all threads of each warp are active during the computation stage. Note that the number of threads in a block must be a multiple of the warp size (32 threads on GTX 280) for optimal efficiency.
The row filtering stage
The column filtering stage
Each row/column filtering stage is separated into two substages: (a) the threads load a block of pixels of one row/column from the global memory into the shared memory, and (b) each thread computes the filter over the data stored in the shared memory and the result is sent to the global memory. For the column filtering, the resulting coefficient is stored in the global memory after performing the perceptual weighting, i.e., multiplying the final coefficient by the perceptual weight corresponding to the wavelet subband of the coefficient.
In the row or column filtering, the pixels located in the image block borders also need adjacent pixels from other thread blocks to compute the DWT. The apron region must also be loaded in the shared memory, but only for reading purposes, because the filtered value of the pixels located there is computed by other thread blocks.
The speedup achieved by the DWT GPU-based algorithm is up to 20 times relative to the sequential implementation in one core. Note that wavelet transform is only a single first step in an image/video encoder.
3 Performance evaluation
All evaluated encoders have been tested on an Intel Pentium Core 2 CPU at 1.8 GHz with 6 GB of RAM memory. We use an NVIDIA GTX 280 GPU that contains 30 multiprocessors with eight cores in each multiprocessor, 1 GB of global memory, and 16 kB of shared memory by block (or SM).
The proposed encoder is compared with Kakadu 5.2.5 and SPIHT (Sphit 8.01) encoders with two sets of test images: (a) a 512×512 image resolution set including Lena, Barbara, Balloon, Horse, Goldhill, Boat, Mandrill, and Zelda, and (b) a 2,048×2,560 image resolution set including Cafe, Bike, and Woman. When comparing with Kakadu, we perform two comparisons: one labeled as Kakadu_csf, which has enabled its perceptual weighting mode (with the perceptual weights presented in ), and the other one, labeled as Kakadu, without perceptual weights.
First, we analyze the speedup of the GPU-based encoder using 2D-DWT described in the previous section with respect to the traditional convolution algorithm running in a single core processor.
GPU vs. SEQ PE_LTW speedup and total encoding time comparison with SPIHT and Kakadu
Rate & Coder
When the target bitrate is low, i.e., high compression rate, the uniform quantization of the wavelet coefficients produces a great number of nonsignificant coefficients in low decomposition levels, the root of the zero tree being located at higher decomposition levels. This fact reduces the computation cost because only the root of a zero tree needs to be encoded. As a consequence, the overall number of operations is reduced and the gain of GPU optimized version is reduced too.
Speedup comparison by target bitrate
PE_LTW mean times
4 R/D evaluation
For evaluating image encoders, the most common performance metric is the well-known R/D, the trade-off between encoder bitrate (bpp) and the reconstructed quality typically measured in decibels through the PSNR of luminance color plane. However, it is also well known that the PSNR quality measurement is not close to the human perception of quality and sometimes it gives wrong quality scores, leading to erroneous conclusions when evaluating different encoding strategies.
There are several studies about the convenience of using other image quality assessment metrics than PSNR that better fit to human perceptual quality assessment (i.e., subjective test results) [14, 17, 19, 20]. One of the best behaving objective quality assessment metrics is visual information fidelity (VIF) , which has been proven [17, 19] to have a better correlation with subjective perception than other metrics that are commonly used for encoder comparisons [14, 20]. The VIF metric uses statistic models of natural scenes in conjunction with distortion models in order to quantify the statistical information shared between the test and reference images.
Rate savings of PE_LTW vs. Kakadu, SPIHT, and Kakadu with perceptual weights Kakadu_csf
(% rate saved, mean)
(% rate saved, mean)
(% rate saved, mean)
We have presented a perceptual image wavelet encoder whose 2D-DWT stage is implemented using CUDA running on a GPU. Our proposed perceptual encoder reveals the importance of exploiting the contrast sensitivity function behavior of the HVS by means of an accurate perceptual weighting of wavelet coefficients. PE_LTW is very competitive in terms of perceptual quality, being able to obtain important bitrate savings regardless of the image resolution and at any bitrate when compared with SPIHT and Kakadu with and without its perceptual weighting mode enabled. The PE_LTW encoder is able to produce a quality-equivalent image with respect to the other two encoders with a reduced rate.
As the 2D-DWT transform runs on a GPU, the overall encoding time is highly reduced compared to the sequential version of the same encoder, obtaining maximum speedups of 6.86 for 512×512 images and 4.39 for 2,048×2,560 images. Compared with SPIHT and Kakadu, our proposal is clearly faster than SPIHT but needs additional optimizations to outperform Kakadu times.
This research was supported by the Spanish Ministry of Education and Science under grant TIN2011-27543-C03-03.S
- ISO: JPEG 2000 image coding system. Part 1: core coding system. Geneva: ISO; 2000.Google Scholar
- Said A, Pearlman A: A new, fast and efficient image codec based on set partitioning in hierarchicaltrees. IEEE Trans. Circ., Syst. Video Technol 1996, 6(3):243-250. 10.1109/76.499834View ArticleGoogle Scholar
- Shapiro JM: A fast technique for identifying zerotrees in the EZW algorithm. Proc. IEEE Int. Conf. Acoust., Speech, Signal Process 1996, 3: 1455-1458.Google Scholar
- Mallat SG: A theory for multi-resolution signal decomposition: the wavelet representation. IEEE Trans. Pat. Anal. Mach. Intel 1989, 11(7):674-693. 10.1109/34.192463View ArticleMATHGoogle Scholar
- Lopez O, Martinez M, Pinol P, Malumbres MP, Oliver J: E-LTW: an enhanced LTW encoder with sign coding and precise rate control. In 2009 16th IEEE International Conference on Image Processing (ICIP). Piscataway: IEEE; 2009:2821-2824.View ArticleGoogle Scholar
- Taubman DS, Marcellin MW: JPEG2000 Image Compression Fundamentals, Standards and Practice. Berlin: Springer; 2002.View ArticleGoogle Scholar
- Sheikh HR, AC Bovik G, de Veciana: An information fidelity criterion for image quality assessment using natural scene statistics. IEEE Trans. Image Process 2005, 14(12):2117-2128.View ArticleGoogle Scholar
- Sweldens W: The lifting scheme: a custom-design construction of biorthogonal wavelets. Appl. Comput. Harmonic Anal 1996, 3(2):186-200. 10.1006/acha.1996.0015MathSciNetView ArticleMATHGoogle Scholar
- Nadenau MJ, Reichel J, Kunt M: Wavelet-based color image compression: exploiting the contrast sensitivity function. IEEE Trans. Image Process 2003, 12(1):58-70. 10.1109/TIP.2002.807358View ArticleGoogle Scholar
- Deever A, Hemami SS: What’s your sign?: efficient sign coding for embedded wavelet image coding. In Proceedings of the Data Compression Conference, 2000 (DCC 2000). IEEE; 2000:273-282.View ArticleGoogle Scholar
- López O, Martinez-Rach M, Oliver J, Malumbres MP: Impact of rate control tools on very fast non-embedded wavelet image encoders. In Visual Communications and Image Processing 2007. Piscataway: IEEE; 2007.Google Scholar
- Oliver J, Malumbres MP: Low-complexity multiresolution image compression using wavelet lower trees. IEEE Trans. Circ. Syst. Video Technol 2006, 16(11):1437-1444.View ArticleGoogle Scholar
- Mannos J, Sakrison D: The effects of a visual fidelity criterion of the encoding of images. IEEE Trans. Info. Theory 1974, 20(4):525-536. 10.1109/TIT.1974.1055250View ArticleMATHGoogle Scholar
- Wang Z, Bovik A, Sheikh H, Simoncelli EP: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process 2004, 13(4):600-612. 10.1109/TIP.2003.819861View ArticleGoogle Scholar
- Watson AB, Yang GY, Solomon JA, Villasenor J: Visibility of wavelet quantization noise. IEEE Trans. Image Process 1997, 6(8):1164-1175. 10.1109/83.605413View ArticleGoogle Scholar
- Moumkine N, Tamtaoui A, Ait Ouahman A: Integration of the contrast sensitivity function into wavelet codec. In Proceedings of the Second International Symposium on Communications, Control and Signal Processing (ISCCSP 2006). Marrakech; 13–15 Mar 2006.Google Scholar
- Gao X, Lu W, Tao D, Li X: Image quality assessment based on multiscale geometric analysis. IEEE Trans. Image Process 2009, 18(7):1409-1423.MathSciNetView ArticleGoogle Scholar
- Beegan AP, Iyer LR, Bell AE, Maher VR, Ross MA: Design and evaluation of perceptual masks for wavelet image compression. In Proceedings of the 2002 IEEE 10th Digital Signal Processing Workshop, 2002 and the 2nd Signal Processing Education Workshop. Piscataway: IEEE; 88-93.Google Scholar
- Martinez-Rach M, Lopez O, Piñol P, Oliver J, Malumbres MP: A study of objective quality assessment metrics for video codec design and evaluation. In Eight IEEE International Symposium on Multimedia, vol.1. San Diego: IEEE Computer Society; 2006:517-524.View ArticleGoogle Scholar
- Sheikh HR, Sabir MF, Bovik AC: A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Trans. Image Process 2006, 15(11):3440-3451.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.