- Open Access
Enhancing LTW image encoder with perceptual coding and GPU-optimized 2D-DWT transform
EURASIP Journal on Advances in Signal Processing volume 2013, Article number: 141 (2013)
When optimizing a wavelet image coder, the two main targets are to (1) improve its rate-distortion (R/D) performance and (2) reduce the coding times. In general, the encoding engine is mainly responsible for achieving R/D performance. It is usually more complex than the decoding part. A large number of works about R/D or complexity optimizations can be found, but only a few tackle the problem of increasing R/D performance while reducing the computational cost at the same time, like Kakadu, an optimized version of JPEG2000. In this work we propose an optimization of the E_LTW encoder with the aim to increase its R/D performance through perceptual encoding techniques and reduce the encoding time by means of a graphics processing unit-optimized version of the two-dimensional discrete wavelet transform. The results show that in both performance dimensions, our enhanced encoder achieves good results compared with Kakadu and SPIHT encoders, achieving speedups of 6 times with respect to the original E_LTW encoder.
Wavelet transforms have been reported to have good performance for image compression; therefore, many state-of-the-art image codecs, including the JPEG2000 image coding standard, use the discrete wavelet transform (DWT) [1, 2]. The use of wavelet coefficient trees and successive approximations was introduced by the embedded zerotree wavelet (EZW) algorithm  with a bitplane coding approximation. SPIHT , an advanced version of EZW, processes the wavelet coefficient trees in a more efficient way by partitioning the coefficients depending on their significance. Both EZW and SPIHT need the coefficient tree construction to search for significant coefficients through a multiple iterative process at each bitplane, which involves high computational complexity.
Bitplane coding is implemented by the JPEG2000 encoding codeblocks with three passes per plane, so the most important information, from a rate-distortion (R/D) point of view, is first encoded. It also uses an optional and low-complexity post-compression optimization algorithm, based on the Lagrange multiplier method. Besides, it uses a large number of contexts for the arithmetic encoder. This post-compression rate-distortion optimization algorithm selects the most important coefficients by weighting them, based on the mean square error (MSE) distortion measurement.
Wavelet-based image processing systems are typically implemented with memory-intensive algorithms and with higher execution time than other encoders based on other transforms like discrete cosine transform. In usual two-dimensional (2D)-DWT implementations , image decomposition is computed by means of a convolution filtering process, and so its complexity rises as the filter length does. The image is transformed at every decomposition level, first column by column and then row by row.
In , the authors proposed the E_LTW codec with sign coding, precise rate-control, and some optimizations to avoid bitplane processing, at the cost of not being embedded, but with low memory requirements and similar R/D performance than the one obtained by embedded encoders like JPEG2000 and SPIHT.
Part II of the JPEG2000 standard includes visual progressive weighting  and visual masking by setting the weights based on the human visual system (HVS) using contrast sensitivity function (CSF). Many other image encoders have included much of the knowledge of the human visual system in order to obtain a better perceptual quality of the compressed images. The most widely used characteristic is the contrast adaptability of the HVS, because HVS is more sensitive to contrast than to absolute luminance . The CSF relates the spatial frequency with the contrast sensitivity.
This perceptual coding will improve the perceptual quality of the reconstructed images, so that for a desired rate range, a better perceptual R/D behavior is achieved.
Although most studies employ the peak signal-to-noise ratio (PSNR) metric to measure image quality performance, it is well known that this metric does not always capture the distortion perceived by the human being. Therefore, we decided to use objective quality assessment metrics whose design is inspired by the HVS, since our proposal includes perceptual-based encoding techniques that may not be properly evaluated by the PSNR metric.
In this work, we propose the PE_LTW (perceptually enhanced LTW) as an enhanced version of the E_LTW encoder by including perceptual coding based on the CSF and the use of graphics processing unit (GPU)-optimized 2D-DWT algorithms based on the methods described in [4, 8].
After improving the perceptual R/D behavior of our proposal, we proceed to optimize the 2D-DWT transform module by GPU processing to reduce the overall encoding time. From previous work, we have defined a CUDA implementation of the 2D-DWT transform that is able to considerably reduce the 2D-DWT computation time.
So as to test the behavior of our proposal, we have compared the performance of our PE_LTW encoder in terms of perceptual quality and encoding delays with the Kakadu implementation of the JPEG2000 standard, with and without enabling its perceptual weighting mode, and with the SPIHT image encoder.
2 Encoding system
The basic idea of this encoder is very simple: after computing the 2D-DWT transform of an image, the perceptually weighted wavelet coefficients are uniformly quantized and then encoded with arithmetic coding.
As mentioned, the 2D-DWT computation stage runs on a GPU and includes the perceptual weighting based on the CSF and implemented as an invariant scaling factor weighting (ISFW)  that weights the obtained coefficients depending on the importance that the frequency subband has for the HVS contrast sensitivity. We detail the CSF and the ISFW later in the next sections.
The uniform quantization of the perceptually weighted coefficients is performed by means of two strategies: one coarser and another finer. The finer one consists of applying a scalar uniform quantization (Q) to the coefficients. The coarser one is based on removing the least significant bitplanes (rplanes) from coefficients.
For the coding stage, if the absolute value of a coefficient and all its descendants (considering the classic quad-tree structure from ) is lower than a threshold value (2rplanes), the entire tree is encoded with a single symbol, which we call LOWER symbol (indicating that all the coefficients in the tree are lower than 2rplanes and so they form a lower tree). However, if a coefficient is lower than the threshold and not all its descendants are lower than it, that coefficient is encoded with an ISOLATED LOWER symbol. On the other hand, for each wavelet coefficient higher than 2rplanes, we encode a symbol indicating the number of bits needed to represent that coefficient, along with a binary-coded representation of its bits and sign (note that the rplanes less significant bits are not encoded).
The encoder exploits the sign neighborhood correlation of wavelet subband type (HL,LH,HH) as Deever and Hemami assessed in  by encoding the prediction of the sign (success of failure).
The proposed encoder also includes the rate control algorithm presented in  but taking into account the sign coding and the intrinsic error model of the rate control. As the rate control underestimates the target rate, the required bits to match the target bitrate are added to the bitstream. The selected bits correspond to the bitplanes (lower or equal to the rplanes quantization parameter) of significant coefficients added to the output bitstream following a particular order, from low-frequency subbands to the highest one.
2.2 The contrast sensitivity function
In , the authors explained how the sensitivity to contrast of the HVS can be exploited by means of the CSF curve to enhance the perceptual or subjective quality of the DWT-encoded images. A comprehensive review of HVS models for quality assessment/image compression is found in . Most of these models take into account the varying sensitivity over spatial frequency, color, and the inhibiting effects of strong local contrasts or activity, called masking.
Complex HVS models implement each of these low-level visual effects as a separate stage. Then the overall model consists of the successive processing of each stage. One of the initial HVS stages is the visual sensitivity as a function of spatial frequency that is described by the CSF. A closed-form model of the CSF for luminance images  is given by
where spatial frequency is with units of cycles/degree (f x and f y are the horizontal and vertical spatial frequencies, respectively). The frequency is usually measured in cycles per optical degree, which makes the CSF independent of the viewing distance.
Figure 1 depicts the CSF curve obtained with Equation 1, and it characterizes luminance sensitivity as a function of normalized spatial frequency (CSF=1/Contrast threshold). As shown, CSF is a band-pass filter, which is most sensitive to normalized spatial frequencies between 0.025 and 0.125 and less sensitive to very low and very high frequencies. The reason why we cannot distinguish patterns with high frequencies is the limited number of photoreceptors in our eyes. CSF curves exist for chrominance as well. However, unlike luminance stimuli, human sensitivity to chrominance stimuli is relatively uniform across spatial frequency.
One of the first works that demonstrate that the MSE cannot reliably predict the difference of the perceived quality of two images can be found in . They propose, by way of psychovisual experiments, the aforementioned model of the CSF, which is well suited and widely used [6, 14–16] for wavelet-based codecs; therefore, we adopt this model.
2.3 Using the CSF
In , the authors explained how the CSF can be implemented in wavelet-based codecs. Some codecs, like the JPEG2000 standard Part II, introduce the CSF as a visual progressive single factor weighting, replacing the MSE by the CSF-weighted MSE (WMSE) and optimizing system parameters to minimize WMSE for a given bitrate. This is done in the post-compression rate-distortion optimization algorithm where the WMSE replaces the MSE as the cost function which drives the formation of quality layers .
CSF weights can be obtained also by applying to each frequency subband the appropriate contrast detection threshold. In , subjective experiments were performed to obtain a model that expresses the threshold DWT noise as a function of spatial frequency. Using this model, the authors obtained a perceptually lossless quantization matrix for the linear phase 9/7 DWT. By the use of this quantization matrix, each subband is quantized by a value that weights the overall resulting quantized image at the threshold of artifacts visibility. For suprathreshold quantization, a uniform quantization stage is afterward performed.
However, we introduce the CSF in the encoder using the ISFW strategy proposed also in . So from the CSF curve, we obtain the weights for scaling the wavelet coefficients. This weighting can be introduced after the wavelet filtering stage and before the uniform quantization stage is applied. The weighting is a simple multiplication of the wavelet coefficients in each frequency subband by the corresponding weight. At the decoder, the inverse of this weight is applied. The CSF weights do not need to be explicitly transmitted to the decoder. This stage is independent to the other encoder modules (wavelet filtering, quantization, etc).
The granularity of the correspondence between frequency and weighting value is a key issue. As wavelet-based codecs obtain a multiresolution signal decomposition, the easiest association is to find a unique weighting value (or contrast detection threshold) for each wavelet frequency subband. If further decompositions of the frequency domain are done, for example, a finer association could be done between frequency and weights using packet wavelets .
We perform the ISFW implementation based on  but increasing the granularity at the subband level. This is done in the wavelet transform stage of the PE-LTW encoder multiplying each coefficient in a wavelet subband by its corresponding weighting factor. In spite of the fact that CSF (Equation 1) is independent of the viewing distance, in order to introduce it as a scaling factor, the resolution and the viewing distance must be fixed. Although an observer can look at the images from any distance, as stated in , the assumption of ‘worst case viewing conditions’ can produce CSF weighting factors that work properly for all different viewing distances and media resolutions. So after fixing viewing conditions, we obtain the weighting matrix, presented in Table 1. For each wavelet decomposition level and frequency orientation, the weights are directly obtained from the CSF curve, by normalizing the corresponding values so that the most perceptually important frequencies are scaled with higher values, while the less important are preserved. This scaling process augments the magnitude of all wavelet coefficients, except for those in the LL subband that are neither scaled nor quantized in our coding algorithm. Our tests reveal that, thanks to the weighting process, the uniform quantization stage preserves a very good balance between bitrate and perceptual quality in all the quantization range, from under-threshold (perceptually lossless) to suprathreshold quantization (lossy).
2.4 GPU 2D-DWT optimization
In order to develop the 2D-DWT-optimized version, we will use an NVIDIA GTX 280 GPU that contains 30 multiprocessors with eight cores in each multiprocessor, 1 GB of global memory, and 16 kB of shared memory (SM) by block.
Firstly, we will define our GPU-based 2D-DWT algorithm, named as CUDA Conv 9/7, as the reference algorithm. It will only use the GPU shared memory space to store the buffer that will contain a copy of the working row/column data. The constant memory space is used to store the filter taps. We call each CUDA kernel with a one-dimensional number of thread blocks, NBLOCKS, and a one-dimensional number of threads by block, NTHREADS.
In the horizontal DWT filtering process, each image row is stored in the threads shared memory. After that, in the vertical filtering, each column is processed in the same way. The row or column size determines the NBLOCKS parameter, which must be greater or equal to the image width in the horizontal step or the image height in the vertical step. One of the goals in the proposed CUDA-based methods is not to increase memory requirements, so we will store the resulting wavelet coefficients in the original image memory space.
For computing the DWT, the threads use the shared memory space, where latency access is extremely low. The CUDA-Sep 9/7 algorithm stores the original image in the GPU global memory but computes the filtering steps from the shared memory.
Execution in the GPU is composed by threads grouped in a number of 32 threads called warp. Each warp must load a block of the image from the global memory into a shared memory array with BLOCKSIZE pixels. As it can be seen in Figure 2, the number of thread blocks, NBLOCKS, or tiles depends on BLOCKSIZE and image dimensions. Moreover, pixels located in the border of the block also need neighbor pixels from other blocks to compute the convolution. These regions are called apron and are shadowed in the last row and column of Figure 2a, b. The size of the apron region depends on the filter radius (the filter radius being the half of the filter length minus 1). In both figure panels, the values of the filter radius and the filter length corresponding to the Daubechies 9/7 filter are presented.
We can reduce the number of idle threads by reducing the total number of threads per block and also using each thread to load multiple pixels into the shared memory. This ensures that all threads of each warp are active during the computation stage. Note that the number of threads in a block must be a multiple of the warp size (32 threads on GTX 280) for optimal efficiency.
To achieve higher efficiency and higher memory throughput, the GPU attempts to coalesce accesses from multiple threads into a single memory transaction. If all threads within a warp (32 threads) simultaneously read consecutive words, then a single large read of the 32 values can be performed at optimum speed. In the CUDA-Sep 9/7 algorithm, the convolution process is separated in two stages:
The row filtering stage
The column filtering stage
Each row/column filtering stage is separated into two substages: (a) the threads load a block of pixels of one row/column from the global memory into the shared memory, and (b) each thread computes the filter over the data stored in the shared memory and the result is sent to the global memory. For the column filtering, the resulting coefficient is stored in the global memory after performing the perceptual weighting, i.e., multiplying the final coefficient by the perceptual weight corresponding to the wavelet subband of the coefficient.
In the row or column filtering, the pixels located in the image block borders also need adjacent pixels from other thread blocks to compute the DWT. The apron region must also be loaded in the shared memory, but only for reading purposes, because the filtered value of the pixels located there is computed by other thread blocks.
The speedup achieved by the DWT GPU-based algorithm is up to 20 times relative to the sequential implementation in one core. Note that wavelet transform is only a single first step in an image/video encoder.
3 Performance evaluation
All evaluated encoders have been tested on an Intel Pentium Core 2 CPU at 1.8 GHz with 6 GB of RAM memory. We use an NVIDIA GTX 280 GPU that contains 30 multiprocessors with eight cores in each multiprocessor, 1 GB of global memory, and 16 kB of shared memory by block (or SM).
The proposed encoder is compared with Kakadu 5.2.5 and SPIHT (Sphit 8.01) encoders with two sets of test images: (a) a 512×512 image resolution set including Lena, Barbara, Balloon, Horse, Goldhill, Boat, Mandrill, and Zelda, and (b) a 2,048×2,560 image resolution set including Cafe, Bike, and Woman. When comparing with Kakadu, we perform two comparisons: one labeled as Kakadu_csf, which has enabled its perceptual weighting mode (with the perceptual weights presented in ), and the other one, labeled as Kakadu, without perceptual weights.
First, we analyze the speedup of the GPU-based encoder using 2D-DWT described in the previous section with respect to the traditional convolution algorithm running in a single core processor.
In Table 2, we show for each test image, at different bitrates, the encoding times for SPIHT, Kakadu, and our proposal in milliseconds. The first six columns are related to our proposal: The SEQ-DWT column shows the time required by the DWT when running on a single core. The GPU-DWT column shows the time of the CUDA-Sep 9/7 DWT version when running on GPU. The Rate & Coder column shows the time required by the rate control and the encoding stage, this time being common for both the sequential and GPU 2D-DWT versions. The T.SEQ column shows the total time for the sequential version and the T.GPU the total time for the GPU version. Finally, the Speedup column shows the speedup of the GPU version compared to the sequential version. The last two columns are the total execution time, also in milliseconds, for the other encoders, SPIHT and Kakadu.
When the target bitrate is low, i.e., high compression rate, the uniform quantization of the wavelet coefficients produces a great number of nonsignificant coefficients in low decomposition levels, the root of the zero tree being located at higher decomposition levels. This fact reduces the computation cost because only the root of a zero tree needs to be encoded. As a consequence, the overall number of operations is reduced and the gain of GPU optimized version is reduced too.
Table 3 shows the comparison of the average execution times (milliseconds) of each image in the test set at different compression rates. The PE_LTW is faster than SPIHT regardless of the target rate for any image size. However, the Kakadu encoder is still faster than the PE_LTW. Although the PE_LTW runs its DWT stage over the GPU, it is the only optimized stage in the whole encoder. By contrast, all encoding stages in the Kakadu 5.2.5 are fully optimized. Besides the use of multithread and multicore hardware capabilities, Kakadu uses processor intrinsics capabilities like MMX/SSE/SSE2/SIMD and uses a very fast multicomponent transform, i.e., block transform, which is well suited for parallelization.
4 R/D evaluation
For evaluating image encoders, the most common performance metric is the well-known R/D, the trade-off between encoder bitrate (bpp) and the reconstructed quality typically measured in decibels through the PSNR of luminance color plane. However, it is also well known that the PSNR quality measurement is not close to the human perception of quality and sometimes it gives wrong quality scores, leading to erroneous conclusions when evaluating different encoding strategies.
Figure 3 shows the R/D comparison of the Woman (2,048×2,560) image compressed with the PE_LTW encoder, SPIHT, Kakadu, and Kadadu_csf, using PSNR as quality metric. A misleading conclusion after looking at R/D curves for the PE_LTW and Kakadu_csf is that the encoding strategy of those proposals are inappropriate, since their quality results are always lower than those of the other encoders, specially at high bitrates.
There are several studies about the convenience of using other image quality assessment metrics than PSNR that better fit to human perceptual quality assessment (i.e., subjective test results) [14, 17, 19, 20]. One of the best behaving objective quality assessment metrics is visual information fidelity (VIF) , which has been proven [17, 19] to have a better correlation with subjective perception than other metrics that are commonly used for encoder comparisons [14, 20]. The VIF metric uses statistic models of natural scenes in conjunction with distortion models in order to quantify the statistical information shared between the test and reference images.
As an example of how measuring the perceptual quality of images with PSNR is misleading, we show in Figure 4 a subjective comparison of the three encoders with a cropped region of the Woman test image compressed at 0.25 bpp. In this case the third image, encoded with PE_LTW, seems to have better subjective quality than the other two. This observation contradicts the conclusion obtained from Figure 3 that suggests that at this rate PE_LTW is worse than SPIHT and Kakadu. The same behavior can be observed as well with the other test images. So it is better not to trust on how PNSR ranks quality and use instead a perceptually inspired quality assessment metric like VIF that, as stated in [17, 19], has a better correlation with the human perception of image quality.
So we will use the VIF metric in our R/D comparisons. Figure 5 shows some of the R/D results for some test images. As shown, the PE_LTW encoder can achieve higher compression rates while maintaining the same perceptual quality than the other encoders, i.e., a bitrate saving is obtained while using the PE_LTW instead of Kakadu or SPIHT at a desired quality.
Table 4 shows the rate savings obtained with PE_LTW vs. Kakadu, SPIHT, and Kakadu_csf. The VIF interval varies from 0.1 to 0.95 VIF quality units, 0.1 being the worst quality. This table groups the results by image resolution. Results are expressed as percentages of saved rate in the aforementioned VIF interval.
We have presented a perceptual image wavelet encoder whose 2D-DWT stage is implemented using CUDA running on a GPU. Our proposed perceptual encoder reveals the importance of exploiting the contrast sensitivity function behavior of the HVS by means of an accurate perceptual weighting of wavelet coefficients. PE_LTW is very competitive in terms of perceptual quality, being able to obtain important bitrate savings regardless of the image resolution and at any bitrate when compared with SPIHT and Kakadu with and without its perceptual weighting mode enabled. The PE_LTW encoder is able to produce a quality-equivalent image with respect to the other two encoders with a reduced rate.
As the 2D-DWT transform runs on a GPU, the overall encoding time is highly reduced compared to the sequential version of the same encoder, obtaining maximum speedups of 6.86 for 512×512 images and 4.39 for 2,048×2,560 images. Compared with SPIHT and Kakadu, our proposal is clearly faster than SPIHT but needs additional optimizations to outperform Kakadu times.
ISO: JPEG 2000 image coding system. Part 1: core coding system. Geneva: ISO; 2000.
Said A, Pearlman A: A new, fast and efficient image codec based on set partitioning in hierarchicaltrees. IEEE Trans. Circ., Syst. Video Technol 1996, 6(3):243-250. 10.1109/76.499834
Shapiro JM: A fast technique for identifying zerotrees in the EZW algorithm. Proc. IEEE Int. Conf. Acoust., Speech, Signal Process 1996, 3: 1455-1458.
Mallat SG: A theory for multi-resolution signal decomposition: the wavelet representation. IEEE Trans. Pat. Anal. Mach. Intel 1989, 11(7):674-693. 10.1109/34.192463
Lopez O, Martinez M, Pinol P, Malumbres MP, Oliver J: E-LTW: an enhanced LTW encoder with sign coding and precise rate control. In 2009 16th IEEE International Conference on Image Processing (ICIP). Piscataway: IEEE; 2009:2821-2824.
Taubman DS, Marcellin MW: JPEG2000 Image Compression Fundamentals, Standards and Practice. Berlin: Springer; 2002.
Sheikh HR, AC Bovik G, de Veciana: An information fidelity criterion for image quality assessment using natural scene statistics. IEEE Trans. Image Process 2005, 14(12):2117-2128.
Sweldens W: The lifting scheme: a custom-design construction of biorthogonal wavelets. Appl. Comput. Harmonic Anal 1996, 3(2):186-200. 10.1006/acha.1996.0015
Nadenau MJ, Reichel J, Kunt M: Wavelet-based color image compression: exploiting the contrast sensitivity function. IEEE Trans. Image Process 2003, 12(1):58-70. 10.1109/TIP.2002.807358
Deever A, Hemami SS: What’s your sign?: efficient sign coding for embedded wavelet image coding. In Proceedings of the Data Compression Conference, 2000 (DCC 2000). IEEE; 2000:273-282.
López O, Martinez-Rach M, Oliver J, Malumbres MP: Impact of rate control tools on very fast non-embedded wavelet image encoders. In Visual Communications and Image Processing 2007. Piscataway: IEEE; 2007.
Oliver J, Malumbres MP: Low-complexity multiresolution image compression using wavelet lower trees. IEEE Trans. Circ. Syst. Video Technol 2006, 16(11):1437-1444.
Mannos J, Sakrison D: The effects of a visual fidelity criterion of the encoding of images. IEEE Trans. Info. Theory 1974, 20(4):525-536. 10.1109/TIT.1974.1055250
Wang Z, Bovik A, Sheikh H, Simoncelli EP: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process 2004, 13(4):600-612. 10.1109/TIP.2003.819861
Watson AB, Yang GY, Solomon JA, Villasenor J: Visibility of wavelet quantization noise. IEEE Trans. Image Process 1997, 6(8):1164-1175. 10.1109/83.605413
Moumkine N, Tamtaoui A, Ait Ouahman A: Integration of the contrast sensitivity function into wavelet codec. In Proceedings of the Second International Symposium on Communications, Control and Signal Processing (ISCCSP 2006). Marrakech; 13–15 Mar 2006.
Gao X, Lu W, Tao D, Li X: Image quality assessment based on multiscale geometric analysis. IEEE Trans. Image Process 2009, 18(7):1409-1423.
Beegan AP, Iyer LR, Bell AE, Maher VR, Ross MA: Design and evaluation of perceptual masks for wavelet image compression. In Proceedings of the 2002 IEEE 10th Digital Signal Processing Workshop, 2002 and the 2nd Signal Processing Education Workshop. Piscataway: IEEE; 88-93.
Martinez-Rach M, Lopez O, Piñol P, Oliver J, Malumbres MP: A study of objective quality assessment metrics for video codec design and evaluation. In Eight IEEE International Symposium on Multimedia, vol.1. San Diego: IEEE Computer Society; 2006:517-524.
Sheikh HR, Sabir MF, Bovik AC: A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Trans. Image Process 2006, 15(11):3440-3451.
This research was supported by the Spanish Ministry of Education and Science under grant TIN2011-27543-C03-03.S
The authors declare that they have no competing interests.
About this article
Cite this article
Martínez-Rach, M.O., López-Granado, O., Galiano, V. et al. Enhancing LTW image encoder with perceptual coding and GPU-optimized 2D-DWT transform. EURASIP J. Adv. Signal Process. 2013, 141 (2013). https://doi.org/10.1186/1687-6180-2013-141
- Wavelet image coding
- Perceptual coding
- Contrast sensitivity function
- GPU optimization