GPU-based 3D lower tree wavelet video encoder
Vicente Galiano
Otoniel López-Granado
Manuel P Malumbres
Leroy Anthony Drummond
Hector Migallón
https://doi.org/10.1186/1687-6180-2013-24
© Galiano et al.; licensee Springer. 2013
- Received: 15 November 2012
- Accepted: 6 February 2013
- Published: 19 February 2013
Abstract
The 3D-DWT is a mathematical tool of increasing importance in those applications that require an efficient processing of huge amounts of volumetric info. Other applications like professional video editing, video surveillance applications, multi-spectral satellite imaging, HQ video delivery, etc, would rather use 3D-DWT encoders to reconstruct a frame as fast as possible. In this article, we introduce a fast GPU-based encoder which uses 3D-DWT transform and lower trees. Also, we present an exhaustive analysis of the use of GPU memory. Our proposal shows good trade off between R/D, coding delay (as fast as MPEG-2 for High definition) and memory requirements (up to 6 times less memory than x264).
Keywords
- 3D-DWT
- Video coding
- GPU
- Manycore
1 Introduction
At video content production stages, digital video processing applications require fast frame random access to perform an undefined number of real-time decompressing-editing-compressing interactive operations, without a significant loss of original video content quality. Intra-frame coding is desirable as well in many other applications like video archiving, high-quality high-resolution medical and satellite video sequences, applications requiring simple real-time encoding like video-conference systems or even for professional or home video surveillance systems[1] and digital video recording systems (DVR). However, intra coding does not take profit of the temporal redundancy between frames.
In the last years, most of all in areas such as video watermarking[2] and 3D coding (e.g., compression of volumetric medical data[3] or multispectral images[4], 3D model coding[5], and especially, video coding), three-dimensional wavelet transform (3D-DWT) based encoders have arisen as an alternative between simple intra coding and complex inter coding solutions that applies motion compensation between frames to exploit temporal redundancy.
In[6], authors utilized 3-D spatio-temporal subband decomposition and geometric vector quantization (GVQ). In[7] a full color video coder based on 3-D subband coding with camera pan compensation was presented. In[8] an extension to 3D of the well known embedded zerotree wavelet (EZW) algorithm developed by Shapiro[9] was presented. Similarly, an extension to 3D-DWT of the set partitioning in hierarchical trees (SPIHT) algorithm developed by Said and Pearlman[10] was presented in[11], using a tree with eight descendants per coefficient instead of the typical quad-trees of image coding. All of this 3D-DWT based encoders are faster than complex inter coding schemes but slower than simple intra coding solutions. So we will try in this study to speed up 3D video encoders to achieve coding delays as closer as possible to the ones obtained by intra video encoders but with a clearly superior compression performance. In order to achieve this goal, we will focus on GPU-based platforms.
Wide research have been carried out to accelerate the DWT, specially the 2D DWT, exploiting both multicore architectures and graphic processing units (GPU). In[12], a Single Instruction Multiple Data (SIMD) algorithm runs the 2D-DWT on a GeForce 7800 GTX using Cg and OpenGL, with a remarkable speed-up. A similar effort has been performed in[13] combining Cg and the 7800 GTX to report a 1.2–3.4 speed-up versus a CPU counterpart. In[14], authors present a CUDA implementation for the 2D-FWT running more than 20 times as fast as the sequential C version on a CPU, and more than twice as fast as the optimized OpenMP and Pthreads versions implemented on a multicore CPU. In[15], authors present GPU implementations for the 2D-DWT obtaining speed-ups up to 20 when compared to the CPU sequential algorithm.
In this study, we present a GPU 3D-DWT based video encoder using lower trees as the core coding system. The proposed encoder requires less memory than 3D-SPIHT[11] and has a good R/D behavior. Furthermore, we present an in-depth analysis of the use of GPU’s to accelerate the 3D-DWT transform. Using these strategies, the proposed encoder is able to compress a Full-HD video sequence in real time.
The rest of the article is organized as follows. Section 2 presents the proposed 3D-DWT based encoder. In Section 3, a performance evaluation in terms of R/D, memory requirements and coding time is presented. Section 4 describes several optimization proposals based on CUDA to process the 3D-DWT transform, while in Section 5 we analyze these proposals when applied to the proposed encoder. Finally in Section 6 some conclusions are drawn.
2 Encoding system
In this section, we present a 3D-DWT based encoder with low complexity and good R/D performance. As our main concern is fast encoding process, no R/D optimization, motion estimation/motion compensation (ME/MC) or bit-plane processing is applied. This encoder is based on both the 3D-DWT transform and lower-trees (3D-LTW).
After all 3D-DWT decomposition levels are applied, all the resulting wavelet coefficients are quantized and then, the encoding system compresses the input data to obtain the final bit-stream corresponding to that GOP. It is important to remark that the compressed bit-stream is ordered in such a way that the decoder obtains the bit-stream in the correct order.
2.1 Lower-tree wavelet coding
The proposed video coder is based on the LTW image coding algorithm[16]. As in LTW encoder, the proposed video codec uses a scalar uniform quantization by means of two quantization parameters: rplanes and Q. The finer quantization consists in applying a scalar uniform quantization, Q, to all wavelet coefficients. The coarser quantization is based on removing the least significant bit planes, rplanes, from wavelet coefficients.
The encoder uses a tree structure to reduce data redundancy among subbands (similar to that of[11]), and also as a fast way of grouping coefficients, reducing the number of symbols needed to encode the image. This structure is called lower tree, and all the coefficients in the tree are lower than 2^{ rplanes }. In Figure1, a example of the relationship between subbands is presented.
Let us describe the coding algorithm. In the first stage (symbol computation), all wavelet subbands are scanned from the first decomposition level to the N th (to be able to build the lower-trees from leaves to root) and the encoder has to determine if each 2 × 2 block of coefficients of both subband frames is part of a lower-tree. In the first level subband (see Figure1), if the eight coefficients in these blocks (2 blocks of 2 × 2 coefficients) are insignificant (i.e., lower than 2^{ rplanes }), they are considered to be part of the same lower-tree, labeled as LOWER_COMPONENT. Then, when scanning upper level subbands, if both 2 × 2 blocks have eighth insignificant coefficients and all their direct descendants are LOWER_COMPONENT, the coefficients in that blocks are labeled as LOWER_COMPONENT, increasing the lower-tree size.
As in the original LTW image encoder, when there is at least one significant coefficient in one of the two blocks of 2 × 2 coefficients or in its descendant coefficients, we need to encode each coefficient separately. Recall that in this case, if a coefficient and all its descendants are insignificant, we use the LOWER symbol to encode the entire tree, but if the coefficient is insignificant, and it has a significant descendant, the coefficient is encoded as ISOLATED_LOWER. However, if all descendants of a significant coefficient are insignificant (LOWER_COMPONENT), we use a special symbol indicating the number of bits needed to represent it and a superscript L (4^{ L }).
Finally, in the second stage, subbands are encoded from the LL L _{ N } subband to the first-level wavelet subbands and symbols computed in the first stage are entropy coded by means of an arithmetic encoder. Recall that no LOWER_COMPONENT is encoded. The value of significant coefficients and their sign are raw encoded.
3 Performance evaluation
In this section, we will compare the performance of our proposed encoder (3D-LTW) using Daubechies 9/7F filter for both spatial and temporal domain with the following video encoders:
The performance metrics employed in the tests are R/D performance, coding and decoding delay and memory requirements. All the evaluated encoders have been tested on an Intel PentiumM Dual Core 3.0 GHz with 2 Gbyte RAM memory.
The test video sequences used in the evaluation are: Foreman (QCIF and CIF) 300 frames, Container (QCIF and CIF) 300 frames, News (QCIF and CIF) 300 frames, Hall (QCIF and CIF) 300 frames, Mobile (ITU D1 576p30) 40 frames, Station2 (HD 1024p25) 312 frames, Ducks (HD 1024p50) 130 frames and Ducks (SHD 2048p50) 130 frames.
It is important to remark that MPEG-2 and x264 evaluated implementations are fully optimized, using CPU capabilities like Multimedia Extensions (MMX2, SSE2Fast, SSSE3, etc.) and multithreading, whereas 3D-DWT based encoders (3D-SPIHT and 3D-LTW) are non optimized C++ implementations.
3.1 Memory requirements
Memory requirements for evaluated encoders (KB)
Format/ | QCIF | CIF | ITU-D1 | Full-HD |
---|---|---|---|---|
Codec | ||||
H264 | 35824 | 86272 | 227620 | 489960 |
x264 | 10752 | 18076 | 36600 | 178940 |
MPEG-2 | 4696 | 6620 | 9164 | 32820 |
3D-SPIHT | 10152 | 34504 | 118460 | 645720 |
3D-LTW | 1611 | 6390 | 20576 | 123072 |
3.2 R/D performance
Average PSNR (dB) with different bit rate and coders
Codec/Bit rate | |||||
---|---|---|---|---|---|
Kbps/dB | x264 | MPEG-2 | x264 Intra | 3D-SPIHT | 3D-LTW |
Foreman (CIF) | |||||
3040 | 44.99 | 40.74 | 39.95 | 40.32 | 41.38 |
1520 | 41.80 | 37.10 | 35.29 | 36.42 | 36.67 |
760 | 38.90 | 34.09 | 31.43 | 33.35 | 33.42 |
380 | 35.60 | 31.59 | 28.15 | 30.78 | 30.68 |
190 | 31.99 | 29.32 | 25.07 | 28.53 | 28.54 |
Container (CIF) | |||||
3040 | 47.20 | 43.59 | 37.97 | 47.82 | 46.54 |
1520 | 43.60 | 40.43 | 33.04 | 43.99 | 41.93 |
760 | 40.50 | 37.19 | 29.22 | 39.54 | 37.39 |
380 | 37.09 | 34.48 | 25.88 | 35.20 | 33.31 |
190 | 33.89 | 32.05 | 23.27 | 31.10 | 29.79 |
Hall (CIF) | |||||
3040 | 42.92 | 42.29 | 41.19 | 44.68 | 44.46 |
1520 | 40.55 | 39.89 | 36.60 | 42.27 | 41.66 |
760 | 38.94 | 37.95 | 31.89 | 40.11 | 38.93 |
380 | 37.25 | 35.95 | 27.32 | 37.39 | 35.43 |
190 | 34.80 | 33.59 | 23.88 | 33.56 | 31.90 |
Mobile (ITU-D1) | |||||
6400 | 40.33 | 37.82 | 35.56 | 38.24 | 38.86 |
3598 | 38.82 | 36.09 | 32.53 | 35.07 | 35.59 |
2100 | 37.57 | 34.37 | 30.12 | 32.53 | 32.69 |
1142 | 35.51 | 32.58 | 27.87 | 30.52 | 30.64 |
542 | 31.82 | 30.68 | 25.65 | 28.82 | 29.26 |
Ducks (Full-HD) 50fps | |||||
98304 | 37.34 | 38.49 | 36.26 | 37.77 | 36.07 |
49152 | 34.48 | 35.27 | 32.61 | 35.39 | 32.85 |
24576 | 32.46 | 32.28 | 29.16 | 33.68 | 31.49 |
12288 | 30.55 | 29.32 | 26.43 | 31.63 | 30.23 |
6144 | 28.47 | 27.82 | 24.19 | 28.99 | 29.19 |
3.3 Subjective evaluation
3.4 Coding/decoding time
4 3D-DWT optimizations
As 3D-DWT computation requires more than 45% and up to 80% of the total coding time in the proposed encoder, in this section we present several GPU based strategies to improve the 3D-DWT computation time.
Two different GPUs architectures are used in this study. The first one is a GTX280 which contains 240 CUDA cores with 1 GB of dedicated video memory. The other one is a laptop GPU (GT540M) with 96 CUDA cores and 2 GB of dedicated video memory. We can appreciate significant differences between both devices that will be reflected in the results shown in this section.
4.1 Performance evaluation of the GPU 3D-DWT
In this section, we present the performance evaluation of our GPU-based 3D-DWT algorithm in terms of computational and memory transfer times and the speed-ups obtained when compared to the CPU sequential algorithm. We present results for both previously mentioned GTX280 and GT540M platforms.
4.2 Memory access optimization
The previously presented algorithm uses the global memory to store both source and output data in wavelet computation. A reasonable speed-up (13) has been obtained with high video resolutions. However, we can achieve better performance if we compute the filtering steps from the shared memory. A block of the frame (row/column or temporal array) can be loaded into a shared memory array with BLOCKSIZE pixels. The number of thread blocks, NBLOCKS, depends on BLOCKSIZE and video frame resolution. We must note that around the loaded video frame block there is an apron of neighbor pixels that it is also required to load in the shared memory in order to properly filter the video frame block. We can reduce the number of idle threads by reducing the total number of threads per block and also using each thread to load multiple pixels into shared memory. This ensures that all threads are active during the computation stage. Note that the number of threads in a block must be a multiple of the warp size (32 threads on GTX280 and GT540M) for optimal efficiency. To achieve better efficiency and higher memory throughput, the GPU attempts to coalesce accesses from multiple threads into a single memory transaction. If all threads within a warp (32 threads) simultaneously read consecutive words then single large read of the 32 values can be performed at optimum speed.
In this new approach, each row/column/temporal filtering stage is separated into two sub-stages: (a) the threads load a block of pixels of one row/column/temporal array from the global memory into the shared memory, and (b) each thread computes the filter over the data stored in the shared memory and stores the results in the global memory. We must not forget about the cases when a row or column processing tile becomes clamped by video frame borders, and initialize clamped shared memory array indices with correct values. In this case, threads also must load in shared memory the values of adjacent pixels in order to compute the pixels located in borders.
5 Performance evaluation of the proposed encoder using GPUs
After analyzing the performance of the GPU 3D-DWT computation, we will present a comparison of the proposed encoder against the other encoders in terms of coding delay.
Although, the GPU version of the 3D-LTW encoder has been speeded up to 3.2 times, now, the bottleneck in the global encoder is the coding stage after computing the 3D-DWT transform, specially at low compression rates, where there are lots of significant coefficients to encode. Several strategies could be performed in order to speed up even more the proposed encoder, like overlapping both GPU computation and memory transfer times, overlapping CPU processing times with GPU processing time, or using several GPUs to compute multiple 3D wavelet transforms from different GOPs.
6 Conclusions
In this article, we have presented the 3D-LTW video encoder based on 3D wavelet transform and lower trees with eight nodes. We have compared our algorithm against 3D-SPIHT, H.264, x264, and MPEG-2 encoders in terms of R/D, coding delay and memory requirements.
Regarding R/D, our proposal has a better behavior than MPEG-2. When compared to 3D-SPIHT, our proposal has a similar behavior for sequences with medium and high movement, but slightly lower performance for sequences with low movement like Container. However, our proposal requires 6 times less memory than 3D-SPIHT. Both 3D-DWT based encoders (3D-SPIHT and 3D-LTW) outperforms x264 in Intra mode (up to 11 dB) exploiting only the temporal redundancy among video frames when applying the 3D-DWT. It is also important to see the behavior of 3D-DWT based encoders when applied to high frame rate video sequences obtaining even better PSNR than x264 in Inter mode.
In order to speed up our encoder, we have presented an exhaustive analysis of GPU memory strategies to compute the 3D-DWT transform. As we have seen, the GPU 3D-DWT algorithm obtains good speed-ups, up to 16 in the GT540M platform and up to 39 in the GTX280. Using these optimizations, the proposed encoder (3D-LTW) is a very fast encoder, specially for Full-HD video resolutions, being able to compress a Full-HD video sequence in real time.
The fast coding/decoding process and the avoiding of the use of motion estimation/motion compensation algorithms, makes the 3D-LTW encoder a good candidate for applications where the coding/decoding delay are critical for proper operation or for applications where a frame must be reconstructed as soon as possible. 3D-DWT based encoders could be an intermediate solution between pure Intra encoders and complex Inter encoders.
Although the proposed 3D-LTW encoder has been developed for natural video sequences where Daubechies 9/7F filter for the 3D-DWT stage has been widely used in the literature, other bi-orthogonal filters could be applied, depending on the final application. Even though longer filters capture better the frequency changes on an image, differences on R/D for natural images are negligible with respect to Daubechies 9/7F filter. This effect could be extended to the temporal domain case. However, longer filters introduce an increment on the DWT computation complexity because more operations per pixel must be performed, making the encoder slower. Obviously, if a longer filter is used in the DWT stage, the speed-up will be greater, because more operations per pixel will be performed in a parallel way.
As future study, we pretend to move other parts of the coding stage, like the quantization stage to the GPU to speed up even more the encoder. Furthermore, we pretend to overlap the CPU computation stage with the GPU computation of the 3D-DWT stage. Regarding quantization step over GPU, our first attempts shows that the 3D-DWT stage over GPU will be increased a 12% on average while the coding stage will be reduced a 17 % on average, which makes our encoder even faster.
Declarations
Acknowledgements
This research was supported by the Spanish Ministry of Education and Science under grant TIN2011-27543-C03-03 and the Spanish Ministry of Science and Innovation under grant number TIN2011-26254 and TEC2010-11776-E.
