Research  Open  Published:
GPUbased 3D lower tree wavelet video encoder
EURASIP Journal on Advances in Signal Processingvolume 2013, Article number: 24 (2013)
Abstract
The 3DDWT is a mathematical tool of increasing importance in those applications that require an efficient processing of huge amounts of volumetric info. Other applications like professional video editing, video surveillance applications, multispectral satellite imaging, HQ video delivery, etc, would rather use 3DDWT encoders to reconstruct a frame as fast as possible. In this article, we introduce a fast GPUbased encoder which uses 3DDWT transform and lower trees. Also, we present an exhaustive analysis of the use of GPU memory. Our proposal shows good trade off between R/D, coding delay (as fast as MPEG2 for High definition) and memory requirements (up to 6 times less memory than x264).
1 Introduction
At video content production stages, digital video processing applications require fast frame random access to perform an undefined number of realtime decompressingeditingcompressing interactive operations, without a significant loss of original video content quality. Intraframe coding is desirable as well in many other applications like video archiving, highquality highresolution medical and satellite video sequences, applications requiring simple realtime encoding like videoconference systems or even for professional or home video surveillance systems[1] and digital video recording systems (DVR). However, intra coding does not take profit of the temporal redundancy between frames.
In the last years, most of all in areas such as video watermarking[2] and 3D coding (e.g., compression of volumetric medical data[3] or multispectral images[4], 3D model coding[5], and especially, video coding), threedimensional wavelet transform (3DDWT) based encoders have arisen as an alternative between simple intra coding and complex inter coding solutions that applies motion compensation between frames to exploit temporal redundancy.
In[6], authors utilized 3D spatiotemporal subband decomposition and geometric vector quantization (GVQ). In[7] a full color video coder based on 3D subband coding with camera pan compensation was presented. In[8] an extension to 3D of the well known embedded zerotree wavelet (EZW) algorithm developed by Shapiro[9] was presented. Similarly, an extension to 3DDWT of the set partitioning in hierarchical trees (SPIHT) algorithm developed by Said and Pearlman[10] was presented in[11], using a tree with eight descendants per coefficient instead of the typical quadtrees of image coding. All of this 3DDWT based encoders are faster than complex inter coding schemes but slower than simple intra coding solutions. So we will try in this study to speed up 3D video encoders to achieve coding delays as closer as possible to the ones obtained by intra video encoders but with a clearly superior compression performance. In order to achieve this goal, we will focus on GPUbased platforms.
Wide research have been carried out to accelerate the DWT, specially the 2D DWT, exploiting both multicore architectures and graphic processing units (GPU). In[12], a Single Instruction Multiple Data (SIMD) algorithm runs the 2DDWT on a GeForce 7800 GTX using Cg and OpenGL, with a remarkable speedup. A similar effort has been performed in[13] combining Cg and the 7800 GTX to report a 1.2–3.4 speedup versus a CPU counterpart. In[14], authors present a CUDA implementation for the 2DFWT running more than 20 times as fast as the sequential C version on a CPU, and more than twice as fast as the optimized OpenMP and Pthreads versions implemented on a multicore CPU. In[15], authors present GPU implementations for the 2DDWT obtaining speedups up to 20 when compared to the CPU sequential algorithm.
In this study, we present a GPU 3DDWT based video encoder using lower trees as the core coding system. The proposed encoder requires less memory than 3DSPIHT[11] and has a good R/D behavior. Furthermore, we present an indepth analysis of the use of GPU’s to accelerate the 3DDWT transform. Using these strategies, the proposed encoder is able to compress a FullHD video sequence in real time.
The rest of the article is organized as follows. Section 2 presents the proposed 3DDWT based encoder. In Section 3, a performance evaluation in terms of R/D, memory requirements and coding time is presented. Section 4 describes several optimization proposals based on CUDA to process the 3DDWT transform, while in Section 5 we analyze these proposals when applied to the proposed encoder. Finally in Section 6 some conclusions are drawn.
2 Encoding system
In this section, we present a 3DDWT based encoder with low complexity and good R/D performance. As our main concern is fast encoding process, no R/D optimization, motion estimation/motion compensation (ME/MC) or bitplane processing is applied. This encoder is based on both the 3DDWT transform and lowertrees (3DLTW).
First of all, the 3DDWT is applied to a group of pictures (GOP). In Figure1 an example of a two level decomposition of the 3DDWT transform is applied to a eightframe video sequence. As it can be seen on the left side, spatial decomposition to all video frames is performed resulting in four subbands (LL∗_{1}, LH∗_{1}, HL∗_{1}, HH∗_{1}). After applying the temporal decomposition, we will obtain the highfrequency temporal subbands (∗∗H _{1} labeled subbands with a dark blue color), and the lowfrequency ones (∗∗L _{1} labeled subbands with a light blue color). On the right side of Figure1, we show the second decomposition level of the 3DDWT transform. So, we will perform the same process to the frames belonging to the LL L _{1}, performing the spatial and temporal DWT filtering to obtaining the corresponding subbands. Finally, we also show the wavelet coefficients offspring relationship, that the coefficient encoder will exploit. As it can be seen each coefficient of a particular subband at N th decomposition level will have eight descendants in the (N−1)th decomposition level as shown at figure.
After all 3DDWT decomposition levels are applied, all the resulting wavelet coefficients are quantized and then, the encoding system compresses the input data to obtain the final bitstream corresponding to that GOP. It is important to remark that the compressed bitstream is ordered in such a way that the decoder obtains the bitstream in the correct order.
2.1 Lowertree wavelet coding
The proposed video coder is based on the LTW image coding algorithm[16]. As in LTW encoder, the proposed video codec uses a scalar uniform quantization by means of two quantization parameters: rplanes and Q. The finer quantization consists in applying a scalar uniform quantization, Q, to all wavelet coefficients. The coarser quantization is based on removing the least significant bit planes, rplanes, from wavelet coefficients.
The encoder uses a tree structure to reduce data redundancy among subbands (similar to that of[11]), and also as a fast way of grouping coefficients, reducing the number of symbols needed to encode the image. This structure is called lower tree, and all the coefficients in the tree are lower than 2^{rplanes}. In Figure1, a example of the relationship between subbands is presented.
Let us describe the coding algorithm. In the first stage (symbol computation), all wavelet subbands are scanned from the first decomposition level to the N th (to be able to build the lowertrees from leaves to root) and the encoder has to determine if each 2 × 2 block of coefficients of both subband frames is part of a lowertree. In the first level subband (see Figure1), if the eight coefficients in these blocks (2 blocks of 2 × 2 coefficients) are insignificant (i.e., lower than 2^{rplanes}), they are considered to be part of the same lowertree, labeled as LOWER_COMPONENT. Then, when scanning upper level subbands, if both 2 × 2 blocks have eighth insignificant coefficients and all their direct descendants are LOWER_COMPONENT, the coefficients in that blocks are labeled as LOWER_COMPONENT, increasing the lowertree size.
As in the original LTW image encoder, when there is at least one significant coefficient in one of the two blocks of 2 × 2 coefficients or in its descendant coefficients, we need to encode each coefficient separately. Recall that in this case, if a coefficient and all its descendants are insignificant, we use the LOWER symbol to encode the entire tree, but if the coefficient is insignificant, and it has a significant descendant, the coefficient is encoded as ISOLATED_LOWER. However, if all descendants of a significant coefficient are insignificant (LOWER_COMPONENT), we use a special symbol indicating the number of bits needed to represent it and a superscript L (4^{L}).
Finally, in the second stage, subbands are encoded from the LL L _{ N } subband to the firstlevel wavelet subbands and symbols computed in the first stage are entropy coded by means of an arithmetic encoder. Recall that no LOWER_COMPONENT is encoded. The value of significant coefficients and their sign are raw encoded.
3 Performance evaluation
In this section, we will compare the performance of our proposed encoder (3DLTW) using Daubechies 9/7F filter for both spatial and temporal domain with the following video encoders:

3DSPIHT[17].

H.264 (JM16.1 version) (high quality profile)[18].

MPEG2 (ffmpegr25117)—GOP size 15, sequence type IBBPBBP …[19].

x264 (mingw32libx264 r17131 high quality profile) in both Inter and Intra mode[19].
The performance metrics employed in the tests are R/D performance, coding and decoding delay and memory requirements. All the evaluated encoders have been tested on an Intel PentiumM Dual Core 3.0 GHz with 2 Gbyte RAM memory.
The test video sequences used in the evaluation are: Foreman (QCIF and CIF) 300 frames, Container (QCIF and CIF) 300 frames, News (QCIF and CIF) 300 frames, Hall (QCIF and CIF) 300 frames, Mobile (ITU D1 576p30) 40 frames, Station2 (HD 1024p25) 312 frames, Ducks (HD 1024p50) 130 frames and Ducks (SHD 2048p50) 130 frames.
It is important to remark that MPEG2 and x264 evaluated implementations are fully optimized, using CPU capabilities like Multimedia Extensions (MMX2, SSE2Fast, SSSE3, etc.) and multithreading, whereas 3DDWT based encoders (3DSPIHT and 3DLTW) are non optimized C++ implementations.
3.1 Memory requirements
In Table1, memory requirements of different encoders under test are shown. The 3DLTW encoder running over a GOP size of 16 frames uses up to 6 times less memory than 3DSPIHT, up to 22 times less memory than H.264 for QCIF sequence resolution and up to 6 times less memory than x264 which is an optimized implementation of H.264, for small sequence resolutions. It is important to remark that 3DSPIHT keeps the compressed bitstream of a 16 GOP size in memory until the whole compression is performed, while encoders like MPEG2, H.264, 3DLTW and x264 output the bitstream inline. Block based encoders like MPEG2 require less memory than the others encoders, specially at high definition sequences.
3.2 R/D performance
Regarding R/D, in Table2 we can see the R/D behavior of all evaluated encoders for different sequences. As shown, x264 is the one that obtains the best results for sequences with high movement, mainly due to the exhaustive ME/MC stage included in this encoder, contrary to 3DSPIHT and 3DLTW that do not include any ME/MC stage. The R/D behavior of 3DSPIHT and 3DLTW is similar for images with moderatehigh motion activity, but for sequences with low movement, 3DSPIHT outperform 3DLTW, mainly due to the extra decomposition levels applied in high frequency subbands. Figure2 shows an example of this effect in two different sequences, one with low motion activity like Container and other with moderate motion activity like Foreman. Notice that the proposed 3DLTW encoder improves the performance of the oldfashioned MPEG2 inter video encoder. Also, it is worth to highlight the significant R/D improvement of both 3DLTW and 3DSPIHT over the x264 intra encoder (up to 11 dB). This R/D improvement is accomplished by exploiting only the temporal redundancy among video frames when applying the 3DDWT. It is also interesting the behavior of 3DDWT based encoder for high frame rate video sequences like Ducks. As it can be seen all 3DDWT based encoders have a similar behavior than the other encoders, even better than x264 in INTER mode.
3.3 Subjective evaluation
We have also performed a subjective evaluation of the proposed encoder. Figures3 and4 show the 33^{rd} frame of the Ducks sequence in FullHD format compressed at 13000 Kbps. As we can see, both 3DLTW and x264 obtain the best results. MPEG2 obtain lower performance. Also, in Figure4, we can see the poor performance of x264 Intra in this frame where disturbing blocking artifacts appear. Its interesting to see the great behavior of 3DLTW which is even better than x264, even when no ME/MC is applied in the proposed encoder.
3.4 Coding/decoding time
In Figure5, we present the total coding time (excluding I/O) of all evaluated encoders and for different sequence resolutions. As it can be seen, MPEG2 encoder is the fastest one due to its blockbased processing algorithm. Regarding 3DDWT based encoders, the proposed encoder 3DLTW is up to 7 times as fast as 3DSPIHT and up to 6 times as fast as x264 encoder.
Also, in Figure6a we present the total coding time of a frame for different video sequence resolutions as a function of the GOP size for the 3DLTW encoder. As it can be seen, for low resolution sequences there are near no differences in the total coding time, but for high resolution video sequences, the total coding time will increase up to 40% as the GOP size increases. Furthermore, its interesting to see that the time required to perform 3DDWT stage ranges between 45 and 80% of the total coding time depending on the GOP size, as seen in Figure6b. So, improvements in the 3DDWT computation will drastically reduce the total coding time of the proposed encoder.
4 3DDWT optimizations
As 3DDWT computation requires more than 45% and up to 80% of the total coding time in the proposed encoder, in this section we present several GPU based strategies to improve the 3DDWT computation time.
Two different GPUs architectures are used in this study. The first one is a GTX280 which contains 240 CUDA cores with 1 GB of dedicated video memory. The other one is a laptop GPU (GT540M) with 96 CUDA cores and 2 GB of dedicated video memory. We can appreciate significant differences between both devices that will be reflected in the results shown in this section.
The algorithm used to compute the 3DDWT in the GPUs is illustrated in Figure7. Before the first computation step, image data must be transferred from host memory to the global memory of the device. We must transfer the number of frames indicated by the GOP size. As we increase the GOP size, more amount of global memory is needed in the GPU. All frames are stored in adjacent memory positions. In this way, the memory requirements for the GPU is Width × Height × frames × size of (float) bytes. As showed in Figure7, in this implementation we are using two memory spaces in the global memory of the GPU: one for the input data and other for the output data after applying the filtering process. In the first step, each thread computes the row convolution and stores the result in the output memory. For computing the second step, the source data is now the output data obtained in the previous step, so it is not needed a copy of memory data for preparing this step. So, in the second step, the column filter is applied and the 2DDWT is completed for each frame and after that, the output is again in the source space memory. After that, in the third step, a 1DDWT is performed to consider the temporal dimension. At the last step, data must be transferred to the host memory to proceed with the next GOP. The first level 3DDWT is performed in the output space memory and if we want compute a second level we must copy data from output to input space. Then, in the second level only half of the resolution (LLL subband) must be computed, iterating the same steps that for the first level.
4.1 Performance evaluation of the GPU 3DDWT
In this section, we present the performance evaluation of our GPUbased 3DDWT algorithm in terms of computational and memory transfer times and the speedups obtained when compared to the CPU sequential algorithm. We present results for both previously mentioned GTX280 and GT540M platforms.
In Figure8, we present the computational times for both GPU platforms used in this study and for two different video sequence resolutions considering GOP sizes varying from 16 to 128 and computing four wavelet decomposition levels. As shown in Figure8, for ITUD1 video frame resolution, the GTX280 is 2.3 times as fast as the GT540M regarding the GPU computational time. This is mainly due to the greater number of cores available in the GTX280 (2.5 times more cores). Moreover, in Figure9a we compare computational times in GPU shown in Figure8, versus the times needed to compute the wavelet transform in CPU, shown in Figure6b), for a GOP size of 32 and we obtain an speedup around 16.6 in GT540M, and 38 in GTX280. Computational times for FullHD resolution over GTX280 are not available due to global memory constraints.
However, only computational time has been considered in this analysis. In Figure10, we show total times including transfer times between host memory and GPU memory. We must notice that these times including transfer times are higher than the ones showed in Figure8, being 1.3 in GT540M and 3.73 in GTX280. The global computational time including the memory transfer time is lower in the GT540M than in the GTX280 due to the significantly lower memory transfer time, thanks to a second generation of PCI Express bus which improves data transfers. As shown, data transfer between device memory and host memory introduce a significant penalty when using GPUs for general purpose computing. Comparing times from Figure10 with the measured times in CPU, note that we continue obtaining a good speedup of 12 in GT540M and over 10 in GTX280 as shown in Figure9a.
4.2 Memory access optimization
The previously presented algorithm uses the global memory to store both source and output data in wavelet computation. A reasonable speedup (13) has been obtained with high video resolutions. However, we can achieve better performance if we compute the filtering steps from the shared memory. A block of the frame (row/column or temporal array) can be loaded into a shared memory array with BLOCKSIZE pixels. The number of thread blocks, NBLOCKS, depends on BLOCKSIZE and video frame resolution. We must note that around the loaded video frame block there is an apron of neighbor pixels that it is also required to load in the shared memory in order to properly filter the video frame block. We can reduce the number of idle threads by reducing the total number of threads per block and also using each thread to load multiple pixels into shared memory. This ensures that all threads are active during the computation stage. Note that the number of threads in a block must be a multiple of the warp size (32 threads on GTX280 and GT540M) for optimal efficiency. To achieve better efficiency and higher memory throughput, the GPU attempts to coalesce accesses from multiple threads into a single memory transaction. If all threads within a warp (32 threads) simultaneously read consecutive words then single large read of the 32 values can be performed at optimum speed.
In this new approach, each row/column/temporal filtering stage is separated into two substages: (a) the threads load a block of pixels of one row/column/temporal array from the global memory into the shared memory, and (b) each thread computes the filter over the data stored in the shared memory and stores the results in the global memory. We must not forget about the cases when a row or column processing tile becomes clamped by video frame borders, and initialize clamped shared memory array indices with correct values. In this case, threads also must load in shared memory the values of adjacent pixels in order to compute the pixels located in borders.
In Figure11, we evaluate the new algorithm for computing the wavelet transform using the shared memory. As we can see, both GPUs have reduced considerably the execution time. As an example, for FullHD video resolution and with a GOP size of 32, we have improved the computational time up to 1.83x and up to 3.5x for GT540M and GTX280, respectively, when compared to the previous algorithm that uses the global memory. Figure9b compares the times showed in Figure11 with times needed to compute the 3DDWT in CPU, and it shows an speedup over 30 in GT540M and 136 in GTX280. However, transfer times between host and GPU memory are too high to notice this improvement in total times over GPU. As shown in Figure12, total times increase considerably, being the computational wavelet time only the 8 % of the total time needed to transfer and compute wavelet. In Figure9b, we show the speedups of our proposal taking into account the transfer times. Speedups of 19 and 11 were obtained with GT540M and GTX280 GPUs, respectively.
5 Performance evaluation of the proposed encoder using GPUs
After analyzing the performance of the GPU 3DDWT computation, we will present a comparison of the proposed encoder against the other encoders in terms of coding delay.
In Figure13, we present the total coding time (excluding I/O) in frames per second of all evaluated encoders and for different sequence resolutions for a quality of 30 dB. Now, our proposal uses the GPU to compute the 3DDWT stage. As it can be seen, 3DLTW encoder is the fastest one being up to 3.2 times on average as fast as the nonGPU version of the proposed encoder, up to 22 times as fast as 3DSPIHT and up to 19 times as fast as x264 which is a fully optimized version of H.264. After the GPU optimization of the 3D wavelet transform stage, the proposed encoder is able to compress a FullHD sequence in real time. Remark, that the optimizations performed are due only to GPU strategies while other encoders like x264, H263, MPEG2, and MPEG4 are fully optimized implementations, using CPU capabilities like Multimedia Extensions (MMX2, SSE2Fast, SSSE3, etc.) and multithreading.
Although, the GPU version of the 3DLTW encoder has been speeded up to 3.2 times, now, the bottleneck in the global encoder is the coding stage after computing the 3DDWT transform, specially at low compression rates, where there are lots of significant coefficients to encode. Several strategies could be performed in order to speed up even more the proposed encoder, like overlapping both GPU computation and memory transfer times, overlapping CPU processing times with GPU processing time, or using several GPUs to compute multiple 3D wavelet transforms from different GOPs.
6 Conclusions
In this article, we have presented the 3DLTW video encoder based on 3D wavelet transform and lower trees with eight nodes. We have compared our algorithm against 3DSPIHT, H.264, x264, and MPEG2 encoders in terms of R/D, coding delay and memory requirements.
Regarding R/D, our proposal has a better behavior than MPEG2. When compared to 3DSPIHT, our proposal has a similar behavior for sequences with medium and high movement, but slightly lower performance for sequences with low movement like Container. However, our proposal requires 6 times less memory than 3DSPIHT. Both 3DDWT based encoders (3DSPIHT and 3DLTW) outperforms x264 in Intra mode (up to 11 dB) exploiting only the temporal redundancy among video frames when applying the 3DDWT. It is also important to see the behavior of 3DDWT based encoders when applied to high frame rate video sequences obtaining even better PSNR than x264 in Inter mode.
In order to speed up our encoder, we have presented an exhaustive analysis of GPU memory strategies to compute the 3DDWT transform. As we have seen, the GPU 3DDWT algorithm obtains good speedups, up to 16 in the GT540M platform and up to 39 in the GTX280. Using these optimizations, the proposed encoder (3DLTW) is a very fast encoder, specially for FullHD video resolutions, being able to compress a FullHD video sequence in real time.
The fast coding/decoding process and the avoiding of the use of motion estimation/motion compensation algorithms, makes the 3DLTW encoder a good candidate for applications where the coding/decoding delay are critical for proper operation or for applications where a frame must be reconstructed as soon as possible. 3DDWT based encoders could be an intermediate solution between pure Intra encoders and complex Inter encoders.
Although the proposed 3DLTW encoder has been developed for natural video sequences where Daubechies 9/7F filter for the 3DDWT stage has been widely used in the literature, other biorthogonal filters could be applied, depending on the final application. Even though longer filters capture better the frequency changes on an image, differences on R/D for natural images are negligible with respect to Daubechies 9/7F filter. This effect could be extended to the temporal domain case. However, longer filters introduce an increment on the DWT computation complexity because more operations per pixel must be performed, making the encoder slower. Obviously, if a longer filter is used in the DWT stage, the speedup will be greater, because more operations per pixel will be performed in a parallel way.
As future study, we pretend to move other parts of the coding stage, like the quantization stage to the GPU to speed up even more the encoder. Furthermore, we pretend to overlap the CPU computation stage with the GPU computation of the 3DDWT stage. Regarding quantization step over GPU, our first attempts shows that the 3DDWT stage over GPU will be increased a 12% on average while the coding stage will be reduced a 17 % on average, which makes our encoder even faster.
References
 1.
Ryu JS, Kim ET: Fast intra coding method of h.264 for video surveillance system. Int. J. Comput. Sci. Netw. Secur 2007, 7(10):7681.
 2.
Campisi P, Neri A: Video watermarking in the 3DDWT domain using perceptual masking. In IEEE International Conference on Image Processing. Genoa, Italy; 2005:9971000.
 3.
Schelkens P, Munteanu A, Barbariend J, Galca M, GiroNieto X, Cornelis J: Wavelet coding of volumetric medical datasets. IEEE Trans. Med. Imag 2003, 22(3):441458.
 4.
Dragotti PL, Poggi G: Compression of multispectral images by threedimensional SPITH algorithm. IEEE Trans. Geosci. Rem. Sens 2000, 38(1):416428.
 5.
Aviles M, Moran F, Garcia N: Progressive lower trees of wavelet coefficients: Efficient spatial and SNR scalable coding of 3D models. Lecture Notes Comput. Sci 2005, 3767: 6172.
 6.
Podilchuk CI, Jayant NS, Farvardin N: Three dimensional subband coding of video. IEEE Trans. Image Process 1995, 4(2):125135.
 7.
Taubman D, Zakhor A: Multirate 3D subband coding of video. IEEE Trans. Image Process 1994, 3(5):572588.
 8.
Chen Y, Pearlman WA: Threedimensional subband coding of video using the zerotree method. In Proc. SPIE Visual Communications and Image Processing. Orlando, FL; 1996:13021309.
 9.
Shapiro JM: Embedded image coding using zerotrees of wavelet coefficients. IEEE Trans. Signal Process 1993, 41(12):34453462.
 10.
Said A, Pearlman A: A new, fast and efficient image codec based on set partitioning in hierarchical trees. IEEE Trans. Circ. Syst. Video Technol 1996, 6(3):243250.
 11.
Kim BJ, Xiong Z, Pearlman WA: Low bitrate scalable video coding with 3D set partitioning in hierarchical trees (3D SPIHT). IEEE Trans. Circ. Syst. Video Technol 2000, 10: 13741387.
 12.
Wong TT, Leung CS, Heng PA, Wang J: Discrete wavelet transform on consumerlevel graphics hardware. IEEE Trans. Multimedia 2007, 9(3):668673.
 13.
Tenllado C, Setoain J, Prieto M, Pinuel L, Tirado F: Parallel implementation of the 2d discrete wavelet transform on graphics processing units: filter bank versus lifting. IEEE Trans. Parallel Distrib. Syst 2008, 19(3):299310.
 14.
Franco J, Bernabé G, Fernández J, Acacio ME, Ujaldón M: The gpu on the 2d wavelet transform, survey and contributions. In In proceedings of Para 2010: State of the Art in Scientific and Parallel Computing. Lecture Notes in Computer Science. Reykjavik, Iceland; 2010:173183.
 15.
Galiano V, López O, Malumbres MP, Migallón H: Improving the discrete wavelet transform computation from multicore to gpubased algorithms. In In proceedings of International Conference on Computational and Mathematical Methods in Science and Engineering. Benidorm, Spain; 2011:544555.
 16.
Oliver J, Malumbres MP: Lowcomplexity multiresolution image compression using wavelet lower trees. IEEE Trans. Circ. Syst. Video Technol 2006, 16(11):14371444.
 17.
Kim BJ, Xiong Z, Pearlman WA: Very low bitrate embedded video coding with 3D set partitioning in hierarchical trees (3D SPIHT). 1997.
 18.
ISO/IEC 1449610 and ITU Rec. H.264 Advanced video coding (2003)
 19.
ffmpeg 2010.http://ffmpeg.zeranoe.com/
Acknowledgements
This research was supported by the Spanish Ministry of Education and Science under grant TIN201127543C0303 and the Spanish Ministry of Science and Innovation under grant number TIN201126254 and TEC201011776E.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 3DDWT
 Video coding
 GPU
 Manycore