 Research
 Open Access
 Published:
Multicorebased 3DDWT video encoder
EURASIP Journal on Advances in Signal Processing volume 2013, Article number: 84 (2013)
Abstract
Threedimensional wavelet transform (3DDWT) encoders are good candidates for applications like professional video editing, video surveillance, multispectral satellite imaging, etc. where a frame must be reconstructed as quickly as possible. In this paper, we present a new 3DDWT video encoder based on a fast runlength coding engine. Furthermore, we present several multicore optimizations to speedup the 3DDWT computation. An exhaustive evaluation of the proposed encoder (3DGOPRL) has been performed, and we have compared the evaluation results with other video encoders in terms of rate/distortion (R/D), coding/decoding delay, and memory consumption. Results show that the proposed encoder obtains good R/D results for highresolution video sequences with nearly inplace computation using only the memory needed to store a group of pictures. After applying the multicore optimization strategies over the 3D DWT, the proposed encoder is able to compress a full highdefinition video sequence in realtime.
1 Introduction
Currently, most of the popular video compression technologies operate in both intra and inter coding modes. Intra mode compression operates in a framebyframe basis while inter mode achieves compression by applying motion estimation and compensation between frames and taking advantage of the temporal correlation between frames. Inter mode compression is able to achieve increased coding efficiency over intra mode schemes. However, in video content production stages, digital videoprocessing applications require fastframe random access to perform an undefined number of realtime decompressingeditingcompressing interactive operations, without a significant loss of original video content quality. Intraframe coding is desirable as well in many other applications like video archiving, highquality highresolution medical and satellite video sequences, applications requiring simple realtime encoding like videoconference systems or even for professional or home video surveillance systems [1], and digital video recording systems, where the user equipment is usually not as powerful as the head end equipment.
There is another video encoding approach that may be also considered as an inter coding approach but without the use of motion estimation/compensation. In this approach, known as threedimensional (3D) coding, a video sequence is considered as a threedimensional data set where each pixel has two spatial and one temporal coordinates. Most of the 3D encoders proposed in the literature are based on the threedimensional wavelet transform (3DDWT), mainly used in watermarking [2] and video coding applications (e.g., compression of volumetric medical data [3], multispectral images [4], or 3D model coding [5]). So, 3DDWTbased encoders could be an intermediate approximation between intra and inter coding modes, because it avoids motion estimation and compensation, and the decoding latency will depend on the GOP size.
For example, Taubman and Zakhor presented a fullcolor video coder based on a 3D subband coding with camera pan compensation [6]. Podilchuk, et al. utilized a 3D spatiotemporal subband decomposition and geometric vector quantization [7]. Chen and Pearlman [8] extended to 3D improved embedded zerotree wavelet (IEZW) for video coding the twodimensional (2D) embedded zerotree wavelet (EZW) method [9] and showed promise of an effective and computationally simple video coding system without motion compensation, obtaining excellent numerical and visual results. In [10], instead of the typical quadtrees of image coding, a tree with eight descendants per coefficient is used to extend the set partitioning in hierarchical trees (SPIHT) image encoder to 3D video coding. In [11], a fast SPIHT version is presented using a Huffmanbased entropy encoder instead of a contextadaptive arithmetic encoder. However, the proposed image encoder has not been extended to the 3D version. Also in [12], an extension of the fast backward coding of wavelet trees (BCWT) image encoder [13] is presented, reporting a coding speed of 32 frames per second for a common intermediate format (CIF) resolution video sequence. The BCWT image encoder offers high coding speed, low memory usage, and a similar rate/distortion (R/D) performance than the SPIHT encoder. The key of the BCWT encoder is its unique onepass backward coding, which starts from the lowest level of subbands and travels backwards. Maximum quantization levels of descendants (MQD) map calculation and coefficient encoding are all carefully integrated inside this pass in such a way that there is as little redundancy as possible for computation and memory usage. A 3D zerotree coding through modified EZW has also been used with good results in compression of volumetric images [14].
In this work, we present a fast 3DDWTbased encoder with a runlength core coding system. The proposed encoder requires less memory than 3D SPIHT [10] and has a good R/D behavior. Furthermore, we present an indepth analysis of the use of multicore strategies to accelerate the 3DDWT. Using these strategies, the proposed encoder is able to compress a full highdefinition (HD) video sequence in realtime.
The rest of the paper is organized as follows: section 2 presents the proposed 3DDWTbased encoder. In section 3, a performance evaluation in terms of R/D, memory requirements, and coding time is presented. Section 4 describes several optimization proposals based on multicore processing strategies applied to the 3DDWT computation while in section 4.2, we analyze their performance. Furthermore, in section 4.3, we present a pipeline strategy to speed up the proposed encoder. Finally, in section 5, we show the performance of the improved proposed encoder against other stateoftheart encoders while in section 6, some conclusions are drawn.
2 Encoding system
In this section, we present a 3DDWTbased encoder with low complexity and good R/D performance. As our main concern is fast encoding process, no R/D optimization, motion estimation/motion compensation (ME/MC) or bitplane processing is applied. This encoder is based on both 3DDWT and runlength encoding (3DGOPRL), and it is able to compress an ITUD1 (576p30) video sequence at 40 frames per second.
In Figure 1, the whole encoding system scheme is shown. First of all, the 3DDWT is applied to a group of pictures (GOP) in such a way that a combination of a 2D spatial DWT and a 1D temporal DWT is applied and the temporal DWT absorbs motion in the GOP. The temporal DWT is carried out on the pixel values of the same location along the time axis. Our 3DDWT implementation, as how 3DSPIHT and 3DBCWT are done, uses the Daubechies 9/7F filter for both spatial and temporal domains because this filter has shown good results for lossy compression [15].
After that, all wavelet coefficients are quantized, and then, subband frames are passed from the lowest frequency subband L L L _{ n }to the highest frequency subband H H H _{1} to the runlength encoding system which compresses the input data, and we obtain the final bitstream corresponding to that GOP. As in the 3DBCWT encoder [12], only one pass is applied over the GOP to encode the coefficients, but contrary to the 3DBCWT encoder, the compressed bitstream generated by our encoder is ordered in such a way that the decoder obtains the bitstream in the correct order.
2.1 Fast runlength coding
In the proposed encoder, the quantization process is performed by two strategies: one coarser and another finer. The finer one is done by applying a scalar uniform quantization to the wavelet coefficients using the Q parameter. The coarser one is done by removing bit planes from the least significant part of the wavelet coefficients. We define rplanes as the number of less significant bits to be removed, and we call significant coefficient to those coefficients c _{ i,j }that are different to zero after discarding the least significant rplane bits, in other words, if c _{ i,j }≥2^{rplanes}.
In the proposed coding algorithm, the wavelet coefficients are encoded as follows: the quantized coefficients in the subband buffer are scanned row by row (to exploit their locality). For each coefficient in that buffer, if it is not significant, a runlength count of insignificant symbols at this level is increased (r u n _l e n g t h _{ L }). However, if it is significant, we encode both the count of previous insignificant symbols and the significant coefficient, and r u n _l e n g t h _{ L }is reset.
A significant coefficient is encoded by means of a symbol indicating the number of bits required to represent that coefficient. An arithmetic encoder with two contexts is used to efficiently store that symbol. As coefficients in the same subband have similar magnitude, an adaptive arithmetic encoder is able to represent this information in a very efficient way. After that, the significant bits and sign of the wavelet coefficient are rawencoded to speed up the execution time.
In order to encode the count of insignificant symbols, we use a RUN symbol. After encoding this symbol, the runlength count (r u n _l e n g t h _{ L }) is stored in a similar way as in the case of significant coefficients. First, the number of bits needed to encode the run value is arithmetically encoded (with a different context). Afterwards, the bits are rawencoded.
Instead of using runlength count symbols, we could have used a single symbol to encode each insignificant coefficient. However, we would need to encode a larger amount of symbols, and therefore, the complexity of the algorithm would increase (most of all, in the case of a large number of insignificant contiguous symbols, which usually occurs in moderatetohigh compression ratios). However, the compression performance is increased if a specific symbol is used for every insignificant coefficient since an arithmetic encoder processes more efficiently many likely symbols than a lower amount of less likely symbols. So, for short runlengths, we encode a LOWER symbol for each insignificant coefficient instead of coding a runlength count symbol for all the sequence. The threshold to enter the runlength mode and start using runlength count symbols is defined by the enter_run_mode parameter. The formal description of the depicted algorithm can be found in Algorithm 1.
Algorithm 1 Runlength coding of the wavelet coefficients
3 Performance evaluation
In this section, we will compare the performance of our proposed encoder (3DGOPRL) using the Daubechies 9/7F filter for both spatial and temporal domains and a GOP size of 16 with the video encoders presented in Table 1.
The performance metrics employed in the tests are R/D performance, coding and decoding delay, and memory requirements. All the evaluated encoders have been tested on an Intel PentiumM Dual Core 3.0 GHz processor (Santa Clara, CA, USA) with a 2Gbyte RAM memory.
The test video sequences used in the evaluation are the Foreman (QCIF and CIF) 300 frames, container (QCIF and CIF) 300 frames, news (QCIF and CIF) 300 frames, hall (QCIF and CIF) 300 frames, mobile (ITU D1 576p30) 40 frames, station2 (HD 1024p25) 312 frames, Ducks (HD 1024p50) 130 frames, and Ducks (SHD 2048p50) 130 frames.
It is important to remark that the H.263, MPEG2, MPEG4, and ×264 are evaluated by implementations that are fully optimized, using CPU capabilities like multimedia extensions (MMX2, SSE2Fast, SSSE3, etc.) and multithreading, whereas 3DSPIHT and 3DGOPRL had nonoptimized C++ implementations.
3.1 Memory requirements
In Table 2, the memory requirements of different encoders under test are shown. Obviously, the H.263 encoder, only using P frames, requires to keep in memory just two frames to accomplish the ME/MC stage, whereas encoders based on 3DDWT like 3DSPIHT and 3DGOPRL need to keep more frames in memory to apply the time filter. The 3DGOPRL encoder running over a GOP size of 16 frames uses up to 6 times less memory than 3DSPIHT, up to 22 times less memory than H.264 for QCIF sequence resolution, and up to 6 times less memory than ×264 which is an optimized implementation of H.264, for small sequence resolutions. It is important to remark that 3DSPIHT keeps the compressed bitstream of a 16GOP size in memory until the whole compression is performed, while encoders like MPEG2, MPEG4, H.263, H.264, 3DGOPRL, and ×264 output the bitstream inline. Blockbased encoders like MPEG2 and MPEG4 require less memory than the other encoders, specially at highdefinition sequences. Also, the memory requirements in the proposed encoder (3DGOPRL) are doubled as the GOP size is doubled.
3.2 R/D performance
Regarding R/D, in Table 3, we can see the R/D behavior of all evaluated encoders for different sequences. As shown, both H.264 and ×264 are the ones that obtain the best results for sequences with high movement, mainly due to the exhaustive ME/MC stage included in these encoders, which is contrary to 3DSPIHT and 3DGOPRL that do not include any ME/MC stage. The R/D behavior of 3DSPIHT and 3DGOPRL is similar for images with moderatehigh motion activity, but for sequences with low movement, 3DSPIHT outperform 3DGOPRL, showing the power of its tree encoding system. The proposed encoder (3DGOPRL) has a similar behavior to H.263 and MPEG2 and a slightly lower performance than MPEG4. Also, we can see the improvement of 3DGOPRL and 3DSPIHT when compared to ×264 in intra mode (up to 11 dB). This R/D improvement is accomplished by exploiting only the temporal redundancy among video frames when applying the 3DDWT. It is also interesting that the behavior of the 3DDWTbased encoder for high frame rate video sequences like Ducks. As it can be seen, all 3DDWTbased encoders have a similar behavior than the other encoders, even better than ×264.
3.3 Encoding time
In Figure 2, we present the coding speed (excluding I/O) of all evaluated encoders and for different sequence resolutions. As it can be seen, MPEG2 and MPEG4 encoders are the fastest ones due to their blockbased processing algorithm. Regarding 3DDWTbased encoders, the proposed encoder 3DGOPRL is up to seven times as fast as 3DSPIHT and up to six times as fast as the ×264 encoder.
Also, in Figure 3a, we present the total coding time of a frame for different video sequence resolutions as a function of the GOP size. As it can be seen, for low resolution sequences, there are nearly no differences in the total coding time, but for highresolution video sequences, the total coding time will increase up to 40% as the GOP size increases. Furthermore, it is interesting to see that the required time to perform the 3DDWT stage ranges between 45% and 80% of the total coding time depending on the GOP size, as seen in Figure 3b. So, improvements in the 3DDWT computation will drastically reduce the total coding time of the proposed encoder.
4 3DDWT optimizations
As 3DDWT computation requires more than 45% and up to 80% of the total coding time in the proposed encoder. In this section, we present several parallel strategies to improve the 3DDWT computation time.
4.1 Multicore 3D wavelet transform
In the proposed encoder (3DGOPRL), the Daubechies 9/7 filter, proposed in [20], has been used to perform the regular filterbank convolution in order to develop the parallel 3DDWT algorithm. In [21], we proposed the convolutionbased parallel 2DDWT using an extra memory space in order to perform a nearly inplace computation, avoiding the requirement of twice the image size to store the computed coefficients. This strategy has been also followed to develop the parallel 3DDWT algorithm.
We want to remark that we use four decomposition levels in order to compute the 3DDWT, and the computation of each wavelet decomposition level is divided into two main steps: in the first step, the 2DDWT is applied to each frame of the current GOP, and in the second step, the 1DDWT is performed to consider the temporal axis. We have used the symmetric extension technique in order to avoid the border effects on both the frame borders and the GOP borders.
If we consider the first step (i.e., the 2DDWT applied to each video frame), the extra memory size depends on both the row size or column size (the larger one), and the number of processes in the parallel algorithm. The extra memory stores, the frame row/column pixels, plus the pixels are required to perform the symmetric extension. For the Daubechies 9/7 filter, we must extend the row/column with four elements on both borders.
Table 4 shows the extra memory size (in pixels) and the percentage of memory increase for several video frame resolutions and number of processes used in the parallel algorithm. Note that each process stores its own working pixels which are not shared with other processes. The worst case in Table 4, attending to memory increase, is a very small value equal to 0.1109%. If the GOP size is larger than the row or column size, the amount of required extra memory is fixed by the GOP length. Percentage values in Table 4 have been obtained considering a GOP size equal to 32. In the second step of the 3DDWT (i.e., the temporal 1DDWT), we perform the symmetric extension in order to avoid the border effects in the temporal domain. In all performed experiments, the maximum GOP size considered is 128; therefore, the extra memory used in the first step is enough to be reused in the second step.
We have used the OpenMP [22] paradigm in order to develop the parallel 3DDWT algorithm. The multicore platforms used in our tests are as follows:

Intel Core 2 Quad Q6600 2.4 GHz, with four cores.

HP Proliant SL390 G7 (HP, Palo Alto, CA, USA) with two Intel Xeon ×5,660, each CPU with six cores at 2.8 GHz.
4.2 Performance evaluation of the multicore 3DDWT
In this section, we discuss the behavior of the parallel algorithm described in the previous section. Figure 4 presents the 3DDWT computational times for a video frame resolution of 1,280×640 varying the GOP size and the number of processes. In the 3DDWT, there is an intensive use of memory; therefore, the improvement in the use of the cache memory and data locality justifies efficiencies greater than 1. Values shown in Figure 4 correspond to the executions on the multicore Q6600 platform. However, efficiencies greater than 1 are not observed for the multicore HP Proliant SL390 due to the higher memory access performance respect to the multicore Q6600. The HP Proliant SL390 architecture provides a highbandwidth memory access, through the Intel QPI Speed 64GT/s; therefore, the global performance improvement is less significant than in the Q6600 platform. In Figure 5, we also present the computational times for the multicore HP Proliant SL390. The efficiencies obtained on both platforms are similar. However, comparing data obtained from video frames of different resolutions, we can conclude that the behavior on the multicore Q6600 becomes worse than on the multicore HP Proliant SL390, as the GOP size increases, i.e., when the global memory size increases.
The GOP size is an important parameter in the 3DDWT computation, when applied to video coding because the average video quality increases as we increase the GOP size due to the minor GOP boundary effect. However, the computational load and memory requirements increase. Ideally, the GOP size would be equal to the total number of video frames. Since this is not possible due to the device memory restrictions, we must select the GOP size attending to both the video quality and the computational time. As we can see in Figures 4 and 5, the computational time increases as the GOP size increases. The minimum GOP size in our algorithm is 16 due to the four wavelet decomposition levels performed in the 3DDWT (2^{4}).
In Figure 6, we present the computational time per frame.
We can observe that the parallel algorithm improves its behavior when both the number of processes and the GOP size increase. We want to remark that upon setting the GOP size equal to 256, for medium and highresolution video frames, the results obtained are not good due to the global memory size requirement. The optimal GOP size values are 64 and 128. Setting the GOP size to 128 reduces the border effects while setting the GOP size to 64 reduces the memory requirements. Both GOP size values obtain the best results in terms of computation time per frame, as seen in Figure 6.
4.3 Overlapping the 3DDWT stage and the coding stage
In section 4, we have analyzed the behavior of the parallel 3DDWT for multicores, and we have presented a parallel algorithm that obtains good efficiencies using up to the maximum number of available cores (12 cores in the HP Proliant SL390). Furthermore, we have reduced the computational time of the 3DDWT stage, but the time of the coding stage has not been considered at this time. So, in order to improve the global coding time, we consider implementing a twophase pipeline strategy considering both the 3DDWT and the coding stage. Note that there are no dependencies between these two stages if the working frame of the GOP is not the same.
As we have said, in the pipeline strategy proposed, we overlap the 3DDWT computation and the coding stage, where both stages process different GOPs. In Figure 7, we show that the pipeline strategy developed. At each step, we simultaneously compute the 3DDWT of one GOP and encode the GOP transformed in the previous step. At the initial step, we only perform the 3DDWT transform of the first GOP, and the last GOP is encoded at the final step without overlapping the task.
Firstly, in order to implement this pipeline procedure, we consider a multicore algorithm with two processes: the first one computes the 3DDWT, and the second one computes the coding stage. There exists an inherent penalty in this type of algorithms at both the initial step and the final step. This penalty causes that the computational time reduction will be slightly lower than the optimal value equal to 50%. Considering the optimal GOP size values (64 or 128 frames), the ideal computational time reductions are 46.9% and 48.5%, respectively. We want to remark that our algorithm achieves these ideal values, obtaining, therefore, efficiencies equal to 0.94 and 0.97, respectively.
The previous conclusions are drawn considering that the computational time for both phases, the 3DDWT stage and the coding stage, is similar. In Figure 8, we analyze the behavior of the computational time for both stages for the container (CIF) video sequence. As we can observe, the assumption that computational times for both stages are similar is only valid for very low compression rates. We can extend the behavior showed in Figure 8 to the rest of the video sequences. Therefore, it is necessary to apply the parallel optimizations presented in section 4 in order to achieve ideal efficiencies. We want to remark that the improvements are focused on the 3DDWT computation. To obtain the ideal efficiencies (using more than two processes), we must achieve both goals, reduce at maximum the 3DDWT computational time in the first step (at this step there is no overlapping), and reduce the 3DDWT computational time in the following steps in order to obtain a time lower or equal to the coding time (the other overlapped task).
Therefore, there are four different conditions in the parallel computation of the first GOP of a video sequence. In the initial step, we only compute the 3DDWT transform of the first frame of the GOP. In the following steps, in which there are overlapped tasks, we must adapt the 3DDWT computation in order to obtain the optimal number of processes used in the 3DDWT computation. In the third stage, we compute the 3DDWT using the optimal number of processes obtained and the coding stage using one process. As we have said, the fourth step is the computation of the coding stage of the last GOP. Both the forkjoin model of parallelism and the nested parallelism, offered by OpenMP, are used to implement these four discussed stages.
The forkjoin parallelism refers to a method of specifying the parallel execution of a program whereby the program flow diverges into two or more flows that can be executed concurrently, and then, all flows come back together into a single flow when all of the parallel work is completed. In the nested parallelism, each flow can diverge into a new flow with two or more processes. In Figure 9, we show the structure of the parallel model developed using the forkjoin model and the nested parallelism. In the first step, we use the maximum number of processes in order to accelerate at maximum the initial 3DDWT computation. In the following steps (see Figure 7), the flow diverges into two processes where the first one computes the 3DDWT of the following GOP and the second one computes the coding stage of the previous GOP. The flow that computes the 3DDWT must adapt the number of processes in order to obtain a 3DDWT computational time lower than the computational time of the coding stage. We set the number of processes to compute the 3DDWT of the second GOP equal to half the maximum number of processes. In the following steps, the algorithm varies in the number of processes, depending on the measured time for both 3DDWT and coding tasks, until the optimal value is found. Once we have obtained the optimal value of processes to compute the 3DDWT, this value remains unchanged for the rest of the GOPs. The maximum number of processes used to compute the 3DDWT is equal to the number of cores available minus one since this core (or process) is used to compute the coding stage. As we can see in Figure 8, the coding stage time is between two and four times lower, depending on the bit rate. Therefore, the optimal number of processes to compute the 3DDWT depends on the bit rate, varying between 2 and 6.
Using the proposed strategy, we increase the efficiency of the pipeline structure up to 0.97 and up to 0.98 for GOP sizes 64 and 128, respectively. Moreover, the optimal value of processes is lower than the number of available processes, specially for the HP Proliant SL390 platform. The developed pipeline structure allows us to have idle cores, depending on the compression rate, and therefore, we can analyze the parallelization of the coding stage to improve the results in the future work.
Also, it is important to remark that upon joining the presented parallel strategies and the overlapping technique, we nearly reached the ideal speedups, where the bound of the speedup is determined by the computational time of the coding stage. Typical values of the speedup achievable are between 3 and 5.
5 Global performance evaluation
After analyzing both the performance of the multicore approach for the 3DDWT computation and the aforementioned pipeline structure, we will present a comparison of the proposed encoder against the other test encoders in terms of coding delay.
In Figure 10, we present the coding speed (excluding I/O) in frames per second of all evaluated encoders and for different sequence resolutions. Now, our proposal uses the previously presented multicore optimization to perform the 3DDWT in section 4. As it can be seen, MPEG2 and MPEG4 encoders still are the fastest ones. However, now, the 3DGOPRL encoder is up to four times as fast as the nonmulticore version of the proposed encoder, being able to compress a fullHD sequence in realtime.
Although, the multicore version of the 3DGOPRL encoder has been speeded up to four times, now, the bottleneck in the encoder is the coding stage after computing the 3DDWT transform, specially at low compression rates, where there are lots of significant coefficients to encode. Considering the overlapping strategy presented in section 4.3, the 3DDWT computation is hidden and the total coding time will be due only to the coding stage, except for the first GOP. Of course, that extra memory for the second GOP is required in this approach. As it can be seen in Figure 11, using this technique, the proposed encoder is the fastest one for fullHD video resolutions. Remark, that the optimizations performed are due only to multicore strategies while other encoders like ×264, H.263, MPEG2, and MPEG4 are fully optimized implementations, using CPU capabilities like multimedia extensions (MMX2, SSE2Fast, SSSE3, etc.) and multithreading.
6 Conclusions
In this paper, we have presented the 3DGOPRL, a fast video encoder based on 3D wavelet transform and efficient runlength coding. We have compared our algorithm against 3DSPIHT, H.264, ×264, H.263, MPEG2, and MPEG4 encoders in terms of R/D, coding delay, and memory requirements.
Regarding R/D, our proposal has a similar behavior to MPEG2 and H.263 and a slightly lower performance than MPEG4. When compared with 3DSPIHT, our proposal has a similar behavior for sequences with medium and high movements but lower performance for sequences with low movement, like that of the container. However, our proposal requires six times less memory than the 3DSPIHT. Both 3DDWTbased encoders (3DSPIHT and 3DGOPRL) outperform ×264 in intra mode (up to 11 dB), exploiting only the temporal redundancy among video frames when applying the 3DDWT. It is also important to see the behavior of 3DDWTbased encoders when applied to high frame rate video sequences, which is obtaining even better PSNR than ×264 in inter mode.
In order to speed up our encoder, we have presented an exhaustive analysis of the parallel strategies to compute the 3DDWT transform. As we have seen, the parallel algorithm obtains good efficiencies, with the proper parameter setting, using the available cores, up to 12 in the multicore HP Proliant SL390 and up to 4 in the multicore Q6600. Even more, we have applied multithreading strategies to hide the 3DDWT computational time. Using these strategies, the proposed encoder (3DGOPRL) is the fastest encoder for fullHD video resolutions, being able to compress a fullHD video sequence in realtime.
The fast coding/decoding process and the fact of avoiding the use of motion estimation/motion compensation algorithms make the 3DGOPRL encoder a good candidate for applications where the coding/decoding delay is critical for proper operation or for applications where a frame must be reconstructed as soon as possible. 3DDWTbased encoders could be an intermediate solution between pure intra encoders and complex inter encoders.
In the future work, we intend to apply parallel strategies to speed up the encoder even more, but this time, we are focusing on the coding stage.
References
 1.
JangSeon R, EungTea K: Fast intra coding method of H.264 for video surveillance system. Int. J. Comput. Sci. Netw. Secur 2007, 7(10):7681.
 2.
Campisi P, Neri A: Video watermarking in the 3DDWT domain using perceptual masking. In IEEE International Conference on Image Processing. NY: IEEE; 2005:9971000.
 3.
Schelkens P, Munteanu A, Barbariend J, Galca M, GiroNieto X, Cornelis J: Wavelet coding of volumetric medical datasets. IEEE. Trans. Med. Imaging 2003, 22(3):441458. 10.1109/TMI.2003.809582
 4.
Dragotti PL, Poggi G: Compression of multispectral images by threedimensional SPITH, algorithm. IEEE Trans. Geoscience Remote Sensing 2000, 38(1):416428. 10.1109/36.823937
 5.
Aviles M, Moran F, Garcia N: Progressive lower trees of wavelet coefficients: efficient spatial and SNR scalable coding of 3D models. Lect. Notes Comput. Sci 2005, 3767: 6172. 10.1007/11581772_6
 6.
Taubman D, Zakhor A: Multirate 3D subband coding of video. IEEE Trans. Image Process 1994, 3(5):572588. 10.1109/83.334984
 7.
Podilchuk CI, Jayant NS, Farvardin N: Three dimensional subband coding of video. IEEE Trans. Image Process 1995, 4(2):125135. 10.1109/83.342187
 8.
Chen Y, Pearlman WA: Threedimensional subband coding of video using the zerotree method. In Visual Communications and Image Processing. Bellingham: SPIE; 1996:13021309.
 9.
Shapiro JM: Embedded image coding using zerotrees of wavelet coefficients. IEEE Trans. Signal Process 1993, 41(12):34453462. 10.1109/78.258085
 10.
Kim BJ, Xiong Z, Pearlman WA: Low bitrate scalable video coding with 3D set partitioning in hierarchical trees (3D SPIHT). IEEE Trans. Circuits Syst. Video Tech 2000, 10: 13741387. 10.1109/76.889025
 11.
Bibhuprasad M, Abhishek S, Sudipta M: A high performance modified SPIHT for scalable image compression. Int. J. Image Process 2011, 5(4):390402.
 12.
Ye L, Karp T, Nutter B, Mitra S, Guo J: Threedimensional subband coding of video with 3D BCWT. In Signals, Systems and Computers, 2006. ACSSC ’06. Fortieth Asilomar Conference on. NY: IEEE; 2006:401405.
 13.
Guo J, Mitra S, Nutter B, Karp T: A fast and low complexity image codec based on backward coding of wavelet trees. In Proceedings of the Data Compression Conference. NY: IEEE; 2006:292301.
 14.
Luo J, Wang X, Chen CW, Parker KJ: Volumetric medical image compression with threedimensional wavelet transform and octave zerotree coding. In Visual Communications and Image Processing. Bellingham: SPIE; 1996:579590.
 15.
Sunil BM, Raj CP: Analysis of wavelet for 3ddwt volumetric image compression. In Emerging Trends in Engineering and Technology (ICETET), 2010 3rd International Conference on. NY: IEEE; 2010:180185.
 16.
Kim BJ, Pearlman WA: An embedded wavelet video coder using threedimensional set partitioning in hierarchical trees (SPIHT). In Proceedings of the Data Compression Conference, 1997. NY: IEEE; 1997:251260.
 17.
ISO/IEC 14496–10 and ITU Rec H.264. Coding of audiovisual objects  Part 10: Advanced Video Coding. 2003.
 18.
ITUT Recommendation H.263: Video coding for low bit rate communication. 2005.
 19.
FFmpeg 2010.http://ffmpeg.zeranoe.com Available on
 20.
Mallat SG: A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans. Pattern. Anal. Mach. Intell 1989, 11(7):674693. 10.1109/34.192463
 21.
Galiano V, López O, Malumbres MP, Migallón H: Improving the discrete wavelet transform computation from multicore to gpubased algorithms. In proceedings of International Conference on Computational and Mathematical Methods in Science and Engineering. Salamanca: J. VigoAguiar; 2011:544555.
 22.
OpenMP application program interface: version 3.1. OpenMP Architecture Rev. Board 2011. Available on http://www.openmp.org
Acknowledgements
This research was supported by the Spanish Ministry of Science and Innovation under grant numbers TIN201126254 and TIN201127543C0303.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
About this article
Cite this article
Galiano, V., LópezGranado, O., Malumbres, M.P. et al. Multicorebased 3DDWT video encoder. EURASIP J. Adv. Signal Process. 2013, 84 (2013) doi:10.1186/16876180201384
Received
Accepted
Published
DOI
Keywords
 3DDWT
 Video coding
 Multicore
 Wavelets
 Performance