Multicore-based 3D-DWT video encoder

Galiano, Vicente; López-Granado, Otoniel; Malumbres, Manuel P.; Migallón, Hector

doi:10.1186/1687-6180-2013-84

Research
Open access
Published: 20 April 2013

Multicore-based 3D-DWT video encoder

Vicente Galiano¹,
Otoniel López-Granado¹,
Manuel P. Malumbres¹ &
…
Hector Migallón¹

EURASIP Journal on Advances in Signal Processing volume 2013, Article number: 84 (2013) Cite this article

2705 Accesses
Metrics details

Abstract

Three-dimensional wavelet transform (3D-DWT) encoders are good candidates for applications like professional video editing, video surveillance, multi-spectral satellite imaging, etc. where a frame must be reconstructed as quickly as possible. In this paper, we present a new 3D-DWT video encoder based on a fast run-length coding engine. Furthermore, we present several multicore optimizations to speed-up the 3D-DWT computation. An exhaustive evaluation of the proposed encoder (3D-GOP-RL) has been performed, and we have compared the evaluation results with other video encoders in terms of rate/distortion (R/D), coding/decoding delay, and memory consumption. Results show that the proposed encoder obtains good R/D results for high-resolution video sequences with nearly in-place computation using only the memory needed to store a group of pictures. After applying the multicore optimization strategies over the 3D DWT, the proposed encoder is able to compress a full high-definition video sequence in real-time.

1 Introduction

Currently, most of the popular video compression technologies operate in both intra and inter coding modes. Intra mode compression operates in a frame-by-frame basis while inter mode achieves compression by applying motion estimation and compensation between frames and taking advantage of the temporal correlation between frames. Inter mode compression is able to achieve increased coding efficiency over intra mode schemes. However, in video content production stages, digital video-processing applications require fast-frame random access to perform an undefined number of real-time decompressing-editing-compressing interactive operations, without a significant loss of original video content quality. Intra-frame coding is desirable as well in many other applications like video archiving, high-quality high-resolution medical and satellite video sequences, applications requiring simple real-time encoding like video-conference systems or even for professional or home video surveillance systems [1], and digital video recording systems, where the user equipment is usually not as powerful as the head end equipment.

There is another video encoding approach that may be also considered as an inter coding approach but without the use of motion estimation/compensation. In this approach, known as three-dimensional (3D) coding, a video sequence is considered as a three-dimensional data set where each pixel has two spatial and one temporal coordinates. Most of the 3D encoders proposed in the literature are based on the three-dimensional wavelet transform (3D-DWT), mainly used in watermarking [2] and video coding applications (e.g., compression of volumetric medical data [3], multispectral images [4], or 3D model coding [5]). So, 3D-DWT-based encoders could be an intermediate approximation between intra and inter coding modes, because it avoids motion estimation and compensation, and the decoding latency will depend on the GOP size.

For example, Taubman and Zakhor presented a full-color video coder based on a 3D subband coding with camera pan compensation [6]. Podilchuk, et al. utilized a 3D spatio-temporal subband decomposition and geometric vector quantization [7]. Chen and Pearlman [8] extended to 3D improved embedded zero-tree wavelet (IEZW) for video coding the two-dimensional (2D) embedded zero-tree wavelet (EZW) method [9] and showed promise of an effective and computationally simple video coding system without motion compensation, obtaining excellent numerical and visual results. In [10], instead of the typical quad-trees of image coding, a tree with eight descendants per coefficient is used to extend the set partitioning in hierarchical trees (SPIHT) image encoder to 3D video coding. In [11], a fast SPIHT version is presented using a Huffman-based entropy encoder instead of a context-adaptive arithmetic encoder. However, the proposed image encoder has not been extended to the 3D version. Also in [12], an extension of the fast backward coding of wavelet trees (BCWT) image encoder [13] is presented, reporting a coding speed of 32 frames per second for a common intermediate format (CIF) resolution video sequence. The BCWT image encoder offers high coding speed, low memory usage, and a similar rate/distortion (R/D) performance than the SPIHT encoder. The key of the BCWT encoder is its unique one-pass backward coding, which starts from the lowest level of subbands and travels backwards. Maximum quantization levels of descendants (MQD) map calculation and coefficient encoding are all carefully integrated inside this pass in such a way that there is as little redundancy as possible for computation and memory usage. A 3D zero-tree coding through modified EZW has also been used with good results in compression of volumetric images [14].

In this work, we present a fast 3D-DWT-based encoder with a run-length core coding system. The proposed encoder requires less memory than 3D SPIHT [10] and has a good R/D behavior. Furthermore, we present an in-depth analysis of the use of multicore strategies to accelerate the 3D-DWT. Using these strategies, the proposed encoder is able to compress a full high-definition (HD) video sequence in real-time.

The rest of the paper is organized as follows: section 2 presents the proposed 3D-DWT-based encoder. In section 3, a performance evaluation in terms of R/D, memory requirements, and coding time is presented. Section 4 describes several optimization proposals based on multicore processing strategies applied to the 3D-DWT computation while in section 4.2, we analyze their performance. Furthermore, in section 4.3, we present a pipeline strategy to speed up the proposed encoder. Finally, in section 5, we show the performance of the improved proposed encoder against other state-of-the-art encoders while in section 6, some conclusions are drawn.

2 Encoding system

In this section, we present a 3D-DWT-based encoder with low complexity and good R/D performance. As our main concern is fast encoding process, no R/D optimization, motion estimation/motion compensation (ME/MC) or bitplane processing is applied. This encoder is based on both 3D-DWT and run-length encoding (3D-GOP-RL), and it is able to compress an ITU-D1 (576p30) video sequence at 40 frames per second.

In Figure 1, the whole encoding system scheme is shown. First of all, the 3D-DWT is applied to a group of pictures (GOP) in such a way that a combination of a 2D spatial DWT and a 1D temporal DWT is applied and the temporal DWT absorbs motion in the GOP. The temporal DWT is carried out on the pixel values of the same location along the time axis. Our 3D-DWT implementation, as how 3D-SPIHT and 3D-BCWT are done, uses the Daubechies 9/7F filter for both spatial and temporal domains because this filter has shown good results for lossy compression [15].

After that, all wavelet coefficients are quantized, and then, subband frames are passed from the lowest frequency subband L L L _nto the highest frequency subband H H H ₁ to the run-length encoding system which compresses the input data, and we obtain the final bit-stream corresponding to that GOP. As in the 3D-BCWT encoder [12], only one pass is applied over the GOP to encode the coefficients, but contrary to the 3D-BCWT encoder, the compressed bit-stream generated by our encoder is ordered in such a way that the decoder obtains the bit-stream in the correct order.

2.1 Fast run-length coding

In the proposed encoder, the quantization process is performed by two strategies: one coarser and another finer. The finer one is done by applying a scalar uniform quantization to the wavelet coefficients using the Q parameter. The coarser one is done by removing bit planes from the least significant part of the wavelet coefficients. We define rplanes as the number of less significant bits to be removed, and we call significant coefficient to those coefficients c _i,jthat are different to zero after discarding the least significant rplane bits, in other words, if c _i,j≥2^rplanes.

In the proposed coding algorithm, the wavelet coefficients are encoded as follows: the quantized coefficients in the subband buffer are scanned row by row (to exploit their locality). For each coefficient in that buffer, if it is not significant, a run-length count of insignificant symbols at this level is increased (r u n _l e n g t h _L). However, if it is significant, we encode both the count of previous insignificant symbols and the significant coefficient, and r u n _l e n g t h _Lis reset.

A significant coefficient is encoded by means of a symbol indicating the number of bits required to represent that coefficient. An arithmetic encoder with two contexts is used to efficiently store that symbol. As coefficients in the same subband have similar magnitude, an adaptive arithmetic encoder is able to represent this information in a very efficient way. After that, the significant bits and sign of the wavelet coefficient are raw-encoded to speed up the execution time.

In order to encode the count of insignificant symbols, we use a RUN symbol. After encoding this symbol, the run-length count (r u n _l e n g t h _L) is stored in a similar way as in the case of significant coefficients. First, the number of bits needed to encode the run value is arithmetically encoded (with a different context). Afterwards, the bits are raw-encoded.

Instead of using run-length count symbols, we could have used a single symbol to encode each insignificant coefficient. However, we would need to encode a larger amount of symbols, and therefore, the complexity of the algorithm would increase (most of all, in the case of a large number of insignificant contiguous symbols, which usually occurs in moderate-to-high compression ratios). However, the compression performance is increased if a specific symbol is used for every insignificant coefficient since an arithmetic encoder processes more efficiently many likely symbols than a lower amount of less likely symbols. So, for short run-lengths, we encode a LOWER symbol for each insignificant coefficient instead of coding a run-length count symbol for all the sequence. The threshold to enter the run-length mode and start using run-length count symbols is defined by the enter_run_mode parameter. The formal description of the depicted algorithm can be found in Algorithm 1.

Algorithm 1 Run-length coding of the wavelet coefficients

3 Performance evaluation

In this section, we will compare the performance of our proposed encoder (3D-GOP-RL) using the Daubechies 9/7F filter for both spatial and temporal domains and a GOP size of 16 with the video encoders presented in Table 1.

Table 1 Configuration parameters of the evaluated encoders

Full size table

The performance metrics employed in the tests are R/D performance, coding and decoding delay, and memory requirements. All the evaluated encoders have been tested on an Intel PentiumM Dual Core 3.0 GHz processor (Santa Clara, CA, USA) with a 2-Gbyte RAM memory.

The test video sequences used in the evaluation are the Foreman (QCIF and CIF) 300 frames, container (QCIF and CIF) 300 frames, news (QCIF and CIF) 300 frames, hall (QCIF and CIF) 300 frames, mobile (ITU D1 576p30) 40 frames, station2 (HD 1024p25) 312 frames, Ducks (HD 1024p50) 130 frames, and Ducks (SHD 2048p50) 130 frames.

It is important to remark that the H.263, MPEG-2, MPEG-4, and ×264 are evaluated by implementations that are fully optimized, using CPU capabilities like multimedia extensions (MMX2, SSE2Fast, SSSE3, etc.) and multithreading, whereas 3D-SPIHT and 3D-GOP-RL had non-optimized C++ implementations.

3.1 Memory requirements

In Table 2, the memory requirements of different encoders under test are shown. Obviously, the H.263 encoder, only using P frames, requires to keep in memory just two frames to accomplish the ME/MC stage, whereas encoders based on 3D-DWT like 3D-SPIHT and 3D-GOP-RL need to keep more frames in memory to apply the time filter. The 3D-GOP-RL encoder running over a GOP size of 16 frames uses up to 6 times less memory than 3D-SPIHT, up to 22 times less memory than H.264 for QCIF sequence resolution, and up to 6 times less memory than ×264 which is an optimized implementation of H.264, for small sequence resolutions. It is important to remark that 3D-SPIHT keeps the compressed bit-stream of a 16-GOP size in memory until the whole compression is performed, while encoders like MPEG-2, MPEG-4, H.263, H.264, 3D-GOP-RL, and ×264 output the bit-stream inline. Block-based encoders like MPEG-2 and MPEG-4 require less memory than the other encoders, specially at high-definition sequences. Also, the memory requirements in the proposed encoder (3D-GOP-RL) are doubled as the GOP size is doubled.

Table 2 Memory requirements for evaluated encoders (kB)

Full size table

3.2 R/D performance

Regarding R/D, in Table 3, we can see the R/D behavior of all evaluated encoders for different sequences. As shown, both H.264 and ×264 are the ones that obtain the best results for sequences with high movement, mainly due to the exhaustive ME/MC stage included in these encoders, which is contrary to 3D-SPIHT and 3D-GOP-RL that do not include any ME/MC stage. The R/D behavior of 3D-SPIHT and 3D-GOP-RL is similar for images with moderate-high motion activity, but for sequences with low movement, 3D-SPIHT outperform 3D-GOP-RL, showing the power of its tree encoding system. The proposed encoder (3D-GOP-RL) has a similar behavior to H.263 and MPEG-2 and a slightly lower performance than MPEG-4. Also, we can see the improvement of 3D-GOP-RL and 3D-SPIHT when compared to ×264 in intra mode (up to 11 dB). This R/D improvement is accomplished by exploiting only the temporal redundancy among video frames when applying the 3D-DWT. It is also interesting that the behavior of the 3D-DWT-based encoder for high frame rate video sequences like Ducks. As it can be seen, all 3D-DWT-based encoders have a similar behavior than the other encoders, even better than ×264.

Table 3 Average peak signal-to-noise ratio (PSNR) (dB) with different bit rate and coders

Full size table

3.3 Encoding time

In Figure 2, we present the coding speed (excluding I/O) of all evaluated encoders and for different sequence resolutions. As it can be seen, MPEG-2 and MPEG-4 encoders are the fastest ones due to their block-based processing algorithm. Regarding 3D-DWT-based encoders, the proposed encoder 3D-GOP-RL is up to seven times as fast as 3D-SPIHT and up to six times as fast as the ×264 encoder.

Also, in Figure 3a, we present the total coding time of a frame for different video sequence resolutions as a function of the GOP size. As it can be seen, for low resolution sequences, there are nearly no differences in the total coding time, but for high-resolution video sequences, the total coding time will increase up to 40% as the GOP size increases. Furthermore, it is interesting to see that the required time to perform the 3D-DWT stage ranges between 45% and 80% of the total coding time depending on the GOP size, as seen in Figure 3b. So, improvements in the 3D-DWT computation will drastically reduce the total coding time of the proposed encoder.

4 3D-DWT optimizations

As 3D-DWT computation requires more than 45% and up to 80% of the total coding time in the proposed encoder. In this section, we present several parallel strategies to improve the 3D-DWT computation time.

4.1 Multicore 3D wavelet transform

In the proposed encoder (3D-GOP-RL), the Daubechies 9/7 filter, proposed in [20], has been used to perform the regular filter-bank convolution in order to develop the parallel 3D-DWT algorithm. In [21], we proposed the convolution-based parallel 2D-DWT using an extra memory space in order to perform a nearly in-place computation, avoiding the requirement of twice the image size to store the computed coefficients. This strategy has been also followed to develop the parallel 3D-DWT algorithm.

We want to remark that we use four decomposition levels in order to compute the 3D-DWT, and the computation of each wavelet decomposition level is divided into two main steps: in the first step, the 2D-DWT is applied to each frame of the current GOP, and in the second step, the 1D-DWT is performed to consider the temporal axis. We have used the symmetric extension technique in order to avoid the border effects on both the frame borders and the GOP borders.

If we consider the first step (i.e., the 2D-DWT applied to each video frame), the extra memory size depends on both the row size or column size (the larger one), and the number of processes in the parallel algorithm. The extra memory stores, the frame row/column pixels, plus the pixels are required to perform the symmetric extension. For the Daubechies 9/7 filter, we must extend the row/column with four elements on both borders.

Table 4 shows the extra memory size (in pixels) and the percentage of memory increase for several video frame resolutions and number of processes used in the parallel algorithm. Note that each process stores its own working pixels which are not shared with other processes. The worst case in Table 4, attending to memory increase, is a very small value equal to 0.1109%. If the GOP size is larger than the row or column size, the amount of required extra memory is fixed by the GOP length. Percentage values in Table 4 have been obtained considering a GOP size equal to 32. In the second step of the 3D-DWT (i.e., the temporal 1D-DWT), we perform the symmetric extension in order to avoid the border effects in the temporal domain. In all performed experiments, the maximum GOP size considered is 128; therefore, the extra memory used in the first step is enough to be reused in the second step.

Table 4 Amount of extra memory size

Full size table

We have used the OpenMP [22] paradigm in order to develop the parallel 3D-DWT algorithm. The multicore platforms used in our tests are as follows:

Intel Core 2 Quad Q6600 2.4 GHz, with four cores.
HP Proliant SL390 G7 (HP, Palo Alto, CA, USA) with two Intel Xeon ×5,660, each CPU with six cores at 2.8 GHz.

4.2 Performance evaluation of the multicore 3D-DWT

In this section, we discuss the behavior of the parallel algorithm described in the previous section. Figure 4 presents the 3D-DWT computational times for a video frame resolution of 1,280×640 varying the GOP size and the number of processes. In the 3D-DWT, there is an intensive use of memory; therefore, the improvement in the use of the cache memory and data locality justifies efficiencies greater than 1. Values shown in Figure 4 correspond to the executions on the multicore Q6600 platform. However, efficiencies greater than 1 are not observed for the multicore HP Proliant SL390 due to the higher memory access performance respect to the multicore Q6600. The HP Proliant SL390 architecture provides a high-bandwidth memory access, through the Intel QPI Speed 64GT/s; therefore, the global performance improvement is less significant than in the Q6600 platform. In Figure 5, we also present the computational times for the multicore HP Proliant SL390. The efficiencies obtained on both platforms are similar. However, comparing data obtained from video frames of different resolutions, we can conclude that the behavior on the multicore Q6600 becomes worse than on the multicore HP Proliant SL390, as the GOP size increases, i.e., when the global memory size increases.

The GOP size is an important parameter in the 3D-DWT computation, when applied to video coding because the average video quality increases as we increase the GOP size due to the minor GOP boundary effect. However, the computational load and memory requirements increase. Ideally, the GOP size would be equal to the total number of video frames. Since this is not possible due to the device memory restrictions, we must select the GOP size attending to both the video quality and the computational time. As we can see in Figures 4 and 5, the computational time increases as the GOP size increases. The minimum GOP size in our algorithm is 16 due to the four wavelet decomposition levels performed in the 3D-DWT (2⁴).

In Figure 6, we present the computational time per frame.

We can observe that the parallel algorithm improves its behavior when both the number of processes and the GOP size increase. We want to remark that upon setting the GOP size equal to 256, for medium- and high-resolution video frames, the results obtained are not good due to the global memory size requirement. The optimal GOP size values are 64 and 128. Setting the GOP size to 128 reduces the border effects while setting the GOP size to 64 reduces the memory requirements. Both GOP size values obtain the best results in terms of computation time per frame, as seen in Figure 6.

4.3 Overlapping the 3D-DWT stage and the coding stage

In section 4, we have analyzed the behavior of the parallel 3D-DWT for multicores, and we have presented a parallel algorithm that obtains good efficiencies using up to the maximum number of available cores (12 cores in the HP Proliant SL390). Furthermore, we have reduced the computational time of the 3D-DWT stage, but the time of the coding stage has not been considered at this time. So, in order to improve the global coding time, we consider implementing a two-phase pipeline strategy considering both the 3D-DWT and the coding stage. Note that there are no dependencies between these two stages if the working frame of the GOP is not the same.

As we have said, in the pipeline strategy proposed, we overlap the 3D-DWT computation and the coding stage, where both stages process different GOPs. In Figure 7, we show that the pipeline strategy developed. At each step, we simultaneously compute the 3D-DWT of one GOP and encode the GOP transformed in the previous step. At the initial step, we only perform the 3D-DWT transform of the first GOP, and the last GOP is encoded at the final step without overlapping the task.

Firstly, in order to implement this pipeline procedure, we consider a multicore algorithm with two processes: the first one computes the 3D-DWT, and the second one computes the coding stage. There exists an inherent penalty in this type of algorithms at both the initial step and the final step. This penalty causes that the computational time reduction will be slightly lower than the optimal value equal to 50%. Considering the optimal GOP size values (64 or 128 frames), the ideal computational time reductions are 46.9% and 48.5%, respectively. We want to remark that our algorithm achieves these ideal values, obtaining, therefore, efficiencies equal to 0.94 and 0.97, respectively.

The previous conclusions are drawn considering that the computational time for both phases, the 3D-DWT stage and the coding stage, is similar. In Figure 8, we analyze the behavior of the computational time for both stages for the container (CIF) video sequence. As we can observe, the assumption that computational times for both stages are similar is only valid for very low compression rates. We can extend the behavior showed in Figure 8 to the rest of the video sequences. Therefore, it is necessary to apply the parallel optimizations presented in section 4 in order to achieve ideal efficiencies. We want to remark that the improvements are focused on the 3D-DWT computation. To obtain the ideal efficiencies (using more than two processes), we must achieve both goals, reduce at maximum the 3D-DWT computational time in the first step (at this step there is no overlapping), and reduce the 3D-DWT computational time in the following steps in order to obtain a time lower or equal to the coding time (the other overlapped task).

Therefore, there are four different conditions in the parallel computation of the first GOP of a video sequence. In the initial step, we only compute the 3D-DWT transform of the first frame of the GOP. In the following steps, in which there are overlapped tasks, we must adapt the 3D-DWT computation in order to obtain the optimal number of processes used in the 3D-DWT computation. In the third stage, we compute the 3D-DWT using the optimal number of processes obtained and the coding stage using one process. As we have said, the fourth step is the computation of the coding stage of the last GOP. Both the fork-join model of parallelism and the nested parallelism, offered by OpenMP, are used to implement these four discussed stages.

The fork-join parallelism refers to a method of specifying the parallel execution of a program whereby the program flow diverges into two or more flows that can be executed concurrently, and then, all flows come back together into a single flow when all of the parallel work is completed. In the nested parallelism, each flow can diverge into a new flow with two or more processes. In Figure 9, we show the structure of the parallel model developed using the fork-join model and the nested parallelism. In the first step, we use the maximum number of processes in order to accelerate at maximum the initial 3D-DWT computation. In the following steps (see Figure 7), the flow diverges into two processes where the first one computes the 3D-DWT of the following GOP and the second one computes the coding stage of the previous GOP. The flow that computes the 3D-DWT must adapt the number of processes in order to obtain a 3D-DWT computational time lower than the computational time of the coding stage. We set the number of processes to compute the 3D-DWT of the second GOP equal to half the maximum number of processes. In the following steps, the algorithm varies in the number of processes, depending on the measured time for both 3D-DWT and coding tasks, until the optimal value is found. Once we have obtained the optimal value of processes to compute the 3D-DWT, this value remains unchanged for the rest of the GOPs. The maximum number of processes used to compute the 3D-DWT is equal to the number of cores available minus one since this core (or process) is used to compute the coding stage. As we can see in Figure 8, the coding stage time is between two and four times lower, depending on the bit rate. Therefore, the optimal number of processes to compute the 3D-DWT depends on the bit rate, varying between 2 and 6.

Using the proposed strategy, we increase the efficiency of the pipeline structure up to 0.97 and up to 0.98 for GOP sizes 64 and 128, respectively. Moreover, the optimal value of processes is lower than the number of available processes, specially for the HP Proliant SL390 platform. The developed pipeline structure allows us to have idle cores, depending on the compression rate, and therefore, we can analyze the parallelization of the coding stage to improve the results in the future work.

Also, it is important to remark that upon joining the presented parallel strategies and the overlapping technique, we nearly reached the ideal speedups, where the bound of the speedup is determined by the computational time of the coding stage. Typical values of the speedup achievable are between 3 and 5.

5 Global performance evaluation

After analyzing both the performance of the multicore approach for the 3D-DWT computation and the aforementioned pipeline structure, we will present a comparison of the proposed encoder against the other test encoders in terms of coding delay.

In Figure 10, we present the coding speed (excluding I/O) in frames per second of all evaluated encoders and for different sequence resolutions. Now, our proposal uses the previously presented multicore optimization to perform the 3D-DWT in section 4. As it can be seen, MPEG-2 and MPEG-4 encoders still are the fastest ones. However, now, the 3D-GOP-RL encoder is up to four times as fast as the non-multicore version of the proposed encoder, being able to compress a full-HD sequence in real-time.

Although, the multicore version of the 3D-GOP-RL encoder has been speeded up to four times, now, the bottleneck in the encoder is the coding stage after computing the 3D-DWT transform, specially at low compression rates, where there are lots of significant coefficients to encode. Considering the overlapping strategy presented in section 4.3, the 3D-DWT computation is hidden and the total coding time will be due only to the coding stage, except for the first GOP. Of course, that extra memory for the second GOP is required in this approach. As it can be seen in Figure 11, using this technique, the proposed encoder is the fastest one for full-HD video resolutions. Remark, that the optimizations performed are due only to multicore strategies while other encoders like ×264, H.263, MPEG-2, and MPEG-4 are fully optimized implementations, using CPU capabilities like multimedia extensions (MMX2, SSE2Fast, SSSE3, etc.) and multithreading.

6 Conclusions

In this paper, we have presented the 3D-GOP-RL, a fast video encoder based on 3D wavelet transform and efficient run-length coding. We have compared our algorithm against 3D-SPIHT, H.264, ×264, H.263, MPEG-2, and MPEG-4 encoders in terms of R/D, coding delay, and memory requirements.

Regarding R/D, our proposal has a similar behavior to MPEG-2 and H.263 and a slightly lower performance than MPEG-4. When compared with 3D-SPIHT, our proposal has a similar behavior for sequences with medium and high movements but lower performance for sequences with low movement, like that of the container. However, our proposal requires six times less memory than the 3D-SPIHT. Both 3D-DWT-based encoders (3D-SPIHT and 3D-GOP-RL) outperform ×264 in intra mode (up to 11 dB), exploiting only the temporal redundancy among video frames when applying the 3D-DWT. It is also important to see the behavior of 3D-DWT-based encoders when applied to high frame rate video sequences, which is obtaining even better PSNR than ×264 in inter mode.

In order to speed up our encoder, we have presented an exhaustive analysis of the parallel strategies to compute the 3D-DWT transform. As we have seen, the parallel algorithm obtains good efficiencies, with the proper parameter setting, using the available cores, up to 12 in the multicore HP Proliant SL390 and up to 4 in the multicore Q6600. Even more, we have applied multithreading strategies to hide the 3D-DWT computational time. Using these strategies, the proposed encoder (3D-GOP-RL) is the fastest encoder for full-HD video resolutions, being able to compress a full-HD video sequence in real-time.

The fast coding/decoding process and the fact of avoiding the use of motion estimation/motion compensation algorithms make the 3D-GOP-RL encoder a good candidate for applications where the coding/decoding delay is critical for proper operation or for applications where a frame must be reconstructed as soon as possible. 3D-DWT-based encoders could be an intermediate solution between pure intra encoders and complex inter encoders.

In the future work, we intend to apply parallel strategies to speed up the encoder even more, but this time, we are focusing on the coding stage.

References

Jang-Seon R, Eung-Tea K: Fast intra coding method of H.264 for video surveillance system. Int. J. Comput. Sci. Netw. Secur 2007, 7(10):76-81.
Google Scholar
Campisi P, Neri A: Video watermarking in the 3D-DWT domain using perceptual masking. In IEEE International Conference on Image Processing. NY: IEEE; 2005:997-1000.
Google Scholar
Schelkens P, Munteanu A, Barbariend J, Galca M, Giro-Nieto X, Cornelis J: Wavelet coding of volumetric medical datasets. IEEE. Trans. Med. Imaging 2003, 22(3):441-458. 10.1109/TMI.2003.809582
Article Google Scholar
Dragotti PL, Poggi G: Compression of multispectral images by three-dimensional SPITH, algorithm. IEEE Trans. Geoscience Remote Sensing 2000, 38(1):416-428. 10.1109/36.823937
Article Google Scholar
Aviles M, Moran F, Garcia N: Progressive lower trees of wavelet coefficients: efficient spatial and SNR scalable coding of 3D models. Lect. Notes Comput. Sci 2005, 3767: 61-72. 10.1007/11581772_6
Article Google Scholar
Taubman D, Zakhor A: Multirate 3-D subband coding of video. IEEE Trans. Image Process 1994, 3(5):572-588. 10.1109/83.334984
Article Google Scholar
Podilchuk CI, Jayant NS, Farvardin N: Three dimensional subband coding of video. IEEE Trans. Image Process 1995, 4(2):125-135. 10.1109/83.342187
Article Google Scholar
Chen Y, Pearlman WA: Three-dimensional subband coding of video using the zero-tree method. In Visual Communications and Image Processing. Bellingham: SPIE; 1996:1302-1309.
Chapter Google Scholar
Shapiro JM: Embedded image coding using zerotrees of wavelet coefficients. IEEE Trans. Signal Process 1993, 41(12):3445-3462. 10.1109/78.258085
Article Google Scholar
Kim BJ, Xiong Z, Pearlman WA: Low bit-rate scalable video coding with 3D set partitioning in hierarchical trees (3D SPIHT). IEEE Trans. Circuits Syst. Video Tech 2000, 10: 1374-1387. 10.1109/76.889025
Article Google Scholar
Bibhuprasad M, Abhishek S, Sudipta M: A high performance modified SPIHT for scalable image compression. Int. J. Image Process 2011, 5(4):390-402.
Google Scholar
Ye L, Karp T, Nutter B, Mitra S, Guo J: Three-dimensional subband coding of video with 3-D BCWT. In Signals, Systems and Computers, 2006. ACSSC ’06. Fortieth Asilomar Conference on. NY: IEEE; 2006:401-405.
Chapter Google Scholar
Guo J, Mitra S, Nutter B, Karp T: A fast and low complexity image codec based on backward coding of wavelet trees. In Proceedings of the Data Compression Conference. NY: IEEE; 2006:292-301.
Google Scholar
Luo J, Wang X, Chen CW, Parker KJ: Volumetric medical image compression with three-dimensional wavelet transform and octave zerotree coding. In Visual Communications and Image Processing. Bellingham: SPIE; 1996:579-590.
Chapter Google Scholar
Sunil BM, Raj CP: Analysis of wavelet for 3d-dwt volumetric image compression. In Emerging Trends in Engineering and Technology (ICETET), 2010 3rd International Conference on. NY: IEEE; 2010:180-185.
Chapter Google Scholar
Kim BJ, Pearlman WA: An embedded wavelet video coder using three-dimensional set partitioning in hierarchical trees (SPIHT). In Proceedings of the Data Compression Conference, 1997. NY: IEEE; 1997:251-260.
Google Scholar
ISO/IEC 14496–10 and ITU Rec H.264. Coding of audio-visual objects - Part 10: Advanced Video Coding. 2003.
Google Scholar
ITU-T Recommendation H.263: Video coding for low bit rate communication. 2005.
Google Scholar
FFmpeg 2010.http://ffmpeg.zeranoe.com Available on
Mallat SG: A theory for multi-resolution signal decomposition: The wavelet representation. IEEE Trans. Pattern. Anal. Mach. Intell 1989, 11(7):674-693. 10.1109/34.192463
Article Google Scholar
Galiano V, López O, Malumbres MP, Migallón H: Improving the discrete wavelet transform computation from multicore to gpu-based algorithms. In proceedings of International Conference on Computational and Mathematical Methods in Science and Engineering. Salamanca: J. Vigo-Aguiar; 2011:544-555.
Google Scholar
OpenMP application program interface: version 3.1. OpenMP Architecture Rev. Board 2011. Available on http://www.openmp.org
Google Scholar

Download references

Acknowledgements

This research was supported by the Spanish Ministry of Science and Innovation under grant numbers TIN2011-26254 and TIN2011-27543-C03-03.

Author information

Authors and Affiliations

Physics and Computer Architecture Department, Miguel Hernández University, Elche, 03202, Spain
Vicente Galiano, Otoniel López-Granado, Manuel P. Malumbres & Hector Migallón

Authors

Vicente Galiano
View author publications
You can also search for this author in PubMed Google Scholar
Otoniel López-Granado
View author publications
You can also search for this author in PubMed Google Scholar
Manuel P. Malumbres
View author publications
You can also search for this author in PubMed Google Scholar
Hector Migallón
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hector Migallón.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Authors’ original file for figure 13

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Galiano, V., López-Granado, O., Malumbres, M.P. et al. Multicore-based 3D-DWT video encoder. EURASIP J. Adv. Signal Process. 2013, 84 (2013). https://doi.org/10.1186/1687-6180-2013-84

Download citation

Received: 30 October 2012
Accepted: 27 March 2013
Published: 20 April 2013
DOI: https://doi.org/10.1186/1687-6180-2013-84

Multicore-based 3D-DWT video encoder

Abstract

1 Introduction

2 Encoding system

2.1 Fast run-length coding

Algorithm 1 Run-length coding of the wavelet coefficients

3 Performance evaluation

3.1 Memory requirements

3.2 R/D performance

3.3 Encoding time

4 3D-DWT optimizations

4.1 Multicore 3D wavelet transform

4.2 Performance evaluation of the multicore 3D-DWT

4.3 Overlapping the 3D-DWT stage and the coding stage

5 Global performance evaluation

6 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords