EURASIP Journal on Applied Signal Processing 2003:1, 56–65 c ○ 2003 Hindawi Publishing Corporation 3D Scan-Based Wavelet Transform and Quality Control for Video Coding

Wavelet coding has been shown to achieve better compression than DCT coding and moreover allows scalability. 2D DWT can be easily extended to 3D and thus applied to video coding. However, 3D subband coding of video suffers from two drawbacks. The first is the amount of memory required for coding large 3D blocks; the second is the lack of temporal quality due to the sequence temporal splitting. In fact, 3D block-based video coders produce jerks. They appear at blocks temporal borders during video playback. In this paper, we propose a new temporal scan-based wavelet transform method for video coding combining the advantages of wavelet coding (performance, scalability) with acceptable reduced memory requirements, no additional CPU complexity, and avoiding jerks. We also propose an efficient quality allocation procedure to ensure a constant quality over time.


INTRODUCTION
Although 3D subband coding of video [1,2,3,4,5] provides encouraging results compared to MPEG [6,7,8,9], its generalization suffers from significant memory requirements. One way to reduce memory requirements is to apply the temporal discrete wavelet transform (DWT) on 3D blocks coming from a temporal splitting of the sequence. But this blockbased DWT method introduces temporal blocking artifacts which result in undesirable jerks during video playback. In this paper, we propose new tools for 3D subband codecs to guarantee the output frames a constant quality over time.
Scan-based 2D wavelet transforms were first suggested for on-board satellite compression in [10,11] and by Chrysafis and Ortega in [12].
In Section 2, we propose a 3D scan-based DWT method and a 3D scan-based motion-compensated lifting DWT for video coding. The method allows the computation of the temporal wavelet decomposition of a sequence with infinite length using little memory and no extra CPU. Furthermore, the proposed wavelet transform provides higher quality control than 3D block-based video compression schemes (avoiding jerks).
In Section 3, we propose an efficient model-based quality control procedure. This bit-allocation procedure controls the output frames quality over time. This new quality-control procedure takes advantage of the model-based rate allocation methods described in [13].
Finally, Section 4 presents experimental results obtained by our method.

Principle
The method generally used to reduce memory requirements for large image coding is to split the image and then perform the transform on tiles such as JPEG with 8 × 8 DCT blocks or JPEG2000 [14]. Unfortunately, the coefficients are computed from periodic or symmetrical extensions of the signal.
This results in undesirable blocking artifacts. For video coding, the same blocking artifacts in the temporal direction (introduced by temporal splitting) result in jerks.
In this section, we propose a 3D wavelet transform framework for video coding that requires storing a minimum amount of data without any additional CPU complexity [15]. The frames of the sequence are acquired and processed on the fly.

Definitions of the temporal coherence and the buffer names
We consider a temporal interval (set of input frames). We define the set of its temporally coherent wavelet coefficients as the set of all coefficients, in all subbands, obtained by a filter (or convolution of filters) centered on any one of the frames of this temporal interval. In this paper, we assume that encoding is allowed only when we have a temporally coherent set of wavelet coefficients. Temporal coherence improves the encoder performance since it allows optimal bit allocation for wavelet coefficients of the same temporal interval.
The set of buffers used to perform the temporal wavelet transform will be called filtering buffers. These buffers produce low-and high-frequency temporal wavelet coefficients. In the same way, we call synchronization buffers, the set of buffers used to store output coefficients before their encoding.

Temporal scan-based video DWT and delay
Consider the case of a 3D wavelet transform which can be split into a 2D DWT on each frame and an additional 1D DWT in the time direction [16]. In this paper, we focus on an efficient implementation of the temporal wavelet transform and we propose a method independent of the choice of the spatial wavelet transform.
Each time a frame is received, we perform its 2D wavelet transform and send it into our scan-based temporal wavelet transform system. We consider symmetrical filters with odd length since they are the most widely used in image compression algorithms [14,17]. To simplify, we also suppose that the low-pass filter is longer than the high-pass one. Let L = 2S + 1 be the length of the low-pass filter with S ≥ 2. We want to design components that can be easily reused for any wavelet decomposition tree. Therefore, the memory used for the filtering buffers is supposed to be internal and cannot be shared with other filtering buffers nor with the synchronization buffers for wavelet coefficients storage. We propose a method that minimizes the total memory requirements for FIR filtering.

Single-stage DWT
We first consider a single stage of the temporal wavelet transform.
The length of the low-pass filter is L. Therefore, we need L frames of 2D wavelet coefficients in memory to compute one frame of low-frequency temporal wavelet coefficients. The high-pass filter is shorter. Thus, our filtering buffer must contain exactly L frames of 2D wavelet coefficients. Consequently, filtering buffers are FIFO with length L.   shows the scheme for a single stage of a 5/3 temporal wavelet decomposition. The filtering buffer contains five frames of 2D wavelet coefficients. The synchronization buffers are used to store output 3D wavelet coefficients until we get a temporally coherent set of 3D wavelet coefficients.
When the (S + 1)st 2D transformed frame is received, the filtering buffer is symmetrically filled up in order to avoid side effects. The central frame is the 2D wavelet transform of the first image of the sequence. We can compute the first low-frequency temporal coefficients applying the low-pass filter to the central frame of the filtering buffer (gray frame in Figure 1). The first high-frequency temporal coefficients must be computed on the second 2D transformed frame. This frame (hatched frame in Figure 1) and all its necessary neighbours are already present in the filtering buffer since the high-pass filter is shorter than the low-pass one. Therefore, the high-frequency temporal wavelet coefficients can also be computed without additional input frame.
Finally, we have to wait for only S + 1 input frames to get one low-frequency and one high-frequency temporal frames of wavelet coefficients. Then, for each pair of input frames, we can compute both low-frequency and high-frequency coefficients. Each pair of low-and high-frequency frames is a set of temporally coherent wavelet coefficients. Therefore, we need S+1 input frames to get the first set of temporally coherent wavelet coefficients and S +1+2(n − 1) = S +2n − 1 input frames to get a set of n low-frequency and n high-frequency output frames.
When the input sequence is finished, input frames are replaced by a symmetrical extension using the frames present in the filtering buffer in order to flush it.

Multistage DWT
We now consider the general scheme of an N-level temporal wavelet decomposition. We focus only on the usual dyadic decomposition without additional high-frequency subband decomposition. We assume that decomposition levels are indexed from 1 to N, where level j corresponds to the coefficients produced by the jth wavelet decomposition (level 0 is the sequence of all 2D wavelet transformed frames).
We compute the encoding delay for a two-level wavelet decomposition. The first stage has to compute S + 1 lowfrequency temporal frames to get coefficients in both lowfrequency and high-frequency subbands of the second level. In the same time, the first stage has also computed S+1 highfrequency temporal frames. But, from Section 2.2.1, we know that these S + 1 low-frequency and S + 1 high-frequency output frames of 3D wavelet coefficients can only be computed after the delay of S + 2(S + 1) − 1 frames. Thus, we have to wait for 3S+1 frames to get one frame of 3D coefficients in all subbands of the second decomposition level and S + 1 frames of 3D coefficients in the first level. Notice that for temporal coherence, we need only the first two frames among the S + 1 of the first level.
To compute the delay for an N-level temporal wavelet decomposition, we define d j as the number of frames required at the input of the jth filtering buffer to get temporally coherent coefficients in all subbands. The processing of the first set of 3D subbands of temporally coherent wavelet coefficients will be possible after D = d 1 frames have been received. From Section 2.2.1, we know that d j = S + 2d j+1 − 1 for j ∈ {1, . . . , N −1} and d N = S+1. Solving these equations, we find that the number of input frames required at level j before the first wavelet coefficients are available for processing is d j = (2 N+1− j − 1)S + 1. Therefore, for an N-level temporal wavelet decomposition, the number of input frames needed to get the first set of temporally coherent wavelet coefficients is Thus, the number of frames needed for the synchronization of the multistage decomposition increases exponentially with the number of decomposition levels. Figure 2 shows the scheme of a three-level wavelet decomposition for S = 2. Dark frames in the synchronization buffers are the set of coefficients which will be processeded together (quantized and encoded) as soon as we have coefficients in all temporal frequency bands. This set of coefficients is temporally coherent. At the beginning of the sequence, we have to wait for D input frames. Then, sets of temporally coherent coefficients will be available each 2 N input frames. Table 1 shows the number of input frames needed to get the first set of temporally coherent wavelet coefficients for two widely used filter banks. This table shows that a three-level decomposition introduces an encoding delay of less than one second with the 9/7 filter bank and only half a second with the 5/3 filter bank. In 3D block-based video coders, the delay is equal to the size of the temporal block. As blocks are larger in order to minimize the number of jerks, the delay is more important for 3D blockbased wavelet transform video coders.

Memory requirements
Memory requirements are given by the sum of the number of frames in the N filtering buffers and the number of frames in the synchronization buffers.
The memory requirements for the filtering buffers are equal to (2S + 1)N frames.
The synchronization buffers of the last decomposition level must contain one frame of 3D wavelet coefficients for both the low-frequency and high-frequency subbands. For the jth decomposition level ( j < N), d j+1 , low-frequency outputs need to be computed and, in the same time, d j+1 high-frequency outputs can be computed. As we know that temporal coherence requires less than d j+1 3D frames of wavelet coefficients at level j, we can decide to delay the computation of the last computable high-frequency coefficients until the new set of temporally coherent 3D wavelet coefficients has been encoded. Once the set of temporally coherent coefficients has been encoded, we compute all the high-frequency coefficients for levels 1 to N − 1 and send them into the synchronization buffers. Then, the on-the-fly wavelet transform can resume normally. This trick allows to spare one frame in the memory requirements of each synchronization buffer for levels 1 to N − 1. Thus, the memory requirements for the synchronization buffers are limited to 2 + N−1 j=1 (d j+1 − 1). We need to store M S = (2 N − N − 1)S +2 frames of coefficients for all the synchronization buffers. Therefore, the total memory requirements of this method are frames, for an N-level temporal wavelet transform with filter length L = 2S + 1. When memory can be shared between filtering buffers and synchronization buffers, the total memory requirements are limited to frames. See [18] for complete memory requirements formulae. Tables 2 and 3 show the total memory requirements for the 9/7 and 5/3 filter banks, respectively, for independent and shared buffers.
Memory requirements increase as an exponential function of the resolution N and as a linear function of the filter length.
Note that, for the same memory requirements (e.g., 48 frames) and three levels of the 9/7 DWT decomposition with a frame rate of 30 fps, the encoding delay for temporal blockbased video coders is equal to 1.6 second while it is 0.97 second in our case (from Table 1). Furthermore, block-based video coders have jerks for each group of 48 frames while our method avoids these annoying artifacts.   (3), in terms of frames, of the scanbased DWT system including both filtering and synchronization buffers when memory can be shared between filtering and synchronization buffers.
Number of levels (N) 9/7 DWT 5/3 DWT The CPU complexity of our temporal scan-based DWT is exactly the same as to perform the regular 1D DWT in the temporal direction on the entire sequence.

Scan-based motion compensated lifting
The main drawback of the 3D scan-based DWT is that it does not take motion compensation into account. 3D motion compensated lifting is an efficient tool to take account of motion in video [4,6,9,19,20,21].
Thus, we propose a new 3D scan-based motion compensated lifting scheme [18,22]. This method combines the benefits of scan-based filtering, block-based coding, and quality control [22].
When filtering and synchronization buffers are independent, the total memory requirements become where β is a parameter depending on the filter, β = 6 for the 9/7 Daubechies DWT [23], and β = 4 for the 5/3 DWT. When memory can be shared between filtering and synchronization buffers, the total memory requirements are limited to Complete memory requirements computation can be found in [18]. The scan-based motion compensated lifting scheme saves memory compared to the regular filter banks implementation. Furthermore, our method does not increase the CPU complexity compared to the usual lifting implementation. Tables 4 and 5 show the memory requirements for scanbased motion compensated lifting video coders, respectively, for independant and shared buffers.
Thus, the scan-based motion compensated lifting scheme saves 12 to 33% memory (Tables 2 and 4 or Tables 3 and 5) and takes account motion compensation.  A 32-frames memory (which is a reasonable GOP memory) allows to implement a 3D scan-based motion compensated lifting with efficient filters (9/7) and three-level decomposition.
The scan-based motion compensated lifting also removes jerks with quality control.

MODEL-BASED TEMPORAL QUALITY CONTROL
The bit allocation for the successive sets of temporally coherent coefficients can be performed with respect to either rate or quality constraints. In both cases, the goal is to find a set of quantizers to apply in each subband, which performances lie on the convex hull of the global rate-distorsion curve [24,25,26,27].
Three different methods can be used to model the rate and distortion.
(i) The first one-used in JPEG2000 [14]-consists in prequantizing the wavelet coefficients with a small predetermined quantization step and encodes their bitplanes until the rate or distortion constraint (depending on the application) is verified. In this method, the quantization step of each wavelet coefficient can only be a product of the chosen quantization step multiplied by an integer power of two. The distortion and bitrate functions are exact but they are computed during the encoding process.
(ii) The second method uses asymptotic models for both the distortion and the bitrate. As the asymptotic rate and distortion functions are simple, the minimum of the rate or distortion allocation criterion can be computed analytically. This method is therefore the simplest one to get the quantization steps to apply in each subband. However, the asymptotic assumption is only true for high bitrate subbands.
(iii) We have proposed to use nonasymptotic theoretical models for both rate and distortion [13]. The rate and the distortion depend on the quantization step but also on the probability density function of the wavelet coefficients. Assuming that the probability density model is accurate, this method provides optimal rate-distortion performances.
In this section, we propose a new nonasymptotic temporal quality control procedure to ensure constant quality over time. The quality measure is based on the mean square error (MSE) between the compressed signal and the original one.

Principle of the model-based MSE allocation
The purpose of MSE allocation is to determine the optimal quantizers in each subband which minimize the total bitrate for a given output MSE. Since the 9/7 biorthogonal filter bank is nearly orthogonal, the MSE between the original image and the decoded one can be computed by a weighted sum of the mean squared quantization errors of each subband. We have with #SB the number of 3D subbands, σ 2 Qi the mean squared quantization error for subband i, and {π i } the weights used to take account of the nonorthogonality of the filter bank [28]. The weights ∆ i are optional and can be used for frequency selection or distortion measures. The output bitrate can be expressed as the following weighted sum: with R i the output bitrate for subband i and a i the weight of subband i in the total bitrate (a i is the ratio of the size of subband i divided by the size of the sequence). The subband quantizers are uniform scalar quantizers. They are defined by their quantization bins q i . The solution of our constrained problem is obtained thanks to Lagrangian operators by minimizing the following criterion: (8) where D T denotes the target output MSE and both R i and σ 2 Qi depend on the quantization steps q i . The models used for the bitrate and distortion functions are described in the next subsection.

Rate and distortion models
In each 3D subband, the probability density function of the wavelet coefficients is unimodal with zero mean and can be approximated with generalized Gaussian [23,29]. Therefore, we have with b = (1/σ) Γ(3/α)/Γ(1/α) and a = bα/2Γ(1/α). We also assume that wavelet coefficients are independent and identically distributed (i.i.d.) [13] in each subband. Let Pr(m) be the probability of the quantization level m so that for m = 0 and From (10) and (11), we can approximate the bitrate R by the entropy of the output quantization levels The best coding value for the quantization level m [30] is the centroid of its quantization bin for m = 0 andx 0 = 0. The mean squared quantization error is given by Inserting the value ofx m into (14), we get Proposition 1. When p α,σ is a generalized Gaussian distribution with standard deviation σ and shape parameter α, there is a family of functions f n,m which verifies Proof of Proposition 1 is given in [18].
Therefore, the bitrate R and the quantization distortion σ 2 Q depend only on the shape parameter α and the ratio q/σ,

Optimal model-based quantization for MSE control
Therefore, the goal is to find the quantization steps {q i } and λ which minimize We differentiate the criterion with respect to q i and λ. This provides the following equations: where q i = q i /σ i . Thus, the quantizers parameters {q i } must verify the following system of #SB + 1 equations and #SB + 1 unknowns: In order to simplify the notation, write where where A =  with Equations (23) become The solution of the MSE allocation problem can be obtained with the following equations: where h −1 is the inverse function of h. The parameter λ can be found from (28), and then (29) provides the optimal quantization steps q i . Unfortunately, as there is no analytical formula for h −1 , the MSE allocation problem will be solved using a parametric approach described below.

Parametric approach
Equation (29) gives the values of the quantization steps using tables of the function h for different shape parameters α. Figure 3 shows the tables of ln(−h α ( q)) for α = 1, 1/2, 1/3, and 1/4 and the asymptotic curve of equation  To solve (28), we need tables linking D and λ. Using (20) and (25), we plot the parametric curve (with parameter q) for a given α. Using (29), this parametric curve is equivalent in each subband to the following parametric curve: (32) Thus, we have a relation between D and λ in each subband. The optimal λ is found using the constraint (28). Then, we have a relation between λ and the quantization step q i in each subband.

Algorithm of the model-based MSE allocation
The proposed MSE allocation procedure is the following.
(1) Set the initial value of λ to its asymptotic optimum value λ = 1/2D T ln 2. (2) For each 3D subband i, compute ln(a i /λ∆ i π i σ 2 i ) = ln(−h) and read the corresponding normalized MSE D i using the tables shown in Figure 4.
If it is lower than a given threshold, the constraint (28) is verified and the current λ is optimal. Otherwise, compute 1 a new value of λ and go back to step (1). (4) For each 3D subband i, compute ln(a i /λ∆ i π i σ 2 i ) = ln(−h) with the optimal λ and read q i /σ i using the tables shown in Figure 3. This q i is the optimal quantization step for subband i.
The tables shown in Figures 3 and 4 are stored for several shape parameters α. They are valid for any video sequence.

EXPERIMENTAL RESULTS
To show the efficiency of our 3D scan-based wavelet transform method in removing the temporal blocking artifacts (jerks), we first extended EBWIC [13] to 3D data. The quantized wavelet coefficients have been encoded using JPEG2000's bit-plane context-based arithmetic coder [14]. We first encoded a sequence with the proposed 3D scanbased temporal wavelet transform and a bitrate regulation for the temporally coherent coefficients of each group of 16 frames. Then, we encoded the same sequence with the blockbased approach, where the temporal wavelet coefficients and their encoding were performed on independent temporal blocks of 16 frames. Figure 5 shows a global PSNR improvement of mean 0.11 dB with our approach. Furthermore, we have reduced the PSNR variance from 0.13 to 0.06. The peaks of the block-based approach fit with the artifacts produced at temporal tiles borders (jerks). Regarding the visual quality, the proposed method is also better since the annoying jerks are cancelled out.
Then, we replaced the bitrate regulation by our new MSE allocation procedure. Figure 6 shows that the quality of successive groups of 8 frames is well controlled. The PSNR variations are less than 1 dB with our method while they were up to 9 dB with a bitrate control procedure. The global sequence PSNR is 32.7 dB in both cases. Therefore, our method provides the same global rate-distortion performance but ensures constant quality output frames. This results in a better visual quality.

CONCLUSION
In this paper, we have proposed methods for efficient quality control in video-coding applications.
In Section 2, we have proposed a 3D scan-based DWT method which allows the computation of the temporal wavelet decomposition of a sequence with infinite length using few memory and no extra CPU. Compared to temporal tiling approaches often used to reduce memory requirements, our method avoids temporal tiles artefacts. We have also shown in Section 2.3 that, for the same memory requirements, our method reduces the encoding delay. We have proposed the scan-based motion compensated lifting which results in both saving memory and temporal quality control.
In Section 3, we have proposed a new efficient modelbased quality control procedure. This bit allocation procedure controls the output frames quality over time. The extension to scalar quantizers with a deadzone [31,32,33] is straightforward.
These methods combine the advantages of wavelet coding (performance, scalability) with minimum memory requirements and low CPU complexity. Michel Barlaud received his Thèse d'Etat from the University of Paris XII. He is currently a Professor of image processing at the University of Nice-Sophia Antipolis, and the leader of the Image Processing Group of I3S. His research topics are image and video coding using scan-based wavelet transform, inverse problem using half-quadratic regularization, and image and video segmentation using region-based active contours and PDEs. He is a regular reviewer for several journals, a member of the technical committees of several scientific conferences. He leads several national research and development projects with French industries and participates in several international academic collaborations (Universities of Maryland, Stanford, Boston, Louvain La Neuve). He is the author of a large number of publications in the area of image and video processing and the editor of the book Wavelets and Image Communication Elsevier, 1994.