Rate Control for H.264 with Two-Step Quantization Parameter Determination but Single-Pass Encoding

We present an e ﬃ cient rate control strategy for H.264 in order to maximize the video quality by appropriately determining the quantization parameter (QP) for each macroblock. To break the chicken-and-egg dilemma resulting from QP-dependent rate-distortion optimization (RDO) in H.264, a preanalysis phase is conducted to gain the necessary source information, and then the coarse QP is decided for rate-distortion (RD) estimation. After motion estimation, we further reﬁne the QP of each mode using the obtained actual standard deviation of motion-compensated residues. In the encoding process, RDO only performs once for each macroblock, thus one-pass, while QP determination is conducted twice. Therefore, the increase of computational complexity is small compared to that of the JM 9.3 software. Experimental results indicate that our rate control scheme with two-step QP determination but single-pass encoding not only e ﬀ ectively improves the average PSNR but also controls the target bit rates well.


INTRODUCTION
H.264/MPEG-4 AVC is the latest international video coding standard developed by Joint Video Team (JVT) of ISO Motion Picture Expert Group and ITU-T Video Coding Expert Group [1][2][3][4][5]. As in other video standards such as MPEG-2 [6] and H.263 [7,8], rate control remains as an open but important issue for H.264/AVC. A rate control scheme that is able to maximize the video quality and at the same time meets the rate constraints is much desired for H.264/AVC.
In comparison with other video standards, there are several challenges for rate control in H.264 [9][10][11][12], due to its unique features. The first one is the well-known chickenand-egg dilemma in the rate-distortion optimization (RDO) process [10], which is briefly described as follows. In H.264, quantization-parameter-(QP-) dependant RDO technique is adopted in the process of best prediction mode selection [11,13]. To perform RDO, QP should be decided first. But in order to perform rate control, QP can only be obtained according to the coding complexity and number of target bits that are calculated by motion-compensated residues after RDO mode decision. This imposes a big problem for rate control in H.264. Secondly, due to more delicate prediction modes adopted in H.264 than those in previous standards, the number of header bits fluctuates greatly from Inter 16 × 16 to Inter 4 × 4 [11,12]. Thus, a good overhead model is necessary for accurate rate control. Thirdly, better mode selection in H.264 often leads to small motion-compensated residues [11]. As a result, a large number of macroblocks will be quantized to zero.
Although several rate control algorithms have recently been proposed to cope with these problems [9,12,14], the proper method for rate control in H.264/AVC has not been fully explored. A predictive rate control scheme [9] has been adopted in H.264/AVC reference software JM 9.3 [15]. The general idea of the rate control scheme is as follows: after preencoding of the macroblock using the QP of previously encoded macroblock, the block activity is measured by the sum of absolute differences (SAD). Using a linear model that captures the connection between the QP, buffer occupancy, and the block activity, the QP is then determined based on buffer occupancy and block activity. The macroblock is reencoded using the obtained QP if the difference between the two QPs exceeds a specific threshold. Up to 20% of the MBs need to be encoded twice. Furthermore, linear modeling of the relation between QPs, buffer occupancy, and the block activity may not achieve the best performance. In [12], a solution of the chicken-and-egg dilemma between rate control and RDO in H.264 is given, and hence different bits to different modes are allocated so that the bad situation for the quadratic rate-distortion (RD) model is deviated. Although the solution can keep the peak signal-to-noise ratio (PSNR) smoother than that of [9] and generalized bit rate matches the target bit rate accurately, the PSNR improvement is insignificant. In [16][17][18], a PSNR-and-MAD-based frame complexity estimation is proposed to allocate the bits more accurately among frames. Two special cases of scene change and small texture bits are taken into account when determining QP at frame layer. A frame skipping decision is also used to proactively drop a simple frame in order to make room for the later more complex frames. However, this rate control scheme does not pay much attention to QP determination at the macroblock layer. In [19], a frame-layer rate control scheme is presented, which computes the Lagrange multiplier for mode decision by using a quantization parameter which may be different from that used for encoding.
In this paper, we propose an RDO-based rate control scheme for H.264 with two-step QP determination but single-pass encoding in order to maximize the video quality by appropriately determining QP for each macroblock, which is based on our previous work [11]. To break the chicken-and-egg dilemma resulting from QP-dependent rate-distortion optimization (RDO) in H.264, a pre-analysis phase is conducted to gain the necessary source information, and then the coarse QP is decided for R-D estimation. After QP-dependant motion estimation (with coarse QP), we further refine the QP of each mode based on the obtained actual standard deviation of motion-compensated residues. Using the actual standard deviation, each possible mode's QP can be calculated. Thus, these QPs are used in the comparison of each mode's rate-distortion (RD) cost (RDcost). The encoder chooses the mode having the minimum value. Thus, carefully selected QPs can ensure accurate bits allocation to individual MBs according to their actual needs. The introduction of QP refinement process is helpful to achieve a good video quality given the bit budget. In addition, the header bits and coefficient bits are separately estimated so that the rate control accuracy is further enhanced. In the encoding process, RDO only performs once for each macroblock, thus onepass, while QP determination is conducted twice. Therefore, the increase of computational complexity is small compared to that of the JM 9.3 software. Experimental results indicate that our rate control scheme not only effectively improves the average PSNR but also controls the target bit rates well.
The rest of this paper is organized as follows. In Section 2, we derive models for bit rate and distortion estimation. In Section 3, our proposed rate control algorithm is presented in detail, including the solutions to the aforementioned difficulties and the two-step QP decision with single-pass encoding. Section 4 gives experimental results. Finally, Section 5 concludes the paper. Figure 1 shows the basic ideas of the overall rate control process of our algorithm, which comprises of two major steps. Firstly, pre-analysis is performed to break the chicken-andegg dilemma, thus obtaining the source information, which is used in determining the coarse QP for QP-dependent motion estimation. Secondly, RDO mode decision is conducted at the macroblock layer to select the best prediction mode for individual macroblock. The refined QP of each possible mode is determined and used in the RDcost comparison. After RDO, current macroblock is encoded with the selected mode and its corresponding refined quantization parameter.

MODELING RATE AND DISTORTION
To determine QP, an R-D model usually estimates the rate and distortion based on some measurements of frames or macroblocks. In this paper, we choose the R-D model of our previous work [11] in which the header bits, the coefficient bits, and distortion of each macroblock are estimated. They are briefly described as follows.  inter-predicted frames, the average QP from all MBs of the previously inter-predicted frame is used to preanalyze current frame.

Header bits estimation
Most existing R-D models only consider the transform coefficient bits in the estimation of the rate for a macroblock. Header bits are simply represented by a constant value. This is a reasonable simplification for previous standards such as MPEG-2 and H.263, because the header bits are relatively few in number due to the simplicity of prediction modes in these standards. However, header bits form a significant portion of H.264/AVC bitstream [11]. Therefore, the number of header bits needs to be estimated separately from coefficient bits for accurate rate estimation. In this paper, we use the following simple but effective model to estimate the number of header bits for one macroblock: where H i is the number of header bits for the ith macroblock in the current frame. σ i is the predicted standard deviation of motion-compensated residues for Inter 16 × 16. In the following, we refer to the standard deviation of the motioncompensated residue obtained in the pre-analysis phase as predicted standard deviation since it may be different from the actual standard deviation if RDO selects other mode rather than the Inter 16 × 16 mode as the prediction mode. H trd and σ 2 trd are the averages of all recorded H i and σ 2 i , which are explained below. C is a constant that implies the linear relation between H i and com i , which is used to separate the following two situations so that (1) looks more compact.
Two situations are considered in our header bits model. (1) When encoding the previous frame, we record H i and σ 2 i of the MBs whose H i is smaller than a predefined constant (= 11). After encoding the previous frame, we calculate the averages of all recorded H i and σ 2 i , which are referred to as H trd and σ 2 trd , respectively. During the encoding of current frame, if σ 2 i ≤ σ 2 trd for a macroblock, we then conclude that this macroblock will produce a small number of header bits and H i is directly estimated by H trd .
(2) Otherwise, the number of header bits of a macroblock is linear to [log(σ 2 i )] 2 . Furthermore, C is adaptively updated macroblock by macroblock during the encoding process to make the model more robust, which is discussed below. Further explanation of (1) and (2) is given as follows.
We use Inter 16×16 mode in the pre-analysis to compute the motion-compensated residues. A good prediction of the MB by Inter 16 × 16 will result in a small predicted standard deviation. So the chances are that Inter 16×16 will be selected as the best prediction mode. In contrast, a large predicted standard deviation implies a bad prediction and RDO may quite possibly select other modes such as Intra 4 × 4 or Inter 8 × 8 to do the prediction. In this sense, the prediction mode selected by RDO is, to some extent, dependent on the predicted standard deviation. On the other hand, as we know, in H.264, the number of header bits strongly depends on its prediction mode (e.g., Inter 16 × 16 has only one motion vector while Inter 8×8 may have up to 16 motion vectors). From the above analysis, we can say that the number of header bits depends on the predicted standard deviation as well. The larger the predicted standard deviation, the higher the possibility that header-bits-expensive modes, such as Inter 8 × 8, will be used. In other words, the number of header bits increases with the predicted standard deviation, as is suggested by (2).

Coefficient bits estimation
The rate-quantization model proposed in [21] is used to estimate the coefficient bits estimation: where F i denotes the bit required for encoding the DCT coefficients of ith macroblock; σ i denotes the standard deviation of motion-compensated residues; Q i is the quantization step size; A is the number of the pixels in a macroblock (i.e., 16 × 16 = 256); K is a constant and can be set to e/ln 2 if the DCT coefficients are Laplacian distributed and independent [21]. However, since the DCT coefficients may not follow the Laplacian distribution strictly, it is better to adaptively update the value of K, macroblock by macroblock and frame by frame. More details are discussed in the Section 3.3.

Distortion estimation
The following well-known distortion-quantization model [15] is used to measure the distortion of encoded macroblocks: where N is the total number of macroblocks in one frame; α i is distortion weight of ith macroblock, which can be used to incorporate the importance or weight of that macroblock's distortion. However, in this implementation, these weights are used to reduce the bit overhead caused by recording each macroblock's QP individually at low bit rates.
If the values of QP for sequential macroblocks are differentially encoded in a raster-scan order, frequent QP changes between macroblocks consume too many bits. This effect is negligible at high bit rates but may become increasingly  significant at low bit rates. We therefore try to control the dynamic range of QP by simply setting the values of α i . At lower bit rates, α i is determined from the respective standard deviation of residues σ i by the method proposed in [15]. At higher bit rates (above 0.5 bits/pixel), all of α i are set to 1. Figure 2 shows the flowchart of the proposed rate control scheme. The three major steps are the above-mentioned pre-analysis, frame-layer bit allocation, and macroblock-layer rate control.

Pre-analysis
Through pre-analysis using Inter 16 × 16 mode, we obtain the necessary source information for R-D estimation before the RDO. The predicted information is used to determine the bit budget for frames and the coarse QPs for macroblocks.

Frame-layer bit allocation
In [9], a fluid flow traffic model was proposed to compute the target bit for the current coding frame. Although this model can achieve accurate bit-rate control, it only considers the buffer states (or rate) but without the consideration of distortion, thus may limit the quality improvement. In our previous work [11], we proposed a frame-layer bit allocation scheme by integrating both rate-distortion cost and target bit rate. The scheme can be divided into two steps.
First, we determine the number of target bits for current frame without considering the buffer state using the following equation: where R is the available channel bandwidth. f is the frame rate. J cur is the RDcost of current frame, which is defined as the sum of the RDcost of all the MBs in the current frame. It is noticed that macroblock-layer rate control is still not enabled at this moment. Remembering that in the pre-analysis stage we use Inter 16 × 16 mode for pre-encoding, so J cur is actually the RDcost of current frame under the Inter 16 × 16 mode. J is the average RDcost of the encoded frames in the group of pictures (GOP), the GOP size is 100 frames. J prev,0 is the sum of RDcost of all the zero-coefficient macroblocks in the previous frame. Zero-coefficient macroblock refers to a macroblock whose coefficients are all quantized to zeros after the transform and quantization. P n is the average PSNR of the recent n frames, which is computed using a sliding window (length is 8) method. P is the average PSNR of the encoded frames again in the GOP. Second, the target number of bits for a frame is further adjusted according to the buffer state in a similar way to the fluid flow traffic model [11,20]: where M is the buffer size and L is the currently observed buffer fullness. The strength of the restriction depends on the parameters of λ 1 and λ 2 , which are determined from the normalized buffer fullness (L/M) via As we can see, λ 1 and λ 2 linearly range from 0 to 1 according to the current buffer state. The two functions converge at point (0.2, 1), which means that there is no constraint imposed when L/M is 0.2. On the other hand, stronger restriction is imposed when the buffer level is extremely high or low. It should be noticed that these controlling points of linear function can be adjusted to meet the variant requirement and buffer condition.

Determining coarse QP
We mainly focus our discussion on the low delay situation where the macroblock-layer rate control is more critical. We consider the IPPP. . . GOP structure. The most crucial task of macroblock-layer rate control is to determine the QP for every individual macroblock. For I frame, the method in the JM 9.3 reference software is also used to determine the QPs in this implementation. In the following, we only discuss the QP determination for P frames.
The optimized quantization step size Q * i for ith MB can be determined by minimizing the overall distortion D subject to a given bit budget B, namely, minimizing the RDcost as follows: This kind of optimization problem can be solved by Lagrangian optimization technique [21]: It is noticed that σ i in the equation is the standard deviation of motion-compensated residues of the Inter 16 × 16 mode in the pre-analysis phase. Formula (9) is used to compute the coarse QP of each macroblock. The parameters K i−1 and C i are recursively updated (MB by MB) during the encoding of the successive macroblocks; more details are given in Section 3.3.5.
From (9), we can see that if α i approaches σ i very closely, the term σ i /α i becomes 1 and thus all of the quantization steps in one frame are approximately equal. The range of QP is then reduced. So it gives a good explanation to the aforementioned distortion weights determination.

Motion estimation
The resultant Q Coarse (i.e., Q * i ) and λ Motion =0.85×2 (QCoarse−12)/3 are used in motion estimation to search for the best motion vectors for each macroblock under a certain mode.

Quantization parameter refinement
From Section 2, we know that the coefficient model is based on the actual standard deviation of the motion-compensated residues. Clearly, the standard deviation obtained in the preanalysis may be different from the actual standard deviation if the RDO process selects another prediction mode rather than Inter 16 × 16. This will result in some error of QP calculation to some extent, especially for high-motion videos and 6 EURASIP Journal on Applied Signal Processing their high bit rates because there are fewer chances for Inter 16 × 16 to be selected in such situation.
We observe that for mode I k , the standard deviation of motion-compensated residues σ * i (I k ) can be obtained easily after motion estimation (ME) in the loop of the RDO process. Then, the QP of each mode, denoted as QP Ik , can be calculated using (9), where we just replace σ i with σ * i (I k ). After all modes are checked by RDO, the encoder uses QP Ik in the comparison of RDcost to choose a best prediction mode (I best ) for the current macroblock.

Encoding of MBs using the best mode
To encode the ith macroblock with the best mode I best , we define S i = N j=i α j σ j , T i = N j=i com j and rewrite (9) as follows: where B i is the unused number of target bits for the remaining macroblocks from ith to Nth in the current frame. K i and C i are the updated values of R-D model parameters K and C after encoding the first (i − 1) macroblocks. In this way, we can compute the QPs of each macroblock via updating the required parameters macroblock by macroblock when the macroblocks are processed sequentially in one frame.

Updating some parameters of R-D model
(1) Updating B i B i+1 is updated as follows: where J j is the R-D cost of jth macroblock obtained in the pre-analysis stage; b j is the actual number of encoding bits used for jth macroblock. We adopt the weighted average method to improve the accuracy and robustness of bit allocation. On the right-hand side of the equation, the first term indicates the unused bit budget for the remaining macroblocks to be encoded while the second term is to update the bit allocation according to the actual R-D cost of the macroblocks. Such updating according to the actual encoding results is necessary during the scan over all macroblocks.
(2) Updating K i (a) Compute the K i after encoding the current macroblock: (b) If K i > 0 and K i ≤ 4.5, compute the average K of the macroblocks encoded so far: where l is the number of macroblocks encoded so far whose K i is within [0, 4.5]. Otherwise, we regard the current value of K i as an ineffective estimation and just skip this step. So K i remains unchanged after encoding the current macroblock in this situation. (c) Find the weighted average of the initial estimate K 1 with K i : where K 1 is the average K of the previous frame. It is used to improve the accuracy of the estimation of K, since when only the first few macroblocks in the current frame have been encoded (i.e., i is small), K i is the average of only a few values and hence is not a robust estimate of K for the current frame. Then the updated K i is used in (9) and (10).
(3) Updating C i (a) Compute the C i after encoding the current macroblock: where i j=1 (b j − F j ) is the total number of header bits used for encoding the first i macroblocks. (b) Find the average C i of all the encoded macroblocks in the current frame: (c) Find the weighted average of the initial estimate C 1 with C i : where C 1 is the average C of the previous frame. This method of weighted average is used for the same reason as (14). Then the updated C i is used in (9) and (10).

Implementation issue related to RDO options
When our scheme was integrated into the JM 9.3 software, two different situations were considered: RDO on and RDO off (whether to apply RDO technique in mode decision process or not), which led to a little difference in the realization of our algorithm.

(1) RDO off
When the RDO option was switched to off, it implied that RDcost value comparison was not conducted for mode decision. Only the values of SAD or SATD (when Hadamard Xiaokang Yang et al.

7
transform was set) for each mode were compared to select the best prediction mode. Therefore, we just examined the standard deviation of motion-compensated errors for the best mode and updated its QP.
(2) RDO on It was more complicated when the RDO option was switched to on. The mean absolute difference (MAD) for each mode should be calculated in order to perform QP refinement. Firstly, motion estimation was performed. All modes were checked in order. After that, some variables were updated if the best mode had been changed. Therefore we also applied our algorithm here. Similarly, we obtained the MAD of 8 × 8 subblock and then introduce the small-sized refined QP for RDcost comparison. For QP refinement, the QP range was restricted in a reasonable range, that is, the coarse QP ±4 to prevent too high QP fluctuation between neighboring macroblocks.
Another issue was how many parameters of the rate control model in (9) should be updated with different modes. In fact, many model variables were associated with the standard deviation of motion-compensated residues σ * i (I k ). But we believed that there was no need to modify them because they were less dominative than σ * i (I k ) in deciding the refined QP. Another reason was that most of these variables were introduced in the pre-analysis phase at the frame layer, such as the number of target bits and the number of header bits. Though these parameters had some errors if we did not recalculate them, it was also unsuitable to update them at the macroblock layer during the encoding process. Hence we only traced the change of each mode's MAD and ignored other parameters that had indirect relations with the standard deviations of motion-compensated residues. So in our implementation, the only difference between (9) and (10) is σ * i (I k ). In the encoding process, the QP calculation is conducted twice in all. First, coarse QP is obtained to compute the Langrange multiplier parameter for motion estimation. Second, QPs are further refined for different modes, which are used for R-D cost comparison in the RDO process. The final QP of the macroblock (i.e., the best mode's corresponding refined QP) becomes more accurate and conforms to the actual R-D performance of the macroblock for more effective  and accurate rate control. The RDO process does not need to be performed again like that in JVT-F086 [22], hence we call it two-step QP determination but single-pass encoding.

Computational complexity analysis
The possible computational complexity overhead of our method may come from the pre-analysis stage where the Inter 16 × 16 mode is performed to obtain the source information. However, since the results obtained in pre-analysis can be stored for use in the following RDO process, there is no need to implement Inter 16 × 16 again during the RDO. Thus, pre-analysis will only change the algorithm flow and the overall computational complexity has only a possibly negligible increase when RDO option is switched on. As for the RDO off situation, the encoding complexity increases about 30% in terms of the total encoding time.

RESULTS AND DISCUSSIONS
The proposed rate control scheme was implemented onto the H.264 JM 9.3 encoder [23]. In this section, nine typical sequences of various resolution sizes and motion measurements were tested as listed in Table 1. The encoder configuration is shown in Table 2. The performance of our proposed scheme is evaluated in comparison with the original encoder JM 9.3 and the existing rate control functionality in the JM 9.3. We also compared the proposed approach with the approach that does not refine the QP for mode decision. In the 8 EURASIP Journal on Applied Signal Processing simulation, we first encoded the sequence using fixed quantization parameter to determine the target bit rate. Then the same video was encoded once again using the rate control scheme in JM 9.3 and our rate control algorithm, respectively. The obtained PSNRs and the bit rates are compared. We adopt the method in [20] to determine the starting quantization parameter QP 0 . It is predefined based on the available channel bandwidth and the GOP length. In our implementation, the QP for the first I frame is 4 lesser than that for the fixed-QP scheme. The same starting QP is used in the JM 9.3 rate control scheme for a fair comparison of PSNR.
Tables 3 to 6 list the comparison of the experimental results among JM 9.3 rate control (RC), the proposed rate control without QP refinement (PRC w/o QP refinement), and the proposed rate control with QP refinement (PRC with QP refinement). We analyzed the performances of these three rate control schemes with JM 9.3 fixed QP (FQP) as benchmark, where each of the video sequences was encoded at seven different bit rates with JM 9.3 for fixed QPs ranged from 20 to 44 (the QPs were kept unchanged for all the frames). For the other three rate control schemes, the QPs in the tables were only used for I frames and the QPs in P frames were dynamically adjusted by the aforementioned algorithm during the encoding process. R is the overall bit rate.
As observed from Tables 3 to 6, our rate control scheme with QP refinement outperforms the existing rate control  presented in this paper to save the page space. Figures 3 and 4 show frame-by-frame PSNR curve comparison in the encoding process for "Salesman" and "Paris" in the case of "RDO on." Interestingly, our scheme is relatively more effective for the sequences tested with low bit rates and low motion be-cause Inter 16 × 16 mode is more likely to be selected by RDO in such situations. Thus, the inaccuracies resulted from the inconsistency of different prediction modes in the preanalysis stage and RDO stage are avoided as much as possible. But thanks to the QP refinement algorithm, the performances of those high motion and high bit rate sequences are also improved. In our future work, we may try to use Inter 8 × 8 mode for preencoding to obtain more accurate source information for the sequences.

CONCLUSION
We have presented a novel RDO-based rate control algorithm for H.264. The major difficulties in H.264 rate control have been addressed. The pre-analysis stage is used to break the chicken-and-egg dilemma. Robust header bits prediction model and coefficient bits prediction model are established by adaptively updating the model parameters. The frame-layer bit allocation is simple and effective. By using the two-step QP determination but single-pass encoding scheme at the macroblock-layer rate control, each macroblock's QP is further refined and thus highly conformed to its actual  needs. As shown by the test results, our proposed rate control scheme significantly outperforms the original JM 9.3 with fixed QP and the existing rate control scheme in JM 9.3 in terms of PSNR improvement, while maintaining the bit accuracy.