3.1. Architecture of the target system
Figure 1 shows a block diagram of the target system used in this research. The input video is encoded by an H.264/AVC encoder and the reconstructed frames are stored for the next frames. It is assumed that both the encoder and decoder sides include modules to adjust the spatial resolution of the input and displaying videos, respectively. Based on the target bitrate and other encoding results, such as the current bitrate and PSNR information, the bitrate control module determines the proper QP value and the spatial resolution ratios for the encoder and the resolution conversion module, respectively. For implementation of the proposed bitrate control module in Figure 1, the bitrate control algorithm in the JM 13.2 reference software is used for the control of the QP value, whereas a new algorithm, described next, is proposed for spatial resolution control. In this article, the spatial resolution for the best video quality is determined by considering the PSNR which indicates objective quality. PSNR values are obtained from the difference between the original video and the up-sampled reconstructed video. The reason why the PSNR is chosen over VQM is that the VQM calculation is computationally expensive and needs the buffering of a few frames whereas the PSNR computation is quite simple.
3.2. Spatial resolution control
The reconstructed and up-sampled video includes two kinds of distortion. One is generated from the encoding process and the other is caused by spatial up/down-sampling. When spatial resolution control is used for bitrate control, it is important to find a resolution ratio that shows the best quality video under a given target bitrate. According to [39], experiments have shown that the PSNR degradation due to down-sampling and up-sampling operations increases approximately in proportion to the bitrate and the extent of the reduction in spatial resolution. Let PSNRcoding_down denote the PSNR of a video that has been down-sampled and encoded. Then, PSNRcoding_down is formulated as
(1)
where q 1, q 2, and q 3 are constants which depend on the video content and R is the bitrate of the encoded stream. The term sa, referred to here as the spatial resolution ratio, represents the ratio of the down-sampled frame area to the original frame area. When sa is smaller than 1, the frame is down-sampled. Equation (1) describes the relationship between sa and PSNR and thus is used for calculating the optimal spatial resolution. However, the parameters such as q 1, q 2, and q 3 in Equation (1) depend on the video content and cannot be known prior to encoding. Thus, this optimal solution cannot be applied to real-time systems on the fly.
Figure 2 shows the PSNR of HD-sized video sequences, Station2, at various spatial resolution ratios and at three bitrates: 600 kbps, 1 Mbps, and 2 Mbps. In Figure 2, the solid curves show the PSNR values obtained by simulation. Each graph has a peak PSNR value at a certain sa. The initial PSNR without spatial down-sampling is denoted by PSNRfull (when sa = 1). The peak PSNR of each graph is denoted by PSNRpeak and the sa that gives PSNRpeak is denoted by sapeak. In the graph for the 600 kbps bitrate in Figure 2, PSNRfull is marked with a circle while PSNRpeak and sapeak are marked with a triangle and a rectangle, respectively. If the spatial resolution ratio is adjusted to sapeak, then the highest PSNR is achieved for a given bitrate.
In order to reduce the complexity of calculating sapeak, a method for finding sapeak based on a simplified model obtained from Figure 2 is proposed. In Figure 2, dotted lines connect PSNRfull and PSNRpeak. Within the range of sa from 1 to sapeak, the dotted lines are very close to the measured data. Based on this proximity, the equation for PSNRcoding_down in Equation (1) is reformulated as a simplified model for low bitrate control as shown in Equation (2) in which α and β are positive and α represents the slope in the modeled graphs. If the information for α, PSNRfull, and PSNRpeak are given, the sapeak can be estimated using the linear model in Equation (2).
(2)
The slope of the model, α, can be estimated using the PSNR and sa values obtained from encoded frames. Let PSNRprev and saprev denote the PSNR and sa of the previous GOP, respectively, and let PSNRfull be obtained when the first GOP is encoded with sa = 1. Based on the linear model in Equation (2), PSNRprev and PSNRfull are expressed as given in Equations (3) and (4), respectively.
(3)
(4)
If (4) is subtracted from (3), an estimate of α, denoted by αest, is obtained by (5). The slope αest obtained from (5) is the α value delayed by one GOP. It is based on the assumption that spatial resolution of the current frame can be applied to the next GOP, due to similarity between successive GOPs.
(5)
PSNRpeak cannot be simply determined because it varies with both the image content and the target bitrate. In order to estimate PSNRpeak, PSNRpeak/PSNRfull ratios of 12 video sequences are measured under various target bitrates and at various values of slope α. The used videos are as follows: Akiyo and Coast Guard, with CIF (352 × 288) resolution, City, Crew, and Ice, with 4CIF (704 × 576) resolution, Aspen, Factory, Old Town Cross, Parkrun, and Pedestrian Area with HD (1280 × 720) resolution and West Windy Easy and Touchdown Pass, both with full HD (1920 × 1080) resolution are used in the evaluation. A sample of such results is shown in Figure 3 which shows that PSNRpeak/PSNRfull ratio is approximately proportional to the slope α regardless of video sequence types and bitrates. From this observation, PSNRpeak is calculated as given in (6), with the coefficients being chosen experimentally.
(6)
Based on the linear model in (2), sapeak is calculated within the range from 0.1 to 1 as
(7)
where α and PSNRpeak are obtained from (5) and (6), respectively.
Slope α cannot be estimated before the second GOP because PSNRprev and saprev in (5) are not given yet. In this case, the initial value of α is set to a small value from 0.1 to 0.3. This initial α is referred to as αinit. If PSNRpeak in (6) is substituted in (7), sapeak is in inverse proportional to α. Thus, if a small αinit is used in (7), sapeak value is relatively large and thus, the PSNR degradation caused by an excessive down-sampling operation can be avoided. Once PSNRprev and saprev are obtained from the results for the second GOP encoded with αinit, then αest calculated from (5) is used as follows:
(8)
The target spatial resolution ratio, satarget, is adjusted only when the video quality is lower than an acceptable level. Let PSNRtarget denote the PSNR for the acceptable image quality as required by users or applications. It is assumed that the reconstructed image is visually indistinguishable from the original one if the PSNR is greater than from 35 to 40 dB [40–43]. Thus, PSNRtarget is set to 40 dB in this article. If PSNRfull is larger than PSNRtarget, no spatial resolution adjustment is necessary, that is, the target spatial resolution ratio is set to 1. Otherwise, satarget is adjusted to the sapeak as in (7). Thus, satarget is given by
(9)
3.3. The proposed bitrate control algorithm
Figure 4 shows the proposed bitrate control algorithm in which the QP and the spatial resolution ratio are determined sequentially in order to reach the PSNRtarget. The QP value is decided frame-by-frame, whereas the spatial resolution ratios are determined for each GOP. At initialization, the PSNRtarget is defined, and the algorithm step is started at Step 1. In the first GOP of Step 1, the spatial resolution ratio is not changed but only the QP is controlled to meet the target bitrate like a conventional QP-based rate control algorithm. If the target bitrate denoted by bitratetarget in Figure 4 cannot be satisfied in Step 1, the spatial resolution for the next GOP is simply down-sampled by a factor of 2, compared to that of the current GOP because meeting the bitratetarget is paramount. The encoding for the next GOP is started from Step 1 with a half-reduced spatial resolution. If the generated bitrate of the GOP in Step 1 meets the bitratetarget in Figure 4, Step 1 proceeds to Step 2. In Step 2, the sapeak and PSNRpeak for the GOP to be encoded are calculated from (6) and (7), where the PSNR obtained from Step 1 and αinit are used. The satarget is then determined from (9). In Step 3, the PSNR obtained from Step 2 is evaluated and used as PSNRprev in (5) to adjust the proper α. Using the adjusted α denoted by αest, the sapeak, PSNRpeak, and satarget for the GOP in Step 3 are calculated from (6), (7), and (9), respectively. The satarget, determined once in Step 3, is used continually for the subsequent GOPs in Step 4. As long as the R-D characteristics of successive frames are similar, encoding with a satarget works well. To cope with varying R-D characteristics, actions for bitrate change and QP change are described in Figure 4. If bitratetarget is changed, the relation between the spatial resolution and the PSNR becomes different, thus the slope αest value needs to be refreshed through Steps 1, 2, and 3. Before going back to Step 1, the full size is set to 1 which is a full resolution with no reduction of spatial resolution when the increase of bitratetarget is greater than THBR. If the decrease of bitratetarget is greater than THBR, the full size is set to the current satarget. A new satarget for the decreased target bitrate will be determined as a value less than the current satarget. In this article, THBR is calculated by using 0.02 × frame rate × original spatial resolution. Even though the bitratetarget is the same, the motion characteristics of video can be changed. If the motion is faster, the average QP value of the recent frames becomes higher than that of the previous ones, and vice versa. Therefore, satarget is adjusted in a fine-grain manner by the change in the average QP. As shown in Figure 4, the whole flow to decide the proper spatial resolution is processed automatically and does not depend on advanced information about characteristics of the video content and the specific coding methods. Therefore, the proposed algorithm can easily be applied to real-time applications.