### 3.1. Architecture of the target system

Figure 1 shows a block diagram of the target system used in this research. The input video is encoded by an H.264/AVC encoder and the reconstructed frames are stored for the next frames. It is assumed that both the encoder and decoder sides include modules to adjust the spatial resolution of the input and displaying videos, respectively. Based on the target bitrate and other encoding results, such as the current bitrate and PSNR information, the bitrate control module determines the proper QP value and the spatial resolution ratios for the encoder and the resolution conversion module, respectively. For implementation of the proposed bitrate control module in Figure 1, the bitrate control algorithm in the JM 13.2 reference software is used for the control of the QP value, whereas a new algorithm, described next, is proposed for spatial resolution control. In this article, the spatial resolution for the best video quality is determined by considering the PSNR which indicates objective quality. PSNR values are obtained from the difference between the original video and the up-sampled reconstructed video. The reason why the PSNR is chosen over VQM is that the VQM calculation is computationally expensive and needs the buffering of a few frames whereas the PSNR computation is quite simple.

### 3.2. Spatial resolution control

The reconstructed and up-sampled video includes two kinds of distortion. One is generated from the encoding process and the other is caused by spatial up/down-sampling. When spatial resolution control is used for bitrate control, it is important to find a resolution ratio that shows the best quality video under a given target bitrate. According to [39], experiments have shown that the PSNR degradation due to down-sampling and up-sampling operations increases approximately in proportion to the bitrate and the extent of the reduction in spatial resolution. Let PSNR_{coding_down} denote the PSNR of a video that has been down-sampled and encoded. Then, PSNR_{coding_down} is formulated as

\text{PSN}{\text{R}}_{\text{coding\_down}}={q}_{1}\cdot \text{log}R+{q}_{2}-{q}_{3}\cdot \left(\frac{1}{sa}-1\right)\cdot R

(1)

where *q* 1, *q* 2, and *q* 3 are constants which depend on the video content and *R* is the bitrate of the encoded stream. The term *sa*, referred to here as the spatial resolution ratio, represents the ratio of the down-sampled frame area to the original frame area. When *sa* is smaller than 1, the frame is down-sampled. Equation (1) describes the relationship between *sa* and PSNR and thus is used for calculating the optimal spatial resolution. However, the parameters such as *q* 1, *q* 2, and *q* 3 in Equation (1) depend on the video content and cannot be known prior to encoding. Thus, this optimal solution cannot be applied to real-time systems on the fly.

Figure 2 shows the PSNR of HD-sized video sequences, *Station2*, at various spatial resolution ratios and at three bitrates: 600 kbps, 1 Mbps, and 2 Mbps. In Figure 2, the solid curves show the PSNR values obtained by simulation. Each graph has a peak PSNR value at a certain *sa*. The initial PSNR without spatial down-sampling is denoted by PSNR_{full} (when *sa* = 1). The peak PSNR of each graph is denoted by PSNR_{peak} and the *sa* that gives PSNR_{peak} is denoted by *sa*_{peak}. In the graph for the 600 kbps bitrate in Figure 2, PSNR_{full} is marked with a circle while PSNR_{peak} and *sa*_{peak} are marked with a triangle and a rectangle, respectively. If the spatial resolution ratio is adjusted to *sa*_{peak}, then the highest PSNR is achieved for a given bitrate.

In order to reduce the complexity of calculating *sa*_{peak}, a method for finding *sa*_{peak} based on a simplified model obtained from Figure 2 is proposed. In Figure 2, dotted lines connect PSNR_{full} and PSNR_{peak}. Within the range of *sa* from 1 to *sa*_{peak}, the dotted lines are very close to the measured data. Based on this proximity, the equation for PSNR_{coding_down} in Equation (1) is reformulated as a simplified model for low bitrate control as shown in Equation (2) in which α and β are positive and α represents the slope in the modeled graphs. If the information for α, PSNR_{full}, and PSNR_{peak} are given, the *sa*_{peak} can be estimated using the linear model in Equation (2).

\text{PSN}{\text{R}}_{\text{coding\_down\_low}R}=-\alpha \cdot sa+\beta .

(2)

The slope of the model, α, can be estimated using the PSNR and *sa* values obtained from encoded frames. Let PSNR_{prev} and *sa*_{prev} denote the PSNR and *sa* of the previous GOP, respectively, and let PSNR_{full} be obtained when the first GOP is encoded with *sa* = 1. Based on the linear model in Equation (2), PSNR_{prev} and PSNR_{full} are expressed as given in Equations (3) and (4), respectively.

{\text{PSNR}}_{\text{prev}}=-\alpha \cdot {sa}_{\text{prev}}+\beta

(3)

{\text{PSNR}}_{\text{full}}=-\alpha \cdot 1+\beta

(4)

If (4) is subtracted from (3), an estimate of α, denoted by α_{est}, is obtained by (5). The slope α_{est} obtained from (5) is the α value delayed by one GOP. It is based on the assumption that spatial resolution of the current frame can be applied to the next GOP, due to similarity between successive GOPs.

{\alpha}_{\text{est}}=\frac{(\text{PSN}{\text{R}}_{\text{prev}}-\text{PSN}{\text{R}}_{\text{full}}\text{)}}{1-s{a}_{\text{prev}}}

(5)

PSNR_{peak} cannot be simply determined because it varies with both the image content and the target bitrate. In order to estimate PSNR_{peak}, PSNR_{peak}/PSNR_{full} ratios of 12 video sequences are measured under various target bitrates and at various values of slope α. The used videos are as follows: *Akiyo* and *Coast Guard*, with CIF (352 × 288) resolution, *City, Crew*, and *Ice*, with 4CIF (704 × 576) resolution, *Aspen, Factory, Old Town Cross, Parkrun*, and *Pedestrian Area* with HD (1280 × 720) resolution and *West Windy Easy* and *Touchdown Pass*, both with full HD (1920 × 1080) resolution are used in the evaluation. A sample of such results is shown in Figure 3 which shows that PSNR_{peak}/PSNR_{full} ratio is approximately proportional to the slope α regardless of video sequence types and bitrates. From this observation, PSNR_{peak} is calculated as given in (6), with the coefficients being chosen experimentally.

\text{PSN}{\text{R}}_{\text{peak}}=\left(0.03\times \text{PSN}{\text{R}}_{\text{full}}\times \left(\alpha -0.5\right)\right)+\text{PSN}{\text{R}}_{\text{full}}

(6)

Based on the linear model in (2), *sa*_{peak} is calculated within the range from 0.1 to 1 as

s{a}_{\text{peak}}=1-\frac{\left(\text{PSN}{\text{R}}_{\text{peak}}-\text{PSN}{\text{R}}_{\text{full}}\right)}{\alpha}\phantom{\rule{1em}{0ex}}\text{(0}\text{.1}\le s{a}_{\text{peak}}\le \text{1),}

(7)

where α and PSNR_{peak} are obtained from (5) and (6), respectively.

Slope α cannot be estimated before the second GOP because PSNR_{prev} and *sa*_{prev} in (5) are not given yet. In this case, the initial value of α is set to a small value from 0.1 to 0.3. This initial α is referred to as α_{init}. If PSNR_{peak} in (6) is substituted in (7), *sa*_{peak} is in inverse proportional to α. Thus, if a small α_{init} is used in (7), *sa*_{peak} value is relatively large and thus, the PSNR degradation caused by an excessive down-sampling operation can be avoided. Once PSNR_{prev} and *sa*_{prev} are obtained from the results for the second GOP encoded with α_{init}, then α_{est} calculated from (5) is used as follows:

\alpha =\left\{\begin{array}{cc}{\alpha}_{\text{init}}\hfill & \text{forthesecondGOP}\hfill \\ {\alpha}_{\text{est}},\hfill & \text{afterthesecondGOP}\hfill \end{array}\right.

(8)

The target spatial resolution ratio, *sa*_{target}, is adjusted only when the video quality is lower than an acceptable level. Let PSNR_{target} denote the PSNR for the acceptable image quality as required by users or applications. It is assumed that the reconstructed image is visually indistinguishable from the original one if the PSNR is greater than from 35 to 40 dB [40–43]. Thus, PSNR_{target} is set to 40 dB in this article. If PSNR_{full} is larger than PSNR_{target}, no spatial resolution adjustment is necessary, that is, the target spatial resolution ratio is set to 1. Otherwise, *sa*_{target} is adjusted to the *sa*_{peak} as in (7). Thus, *sa*_{target} is given by

s{a}_{\text{target}}=\left\{\begin{array}{cc}1,\hfill & \text{if(PSN}{\text{R}}_{\text{full}}\ge {\text{PSNR}}_{\text{target}}\text{)}\hfill \\ s{a}_{\text{peak}}\text{,}\hfill & \text{else}\hfill \end{array}\right.

(9)

### 3.3. The proposed bitrate control algorithm

Figure 4 shows the proposed bitrate control algorithm in which the QP and the spatial resolution ratio are determined sequentially in order to reach the PSNR_{target}. The QP value is decided frame-by-frame, whereas the spatial resolution ratios are determined for each GOP. At initialization, the PSNR_{target} is defined, and the algorithm step is started at Step 1. In the first GOP of Step 1, the spatial resolution ratio is not changed but only the QP is controlled to meet the target bitrate like a conventional QP-based rate control algorithm. If the target bitrate denoted by bitrate_{target} in Figure 4 cannot be satisfied in Step 1, the spatial resolution for the next GOP is simply down-sampled by a factor of 2, compared to that of the current GOP because meeting the bitrate_{target} is paramount. The encoding for the next GOP is started from Step 1 with a half-reduced spatial resolution. If the generated bitrate of the GOP in Step 1 meets the bitrate_{target} in Figure 4, Step 1 proceeds to Step 2. In Step 2, the *sa*_{peak} and PSNR_{peak} for the GOP to be encoded are calculated from (6) and (7), where the PSNR obtained from Step 1 and α_{init} are used. The *sa*_{target} is then determined from (9). In Step 3, the PSNR obtained from Step 2 is evaluated and used as PSNR_{prev} in (5) to adjust the proper α. Using the adjusted α denoted by α_{est}, the *sa*_{peak}, PSNR_{peak}, and *sa*_{target} for the GOP in Step 3 are calculated from (6), (7), and (9), respectively. The *sa*_{target}, determined once in Step 3, is used continually for the subsequent GOPs in Step 4. As long as the R-D characteristics of successive frames are similar, encoding with a *sa*_{target} works well. To cope with varying R-D characteristics, actions for bitrate change and QP change are described in Figure 4. If bitrate_{target} is changed, the relation between the spatial resolution and the PSNR becomes different, thus the slope α_{est} value needs to be refreshed through Steps 1, 2, and 3. Before going back to Step 1, the full size is set to 1 which is a full resolution with no reduction of spatial resolution when the increase of bitrate_{target} is greater than TH_{BR}. If the decrease of bitrate_{target} is greater than TH_{BR}, the full size is set to the current *sa*_{target}. A new *sa*_{target} for the decreased target bitrate will be determined as a value less than the current *sa*_{target}. In this article, TH_{BR} is calculated by using 0.02 × frame rate × original spatial resolution. Even though the bitrate_{target} is the same, the motion characteristics of video can be changed. If the motion is faster, the average QP value of the recent frames becomes higher than that of the previous ones, and vice versa. Therefore, *sa*_{target} is adjusted in a fine-grain manner by the change in the average QP. As shown in Figure 4, the whole flow to decide the proper spatial resolution is processed automatically and does not depend on advanced information about characteristics of the video content and the specific coding methods. Therefore, the proposed algorithm can easily be applied to real-time applications.