Fast mode decision based on human noticeable luminance difference and rate distortion cost for H.264/AVC

Li, Mian-Shiuan; Chen, Mei-Juan; Tai, Kuang-Han; Sue, Kuen-Liang

doi:10.1186/1687-6180-2013-60

Research
Open access
Published: 25 March 2013

Fast mode decision based on human noticeable luminance difference and rate distortion cost for H.264/AVC

Mian-Shiuan Li¹,
Mei-Juan Chen¹,
Kuang-Han Tai¹ &
…
Kuen-Liang Sue²

EURASIP Journal on Advances in Signal Processing volume 2013, Article number: 60 (2013) Cite this article

2796 Accesses
4 Citations
1 Altmetric
Metrics details

Abstract

This article proposes a fast mode decision algorithm based on the correlation of the just-noticeable-difference (JND) and the rate distortion cost (RD cost) to reduce the computational complexity of H.264/AVC. First, the relationship between the average RD cost and the number of JND pixels is established by Gaussian distributions. Thus, the RD cost of the Inter 16 × 16 mode is compared with the predicted thresholds from these models for fast mode selection. In addition, we use the image content, the residual data, and JND visual model for horizontal/vertical detection, and then utilize the result to predict the partition in a macroblock. From the experimental results, a greater time saving can be achieved while the proposed algorithm also maintains performance and quality effectively.

1. Introduction

With sophisticated technology increasing, multimedia communication has become an important part of human life. In addition to general telecommunications, widespread Internet reliance has made video communication essential. However, the quality of video communication is highly dependent on the efficiency and quality of video transmission. Therefore, many international standards have been developed in recent years. H.264/AVC is one of the popular video coding standards [1]. It is widely applied in video transmission and compression products, e.g., mobile phones, video surveillance, digital TV, etc. Although H.264/AVC has a high coding efficiency, enormous computational complexity is required. In particular, the mode decision procedure occupies the majority of computational complexity due to the evaluation of several inter modes and nine intra predictive directions for Intra 4 × 4 as shown in Figures 1 and 2, respectively. Many studies related to the reduction of computational complexity of mode decision have been proposed.

Bharanitharan et al. [2] proposed a classified region algorithm to reduce the inter mode candidates. The analyses of spatial/temporal homogeneity and edge direction were used to choose the inter modes needed for the rate distortion optimization (RDO) calculation. Choi et al. [3] considered those macroblocks (MBs) with the same motion vectors in the same object. Therefore, they utilized the characteristics of each 4 × 4 block after utilizing a Haar wavelet transform to test the homogeneity in an MB in order to select the candidate modes. Pan et al. [4] reordered the modes according to their probabilities and utilized the mean value and the standard deviation of rate distortion costs (RD costs) to be the early termination criterion of RDO. Lee and Lin [5] utilized the probabilities of several modes to calculate the average computation time in each mode. Yeh et al. [6] predicted the best mode based on Bayesian theory, and refined the prediction with the Markov process. The computational complexity was efficiently reduced. The SKIP mode condition was presented to make a consideration of the neighborhood and co-located information to achieve the reduction in the coding time in [7]. The relation between depth value and mode distribution was analyzed, and the mode candidates were chosen according to different levels of depth in an MB in [8]. Statistics were gathered for both of the RD cost and occurrence probability of each mode in [9]. The normal distribution of RD cost was adopted to calculate the thresholds for early termination. A 2D map was generated according to the neighboring motion vectors in [10]. Inter modes were reordered or removed via this 2D map. Ri et al. [11] defined a spatial-temporal mode prediction. The calculated RD cost and the co-located mode were utilized to produce the threshold for mode selection. Visual characteristics of tunnel surveillance videos were considered to analyze the structure of neighborhood inter/intra blocks for adapting the characteristics of static and fixed backgrounds in the observation systems in [12]. Codes in compliance with a coding order of previously neighboring blocks were assigned to increase the opportunity for an early termination in [13]. The relations between the discrete cosine transform (DCT), the sum of absolute difference, and the sum of square difference were established as the conditions for an early termination in [14]. Not only DCT but also the magnitude order of RD cost was discussed in [15]. The connection between the quantization parameter (QP) and the RD cost was experimented with to act as the threshold in [16]. The activity was calculated by utilizing the motion vectors of the neighboring and co-located blocks of the current block.

In addition to the methods of exploration of rate distortion and motion information, the human visual system (HVS) is also useful for improving video coding. Just-noticeable-difference (JND) is one of the important characteristics in HVS. A model of the luminance difference perception of human vision was developed in [17]. Knowledge about human visual luminance distortion was provided by this model. The human visual characteristics of JND were employed to analyze the content of video for the purpose of improving computational complexity in [18]. In [19], JND was utilized to re-measure image distortion. A perceptual rate distortion model was used to judge mode candidates. The necessary information was provided by gradient, variance, average contrast, and edge data in an MB for considering HVS [20].

This article proposes a fast mode decision algorithm which utilizes the correlation of JND pixels and RD cost to reduce the number of mode candidates. The rest of this article is organized as follows. In Section 2, the JND visual model and total number of non-JND pixels are discussed. The proposed fast mode decision algorithm is described in Section 3. The extensive experimental results are presented in Section 4. In Section 5, conclusion remarks are provided.

2. JND visual model

2.1. Human visual luminance difference

A JND model was applied as a human visual model as previously mentioned in [17]. JND refers to the visual threshold based on the background luminance. The difference between the foreground and background is smaller than that within a certain region, so the human eye is not able to detect it. That is, human eyes are allowed to tolerate certain luminance distortion. This type of feature can be incorporated into a fast mode decision if the observation of the human eye on a block can be characterized by such a feature. The redundancy of computational complexity will then be decreased. The JND visual model indicates the visual distortion of luminance and is shown in Figure 3.

\begin{array}{l} JND (Y (i, j)) = T_{0} \times (1 - {(Y (i, j) / 127)}^{\frac{1}{2}}) + 3 & for Y (i, j) \leq 127 \\ JND (Y (i, j)) = γ \times (Y (i, j) - 127) + 3 & for Y (i, j) > 127 \end{array}

(1)

where Y(i,j) is the background luminance. T ₀ and γ are constants (T ₀ = 17, γ = 3/128). The horizontal axis represents the gray level of the background. Each value corresponds to a JND value on the vertical axis. If the gray level difference between the background and the object is smaller than that of the human visual distortion denoted as the JND value, the object could not be detected visually. This concept of the human visual distortion of the gray level difference can be extended to the temporal domain. We can utilize this characteristic of the noticeable difference to observe the variation of the gray level on the temporal domain. In a video stream, there similarly is the variation of the gray level on every pixel location. Through these variations of the gray level frame-by-frame, consequentially, part of the content is easy to be detected for the variation, but some part seems to be without any alteration. The JND model can thus detect this magnitude of variation of luminance for the human eye. Therefore, that is the purpose for applying noticeable difference to the temporal domain while this usage of pixel domain can also be expended to an MB. Therefore, the human visual distortion criterion of an MB is determined by this characteristic.

In the proposed algorithm, the residual value of every pixel in an MB is compared. That is, the intensity values of the original pixels in the current MB are treated as the background luminance in the JND model, so that the JND value (visual threshold) can be obtained by using the model. If the residual value is less than the JND value, the variation of pixels cannot be perceived by human eyes.

2.2. Human visual characteristics in an MB

After describing the JND visual model, the details of how to utilize this visual characteristic in an MB will be discussed. First, an MB is divided into four 8 × 8 blocks. Pixels are subtracted into an 8 × 8 block from the block of the reference frame at the predicted location which is produced by the predictive motion vector as illustrated in Figure 4. The median motion vectors of the nearest neighboring coded MBs are taken as the predicted motion vectors. It is a little different from the motion vector predictor (MVP) in H.264/AVC, and an example is exhibited in Figure 5. In this step, the motion estimation of each 8 × 8 block does not need to be executed. The current MB is separated into four 8 × 8 blocks (C₀, C₁, C₂, C₃) while the coded neighboring MBs, respectively, are left (L), top (T), and upper right (UR). The left MB is the 8 × 8 mode labeled as L₀, L₁, L₂, and L₃. The top MB is the 8 × 16 mode labeled as T₀ and T₁. The upper right MB is the 16 × 8 mode labeled as UR₀ and UR₁. In this example, the predictive motion vector of block C₀ is calculated with the motion vectors of L₁, T₀, and UR₁ while that of block C₁ is obtained with the motion vectors of L₁, T₁, and UR₁. The predictive motion vector of block C₂ is gotten with the motion vectors of L₃, T₀, and UR₁ while that of block C₃ is calculated with the motion vectors of L₃, T₁, and UR₁.

An 8 × 8 block is chosen to be JND measurement instead of a 16 × 16 or 4 × 4 block due to the consideration of the structure of several modes in H.264/AVC. This mode structure can be imagined as a pyramid. Both the general and detail block partitions are considered. Hence, there are trade-offs when selecting the block in the mode structure. It should not be as unduly rough as a 16 × 16 block. Also it is not as excessively detailed as a 4 × 4 block. This concept of the different layers of block size is confined to the original mode structure of H.264/AVC. The mode structure of H.264/AVC is only supported by blocks of sizes 16, 8 and 4. No matter what the image resolution is, the biggest block size can only be 16 × 16 while the smallest can only be 4 × 4. Therefore, 8 × 8 is the nearest block size to the two sizes. If 16 × 16 is selected for JND measurement, this MB cannot estimate the different motions of smaller sizes, while if the size of 4 × 4 is adopted, the predictive motion vectors could be too diverse. If, on the other hand, 8 × 8 is selected, it can compensate the drawbacks for blocks that are too large or too small. Therefore, an 8 × 8 block is an applicable one as a basic block for measuring the visual distortion.

After the residual values of the four 8 × 8 blocks in an MB are obtained, the intensity values of the original pixels in these four 8 × 8 blocks are treated as the background luminance in the JND model which allows a JND value to be found for every pixel location. If the displaced residual value is smaller than the JND value, the change of the gray level cannot be detected by human eyes because the difference between the current pixel and the one at the predicted location is smaller than the human noticeable luminance distortion. Pixels are counted in this manner for every 8 × 8 block. The number of non-JND pixels (N _NJND) in each 8 × 8 block is provided. The summation of four 8 × 8 blocks is the total number of non-JND pixels (TN_NJND) in an MB. A criterion of the visual perception is provided by N _NJND or TN_NJND which comes from the original JND visual model. If an MB has more TN_NJND, it will possess more characteristics of non-noticeable visual luminance distortion because the number of the points means the number of unnoticeable difference pixels. If there are lots of TN_NJND in an MB, it belongs to a relatively low complexity of movement or image content since most of the temporal difference of the predicted location cannot be detected by human eyes. On the contrary, if there are few TN_NJND in this MB, the temporal difference can be detected easily and thus this MB has a relatively high complexity of movement or image content.

The examples of N _NJND and TN_NJND under different conditions of image content and mode partition are exhibited in Figure 6. According to the process described above, the elements of visual judgment are obtained in an MB. N _NJND0, N _NJND1, N _NJND2, and N _NJND3 are the numbers of N _NJND pixels in the four 8 × 8 blocks. TN_NJND is the total number of N _NJND pixels by summation. In this instance, Figure 6a depicts an example of low variability. It can be observed that the difference is not obvious according to the current MB, the motion compensated MB, and the residual data. Therefore, it processes lots of N _NJND in each 8 × 8 block. The final mode partition is 8 × 16 which is a relatively large block type. Figure 6b gives an example of high variability. In this case, the content is obviously variable. It is part of the image in the Foreman sequence. Correspondingly, few N _NJND are possessed in each 8 × 8 block due to its high temporal variability. Therefore, it is conceivable that its block type should be selected as a relatively detailed mode partition.

According to the above description, the relationship is obtained between the temporal difference, residual value, and N _NJND in each 8 × 8 block. If the temporal discontinuity is larger, the block type has more opportunity for a comparatively detailed mode partition. Few N _NJND and TN_NJND are obtained due to the massive temporal variability while exceeding the visual unnoticeable distortion.

3. Proposed fast mode decision algorithm

The flowchart of the proposed algorithm is exhibited in Figure 7. The SKIP and 16 × 16 modes will be conducted first as the original flow of the coding standard. We can observe that if TN_NJND is equal to 256 in an MB, it almost tends to be selected as SKIP or 16 × 16 mode because it means the difference of this MB in the temporal domain is negligible. Therefore, we just choose SKIP and 16 × 16 modes in mode candidates, and end up with the mode decision. We also provide some statistical examples of its accuracy in Table 1. In Table 1, the high accuracy can prove whether this process of early termination is practical. For usual cases, we try the intra 16 × 16 mode. When TN_NJND is equal to zero, we add intra 4 × 4 mode into our mode candidates, because it means the temporal difference of this MB could be very large. The high accuracy is exhibited in Table 2 which indicates the probability of intra 4 × 4 when TN_NJND is equal to zero. Since it does not have any of the characteristics of unnoticeable visual distortion, it cannot get a great coding efficiency from the motion compensation and spatial prediction of a large block (intra 16 × 16). In this case, we take the intra 4 × 4 mode into consideration of the intra prediction.

Table 1 The accuracy of SKIP and 16 × 16 modes when TN _NJND is 256 (QP24, 300 frames)

Fast mode decision based on human noticeable luminance difference and rate distortion cost for H.264/AVC

Abstract

1. Introduction

2. JND visual model

2.1. Human visual luminance difference

2.2. Human visual characteristics in an MB

3. Proposed fast mode decision algorithm

3.1. Characteristics of image direction

3.2. Complete algorithm

4. Experimental results

5. Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords