Generally, ME&MC using more than one reference frame (i.e., MRFs) exhibit better rate-distortion performance compared to a single reference frame (i.e., using the immediate previous frame) in the expense of computational time [24–30]. The computational time in the MRFs increases almost proportionally with the number of reference frames for ME&MC. Dual reference frame techniques [24–26] represent a good compromise between single and MRFs in terms of computational time and rate-distortion performance. The proposed scheme is a pattern-based video coding under the H.264/AVC framework (wherein pattern mode is embedded) with dual reference frames. Between the dual frames, one is the immediate previous frame and the other is the McFIS, assuming that motion areas and normal/uncovered static areas will be referenced from the immediate previous frame and the McFIS, respectively, through Lagrangian optimization.
The McFIS is generated by dynamic background modeling using Gaussian mixture models [14–16]. It is constructed from the already encoded frames at the encoder and decoder using the same technique so that the McFIS need not to be transmitted from the encoder to the decoder. When a frame is decoded at the encoder/decoder, the McFIS is updated using the newly decoded frame. The detailed procedure will be described in Subsection 2.3. To exploit the non-rectangular MB partitioning and partially skipped mode, a pattern mode is incorporated as an extra mode into the conventional H.264 video coding standard and is defined as the PVC scheme [12, 13]. Figure 1 has shown the PC comprising 32 patterns which are used in the proposed scheme. Each pattern is a binary 16 × 16 pixel matrix, where the white region indicates 1 (i.e., to capture foreground) and the black region indicates 0 (i.e., to capture background). Actually, the pattern is used as a mask to segment out the foreground from the background within a 16 × 16 pixel MB.
We need to determine MR for the current MB using the MBs from the current and reference frames. Then to find the best matched pattern from the PC through a similarity metric[12], ME&MC are carried out using only a pattern-covered MR (i.e., covered by the white region in Figure 1). In the proposed scheme, we also introduce a new pattern matching scheme for ME&MC so that we can overcome the occlusion problem in the existing schemes by exploiting uncovered background.
If an MB has both moving object and background, then the block is normally encoded using a pattern mode. In the pattern mode, the moving object is approximated using the best matched pattern, and the rest of the area is treated as skipped area (i.e., it is not encoded). The existing PVC scheme determines the moving object using MR which is the difference between the current block and its collocated block in the reference frame (i.e., immediate previous frame). If we calculate the MR in this way, we get two moving regions: (1) one is due to uncovered background, i.e., the old place where the object was at the (t − 1)th time, and (2) the other is due to the movement of the object in the new location, i.e., the new place where the object is at the t th time. Actually, the MR in the second case represents the object, and the MR in the first place represents the uncovered background. Both MRs are drawn in Figure 2c. As the McFIS comprises only the background, if we extract the MR by comparing the current block against the McFIS, we will get the MR which represents the true object rather than both uncovered background and object. Then, in the proposed scheme, we determine the best matched pattern for the pattern-covered areas and perform motion estimation and compensation using the immediate previous frame to find the best matched object. The rest of the region of the block is copied from the collocated block in the background frame, i.e., McFIS. The experimental results reveal that (see Figure 3) the proposed scheme encoded more RMBs compared to the existing schemes. Encoding more RMBs indicates more compression.
Thus, the new ideas are (1) extracting only moving object areas as MR rather than both object and background areas, (2) performing motion estimation and motion compensation using pattern-covered area (i.e., moving object areas) using immediate previous frame, and (3) treating the background area as a skipped area and coping from the McFIS. The detailed procedures will be explained in the following subsection.
2.1 New ME&MC for uncovered background areas
Let and be the k th MBs from the i th and (i − 1)th frames, respectively. According to the PVC scheme [12], the MR is defined as follows:
(1)
The similarity of a pattern P
n
∈ PC with the MR in the k th MB is defined as
(2)
The best matched pattern for an MR is then selected as
(3)
Figure 2b shows a current frame, Figure 2a shows a reference frame, the MR (marked as texture) according to Equation 1 is shown in Figure 2c, and a true background without an object (here a moving ball) is shown in Figure 2e. From the figure, we can easily observe that the second block of the third row (marked as block A in Figure 2b) has both a moving object (see Figure 2a) and an uncovered background (see Figure 2b). When ME&MC are carried out for block A (i.e., uncovered background), there is no matched region for block A in the reference frame (i.e., in Figure 2a). Thus, the pattern mode as well as any other mode could not provide accurate ME&MC for blocks similar to A. This problem can be solved if we can generate a true background (Figure 2e) and if ME&MC are carried out using the background as a reference frame by any suitable H.264 mode or pattern mode (if the MR is best matched with any pattern). In this work, we use McFIS (actually a dynamic background frame; to be discussed in Subsection 2.3) for referencing the uncovered background.
When a pattern is matching against the MR (i.e., the part of the ball) in block B (Figure 2b), ideally, pattern 11, 14, or 30 (see Figure 1) would be the best matched pattern, but due to the MR generated by (1) (see Figure 2c) comprising both a moving object and an uncovered background, pattern 21 is the best matched pattern. However, ME&MC using pattern 21 do not find a proper reference region in any reference frames (i.e., Figure 2a or Figure 2e) and result in poor rate-distortion performance. To solve this problem, we need to generate a new MR using the McFIS and the current frame (see Figure 2d) and then use the immediate previous frame for ME&MC using the pattern-covered region, and the rest of the region of the MB is copied from the collocated background frame, i.e., McFIS. In this process, we need to replace in (1) by k th MB from McFIS (i.e., ) to find the object motion. However, we also use two other options using the existing pattern matching with Equation 1 and ME&MC, i.e., using the immediate previous frame (existing pattern matching) and using the McFIS (pattern matching using McFIS) to maximize the rate-distortion performance where an MR is not well matched with the best pattern.
Figure 3 compares the average percentages of MBs selected by the Lagrangian optimization as the reference MBs for three relevant techniques: (1) the existing pattern matching [12] (where the MR is determined based on the difference between the current block and the collocated block in the immediate previous frame), i.e., matching and ME&MC being carried out using the immediate previous frame, (2) pattern matching (where the MR is determined based on the difference between the current block and collocated block in the McFIS and then find the best matched pattern for the MR) and ME&MC being carried out using the McFIS, and (3) pattern matching using the McFIS but ME&MC being carried out with the immediate previous frame. Note that technique 3 is the newly introduced pattern matching and ME&MC approach in this work, while technique 2 is the existing MRF approach with the McFIS. Techniques 1 and 3 have a difference in generating the MR of the current MB (against either the McFIS or the immediate previous frame) to find the best matched pattern at the encoder but have no difference in decoding point of view as both use the immediate previous frame as the reference frame. Thus, we accommodate all three techniques for the pattern mode. The first 300 frames of six standard video sequences, namely Paris, Bridge Close, Silent, News, Salesman, and Hall Objects, have been used for the evaluation. The figure shows that the proposed technique 3 selects the most number of MBs compared to the other two techniques. The higher percentage represents higher effectiveness for referencing. The results indicate that the proposed pattern matching and ME&MC technique are expected to perform better, and this will be further evidenced by the rate-distortion performance in Section 4.
Figure 3 shows that the percentages of RMBs are from 45% to 70% in the proposed scheme, whereas they are from 10% to 30% in the existing schemes. The rest of the MBs are encoded as the traditional H.264 MBs. At a low bit rate (when quantization parameter (QP) is high), the percentage of RMB (i.e., MBs selected by the pattern mode) is larger, and it gradually decreases with the bit rates in the proposed scheme. The decreasing trend of the RMBs with bit rates is quite understandable. The rationality is that at low bit rates, if an MR of an MB is not completely covered by the best matched pattern, the MB can be still encoded using the pattern mode (i.e., as RMB) as the distortion due to the unmatched area might not be significant compared to bit rate saving in calculating the Lagrangian cost function. However, at high bit rates, the distortion might be significant compared to bit saving in calculating the cost function to select other modes compared to the pattern mode.
2.2 Embedding pattern mode within the H.264 framework
Due to the object’s shape, motion characteristics, prediction accuracy, and ratio of foreground and background in an MB, a certain mode might not always have a specific ratio-distortion (R-D) characteristics; however, it is a general trend that when the Lagrangian multiplier is relatively high (at low bit rates), more emphasis is given to the bit rates compared to the distortion; on the other hand, when the Lagrangian multiplier is relatively low (at high bit rates), more emphasis is given to the distortion compared to the bit rates. Thus, for a given block (16 × 16), larger modes (such as 16 × 16, 16 × 8, and 8 × 16) might be chosen at low bit rates, whereas smaller modes (8 × 8, 8 × 4, 4 × 8, and 4 × 4) might be chosen at high bit rates. The general tendency of the pattern mode is that it provides less bits and high mean square error (MSE) compared to the other modes because we only consider pattern-covered areas for bits but overall areas for calculating MSE (non-pattern-matched area contributing in higher MSE). Figure 3 also shows the same tendency as the number of MBs encoded as the pattern mode is decreasing with the bit rates.
In the proposed method, we have added a pattern mode and kept all other H.264 modes including 4 × 4, 4 × 8, and 8 × 4. Thus, if the 8 × 8 block mode is selected, then 8 × 8 blocks are further decomposed into smaller modes, but we do not use any smaller size pattern mode (e.g., 16-pixel patterns) in the decomposition for the proposed scheme. The pattern size is 64; thus, if we let parts (i.e., 64 pixels) of the 16 × 16 block to be inter-predicted and the others are skipped, the setting can be used to approximate any of the patterns. The main difference is that in the pattern mode, a 16 × 16 MB (i.e., 256-pixel block) is represented by a smaller block, i.e., one of the 64-pixel patterns; on the other hand, H.264 treats the MB as a 256-pixel block by signaling zero motion vector and zero residual errors for the skipped areas. In the PVC scheme, we need to send some bits for the pattern index; on the other hand, the H.264 needs to send some bits to signal the zero motion vectors and zero residual errors for the skipped areas. The experimental results reveal that ultimately the pattern mode is the winner for significant times through Lagrangian optimization.
As the size of a pattern (to capture and encode MR) is one fourth (i.e., 64 pixels among 256 pixels) of an MB, ME&MC using pattern-covered areas generally provide less bits (due to the coding of one fourth of the areas) and more MSE (due to the mismatch between a pattern and the MRs) compared to the other modes such as 16 × 16, 16 × 8, 8 × 16, and 8 × 8. After analyzing a number of video sequences, we have observed that the average bits required by the 16 × 16, 16 × 8, 8 × 16, and 8 × 8 modes against the pattern mode are 2.61, 2.78, 2.71, and 2.93 times, respectively. The corresponding MSE ratios are 0.91, 0.89, 0.89, and 0.86. Thus, using conventional Lagrangian multiplier (LM) recommended in the H.264, i.e., λ = 0.85 × 2(QP − 12)/3, the pattern-based video coding scheme encodes a large number of MBs as RMBs, which results in low bit rates with low peak signal-to-noise ratio (PSNR) compared to the H.264 for a similar QP. This may be a problem for the existing rate-control mechanisms as the relationship of the QPs and the rate-distortion may be different. To address this problem, a comprehensive R-D analysis is given by Paul and Murshed in [12], where a pattern mode has been embedded into the H.264 coding framework by modifying LM. The Lagrangian multiplier (after embedding the pattern mode) is relatively smaller compared to the H.264-recommended LM. Paul and Murshed [12] recommended a new LM: λ
PVC = 0.4 × 2(QP − 12)/3. If a video sequence has a few numbers of RMBs (for example, very high motion video sequence), the amendment of LM may cause a problem in the existing rate-control mechanism.
An intuitive solution is to change the Lagrangian cost function (by adding distortion with the product of bits and the LM) of the pattern mode and then allow this mode into the competition with the other modes under the H.264 optimization framework. Thus, the existing rate-control mechanism based on the QP-rate-distortion relationship does not affect so much. As the pattern mode yields less bits with more MSE compared to the other H.264 modes, we can adjust the cost function of the pattern mode by adjusting MSE and bits. Figure 4 shows (by dotted lines) bits and MSE ratios between different conventional modes (i.e., 16 × 16, 16 × 8, 8 × 16, and 8 × 8) and the pattern mode against different QPs before adjustment of the bits and MSE generated by the pattern mode when the pattern mode is selected by the H.264-recommended LM. It shows that, on average, 2.76 times of bits is required by the conventional modes and, on average, 0.89 for the MSE ratio is obtained by the other modes compared to the pattern mode. We adjust (i.e., by reducing) MSE with the pattern mode by generating high-quality MRs (i.e., pattern-covered MRs) using finer quantization compared to the other modes. We also adjust bits by multiplying the corresponding bits by a factor, β (>1), in cost function determination to restrict some MBs to be classified as RMBs where the MBs are poorly matched with the best pattern.
Obviously, making the MSE ratio towards 1 (i.e., the MSE for the pattern mode is the same as that of other modes) by finer quantization and keeping the bit requirement at its lowest level by multiplying with β for the pattern mode would be the desirable case to improve overall rate-distortion performance. Figure 4 shows (solid lines) bits and MSE ratios between different conventional modes (i.e., 16 × 16, 16 × 8, 8 × 16, and 8 × 8) and the pattern mode against different QPs after adjustment (QPpvcmode = QPothermode − 2 and β = 1.5) of the bits and MSE. It shows that, on average, 2.37 (instead of 2.76 before adjustment) times of bits is required by the conventional modes and, on average, 0.95 (instead of 0.89 before adjustment) for the MSE ratio is generated by the other modes compared to the pattern mode. Note that this adjustment makes sure that for a given QP, the PSNR of the PVC is comparable with the H.264, although the corresponding bit rate of the PVC is much lower compared to that of the H.264, so that the PVC exhibits better overall rate-distortion performance compared to the H.264. To make the coding performance uniform for a wide range of bit rates and low to high motion video sequences, we have adjusted the bits and distortion for the pattern mode based on the experimental results.
2.3 New McFIS generation technique
In a video scene, a pixel may be a part of different objects and backgrounds over the time (i.e., in the temporal frames). Each part can be represented by a Gaussian model expressed by pixel intensity variance, mean, and weight[14–16]. Thus, to model a pixel over the time, Gaussian mixture models are used. Intuitively, if a model has large weight and low variance, then most probably the model represents the most stable background. A mean value of the best background model is taken as background pixel intensity for that pixel. In this way, an entire background frame (i.e., McFIS) is constructed. Instead of the mean value, the last satisfied pixel intensity (preserved when a pixel satisfies a model) may be taken as the background pixel intensity to avoid an artificial mean value [16]. As mentioned in [21], background generation using pixel mean (or pixel recent value) is not very effective in video coding applications as the McFIS is generated from the distorted image (i.e., the decoded frame); neighboring pixel intensities within the McFIS (i.e., spatial correlation) are therefore used to generate better McFIS [21]. We have also observed that there is pixel intensity similarity among neighboring pixels. This relationship is also observed by the other researchers, and thus, pre-/post-filtering techniques were introduced by exploiting neighboring pixels to reduce pixel intensity discrepancy in decoded frames due to the quantization and/or block-based ME&MC [31, 32]. Paul et al. in [33] generated McFIS with modified decoded frames using neighboring pixel intensities of the decoded frames. By further investigation, we have found that temporal correlation exploitation is also crucial to construct McFIS along with the spatial correlation. Thus, we modify the existing McFIS [33] as follows, assuming D
i and D
i−1 as the i th and (i − 1)th McFISs, respectively:
(4)
where τ (0 < τ < 1) and T
p
are the weighting factor and threshold, respectively. It is obvious that there should be a strong correlation between consecutive McFISs especially in the stable region (i.e., background). A small difference (i.e., T
p
) may be due to the quantization error instead of different environments. Thus, to rectify this variance, a weighted average is formulated for the current McFIS from the previous McFIS. A large value of τ means that we give more emphasis to the current McFIS.
Although the current pixel and the pixel at the collocated position of the previous McFIS are similar, the abovementioned T
p
adjustment is due to the quantization error but not environment changes (i.e., not for object movements in the background areas), and we have investigated the distortion due to quantization. As the quantization error varies with the quantization step, T
p
should vary with QP, since a large QP creates high distortion while low QP creates low distortion. To find the relationship of T
p
with QP, we need to know the relationship of the distortion and quantization step size. There is a theoretical derivation of the relationship of the distortion and the quantization in [34] where it is shown that the mean square quantization error varies with Δ2/12, with Δ being the quantization step size. This approximation is fairly accurate when the quantization step size is smaller than the signal standard deviation [34], at the middle range of quantization (but applicable for the entire range of quantization). The relationship (when distortion is defined as the mean quantization error) is plotted in Figure 5. We have also investigated this relationship using actual frames and with their reconstructed frames of a number of video sequences such as Paris, Silent, Salesman, and News (average result is plotted in Figure 5). The figure shows that the experimental value is smaller than the theoretical one due to the other factors (more accurate ME&MC using VBS, significant amount of static regions, etc. while the H.264 is used). We have fixed T
p
(plotted in Figure 5) as two times of the experimental value as both the current McFIS and the previous McFIS can suffer from quantization error. To minimize the quantization error on background areas for improving the quality of the McFIS, T
p
can be approximated as T
p
= 0.6513e
0.0861 × QP.
2.4 Encoding and decoding of the proposed scheme
In the proposed scheme, the first frame of video is encoded as an intra-frame, and the subsequent frames are encoded as inter-frames until a scene change [35] occurs. When a frame is encoded and decoded at the encoder, the McFIS is updated using the most recent decoded frame through background modeling. When a scene change occurs, the modeling parameters are reset and a new McFIS will be generated. As the McFIS contains a stable portion of a scene, the sum of absolute difference (SAD) between the current frame and the McFIS is a good indicator for scene change. Obviously, an automatic (not hand-made) cut of the scene (i.e., scene change) cannot be consistently defined and clearly confirmed. In this scheme, a scene change is detected in two different ways: (1) based on the ratio of SAD
i
and SAD
i− 1 where SAD
i
is calculated between McFIS and the i th frame (i.e., the current frame) and SAD
i− 1 between the McFIS and the (i − 1)th frame (i.e., the previous frame) and (2) based on the percentage of the McFIS references in encoding. As the McFIS contains a stable portion of a scene, SAD
i
between the current frame and the McFIS is a good indicator for scene change. In the proposed scheme, we consider scene change if . Paul et al. [36] mentioned that the percentage of McFIS referencing is a good indication to test the relevance of the current McFIS as a reference frame. Thus, we also generate a new McFIS if the percentage of the McFIS reference is below a threshold (e.g., for the current implementation, we use 3%). For each MB, we have examined all modes including the pattern mode using two reference frames, and then the ultimate mode is selected based on the LM. In the pattern mode, we only conduct ME&MC using regions covered by the best pattern (using Equation 3).
To avoid more than four 4 × 4 DCT transformations for a pattern (as 64 1′s in a pattern), we need to rearrange the residual errors covered by the pattern into one 8 × 8 block. For arranging residual errors, we scan the 16 × 16 residual block row-wise from top to bottom and left to right and put them into an 8 × 8 block when we find 1 in the corresponding matched pattern. Arrangement of residual errors for a pattern is shown in Figure 6. A pattern is shown in Figure 6a, numbering of residual errors according to the position of 1 in a pattern is shown in Figure 6b, and arrangement of residual errors into an 8 × 8 block according to the numbering is shown in Figure 6c. The inverse arrangement is also used in the decoder to get back the original shape of the pattern-covered residual error for block reconstruction.