Open Access

Memory bandwidth-scalable motion estimation for mobile video coding

EURASIP Journal on Advances in Signal Processing20112011:126

https://doi.org/10.1186/1687-6180-2011-126

Received: 17 March 2011

Accepted: 7 December 2011

Published: 7 December 2011

Abstract

The heavy memory access of motion estimation (ME) execution consumes significant power and could limit ME execution when the available memory bandwidth (BW) is reduced because of access congestion or changes in the dynamics of the power environment of modern mobile devices. In order to adapt to the changing BW while maintaining the rate-distortion (R-D) performance, this article proposes a novel data BW-scalable algorithm for ME with mobile multimedia chips. The available BW is modeled in a R-D sense and allocated to fit the dynamic contents. The simulation result shows 70% BW savings while keeping equivalent R-D performance compared with H.264 reference software for low-motion CIF-sized video. For high-motion sequences, the result shows our algorithm can better use the available BW to save an average bit rate of up to 13% with up to 0.1-dB PSNR increase for similar BW usage.

Keywords

motion estimationmemory bandwidthH.264/AVC

1. Introduction

With the rapid progress of semiconductor technology, video coding is becoming popular in modern mobile devices to provide video services. In these devices, motion-compensated temporally predictive coding with motion estimation (ME) not only contributes the most to the coding efficiency of modern video encoder designs [1], but also requires large amounts of computations as well as data bandwidth (BW) [2]. This leads to severe design challenges for power-limited mobile devices. In power-limited mobile device, the available power could be changed dynamically due to low battery power or dynamic power management, such as dynamic voltage and frequency scaling [2, 3]. In such cases, the available data BW could be inconsistent with the video requirements and be lower than expected. Once this situation occurs, the video coding will be delayed or forced to drop frames. Either case leads to unwanted low video quality. This BW constrained problem is getting worse with increasing camera resolution in mobile devices.

Broadly speaking, the BW-constrained ME problem is one of the resource constraints. Other resource constrained designs [29] focus on lowering power consumption, with or without rate-distortion (R-D) optimization [25], or adjusting computational complexity with rate-control like methods [69]. He et al. [2] developed a new R-D analysis framework with a power constraint. Subsequently, the power-aware designs [3, 4] directly change their search algorithms without R-D optimization to predesigned ones to fit a lower power mode. Chen et al. [5] used a fast algorithm and data reuse to achieve a power-aware design. Tai et al. [6] proposed a novel computation-aware scheme to determine the target amount of computation power allocated to a frame and allocated this to each block in a computation-distortion-optimized manner. The computational complexity complexity-aware designs [79] used a rate-control like method to combine complexity constraints into R-D optimization. The basic assumption of these approaches is that there are limited computational resources in handheld devices but sufficient memory BW. This assumption could easily fail because of dynamic mobile environment in which videos are coded and decoded at the same time or because of the dynamic power management mentioned above.

To solve the above issue, we propose a BW-scalable ME algorithm to fit the available data BW constraint. We assume that the data BW are the limited resource and could be dynamically changed [3]. The available data BW will be sufficient in full or normal battery mode and have a higher working frequency. In low battery or power-saving mode, the available data BW will be insufficient due to the lower working frequency or lower voltage supply. With a lower than expected BW supply, ME computations could fail to meet real-time constraints or lead to significant R-D performance loss due to the macroblock (MB) skipping coding. The proposed method predicts and allocates the memory BW according to its R-D gain (RDG) and the available BW to model the bandwidth-rate-distortion (B-R-D) behavior of the existing ME algorithm. This B-R-D algorithm is a rate-control like method for MB MB-based BW allocation, which maximizes the coding efficiency under the BW constraint. The simulation results show that the proposed algorithm can better utilize the BW instead of wasting it as other designs do, and it can be scaled to the available BW.

The rest of this article is organized as follows. The review of related studies is presented in Section 2. In Section 3, we propose an analytical B-R-D optimized model. The online R-D optimized BW-scalable ME scheme is summarized in Section 4. Section 5 presents the simulation results and comparisons with traditional approaches. Finally, Section 6 concludes this article.

2. Review of related studies

To solve the computational complexity and data BW challenges of ME, various approaches have been proposed, such as parallel full search hardware design and fast ME algorithms.

Full search ME designs handle the computational complexity by using parallel processing elements for matching cost computation [10]. Furthermore, with its search center at (0, 0), it can reduce the data BW by reusing the overlapped search area, termed Level C data reuse in [11]. Such a design style is simple to use, but it will need constant data BW regardless of the video contents. Besides, to meet the Level C data reuse requirement, such a design also needs a larger search range (SR) to cover the possible best matching point due to the (0, 0) search center [12], which implies a waste of data BW compared to methods with a search center at the motion vector (MV) predictor (MVP).

On the other hand, fast ME algorithms only search a few candidates so that the computational complexity is lower. To facilitate such searching, most of the fast algorithms adopt the MVP as the search center [13]. In [14], most of best matching points are around the MVP, which can cover over 90% of the best matching points within ± 8 SR. Thus, it can have a smaller SR and could have lower data BW even with poor data reuse between consecutive searches. However, even the fast ME algorithm still assumes constant and sufficient data BW support for the required SR. Some designs with a dynamic SR [1517] could have even lower data BW demands by changing the SR according to the content content-dependent prediction, but they still assume constant and sufficient BW support in the planning of chip design. Besides, none of the designs can adapt to dynamic data BWs. Several approaches have tried to reduce the required data BW. Designs in [18, 19] use a cache to maximize the possible data reuse for irregular search patterns. Bus BW-effective ME designs in [20, 21] lower the BW requirement by reducing the pixel representation from 8 bits to a binary pattern. However, these designs are only useful for specific search algorithms without a data BW constraint.

In summary, none of above approaches has considered data BW as a limited resource to explore the possibility of optimizing its usage in an R-D sense. The assumption that there will be constant and sufficient BW has the benefit of simplifying the design procedure, and thus, it is widely used in VLSI hardware design, but it usually wastes a lot of data BW because only a portion of the MBs in a high-motion video will need such a large amount of data. Such data BW waste is a serious problem for power-limited mobile devices because data access to DRAM is off-chip access and thus consumes significant power, which can be as much as the power consumption of the video chip [22]. As indicated in [22], the power consumption of external DRAM access could be up to 50% of the total power consumed by the video decoding chip. For encoding, this portion will be larger but is often neglected in the previous design. Besides, with a dynamically changing BW, the current approaches with constant and sufficient BW assumption would have insufficient BW for coding, could need more time to complete the coding and fail the real-time constraint or drop MB coding and quality to fulfill the timing constraint. Both situations are not acceptable to attain a high-quality visual experience.

3. Analytical B-R-D optimized modeling

For a given video coding distortion (or equivalent picture quality), D, and bit rate, R, if we decrease the available encoding BW, the coding will generate more distortion and bits, which in turn implies a higher D and R for ME operation and more data BW for video coding. Therefore, the overall BW usage of a ME module is linearly proportional to its search area. We introduce a set of BW control parameters, B = [β12,...,βL], to control the search area of the ME module. The model with the BW control parameters is of a more generic form and captures the available data BW under different system conditions. Consequently, the ME SR selection is then a function of these control parameters, denoted by SR12,...,βL). However, the overall BW usage of a ME module is linearly proportional to its search area. Within the BW-limited design framework, the encoder BW requirement, denoted by BW, is a function of SR, and is also a function of B, denoted by
B W = Φ S R = B W ( β 1 , β 2 , . . . , β L )
(1)
where Φ(·) is the SR selection model of the ME module. To optimize the BW usage, the available data BW, β i , should dynamically be allocated among the MBs according to their motion characteristics. Thus, we execute the ME algorithm with a different SR of BW control parameters and obtain the corresponding R-D data. According to our measurements and analysis, the R-D performance model can well be approximated by the following expression, denoted by RDG(BW12, ...,βL)) as (2).
R D G B W = R D G ( B W ( β 1 , β 2 , . . . , β L ) )
(2)
where
R D G = R D C init - R D C BMA
(3)
and the RDG is the difference of the Lagrange R-D cost (RDC) at the MVP (RDCinit) and the final best matching position (RDCBMA). The Lagrange RDC function is frequently employed as a measure of ME efficiency, which is defined as
R D C motion m v , λ motion = min S A D s , c m v + λ motion R m v - p m v
(4)

where mv is the MV received by the ME, and λmotion indicates the Lagrange multiplier. The distortion term SAD(s, c(mv)) is the sum of the absolute differences between the original signal s and the coded video signal c. The rate term, λmotionR(mv - pmv), represents the motion information and the coded bit length of the MV difference (MVD) between the MV and predicted MV. Note that Equation 2 is computationally intensive and is intended for offline analysis to obtain the B-R-D model.

Next, we optimally configure the BW control parameters to maximize the video quality (or minimize the video distortion) and minimize the video bit rate under the BW constraint. Mathematically, this can be formulated as in (5).
max β 1 , β 2 , . . . β L RDG = RDG ( BW ( β 1 β 2 . . . β L ) ) s .t . BW ( β 1 , β 2 , . . . β L ) BW
(5)

where BW is the available BW pool for video encoding. The optimum solution, denoted by RDG(BW), describes the B-R-D behavior of the video encoder. The corresponding optimum BW control parameters are denoted by {β i *(BW)}, 1 ≤ iL.

More specifically, we develop an analytical B-R-D model to perform on-line BW optimization for real-time video coding. For the simplicity of on-line execution, the RDG formulation can be well approximated by the following expression.
R D C init - R D C BMA = γ × B W ( β 1 , β 2 , . . . , β L )
(6)

where γ is a positive constant. In this study, we refer to BW as the maximum required data BW for ME.

4. Online R-D optimized BW-scalable ME

Section 3 provides a theoretical analysis of the data BW-limited performance of the B-R-D optimization. However, in this section, we discuss how this theoretical limited data BW performance can be realized in practical video coding. There are four major issues that need to be addressed. First, the real BW calculation requires global knowledge of the on-chip SRAM buffer resource and reuse strategy. Second, in BW variations between video coding and decoding as discussed in this section, we assume that the available data BW for video coding are time-varying because of non-stationary video input on the real-time coding and decoding side. Third, once the optimum BW efficiency of the previous coded MB is determined, we need to develop a scheme to allocate and predict the BW interval to achieve the video smoothness constraint. This approach is computationally intensive and its corresponding parameter adjustment is only suitable for offline analysis. In real-time video encoding on mobile devices, it is desirable to develop a low-complexity scheme that is able to estimate the BW interval parameters from the frame statistics collected in the video coding. Fourth, to avoid under- or over-use of the BW pool, the target SR is further refined by the neighboring MV. In the following, we will discuss these issues.

4.1. BW budget initialization

First, the BW budget (BWbudget) is initialized for BW allocation of the overall data BW pool later in the coding process. This initialization takes the available system BW and converts it to a default system SR for the ME. Then, the BW budget is allocated with the above system SR for a GOP, as in (7).
B W budget = B W Bus Frame _ Rate × GOP _ size
(7)

where the BWBus denotes the bus data transmission rate (bytes/s), Frame_Rate is the number of coded frames per second, and GOP_size denotes the frame numbers in a GOP. Larger GOP size allows for more freedom in adjusting the BW. For the purposes of having a concrete example that represents common practices in video coding, the BW budget for the GOP is set 16 frames in this article.

4.2. BW evaluation in an R-D sense

To justify the BW usage from (6), the BW efficiency, Gave, is defined as the sum of the RDG before the current coded k th MB divided by the total used BW ( B W u s a g e k ), which denotes the accumulated used data BW up to the (k - 1)th MB, as in (8) and (9).
G ave = i = 1 k - 1 RD C init i - RDC BMA i B W usage k
(8)
where
B W usage k = i = 1 k - 1 B W usage i
(9)

and R D C i n i t i denotes the RDC at the predicted MV position. R D C B M A i denotes the RDC after the motion search of the block-matching algorithm, and B W usage k denotes the used data BW in the i th MB with a Level C data reuse scheme.

Gave measures the BW efficiency by averaging the RDG over the used BW before the k th MB, which implies how much RDG can be achieved with a unit of data BW. Thus, the more Gave we gain, the better BW and coding efficiency we will obtain. In the following step, we will use Gave for BW prediction.

4.3. BW prediction and allocation with the smoothness constraint

With the BW efficiency, Gave, we can derive the allowed BW interval with the BW prediction and allocation. The BW prediction predicts the available BW for the next coded MB with the smoothness constraint. The smoothness constraint maintains the quality and the smoothness (i.e., similar RDC) between consecutively coded MBs. With this constraint and the RDG per unit BW from (8), we can predict the forward and backward BW usage and thus, constrain the possible BW usage of the next coded MB.

First, to keep the quality and the smoothness between the current and the previous MBs, we use the RDC data from previous MBs to make further predictions (10).
R D C init k - G ave B W B P k = i = 1 k - 1 R D C BMA i k - 1
(10)
where BW BP denotes the backward BW prediction, as shown in latter equation. In (10), the left-hand side is the target RDC of the current MB, and the right-hand side is the average RDC of the previous MBs. To maintain the quality and the smoothness, ideally, the target RDC of the current MB will be equal to the average past RDC s. Thus, if we have larger Gave, (10) implies that less BW (i.e., BW BP ) is needed to maintain a similar RDG as the previous MBs. Therefore, the backward prediction for the current k th MB can be derived, as in (11) from (10).
B W B P k = RDC init k - i = 1 k - 1 R D C BMA i k - 1 G ave
(11)
In contrast to BW BP , we define the forward prediction BW FP to keep the quality and smoothness between the current and the future MBs by adopting BW information as in (12).
B W F P k = B W budget - B W usage k n - ( k - 1 )
(12)

where n is the overall MB numbers in a GOP. Because we have no knowledge of the future RDG, the forward prediction, BW FP , is set to the remaining BW budget divided by the remaining MBs in the GOP that are not coded yet.

These two BW predictions link the BW usage between the past MBs and the future MBs. Their relationship can be used to allocate the available BW as follows:

if (BW FP > BWB P) { (condition 1)

BWlower= BW BP + 0.5 × (BW FP - BW BP );

BWupper= BW FP + 0.25 × (BW FP - BW BP );

}

else { (condition 2)

BWlower = BW FP - 0.5 × (BW BP - BW FP );

BWupper = BW FP ;

}

in which, BWlower and BWupper are the lower and upper bounds of the BW usage per MB, respectively. The parameters, 0.5 and 0.25, are selected empirically and are easy to implement because they are powers of 2. The parameters are obtained from a two-step process. In the first step, we execute the proposed BW-scalable ME algorithm with different configurations of parameters to obtain the corresponding BWlower, BWupper, and R-D data. Note that this step is computationally intensive and is intended for offline analysis to obtain BWlower, BWupper, and the B-R-D model only. Once the B-R-D model and the BW intervals BWlower and BWupper are established, we perform the second step, which optimizes the configuration of the BW control parameters to maximize the video quality under the system BW constraint. Meanwhile, the parameters, which are empirically selected in the following section, are obtained by the same method. For condition 1, as shown in Figure 1, BW BP is smaller than BW FP , which implies that less BW had been allocated to the previous MBs, and thus, more BW can be allocated to the next MB. As a result, we set the lower bound, BWlower, higher than the average BW in the past MBs (equal to BW BP + 0.5 × (BW FP - BW BP )), and also set the upper bound, BWupper, higher than the average BW prediction in the future MB coding (equal to BW FP + 0.25 × (BW FP - BW BP )). This larger BW allocation enables better quality. In contrast, for condition 2 in Figure 1, BW FP is smaller than or equal to BW BP , which implies that too much BW had been allocated to the previous MBs, and hence less BW can be allocated to the next MB. As a result, both bounds should be lower than BW FP to keep the smoothness and quality, and we set BWlower equal to BW FP - 0.5 × (BW BP - BW FP ) and set BWupper equal to BW FP .
Figure 1

Illustration of the available BW interval determination.

4.4. SR decision and refinement

Finally, we employ the above available BW interval and R-D data to make an SR decision for the next MB coding. The SR decision is divided into three cases, and the corresponding SR adjustment coefficient is resolution independent, as shown in Figure 2. Case 1 is the BW limited case because the average BW usage of the previous MBs falls outside the available BW interval bounded by BWupper and BWlower. Thus, the current SR is decreased by 8 if it is larger than BWupper or increased by 8 if it is smaller than BWlower for next MB coding.
Figure 2

Illustration of the SR decision.

The average BW usage of the previous MBs falling inside the available BW interval implies sufficient BW is available for R-D optimization. This can be further divided into two cases, case 2 and case 3. If the RDC (R × Dcur) is larger than a predefined threshold (case 2), the video has a bad quality, and thus, the SR is increased by 16 for better quality in the next MB. This threshold is set empirically to 4 times, the average RDC of the previous MBs, i.e., 4(R × Davg), for coarse-grained refinement of the quality. However, if the RDC (R × Dcur) is smaller than the predefined threshold (case 3), the video has a quite smooth quality, and thus, the SR is adjusted slightly. Thus, the SR remains unchanged if the RDG of the current MB (RDGcur) is within the average RDG (RDGavg) plus or minus an adaptive offset (i.e., RDC BMA /20000 empirically for fine-grained refinement of quality). However, if the RDGcur is smaller than RDGa vg- offset, the video is of good enough quality, and thus, the SR is decreased by 4 to save BW. On the other hand, if the RDGcur is larger than RDGavg+ offset, the quality is low, and the SR is increased by 4 to improve the quality.

The above SR decisions are further refined to avoid BW waste by considering the SR values in the adjacent MBs, as illustrated in Figure 3a. First, we get the adjacent MVs from the neighboring blocks and the MV of previous frame on the co-located block, such as MVUL, MVU, MVUR, MVL, and MVCur, shown in Figure 3b. All these MVs are of sub-pel precision. Then, we compare these five MVs and choose a maximum MV (max_mv). After that, we set the available SR value using this maximum MV. The refined SR, max_avail_SR, is
Figure 3

Illustration of the SR refinement. (a) Flowchart of the SR refinement method. (b) The relationship between neighboring blocks and the current block.

max _ avail _ S R = S R lower , max _ m v m v lower S R step × Ceil max _ m v S R step + S R offset , m v lower < max _ m v m v upper S R upper , otherwise
(13)

in which the parameters SRlower, SRupper, SRstep, and SR offset are resolution dependent. For our simulation, we set SRlower equal to 4 for CIF and 26 for HD (720P) resolution. Meanwhile, we set SRupper, SRstep, and SR offset equal to 32, 4, and 4 for CIF resolution and equal to 72, 8, and 2 for HD (720P) resolution. Meanwhile, we set mvlower and mvupper equal to 2 and 24 for CIF resolution and 24 and 64 for HD (720P) resolution.

Finally, the SR is selected by choosing the minimum SR between max_avail_SR and SR from Figure 2, for MB coding.

4.5. Summary of the algorithm

Figure 4 shows the proposed B-R-D optimized algorithm that can be combined with existing ME algorithms to make them BW scalable. This algorithm first models the available BW with its RDG and then predicts and allocates the BW in an R-D optimized sense to determine the available SR. The whole algorithm is repeated for all inter-coded frames in a GOP and consists of four steps, as described below.
Figure 4

Flowchart of the B-R-D optimized modeling method.

Step 1. Initialization: Create the BW budget from (7) for all MBs in a GOP.

Step 2. BW evaluation in an R-D sense: Evaluate the RDG in terms of the consumed BW as shown in (8) and (9) to model the BW in a R-D sense.

Step 3. BW prediction and allocation with the smoothness constraint: From the RDG obtained from step 2 and the available BW, the BW for the next coded MB is predicted in (10) to (12) and allocated as described in Section 4.3 to keep the video quality as smooth as possible using the smoothness constraint.

Step 4. SR decision and refinement: According to the available BW from step 3, the SR of next coded MB is determined and refined in (13) for ME execution.

5. Simulation results

5.1. Simulation conditions

The proposed algorithm was implemented in the H.264/AVC reference software, JM [23], for performance evaluation. The simulation conditions are CIF-sized test sequences with a baseline profile, no R-D optimization, one reference frame, a full-search algorithm as well as an Enhanced Predictive Zonal Search (EPZS) algorithm [24] for ME, IPPP sequences, 30 frames/s, and 16 frames per GOP. All of the block matching algorithms were implemented using Visual C++ on a PC with a 2.66 GHz Intel® Core™ 2 Duo CPU.

In the following simulations, we classify the corresponding BW conditions into two patterns: a constant data BW pattern and a variable data BW pattern. Both patterns provide the same amount of reference block data for the same SR ± R. However, the constant data BW pattern will assume that the available BW is constant and fixed during ME operations, which in turn assumes that the available BW is sufficient and implies that the video encoder does not have a BW constraint during the video encoding process. Meanwhile, the variable data BW pattern will assume that the available BW is variable during ME operations, which assumes that the available BW is insufficient and implies that the video encoder is BW constrained during the video encoding process. The constant data BW pattern is the scenario used in traditional ME design, which does not consider the other components, while the variable data BW pattern simulates the scenario where the BW is changing due to situations like simultaneous coding and decoding (defined as SCD mode) in a video phone or different low power modes (defined as LP mode) for mobile applications. The SCD mode assumes the decoding uses merged sequences from Stefan, Akiyo, and Football (interleaved high-motion and low-motion sequences) and sets the scene cut at a multiple of 32 frames. With the above interleaved decoded sequence, the available BW for encoding will change dynamically, as shown in Figure 5a. Figure 5b shows the LP mode with a descending trend in data BW in a power aware system. In the following simulations, we assume the SR for the search algorithm is ± R for the constant data BW pattern R and the variable data BW pattern case.
Figure 5

Variable data BW pattern with ± 8 SR for: (a) the SCD mode and (b) the LP mode.

To show the benefit of the proposed scheme, we tested three different BW adaption schemes in the following simulations. The first scheme, denoted as fixed-SR, is for ME without any BW adaption scheme. Thus, the total BW for ME is equally distributed for all MB coding, and its SR setting is constant for the entire coding time. The second scheme, denoted as simple-SR, is for ME with a simple BW adaption scheme. Its BW adaption equally distributes the available data BW to all MBs in a period, as in the fixed-SR case, but the distribution will be changed when the available BW changes. Thus, its SR adapts as well. This adaption does not consider the used BW or the related R-D information. The final scheme, denoted as BRD-SR, is the proposed B-R-D optimized BW-scalable method.

5.2. B-R-D performance evaluation

Tables 1, 2, 3, 4, and 5 show the simulation results for the constant and variable BW patterns with the different BW adaption schemes. Figure 6 shows the average BW per frame for the high-motion Stefan sequence with the quantization parameter set to 28.
Table 1

Performance comparison with the fixed-SR scheme for CIF resolution

Search algorithm

Sequence

Akiyo

Foreman

Stefan

 

BW pattern

ΔBW (%)

ΔPSNR (dB)

ΔBit-rate (%)

ΔBW (%)

ΔPSNR (dB)

ΔBit-rate (%)

ΔBW (%)

ΔPSNR (dB)

ΔBit-rate (%)

FS

Const. 8a

-35.2

-0.02

+0.24

-4.78

-0.02

+1.79

-1.01

+0.10

-13.42

 

Const. 16a

-69.8

-0.01

-0.35

-22.07

-0.02

+2.10

-6.04

+0.01

-2.45

 

Const. 24a

-82.8

-0.01

-0.45

-43.74

-0.02

+1.99

-17.59

+0.01

-1.21

EPZS

Const. 8a

-31.3

-0.01

+0.07

-3.66

-0.03

+3.21

-0.25

-0.03

+2.12

 

Const. 16a

-65.4

-0.01

-0.17

-21.26

-0.03

+2.53

-7.14

-0.04

+3.13

 

Const. 24a

-79.8

+0.01

-0.45

-42.95

-0.03

+2.01

-18.75

-0.02

+1.46

ameans constant BW and SR is set within ± 8 and ± 24.

Table 2

Performance comparison with the simple-SR scheme for CIF resolution in the SCD mode

Search algorithm

Sequence

Akiyo

Foreman

Stefan

 

BW pattern

ΔBW (%)

ΔPSNR (dB)

ΔBit-rate (%)

ΔBW (%)

ΔPSNR (dB)

ΔBit-rate (%)

ΔBW (%)

ΔPSNR (dB)

ΔBit-rate (%)

FS

Variable 8a

-37.8

+0.01

+0.17

-12.30

-0.02

+1.98

-1.38

+0.07

-9.83

 

Variable 16a

-69.9

0.00

+0.36

-31.03

-0.02

+3.19

-7.29

+0.01

-2.16

 

Variable 24a

-82.8

-0.01

-0.34

-45.56

-0.02

+1.69

-19.10

-0.01

-1.13

EPZS

Variable 8a

-33.1

+0.02

-0.15

-11.0

-0.02

+2.64

-0.76

-0.02

+1.17

 

Variable 16a

-65.6

+0.01

+0.20

-29.54

-0.02

+2.37

-7.69

-0.03

+2.98

 

Variable 24a

-79.8

0.00

-0.09

-44.72

-0.02

+1.90

-20.8

-0.01

+1.58

ameans variable BW and SR is set within ± 8 and ± 24

Table 3

Performance comparison with the simple-SR scheme for CIF resolution in the LP mode

Search algorithm

Sequence

Akiyo

Foreman

Stefan

 

BW pattern

ΔBW (%)

ΔPSNR (dB)

ΔBit-rate (%)

ΔBW (%)

ΔPSNR (dB)

ΔBit-rate (%)

ΔBW (%)

ΔPSNR (dB)

ΔBit-rate (%)

FS

Variable 8

-37.9

-0.01

+0.12

-5.05

0.00

+0.10

-3.49

+0.03

-2.83

 

Variable 16

-70.2

-0.01

+0.34

-30.1

-0.02

+2.43

-16.5

+0.07

-9.29

 

Variable 24

-83.0

-0.01

+0.04

-51.2

-0.02

+1.20

-32.6

-0.01

+0.04

EPZS

Variable 8

-32.9

0.00

-0.01

-3.44

-0.01

+0.37

-2.73

-0.02

+1.42

 

Variable 16

-65.7

-0.01

-0.13

-27.8

-0.03

+2.84

-16.2

-0.05

+3.35

 

Variable 24

-79.9

+0.01

-0.11

-49.8

-0.01

+1.49

-32.1

-0.01

+1.25

Table 4

Execution-time comparison with the fixed-SR scheme for CIF resolution

Search algorithm

Sequence

Akiyo

Foreman

Stefan

 

BW pattern

ΔTime (%)

FS

Const. 8

+0.45

+0.06

+0.19

 

Const. 16

-0.57

-0.32

-0.06

 

Const. 24

-1.94

-0.69

-0.38

EPZS

Const. 8

-1.31

-0.26

-0.45

 

Const. 16

-2.31

-0.90

-0.20

 

Const. 24

-3.21

-2.43

-0.90

Table 5

Performance comparison with the fixed-SR scheme for 720P resolution.

Search algorithm

Sequence

Station2

Sunflower

Tractor

 

BW pattern

ΔBW (%)

ΔPSNR (dB)

ΔBit-rate (%)

ΔBW (%)

ΔPSNR (dB)

ΔBit-rate (%)

ΔBW (%)

ΔPSNR (dB)

ΔBit-rate (%)

FS

Const. 56a

-69.64

-0.01

+0.27

-48.98

-0.01

+0.28

-23.86

0.00

-0.11

 

Const. 64a

-75.97

0.00

+0.29

-59.09

-0.01

+0.20

-37.97

0.00

+0.06

EPZS

Const.

56a

-69.82

-0.01

-0.06

-49.75

+0.01

-0.2

-26.52

0.00

+0.17

 

Const. 64a

-76.15

0.00

-0.26

-59.69

0.00

+0.39

-40.43

0.00

-0.02

ameans variable BW and SR is set within ± 56 and ± 64.

Figure 6

Constant BW patterns with SR equal to: (a) ± 8 (b) ± 16 (c) ± 24 and variable BW patterns with SR equal to (d) ± 8 (e) ± 16 (f) ± 24.

For the constant BW pattern case, Table 1 illustrates that the full search ME with the proposed BRD-SR scheme can attain similar quality performance as the that with the fixed-SR scheme in the low-motion sequence (Akiyo sequence) and the medium-motion sequence (Foreman sequence), but with less BW. In case of low-motion sequence, the proposed algorithm can save 35-83% of the BW with different SRs. For the medium-motion sequence, our algorithm can save 4-45% of the BW. For the high-motion sequence (Stefan sequence), our algorithm can save an average bit-rate of up to 13% and increase the PSNR by up to 0.1 dB under the low SR constraint. Also, the simulation shows similar results as that in the full search algorithm by applying our proposed algorithm to the fast algorithm, the EPZS algorithm, which is due to our effective SR adjustment. For a fair comparison, the presented BW has considered data reuse [11] in the overlapped region between search points, and thus, only new data that are not in the local buffer will be loaded from external memory and counted in the BW usage. In summary, the proposed algorithm can save data BW for the full search and EPZS algorithms as well.

For the variable BW pattern case, Tables 2 and 3 compare the results between the BRD-SR scheme and the simple-SR scheme in the SCD and LP modes. All of these results show trends in R-D performance and BW saving similar to those in Table 1. In summary, these results show our algorithm with B-R-D optimization can better utilize the BW for ME computation and achieves better performance than the fixed-SR and simple-SR schemes.

Table 4 shows the execution-time of the proposed algorithm and compares it to the fixed-SR scheme with the constant BW pattern. The results are similar to those found with the simple-SR scheme in the variable BW pattern case. Our proposed algorithm slightly improves execution time. However, the saving is not directly proportional to BW saving due to the calculation overhead of the MB-level BW-scalable scheme. These overheads can be reduced with further software optimization or better hardware implementation of the existing ME engine.

Table 5 shows the simulation results for the HD resolution videos and a comparison of the proposed scheme with the fixed-SR scheme. The simulation conditions are three 720P-sized video sequences with a baseline profile, no R-D optimization, one reference frame, IPPP sequences, 30 frames/s, and 16 frames per GOP. All of the simulation results show similar savings to those found with CIF resolution, which are listed in Table 1. This proves the applicability of the proposed algorithm on larger sized video sequences.

6. Conclusion

In this article, we propose a BW-scalable approach for an ME algorithm to maximize the R-D performance while dynamically allocating the available BW. Compared to the traditional methods, our algorithm could save up to 70% of the BW with a full-search algorithm and 65% of the BW with the EPZS algorithm with an average SR size of ± 16 for low-motion CIF resolution sequences. Compared to either the full search or EPZS algorithm, our proposed algorithm can save up to 70% of the BW with an SR size of ± 56 for HD (720P) resolution video. These savings come from appropriate MB-level BW allocation. In addition, while coding high-motion sequences, the simulation result shows our design could save an average bit rate of up to 13% and increase the average PSNR by up to 0.1 dB with similar BW usage for CIF resolution. The proposed design can be combined with current ME designs. Further study can be done by incorporating this work into the rate-control scheme or other resource constrained algorithms for better performance.

Abbreviations

B-R-D: 

bandwidth-rate-distortion

BW: 

bandwidth

BWBP

data bandwidth backward prediction

BWbudget: 

bandwidth budget

BWFP

data bandwidth forward prediction

EPZS: 

enhanced predictive zonal search

max_mv: 

maximum motion vector

MB: 

macroblock

MBs: 

macroblocks

ME: 

motion estimation

MV: 

motion vector

MVD: 

motion vector difference

MVP: 

motion vector predictor

R-D: 

rate-distortion

RDC: 

Lagrange R-D cost

RDCBMA

Lagrange R-D cost at the final best matching position

RDCinit

Lagrange R-D cost at MVP

RDG: 

rate-distortion gain

SR: 

search range.

Declarations

Acknowledgements

The authors appreciate the anonymous referees and editor for their valuable comments and suggestions that lead to the improved version of this article.

Authors’ Affiliations

(1)
Department of Electronics Engineering & Institute of Electronics, National Chiao-Tung University

References

  1. Wiegand T, Sullivan GJ, Bjontegaad G, Luthra A: Overview of the H.264/AVC video coding standard. IEEE Trans Circ Syst Video Technol 2003,13(7):560-575.View ArticleGoogle Scholar
  2. He Z, Liang Y, Chen L, Ahmad I, Wu D: Power-rate-distortion analysis for wireless video communication under energy constraints. IEEE Trans Circ Syst Video Technol 2005,15(5):645-658.View ArticleGoogle Scholar
  3. Lian CJ, Chien SY, Lin CP, Tseng PC, Chen LG: Power-aware multimedia: concepts and design perspectives. IEEE Circ Syst Mag 2007,7(2):26-34.View ArticleGoogle Scholar
  4. Chen YH, Chen TC, Chen LG: Power-scalable algorithm and reconfigurable macro-block pipelining architecture of H.264 encoder for mobile application. In Proceedings of IEEE International Conference on Multimedia and Expo. Ontario, Canada; 2006:281-284.Google Scholar
  5. Chen TC, Chen YH, Tsai CY, Tsai SF, Chien SY, Chen LG: 2.8 to 67.2 mw low-power and power-aware H.264 encoder for mobile applications. In Proceedings of IEEE Symposium on VLSI Circuits. Kyoto, Japan; 2007:222-223.Google Scholar
  6. Tai PL, Huang SY, Liu CT, Wang JS: Computation-aware scheme for software-based block motion estimation. IEEE Trans Circ Syst Video Technol 2003,13(9):901-913. 10.1109/TCSVT.2003.816510View ArticleGoogle Scholar
  7. Ivanov YV, Bleakley CJ: Dynamic complexity scaling for real-time H.264/AVC video encoding. In Proceedings of the 9th International Conference on Multimedia. Augsburg, Germany; 2007:962-970.Google Scholar
  8. Ates HF, Altunbasak Y: Rate-distortion and complexity optimized motion estimation for H.264 video coding. IEEE Trans Circ Syst Video Technol 2008,18(2):159-171.View ArticleGoogle Scholar
  9. Chang CY, Leou JJ, Kuo SS, Chen HY: A new computation-aware scheme for motion estimation in H.264. In Proceedings of IEEE International Conference on Computer and Information Technology. Sydney, Australia; 2008:561-565.Google Scholar
  10. Shen JF, Wang TC, Chen LG: A novel low-power full-search block-matching motion estimation design for H.263+. IEEE Trans Circ Syst Video Technol 2001,11(7):890-897. 10.1109/76.931116View ArticleGoogle Scholar
  11. Tuan JC, Chang TS, Jen CW: On the data reuse and memory bandwidth analysis for full-search block-matching VLSI architecture. IEEE Trans Circ Syst Video Technol 2002,12(1):61-72. 10.1109/76.981846View ArticleGoogle Scholar
  12. Lin SS, Tseng PC, Chen LG: Low-power parallel tree architecture for full search block-matching motion estimation. In Proceedings of IEEE International Symposium on Circuits and Systems. British Columbia, Canada; 2004:313-316.Google Scholar
  13. Kuhn P: Algorithms, Complexity Analysis and VLSI Architectures for MPGE-4 Motion Estimation. Kluwer Academic, Norwell, MA; 1999.View ArticleGoogle Scholar
  14. Lin YK, Lin CC, Kuo TY, Chang TS: A hardware-efficient H.264/AVC motion-estimation design for high-definition video. IEEE Trans Circ Syst I 2008,55(6):1526-1535.MathSciNetView ArticleGoogle Scholar
  15. Xu XZ, He Y: Modification of dynamic search range for JVT. In Joint Video Team, Doc JVT-Q088. Nice, France; 2005.Google Scholar
  16. Liu Z, Zhou J, Goto S, Ikenaga T: Motion estimation optimization for H.264/AVC using source image edge features. IEEE Trans Circ Syst Video Technol 2009,19(8):1095-1107.View ArticleGoogle Scholar
  17. Shim H, Kyung CM: Selective search area reuse algorithm for low external memory access motion estimation. IEEE Trans Circ Syst Video Technol 2009,19(7):1044-1050.View ArticleGoogle Scholar
  18. Chen WY, Ding LF, Tsung PK, Chen LG: Algorithm and architecture design of cache system for motion estimation in high definition H.264/AVC. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing. Las Vegas, USA; 2008:2193-2196.Google Scholar
  19. Chen TC, Chen YH, Tsai SF, Chien SY, Chen LG: Fast algorithm and architecture design of low-power integer motion estimation for H.264/AVC. IEEE Trans Circ Syst Video Technol 2007,17(5):568-577.View ArticleGoogle Scholar
  20. Luo JH, Wang CN, Chiang TH: A novel all-binary motion estimation with optimized hardware architectures. IEEE Trans Circ Syst Video Technol 2002,12(8):700-712. 10.1109/TCSVT.2002.800859View ArticleGoogle Scholar
  21. Wang SH, Tai SH, Chiang TH: A low-power and bandwidth-efficient motion estimation IP core design using binary search. IEEE Trans Circ Syst Video Technol 2009,19(5):760-765.View ArticleGoogle Scholar
  22. Liu TM, Lin TA, Wang SZ, Lee WP, Yang JY, Hou KC, Lee CY: A 125 μw, fully scalable MPEG-2 and H.264/AVC video decoder for mobile applications. IEEE J Solid-State Circ 2007,42(1):161-169.View ArticleGoogle Scholar
  23. Joint Video Team Reference Software JM12.2, ITU-T [Online][http://iphome.hhi.de/suehring/tml/download/]
  24. Tourapis HYC, Tourapis AM: Fast motion estimation within the H.264 codec. In Proceedings of IEEE International Conference on Multimedia and Expo. Baltimore, USA; 2003:517-520.Google Scholar

Copyright

© Hsieh et al; licensee Springer. 2011

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.