Suppressing feedback in a distributed video coding system by employing real field codes

Louw, Daniel J; Kaneko, Haruhiko

doi:10.1186/1687-6180-2013-181

Research
Open access
Published: 05 December 2013

Suppressing feedback in a distributed video coding system by employing real field codes

Daniel J Louw¹ &
Haruhiko Kaneko¹

EURASIP Journal on Advances in Signal Processing volume 2013, Article number: 181 (2013) Cite this article

2454 Accesses
1 Citations
Metrics details

Abstract

Single-view distributed video coding (DVC) is a video compression method that allows for the computational complexity of the system to be shifted from the encoder to the decoder. The reduced encoding complexity makes DVC attractive for use in systems where processing power or energy use at the encoder is constrained, for example, in wireless devices and surveillance systems. One of the biggest challenges in implementing DVC systems is that the required rate must be known at the encoder. The conventional approach is to use a feedback channel from the decoder to control the rate. Feedback channels introduce their own difficulties such as increased latency and buffering requirements, which makes the resultant system unsuitable for some applications. Alternative approaches, which do not employ feedback, suffer from either increased encoder complexity due to performing motion estimation at the encoder, or an inaccurate rate estimate. Inaccurate rate estimates can result in a reduced average rate-distortion performance, as well as unpleasant visual artifacts. In this paper, the authors propose a single-view DVC system that does not require a feedback channel. The consequences of inaccuracies in the rate estimate are addressed by using codes defined over the real field and a decoder employing successive refinement. The result is a codec with performance that is comparable to that of a feedback-based system at low rates without the use of motion estimation at the encoder or a feedback path. The disadvantage of the approach is a reduction in average rate-distortion performance in the high-rate regime for sequences with significant motion.

1 Introduction

In a standard transform-domain distributed video coding (DVC) system, frames are split into key frames (which are similar to I-frames in H.264) and Wyner-Ziv (WZ) frames. The key frames are intra-coded, and the WZ frames are discrete cosine-transformed, quantized and encoded using a systematic error correction code (ECC). The parity symbols from the encoding are then transmitted. At the decoder, advanced motion compensation methods are used to produce an estimate of the frame, also known as the side information (SI), from the intra-coded key frames. Errors in the SI are corrected using the received parity symbols.

The encoder must operate at a high enough rate (transmit enough parity symbols) to ensure that the errors can be corrected. However, calculating this rate at the encoder is fundamentally impossible without calculating the SI used at the decoder. Doing this at the encoder would defeat the purpose of DVC as it would increase the computational complexity of the encoder to the same level as that of conventional video coding. The conventional method used to solve this problem is to introduce a feedback path from the decoder [1]. If the rate estimate is too low, more bits are requested. This requires the video sequence to be decoded in real time and renders the codec unsuitable for any application where the compressed sequence needs to be stored for decompression at a later time. It also introduces latency in the process which can become severe if the decoder is far away from the encoder. Constraining the use of the feedback channel has been shown to alleviate some of these problems [2], though real-time decoding is still required.

The second method used to determine the rate is called suppressed feedback DVC, where a low complexity estimate of the SI is created at the encoder and used to estimate the rate. If the rate estimate is too low, then decoding failures can lead to significant distortion in the decoded frame. Conversely, if the rate is overestimated, any extra bits are wasted. Typically, the rate estimate is good for low motion sequences and poor for high or complex motion sequences. In order to achieve robust performance, extra bits may be transmitted to increase the likelihood of successful decoding. However, a finite probability of a decoding failure remains. Increasing the number of extra bits to reduce this probability implies that a larger percentage of transmitted bits will be redundant, yielding reduced rate-distortion (R-D) performance.

In this paper, the authors propose to solve this problem by changing the codec structure so that the output distortion will be a smoother function of the rate. As a result, any extra bits will continue to improve the distortion, and if the rate is underestimated, the image quality will degrade gradually. The rate-distortion characteristics are modified by changing the nature of the quantization region that the Wyner-Ziv code imposes on the signal space. This change is implemented by reversing the order of the quantizer and the low-density parity check (LDPC) encoder and by defining the code over the real field instead of over a finite field.

The results will show that the system reduces the variance of the distortion and improves the perceptual video quality, in comparison with more conventional feedback-free methods without increasing the encoding complexity. The trade-off introduced by the proposed method is a reduction in the average R-D performance as compared to systems with perfect rate knowledge.

This paper is structured as follows: Section 2 provides a short review of the relevant literature. Section 3 discusses real field coding and describes how it alleviates the effects of an imperfect rate estimation. Section 4 describes the overall design of the system. Specific aspects of the encoder and its subsystems are described in Section 5, while the decoder with its subsystems are discussed in Section 6. The complexity of the proposed system is analysed in Section 7 and the performance of the system is analysed in Section 8, with the conclusion following in Section 9.

2 Overview of the previous work

DVC is based on the Wyner-Ziv [3] and Slepian-Wolf (SW)[4] coding theorems. There have been several DVC systems proposed in literature along with many improvements to individual subsystems. The first practical systems were the Stanford system [5] and PRISM (Berkeley) [6]. Since then many improvements have been introduced. Notable milestones include the DISCOVER [1] and VISNET [7] codecs, both of which are based on the architecture of the Stanford system. The authors have also presented a codec based on the Stanford architecture [8, 9], of which the SI creation method is used in this paper.

The majority of competitive systems require a feedback path. Methods to remove feedback are based on estimating the required rate at the encoder and attempting to mitigate the visual effects of decoding failures at the decoder. Estimating the rate requires estimating the SI at the encoder, which increases the encoder complexity. Some estimate must be made and thus the increased complexity at the encoder becomes a trade-off point for R-D performance.

There have been some notable proposals for feedback suppressed systems. A pixel-domain feedback suppressed system was developed in [10]. However, the simulation results did not consider the key frames in the R-D performance, making comparisons difficult. In [11] and [12] there were attempts to learn the required rate from the SI estimate using machine learning and neural networks, respectively. In [13] a pixel-domain system, exploiting spatial and temporal correlations along with iterative decoding was presented. A system with a structure not relying on conventional Wyner-Ziv techniques and exploiting overlapped block motion estimation with probabilistic compensation (OBMEPC) and SI dependent correlation channel estimation (SID-CE) was presented in [14]. This system produced excellent results at a reduced encoding complexity. A system similar to [10] but for transform-domain DVC was presented in [15], where a multi-mode SI method was used at the encoder. More advanced and complete attempts at creating and analysing a feedback-free DVC system were presented in [16] and [17]. These systems rely on performing reduced complexity motion estimation at the encoder. Sophisticated techniques related to the correlation noise modelling, improved side information generation, and mitigation of decoding errors allowed authors in [16] to achieve R-D performance approaching that of the DISCOVER codec. A useful addition in [17] is the use of a hash to improve the motion estimation at the decoder. Future research may consider the complexity vs. rate-distortion trade-off of including low complexity motion estimation at the encoder, but in this paper, it is assumed that motion estimation is not desired at the encoder.

Encoding over the real field, quantizing the result and decoding with side information can be shown to be mathematically similar to compressive sensing (CS), if the error signal is assumed to be sparse or compressible. The effects of quantization and the R-D performance of quantized CS were analysed in [18]. In the proposed system, we show how the use a binning quantizer [19] can improve the R-D performance when the rate estimate is accurate. CS has been applied to video compression previously, for example in [20]. However, previous systems are different to the one presented here, since in these systems CS is used to reduce the sampling requirements in the pixel domain and, as such, the discrete cosine transform (DCT) is not used.

3 Real field coding

In this section, real field coding and its effect on Wyner-Ziv coding is described. Wyner-Ziv coding can be understood as a process of creating discontiguous quantization bins over the signal space. This can most easily be seen when the Wyner-Ziv code is implemented as a nested lattice quantizer [21]. A scalar quantizer followed by an error correction code performs the same role, where the syndrome describes a coset of codewords, each of which maps to a quantization cell in the signal space. The expected distortion, D _W|Z, of a Wyner-Ziv code can be simplistically expressed as follows:

\begin{array}{l} D_{W | Z} = E [D_{B}] P [ε] + E [D_{Q}] (1 - P [ε]), \end{array}

(1)

where P[ ε] is the probability of a codeword error, D _B is the distortion introduced by a codeword error and D _Q is the distortion introduced by quantization. When the code is operating at a high enough rate P[ ε] → 0 and the quantizer is the dominant distortion source. However, when the rate is too low, P[ ε] can no longer be considered negligible and the distortion of a codeword error significantly affects performance. A real field code creates a contiguous subspace over the signal space. This changes the distortion function to a more continuous and gradual function of the noise, since there is no codeword error-based distortion. If $x \in R^{N}$ is the signal, encoding can be expressed in a similar manner as for conventional error correction codes:

\begin{array}{l} y = G^{T} x, \end{array}

(2)

where $G \in R^{N \times (N + M)}$ is the generator matrix and $y \in R^{(N + M)}$ is the encoded signal. Since only the parity symbols are transmitted, the generator matrix is in systematic form:

\begin{array}{l} G = [I_{n} | P], \end{array}

(3)

where $P \in R^{N \times M}$ is a parity matrix. Let s be the parity symbols (also referred to as the syndrome in Wyner-Ziv literature):

\begin{array}{l} s & = {[y_{N + 1} \dots y_{N + M}]}^{T} \end{array}

(4)

\begin{array}{l} = P^{T} x . \end{array}

(5)

The parity symbols identify a specific subspace with (N- M) dimensions, in $R^{N}$ where x must lie. For example, if N = 3 and M = 2, then s will describe a specific line along which x lies. Decoding will find the most likely point on this line. In finite field coding, the same is true over the finite-field-based signal space, but when the correct codewords are mapped back to the signal space, they map to non-contiguous quantization regions.

Assuming a spherically symmetrical i.i.d Gaussian distributed error vector, the components of the error not along the line will be removed during decoding. Thus, if the original error variance was $σ_{e}^{2}$ , then the decoded error variance, $σ_{d}^{2}$ , becomes

\begin{array}{l} σ_{d}^{2} = σ_{e}^{2} \frac{N - M}{N} . \end{array}

(6)

If the noise is not Gaussian, the decoded variance can be reduced even further. For example, if the noise is K-sparse, which means ||e||₀ = K,K ≪ N, then with M ≥ K + 1, perfect reconstruction is possible, assuming an l ₀-minimization decoder, though the problem is NP-complete [22]. If the noise is Laplace distributed, as is typically the case for DVC, or more generally compressible, then the performance lies between these two cases. Thus, decoding reduces the error variance in all cases and catastrophic decoding failures do not occur. Therefore, while there is no guarantee of a specific R-D performance due to a lack of knowledge about the channel, the method will always reduce the distortion in the SI, and there will be no catastrophic decoding failures due to noise variance underestimation.

Before the real field encoded parity symbols can be transmitted or stored, they must be quantized. We thus calculate:

\begin{array}{l} s^{Q} = Q (s) . \end{array}

(7)

The quantized parity symbols describe a cell in $R^{N}$ that is N -dimensional, but bounded in M of those dimensions. The size of the bounded dimensions depends on the quantization bin size. There will thus be an additional quantization noise component added to the distortion as described in (6). Quantizer design is considered in Sections 5.3 and 5.4. From this point, we will refer to P ^T as H to more closely align with notation used in the Wyner-Ziv coding literature.

4 Description of the proposed system

Figure 1 shows a block diagram of the proposed system. Incoming frames are split into key frames, which are H.264 intra-coded and WZ frames. Let the original pixel-domain WZ frame, at time t in the group-of-pictures (GOP), be referred to as f[ t]. f[ t](i,j) refers to the pixel at position (i,j). The GOP size (N _G) refers to the number of WZ frames in between the key frames plus one. The last key frame from the previous GOP will be referred to as f[ 0] and the key frame at the end of the current GOP as f[ N _G].

4.1 Encoder

The WZ frame is transformed with the block-based DCT to yield:

\begin{array}{l} F [t] = DCT (f [t]) . \end{array}

(8)

The b th subband of the frame will be referred to as F _b[ t], and the coefficient corresponding to the (v,w)th subblock will be referred to as F _b[ t](v,w). For notational brevity, we will now drop the time indices and assume that each WZ frame in the GOP is treated in a similar manner. The number of bits assigned to each subband is calculated using a rate estimation algorithm (Section 5.2) aiming for a specific distortion. The system attempts to achieve the same distortion as a conventional DVC system using a pre-defined quantization matrix. For this paper, eight quantization profiles taken from [23] are used in conjunction with a 4×4 DCT. We also consider quarter common intermediate format (QCIF) resolution (176×144) video sequences, which result in N=1,584 elements per subband. After estimating the required rate for each subband according to the quantization profile, the bit budget ( $ℜ$ ) is allocated to a specific code size (M _b) and bits per quantized symbol (q _b) as described in Section 5.4. Let x _b = vect(F _b), be a (N × 1) vectorised version of F _b, then x _b is encoded:

\begin{array}{l} s_{b} = H_{b} x_{b}, \end{array}

(9)

where H _b is the M _b × N parity check matrix used for the b th subband. Each encoded subband is then quantized with a symmetric uniform quantizer (Section 5.3). The quantized subband will be referred to as follows:

\begin{array}{l} s_{b}^{Q} = Q_{b} (s_{b}), \end{array}

(10)

where Q _b (·) is the quantizer for the b th subband. If q _b represents the number of bits assigned to Q _b (·), then $s_{b}^{Q}$ will have elements in {0,1,…,L _b- 1}, where $L_{b} = 2^{q_{b}}$ . Each quantized subband is then sent to the decoder along with the intra-coded key frames. Due to the lack of a feedback path, frames do not need to be buffered after encoding. The parameters required to describe the quantizer, as discussed in Section 5.3, must also be transmitted.

4.2 Decoder

At the decoder, the key frames and possibly other decoded frames are used to create an estimate of f. In the multi-hypothesis case more than one estimate of f is created. The h th hypothesis will be denoted by ${\hat{f}}_{h}$ . For this system we use four hypotheses as described in [8]. Each hypothesis is transformed to produce ${\hat{F}}_{h}$ and each subband is vectorised to yield y _hb:

\begin{array}{l} {\hat{F}}_{h} & = DCT ({\hat{f}}_{h}) \end{array}

(11)

\begin{array}{l} y_{hb} & = vect ({\hat{F}}_{hb}), \end{array}

(12)

where ${\hat{F}}_{hb}$ is the b th subband of ${\hat{F}}_{h}$ . If the error for each hypothesis is given as

\begin{array}{l} r_{h} = f - {\hat{f}}_{h}, \end{array}

(13)

we calculate an estimate ${\hat{r}}_{h}$ . The method used to calculate the estimated error will differ for each hypothesis and depend on the method used to produce the hypothesis. From the error estimates, noise parameters are calculated at a coefficient level using the method developed in [23]. In order to improve the relative quality of the noise estimation for the different hypotheses, a motion-adaptive scaling factor is used to scale the Laplace noise parameters of the key frame and motion-based hypotheses as described in [8].

For each subband, the error estimates and the N _H hypotheses are combined with the received parity information and decoded using a Gaussian BP (GBP) algorithm which attempts to solve the equation:

\begin{array}{l} {\hat{x}}_{n} = \underset{x_{n} \in R}{arg max} f_{X | Y, S^{Q}} (x_{n} | Y, s^{Q}), \end{array}

(14)

where n indicates the n th coefficient in a subband with N coefficients and the subband indices are dropped for notational brevity. Y is an N _H × N matrix with entries y _hn where the h th row represents the N received symbols of the h th hypothesis. As opposed to standard DVC, no reconstruction is required. The decoded bands are recompiled to produce $\hat{F}$ , and then the pixel-domain output frame $\hat{f}$ is calculated:

\begin{array}{l} \hat{f} = IDCT (\hat{F}) . \end{array}

(15)

After decoding, the process is repeated in a method similar to successive refinement [24–27]. Using the decoded frame $\hat{f}$ , a more accurate motion estimation and compensation algorithm is used to produce higher quality SI. New correlation noise estimates must then be produced for the SI, after which the GBP algorithm is used to decode the improved SI. This process may be iterated a number of times, though two iterations are typically sufficient. If the original decoding introduced an error, then the iterative process may worsen performance as the motion-compensated interpolation (MCI) process may produce worse quality SI. We will now discuss the subsections in more detail.

5 Encoder

In this section we discuss the specifics of the different parts of the system. Both the time and subbands indices have been dropped from most of the equations, since the same process is applied to all the subbands for each of the WZ coded frames.

5.1 Real field code design

Coding over the real field as opposed to a finite field has not been considered much due to the prevalence of digital systems. Some early analog codes were based on Bose-Chaudhuri-Hocquenghem (BCH) codes and the discrete Fourier transform (DFT) [28]. Large codes, analogous to finite field LDPC codes, were considered recently in the context of CS [29]. While exact design strategies are still being developed, it has been shown that codes that work well as binary LDPC codes also work well over the real field.

LDPC code design can be described as designing the decoding graph and choosing the connection weights. In this paper, we use the quasi-cyclic (QC) codes designed in [30] to design the decoding graph (the parity check matrix structure). QC codes were chosen because of their fast encoding algorithms and reduced storage requirements at the encoder, as compared with random codes. For a code with an M × N parity matrix H, the connection weights (non-zero elements) were selected at random from a Gaussian distribution such that the expected total energy of the encoded signal is the same as that of the original signal $M σ_{s}^{2} = N σ_{x}^{2}$ .

The codes from [30] are girth-6 codes that are constructed using Galois fields. Depending on the number of parity symbols, M, we use a different construction field so as to ensure that the code remains girth-6 while being as dense as possible. If the construction field is the Galois field, $GF (2^{q_{c}})$ , then the value for q _c is chosen as

\begin{array}{l} q_{c} = \{\begin{matrix} 5 & if 70 \leq M \leq 120 \\ 6 & if 120 < M \leq 250 \\ 7 & if M > 250 . \end{matrix} \end{array}

(16)

M corresponds to the number of transmitted symbols. These values are sufficient for the video resolution considered in this paper, where N = 1,584. If the higher resolution frames are encoded, then N and M might increase. Codes should be designed with the required size in mind. We perform decoding using a version of GBP described in Section 6.1.

To demonstrate the performance of the real field codes, we evaluate the codes using random Laplace distributed data and noise. Figure 2 shows the ratio of the noise variance after decoding, $σ_{d}^{2}$ , to the initial noise variance, $σ_{e}^{2}$ , as a function of M for a range of signal-to-noise ratio (SNR) values. The number of quantization bits was fixed at a large value (q = 12) to remove the effect of the quantization noise from the analysis. The number of decoding iterations was fixed at 10. The figure shows that the decrease in the noise depends on the initial SNR. As M increases the slope decreases indicating a decreased effectiveness. This is however a function of the decoding algorithm and the number of decoding iterations. The performance at large M can be improved by increasing the number of iterations. This improvement would, however, come at the cost of a greater decoding complexity.

5.2 Rate estimation

The proposed system operates at an estimate of the theoretical Wyner-Ziv rate. This rate is calculated assuming a conventional system with a mid-rise symmetrical quantizer for the DC band and a dead-zone uniform quantizer for the AC bands. The bits assigned to each band for each rate-distortion point, are taken from the quantization profiles in [23]. While these quantizers are used to calculate the bit budget assigned to each subband, a different quantizer is used after encoding. This is because we are quantizing the encoded signal and not the original signal. The code size M _b and number of quantization bits q _b are the design parameters that affect performance where $ℜ = M_{b} q_{b}$ is the bit budget.

We estimate the rate required to reconstruct the source x quantized with bin size Δ. If x is the signal, e is the error and y is the side information, then the system model for a given subband is

\begin{array}{l} y = x + e, \end{array}

(17)

where the elements of x and e are Laplace distributed and drawn i.i.d from $X \sim ℒ (0, α_{x})$ and $E \sim ℒ (0, α_{e})$ respectively [31]. x is quantized to x ^Q with a q-bit quantizer, where the value for q is taken from [23]. The required transmission rate is lower bounded by the entropy remaining in the quantized source given the side information [4]. The per symbol rate can thus be expressed as

\begin{array}{l} R \geq H (X^{Q} | Y^{Q}), \end{array}

(18)

where H(X ^Q|Y ^Q) is the conditional entropy of quantized source, X ^Q, given the quantized side information Y ^Q. The final bit budget per subband will be $ℜ = NR$ .

5.2.1 Correlation channel model

In order to estimate the rate, we require an estimate of the signal variance and the noise variance. We use the decoded key frames, $\hat{f} [0]$ and $\hat{f} [N_{G}]$ , which are automatically available at the encoder due to the intra-coding algorithm, to create an interpolated frame without motion estimation:

\begin{array}{l} \hat{f} [t] = \frac{1}{2} (\hat{f} [0] + \hat{f} [N_{G}]) . \end{array}

(19)

An estimate of the residual, $\hat{r}$ , is then used to calculate the required variance estimates:

\begin{array}{l} \hat{r} [t] & = f [t] - \hat{f} [t] \end{array}

(20)

\begin{array}{l} \hat{R} [t] & = DCT [\hat{r} [t]] \end{array}

(21)

\begin{array}{l} σ_{e}^{2} (b) & = VAR ({\hat{R}}_{b}) \end{array}

(22)

\begin{array}{l} σ_{x}^{2} (b) & = VAR (DCT [f] (b)) \end{array}

(23)

\begin{array}{l} σ_{y}^{2} (b) & = σ_{x}^{2} (b) + σ_{e}^{2} (b) . \end{array}

(24)

While the variance estimation method works fairly well for the DC band, the AC bands are not as accurate. To improve the estimate, we estimate the noise variance of band (b) as usual and then average the value with a predicted variance:

\begin{array}{l} {\hat{σ}}_{e}^{2} {(b)}^{*} = \frac{1}{2} {\hat{σ}}_{e}^{2} (b) + \frac{1}{2} {\hat{σ}}_{e}^{2} {(b - 1)}^{*} \frac{{\hat{σ}}_{x}^{2} (b)}{{\hat{σ}}_{x}^{2} (b - 1)} . \end{array}

(25)

5.2.2 Conventional bit plane based estimates

Most DVC systems encode each bit plane of a DCT subband independently using binary codes. As a result, an estimate of the required rate for each bit plane is required. The standard method for approaching this problem is to consider a binary symmetric channel (BSC) for each bit plane [16, 32]. The minimum rate for a given bit plane is calculated as

\begin{array}{l} R_{b} & \geq H_{b} (ε) \end{array}

(26)

\begin{array}{l} = - ε log ε - (1 - ε) log (1 - ε), \end{array}

(27)

where ε is the crossover probability of the virtual BSC. The crossover probability is defined as the probability that $x_{b} \neq {\hat{x}}_{b}$ , where:

\begin{array}{l} {\hat{x}}_{b} = arg max_{i = 0, 1} Pr (x_{b} = i | y, x_{b - 1}, \dots, x_{1}), \end{array}

(28)

and x _b is the b th bit plane of X ^Q. In practice, the bit plane entropy is calculated as [16, 33]

\begin{array}{l} p_{n} (i) & = \frac{Pr (x_{b} = i, x_{b - 1}, \dots, x_{1} | y)}{Pr (x_{b - 1}, \dots, x_{1} | y)} \end{array}

(29)

\begin{array}{l} H_{b} (p_{n}) & = - p_{n} (0) log p_{n} (0) - p_{n} (1) log p_{n} (1), \end{array}

(30)

where n is the n th symbol in the subband. The final entropy is then calculated as

\begin{array}{l} H_{b} = \frac{1}{N} \sum_{{WZ}^{j}} H_{b} (p_{n}), \end{array}

(31)

with WZ^j representing all the symbols in the j th subband. This method requires N entropy calculations for each bit plane of each subband.

5.2.3 Symbol-wise conditional entropy estimation

The proposed system operates at a symbol level, but the bit-plane-based method could still be used by adding the entropy estimates for each bit plane in a symbol. However, there are some advantages to estimate the entropy directly. The definition of symbol-wise conditional entropy is as follows:

\begin{array}{l} H (X^{Q} | Y^{Q}) & = - \sum_{X^{Q}} \sum_{Y^{Q}} P (X^{Q}, Y^{Q}) log P (X^{Q} | Y^{Q}) \end{array}

(32)

\begin{array}{l} = - \sum_{X^{Q}} \sum_{Y^{Q}} P (X^{Q}, Y^{Q}) log \frac{P (X^{Q}, Y^{Q})}{P (Y^{Q})} . \end{array}

(33)

This method was used for a pixel-domain system [34] that assumed a uniform distribution for X. In general, estimating H(X ^Q|Y ^Q) directly from (32), requires evaluating P(X ^Q,Y ^Q) L ² times, where L = 2^q and q is the number of bits in the quantizer. Instead, we propose to evaluate a different form of the expression:

\begin{array}{l} H (X^{Q} | Y^{Q}) & = H (X^{Q}) - I (X^{Q}; Y^{Q}) \end{array}

(34)

\begin{array}{l} = H (X^{Q}) - H (Y^{Q}) + H (Y^{Q} | X^{Q}), \end{array}

(35)

where I(X ^Q;Y ^Q) is the mutual information between X ^Q and Y ^Q. The authors previously analysed this approach in [35]. We now make the simplifying assumption:

\begin{array}{l} H (Y^{Q} | X^{Q}) \approx H (E^{Q}), \end{array}

(36)

which allows the required rate to be calculated as a function of the entropies of three single variables, each of which is simpler to calculate than the conditional entropy. This is a high-rate assumption which will typically be inaccurate for large Δ values.

Assuming that X is Laplace distributed, H(X ^Q) can be expressed as a summation that depends on the quantizer. When a dead-zone quantizer is used (as for the AC coefficients), we compute the entropy H ^D(X ^Q) as:

\begin{array}{l} H^{D} (X^{Q}) & = - \sum_{i = - \infty}^{\infty} P (i Δ) \underset{2}{log} [P (i Δ)] \end{array}

(37)

\begin{array}{l} P (i Δ) & = \{\begin{matrix} \int_{i Δ}^{(i + 1) Δ} f_{X} (x) dx & if i > 0 \\ \int_{(i - 1) Δ}^{(i + 1) Δ} f_{X} (x) dx & if i = 0 \\ \int_{(i - 1) Δ}^{i Δ} f_{X} (x) dx & if i < 0, \end{matrix} \end{array}

(38)

yielding:

\begin{array}{l} P (iΔ) = \{\begin{matrix} \frac{1}{2} e^{- i α_{x} Δ} (1 - e^{- α_{x} Δ}) & if i > 0 \\ 1 - e^{- α_{x} Δ} & if i = 0 \\ \frac{1}{2} e^{i α_{x} Δ} (1 - e^{- α_{x} Δ}) & if i < 0 . \end{matrix} \end{array}

(39)

Calculating the infinite sum and recombining yields :

\begin{array}{l} H^{D} (X^{Q}) = & - (1 - e^{- α_{x} Δ}) log (1 - e^{- α_{x} Δ}) - e^{- α_{x} Δ} \\ \times [log (\frac{1 - e^{- α_{x} Δ}}{2}) - \frac{α_{x} Δ log (e)}{1 - e^{- α_{x} Δ}}] . \end{array}

(40)

For a uniform mid-rise quantizer that is symmetrical around zero, the entropy is denoted by H ^S(X ^Q) and can be calculated in a similar manner to yield:

\begin{array}{l} H^{S} (X^{Q}) = & - 2 [\frac{1}{2} (1 - e^{- α_{x} Δ})] log [\frac{1}{2} (1 - e^{- α_{x} Δ}] - e^{- α_{x} Δ} \\ \times [log (\frac{1 - e^{- α_{x} Δ}}{2}) - \frac{α_{x} Δ log (e)}{1 - e^{- α_{x} Δ}}] . \end{array}

(41)

For a uniform mid-tread quantizer the entropy, H ^M(X ^Q), is

\begin{array}{l} H^{M} (X^{Q}) = & - (1 - e^{- \frac{α_{x} Δ}{2}}) log (1 - e^{- \frac{α_{x} Δ}{2}}) - e^{- \frac{α_{x} Δ}{2}} \\ \times [log (\frac{e^{\frac{α_{x} Δ}{2}} - e^{- \frac{α_{x} Δ}{2}}}{2}) - \frac{α_{x} Δ log (e)}{(1 - e^{- α_{x} Δ})}] . \end{array}

(42)

These closed-form equations all assume that the quantizer is defined by the bin size and that the range of the variable is theoretically infinite. In reality, there will be a limited number of levels considered. We can thus also calculate H(X ^Q) by evaluating (37) and (39) over a finite number of levels. However, the difference between these is small enough to be ignored and we choose to use the closed-form solutions provided by the infinite quantizer instead.

If we assume that both the side information and the noise are Laplace distributed, then the equations derived for the source signal, (40), (42) and (41), can be used to calculate H(Y ^Q) and H(E ^Q) as well. However, many subbands in DVC coding are assigned only a small number of bits, and as a result the high-rate assumption may not hold. To improve the accuracy of the method, we use a slightly different approach for H(E ^Q). For the DC band, a conventional mid-rise uniform quantizer is typically used on the signal. In this case, we use the equation for the entropy of a mid-tread quantizer, (42), to calculate H(E ^Q). For the AC band, a dead-zone quantizer is typically used on the signal. When X is in the dead zone, P(Y ^Q|X ^Q) can be approximated by P(E ^Q), assuming a dead-zone quantizer for the error. However, when X is not in the dead zone, P(Y ^Q|X ^Q) is better approximated by P(E ^Q), assuming a mid-tread quantizer. Thus, for the AC bands, we estimate H(E ^Q) as follows:

\begin{array}{l} H (E^{Q}) = P (X^{Q} = 0) H^{D} (E^{Q}) + P (X^{Q} \neq 0) H^{M} (E^{Q}) . \end{array}

(43)

5.3 Quantizer

The encoded signal must be quantized before transmission. If the bit budget for a subband is $ℜ$ and the length of the code is M _b, then we use a $q_{b} = ℜ / M_{b}$ bit uniform quantizer with $L = 2^{q_{b}}$ levels. If the range of the quantizer is A = 2||s||_∞, then the quantizer function can be described as

\begin{array}{l} Q (s) = (⌊ f_{c} (s) / Δ ⌋ + \frac{LD}{2}) mod L, \end{array}

(44)

where

\begin{array}{l} f_{c} (s) & = \{\begin{matrix} s & if | s | \leq β A \\ sign (s) β A & otherwise, \end{matrix} \end{array}

(45)

\begin{array}{l} Δ & = \frac{β A}{LD} . \end{array}

(46)

The scaling parameter β allows for the quantizer to clip some measurements. The divisor D is a parameter that allows binning of the measurements to improve the rate performance. For D = 1, the quantizer is a standard uniform quantizer. The Δ, q _b and D parameters for each subband are transmitted along with the encoded sequence. Δ can be represented with 8 bits, q _b with 3 bits and D with 3 bits as well. As a result, the overhead, from transmitting these parameters, is negligible.

5.4 Bit allocation

The rate estimation algorithm provides a specific bit budget, $ℜ$ , for each subband. A method is required to allocate M _b and q _b, where $q_{b} M_{b} = ℜ$ . The quantizer can also employ binning to increase the effective number of bits per symbol. Employing binning will create the possibility of a decoding failure. We discuss binning here to show that the real field coding method can be adapted to include binning (discontiguous quantization) if desired. If D is the binning divisor, then an extra q _D = log2 D bits are effectively gained. D should be chosen to be as large as possible while keeping the probability of a bin error at the decoder to a minimum. The maximum value for D is thus related to the noise variance and the quantizer range. Let Δ _B = A β/D be the size of the bin. If $σ_{x}^{2}$ is the variance of the signal x and $σ_{s}^{2}$ is the variance of the parity signal s then

\begin{array}{l} σ_{s}^{2} = σ_{x}^{2} \frac{N}{M_{b}}, \end{array}

(47)

by design of the parity check matrix. Similarly, if $σ_{e}^{2}$ is the variance of the correlation noise, and z = H e, then the variance of the noise on the parity sequence, $σ_{z}^{2}$ , is

\begin{array}{l} σ_{z}^{2} = σ_{e}^{2} \frac{N}{M_{b}} . \end{array}

(48)

The bin size can be described as

\begin{array}{l} Δ_{B} & = \frac{c σ_{z}}{D} \end{array}

(49)

\begin{array}{l} = \frac{c σ_{x}}{D} \sqrt{\frac{N}{M_{b}}}, \end{array}

(50)

where c is a constant such that c σ _x equals the range of the quantizer. The probability that the noise on the encoded signal, z, will yield a bin error can be approximated and upper bounded by ε :

\begin{array}{l} ε & > P (bin error) \end{array}

(51)

\begin{array}{l} = P (z > \frac{Δ_{B}}{2}) \end{array}

(52)

\begin{array}{l} = exp (- α \frac{Δ_{B}}{2}) \end{array}

(53)

\begin{array}{l} = exp (- α \frac{c σ_{x}}{2 D} \sqrt{\frac{N}{M_{b}}}) . \end{array}

(54)

D is solved as follows:

\begin{array}{l} D & < - \frac{c σ_{x}}{\sqrt{2} ln (ε) σ_{e} \sqrt{\frac{N}{M_{b}}}} \sqrt{\frac{N}{M_{b}}} \end{array}

(55)

\begin{array}{l} = \frac{c}{\sqrt{2} ln (1 / ε)} \sqrt{SNR} . \end{array}

(56)

For a typical choice of c = 6 and ε= 0.05, this yields

\begin{array}{l} D < 1.42 \sqrt{SNR} . \end{array}

(57)

The bit allocation should be chosen to minimise the expected distortion. In general, for a bounded quantization cell, the area in the cell and the resultant distortion can be halved by adding an additional quantization bit per symbol or by doubling the number of constraints (M _b). This would indicate that the number of quantization bits per symbol (q _b) should be maximised. However, this is only true once the quantization region is bounded. This indicates that the allocation decision function should be piecewise defined. The number of constraints should be increased until the region is bounded at which point the number of bits per constraint should be increased.

Since we are aiming for a specific distortion level, we can estimate the number of errors larger than the allowable distortion. Let this number be K :

\begin{array}{l} K = NP (e > ε_{aim}) . \end{array}

(58)

If we treat the dimensionality of the error signal as K, then we need at least K dimensions to create a bounded quantization region for the error. We thus set the minimum value for M _b and the maximum value of q _b as follows:

\begin{array}{l} M_{b} & \geq K + 1 \end{array}

(59)

\begin{array}{l} q_{b} & \leq \frac{ℜ}{K + 1} . \end{array}

(60)

This is only an approximation as the noise model at the encoder is not exact and the decoder is suboptimal. M _b should be larger to improve the performance if the noise is underestimated. Our approach is to calculate M _b and q _b as above. If the effective number of bits q _eff = q _b + q _D, we then limit q _b as

\begin{array}{l} q_{b} \leq q_{aim} + 2 - q_{D} . \end{array}

(61)

We also require that q _eff ≥ 2. After adjusting q _b to meet the above criteria, M _b is recalculated.

6 Decoder

6.1 Belief propagation over real fields

Belief propagation (BP) is a decoding algorithm that calculates an estimate of a marginal probability by exploiting factorization to reduce computational complexity. BP can be performed over any arbitrary field. The update equations require the computation of the product and the convolution of the marginal probability density functions (pdfs) being passed in the graph. When the coding is performed over the binary field, the pdf becomes a probability mass function (PMF) and can be represented using a single number. In the case of the real field, a full pdf is required. Calculating the convolution of a set of arbitrary functions is difficult. Thus, we use a version of relaxed or Gaussian BP [36], which assumes that the messages internal to the graph are Gaussian distributed. This allows for simple closed-form solutions of the update equations that rely only on the mean and the variance of the messages, reducing computational complexity. It also means that only two values need to be passed along each edge of the graph, reducing memory requirements.

Let μ _ij be the messages from variable node i to check node j and v _ji be the message from check node j to variable node i. Let f _X(x _i) be the prior information of variable node i, and f _Y|X(y _i|x _i) be the multi-hypothesis side information. The standard update equation at the variable node is

\begin{array}{l} μ_{ij} (x_{i}) \propto f_{X} (x_{i}) f_{Y | X} (y_{i} | x_{i}) \prod_{k, k \neq j} v_{ki} (\frac{1}{H_{ji}} x_{i}), \end{array}

(62)

while the update at the check node is

\begin{array}{l} v_{ji} (x_{i}) \propto f_{S} (s_{j}) \underset{k, k_{\neq} i}{\oplus} μ_{kj} (H_{jk} x_{k}), \end{array}

(63)

where $\oplus$ indicates convolution, H _ji is the connection weight from the parity check matrix, and f _S(s _j) is the pdf of the j th parity symbol. As mentioned, due to computational complexity, we make the simplifying assumption that these messages are Gaussian distributed. Thus, each message is represented with only a mean and a variance:

\begin{array}{l} μ_{ij}^{G} (x_{i}) & = (E [μ_{ij} (x_{i})], V [μ_{ij} (x_{i})]) \end{array}

(64)

\begin{array}{l} v_{ji}^{G} (x_{i}) & = (E [v_{ji} (x_{i})], V [v_{ji} (x_{i})]), \end{array}

(65)

where (64) and (65) can be calculated using the erf() function. As a result of the Gaussian assumption for the marginals, the updated equations are simplified. A product of Gaussian pdfs is a Gaussian pdf:

\begin{array}{l} \prod_{i = 1}^{I} G (μ_{i}, σ_{i}^{2}) & = SG (μ_{p}, σ_{p}^{2}) \end{array}

(66)

\begin{array}{l} σ_{p}^{2} & = {(\sum_{i = 1}^{I} \frac{1}{σ_{i}^{2}})}^{- 1} \end{array}

(67)

\begin{array}{l} μ_{p} & = \frac{\sum_{i = 1}^{I} μ_{i} σ_{i}^{- 2}}{\sum_{i = 1}^{I} σ_{i}^{- 2}}, \end{array}

(68)

where S is an irrelevant scaling factor, and I is the number of pdfs in the product. The convolution of Gaussian pdfs is also Gaussian:

\begin{array}{l} \oplus_{i = 1}^{I} G (μ_{i}, σ_{i}^{2}) & = G (μ_{c}, σ_{c}^{2}) \end{array}

(69)

\begin{array}{l} μ_{c} & = \sum_{i = 1}^{I} μ_{i} \end{array}

(70)

\begin{array}{l} σ_{c}^{2} & = \sum_{i = 1}^{I} σ_{i}^{2} . \end{array}

(71)

Similarly, the connection weights are easily taken into account:

\begin{array}{l} μ_{ij}^{G} (H_{ji} x_{i}) & = (H_{ji} E [μ_{ij} (x_{i})], H_{ji}^{2} V [μ_{ij} (x_{i})]) \end{array}

(72)

\begin{array}{l} v_{ji}^{G} (\frac{1}{H_{ji}} x_{i}) & = (\frac{1}{H_{ji}} E [v_{ji} (x_{i})], \frac{1}{H_{ji}^{2}} V [v_{ji} (x_{i})]) . \end{array}

(73)

To help with convergence, a damping parameter is added at the variable node. If we are calculating the update for iteration t + 1, $μ_{ij}^{G} {(x_{i})}^{t + 1}$ , then we combine the result of (64), represented by $μ_{ij}^{G} (x_{i})$ , with the output of the previous iteration $μ_{ij}^{G} {(x_{i})}^{t}$ :

\begin{array}{l} μ_{ij}^{G} {(x_{i})}^{t + 1} = β μ_{ij}^{G} {(x_{i})}^{t} + (1 - β) μ_{ij}^{G} (x_{i}), \end{array}

(74)

where β = 0.7 was found to yield good results. When employing binning in the quantizer, the pdf of the parity symbols, f _S(s _j), consists of disjoint sections. If we simply calculate the outputs of the check nodes as in (65), we might end up with inconsistent updates. To solve this, we first calculate the most likely bin at the check node. First, we estimate the value of the syndrome symbol from the incoming messages:

\begin{array}{l} ŝ_{j} = E [\underset{k}{\oplus} μ_{kj} (H_{jk} x_{k})] . \end{array}

(75)

Then, we select the bin with the highest probability given $\hat{s_{j}}$ :

\begin{array}{l} s * & = arg max_{l} P (s_{l} | ŝ_{j}) \end{array}

(76)

\begin{array}{l} = arg min_{l} | s_{l} - ŝ_{j} |, \end{array}

(77)

where s _l, l ∈ {1,D} are the centres of the D bins corresponding to the received $s_{j}^{q}$ . Now, f _S(s _j) is a uniform pdf over the range corresponding to the most likely bin.

6.2 Side information creation

6.2.1 Initial decoding

A multi-hypothesis SI creation method using four hypotheses as developed in [8] is used. The hypotheses are the two reference frames, $\hat{f} [0]$ and $\hat{f} [N_{G}]$ , a MCI-based frame using small blocks, ${\hat{f}}_{S}$ , as well as a MCI-based frame using big blocks, ${\hat{f}}_{B}$ . The method used to perform MCI is a variant of the one described in [37]. First the key frames are mean filtered. The filtered key frames are then used for bidirectional pixel level accuracy block matching (BM) using large blocks and the modified sum of absolute difference (SAD) metric in [37]. These vectors are used as starting points for a BM algorithm using smaller blocks with a fixed search range. The motion vector (MV) field generated with large blocks is quadrupled in density (oversampled by a factor of two in both directions), before being median filtered. By increasing the density, we allow the median filter to produce smoother edges and more precise vectors. The MV field generated with smaller blocks is also quadrupled in density before being median filtered. These two MV fields are then used to create two MCI hypotheses by interpolating from the original key frames.

6.2.2 Successive refinement

After BP decoding, the subbands are recombined and a pixel domain version of the WZ frame is produced. Using this decoded frame, the motion estimation process is repeated to produce better quality side information. This is similar to [24–27] where successive refinement was used to improve reconstruction distortion. Due to the use of real field coding, the effect is slightly different from conventional DVC as improved side information can result in decreased distortion from the GBP decoding process, whereas for finite field coding, there can be no further gain (other than improved reconstruction) after successful decoding of the bit planes.

In order to use the decoded frame, the motion estimation method needs to be adjusted. Three different hypothesis frames are created. The first method performs pixel level accuracy bilinear BM motion estimation (ME) using the decoded frame as a hash (Figure 3) to produce ${\hat{f}}_{I}$ . The matching metric used during ME is

\begin{array}{l} M_{I} (L, M, R) = (1 - 2 λ_{M}) \sum | L - R | & + λ_{M} \sum | L - M | \\ + λ_{M} \sum | R - M |, \end{array}

(78)

where L,M and R represent the subblocks in the left key frame, the reconstructed frame and the right key frame, respectively. λ _M is a scaling factor that may be optimised. For this paper we used λ _M = 0.25.

The other two hypotheses, ${\hat{f}}_{L_{r}}$ and ${\hat{f}}_{R_{r}}$ , are created by matching the key frames individually with the reconstructed frame (Figure 4). This allows non-linear motion to be tracked across the GOP.

For these hypotheses, a simple SAD metric is used:

\begin{array}{l} M_{L_{r}} (L, R) & = \sum | L - M | \end{array}

(79)

\begin{array}{l} M_{R_{r}} (L, R) & = \sum | R - M | . \end{array}

(80)

The metrics are all adjusted by including a motion penalising term [37], which helps to ensure less noisy motion vectors.

6.3 Correlation noise modeling

The system uses a coefficient level noise model [23] which models the correlation noise for a single coefficient of a subband of a hypothesis as being Laplace distributed. The method for calculating this noise parameter (α) requires an estimate of the error in the hypothesis. This error is termed as the residual frame (r). For the multi-hypothesis system, a residual frame estimate is required for each hypothesis.

6.3.1 Residual estimation for the first decoding

For the MCI hypotheses, ${\hat{f}}_{S}$ and ${\hat{f}}_{B}$ , the residual estimates, ${\hat{r}}_{S}$ and ${\hat{r}}_{B}$ respectively, are the difference between the motion-compensated frames used to perform the interpolation. For frame t in the GOP, this is given by

\begin{array}{l} {\hat{r}}_{B} & = | {\hat{f}}_{L^{*}}^{B} - {\hat{f}}_{R^{*}}^{B} | \end{array}

(81)

\begin{array}{l} {\hat{r}}_{S} & = | {\hat{f}}_{L^{*}}^{S} - {\hat{f}}_{R^{*}}^{S} | . \end{array}

(82)

Here, ${\hat{f}}_{L^{*}}^{B}$ and ${\hat{f}}_{R^{*}}^{B}$ indicate the motion-compensated reference frames used to interpolate ${\hat{f}}_{B}$ , while ${\hat{f}}_{L^{*}}^{S}$ and ${\hat{f}}_{R^{*}}^{S}$ indicate the motion-compensated reference frames used to interpolate ${\hat{f}}_{S}$ . For the reference frame hypotheses, the residual estimates ( ${\hat{r}}_{L}$ and ${\hat{r}}_{R}$ ) are simply the difference between the two reference frames:

\begin{array}{l} {\hat{r}}_{L} & = \hat{f} [0] - \hat{f} [N_{G}] \end{array}

(83)

\begin{array}{l} {\hat{r}}_{R} & = \hat{f} [0] - \hat{f} [N_{G}] . \end{array}

(84)

The correlation noise for each hypothesis is estimated using the method in [23], and motion-adaptive scaling factors are applied as in [8]. All the hypotheses are then combined at a coefficient level using Bayesian fusion:

\begin{array}{l} f_{Y | X} (y | x) = \prod_{h = 1}^{H} f_{Y_{h} | X} (y_{h} | x), \end{array}

(85)

before being passed to the GBP algorithm for decoding.

6.3.2 Residual estimation for later iterations

After the frame has been decoded the first time, we calculate new side information frames using the methods discussed in the previous section. If ${\hat{f}}_{I}$ , ${\hat{f}}_{L_{r}}$ and ${\hat{f}}_{R_{r}}$ are the three hypotheses, we estimate a residual frame for each as:

\begin{array}{l} {\hat{r}}_{I} & = (1 - λ_{I}) | {\hat{f}}_{L^{*}} - {\hat{f}}_{R^{*}} | + λ_{I} | {\hat{f}}_{I} - M | \end{array}

(86)

\begin{array}{l} {\hat{r}}_{L_{r}} & = (1 - λ_{L_{r}}) | {\hat{f}}_{L_{r}} - {\hat{f}}_{R_{r}} | + λ_{L_{r}} | {\hat{f}}_{L_{r}} - M | \end{array}

(87)

\begin{array}{l} {\hat{r}}_{R_{r}} & = (1 - λ_{R_{r}}) | {\hat{f}}_{L_{r}} - {\hat{f}}_{R_{r}} | + λ_{R_{r}} | {\hat{f}}_{R_{r}} - M | . \end{array}

(88)

Here, ${\hat{f}}_{L^{*}}$ and ${\hat{f}}_{R^{*}}$ indicate the motion-compensated key frames used to create ${\hat{f}}_{I}$ , M is the previous decoded frame, and the λ s are tunable parameters. While they may be optimised in the future, we used $λ_{I} = λ_{L_{r}} = λ_{R_{r}} = 0.5$ .

From the residual estimate, we calculate a correlation noise estimate for each hypothesis. A method similar to [23] is used, but we include the output variance from the belief propagation decoder in calculating the value of the Laplace parameter α. The goal is to identify the areas where the decoder may have been in error, or where further refinement should occur. If $σ_{b}^{2 (d)} (u, v)$ is the output variance from the GBP decoder, the new equation is:

\begin{array}{l} D_{b} (u, v) = | {\hat{R}}_{b} (u, v) - μ_{b} | \end{array}

(89)

\begin{array}{l} {\hat{α}}_{b} (u, v) = \{\begin{array}{l} \sqrt{\frac{2}{λ {\hat{σ}}_{b}^{2} + (1 - λ) σ_{b}^{2 (d)} (u, v)}}, & {[D_{b} (u, v)]}^{2} \leq {\hat{σ}}_{b}^{2} \\ \sqrt{\frac{2}{λ {[D_{b} (u, v)]}^{2} + (1 - λ) σ_{b}^{2 (d)} (u, v)}}, & {[D_{b} (u, v)]}^{2} > {\hat{σ}}_{b}^{2} \end{array}, \end{array}

(90)

where λ is a tunable variable, that we set to one half.

7 Complexity analysis

In this section, we will discuss the complexity of the proposed system relative to conventional DVC as well as H.264 intra coding.

7.1 Encoder complexity

The overall complexity is a function of the GOP size, since it will be a combination of the key frame encoding complexity as well as the WZ encoding complexity. Initially, we consider only the complexity of the WZ frames, since the key frames have the same complexity in H.264 (intra) and in conventional DVC methods.

The complexity of the encoder can be described as the sum of the complexity of the subblocks. The main points where differences between the proposed method and conventional DVC may occur is in the conditional entropy estimation block, the quantizer and the real field encoder. We do not consider the complexity of motion estimation at the encoder and do not compare against systems that employ motion estimation. The complexity of the DCT operation will be the same for all transform domain systems.

Real field encoding compared to finite field coding will have the same number of operations (assuming the same code size), but the type of operation will differ: Real field (floating point) multiplication and addition versus Galois field multiplication and addition. The complexity of the different operations will be platform and implementation specific. Typically one would expect finite field operations to be faster than the real field equivalent, however many modern processors are optimised to perform floating point operations and can perform them at high speed.

A similar point can be made for bit-plane-based systems with binary codes. These codes will have a larger number of operations compared to symbol-based coding, but the operations will be simple binary XOR and AND operations which are much faster.

The proposed conditional entropy estimator is faster than the conventional method (see Section 5.2), O(1) vs. O(q _b N) per subband, but the complexity of creating the side information estimate is also at least O(N), so this does not represent a very large saving. The quantizers will also be similar in complexity though the proposed system only quantizes (M _b < N) values compared to the N values of a conventional system. Overall, we expect the encoding complexity of the proposed system to be similar to that of conventional DVC without motion estimation at the encoder. Exact comparisons will be implementation specific.

To experimentally evaluate the complexity we performed execution-time experiments. Simulation-based comparisons are difficult and always imperfect, but may still provide some insight into the expected performance of the systems. For these simulations we used the reference implementation of the H.264 intra codec. The proposed codec was implemented in C++. We also compare with the DISCOVER codec.

All the simulations were performed on a computer with a 2.93 GHz Intel Core 2 duo processor (Intel, Sta. Clara, CA, USA) and 2 Gb of RAM running Ubuntu Linux 12.04 LTS. The results can be seen in Table 1. The encoder run time of DISCOVER was analysed in [38]. Without access to the source code, run-time results for the DISCOVER encoder could not be accurately generated. To facilitate a comparison, the run time of DISCOVER (on this computer) was predicted. This was done by calculating the ratio of the per-frame run time of DISCOVER to H.264 (intra) from results presented in [38]. The run time of H.264 (intra), as produced in this simulation, was then multiplied by the ratio to predict the run time of DISCOVER. While this is not ideal, it still provides a good idea of the relative performance of the different systems.

Table 1 Encoder run-time simulation

Full size table

The table shows the average encoding time for a WZ frame as measured when N _G = 2. The overall encoding time for any GOP size can be estimated by combining the running time for the key frames and the WZ frames. From the table, one can see that the proposed system has a reduced run time when measured against the H.264 intra codec. The running time for the proposed system increases as the rate increases (larger RD points), since more subbands are encoded. A rate increase for a given subband also increases the size of the encoding matrix, which leads to a run-time increase.

The proposed system has approximately the same run time as the DISCOVER codec in the low-rate region. However, for the higher RD points, the proposed system has a longer run time than the DISCOVER codec. It should be mentioned that no attempt was made to optimise the encoder implementation, and it is expected that the run time can be significantly reduced in the future.

7.2 Decoder complexity

Decoding complexity is generally not considered a limitation for DVC and is not often considered when evaluating the performance in DVC systems. However, we will briefly discuss the effect of real field coding on the decoder complexity. The reduced complexity GBP algorithm used for decoding is similar to a binary BP decoding algorithm in terms of memory requirements and computational complexity. In GBP each edge requires two values to be transmitted while for binary BP one value is required. The update rules are also simple equations that scale with the check and variable node degree distributions. However, binary-coded systems decode each bit plane independently, thus it has to perform more than one decoding for each subband. The GBP also converges in fewer than ten iterations, which is less than most binary BP algorithms. GBP is thus expected to compare favourably to binary BP in terms of decoding speed. Non-binary BP is typically slower than binary BP and requires more memory. Each edge must transmit the entire PMF (L values). The update equations at the check node are also slow. A faster fast Fourier transform belief propagation (FFT-BP) algorithm still requires two FFT operations for every edge. Typically, FFT-BP is considered to be O(q _b) times slower than binary BP. There are lower complexity non-binary BP algorithms in the literature, but discussing their relative merits is beyond the scope of this article. In general, GBP is expected to be faster than non-binary BP.

We include some execution-time experimental results showing the decoding time in Table 2. These results were produced under the same conditions as for the encoder. In contrast to the encoder, a publicly available executable was used to produce the run-time results for the decoder of the DISCOVER codec.

Table 2 Decoder run-time simulation

Full size table

The table shows the average decoding time for a WZ frame as measured when N _G = 2. From the table one can see that the proposed system runs faster than the DISCOVER codec in all simulations. While this is not proof that the system will always be faster, it does indicate that real field decoding does not represent a complexity increase.

8 Simulation results and discussion

The system presented in this paper is tested on four standard test sequences: foreman, coastguard, soccer and hall monitor. The sequences are all QCIF format with a frame rate of 15 Hz. All the rate-distortion curves show results for the average bit rate and the distortion as calculated over all the frames in the entire sequence (key and WZ frames). All sequences consist of 150 frames.

The system will be compared against the H.264 intra codec, the DISCOVER codec [1], which is a feedback-based benchmark system in DVC literature, and a system previously developed by the authors [8], adapted for this paper, which will be called the ‘FF system’ in this section. All the results for the proposed system were obtained without binning, thus D=1, to ensure that there are no decoding failures.

The finite field (FF) system is the same as in [8] and uses the same side information as the proposed system but uses non-binary codes defined over Galois fields. The modification for this paper is that it also uses the same estimated rate without feedback. The purpose of using the FF system is to highlight the effect of using real field coding while keeping most other aspects, such as the SI, constant. Comparisons with competing systems, while interesting, are more difficult to interpret. Key frames were encoded using the reference implementation of the intra mode of the H.264 codec. The QP parameters used are similar to those used by the DISCOVER codec and are provided in Table 3.

Table 3 Key frame quantization parameters for the RD points

Full size table

In the performance results for the foreman, coastguard, soccer and hall monitor, sequences can be seen in Figures 5, 6, 7 and 8, respectively. The Bjontegaard differential rate and peak signal-to-noise ratio (PSNR) metrics [39, 40] were calculated for the proposed system, the FF system and the DISCOVER codec as measured against the H.264 (intra) results. The results can be seen in Table 4. In the table, ‘DSNR’ refers to the differential PSNR, and ‘Rate’ refers to the percentage difference in rate. The ‘Type’ column indicates whether the full range of RD points was considered or only the mid range.

Table 4 Bjontegaard metric results

Full size table

The same trends are apparent in all the sequences. The proposed system performs comparably with the DISCOVER codec in the low-rate regime, but then deviates and performs less well in the high-rate regime. The FF-based system and the proposed system perform similarly in the coastguard sequence for the GOP size equal to 2 and 4. In all other cases, the proposed system has better performance. The performance gain when the GOP size is equal to 2 is small, but increases at higher rates. The performance gain is also larger for GOP sizes equal to 4 and 8. In general, the performance loss of the proposed system as measured against the DISCOVER codec increases with the GOP size.

In order to objectively analyse the perceptual quality of the proposed system, we include two sets of results. First, we include the performance of the proposed system and the FF system on all three sequences as measured using the structural similarity (SSIM) metric in Figure 9, as it is considered to measure perceptual quality better than the PSNR metric for a given frame. From the figure, one can see that the performance is similar for all three sequences when the GOP size is equal to 2. However, when the GOP size increases to 4, the proposed system performs better.

Secondly, to evaluate the perceptual quality of the entire sequence, we analyse the variance of the PSNR for each sequence. Figure 10 shows the standard deviation of the PSNR for the two systems over all profiles and all test sequences for a GOP size of 2. From the figure, it can be seen that in all cases, the standard deviation of the proposed system is less than that of the FF system. Figures 11 and 12 show the standard deviation of the PSNR for GOP sizes of 4 and 8 respectively. A similar trend is visible for these cases as well. A large variance in distortion results in unpleasant flicker in the video sequence. While the proposed system and the FF system have similar average R-D performance for GOP equal to 2, subjectively, the quality of the FF system appears worse because of image flickering. The proposed system on the other hand has a gentler degradation in image quality resulting in less flickering.

As an example, Figure 13 shows the same frame coded at the same rate by the two systems. The estimated rate is obviously lower than required. As a result, the quality in the FF system degraded significantly while the visual quality in the proposed system degraded much less. As a caveat, no attempt was made in the FF system to detect and ameliorate decoding failures. In cases where the estimated rate was high enough, the FF system decoded correctly and produced a higher quality frame than the proposed system.

Though the results are not plotted, the systems presented in [16] and [17] both have better average R-D performance than the proposed system and achieve similar average R-D performance as the DISCOVER codec. Using low complexity motion estimation at the encoder, they are able to achieve accurate rate estimation. The addition of hashes further improves the performance of these systems. The exact complexity and run-time losses incurred by the motion estimation are not available. The SSIM and PSNR variance results for these systems are also not available.

For the proposed system, it is expected that employing a similar low complexity motion estimation algorithm at the encoder will improve the accuracy of the rate estimation and by extension the average R-D performance of the proposed system as well. Furthermore, accurate rate estimation will allow for binning as described in Section 5.4, which can further improve the average rate-distortion performance. Hashing and more advanced SI creation methods could also be employed.

9 Conclusion

This paper proposed a new approach to feedback suppression in DVC systems that relies on codes defined over the real field. Despite the removal of the feedback path, the encoder complexity was not significantly increased, since no motion estimation was performed at the encoder. The system showed average R-D performance comparable to that of a feedback-based system at low rates. At high rates, there was a reduction in performance compared to feedback-based systems. However, compared to conventional finite-field-based feedback-free systems, the variance in the distortion was reduced. This resulted in improved perceptual visual quality.

References

Artigas X, Ascenso J, Dalai M, Klomp S, Ouaret M: The DISCOVER codec: architecture, techniques and evaluation. Picture Coding Symposium (PCS) (Lisbon,7–9 November 2007)
Google Scholar
Slowack J, Skorupa J, Deligiannis N, Lambert P, Munteanu A, Van De Walle R: Distributed video coding with feedback channel constraints. Circuits Syst. Video Technol., IEEE Trans 2012, 22(7):1014-1026.
Article Google Scholar
Wyner A, Ziv J: The rate-distortion function for source coding with side information at the decoder. IEEE Trans. Inf. Theory 1976, 22: 1-10. 10.1109/TIT.1976.1055508
Article MathSciNet MATH Google Scholar
Slepian D, Wolf J: Noiseless coding of correlated information sources. IEEE Trans. Inf. Theory 1973, 19(4):471-480. 10.1109/TIT.1973.1055037
Article MathSciNet MATH Google Scholar
Aaron A, Zhang R, Girod B: Wyner-Ziv coding of motion video. Conference Record of the Thirty-Sixth Asilomar Conference on Signals, Systems and Computers, Pacific Grove, 3–6 November 2002, Volume 1 (2002), pp 240–244
Google Scholar
Puri R, Ramchandran K: PRISM: a new robust video coding architecture based on distributed compression principle. Annual Allerton Conference on Communication, Control, and Computing (ACCC) (Urbana, 2–4 October 2002)
Ascenso J, Brites C, Dufaux F, Fernando A, Ebrahim T, Pereira F, Tubaro S: The VISNET II DVC codec: architecture, tools and performance. European Signal Processing Conference (EUSIPCO) (Aalborg, 23–27 August 2010)
Louw D, Kaneko H: A multi-hypothesis non-binary LDPC code based distributed video coding system. Proceedings of the 13th IASTED International Conference on Signal and Image Processing, Volume 74, Dallas, 14–16 December 2011
Louw D, Kaneko H: A system combining extrapolated and interpolated side information for single view multi-hypothesis distributed video coding. Proceedings International Symposium on Information Theory and its Applications (ISITA), Honolulu, 28–31 October 2012, pp. 779–783
Morbée M, Prades-Nebot J, Roca A, Pižurica A, Philips W: Improved pixel-based rate allocation for pixel-domain distributed video coders without feedback channel. In Proceedings of the 9th international conference on Advanced concepts for intelligent vision systems, ACIVS’07.. Springer, Berlin, Heidelberg; 2007:663-674.
Chapter Google Scholar
Martinez J, Fernandez-Escribano G, Kalva WARJWeerakkody H, Fernando WAC, Garrido A: Feedback free DVC architecture using machine learning. 15th IEEE International Conference on Image Processing, 2008. ICIP 2008, San Diego, 12–15 October 2008, pp. 1140–1143
Google Scholar
Nickaein I, Rahmati M, Ghidary S, Zohrabi A: Feedback-free and hybrid distributed video coding using neural networks. 2012 11th International Conference on Information Science, Signal Processing and their Applications (ISSPA), Montreal 3–5 July 2012, pp. 528–532
Weerakkody WARJ, Fernando WAC, Adikari ABB: Unidirectional distributed video coding for low cost video encoding. IEEE Trans. Consumer Electron 2007, 53(2):788-795.
Article Google Scholar
Deligiannis N, Munteanu A, Clerckx T, Cornelis J, Schelkens P: Overlapped block motion estimation and probabilistic compensation with application in distributed video coding. IEEE Signal Process. Lett 2009, 16(9):743-746.
Article Google Scholar
Sheng T, Zhu X, Hua G, Guo H, Zhou J, Chen C: Feedback-free rate-allocation scheme for transform domain Wyner–Ziv video coding. Multimedia Syst 2010, 16(2):127-137. 10.1007/s00530-009-0179-8
Article Google Scholar
Brites C, Pereira F: An efficient encoder rate control solution for transform domain Wyner-Ziv video coding. IEEE Trans. Circuits Syst. Video Technol 2011, 21(9):1278-1292.
Article Google Scholar
Verbist F, Deligiannis N, Satti SM, Munteanu A, Schelkens P: Iterative Wyner-Ziv decoding and successive side-information refinement in feedback channel-free hash-based distributed video coding. Proceedings of SPIE 8499, Applications of Digital Image Processing XXXV, 84990O–1–84990O–10 (2012)
Google Scholar
Goyal V, Fletcher A, Rangan S: Compressive sampling and lossy compression. Signal Process. Mag. IEEE 2008, 25(2):48-56.
Article Google Scholar
Kamilov U, Goyal V, Rangan S: Message-passing de-quantization with applications to compressed sensing. IEEE Trans. Signal Process 2012, 60(12):6270-6281.
Article MathSciNet Google Scholar
Baig Y, Lai E, Punchihewa A: Distributed video coding based on compressed sensing. Proceedings of the IEEE International Conference on Multimedia and Expo Workshops (ICMEW), Melbourne, 9–13 July 2012, pp. 325–330
Liu Z, Cheng S, Liveris A, Xiong Z: Slepian-wolf coded nested lattice quantization for Wyner-Ziv coding: high-rate performance analysis and code design. IEEE Trans. Inf. Theory 2006, 52(10):4358-4379.
Article MathSciNet MATH Google Scholar
Donoho D: Compressed sensing. IEEE Trans. Inf. Theory 2006, 52(4):1289-1306.
Article MathSciNet MATH Google Scholar
Brites C, Pereira F: Correlation noise modeling for efficient pixel and transform domain Wyner-Ziv Video coding. IEEE Trans. Circuits Syst. Video Technol 2008, 18(9):1177-1190.
Article Google Scholar
Martins R, Brites C, Ascenso J, Pereira F: Refining side information for improved transform domain Wyner-Ziv video coding. IEEE Trans. Circuits Syst. Video Technol 2009, 19(9):1327-1341.
Article Google Scholar
Ailah AAE, Petrazzuoli G, Farah J, Cagnazzo M, Pesquet-Popescu B, Dufaux F: Side information improvement in transform-domain distributed video coding. Proceedings of the SPIE - Applications of Digital Image Processing XXXVII, San Diego, 17–21 August 2012
Deligiannis N, Verbist F, Slowack J, Van De Walle R, Schelkens P, Munteanu A: Joint successive correlation estimation and side information refinement in distributed video coding. 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), Bucharest, 27–31 August 2012, pp. 569–573
Ye S, Ouaret M, Dufaux F, Ebrahimi T: Improved side information generation for distributed video coding by exploiting spatial and temporal correlations. EURASIP J. Image Video Process 2009, 2009: 683510.
Article Google Scholar
Marshall T Jr: Coding of real-number sequences for error correction: a digital signal processing problem. IEEE J. Selected Areas Commun 1984, 2(2):381-392. 10.1109/JSAC.1984.1146063
Article Google Scholar
Dimakis A, Smarandache R, Vontobel P: LDPC codes for compressed sensing. IEEE Trans. Inf. Theory 2012, 58(5):3093-3114.
Article MathSciNet Google Scholar
Song S, Zhou B, Lin S, K Abdel-Ghaffar K: A unified approach to the construction of binary and nonbinary quasi-cyclic LDPC codes based on finite fields. IEEE Trans. Commun. 2009, 57: 84-93.
Article Google Scholar
Lam E, Goodman J: A mathematical analysis of the DCT coefficient distributions for images. IEEE Trans. Image Process 2000, 9(10):1661-1666. 10.1109/83.869177
Article MATH Google Scholar
Fu C, Kim J: Encoder rate control for block-based distributed video coding. 2010 IEEE International Workshop on Multimedia Signal Processing (MMSP), Saint Malo, 4–6 October 2010, pp. 333–338
Cheng S, Xiong Z: Successive refinement for the Wyner-Ziv problem and layered code design. IEEE Trans. Signal Process 2005, 53(8):3269-3281.
Article MathSciNet Google Scholar
Yaacoub C, Farah J, B Pesquet-Popescu B: Feedback channel suppression in distributed video coding with adaptive rate allocation and quantization for multiuser applications. EURASIP J. Wireless Commun. Netw 2008., 2008: doi:10.1155/2008/427247
Google Scholar
Louw D, Kaneko H: Efficient conditional entropy estimation for distributed video coding. Proceedings of the 30th Picture Coding Symposium (PCS), San Jose, 8–11 December 2013
Rangan S: Estimation with random linear mixing, belief propagation and compressed sensing. 2010 44th Annual Conference on Information Sciences and Systems (CISS), Princeton, 17–19 March 2010, pp. 1–6
Ascenso J, Brites C, Pereira F: Content adaptive Wyner-ZIV video coding driven by motion activity. IEEE International Conference on Image Processing, Atlanta, 8–11 October 2006, pp. 605–608
Pereira F, Ascenso J, Brites C: Studying the GOP size impact on the performance of a feedback channel-based Wyner-Ziv video codec. In Advances in Image and Video Technology, Volume 4872 of Lecture Notes in Computer Science. Ed. by D Mery, L Rueda. Springer, Berlin, Heidelberg; 2007:801-815.
Google Scholar
Bjontegaard G: Calculation of average PSNR differences between RD-curves. the 13th VCEG-M33 Meeting, Austin, April 2001
Sullivan G, Bjontegaard G: Recommended simulation common conditions for H.26L coding efficiency experiments on low-resolution progressive-scan source material. VCEG-N81, 14th Meeting, Santa Barbara, September 2001

Download references

Author information

Authors and Affiliations

Department of Computer Science, Tokyo Institute of Technology, Tokyo, 226-8552, Japan
Daniel J Louw & Haruhiko Kaneko

Authors

Daniel J Louw
View author publications
You can also search for this author in PubMed Google Scholar
Haruhiko Kaneko
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel J Louw.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Authors’ original file for figure 13

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Louw, D.J., Kaneko, H. Suppressing feedback in a distributed video coding system by employing real field codes. EURASIP J. Adv. Signal Process. 2013, 181 (2013). https://doi.org/10.1186/1687-6180-2013-181

Download citation

Received: 15 April 2013
Accepted: 19 November 2013
Published: 05 December 2013
DOI: https://doi.org/10.1186/1687-6180-2013-181

Suppressing feedback in a distributed video coding system by employing real field codes

Abstract

1 Introduction

2 Overview of the previous work

3 Real field coding

4 Description of the proposed system

4.1 Encoder

4.2 Decoder

5 Encoder

5.1 Real field code design

5.2 Rate estimation

5.2.1 Correlation channel model

5.2.2 Conventional bit plane based estimates

5.2.3 Symbol-wise conditional entropy estimation

5.3 Quantizer

5.4 Bit allocation

6 Decoder

6.1 Belief propagation over real fields

6.2 Side information creation

6.2.1 Initial decoding

6.2.2 Successive refinement

6.3 Correlation noise modeling

6.3.1 Residual estimation for the first decoding

6.3.2 Residual estimation for later iterations

7 Complexity analysis

7.1 Encoder complexity

7.2 Decoder complexity

8 Simulation results and discussion

9 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords