 Research
 Open access
 Published:
A robust fusion method for multiview distributed video coding
EURASIP Journal on Advances in Signal Processing volume 2014, Article number: 174 (2014)
Abstract
Distributed video coding (DVC) is a coding paradigm which exploits the redundancy of the source (video) at the decoder side, as opposed to predictive coding, where the encoder leverages the redundancy. To exploit the correlation between views, multiview predictive video codecs require the encoder to have the various views available simultaneously. However, in multiview DVC (MDVC), the decoder can still exploit the redundancy between views, avoiding the need for intercamera communication. The key element of every DVC decoder is the side information (SI), which can be generated by leveraging intraview or interview redundancy for multiview video data. In this paper, a novel learningbased fusion technique is proposed, which is able to robustly fuse an interview SI and an intraview (temporal) SI. An interview SI generation method capable of identifying occluded areas is proposed and is coupled with a robust fusion system able to improve the quality of the fused SI along the decoding process through a learning process using already decoded data. We shall here take the approach to fuse the estimated distributions of the SIs as opposed to a conventional fusion algorithm based on the fusion of pixel values. The proposed solution is able to achieve gains up to 0.9 dB in Bjøntegaard difference when compared with the bestperforming (in a RD sense) single SI DVC decoder, chosen as the best of an interview and a temporal SIbased decoder one.
1 Introduction
Distributed video coding (DVC) [1–3] is a coding paradigm based on the theoretical results of distributed source coding (DSC): the SlepianWolf [4] and the WynerZiv (WZ) theorems [5]. These foundations establish a different way to compress information, namely, by independently coding the source data but jointly decoding it. Thus, in DVC, the source correlation is exploited at the decoder, as opposed to the widely adopted predictive coding solutions where the encoder is responsible for exploiting all the correlation. One of the key blocks of every DVC decoder is the side information (SI) generation module which estimates the WZ frame to be decoded. Typically, in monoview systems, the SI creation exploits the temporal redundancy by making assumptions of the apparent motion in a video stream, e.g. linear motion between reference frames is assumed [6]. Then, at the encoder, parity bits (or syndromes) are generated and transmitted to the decoder, and the use of channel decoders allows obtaining the decoded frames given the available SI. The channel decoder requires soft inputs for the source data to be decoded, which can be calculated from a correlation noise model. This correlation noise model statistically describes the relationship between the SI and the source and is obtained by computing an online residual, without using the original WZ frame.
An efficient DVC system must be able to minimize the amount of data sent from the encoder for a certain decoded quality level. Therefore, the SI has high importance for the ratedistortion (RD) performance of the DVC decoder; in fact, having a highquality SI, characterized by few errors, allows the transmission of less error correcting data (requiring a lower bitrate) and enables improving the decoded WZ frame quality.
In monoview DVC codecs, every frame is independently coded without any reference to other decoded frames. This allows a low encoding complexity since the complex task of exploiting the temporal correlation (using motion estimation/compensation) is performed at the decoder. When different views of the same visual scene are coded in different camera nodes, e.g. in visual sensors networks, interview coding can further improve the coding performance, exploiting intercamera redundancy. If a predictive multiview video codec is used, e.g. multiview video coding (MVC) [7], intercamera communication is needed. MVC relies on the same coding tools used in H.264/AVC: decoded frames belonging to other views are inserted in the reference picture lists and used for disparity estimation/compensation. This approach requires intercamera communication to enable one camera to use the frames of another camera for disparity compensation.
On the other hand, in DVC solutions for the multiview scenario, each camera can independently code the frames, relying on the decoder to exploit the correlation between the views [8, 9]. Typically, the multiview DVC (MDVC) decoder tries to exploit, at the same time, temporal intraview and interview correlation, generating two SI frames: (1) temporal SI, by means of motion estimation and interpolation, e.g. employing overlapped block motion compensation (OBMC) [6] and (2) interview SI, generated by leveraging the interview redundancy [3]. To exploit the best part of each estimated SI frame, it is necessary to fuse the frames, choosing the best regions of each estimated SI frame to create a final SI frame that is used for decoding [8, 9]; typically, the regions are chosen according to an estimation of their quality. SI fusion is a hard problem, and there are many fusion techniques available in the literature [8] with various degrees of efficiency. The goal of an efficient frame fusion technique is to deliver an RD performance better than the bestperforming single SI decoder out of the one using the interview SI and the one using the temporal SI. In general, the larger the difference in RD performance between the SIs, the harder the fusion task is because fusing incorrectly a region of the frame may lead to consistent losses in RD performance.
Considering these challenges, the main contributions of this work are the following:

(1).
A novel interview SI generation system called overlapped block disparity compensation (OBDC) is presented. This method is able to cope with high camera distance and detect occlusions due to a part of the scene outside the field of view of one camera. It is also able to adapt to unknown camera distances

(2).
The fusion of the estimated distributions of the DCT coefficients of the SI

(3).
A novel learning technique based on the refinement of the quality of the fused SI along the decoding process exploiting already decoded data
The three items are combined in a DVC setup providing a novel learningbased MDVC scheme. The fusion of distributions here is proposed as an alternative to the pixellevel fusion of the SI frames. The use of distributions to estimate the reliability of the regions of the SI allows exploiting highperformance noise modelling algorithms developed in literature. This learning algorithm allows correcting wrong initial estimations of the quality of the SIs, leading to superior RD performance for the next steps of the decoding process.
This paper is structured as follows: Section 2 deals with related works on interview SI creation and pixel and blockbased SI fusion techniques. An overview of the DVC coding process is given in Section 3. The novel fusion algorithm as well as the SI generation method is described in Section 4. In Section 5, the performance of the proposed tools is assessed and compared with stateoftheart distributed coding solutions, as well as monoview predictive codecs.
2 Related work
2.1 Interview SI creation
Disparity compensation view prediction (DCVP) [10] is one of the simplest interview SI generation techniques, where the same algorithm used for temporal interpolation is applied between adjacent views to perform disparity estimation and compensation. However, the DCVP SI quality deteriorates when the distance between views is increased. The majority of the studies proposed in literature focus on really close cameras; for example, the distance between the cameras in [8] is 6.5 cm, and the problem of cameras moving with respect to each other is not addressed.
A different way to address the SI generation problem was proposed in [11], where multiview motion estimation (MVME) was presented. The key idea of MVME is to estimate a single SI frame by jointly exploiting the motion of neighbouring views and projecting the motion field in the current view. MVME generates the SI in two separate steps: (1) motion estimation is performed on the available lateral (left and right) views and (2) motion compensation using the reference (decoded) frames in the view to decode (the central view). A fusion step is performed in MVME to fuse various joint motion and disparity estimations, while in the previous work the fusion was performed between a purely interview SI and a purely temporal one. MVME demonstrates high performance in fastmotion sequences, but it is outperformed by motion compensation and interpolation techniques in slowmotion cases [11]. More recently [12], a modified version of the temporal motion estimation algorithm employed in DISCOVER [13] is proposed for interview SI generation. The key novelty is the penalization of small disparities, which characterizes background blocks.
2.2 SI fusion techniques
In recent years, SI fusion methods which use estimated distributions of the DCT coefficients were proposed for monoview DVC [14, 15] and applied to MDVC [16, 17]. In [14], optimal reconstruction for a multihypothesis decoder was proposed. In [16], the authors enhanced [14], proposing a clusterbased noise modelling system and fusion. In [15], the concept of parallel decoding was introduced: the distributions of the available SIs were fused using different weights, generating, in the aforementioned case, six different fused distributions. From each fused distribution, it is possible to calculate a set of conditional probabilities which are fed into six parallel LDPCA decoders. Thereafter, the decoders try to reconstruct the source bitplane considered in parallel for each new chunk of received parity bits. The process stops when the bitplane is successfully decoded by at least one LDPCA decoder. The method proposed in [15] can be seen as a bruteforce ratebased optimization approach but it suffers from high computational complexity; to perform an efficient SI fusion, several channel decoders need to be used. In [17], the method proposed in [15] was applied to stereo MDVC to fuse an interview and temporal SI frames. Nevertheless, the issue related to the complexity of [15] was not addressed, since [17] still relies on parallel LDPCA decoding.
In MDVC, pixel and blockbased fusion techniques are widely adopted [8, 9]. The results of [8] show that finding a fusion method able to perform robustly for a wide range of different video sequences is difficult, in particular, when the quality of the two SIs is very different and therefore the probability of making errors in the fusion process is high. A different approach for fusion in MDVC is proposed in [9], where a past decoded WZ frame and its corresponding SI are used to train a support vector machine classifier, which is then used to perform the fusion task, classifying the reliability of each pixel in the SIs. In [12], the fusion is performed according to an occlusion map: temporal SI is used if pixels belonging to the left or right views are estimated to be occluded. In [12], adaptive validation is also introduced: for a small subset of the WZ frames, the parity bits are requested for correct interview and temporal SIs, introducing an overhead. If the two SIs require similar rates, the fused SI is chosen; otherwise, the single SI providing the lower rate is chosen.
However, the partially decoded information obtained during the decoding process can be used to enhance the RD performance of a DVC codec by improving the correlation noise [6, 18] or the SI [19] or, as it is proposed in this work, the fusion process in a multiview decoder. In [20], the WZ frame is first decoded using either inter or intraview SI, according to the motion activity of the video. Then the completely reconstructed WZ frame is used as basis for the generation of a refined SI, either disparity or motion compensation is used on a block basis. Lastly, the refined SI is used in a new reconstruction step obtaining a higher quality reconstruction.
In [10], the encoder sends information to improve the fusion process: since the encoder has access to the original WZ frame and the key frames (KFs), a fusion mask can be generated based on the difference between the KFs and WZ frame (both known at the encoder). The mask is then compressed and sent to the decoder to drive the fusion process. However, when the encoder participates in the fusion process, its computational complexity is increased which may be impractical for some applications. In addition, the overhead can lead to a significant increase of the bitrate, which may severely limit the improvements obtained from having a higher quality fused SI frame. However, none of the works above used past decoded information to perform a better fusion process in a multiview decoder, as proposed in this work and described next.
2.3 Benchmarks for SI fusion
In [8, 9], many SI fusion solutions were reviewed and presented. However, it is worth describing one method often used for comparison, MDCDLin [8] and two (ideal) SI fusion solutions often used as benchmark in the MVDVC literature. In addition, these benchmarks are used to assess the proposed technique in Section 5.
Consider that the original WZ frame is denoted as X. The SIs employed for fusion, in all the benchmarks, are generated through OBMC and OBDC and denoted as Y_{ OBMC } and Y_{ OBDC } respectively. The corresponding estimated residuals are denoted as R_{ OBMC } and R_{ OBDC }. The following SI fusion benchmarks were considered.
Motion and disparity compensated difference linear fusion (MDCDLin) is a multiview fusion technique [8] used as benchmark in [9, 12]. The techniques presented in [9] are shown to perform either as well as MDCDLin or as well as the best single SI decoder. Therefore, MDCDLin and two single SI decoders are usually employed as benchmarks. The MDCDLin fuses pixel values, using the estimated residuals as weights for generating the fused SI, for the pixel having position x. The weight is calculated as follows:
The final SI is calculated as follows:
The residual for the final SI is calculated using the same weighted average for the residuals.
Ideal fusion (IF) is also considered [8, 9], which is sometimes referred to as oracle fusion. This is a quite common bound in MDVC literature. It is often used as an upper bound to the performance a fusion technique can achieve. The fused SI is calculated as follows:
and the same rule is applied to the residuals, in order to fuse them, obtaining the final residual. The technique requires that the original WZ frame, X, is known at the decoder, and therefore, the technique is not applicable in a practical scenario, but it may be used as a bound for the performance of the system. Even though IF is often used as upper bound (e.g. [9]), it is not an upper bound in a strict sense, since it performs a distortionbased optimization on the quality of the SI, and an improved PSNR of the SI need not always lead to superior RD performance.
Blockbased (BB) ideal fusion (IF), (IF BB), is also introduced here. Given a block B, of 4 × 4 pixels, corresponding to a DCT block, the SAD (sum of absolute differences) of the block between the SI and the corresponding block in the original WZ frame is calculated and used as reliability measure to calculate the weight:
The weight w_{ B } is then used to fuse each pixel r belonging to B as in (2) as well as it is used to generate the residual of the fused SI. Since IF BB requires the knowledge of the original WZ frame, X, this technique cannot be employed in a realistic scenario (as for IF), but it is a useful bound for what concerns the performance which can be reached using the learning approach presented in the next section.
3 Proposed MDVC codec architecture
The MDVC solution proposed in this paper adopts the widely used threeview scenario, although it may be generalized to other scenarios with more cameras. In this scenario, all the views are independently encoded without exploiting any interview correlation. However, the central view is decoded exploiting the interview correlation, while the left and right views are also independently decoded with respect to the other views and used to generate the SI for the current view. At the decoder, the MDVC solution has access to the decoded frames from the lateral and central views, as shown in Figure 1. To generate the SI, OBMC only needs to access the decoded frames I_{c,t − 1} and I_{c,t + 1} since only the temporal correlation is exploited and OBDC requires also the decoded frames I_{r,t} and I_{l,t} since the disparity correlation is exploited, and X is the WZ frame of the central view, unknown at the decoder. The central view is WZ encoded; the lateral views (left and right views) are H.264/AVC Intra coded. The architecture of the proposed DVC codec is depicted in Figure 2 for the encoder and Figure 3 for the central view decoder (in Figure 3, the proposed tools are shaded). The overall encoding process for the multiview DVC encoder can be described as follows:
Central view encoder (Figure 2)

1.
First, the Video Splitting module classifies the video frames into WZ frames and key frames according to the groupofpictures (GOP) structure. In a GOP, the first frame is a KF, the others are WZ frames. The frames selected as KFs are encoded by a H.264/AVC Intra encoder and sent to the decoder.

2.
For the WZ frames X, a DCT transform is applied, in this case an integer, 4 × 4 DCT. The DCT coefficients are uniformly quantized (according to the selected RD point) and divided into bitplanes by the Quantization module.

3.
Each bitplane is fed as input to an LDPCA encoder [21], which generates syndromes which are stored in a buffer and sent upon request from the decoder.
Lateral view encoders (Figure 2)
In general, the only multiview codec requirement is that the lateral views (Figure 1) are encoded independently, i.e. without exploiting any past decoded frames of the same view or from the central view. In this setup, the lateral view frames (Ĩ_{ l }, Ĩ_{ r }) are coded with the H.264/AVC Intra Encoder but other solutions could be used, e.g. monoview DVC codec.
The overall decoding process for the multiview DVC decoder can be described as follows:
Lateral view decoders
In this case, the lateral view frames are H.264/AVC Intra decoded but, as previously stated, other solutions could be used, e.g. monoview DVC codec. The left and right reconstructed frames are denoted as I_{ l } and I_{ r }, respectively.
Central view decoder (Figure 3)

1.
The KFs are decoded first, using an H.264/AVC decoder, obtaining I _{c,t − 1} and I _{c,t + 1}. In addition, the key frame quality should match the quality of the reconstructed WZ frame on average. Thus, to avoid quality fluctuations appropriate quantization step sizes for the WZ and KF DCT coefficients must be selected.

2.
Then, I _{c,t − 1} and I _{c,t + 1} are used by the OBMC SI generation module to calculate the SI Y _{ OBMC } and the (online) residual R _{ OBMC }. Thereafter, Y _{ OBMC } and R _{ OBMC } are DCT transformed, and two sets of DCT coefficients C _{ OBMC } and C _{R,OBMC} are obtained. In this work, online residual estimation, as detailed in [6], is employed to estimate the relationship between the original WZ and SI frames without requiring access to the original WZ frame. The residual DCT coefficients C _{R,OBMC} are used by the Noise Modelling module to calculate the parameter Î± _{ OBMC } of the laplacian distribution of the correlation noise model [6].

3.
The OBDC SI generation module calculates Y _{ OBDC } and the corresponding residual R _{ OBDC }. In OBDC, prealigned frames {I}_{l,t}^{\left(a\right)} and {I}_{r,t}^{\left(a\right)} are generated from the leftview I _{l,t} and rightview I _{r,t}, respectively, removing lateral regions where no correspondence exists between frames. These regions cannot be interpolated using disparity compensation and thus, the colocated pixels in Y _{ OBMC } are used. The SI frame and residual are both DCT transformed, generating C _{ OBDC } and C _{R,OBDC}, respectively. Again, C _{R,OBDC} is used by the Noise Modelling module to calculate the parameter Î± _{ OBDC } of the laplacian distribution of the correlation noise model [6].

4.
The Refined Fusion module generates the fused SI coefficients {C}_{F}^{{b}_{k}} for DCT band b _{ k }. The calculation of the corresponding residual coefficient {C}_{R,F}^{{b}_{k}} after fusion is also performed. Both sets of coefficients (SI and residual) are calculated as weighted averages of the corresponding coefficients (or residuals) of OBMC and OBDC. The weights are calculated using the mean absolute differences (MAD) distortion metric between the partially decoded WZ frames and the SI frames; see Section 4.3 for more details.

5.
The Distribution Fusion module calculates the joint distribution {f}_{Fus}^{{b}_{k}} from the three correlation noise models: OBMC, OBDC and the fused SI. Then, the joint distribution is used by the Soft Input Calculation module to calculate the conditional probabilities for the LDPCA decoder. The joint distribution allows the systems to effectively fuse the three different SIs, taking into account the previously decoded information.

6.
The LDPCA decoder requests syndromes from the encoder using a feedback channel: initially, a subset of syndromes is received by the decoder, which attempts to decode the source (bitplane). If the LDPCA decoding succeeds and an 8bit CRC does not detect any error, the bitplane is assumed to be decoded, otherwise new syndromes are requested via the feedback channel, until successful decoding is achieved.

7.
Once all the bitplanes of the band b _{ k } are decoded, the DCT band is reconstructed by the Reconstruction module, using {f}_{Fus}^{{b}_{k}}, employing the optimal reconstruction technique outlined in [14].

8.
At last, when all the bands are successfully decoded, the OBMC and OBDC are fused again. The newly fused SI is used in a last reconstruction step in the Refined Reconstruction module to further improve the quality of the decoded WZ frame.
4 Multiview decoding tools
In this section, the proposed techniques are described and analyzed. Thus, the novel contributions are interview OBDC SI generation, distribution fusion and the Fusion Learning, which can be divided into two distinct elements: the Refined Fusion used during the decoding process and the refined reconstruction used at the end of the decoding process (Figure 3).
4.1 Interview sideinformation generation
When using DCVP for interview SI generation, the same algorithm applied for motion interpolation is applied between lateral views. This generates errors; for example, the appearance and disappearance of objects from the scene can create areas of wrong matches because an object in one view may have few or no matches in the other view. Thus, wrong disparity vectors can be estimated which in turn may lead to erroneous predictions. Typically, when content is acquired in a multiview system, there are regions which are present in one view but are occluded in another view, since the objects of the scene could be partially or totally occluded from the fieldofview of one camera when compared to another camera. This occurs quite often in the lateral areas of the frames. On the other hand, there are regions where there are clear correspondences between two views. In addition, when disparity between views is high, a higher search range is needed to have correct correspondences between views. This may lead to wrong matches in lowly textured areas. A way to mitigate these two aforementioned problems is to remove the lateral areas from the two frames by aligning them. Naturally, disparity estimation and compensation still needs to be performed, as each object has its own disparity due to the distance of the object to the cameras of the multiview system.
4.1.1 Overlapped block disparity compensation
As stated in the previous section, OBDC is conceptually similar to the idea of DCVP; but to allow for larger disparities, I_{r,t} and I_{l,t} shall be prealigned. This is done by finding the minimum average disparity and removing unmatched areas as described below. Consider that each frame of the multiview system has n × m spatial resolution. The average disparity d_{ avg } between two views is calculated by the following:
where χ(q) is an indicator function, with χ(q) = 1 if q ≥ 0, and χ(q) = 0 otherwise. r is the positive bound of the search range. If d_{ avg } > 0, the pixels belonging to the area having i coordinates in the interval [0, d_{ avg } − 1] are removed from I_{ l,t }(i,j) frame, generating {I}_{l,t}^{\left(a\right)}, and for I_{ r,t } the pixels in the area [m − 1 − d_{ avg }, m − 1] are removed. In case d_{ avg } < 0, the roles of the two frames are inverted as can be seen from the interval covered by the i variable in the first sum for a negative q.
The pixels contained in the lateral areas cannot be used for the disparity estimation and interpolation, since they have no match in the other area; therefore, these two areas are removed, generating the aligned frames {I}_{l,t}^{\left(a\right)} and {I}_{r,t}^{\left(a\right)}, to which OBMC is applied, generating {Y}_{\mathit{OBDC}}^{\left(a\right)}. However, in {Y}_{\mathit{OBDC}}^{\left(a\right)} there are now two areas, d_{ avg }/2 pixels wide, which cannot be interpolated since their corresponding pixels are visible only in one KF view. The assumption for the structure of the areas in {Y}_{\mathit{OBDC}}^{\left(a\right)} comes from the symmetrical structure of the placement of the cameras. Therefore, the unmatched pixels are substituted with the colocated pixels in Y_{ OBMC }. A schematic of the algorithm is depicted in Figure 4. The same substitution is applied to the residual of OBDC, since it suffers from the same problem.Using the prealignment phase, the length of the disparity vectors is reduced. This allows using a smaller search range, more reliable estimation (fewer wrong matches) and also lowering computational complexity. In addition, the calculation of the disparity field in the unmatched areas is not performed, allowing more robust motion estimation for the other blocks. In OBMC (which is the core of OBDC, see Figure 4) and in many similar motion estimation algorithms, smoothing is done on the motion field after its initial calculation. Erroneous disparity vectors may influence correct ones; therefore, with the alignment, the propagation of the error is avoided.
4.2 Fusion based on weighted distribution
The techniques previously proposed in literature make use of the residual or similar features to estimate the reliability of a given pixel (or block) for the two SI estimations. Once the SI reliability is estimated locally, it is possible to fuse each estimate, combining the SI estimates to achieve a higher reliability. Traditionally, many fusion methods for DVC use a binary mask which indicates how the two SI estimations should be fused to maximize the final SI frame quality. However, using this approach a hard decision is made which could be far from optimal and the generation of a new correlation noise model for the fused SI frame is difficult. Here, a different approach is proposed by fusing the correlation noise model distributions obtained for the two SI estimations independently, thus avoiding the need to calculate a residual for the fused SI. The better the residual and correlation noise model estimation is, the better the fusion process works. In addition, fusing the distributions according to the correlation model can be improved, as better correlation noise models are proposed in the literature. First, the correlation noise modelling presented in [6] is summarized here for completeness. Defining {C}_{R}^{{b}_{k}} as the DCT transform of the estimated residual for band b_{ k }, D(u,v) measures the distance between individual coefficients and the average value of coefficients within band b_{ k }:
The parameter {\alpha}^{{b}_{k}}\left(u,v\right) of the laplacian distribution used in the noise modelling is calculated as in [6]:
where E[∙] denotes the expectation. The possible values of β are described in [6]. {\alpha}_{c}^{{b}_{k}} is calculated as follows and it is based on the cluster c (inliers or outliers) the position (u,v) belongs to
where N_{ c } is the number of positions belonging to cluster c.
To determine which cluster the coefficient {C}_{R}^{{b}_{k}}\left(u,v\right) belongs to, a mapping function is used based on the classification (inliers or outliers) on the already decoded coefficients [6]. This classification is based on the estimated variance of the coefficient and D(u, v) [6]. Once the already decoded coefficients are classified, the classification of the coefficients of band b_{ k } is estimated by the mapping function as in [6]. The algorithm employed is more complex [6], but here the main elements necessary to understand the rest of the work are provided.
Using the procedure outlined above for the generic laplacian parameter {\alpha}^{{b}_{k}}\left(u,v\right), two sets of laplacian parameters can be defined: one set for the OBMC SI and one set for the OBDC SI, {\alpha}_{\mathit{OBMC}}^{{b}_{k}}\left(u,v\right) and {\alpha}_{\mathit{OBDC}}^{{b}_{k}}\left(u,v\right), respectively. The weight for fusing the distribution is calculated as proposed in [16]:
Once the weights are calculated, the joint distribution for each position is defined as follows:
where {f}_{\left(X\rightY}^{{b}_{k},\left(u,v\right)} is the estimated distribution for the coefficient (u,v) in band b_{ k } given Y. The idea is that the weights give an indication of the reliability of the SIs and therefore they are used to fuse the distributions. This may be applied both in pixelbased and blockbased approaches. This system is compatible with and exploits the efficient blockbased correlation noise estimations available in literature.
4.3 Fusion learning
The SI fusion process described in the previous section can be improved using a learningbased approach to leverage the knowledge of the already decoded bands. The idea is to use the already decoded bands to perform a more reliable SI fusion. Assuming that band b_{ k }, with k > 0, is being decoded (b_{0} indicates the DC coefficient) and that the decoding follows a zigzag scan order, the previously decoded bands b_{ l }, l < k can be used to guide the fusion for each SI DCT coefficient. Consider a 4 × 4 DCT block in Y_{ OBMC }, denoted as B_{ OBMC } and its corresponding block in the partially reconstructed frame B_{ Rec }. Let {C}_{\mathit{OBMC}}^{{b}_{k}}\left(u,v\right) denote the coefficient in band b_{ k } having position (u,v). First, the nondecoded coefficients are forced to be zero in B_{ OBMC } and in the partially reconstructed block B_{ Rec }. Then, both DCT blocks are inverse DCT transformed and the MAD between the two blocks is calculated, and it is denoted as the weight {w}_{F}^{\mathit{OBMC}}\left(u,v\right) as shown in Figure 5. The MAD is an indicator of how close the previous SI DCT coefficients were to the ones belonging to the original WZ frame. It has to be noted that the WZ frame is not used in this process. The same procedure can be repeated for OBDC, using B_{ OBDC } and B_{ Rec }, generating the weight {w}_{F}^{\mathit{OBDC}}\left(u,v\right). The higher the weight, the lower the reliability of the corresponding SI. Therefore, {w}_{F}^{\mathit{OBMC}}\left(u,v\right) is used as weighting factor for OBDC, while {w}_{F}^{\mathit{OBDC}}\left(u,v\right) is used as weighting factor for OBMC.
The set of weights is used to generate the fused SI coefficient:
and the corresponding residual estimation for the fused coefficient of the SI:
To use the correlation noise model of [6], the coefficients {C}_{F}^{{b}_{k}}\left(u,v\right) need to be divided into the inlier cluster and outlier clusters. Therefore (11) is used to calculate {C}_{F}^{{b}_{l}}\left(u,v\right), 0 ≤ l < k. The coefficients {C}_{F}^{{b}_{l}}\left(u,v\right) and the estimation function defined in [6] are used to segment the coefficients {C}_{F}^{{b}_{k}}\left(u,v\right) in the two clusters. The three SIs for k > 0 are fused using the distribution fusion framework. The final joint distribution is defined as follows:
where
and {f}^{{b}_{k},\left(u,v\right)} is defined in (10).
The adaptive computation of the λ parameter assures that a low weight is selected for the fused SI when the fused SI is not reliable, but it increases rapidly, in line with the expected increase in reliability of the fused SI. The conditional probability of each bit in the SI can be calculated, taking into account the previously decoded bitplanes and the correlation noise model described by {f}_{Fus}^{{b}_{k},\left(u,v\right)}. The decoded bitplanes determine the intervals [L,U) in which each coefficient belongs to. To reconstruct the coefficient in position (u,v), the optimal reconstruction proposed in [14] is used, which is the expectation of the coefficient given that the available SIs are the following:
This procedure is carried out for each band b_{ k }, 0 ≤ k ≤ N_{ b }, where N_{ b } is the maximum number of decoded bands, every time updating the weights {w}_{F}^{\mathit{OBMC}}\left(u,v\right) and {w}_{F}^{\mathit{OBDC}}\left(u,v\right). Once the band {b}_{{N}_{b}} is decoded, {C}_{F}^{{b}_{k},\left(u,v\right)} is calculated for each N_{ b } < k ≤ 16, and they are used as coefficients in the reconstructed frame. For what concerns the reconstruction of the bands b_{ k }, 0 ≤ k ≤ N_{ b }, they are reconstructed a second time to enhance the quality of the reconstructed frame. The segmentation into the inlier cluster and outlier cluster is calculated using the already reconstructed frame, i.e. the actual value of the decoded coefficient is used to determine the cluster it belongs to, as opposed to using the mapping function employed in the previous steps [6]. As residual, the difference between the previously decoded frame and the fused SI is used. In this case λ = 0 in the reconstruction since at this stage, the reliability of the fused SI is so high that it is not necessary to use the interview or temporal SIs.
5 Experimental results
In this section, the proposed coding tools of the previous section are evaluated using the DVC codec described in Section 3. Before presenting the experimental results obtained, the test conditions are first defined. Then, OBDC is compared with DCVP, demonstrating the gains resulting from the prealignment phase. For fairness, DCVP employs OBMC for disparity estimation and compensation. Furthermore, the fusion algorithm performance is analysed comparing it with single SI decoders and alternative fusion techniques, using cameras at relatively close distance. Finally, the case of unknown disparity is analysed, examining the RD performance of the proposed decoder for 18 different camera configurations.
5.1 Test conditions
In the experiments, two sequences with still cameras and two sequences with moving cameras at constant intercamera distance are analysed, in order to test the robustness of the system to global motion. The stream structure for the central view has GOP size 2. The full length of Outdoor and Book Arrival[22], 100 frames, is coded, and the first 10 s of Kendo and Balloons[22], i.e. 300 frames, is coded. For what concerns the spatialtemporal resolution, all the sequences are downsampled to CIF resolution:

Test sequences: Outdoor, Book Arrival, Kendo and Balloons [22]. These sequences are characterized by different types of motion content, depth structures and camera arrangements, providing a meaningful and varied set of test conditions as outlined in Table 1; in the ‘Interval of used views’ column, ‘1’ corresponds to the rightmost view (among the recommended views [23]). In the experiments, the central view is kept fixed while the distance between the central and the lateral cameras is increased, spanning the intervals detailed in Table 1. The distance between two consecutive cameras is 6.5 cm [24] for Outdoor and Book Arrival, while the distance between two consecutive cameras in Kendo and Balloons is 5 cm [22].

WZ frames coding: The WZ frames are encoded at four RD points (Q_{ i }, i = 1, 4, 7, 8) corresponding to four different 4 × 4 DCT quantization matrices [13]. The RD point Q_{ i } corresponds to the lowest bitrate and quality and the RD point Q_{8} to the highest bitrate and quality. The remaining test conditions associated with the DCT, quantization, noise modelling and reconstruction modules are the same as in [6]. For the LDPCA coding, a code length of 6,336 bits is used, and a CRC check of 8 bits is employed to check the correctness of the decoded result.

KFs coding: The KFs in the central view are H.264/AVC Intracoded (main profile) as it is commonly done in e.g. [6]. The quantization parameter (QP) of the KFs is selected in order to have a similar decoded quality between WZ frames and KF for the same RD point. In Table 2, the QPs used for each RD point are reported. As previously said, the lateral views are coded with the same parameters as the KFs of the central view.

Quality and bitrate: Only the bitrate and PSNR of the luminance component is considered, as it is commonly done in literature. Both WZ frames and KFs are taken into account in rate and PSNR calculations. The rate and PSNR of the lateral views are not taken into account in order to better assess the performance of the proposed MDVC solution.
5.2 OBDCbased SI performance assessment
In this section, the RD performance of the DVC solution using OBDC, with the sliding window approach, is assessed and compared with the one achieved when DCVP is used to generate the (interview) SI; the only difference between OBDC and DCVP is the prealignment phase. Table 3 shows the Bjøntegaard bitrate savings (BDRate) and Bjøntegaard PSNR gains (BDPSNR) [25] between OBDC and DCVP when using as lateral views the ones closest to the central view (lowest disparity case), i.e. views 7 and 9 for Outdoor and Book Arrival and views 2 and 4 for Kendo and Balloons. Both SIs are evaluated using the same single SI decoder [6]. For DCVP, the parameters (e.g. search range, strength of the motion smoothing) are adapted to obtain the best average result in terms of RD performance and then the same parameters are used for OBDC. Such parameters are used in OBDC for all the sequences and for all the configurations (distance of the lateral cameras). As it can be observed from Table 3, OBDC allows improvements of the DVC codec RD performance when compared to DCVP, with PSNR gains up to 1.17 dB for the Book Arrival sequence, which is characterized by a complex depth structure. No appreciable gains are reported for Outdoor, the sequence displaying the simplest depth structure. Table 4 shows the BDRate savings and BDPSNR gains between OBDC and DCVP when using as lateral views the ones furthest away from the central view (according to the view interval indicated in Table 1), i.e. views 1 and 15 for Outdoor and Book Arrival, and views 1 and 5 for Kendo and Balloons. In this case, the parameters for OBDC are the same as those used for generating the results in Table 3. On the other hand, the performance of DCVP is maximized through extensive simulations, finding, for each sequence, the parameters giving the best RD performance. It was not possible to find parameters which were able to perform well for all the sequences for DCVP, while, with the prealignment phase in OBDC, the disparity between views is normalized, leaving to the disparity estimation module the task to accommodate for minor differences.
5.3 MDVC RD performance assessment
In this section, the RD performance of the proposed MDVC coding solution is assessed and compared directly with the MDVC scheme MDCDLin. The RD performance for distributed decoding based on onlymotion SI and onlyinterview SI is also presented. Finally, the performance of predictive monoview codecs is provided for further comparison. The left, right and central views used in the experiments are reported in Table 5.
5.3.1 Coding benchmarks
The proposed MDVC coding solution (described in Section 4) is compared with the following DVCbased codecs:

OBMC: Single SI decoder, as presented in [6]. It is a singleview DVC solution, since it exploits the temporal correlation only.

OBDC: Single SI decoder; OBDC is used as SI (outlined in Section 4.1). It exploits the interview correlation for the majority of the frame, while the temporal correlation is used for the rest.

MDCDLin: Motion and disparity compensated linear fusion is the main benchmark. It is summarized in Section 2 and implemented following [8]. The weights (calculated from the online residuals) used to fuse the SIs are also used to fuse the corresponding residuals of the two SIs, to take into account that a wrong fusion has repercussions not only on the SI quality but also on the quality of the residual (which impacts the correlation model accuracy). The SI and the residual estimation are fed into the single SI decoder of [6]. While newer techniques were proposed [9], they were unable to provide consistent gains over MDCDLin. Therefore, MDCDLin is employed as benchmark.

DISCOVER: this DVCbased codec [13] is still widely used as benchmark in literature. The system used as basis for the codec [6] has a structure which is similar to DISCOVER, but it uses an enhanced SI generation module (OBMC) and an advanced noise modelling algorithm. DISCOVER is reported only for completeness, but the focus will be the comparison with the other DVC coding solutions: the OBMC and OBDCbased baseline decoders, in order to make clear how the proposed tools improve the RD performance of the system.
For comparison, the performance of the proposed method is also compared with bounds given by ideal fusion techniques:

IF BB: Summarized in Section 2. The SI and the residual estimation are fed into the single SI decoder detailed in [6]. The weights are used to fuse SIs and estimated residuals of the SIs.

IF: Summarized in Section 2. The SI and the residual estimation are fed into the single SI decoder detailed in [6]. The weights are used to fuse SIs and estimated residuals of the SIs.
The proposed MDVC decoder is finally compared with the following standard predictive coding schemes for reference:

H.264/AVC Intra: It is the H.264/AVC codec (Main profile) with only the Intra modes enabled. It is also used for coding the KFs and lateral views. It is also a lowcomplexity encoding architecture;

H.264/AVC No Motion: Exploits the temporal redundancy in an IB prediction structure setting the search range of the motion compensation to zero; therefore, the motion estimation part, which is the most computationally expensive encoding task, is not performed: the colocated blocks in the backward and/or forward reference frames are used for prediction.
5.3.2 RD performance
Table 6 reports the BDRate savings and BDPSNR gains for the proposed MDVC coding solution when compared to the baseline OBMC and OBDCbased DVC coding solutions, using the tools proposed in [6]. For each sequence, the bestperforming single SIbased DVC solution is identified in boldface. The proposed MDVC video coding solution is able to consistently outperform the best single SIbased DVC solution, with PSNR gains up to 0.9 dB. In the worstcase scenario, Balloons, the improvement is still significant, allowing a bitrate reduction up to around 7%. The results for the DISCOVER codec are also provided, and the average BDRate savings are around 18%. For what concerns the comparison with MDCDLin, the proposed method shows an average BDPSNR gain of 0.62 dB. The improvement is robust, ranging from 0.58 to 0.74 dB. The gains of the proposed method over MDCDLin are in italics if MDCDLin is robust, i.e. if it is able to outperform both the single SI OBMCbased decoder and the single SI OBDC decoder.
Figure 6 reports the RD performance results obtained for the Outdoor, Book Arrival, Kendo and Balloons, for the nine coding solutions mentioned above. The proposed solution outperforms OBMC, OBDC, DISCOVER and MDCDLin, which are all four truly distributed decoders, i.e. they do not require the WZ frame. More specifically, the BDPSNR gains of the proposed solution are up to 1.5 dB when compared with OBDC and up to 1.12 dB when compared with OBMC. The proposed decoder is able to outperform DISCOVER by up to 2 dB because DISCOVER uses less advanced SI generation systems and correlation noise model. MDCDLin is able to robustly fuse the SIs for Outdoor, Book Arrival and Kendo but not for Balloons. Furthermore, for the first three sequences, the improvements achieved with MDCDLin are lower when compared with the proposed solution, achieving BDPSNR gains up to 0.33 dB for Outdoor. Therefore, the proposed solution, leveraging the fusion based on the distributions and the learning process, is able to outperform the other realistic distributed decoders. The use of weights derived from the distributions allows a more precise fusion because the correlation noise modelling is built on the premise that the residual may have errors. The learning process allows a refinement of the fused SI while decoding the frame, improving the SI quality accuracy by performing a more accurate SI fusion process. The ideal fusionbased coding solutions, IF and IF BB, require the original WZ frame. Therefore, they provide a bound but they cannot be used in practice. The BDPSNR gains of IF BB over the proposed coding solution range from 0.02 dB for Book Arrival to 0.28 dB for Kendo. This shows that the proposed system is able to reach performance close to an ideal blockbased fusion technique. However, pixellevel ideal fusion shows gains by up to 1.14 dB BDPSNR, over the proposed coding solution for the Outdoor sequence. For what concerns the reference predictive coders, H.264/AVC Intra is outperformed by every distributed coding solution, regardless of the SI generation method. The proposed decoder is able to reach RD performance comparable with H.264/AVC No Motion for Kendo and Balloons. For Outdoor and Book Arrival, the only distributed decoder able to compete with H.264/AVC No Motion is the one with a pixellevel IF. However, notice that H.264/AVC No Motion requires much higher encoding complexity since it has to test several Intra and Inter modes using as reference the neighbouring or colocated blocks. It is difficult to provide a complete comparison with more recent works, such as [12], given that resolution and the distance between cameras are different, i.e. different test conditions are used. Nevertheless, for the same views used in [12], we produced results for MDCDLin. The technique proposed in [12], referred to as AV, is able to outperform MDCDLin by 0.61 dB as average of the BDPSNR values for the four sequences. It has to be noted that MDCDLin is used to fuse MCTI and DCVP, while the results for AV in [12] are based on fusing better performing SIs. The proposed method is able to achieve a similar improvement over MDCDLin (0.62 dB), but in this case the comparison is done using the same SIs for both fusion architectures. Direct comparison with [12] is difficult because different resolutions are used. Nevertheless, for the four analysed sequences, AV is able to perform well on Balloons (2.2 dB gain [12]), but the gains are minor (0.0 to 0.13 dB [12]) for the other three sequences we consider. The proposed method is, on the other hand, able to provide reasonably robust gains (0.58 to 0.74 dB, Table 6) on all four sequences. As a final note, it can be seen that the occlusion detection mechanism presented in [12] addressed occlusions in the areas where the different views overlap. The proposed method removes the areas that are occluded because they do not belong to the part of the views that overlap. It is reasonable to think that combining both approaches can lead to even higher gains.
5.4 Camera distance impact
This section assesses the impact of varying the distance between the lateral and the central views on the MDVC codec RD performance. The test conditions are similar to the ones used in the previous subsection except for the choice of the lateral views. Tables 7, 8, 9, 10 show the BDRate savings and BDPSNR gains for the proposed MDVC solution with respect to the baseline OBMC and OBDCbased DVC coding solutions when varying the distance between the cameras for the Outdoor, Book Arrival, Kendo and Balloons sequences. The BD gains of the proposed method with respect to MDCDLin are also provided. (The results are provided using boldface and italics, following the conventions of the previous section.) The ∆ value refers to the difference between the index of the central camera and the index of the right camera. It has to be noted that the same value of ∆ may refer to different intercamera spacing depending on the cameras arrangement. According to the results obtained, the proposed MDVC solution is robust to changes in disparity: Outdoor, which is characterized by a simpler depth structure, shows a much more stable performance when compared with Book Arrival. Only in one case, out of the 18 examined cases, the proposed fusion solution is unable to perform better than the best single SI based DVC solution, but the performance loss is negligible, and the BD between the RD performance of the two single SI decoders (one using OBMC, the other using OBDC) is more than 3 dB, making the problem of increasing the performance by fusion extremely hard. For what concerns the performance comparison with MDCDLin, the gains of the proposed method, in BDPSNR, range from 0.50 dB (Outdoor, ∆ = 6) to 2.25 (Book Arrival, ∆ = 7). The proposed method shows higher stability and robustness when compared with MDCDLin, which is unable to efficiently fuse SI having too different quality. It has to be noted that, as opposed to [12], MDCDLin fuses the same SIs used by the proposed method; therefore here the assessment is purely based on the performance of the fusion algorithm.
6 Conclusions
In this paper, a novel fusion approach is proposed, based on learning and fusion of the distributions, rather than fusion of the pixels of the SIs. This allows simplifying the problem of estimating the residual of the fused SI and allows the MDVC solution to leverage wellknown techniques for residual estimation and correlation noise model calculation developed for single SI DVC schemes. The proposed MDVC coding solution proved to be robust to both increments and decrements of the distance between the cameras, which could be a desirable feature in systems where cameras can move with respect to each other or in systems where the distance between cameras is unknown. The proposed learning approach achieved a superior RD performance, on average, when compared with single SI decoders and it showed higher robustness than a residualbased SI fusion technique. The proposed fusion reached performance similar to the performance bounds obtained with a blockbased ideal fusion, which relies on the knowledge of the original WZ frame. In case of cameras moving with respect to the scene, but keeping a fixed disparity, the MDVC solution was able to achieve results that are close to H.264/AVC No Motion, and in the case of fixed cameras, the difference is relatively small, in particular, when compared with the RD performance loss of single SI DVC solutions.
References
Girod B, Aaron AM, Rane S, RebolloMonedero D: Distributed video coding. Proc. IEEE 2005, 93(1):7183.
Puri R, Majumdar A, Ramchandran K: PRISM: a video coding paradigm with motion estimation at the decoder. IEEE Trans. Image Process. 2007, 16(10):24362448.
Guillemot C, Pereira F, Torres L, Ebrahimi T, Leonardi R, Ostermann J: Distributed monoview and multiview video coding. Signal Process. Mag. IEEE 2007, 24(5):6776.
Slepian D, Wolf J: Noiseless coding of correlated information sources. IEEE Trans. Inf. Theory 1973, 19(4):471480. 10.1109/TIT.1973.1055037
Wyner A, Ziv J: The ratedistortion function for source coding with side information at the decoder. IEEE Trans. Inf. Theory 1976, 22(1):110. 10.1109/TIT.1976.1055508
Huang X, Forchhammer S: Crossband noise model refinement for transform domain WynerZiv video coding. Signal Process. Image Commun. 2012, 27(1):1630. 10.1016/j.image.2011.06.008
Vetro A, Wiegand T, Sullivan GJ: Overview of the stereo and multiview video coding extensions of the H.264/MPEG4 AVC standard. Proc. IEEE 2011, 99(4):626642.
Maugey T, Miled W, Cagnazzo M, PesquetPopescu B: Fusion schemes for multiview distributed video coding. In Proceedings of European Signal Processing Conference. Glasgow; 2009:559563.
Dufaux F: Support vector machine based fusion for multiview distributed video coding. In Proceedings of Digital Signal Processing (DSP). Corfu; 2011:17.
Ouaret M, Dufaux F, Ebrahimi T: Multiview distributed video coding with encoder driven fusion. In Proceedings of the 2007 European Signal Processing Conference (EUSIPCO2007). Poznan; 2007.
Artigas X, Tarrés F, Torres L: Comparison of different side information generation methods for multiview distributed video coding. In Proceedings of SIGMAP 2007. Barcelona; 2007.
Petrazzuoli G, Cagnazzo M, PesquetPopescu B: Novel solutions for side information generation and fusion in multiview DVC. Eurasip J. Adv. Signal Process 2013, 1: 154.
Artigas X, Ascenso J, Dalai M, Klomp S, Kubasov D, Ouaret M: The DISCOVER codec: architecture, techniques and evaluation. In Proceedings of Picture Coding Symposium (PCS) 2007. Lisbon; 2007.
Kubasov D, Nayak J, Guillemot C: Optimal reconstruction in WynerZiv video coding with multiple side information. In Proceedings of IEEE MMSP 2007. Chania, Crete; 2007:183186.
Huang X, Brites C, Ascenso J, Pereira F, Forchhammer S: Distributed video coding with multiple side information. In Proceedings of Picture Coding Symposium (PCS) 2009. Chicago, Illinois; 2009:385388.
Li Y, Liu H, Liu X, Ma S, Zhao D, Gao W: Multihypothesis based multiview distributed video coding. In Proceedings of Picture Coding Symposium (PCS) 2009. Chicago, Illinois; 2009:14.
Salmistraro M, Zamarin M, Forchhammer S: Multihypothesis distributed stereo video coding. In Proceedings of MMPS 2013. Pula, Sardinia; 30 September to 2 October 2013
Luong H, Raket LL, Huang X, Forchhammer S: Side information and noise learning for distributed video coding using optical flow and clustering. IEEE Trans. Image Process. 2012, 21(12):47824796.
Brites C, Ascenso J, Pereira F: Learning based decoding approach for improved WynerZiv video coding. In Proceedings of PCS 2012. Krakow; 2012:165168.
Ouaret M, Dufaux F, Ebrahimi T: Iterative multiview side information for enhanced reconstruction in distributed video coding. J Image. Video. Process 2009, 2009: 3:13:17.
Varodayan D, Aaron A, Girod B: Rateadaptive codes for distributed source coding. Eurasip. Signal. Process. J 2006, 86(11):31233130. 10.1016/j.sigpro.2006.03.012
Smolic A, Tech G, Brust H Technical Report d2. Report on Generation of Stereo Video Database 1 July 2010
Nagoya University  Tanimoto Laboratory: Kendo specifications. . Accessed 18 November 2014 www.tanimoto.nuee.nagoyau.ac.jp/~fukushima/mpegftv/yuv/Kendo/readme.txt
Feldmann I, Mueller M, Zilly F, Tanger R, Mueller K, Smolic A, Kauff P, Wiegand T: HHI Test Material for 3D Video. ISO, Archamps; 2008. May 2008
Bjøntegaard G: Calculation of average PSNR differences between RD curves. In VCEG 13th Meeting. Austin, Texas; 2001.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Salmistraro, M., Ascenso, J., Brites, C. et al. A robust fusion method for multiview distributed video coding. EURASIP J. Adv. Signal Process. 2014, 174 (2014). https://doi.org/10.1186/168761802014174
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/168761802014174