In this section, the proposed techniques are described and analyzed. Thus, the novel contributions are inter-view OBDC SI generation, distribution fusion and the Fusion Learning, which can be divided into two distinct elements: the Refined Fusion used during the decoding process and the refined reconstruction used at the end of the decoding process (Figure 3).
4.1 Inter-view side-information generation
When using DCVP for inter-view SI generation, the same algorithm applied for motion interpolation is applied between lateral views. This generates errors; for example, the appearance and disappearance of objects from the scene can create areas of wrong matches because an object in one view may have few or no matches in the other view. Thus, wrong disparity vectors can be estimated which in turn may lead to erroneous predictions. Typically, when content is acquired in a multiview system, there are regions which are present in one view but are occluded in another view, since the objects of the scene could be partially or totally occluded from the field-of-view of one camera when compared to another camera. This occurs quite often in the lateral areas of the frames. On the other hand, there are regions where there are clear correspondences between two views. In addition, when disparity between views is high, a higher search range is needed to have correct correspondences between views. This may lead to wrong matches in lowly textured areas. A way to mitigate these two aforementioned problems is to remove the lateral areas from the two frames by aligning them. Naturally, disparity estimation and compensation still needs to be performed, as each object has its own disparity due to the distance of the object to the cameras of the multiview system.
4.1.1 Overlapped block disparity compensation
As stated in the previous section, OBDC is conceptually similar to the idea of DCVP; but to allow for larger disparities, Ir,t and Il,t shall be pre-aligned. This is done by finding the minimum average disparity and removing unmatched areas as described below. Consider that each frame of the multiview system has n × m spatial resolution. The average disparity d
avg
between two views is calculated by the following:
(5)
where χ(q) is an indicator function, with χ(q) = 1 if q ≥ 0, and χ(q) = 0 otherwise. r is the positive bound of the search range. If d
avg
> 0, the pixels belonging to the area having i coordinates in the interval [0, |d
avg
| − 1] are removed from I
l,t
(i,j) frame, generating , and for I
r,t
the pixels in the area [m − 1 − |d
avg
|, m − 1] are removed. In case d
avg
< 0, the roles of the two frames are inverted as can be seen from the interval covered by the i variable in the first sum for a negative q.
The pixels contained in the lateral areas cannot be used for the disparity estimation and interpolation, since they have no match in the other area; therefore, these two areas are removed, generating the aligned frames and , to which OBMC is applied, generating . However, in there are now two areas, |d
avg
|/2 pixels wide, which cannot be interpolated since their corresponding pixels are visible only in one KF view. The assumption for the structure of the areas in comes from the symmetrical structure of the placement of the cameras. Therefore, the unmatched pixels are substituted with the co-located pixels in Y
OBMC
. A schematic of the algorithm is depicted in Figure 4. The same substitution is applied to the residual of OBDC, since it suffers from the same problem.Using the pre-alignment phase, the length of the disparity vectors is reduced. This allows using a smaller search range, more reliable estimation (fewer wrong matches) and also lowering computational complexity. In addition, the calculation of the disparity field in the unmatched areas is not performed, allowing more robust motion estimation for the other blocks. In OBMC (which is the core of OBDC, see Figure 4) and in many similar motion estimation algorithms, smoothing is done on the motion field after its initial calculation. Erroneous disparity vectors may influence correct ones; therefore, with the alignment, the propagation of the error is avoided.
4.2 Fusion based on weighted distribution
The techniques previously proposed in literature make use of the residual or similar features to estimate the reliability of a given pixel (or block) for the two SI estimations. Once the SI reliability is estimated locally, it is possible to fuse each estimate, combining the SI estimates to achieve a higher reliability. Traditionally, many fusion methods for DVC use a binary mask which indicates how the two SI estimations should be fused to maximize the final SI frame quality. However, using this approach a hard decision is made which could be far from optimal and the generation of a new correlation noise model for the fused SI frame is difficult. Here, a different approach is proposed by fusing the correlation noise model distributions obtained for the two SI estimations independently, thus avoiding the need to calculate a residual for the fused SI. The better the residual and correlation noise model estimation is, the better the fusion process works. In addition, fusing the distributions according to the correlation model can be improved, as better correlation noise models are proposed in the literature. First, the correlation noise modelling presented in [6] is summarized here for completeness. Defining as the DCT transform of the estimated residual for band b
k
, D(u,v) measures the distance between individual coefficients and the average value of coefficients within band b
k
:
(6)
The parameter of the laplacian distribution used in the noise modelling is calculated as in [6]:
(7)
where E[∙] denotes the expectation. The possible values of β are described in [6]. is calculated as follows and it is based on the cluster c (inliers or outliers) the position (u,v) belongs to
(8)
where N
c
is the number of positions belonging to cluster c.
To determine which cluster the coefficient belongs to, a mapping function is used based on the classification (inliers or outliers) on the already decoded coefficients [6]. This classification is based on the estimated variance of the coefficient and D(u, v) [6]. Once the already decoded coefficients are classified, the classification of the coefficients of band b
k
is estimated by the mapping function as in [6]. The algorithm employed is more complex [6], but here the main elements necessary to understand the rest of the work are provided.
Using the procedure outlined above for the generic laplacian parameter , two sets of laplacian parameters can be defined: one set for the OBMC SI and one set for the OBDC SI, and , respectively. The weight for fusing the distribution is calculated as proposed in [16]:
(9)
Once the weights are calculated, the joint distribution for each position is defined as follows:
(10)
where is the estimated distribution for the coefficient (u,v) in band b
k
given Y. The idea is that the weights give an indication of the reliability of the SIs and therefore they are used to fuse the distributions. This may be applied both in pixel-based and block-based approaches. This system is compatible with and exploits the efficient block-based correlation noise estimations available in literature.
4.3 Fusion learning
The SI fusion process described in the previous section can be improved using a learning-based approach to leverage the knowledge of the already decoded bands. The idea is to use the already decoded bands to perform a more reliable SI fusion. Assuming that band b
k
, with k > 0, is being decoded (b0 indicates the DC coefficient) and that the decoding follows a zig-zag scan order, the previously decoded bands b
l
, l < k can be used to guide the fusion for each SI DCT coefficient. Consider a 4 × 4 DCT block in Y
OBMC
, denoted as B
OBMC
and its corresponding block in the partially reconstructed frame B
Rec
. Let denote the coefficient in band b
k
having position (u,v). First, the non-decoded coefficients are forced to be zero in B
OBMC
and in the partially reconstructed block B
Rec
. Then, both DCT blocks are inverse DCT transformed and the MAD between the two blocks is calculated, and it is denoted as the weight as shown in Figure 5. The MAD is an indicator of how close the previous SI DCT coefficients were to the ones belonging to the original WZ frame. It has to be noted that the WZ frame is not used in this process. The same procedure can be repeated for OBDC, using B
OBDC
and B
Rec
, generating the weight . The higher the weight, the lower the reliability of the corresponding SI. Therefore, is used as weighting factor for OBDC, while is used as weighting factor for OBMC.
The set of weights is used to generate the fused SI coefficient:
(11)
and the corresponding residual estimation for the fused coefficient of the SI:
(12)
To use the correlation noise model of [6], the coefficients need to be divided into the inlier cluster and outlier clusters. Therefore (11) is used to calculate , 0 ≤ l < k. The coefficients and the estimation function defined in [6] are used to segment the coefficients in the two clusters. The three SIs for k > 0 are fused using the distribution fusion framework. The final joint distribution is defined as follows:
(13)
where
and is defined in (10).
The adaptive computation of the λ parameter assures that a low weight is selected for the fused SI when the fused SI is not reliable, but it increases rapidly, in line with the expected increase in reliability of the fused SI. The conditional probability of each bit in the SI can be calculated, taking into account the previously decoded bitplanes and the correlation noise model described by . The decoded bitplanes determine the intervals [L,U) in which each coefficient belongs to. To reconstruct the coefficient in position (u,v), the optimal reconstruction proposed in [14] is used, which is the expectation of the coefficient given that the available SIs are the following:
(15)
This procedure is carried out for each band b
k
, 0 ≤ k ≤ N
b
, where N
b
is the maximum number of decoded bands, every time updating the weights and . Once the band is decoded, is calculated for each N
b
< k ≤ 16, and they are used as coefficients in the reconstructed frame. For what concerns the reconstruction of the bands b
k
, 0 ≤ k ≤ N
b
, they are reconstructed a second time to enhance the quality of the reconstructed frame. The segmentation into the inlier cluster and outlier cluster is calculated using the already reconstructed frame, i.e. the actual value of the decoded coefficient is used to determine the cluster it belongs to, as opposed to using the mapping function employed in the previous steps [6]. As residual, the difference between the previously decoded frame and the fused SI is used. In this case λ = 0 in the reconstruction since at this stage, the reliability of the fused SI is so high that it is not necessary to use the inter-view or temporal SIs.