Stereo Image Coder Based on the MRF Model for Disparity Compensation

This paper presents a stereoscopic image coder based on the MRF model and MAP estimation of the disparity field. The MRF model minimizes the noise of disparity compensation, because it takes into account the residual energy, smoothness constraints on the disparity field and the occlusion field. Disparity compensation is formulated as an MAP-MRF problem in the spatial domain, where the MRF field consists of the disparity vector and occlusion fields. The occlusion field is partitioned into three regions by an initial double-threshold setting. The MAP search is conducted in a block-based sense on one or two of the three regions, providing faster execution. The reference and residual images are decomposed by a discrete wavelet transform and the transform coefficients are encoded by employing the morphological representation of wavelet coefficients algorithm. As a result of the morphological encoding, the reference and residual images together with the disparity vector field are transmitted in partitions, lowering total entropy. The experimental evaluation of the proposed scheme on synthetic and real images shows beneficial performance over other stereoscopic coders in the literature.


Introduction
The perception of a scene with 3-D realism may be accomplished by a stereo image pair which consists of two images of the same scene recorded from two slightly different perspectives. The two images are distinguished as Left and Right image that present binocular redundancy and for that reason can be encoded more efficiently as a pair than independently. The stereoscopic vision has a very wide field of applications in robot vision, virtual machines, medical surgery etc. Typically, the transmission or the storage of a stereo image requires twice the bandwidth or the capacity of a single image. The objective on a bandwidth-limited transmission system is to develop an efficient coding scheme that will exploit the redundancies of the two images, that is, intra-image and cross-image correlation or similarities.
A typical compression scenario is the encoding of one image, which is called reference and the disparity compensation of the other, which is called target. In this work, the Left image is assigned as reference and the Right image as target. Transform coding is a method used to remove intra-spatial redundancy both from reference and target images. The cross-image redundant information is evaluated by considering the disparity between the two images. Disparity compensation procedure estimates the best prediction of the target image from the reference and results in an error image, which is called residual, together with a disparity vector field. The encoded reference and residual images together with the disparity vectors are entropy coded and transmitted. Therefore, the effectiveness of the encoding algorithm, the energy of the residual image and the smoothness of the disparity vector field affect the overall performance of the stereo coder.
Several methods have been developed for disparity compensation. The area-based methods, including either pixel or line or area matching, are simple approaches for disparity estimation [1,2]. The block-based matching method, either fixed or variable size (FSBM or VSBM), finds the distance between two blocks that have similar intensities within a predefined search window [3].
problem if the cameras are coplanar. The distance between two points of the stereo pair images that correspond to the same scene point is called disparity. The estimation of this distance (disparity vector or DV) is very important in stereo image compression because the target image (Right) can be predicted from the reference (Left) along with the disparity vectors. Then, the difference of the prediction from the original image (disparity compensated difference or DCD) is evaluated so that redundant information is not encoded and transmitted [17,35]. Disparity compensation usually employs BMA for the estimation of a residual or DCD block: where , are the corresponding blocks of the Right and the reconstructed Left images respectively and dv x , dv y are the disparity vector components for the best match, which is defined as: (2) where A is the window searching area and the matching criterion is MAE. In this work, the general case is considered where the disparity vector has horizontal and vertical components. The above described disparity compensation process is called closed-loop, because the prediction of the target image is performed with the reconstructed reference image. This is quite reasonable because the reconstruction of the target image will be performed with the assistance of the reconstructed reference image at the decoder's side [8]. Alternatively, disparity compensation may be performed with the reference image and is called open-loop. The openloop systems, although they are simpler since there is no need for inverse quantization and wavelet transform at the encoder's side, are less effective. The disparity compensation process exploits the spatial cross-image dependency in order to remove redundant information. However, some blocks that have no correspondence may be encountered and are called occluded blocks. The sides of the stereo pair that cannot be seen directly by both eyes, as well as the areas from object overlapping are occluded regions. The occluded regions are usually tracked and excluded during the disparity estimation process, since they contribute to high distortion in the residual image. MRF model penalizes the existence of an occluded block and encourages the connectivity of neighbouring occluded blocks, as they usually appear at the boundaries of objects where large intensity gradients prevail.

The MRF/GRF model
In this section, the basic concepts of the MRF model are reviewed [23,30]. Let: be a rectangular lattice of size NxN, which in this case is the disparity compensated image and a family of random variables defined on representing the random disparity field. Obviously, each disparity compensated image may be viewed as a discrete sample realization of S D , with a configuration , which is a set of each random variable. Each disparity compensated block of pixels may be viewed as a random variable in the spatial domain: The MRF model considers a neighbourhood system on S , which is defined as: where is the set of sites on the neighbourhood of the block. The definition of the neighbourhood is as follows: is a positive integer defining the order of a neighbourhood system. The first order neighbourhood, =1, which is used in the present work, is a four connected structuring element as shown in Fig. 1 The cliques are a subset of sites in , where each site is a neighbour of the other sites in the defined neighbourhood system. A family of random variables is an MRF model on with respect to if the following properties are satisfied: (  ,  )  ,  (  ,  |  (   ,  ,  , is the energy function for . represents the clique potential of all possible first order clique sets, which are single-site and double-site . Normalization factor Z is called partition function and has the following form: ( 1 (12) The practical value of the above is that the probability of a configuration may be specified in terms of prior potentials for all the cliques. Let us assume that the observation model r , the configuration , the a priori probability and the likelihood density are known. Normally, the best value of is given by an MAP estimation, which can be expressed with Bayes formula, as: where is the probability density function of the observation model, which does not affect the solution of (13). Therefore, an MAP solution is given by: ) (r p (14) )] According to Gibbs distribution equation (10), the MAP solution may be converted as follows: is the prior and the likelihood energies. Finally, configuration may be estimated by the minimization of energy equation (16) knowing the prior and likelihood energies for a given neighbourhood system.

. The morphological encoder
The conventional wavelet image coders decompose a "still" image into multiresolution bands providing better compression quality than the so far existing DCT transform [18]. The statistical properties of the wavelet coefficients led to the development of some very efficient encoding algorithms such as, the embedded zero tree wavelet coder (EZW) [19], the coder based on set partitioning in hierarchical trees (SPIHT) [20], the coder based on the morphological representation of wavelet data (MRWD) [21] and its enhanced version called significance-linked connected component analysis for wavelet image coding (SLCCA) [22].
MRWD algorithm, which is used in the present work, exploits the intra-band clustering and inter-band directional spatial dependency of the wavelet coefficients. This spatial dependency is shown in Fig. 2(a) for a three level wavelet transform. Hence, a prediction of the significant coefficients in a hierarchical manner is feasible starting from the coarsest scale. This may be accomplished using the morphological dilation operation with a structuring element. A dead-zone uniform step size quantizer quantizes all the subbands and the coefficients of the coarsest detail subbands constitute either the map of significance or insignificance, that is, a binary image with two partitions in every subband. The intra-band dependency of wavelet coefficients or the tendency to form clusters suggests that significant neighbours may be captured applying a morphological dilation operator. The finer scale significant coefficients, in the children subbands, may be predicted from the significant ones of the coarser scale, parent subbands, by applying the same morphological operator to an enlarged neighbourhood, because children subbands have double size than their parents. However, the significant partitions comprise insignificant coefficients that were captured as significant and correspondingly, the insignificant partitions comprise significant coefficients that were isolated. So, each of these two partitions may be further partitioned into two groups, so that the elements of each group to have the same properties. Fig. 2(b) shows binary images of the detail subbands with the formed clusters after the aforementioned morphological operation. The black areas denote significant coefficients, the white areas denote insignificant ones, while the grey areas illustrate insignificant coefficients that are captured as significant by dilation operation with a 3X3 structuring element. The approximation subband, which contains the low frequency components, is not subjected to this operation and all of its coefficients are considered as significant.
Consequently, the coefficients of the wavelet transform are partitioned into groups with the same characteristics and total composite entropy is lowered. The transmitted sequence of these partitions has a certain order of transmission including side information, which consists of the headers that define each partition, needed at the decoder's stage. The performance of this algorithm for "still" images is quite good with respect to other state of the art compression techniques. It provides PSNR values of about 1 dB better than EZW and has about the same performance as SPIHT [21]. The morphological encoder has also the capability, by assigning a set of embedded quantizers, to produce an embedded coding which insures resolution scalability at the decoder's side.

The proposed algorithm
The disparity field of a stereo image pair is an MRF/GRF model consisting of disparity, D and occlusion, , fields. The problem is to determine disparity and occlusion fields from the O observations which are the pair of images. The configurations and , for disparity and occlusion fields respectively, may be estimated by equation (16): , are the observations and represent the Left and Right images respectively. The first term represents the likelihood energy, the second term represents the prior disparity field when occlusion field is given and the third term represents the prior occlusion constraint.

. The likelihood energy
The likelihood energy, which is also called similarity constraint, indicates how similar two corresponding images are, when disparity and occlusion fields are known. Typically, this may be expressed as: where is a binary indication for the presence of an occluded block, are the pixels of the processed block and are the predicted pixels of the reconstructed Left block that are translated by in order to have a best match to the corresponding ones of the processed block. The best matching between two corresponding blocks is decided by the minimum value of their MAE.

The smoothness constraint
The prior disparity field, when occlusion field is given, is also called smoothness constraint. Minimization of the respective term in the general equation (17) provides a smooth disparity field except on the occluded points. This is expressed as follows: where is the disparity field of the first order neighbourhood system. As it is clear from the above equation, the occluded neighbours are not taken into account, since they represent local discontinuities. The effect of this procedure is to result in a more uniform disparity field, which provides better encoding. In this work, MAE is selected instead of MSE as a measure of the energy terms in equations (18) and (19), because it is simpler and less sensitive to outliers.

The occlusion constraint
The prior occlusion field, which is called occlusion constraint, is a binary field that defines local discontinuities. The occluded blocks are not compensated and their disparity vector is set to zero. The energy equation of the occlusion field has the following form: where are the occluded neighbours of the processed , and are the single and double clique sites respectively. First term provides the energy cost if a block becomes occluded and second term encourages occlusion connectivity.

The final equation for disparity estimation
The MRF/GRF model general equation (17), taking into account (18), (19) and (20), may be expressed as: where d λ and o λ are weighting constants that control each of the participating fields. Each term of the above equation depicts the energy cost of likelihood, smoothness constraint and occlusion functions respectively.

The proposed disparity compensation
The disparity field, which is estimated by equations (1) and (2), consists of the residual image and the vector field. The initial occlusion field is formed by employing a double threshold procedure as in [16]: Hence, the occlusion field is separated into three regions: • The non-occluded region, where the blocks are always predictable.
• The occluded region, where the blocks are always occluded and excluded from the MAP search. • The uncertain region, where the blocks are subjected into an MAP search, in order to enrol them as occluded or non-occluded. Disparity and occlusion fields are iteratively updated according to the non-optimal deterministic method proposed in [25], in order to reduce complexity: • Given the best initial estimate of the occlusion field, update disparity field by minimizing the first two terms of the final equation (21). This phase refers to blocks that belong to the non-occluded and uncertain regions, because occluded blocks are not compensated. • Given the best estimate of the disparity field, update occlusion field by minimizing the last two terms of the final equation. This phase is applied on blocks that belong to the uncertain region, in order to enrol them in one of the other two regions. First term penalizes the conversion of an uncertain block to an occluded or non-occluded block and second term favours the connectivity of the processed block. • The whole process is repeated until no further energy minimization takes place. The proposed MRF method converges in three or four iterations.
The last two terms of equation (21) represent the potential costs for the occlusion phase and are defined as follows: where , 0 C p λ , o λ are weighting constants. The first term of the above equation is the energy cost if an uncertain block is assigned as occluded and is expressed in terms of the mean residual block. The second term penalizes the connectivity of an uncertain block to its neighbours. The function is defined as: (⋅ δ is the Kronecker delta function and sign is the signum function.
If the neighbours of an uncertain block are occluded or non-occluded blocks, the cost increases with the number of neighbours that are of different kind. This term favours the connectivity of an uncertain block to its neighbourhood. If a neighbour of an uncertain block is also uncertain, the cost depends on their disparity vectors difference. If this is greater than a threshold, q λ , there is no energy cost. If the difference is less than the pre-specified threshold, the energy cost increases if the two uncertain blocks are of different kind. The threshold q λ becomes smaller over the iterations, as the disparity vector field becomes more uniform and in this work is defined as .

The computational complexity of the proposed algorithm
It is well known that the computational complexity of a disparity estimation algorithm is defined by the search algorithm, the cost function and the search range. Assume the macroblock size of 8x8 pixels and that the search range parameter is p. Let's also assume that a disparity estimation algorithm employs the MSE cost function and that the image size is MxN pixels. The computational complexity of BMA is given by: where is the complexity of the cost function requiring 259 operations. The exhaustive search technique requires searches per macroblock. If p=16, as proposed in this paper, the number of searches is 1089. The disparity field, unlike motion field, depends on the distance from the camera and thus is less uniform, as different parts of the background show different disparity. The occlusion field is decided by comparing the magnitude of the MAE of a matching block with a pre-selected threshold and we assume that consists of occluded macroblocks. The disparity compensation procedure is performed for macroblocks that are not occluded. Thus, the complexity of defining the occlusion field is given by the term that is about the complexity of the MAE cost function.
The computational complexity given by equation (25) is considered as the initial step of a typical MRF algorithm, [6]. To this complexity, it has to be added the required operations for updating the disparity field and the operations for updating the occlusion field, as mentioned in the previous subsection. The update of the disparity field is performed by the first two terms of equation (21), whereas the update of the occlusion field is performed by the last two terms of the same equation or equations (23) and (24). This may be expressed as: where and represent the computational complexity for updating the disparity and occlusion fields respectively, and is the number of required iterations. The update of the disparity field concerns only the non-occluded macroblocks, whereas the update of the occlusion field concerns all the macroblocks.
In our proposed algorithm, the update of the occlusion field is performed only on the uncertain region as indicated by (22), which is a fraction of the total image size. This reduces the computational complexity of a typical MRF method, which is expressed by equation (26), and renders the execution time faster. Moreover, MAE has been chosen as the cost function because of its simplicity compared to MSE, its direct hardware implementation and its robustness to outliers. It has been estimated that the time consumed by our proposed algorithm is about three times that of BMA and about 30% less than that of a typical MRF algorithm. The complexity of our proposed scheme may be reduced if the search range in the vertical direction is confined to ± 2 pixels. In that case the number of searches is reduced to 165. This is reasonable because the natural images used for experimental evaluation have been captured by fixed and aligned cameras. The complexity may be further improved if a fast search algorithm is employed for disparity estimation, as those that have been developed for motion estimation. For example, the three-step search algorithm presents a complexity O(log p) compared to O(p 2 ) that the exhaustive search presents. Also, the complexity of the hierarchical block matching algorithm is 50 times lower compared to the exhaustive search.

Experimental results
In this section, the experimental evaluation of the proposed coder is reported. Three grey scale stereo image pairs were employed for the experimental evaluation, from which two images are synthetic: "Room" (256X256) and "SYN.256" (256X256) and two images are real: "Fruit" (256X256) and "Aqua" (360X288) [26,27,28]. The proposed stereoscopic coder employs four level DWT with symmetric extension, based on the 9/7 biorthogonal Daubechies filters [29]. The parameter values are obtained by trial and error and are listed in Table 1.
• are thresholds that define an initial occlusion field. They are defined in terms of the average value of the initial disparity compensated field. The initial DCD or the initial residual image is attained after disparity estimation for all the macroblocks employing BMA.  λ is a variable threshold value in each iteration that penalizes the disparity vector difference between neighbouring uncertain blocks. Except for the thresholds T 1 and T 2 , which define the three regions of the occlusion field, minor alterations to the other parameters will not change considerable the experimental results. It is very difficult to estimate automatically their values or to correlate their estimation with the source stereoscopic image pair. For this reason, the parameters listed in Table 1 were kept constant throughout the experiments and for all the tested images.
The experimental evaluation of the proposed method is performed with the following criteria: • The subjective quality measure, which is the optical quality of the reproduced target image. The smoothness of the residual image and the disparity vector field are indicative of the final target image quality. The abnormalities that appear in the residual image due to occlusions make the bit cost larger. Also, the detection of the occlusion field by using thresholds is simple but contributes to a larger bit cost. • The objective quality measure of the reproduced images, which is expressed by the PSNR value in terms of the total bit-rate. where MSE L and MSE R are the mean square errors of Left and Right images respectively. The total bit-rate is the entropy of the DWT subband coefficients of reference and residual images, after their morphological representation and partitioning by the morphological encoder and the disparity vectors, which are DPCM encoded, since their transmission must be lossless.
• The entropy of the disparity vector field, which is defined as: where and denote the probability of the horizontal and vertical disparity vector components. This measure indicates the randomness of the disparity field and it is intended to be as low as possible. This is normal in most images which consist of smooth intensity objects, except around object boundaries. The MRF method, in contradiction to the classical BMA, takes care of that vector smoothness.
) ( x dv P ) ( y dv P • The normalized average energy or MSE of the residual image, which is defined as: where is the image lattice of dimensions. A lower residual energy means that fewer bits are needed for encoding, so it is indicative of the matching algorithm effectiveness.

S N N ×
The experimental evaluation involves the comparison of the proposed disparity compensation process, which is based on the MRF model, with respect to the classical BMA method and the performance of the proposed stereo coder with respect to other state-of the-art coders. In this coder, the disparity compensation process is implemented with blocks of 8X8 pixels in a searching area of 16 pixels. This size of blocks is found to be the best choice in terms of the produced noise and coding efficiency. Table 2 shows the normalized average energy of the residual image and the entropy of the disparity vector field for BMA and MRF processes at a specific total bit rate. As expected, the MRF residual images present lower energy and the disparity vector field is smoother than that of BMA processing. This lower energy and the smoothness of the vector field insure lower total entropy values. The occluded regions are usually tracked and excluded from the disparity compensation process, since they contribute to distortions increasing excessively the bit-rate. The occlusion indicators are transmitted because their residual coding results in a total bit-rate benefit. Also, their main role is to avoid mismatching blocks containing object boundaries and preventing disparity over-smoothing across discontinuities. The MRF model penalizes the existence of an occluded block and encourages the connectivity of neighbouring occluded blocks, which usually appear at objects boundaries where large intensity gradients prevail.  . 3 (a) and (b) show the initial and final occlusion fields for the "Room" stereo pair respectively. In (a), grey regions represent the uncertain field, black areas represent the occluded field and white areas represent the non-occluded field. The isolated occluded blocks are initially assigned as uncertain blocks, because it is desirable to exclude them as they increase entropy cost. It is also apparent in (b) that occlusion connectivity is favoured, as black areas have been enlarged. In Fig. 4, (a)-(d) show the residual image and the disparity vector of a BMA and an MRF based disparity compensation process for the "Room" stereo pair, at a bit rate of 0.20 bpp. In Fig. 5, (a)-(d) show the residual image and the disparity vector field of a BMA and an MRF based disparity compensation process for the "SYN.256" stereo pair, at a bit rate of 0.21 bpp. In both stereo pairs, the performance of the MRF disparity compensation process is better than the corresponding BMA. Apparently, the MRF model residual images present lower energy and their corresponding disparity vector fields are smoother than their BMA counterparts validating the results of Table 2.
In Fig. 6, (a) and (b) show the reconstructed target image of stereo pair "Room", for BMA and MRF respectively. The objective quality of BMA and MRF processes is 26.02 dB and 28.24 dB respectively, for a bit rate of 0.2 bpp. In Fig. 7, (a) and (b) show the reconstructed target image of stereo pair "SYN.256", for BMA and MRF respectively. The performance of BMA and MRF processes is 29.08 dB and 29.92 dB respectively, for a bit rate of 0.21 bpp.    Fig. 9. Quality performance evaluation of various stereoscopic coders for "Fruit" stereo pair.
At the lower bound the two algorithms converge, whereas at bit rates greater than 0.5 bpp our proposed scheme outperforms. The Hierarchical MRF stereo coder incorporates the typical MRF model and a variable size block matching scheme for disparity estimation. Consequently, we believe that a variable size block disparity estimation scheme, adapted to our MRF model, would improve the performance of our coder. Fig. 9 shows the experimental evaluation of various stereoscopic coders for the "Fruit" stereo image pair. The disparity compensated EZW coder is based on EZW encoding for both the reference and residual images employing fixed size block BMA for disparity estimation. The proposed MRF stereo coder presents beneficial PSNR values in comparison with the other coders. This proves that our coder behaves equally well not only with synthetic stereo images but with camera-acquired images, which present a more difficult disparity field as it depends on cameras distances and their alignment. It should be observed that the quality difference mitigates at lower bit-rates, which may be ascribed to the fixed size block matching. BMA disparity compensation with fixed size blocks does not exploit the constant disparity areas that exist in a scene and assigns more bits than actually required. Fig. 10 illustrates the performance of the proposed coder in comparison with other state-of the-art coders for the "Aqua" stereo image pair. Again, our proposed coder outperforms in the middle and high bit-rates of at least 0.8 dB the other stereo coders and its performance converges to the others at lower bit-rates. It is worth to note that Hierarchical MRF stereo coder presents inferior quality to the specific natural image, whereas our proposed scheme has a stable performance both in synthetic and natural images.  Apart from the MRF model, which treats disparity compensation very effectively, the wavelet based morphological encoder contributes to the good performance of our proposed scheme because it is more efficient than other coders. It presents, for "still" images, about 1 dB better performance over EZW and also outperforms DCT because of its wavelet nature. Apart from its simple implementation, fast execution and efficiency, MRWD encoder may provide embedded bit streams and spatial scalability that are prerequisites of a modern coding scheme.
The proposed algorithm may be applied to stereoscopic video coding with advantageous results as the smoother disparity field will imply better temporal prediction. Of course, this will imply a more complicated framework because motion and disparity fields along with their unpredictable fields must be integrated. The motion-disparity estimation procedure for compensating the auxiliary channel with techniques like the joint motion-disparity estimation, vector regularization as well as the GOP structure of the two channels should be considered for an effective coding scheme with low complexity. Also, the fixed block size framework employed in this paper assumes that all the pixels of a block have the same disparity, which is not the case. This assumption does not take advantage of the constant disparity areas. Therefore, as a plan for future work, the application of the proposed scheme in a variable size block framework should be considered.

Conclusions
In this work, an algorithm employing the MRF model is proposed for the disparity estimation of a stereo image pair. The MRF model is a popular method in video community for a consistent evaluation of motion fields. It provides the means to accomplish smooth disparity field without increasing the residual energy and thus to devote fewer bits to encode them. The proposed coder consists of a disparity compensation unit and an encoding unit. The disparity compensation unit constructs initially the disparity and occlusion fields using BMA. The occlusion field is separated into three regions by employing a two level threshold and the MAP search is performed on the uncertain region, which consists of blocks that have to enrol in the occlusion or non-occlusion regions. This approach permits faster execution times, as the MAP search is conducted in a fraction of the whole image. In addition, the choice of MAE provides more reliable disparity estimation than MSE because it is more robust and simpler. The encoding unit decomposes the reference and residual images with DWT and employs the morphological algorithm MRWD for compression. This algorithm partitions the coefficients of the wavelet transform and lowers their entropy. The obtained results show that the proposed method improves the quality of the reconstructed target images compared with the results that a plain BMA method may provide. Also, the proposed stereo coder outperforms some known state of the art coders.
To further investigate the contribution of the MRF model to the efficient handling of disparity estimation, its application on the subband domain may be tested. This may be done using BMA or coefficient matching, which may be embedded into the morphological encoder by using the same structuring element as that of the first order neighbourhood system.