- Research
- Open access
- Published:
Layer-based sparse representation of multiview images
EURASIP Journal on Advances in Signal Processing volume 2012, Article number: 61 (2012)
Abstact
This article presents a novel method to obtain a sparse representation of multiview images. The method is based on the fact that multiview data is composed of epipolar-plane image lines which are highly redundant. We extend this principle to obtain the layer-based representation, which partitions a multiview image dataset into redundant regions (which we call layers) each related to a constant depth in the observed scene. The layers are extracted using a general segmentation framework which takes into account the camera setup and occlusion constraints. To obtain a sparse representation, the extracted layers are further decomposed using a multidimensional discrete wavelet transform (DWT), first across the view domain followed by a two-dimensional (2D) DWT applied to the image dimensions. We modify the viewpoint DWT to take into account occlusions and scene depth variations. Simulation results based on nonlinear approximation show that the sparsity of our representation is superior to the multi-dimensional DWT without disparity compensation. In addition we demonstrate that the constant depth model of the representation can be used to synthesise novel viewpoints for immersive viewing applications and also de-noise multiview images.
1 Introduction
The notion of sparsity, namely the idea that the essential information contained in a signal can be represented with a small number of significant components, is widespread in signal processing and data analysis in general. Sparse signal representations are at the heart of many successful signal processing applications, such as signal compression and de-noising. In the case of images, successful new representations have been developed on the assumption that the data is well modelled by smooth regions separated by edges or regular contours. Besides wavelets, which have been successful for image compression [1], other examples of dictionaries that provide sparse image representations are curvelets [2], contourlets [3], ridgelets [4], directionlets [5], bandlets [6, 7] and complex wavelets [8, 9]. We refer the reader to a recent overview article [10] for a more comprehensive review on the theory of sparse signal representation.
In parallel and somewhat independently to these developments, there has been a growing interest in the capture and processing of multiview images. The popularity of this approach has been driven by the advent of novel exciting applications such as immersive communication [11] or free-viewpoint and three-dimensional (3D) TV [12]. At the heart of these applications is the idea that a novel arbitrary photorealistic view of a real scene can be obtained by proper interpolation of existing views. The problem of synthesising a novel image from a set of multiview images is known as image-based rendering (IBR) [13].
Multiview data sets are inherently multi-dimensional. In the most general case multiview images can be parameterised using a single 7D function called the plenoptic function [14]. The dimensions, however, can be reduced by making some simplifying assumptions as discussed in the next section. In particular, the assumption that a camera can move only along two directions leads to the 4D light field parameterisation [15]. If the camera moves only along a straight line the 3D epipolar-plane image (EPI) volume is obtained. We will discuss and use these two parameterisations throughout the article. Intuitively, in the case of a multi-view image array which captures the same scene from different locations, a significantly more sparse representation can be obtained than the independent analysis of each image. When dealing with multiview images, however, the data model must take into account appearing (disocclusions) and disappearing (occlusions) objects. This nonlinear property means that finding a sparse representation is inherently more difficult than in the two-dimensional (2D) case. For this reason, in this article we propose a hybrid method to obtain a sparse representation of multiview images. The fundamental component of the algorithm is the layer-based representation. In many situations, it is possible to divide the observed scene into a small number of depth layers that are parallel to the direction of camera motion. The layer-based representation partitions the multiview images into a set of layers each related to a constant depth in the observed scene. See also Figure 1 for a visual example of the partition. We present a novel method to extract these regions, which takes into account the structure of multiview data to achieve accurate results. In the case of the 4D light field, the sparse representation of the data is then obtained by taking a 4D discrete wavelet transform (DWT) of each depth layer. First we take a view compensated DWT along the two view directions, then the 2D separable spatial DWT is taken. This new representation is more effective than a standard separable DWT and this is shown using nonlinear approximation results. In addition, we present IBR and de-noising applications based on the extracted layers.
The article is organised as follows. Next we review the structure of multiview data, discuss the layer-based representation and present a high-level overview of our proposed method. In Section 3 we present the layer extraction algorithm. The multi-dimensional DWT is discussed in Section 4. We finally evaluate the proposed sparse representation in Section 5 and conclude in Section 6.
2 Multiview data structure
We start by introducing the plenoptic function and the structure of multiview data. In addition we present a layer-based representation that exploits the multiview structure to partition the data into volumes each related to a constant depth in the scene.
2.1 Plenoptic function
In the IBR framework, multiview images form samples of a multi-dimensional structure called the plenoptic function [14]. Introduced by Adelson and Bergen, this function parameterises each light ray with a 3D point in space (V x , V y , V z ) and its direction of arrival (θ,ϕ). Two further variables λ and t are used to specify the wavelength and time, respectively. In total the plenoptic function is therefore seven dimensional:
where I corresponds to the light ray intensity.
In practise, however, it is not feasible to store, transmit or capture the 7D function. A number of simplifications are therefore applied to reduce its dimensionality. Firstly, it is common to drop the λ parameter and instead deal with either the monochromatic intensity or the red, green, blue (RGB) channels separately. Secondly, the light rays can be recorded at a specific moment in time, thus dropping the t parameter. This simplification can for example be applied when viewing a stationary scene. The resulting object is a 5D function.
A popular parameterisation of the plenoptic function, known as the light field [15] defines each light ray by its intersection with a camera plane and a focal plane:
where as illustrated in Figure 2, (V x , V y ) and (x, y) correspond to the coordinates of the camera and the focal plane, respectively. Observe that the dataset can be analysed as a 2D array of images, where each image is formed by the light rays which pass through a specific point on the camera plane. In Figure 3 we illustrate an example of a light field with 16-camera locations. The camera positions are evenly spaced on a 2D grid (V x , V y ).
The light field can be further simplified by setting the 2D camera plane to a line. This is also known as the EPI volume [16]:
In comparison to the light field, the EPI is easier to visualise and in the following sections we use it to present a number of concepts. All of the properties are however easily generalised to the light field. Next, we review the EPI and light field structure and present the layer-based representation.
2.2 EPI and light field structure
In this section we show that an EPI volume and a light field are structured datasets. By structure we mean that the fundamental component of multiview images are lines along which the intensity of the pixels is constant. This concept is shown in Figure 4c. This illustration is obtained by stacking an array of images into a volume and taking a cross section through the dataset. It can be clearly observed that pixels are redundant along lines of varying gradients. These pixels along which the intensity of the volume is constant are also known as an EPI line.
In order to demonstrate why the fundamental component of multiview images are EPI lines, consider the setup in Figure 4a. Here we show a simplified version of the scene: the horizontal axis corresponds to the camera location line; the line parallel to it defines the focal plane of each cameraa; and the vertical axis defines the depth of the scene. The curved line thus corresponds to the surface of the object.
Given this setup consider a point in space with coordinates (X, Y, Z). Assuming a Lam-bertian sceneb this point will appear in each of the images with coordinates
where f is the focal length. As illustrated in Figure 4b, x is linearly related to the camera location V x . The rate of change in the pixel location, also known as the disparity , is inversely related to the depth of the object. Thus, a point in space maps to a line in the EPI volume.
However, the above analysis does not take into account occlusions. Clearly when two lines intersect, the EPI line corresponding to a smaller depth (larger disparity) will occlude all the EPI lines which are related to a larger depth (smaller disparity) in the scene. This principle is illustrated in Figure 4c.
The above concepts can also be extended to the light field, where the camera is allowed to move along two dimensions (V x , V y ). In this case, a point (X, Y, Z) maps onto a 2D plane as
2.3 Layer-based representation
The layer-based representation is an extension of the EPI line concept. The representation partitions the multiview data into homogenous regions, where each layer is a collection of EPI lines modelled by a constant depth plane. An example of a layer-based representation is shown in Figure 1.
Consider a set of EPI lines modelled by a constant disparity Δp k as shown in Figure 5a. We define the layer carved out by the EPI lines with and the boundary which delimits the region with Γ k Assuming there are no occlusions, observe that using (4) and (5), the boundary Γ k can be defined by a contour on one of the viewpoints projected to the remaining frames. More specifically, if we define the contour γ k (s) = [x (s), y (s)] to be the boundary on the viewpoint (V x = 0), we obtain the relationship
where s parameterises the contour γ k (s).
In order to take into account occlusions, we can use the same principles as in the case of EPI lines; a layer will be occluded when it intersects with other layers which are related to a smaller depth in the scene. We illustrate this in Figure 5, which shows that when two layers intersect we obtain their visible representationsc and . In this example the layers are ordered in terms of increasing depth (i.e., corresponds to a larger depth than ). In general, the visible regions of a layer can be defined as
We illustrate the layers and from the Animal Farm dataset in Figure 6.
There are a number of advantages to segmenting a multiview dataset into layers. Firstly, each layer is highly redundant in the direction of the disparity Δp. This is due to the fact that each layer consists of EPI lines modelled by a constant depth. Secondly, any occluded regions are explicitly defined by the representation. These regions correspond to artificial boundaries, and their specific locations can be used to design a transform which takes them into account. Thirdly, the boundary of each layer can be efficiently defined by a contour on one viewpoint γ (s) and its disparity Δp. This is important if a compression algorithm based on the sparse representation is to be implemented, where the segmentation of each layer must also be transmitted.
2.4 Sparse representation method high-level overview
We use the above analysis to develop a new method that provides sparse representations of multiview images. The method is outlined in Figure 7. The first step of the method is to obtain a layer-based representation. As highlighted in Section 2.3 each layer is modelled by a constant depth plane and a contour on one of the image viewpoints. To extract these layers, we use a variational framework where the general segmentation results are modified to include the camera setup and the occlusion constraints.
In the following step we decompose the layers using a 4D DWT applied in a separable fashion across the viewpoint and the spatial dimensions. We modify the viewpoint transform to include disparity compensation and also efficiently deal with disoccluded regions. Additionally, the transform is implemented using the lifting scheme [17] to reduce the complexity and maintain invertibility.
In the following sections we describe the layer extraction and 4D DWT stages in more detail.
3 Layer-based segmentation
Data segmentation is the first stage of the proposed method. Here we introduce our segmentation algorithm which achieves accurate results by taking into account the structure of multiview data. We introduce the method by first describing a general segmentation problem and then showing how that solution can be adapted to extract layers from a light field dataset.
3.1 General region-based data segmentation
Consider a general segmentation problem shown in Figure 8. The aim is to partition an m-dimensional dataset into subsets and where the boundary which delimits the two regions is defined by Γ (σ) with σ ε ℝm-1. This type of problem can be solved using an optimisation framework, where the boundary is obtained by minimising an objective function J:
The cost function in (9) can be defined using either a boundary or region-based approach. The boundary methods evaluate the cost only on Γ and, hence, they are influenced by local data properties and easily affected by noise. In contrast, the region-based methods evaluate the cost function over a complete region and are therefore more robust. A typical region-based cost function [18] can be defined as:
where the descriptor d (·) measures the homogeneity of each region and x ε ℝm. The descriptor can be designed such that when x belongs to the region , d(x,) tends to zero and vice versa. Note also that (10) has an additional regularisation term η, which acts to minimise the length of the boundary.
The optimisation problem defined in (10) cannot be solved directly for Γ. An iterative solution can, however, be obtained by making the boundary a function of an evolution parameter τ. Consider modeling the boundary using a partial differential equation (PDE), also known as an active contour [19]:
where v is a velocity vector, which can be expressed in terms of a scalar force F acting in the outward normal direction n to the boundary. The velocity vector v can be evaluated in terms of the descriptor d(·) by differentiating (10) with respect to τ. Applying the Eulerian framework [18], the derivative can be expressed in terms of boundary integrals:
where κ is the curvature of the boundary Γ and · denotes the dot product. Observe that v and n correspond to the velocity and the normal vectors in (11), respectively.
The velocity vector, which evolves Γ in the steepest descent direction can hence be deduced using the Cauchy-Schwarz inequality as:
The above framework is also known as 'competition-based' segmentation. This is clear from (13), where a point on the boundary will experience a positive force if it belongs to the region and vice versa, hence evolving the contour in the correct direction. In conclusion, the general segmentation problem can be solved by modeling the boundary Γ as a PDE and evolving the contour in the direction of the velocity vector v.
3.2 Multiview image segmentation
In the case of a light field, the goal is to extract N layers, where each volume is modelled by a constant depth Z k or the associated disparity Δp k . In the context of the previous section, this is equivalent to segmenting the data into 4D layers {}, where the boundary of each layer is defined by {Γ1,..., Γ N } (the background volume is assigned the residual regions which do not belong to any other layer).
In this setup, corresponds to a layer which is defined by a contour on one viewpoint and a disparity as outlined in Section 2.3. However, due to occlusions the complete layer will not be visible in the dataset. Therefore, we define the cost functiond in terms of the visible regions , and this leads to the following:
where x = [x, y, V x , V y ] Tand correspond to the visible regions of each layer.
Recall that under our assumptions, the intensity along each EPI line is constant. We therefore choose the descriptor d k (x, Δp k )
where μ (x, Δp k ) is the mean of the EPI line which passes through a point x and has a disparity Δp k .
The aim of the segmentation is then to obtain the layer boundaries Γ k and the disparity values Δp k for k = 1,..., N by minimising (14). Observe however that (14) has a large number of unknown variables. In order to minimise the function, we consider the problem of layer evolution and disparity estimation separately and then show how the problem is iteratively solved in Section 3.2.4.
Assuming the layer disparities are known, the minimisation can be simplified to
One way to minimise (16) is to evolve iteratively the boundary of each layer. For example assuming there are three volumes , and and we choose to evolve the boundary of the first one, the energy function can be expressed as
where when for i = 2, 3.
In general, when evolving the k-th layer, the cost function can be simplified to
A possible solution would then be to evaluate the 4D velocity vector of the boundary corresponding to This approach, however, would not explicitly take into account the structure of multiview data in the minimisation. In the following we show how (19) is solved by imposing the camera setup and the occlusion constraints.
3.2.1 Imposing camera setup and occlusion constraints
Recall that the background layer corresponds to the object with the largest depth (smallest disparity). If the boundary of this layer increases, it will automatically be occluded by the remaining layers in the dataset. Therefore, the structure of the visible layers will remain unchanged, and hence the cost must also remain the same. When evolving the k-th layer, we model this by using the following indicator function:
where the layers {} are ordered in terms of increasing depth. Incorporating this into (18) allows the cost to be expressed in terms of as follows:
Observe that the integration bounds now correctly correspond to the layer boundary Γ k . This, therefore, allows the derivative of the cost to be defined as
Additionally, recall that using the camera setup constraint, the boundary Γ k can be parameterised by a 2D contour γ k (s) = [x (s),y(s)] on the image viewpoint (V x = 0). Substituting this into (22) we obtain
where vγkand nγknow correspond to the velocity and the outward normal vector of the 2D boundarye, respectively. In addition, the new objective functions D k (⋅) and are defined as
where
Note that the new descriptors and D k (·) are simply the descriptors and d k (·) integrated over the viewpoint dimensions.
The velocity vector which reduces the cost in the direction of steepest descent can therefore be chosen asf
There are two main advantages in simplifying the evolution from a 4D to a 2D contour. First, the approach ensures that the layer boundary remains consistent across the views. Secondly, the complexity is reduced from evolving a 4D hypersurface to a 2D contour. We show a comparison between an unconstrained and constrained boundary evolution in Figure 9. Observe that by imposing the camera setup and occlusion constraints in Figure 9b we obtain a segmentation which is consistent with the EPI structure. In conclusion, (26) defines a velocity vector, which evolves the layer boundary γ k (s) towards the desired segmentation for each layer.
3.2.2 Disparity and number of layers estimation
In the previous section we presented an approach to derive the velocity vector for each layer. However, the knowledge of the disparities is required in order to evaluate the correct evolution. We evaluate these parameters by assuming the 2D layer contours {γl...,γ N } are constant. In this case, the objective function can be simplified to:
In contrast to the optimisation of the layer contours, this problem is significantly simpler. A solution can be obtained in an iterative approach by estimating the disparity of each layer assuming the remaining disparities are constant. Each parameter is then estimated using the MATLAB nonlinear optimisation toolbox.
In addition, observe that we require the knowledge of the number of layers N. In our approach we initialise this value using a stereo match algorithm [20]. Alternatively, one could estimate the number of layers using the spectral properties of the light field [21] as proposed in [22].
3.2.3 Level-set method for the boundary evolution
We have demonstrated an approach to derive the velocity vector for each boundary. We then implement the evolution of the active contours using the level-set method [23].
This method, instead of evolving directly the 2D boundary, implicitly models the curve using a higher dimensional surface z = ϕ(x, y, τ). The original boundary is then defined as the zero-level of the new function
where s parameterises the (x, y) coordinates on the 2D boundary.
The evolution equation of the surface can then be derived as follows. First, by implicitly differentiating ϕ (γ (s, τ), τ) = 0 with respect to τ we obtain
where ∇ is the gradient operator. Second, observe that the normal to the surface ϕ (x, y, τ) evaluated on the boundary γ (s, τ) corresponds to the outward normal vector of the boundary n γ . This implies that
Combining (29), (30) with original boundary model we obtain the level-set evolution equation [23]
There are two main advantages to using the level-set method. First, the surface implicitly models any topological changes of the boundary. Second, unlike other parameterisation schemes, the approach does not suffer instability issues since (31) is evaluated on a fixed cartesian grid.
The evolution of the level-set method does however have a drawback in terms of increased complexity. To evolve the surface, the velocity vector must be evaluated at every position on the grid. In our approach, we deal with this problem by using the narrowband implementation [24], where only a region around the boundary is evolved instead of the complete surface.
3.2.4 Layer segmentation algorithm overview
An overview of the complete layer extraction algorithm is shown in Algorithm 1. First, the 2D contours and the disparity of each layer are initialised using a stereo matching algorithm [20]. The algorithm evaluates the disparity of each layer and then iteratively evolves the boundariesg using the proposed velocity vector in (26). This process continues for a certain number of iterations or until the change in the overall cost is below a predefined threshold.
An example of the extracted layers using the method outlined in Algorithm 1 is shown in Figure 1. In addition, in Figure 10 we show a comparison between an initialised layer-boundary using the stereo matching algorithm and the final layer contour.
To obtain a sparse representation, the obtained layers are decomposed using a 4D DWT as explained in the following section.
4 Data decomposition
In this stage, the redundancy of the texture in each layer shown in Figure 1 is reduced using a multi-dimensional wavelet transform. In the following, we present the inter-view and the spatial transforms in more detail.
Algorithm 1 Layer extraction algorithm
STEP 1: Initialise the 2D boundary of each layer {γ1, γ2,..., γ N } using a stereo matching algorithm (Algorithm [20] in our implementation).
STEP 2: Estimate the disparity of each layer {Δ p1, Δp2,... Δp N } by minimising the squared error along the EPI lines.
STEP 3: Reorder the layers in terms of increasing depth.
STEP 4: Iteratively evolve the layer boundaries assuming the remaining layers are constant:
for k = 1 to N-1 do
Evaluate the velocity vector vγkof the k-th layer.
Evolve the boundary γ k according to the velocity vector.
end for
STEP 5: Return to STEP 2 or exit algorithm if the change in the cost (14) is below a predefined threshold.
4.1 Inter-view 2D DWT
We implement the inter-view 2D DWT on each layer in two steps: first by applying a 1D disparity compensated DWT across the row images (V y ) followed by the column images (V x ) as illustrated in Figure 11. The process is iterated on the low-pass components to obtain a multiresolution decomposition.
In our implementation of the 1D DWT we use the disparity compensated Haar transform. This is motivated by the fact that the light field intensity along the EPI lines is constant. Therefore, a wavelet with one vanishing moment is enough to obtain a sparse representation. It is applied by modifying the standard lifting equations [17] and including a warping operator as follows:
where, P o [n] and P e [n] represent 2D images with spatial coordinates (x, y) located at odd (2n + 1) and even (2n) camera locations, respectively. Following (32), contains the 2D low-pass subband and the high-pass subband. Assuming that is invertible and the images are spatially continuous, the above transform can be shown to be equivalent to the standard DWT applied along the motion trajectories [25].
In both the prediction and update steps in (32), the warping operator is chosen to maximise the inter-image correlation. This is achieved by using a projective operation that maps one image onto the same viewpoint as its odd/even complement in the lifting step. Using (4) and the fact that the layers are modelled by a constant disparity, we define the warping operation from viewpoint n1 to n2 along the V x dimension as:
where Δp is the layer disparity.
Note that in the case of an occlusion, the DWT leads to filtering across an artificial boundary and, thus, results in a reduced sparsity efficiency. To prevent this, we use the concept proposed in [26] to create a shape-adaptive transform in the view domain. The transform in (32) is modified whenever a pixel at an even or odd location is occluded such that
and the high pass coefficient in is set to zero. In (34), the warping operator is set to an integer pixel precision to ensure invertibility and is set to be the ceiling of the disparity in (33).
4.2 Spatial shape-adaptive 2D DWT
Following the inter-view transform we reduce the intra-view redundancy by using a 2D DWT. However, prior to applying the 2D DWT on each image, we recombine the transform coefficients into a single layer. This is done to increase the number of decompositions which can be applied by the spatial transform. A comparison between the original and recombined layers is illustrated in Figure 12. Note that due to occlusions and the way in which the inter-view transform is implemented, two or more layers may overlap in each subband. In this case, we apply a separate spatial transform to the overlapped pixels.
Note that the overlapped pixels are commonly bounded by an irregular (non-rectangular) shape. For that reason, the standard 2D DWT applied to the entire spatial domain is inefficient due to the boundary effect. We therefore use the shape-adaptive DWT [26] within arbitrarily shaped objects. The method reduces the magnitude of the high pass coefficients by symmetrically extending the texture whenever the wavelet filter is crossing the boundary. The 2D DWT is built as a separable transform with linear-phase symmetric wavelet filters (9/7 or 5/3 [27]), which, together with the symmetric signal extensions, leads to critically sampled transform subbands.
5 Evaluation
In this section we evaluate the performance of the proposed sparse representation using its nonlinear approximation properties. In addition, we demonstrate de-noising and IBR applications based on the proposed decomposition.
5.1 Nonlinear approximation
We evaluate the sparseness of the representation using its N-term nonlinear approximation. To implement the nonlinear approximation, we keep the N largest coefficients in the transform domain, reconstruct the data and evaluate the data fidelity in terms of PSNR.
Our results show that the proposed layer-based representation offers superior approximation properties compared to a typical multi-dimensional DWTh. We demonstrate this in Figure 13 on three datasets: Tsukuba light field [272 × 368 × 4 × 4], Teddy EPI [368 × 352 × 4] and Doll EPI [368 × 352 × 4] (all from [28]), which vary in terms of scene complexity, number of images and spatial resolution. We show that in each case our approach achieves a sparser representation across the complete range of retained coefficients, with PSNR gains of up to 7 dB on the Tsukuba light field. The Tsukuba light field has a larger PSNR improvement than the respective Teddy and Doll EPI volumes due to the additional viewing dimension. This means that there exists more redundant information and this is fully exploited by our representation. We also show that the PSNR curves correspond to a subjective improvement in Figure 14.
We note that the nonlinear approximation metric is also a good indicator of the compression capability of the representation. In practice, the issue of compression is more complicated due to the additional problem of encoding the locations of the significant coefficients and also to the rate allocation. These issues are beyond the scope of this paper, however, we refer the reader to [29] where these problems are addressed and a complete multiview image compression method is presented.
5.2 De-noising
Here we demonstrate de-noising results based on the proposed sparse representation in the presence of additive white Gaussian noise (AWGN). We implement the de-noising by soft thresholding the wavelet coefficients in each subband. For each subband, the threshold is chosen by minimising the Stein's Unbiased Risk Estimate (SURE) of the mean squared error (MSE) [30].
Note that the aim of this section is not to compare the results to the state-of-the-art in multiview de-noising techniques but to demonstrate that the sparse representation can be used for de-noising applications. In Figure 15 we compare our algorithm to the competitive SURE-LET OWT de-noising method [31] applied to each image independently. In this setup, we assume the noise is added to the extracted layers. Analysing the Tsukuba light field, Teddy EPI and Doll EPI datasets, our approach corresponds to a PSNR improvement of up to 2 dB. The light field has the most significant gain due to a sparser representation, which results from the larger number of images in the dataset.
The subjective results are illustrated in Figure 16 and these clearly show that the proposed sparse representation attains more visually pleasing results than the SURE-LET OWT method.
5.3 Image-based rendering
In this section we present viewpoint interpolation results based on the layer-based representation shown in Figure 1.
To render an image at an arbitrary viewpoint, we linearly interpolate the closest available images. Recall that the data pixels are highly correlated in the direction of the disparity. We take this into account by modifying the support of the rendering kernel according to the disparity of each layer. Additionally, we modify the interpolation in the presence of occlusions to further improve the results. In this case, only pixels that belong to the layer are used in the rendering process.
In order to obtain an objective evaluation, we use the leave-one-out approach. In this case the images located at odd camera viewpoint locations are removed and synthesised using the scene modeli.
We compare our results to a state-of-the-art stereo matching algorithm [32] and an EPI tubes extraction method [33]. These methods specify the structure of the EPI lines, and the interpolation is implemented using the same approach as in the proposed algorithm.
In Table 1 we show a comparison on four datasets: Dwarves EPI [555 × 695 × 7] [28], Lobby EPI [800 × 800 × 5], Desk light field [500 × 500 × 4 × 4] and Animal Farm EPI [235 × 625 × 32 (last three from [34]). Observe that the layer-based representation achieves an improved SNR in comparison to both the stereo matching and the EPI tubes extraction method on all datasets.
In addition to the quantitative evaluation, we show a subjective comparison in Figure 17. This shows that interpolation using the proposed layer-based representation achieves significantly improved results. We note that using our method we obtain artifact free and photo-realistic images. In comparison, aliasing artifacts are present in the stereo matching and EPI tube extraction methods. This is due to incorrect compensation of the interpolation kernel, which stems from inaccurate depth correspondence in the scene.
6 Conclusion
We presented a novel method to obtain a sparse representation of multiview images. The fundamental component of the algorithm is the layer-based representation, which partitions the multiview images into a set of layers each related to a constant depth in the scene. We presented a novel method to obtain the layer-based representation using a general segmentation framework which takes into account the structure of multiview data to achieve accurate results. The obtained layers are then decomposed using a 4D DWT applied in a separable approach, first across the camera viewpoint and then the image dimensions. We modify the viewpoint transform to efficiently deal with occlusions and depth variations. Simulation results based on nonlinear approximation have shown that the sparsity of our representation is superior to a multi-dimensional DWT with the same decomposition structure without disparity compensation. In addition, we have shown that the proposed representation can be used to efficiently synthesise novel viewpoints for IBR applications and also de-noise multiview images in the presence of AWGN.
Endnotes
aEach camera in the setup is modelled by the pinhole model [35]. bLight ray intensity is constant when an object is observed from a different angle. cBy visible regions we mean the EPI line segments which are present in the EPI volume. dWe have not included the regularisation terms for the sake of clarity. eIt can be shown that given a fronto-parallel depth plane, the inner product of vγk· nγkis equal to vΓk· nΓk. fNote that in practise we also include a regularisation term to constrain the evolution according to the curvature of the boundary. gNote that the background layer is automatically assigned all of the regions which do not belong to the remaining layers and is therefore not evolved. hThis multi-dimensional DWT has the same decomposition structure as our method, however no disparity compensation. iThe extracted layers are obtained using the dataset with the removed images.
References
Taubman D, Marcellin M: JPEG2000 Image Compression Fundamentals, Standards and Practice. Kluwer Academic Publishers, Boston; 2004.
Candés EJ, Donoho DL: Curvelets: a surprisingly effective nonadaptive representation of objects with edges. Curve and Surface Fitting, University Press, Saint-Malo; 2000.
Do MN, Vetterli M: The contourlet transform: an efficient directional multiresolution image representation. IEEE Trans Image Process 2005, 14(12):2091-2106.
Do MN, Vetterli M: The finite ridgelet transform for image representation. IEEE Trans Image Process 2003, 12(1):16-28. 10.1109/TIP.2002.806252
Velisavljevic V, Beferull-Lozano B, Vetterli M, Dragotti PL: Directionlets: anisotropic multidirectional representation with separable filtering. IEEE Trans Image Process 2006, 5(7):1916-1933.
Pennec E Le, Mallat S: Sparse geometric image representations with bandelets. IEEE Trans Image Process 2005, 14(4):423-438.
Pennec E Le, Mallat S: Bandelet image approximation and compression. SIAM J Multiscale Model Simul 2005, 4(3):992-1039. 10.1137/040619454
Bayram I, Selesnick IW: On the dual-tree complex wavelet packet and M -band transforms. IEEE Trans Signal Process 2008, 56(6):2298-2310.
Selesnick IW, Baraniuk RG, Kingsbury NG: The dual-tree complex wavelet transform. IEEE Signal Process Mag 2005, 22(6):123-151.
Bruckstein AM, Donoho DL, Elad M: From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Rev 2009, 51: 34-81. 10.1137/060657704
Do MN, Nguyen QH, Nguyen HT, Kubacki D, Patel SJ: Immersive visual communication. IEEE Signal Process Mag 2011, 28(1):58-66.
Kubota A, Smolic A, Magnor M, Tanimoto M, Chen T, Zhang C: Multiview imaging and 3DTV. IEEE Signal Process Mag 2007, 24(6):10-21.
Zhang C, Chen T: A survey on image-based rendering-representation, sampling and compression. Signal Process Image Commun 2004, 19: 1-28. 10.1016/j.image.2003.07.001
Adelson EH, Bergen JR: The Plenoptic Function and the Elements of Early Vision Computational Models of Visual Processing. MIT Press, Cambrige; 1991:3-20.
Levoy M, Hanrahan P: Light field rendering. In Proceedings of Computer Graphics (SIGGRAPH). New Orleans, Louisiana; 1996:31-42.
Bolles R, Baker H, Marimont D: Epipolar-plane image analysis: an approach to determining structure from motion. Int J Comput Vis 1987, 1(1):7-55. 10.1007/BF00128525
Daubechies I, Sweldens W: Factoring wavelet transforms into lifting steps. J Fourier Anal Appl 1998, 4(3):247-269. 10.1007/BF02476026
Jehan-Besson S, Barlaud M, Aubert G: DREAM2S: Deformable regions driven by an eulerian accurate minimization method for image and video segmentation. Int J Comput Vis 2003, 53: 45-70. 10.1023/A:1023031708305
Kass M, Witkin A, Terzopoulos D: Snakes: Active contour models. Int J Comput Vis 1988, 1(4):321-331. 10.1007/BF00133570
Kolmogorov V, Zabih R: Multi-camera scene reconstruction via graph cuts. In Proceedings of the 7th European Conference on Computer Vision-Part III (ECCV). Springer-Verlag, Copenhagen, Denmark; 2002:82-96.
Chai JX, Tong X, Chan SC, Shum HY: Plenoptic sampling. In Proceedings of Computer Graphics (SIGGRAPH). ACM Press/Addison-Wesley Publishing Co., New York; 2000:307-318.
Berent J, Dragotti PL, Brookes M: Adaptive layer extraction for image based rendering. In Proceedings of IEEE Workshop on Multimedia Signal Processing (MMSP). Rio De Janeiro, Brazil; 2009:266-271.
Sethian JA: Level Set Methods. Cambridge University Press, Cambridge; 1996.
Hötter M: Object-oriented analysis-synthesis coding based on moving two-dimensional objects. Signal Process Image Commun 1990, 2(4):409-428. 10.1016/0923-5965(90)90027-F
Secker A, Taubman D: Lifting-based invertible motion adaptive transform (LIMAT) framework for highly scalable video compression. IEEE Trans Image Process 2003, 12(12):1530-1542. 10.1109/TIP.2003.819433
Li S, Li W: Shape-adaptive discrete wavelet transforms for arbitrarily shaped visual object coding. IEEE Trans Circ Syst Video Technol 2000, 10(5):725-743. 10.1109/76.856450
Unser M, Blu T: Mathematical properties of the JPEG2000 wavelet filters. IEEE Trans Image Process 2003, 12(9):1080-1090. 10.1109/TIP.2003.812329
Scharstein D, Szeliski R:Middlebury datasets. [http://www.vision.middlebury.edu/stereo/data]
Gelman A, Dragotti PL, Velisavljevi'c V: Multiview image compression using a layer-based representation. In Proceedings of the IEEE International Conference on Image Processing (ICIP). Hong Kong, China; 2010:13-16.
Donoho D, Johnstone IM: Adapting to unknown smoothness via wavelet shrinkage. J Am Stat Assoc 1995, 90: 1200-1224. 10.2307/2291512
Blu T, Luisier F: The SURE-LET approach to image denoising. IEEE Trans Image Process 2007, 16(11):2778-2786.
Ogale AS, Aloimonos Y: Shape and the stereo correspondence problem. Int J Comput Vis 2005, 65(3):147-162. 10.1007/s11263-005-3672-3
Criminisi A, Kang SB, Swaminathan R, Szeliski R, Anandan P: Extracting layers and analyzing their specular properties using epipolar-plane-image analysis. Microsoft Research Technical Report MSR-TR-2002-19; 2002.
Berent J:Coherent multi-dimensional segmentation of multi-view images using a vari-ational framework and applications to image based rendering. PhD Thesis, Imperial College; 2008. [http://www.commsp.ee.ic.ac.uk/~pld/group/Thesis_Berent_08.pdf]
Hartley RI, Zisserman A: Multiple View Geometry in Computer Vision. 2nd edition. Cambridge University Press, Cambridge; 2004. ISBN:0521540518
Shum HY, Kang SB: A review of image-based rendering techniques. IEEE SPIE Vis Commun Image Process (VCIP) 2000, 213: 1-12.
Acknowledgements
We would like to thank the anonymous reviewers whose constructive comments led to an improved manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Gelman, A., Berent, J. & Dragotti, P.L. Layer-based sparse representation of multiview images. EURASIP J. Adv. Signal Process. 2012, 61 (2012). https://doi.org/10.1186/1687-6180-2012-61
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/1687-6180-2012-61