Layer-based sparse representation of multiview images

Gelman, Andriy; Berent, Jesse; Dragotti, Pier Luigi

doi:10.1186/1687-6180-2012-61

Research
Open access
Published: 09 March 2012

Layer-based sparse representation of multiview images

Andriy Gelman¹,
Jesse Berent² &
Pier Luigi Dragotti¹

EURASIP Journal on Advances in Signal Processing volume 2012, Article number: 61 (2012) Cite this article

3997 Accesses
9 Citations
Metrics details

Abstact

This article presents a novel method to obtain a sparse representation of multiview images. The method is based on the fact that multiview data is composed of epipolar-plane image lines which are highly redundant. We extend this principle to obtain the layer-based representation, which partitions a multiview image dataset into redundant regions (which we call layers) each related to a constant depth in the observed scene. The layers are extracted using a general segmentation framework which takes into account the camera setup and occlusion constraints. To obtain a sparse representation, the extracted layers are further decomposed using a multidimensional discrete wavelet transform (DWT), first across the view domain followed by a two-dimensional (2D) DWT applied to the image dimensions. We modify the viewpoint DWT to take into account occlusions and scene depth variations. Simulation results based on nonlinear approximation show that the sparsity of our representation is superior to the multi-dimensional DWT without disparity compensation. In addition we demonstrate that the constant depth model of the representation can be used to synthesise novel viewpoints for immersive viewing applications and also de-noise multiview images.

1 Introduction

The notion of sparsity, namely the idea that the essential information contained in a signal can be represented with a small number of significant components, is widespread in signal processing and data analysis in general. Sparse signal representations are at the heart of many successful signal processing applications, such as signal compression and de-noising. In the case of images, successful new representations have been developed on the assumption that the data is well modelled by smooth regions separated by edges or regular contours. Besides wavelets, which have been successful for image compression [1], other examples of dictionaries that provide sparse image representations are curvelets [2], contourlets [3], ridgelets [4], directionlets [5], bandlets [6, 7] and complex wavelets [8, 9]. We refer the reader to a recent overview article [10] for a more comprehensive review on the theory of sparse signal representation.

In parallel and somewhat independently to these developments, there has been a growing interest in the capture and processing of multiview images. The popularity of this approach has been driven by the advent of novel exciting applications such as immersive communication [11] or free-viewpoint and three-dimensional (3D) TV [12]. At the heart of these applications is the idea that a novel arbitrary photorealistic view of a real scene can be obtained by proper interpolation of existing views. The problem of synthesising a novel image from a set of multiview images is known as image-based rendering (IBR) [13].

Multiview data sets are inherently multi-dimensional. In the most general case multiview images can be parameterised using a single 7D function called the plenoptic function [14]. The dimensions, however, can be reduced by making some simplifying assumptions as discussed in the next section. In particular, the assumption that a camera can move only along two directions leads to the 4D light field parameterisation [15]. If the camera moves only along a straight line the 3D epipolar-plane image (EPI) volume is obtained. We will discuss and use these two parameterisations throughout the article. Intuitively, in the case of a multi-view image array which captures the same scene from different locations, a significantly more sparse representation can be obtained than the independent analysis of each image. When dealing with multiview images, however, the data model must take into account appearing (disocclusions) and disappearing (occlusions) objects. This nonlinear property means that finding a sparse representation is inherently more difficult than in the two-dimensional (2D) case. For this reason, in this article we propose a hybrid method to obtain a sparse representation of multiview images. The fundamental component of the algorithm is the layer-based representation. In many situations, it is possible to divide the observed scene into a small number of depth layers that are parallel to the direction of camera motion. The layer-based representation partitions the multiview images into a set of layers each related to a constant depth in the observed scene. See also Figure 1 for a visual example of the partition. We present a novel method to extract these regions, which takes into account the structure of multiview data to achieve accurate results. In the case of the 4D light field, the sparse representation of the data is then obtained by taking a 4D discrete wavelet transform (DWT) of each depth layer. First we take a view compensated DWT along the two view directions, then the 2D separable spatial DWT is taken. This new representation is more effective than a standard separable DWT and this is shown using nonlinear approximation results. In addition, we present IBR and de-noising applications based on the extracted layers.

The article is organised as follows. Next we review the structure of multiview data, discuss the layer-based representation and present a high-level overview of our proposed method. In Section 3 we present the layer extraction algorithm. The multi-dimensional DWT is discussed in Section 4. We finally evaluate the proposed sparse representation in Section 5 and conclude in Section 6.

2 Multiview data structure

We start by introducing the plenoptic function and the structure of multiview data. In addition we present a layer-based representation that exploits the multiview structure to partition the data into volumes each related to a constant depth in the scene.

2.1 Plenoptic function

In the IBR framework, multiview images form samples of a multi-dimensional structure called the plenoptic function [14]. Introduced by Adelson and Bergen, this function parameterises each light ray with a 3D point in space (V_x, V_y, V_z) and its direction of arrival (θ,ϕ). Two further variables λ and t are used to specify the wavelength and time, respectively. In total the plenoptic function is therefore seven dimensional:

I = P_{7} (V_{x}, V_{y}, V_{z}, θ, ϕ, λ, t),

(1)

where I corresponds to the light ray intensity.

In practise, however, it is not feasible to store, transmit or capture the 7D function. A number of simplifications are therefore applied to reduce its dimensionality. Firstly, it is common to drop the λ parameter and instead deal with either the monochromatic intensity or the red, green, blue (RGB) channels separately. Secondly, the light rays can be recorded at a specific moment in time, thus dropping the t parameter. This simplification can for example be applied when viewing a stationary scene. The resulting object is a 5D function.

A popular parameterisation of the plenoptic function, known as the light field [15] defines each light ray by its intersection with a camera plane and a focal plane:

I = P_{4} (V_{x}, V_{y}, x, y),

(2)

where as illustrated in Figure 2, (V_x, V_y) and (x, y) correspond to the coordinates of the camera and the focal plane, respectively. Observe that the dataset can be analysed as a 2D array of images, where each image is formed by the light rays which pass through a specific point on the camera plane. In Figure 3 we illustrate an example of a light field with 16-camera locations. The camera positions are evenly spaced on a 2D grid (V_x, V_y).

The light field can be further simplified by setting the 2D camera plane to a line. This is also known as the EPI volume [16]:

I = P_{3} (V_{x}, x, y) .

(3)

In comparison to the light field, the EPI is easier to visualise and in the following sections we use it to present a number of concepts. All of the properties are however easily generalised to the light field. Next, we review the EPI and light field structure and present the layer-based representation.

2.2 EPI and light field structure

In this section we show that an EPI volume and a light field are structured datasets. By structure we mean that the fundamental component of multiview images are lines along which the intensity of the pixels is constant. This concept is shown in Figure 4c. This illustration is obtained by stacking an array of images into a volume and taking a cross section through the dataset. It can be clearly observed that pixels are redundant along lines of varying gradients. These pixels along which the intensity of the volume is constant are also known as an EPI line.

In order to demonstrate why the fundamental component of multiview images are EPI lines, consider the setup in Figure 4a. Here we show a simplified version of the scene: the horizontal axis corresponds to the camera location line; the line parallel to it defines the focal plane of each camera^a; and the vertical axis defines the depth of the scene. The curved line thus corresponds to the surface of the object.

Given this setup consider a point in space with coordinates (X, Y, Z). Assuming a Lam-bertian scene^b this point will appear in each of the images with coordinates

x = \frac{f X}{Z} - \frac{f V_{x}}{Z},

(4)

y = \frac{f Y}{Z},

(5)

where f is the focal length. As illustrated in Figure 4b, x is linearly related to the camera location V_x. The rate of change in the pixel location, also known as the disparity $Δ p = \frac{f}{Z}$ , is inversely related to the depth of the object. Thus, a point in space maps to a line in the EPI volume.

However, the above analysis does not take into account occlusions. Clearly when two lines intersect, the EPI line corresponding to a smaller depth (larger disparity) will occlude all the EPI lines which are related to a larger depth (smaller disparity) in the scene. This principle is illustrated in Figure 4c.

The above concepts can also be extended to the light field, where the camera is allowed to move along two dimensions (V_x, V_y). In this case, a point (X, Y, Z) maps onto a 2D plane as

(\begin{matrix} X \\ Y \\ Z \end{matrix}) \to (\begin{matrix} x \\ y \\ V_{x} \\ V_{y} \end{matrix}) = (\begin{matrix} (X - V_{x}) f / Z \\ (Y - V_{y}) f / Z \\ V_{x} \\ V_{y} \end{matrix}) .

(6)

2.3 Layer-based representation

The layer-based representation is an extension of the EPI line concept. The representation partitions the multiview data into homogenous regions, where each layer is a collection of EPI lines modelled by a constant depth plane. An example of a layer-based representation is shown in Figure 1.

Consider a set of EPI lines modelled by a constant disparity Δp_k as shown in Figure 5a. We define the layer carved out by the EPI lines with $ℋ_{k}$ and the boundary which delimits the region with Γ_kAssuming there are no occlusions, observe that using (4) and (5), the boundary Γ_kcan be defined by a contour on one of the viewpoints projected to the remaining frames. More specifically, if we define the contour γ_k (s) = [x (s), y (s)] to be the boundary on the viewpoint (V_x = 0), we obtain the relationship

Γ_{k} (s, V_{x}) = (\begin{matrix} x (s) - Δ p_{k} V_{x} \\ y (s) \\ V_{x} \end{matrix}),

(7)

where s parameterises the contour γ_k (s).

In order to take into account occlusions, we can use the same principles as in the case of EPI lines; a layer will be occluded when it intersects with other layers which are related to a smaller depth in the scene. We illustrate this in Figure 5, which shows that when two layers intersect we obtain their visible representations^c $ℋ_{k - 1}^{V}$ and $ℋ_{k}^{V}$ . In this example the layers are ordered in terms of increasing depth (i.e., $ℋ_{k}$ corresponds to a larger depth than $ℋ_{k - 1}$ ). In general, the visible regions of a layer can be defined as

ℋ_{k}^{V} = ℋ_{k} ⋂ \bar{(⋃_{j = 1}^{k - 1} ℋ_{j})} .

(8)

We illustrate the layers $ℋ_{k}$ and $ℋ_{k}^{V}$ from the Animal Farm dataset in Figure 6.

There are a number of advantages to segmenting a multiview dataset into layers. Firstly, each layer is highly redundant in the direction of the disparity Δp. This is due to the fact that each layer consists of EPI lines modelled by a constant depth. Secondly, any occluded regions are explicitly defined by the representation. These regions correspond to artificial boundaries, and their specific locations can be used to design a transform which takes them into account. Thirdly, the boundary of each layer can be efficiently defined by a contour on one viewpoint γ (s) and its disparity Δp. This is important if a compression algorithm based on the sparse representation is to be implemented, where the segmentation of each layer must also be transmitted.

2.4 Sparse representation method high-level overview

We use the above analysis to develop a new method that provides sparse representations of multiview images. The method is outlined in Figure 7. The first step of the method is to obtain a layer-based representation. As highlighted in Section 2.3 each layer is modelled by a constant depth plane and a contour on one of the image viewpoints. To extract these layers, we use a variational framework where the general segmentation results are modified to include the camera setup and the occlusion constraints.

In the following step we decompose the layers using a 4D DWT applied in a separable fashion across the viewpoint and the spatial dimensions. We modify the viewpoint transform to include disparity compensation and also efficiently deal with disoccluded regions. Additionally, the transform is implemented using the lifting scheme [17] to reduce the complexity and maintain invertibility.

In the following sections we describe the layer extraction and 4D DWT stages in more detail.

3 Layer-based segmentation

Data segmentation is the first stage of the proposed method. Here we introduce our segmentation algorithm which achieves accurate results by taking into account the structure of multiview data. We introduce the method by first describing a general segmentation problem and then showing how that solution can be adapted to extract layers from a light field dataset.

3.1 General region-based data segmentation

Consider a general segmentation problem shown in Figure 8. The aim is to partition an m-dimensional dataset $D \subset ℝ^{m}$ into subsets $ℋ$ and $\bar{ℋ}$ where the boundary which delimits the two regions is defined by Γ (σ) with σ ε ℝ^m-1. This type of problem can be solved using an optimisation framework, where the boundary is obtained by minimising an objective function J:

Γ = arg min {J (Γ)} .

(9)

The cost function in (9) can be defined using either a boundary or region-based approach. The boundary methods evaluate the cost only on Γ and, hence, they are influenced by local data properties and easily affected by noise. In contrast, the region-based methods evaluate the cost function over a complete region and are therefore more robust. A typical region-based cost function [18] can be defined as:

J (Γ) = \int_{ℋ} d (x, ℋ) d x + \int_{\bar{ℋ}} d (x, \bar{ℋ}) d x + \int_{Γ} η d σ,

(10)

where the descriptor d (·) measures the homogeneity of each region and x ε ℝ^m. The descriptor can be designed such that when x belongs to the region $ℋ$ , d(x, $ℋ$ ) tends to zero and vice versa. Note also that (10) has an additional regularisation term η, which acts to minimise the length of the boundary.

The optimisation problem defined in (10) cannot be solved directly for Γ. An iterative solution can, however, be obtained by making the boundary a function of an evolution parameter τ. Consider modeling the boundary using a partial differential equation (PDE), also known as an active contour [19]:

\frac{\partial Γ (σ, τ)}{\partial τ} = v (σ, τ) = F (σ, τ), n (σ, τ),

(11)

where v is a velocity vector, which can be expressed in terms of a scalar force F acting in the outward normal direction n to the boundary. The velocity vector v can be evaluated in terms of the descriptor d(·) by differentiating (10) with respect to τ. Applying the Eulerian framework [18], the derivative can be expressed in terms of boundary integrals:

\frac{\partial J (Γ (τ))}{\partial τ} = \int_{Γ (τ)} [d (x, ℋ) - d (x, \bar{ℋ}) + η κ (x)] (v \cdot n) d σ,

(12)

where κ is the curvature of the boundary Γ and · denotes the dot product. Observe that v and n correspond to the velocity and the normal vectors in (11), respectively.

The velocity vector, which evolves Γ in the steepest descent direction can hence be deduced using the Cauchy-Schwarz inequality as:

v = [d (x, \bar{ℋ}) - d (x, ℋ) - η κ (x)] n .

(13)

The above framework is also known as 'competition-based' segmentation. This is clear from (13), where a point on the boundary will experience a positive force if it belongs to the region $ℋ$ and vice versa, hence evolving the contour in the correct direction. In conclusion, the general segmentation problem can be solved by modeling the boundary Γ as a PDE and evolving the contour in the direction of the velocity vector v.

3.2 Multiview image segmentation

In the case of a light field, the goal is to extract N layers, where each volume is modelled by a constant depth Z_k or the associated disparity Δp_k. In the context of the previous section, this is equivalent to segmenting the data into 4D layers { $ℋ_{1}, . . ., ℋ_{N}$ }, where the boundary of each layer is defined by {Γ₁,..., Γ_N} (the background volume $ℋ_{N}$ is assigned the residual regions which do not belong to any other layer).

In this setup, $ℋ_{k}$ corresponds to a layer which is defined by a contour on one viewpoint and a disparity as outlined in Section 2.3. However, due to occlusions the complete layer will not be visible in the dataset. Therefore, we define the cost function^d in terms of the visible regions $ℋ_{k}^{V}$ , and this leads to the following:

min_{{Γ_{1}, \dots, Γ_{N}, Δ p_{1}, \dots, Δ p_{N}}} (\sum_{k = 1}^{N} \int_{ℋ_{k}^{V}} d_{k} (x, Δ p_{k}) d x),

(14)

where x = [x, y, V_x, V_y] ^Tand $ℋ_{k}^{V}$ correspond to the visible regions of each layer.

Recall that under our assumptions, the intensity along each EPI line is constant. We therefore choose the descriptor d_k (x, Δp_k)

d_{k} (x, Δ p_{k}) = {[I (x) - μ (x, Δ p_{k})]}^{2},

(15)

where μ (x, Δp_k) is the mean of the EPI line which passes through a point x and has a disparity Δp_k.

The aim of the segmentation is then to obtain the layer boundaries Γ_kand the disparity values Δp_k for k = 1,..., N by minimising (14). Observe however that (14) has a large number of unknown variables. In order to minimise the function, we consider the problem of layer evolution and disparity estimation separately and then show how the problem is iteratively solved in Section 3.2.4.

Assuming the layer disparities are known, the minimisation can be simplified to

min_{{Γ_{1}, \dots, Γ_{N}}} (\sum_{k = 1}^{N} \int_{ℋ_{k}^{V}} d_{k} (x, Δ p_{k}) d x) .

(16)

One way to minimise (16) is to evolve iteratively the boundary of each layer. For example assuming there are three volumes $ℋ_{1}$ , $ℋ_{2}$ and $ℋ_{3}$ and we choose to evolve the boundary of the first one, the energy function can be expressed as

J_{1} = \int_{ℋ_{1}^{V}} d_{1} (x, Δ p_{1}) d x + \underset{\int \bar{H_{1}^{V}} d_{1}^{out} (x) d x}{\underset{⏟}{\int_{ℋ_{2}^{V}} d_{2} (x, Δ p_{2}) d x + \int_{ℋ_{3}^{V}} d_{3} (x, Δ p_{3}) d x}}

(17)

= \int_{ℋ_{1}^{V}} d_{1} (x, Δ p_{1}) d x + \int_{\bar{ℋ_{1}^{V}}} d_{1}^{out} (x) d x,

(18)

where $d_{1}^{out} (x) = d_{i} (x, Δ p_{i})$ when $x \in ℋ_{i}^{V}$ for i = 2, 3.

In general, when evolving the k-th layer, the cost function can be simplified to

J_{k} = \int_{ℋ_{k}^{V}} d_{k} (x, Δ p_{k}) d x + \int_{\bar{ℋ_{k}^{V}}} d_{k}^{out} (x) d x .

(19)

A possible solution would then be to evaluate the 4D velocity vector of the boundary corresponding to $ℋ_{k}^{V}$ This approach, however, would not explicitly take into account the structure of multiview data in the minimisation. In the following we show how (19) is solved by imposing the camera setup and the occlusion constraints.

3.2.1 Imposing camera setup and occlusion constraints

Recall that the background layer corresponds to the object with the largest depth (smallest disparity). If the boundary of this layer increases, it will automatically be occluded by the remaining layers in the dataset. Therefore, the structure of the visible layers will remain unchanged, and hence the cost must also remain the same. When evolving the k-th layer, we model this by using the following indicator function:

ℐ_{k} (x) = \{\begin{gathered} 0, if x \in (⋃_{j = 1}^{k - 1} ℋ_{j}) \\ 1, otherwise, \end{gathered}

(20)

where the layers { $ℋ_{1}, . . ., ℋ_{N}$ } are ordered in terms of increasing depth. Incorporating this into (18) allows the cost to be expressed in terms of $ℋ_{k}$ as follows:

J_{k} = \int_{ℋ_{k}} d_{k} (x, Δ p_{k}) ℐ_{k} (x) d x + \int_{\bar{ℋ_{k}}} d_{1}^{out} (x) ℐ_{k} (x) d x .

(21)

Observe that the integration bounds $ℋ_{k}$ now correctly correspond to the layer boundary Γ_k. This, therefore, allows the derivative of the cost to be defined as

\frac{d J_{k}}{d τ} = \int_{Γ_{k}} [d_{k} (x, Δ p_{k}) ℐ_{k} (x) - d_{k}^{out} (x) ℐ_{k} (x)] (v_{Γ_{k}} ∙ n_{Γ_{k}}) d σ .

(22)

Additionally, recall that using the camera setup constraint, the boundary Γ_kcan be parameterised by a 2D contour γ_k(s) = [x (s),y(s)] on the image viewpoint (V_x = 0). Substituting this into (22) we obtain

\frac{d J_{k}}{d τ} = \int_{γ_{k}} [D_{k} (s, Δ p_{k}) - D_{k}^{out} (s)] (v_{γ_{k}} ∙ n_{γ_{k}}) d s,

(23)

where v_γkand n_γknow correspond to the velocity and the outward normal vector of the 2D boundary^e, respectively. In addition, the new objective functions D_k (⋅) and $D_{k}^{out} (\cdot)$ are defined as

D_{k} (s, Δ p_{k}) = \iint d_{k} (x, Δ p_{k}) ℐ_{k} (x) d V_{x} d V_{y}

(24)

D_{k}^{out} (s) = \iint d_{k}^{out} (x) ℐ_{k} (x) d V_{x} d V_{y},

(25)

where

x = [\begin{matrix} x (s) - Δ p_{k} V_{x} \\ y (s) - Δ p_{k} V_{y} \\ V_{x} \\ V_{y} \end{matrix}] .

Note that the new descriptors $D_{k}^{out} (\cdot)$ and D_k (·) are simply the descriptors $d_{k}^{out} (\cdot)$ and d_k (·) integrated over the viewpoint dimensions.

The velocity vector which reduces the cost in the direction of steepest descent can therefore be chosen as^f

v_{γ k} = [D_{k}^{out} (s) - D_{k} (s, Δ p_{k})] n_{γ k} .

(26)

There are two main advantages in simplifying the evolution from a 4D to a 2D contour. First, the approach ensures that the layer boundary remains consistent across the views. Secondly, the complexity is reduced from evolving a 4D hypersurface to a 2D contour. We show a comparison between an unconstrained and constrained boundary evolution in Figure 9. Observe that by imposing the camera setup and occlusion constraints in Figure 9b we obtain a segmentation which is consistent with the EPI structure. In conclusion, (26) defines a velocity vector, which evolves the layer boundary γ_k(s) towards the desired segmentation for each layer.

3.2.2 Disparity and number of layers estimation

In the previous section we presented an approach to derive the velocity vector for each layer. However, the knowledge of the disparities is required in order to evaluate the correct evolution. We evaluate these parameters by assuming the 2D layer contours {γ_l...,γ_N} are constant. In this case, the objective function can be simplified to:

min_{{Δ p_{1}, \dots, Δ p_{N}}} \sum_{k = 1}^{N} \int_{ℋ_{k}^{V}} d_{k} (x, Δ p_{k}) d x .

(27)

In contrast to the optimisation of the layer contours, this problem is significantly simpler. A solution can be obtained in an iterative approach by estimating the disparity of each layer assuming the remaining disparities are constant. Each parameter is then estimated using the MATLAB nonlinear optimisation toolbox.

In addition, observe that we require the knowledge of the number of layers N. In our approach we initialise this value using a stereo match algorithm [20]. Alternatively, one could estimate the number of layers using the spectral properties of the light field [21] as proposed in [22].

3.2.3 Level-set method for the boundary evolution

We have demonstrated an approach to derive the velocity vector for each boundary. We then implement the evolution of the active contours using the level-set method [23].

This method, instead of evolving directly the 2D boundary, implicitly models the curve using a higher dimensional surface z = ϕ(x, y, τ). The original boundary is then defined as the zero-level of the new function

\begin{gathered} γ (s, τ) = arg {ϕ (x, y, τ)} \\ such that ϕ (x, y, τ) = 0, \end{gathered}

(28)

where s parameterises the (x, y) coordinates on the 2D boundary.

The evolution equation of the surface can then be derived as follows. First, by implicitly differentiating ϕ (γ (s, τ), τ) = 0 with respect to τ we obtain

\frac{\partial ϕ}{\partial τ} + \frac{\partial ϕ}{\partial x} \frac{\partial x}{\partial τ} + \frac{\partial ϕ}{\partial y} \frac{\partial y}{\partial τ} = 0 \Leftrightarrow \frac{\partial ϕ}{\partial τ} + \nabla ϕ (γ (s, τ), τ) \cdot \frac{\partial γ}{\partial τ} = 0,

(29)

where ∇ is the gradient operator. Second, observe that the normal to the surface ϕ (x, y, τ) evaluated on the boundary γ (s, τ) corresponds to the outward normal vector of the boundary n_γ. This implies that

\frac{\nabla ϕ}{|\nabla ϕ|} = n_{γ} .

(30)

Combining (29), (30) with original boundary model $\frac{\partial γ}{\partial τ} = F n_{γ}$ we obtain the level-set evolution equation [23]

\frac{\partial ϕ (x, y, τ)}{\partial τ} = - F (x, y) |\nabla ϕ (x, y, τ)| .

(31)

There are two main advantages to using the level-set method. First, the surface implicitly models any topological changes of the boundary. Second, unlike other parameterisation schemes, the approach does not suffer instability issues since (31) is evaluated on a fixed cartesian grid.

The evolution of the level-set method does however have a drawback in terms of increased complexity. To evolve the surface, the velocity vector must be evaluated at every position on the grid. In our approach, we deal with this problem by using the narrowband implementation [24], where only a region around the boundary is evolved instead of the complete surface.

3.2.4 Layer segmentation algorithm overview

An overview of the complete layer extraction algorithm is shown in Algorithm 1. First, the 2D contours and the disparity of each layer are initialised using a stereo matching algorithm [20]. The algorithm evaluates the disparity of each layer and then iteratively evolves the boundaries^g using the proposed velocity vector in (26). This process continues for a certain number of iterations or until the change in the overall cost is below a predefined threshold.

An example of the extracted layers using the method outlined in Algorithm 1 is shown in Figure 1. In addition, in Figure 10 we show a comparison between an initialised layer-boundary using the stereo matching algorithm and the final layer contour.

To obtain a sparse representation, the obtained layers are decomposed using a 4D DWT as explained in the following section.

4 Data decomposition

In this stage, the redundancy of the texture in each layer shown in Figure 1 is reduced using a multi-dimensional wavelet transform. In the following, we present the inter-view and the spatial transforms in more detail.

Algorithm 1 Layer extraction algorithm

STEP 1: Initialise the 2D boundary of each layer {γ₁, γ₂,..., γ_N} using a stereo matching algorithm (Algorithm [20] in our implementation).

STEP 2: Estimate the disparity of each layer {Δ p₁, Δp₂,... Δp_N} by minimising the squared error along the EPI lines.

STEP 3: Reorder the layers in terms of increasing depth.

STEP 4: Iteratively evolve the layer boundaries assuming the remaining layers are constant:

for k = 1 to N-1 do

Evaluate the velocity vector v_γkof the k-th layer.

Evolve the boundary γ_kaccording to the velocity vector.

end for

STEP 5: Return to STEP 2 or exit algorithm if the change in the cost (14) is below a predefined threshold.

4.1 Inter-view 2D DWT

We implement the inter-view 2D DWT on each layer in two steps: first by applying a 1D disparity compensated DWT across the row images (V_y) followed by the column images (V_x) as illustrated in Figure 11. The process is iterated on the low-pass components to obtain a multiresolution decomposition.

In our implementation of the 1D DWT we use the disparity compensated Haar transform. This is motivated by the fact that the light field intensity along the EPI lines is constant. Therefore, a wavelet with one vanishing moment is enough to obtain a sparse representation. It is applied by modifying the standard lifting equations [17] and including a warping operator $W$ as follows:

\begin{gathered} ℒ_{o} [n] = \frac{P_{o} [n] - W {P_{e} [n]}}{2}, \\ ℒ_{e} [n] = P_{e} [n] + W {ℒ_{o} [n]}, \end{gathered}

(32)

where, P_o [n] and P_e [n] represent 2D images with spatial coordinates (x, y) located at odd (2n + 1) and even (2n) camera locations, respectively. Following (32), $ℒ_{e} [n]$ contains the 2D low-pass subband and $ℒ_{o} [n]$ the high-pass subband. Assuming that $W$ is invertible and the images are spatially continuous, the above transform can be shown to be equivalent to the standard DWT applied along the motion trajectories [25].

In both the prediction and update steps in (32), the warping operator $W$ is chosen to maximise the inter-image correlation. This is achieved by using a projective operation that maps one image onto the same viewpoint as its odd/even complement in the lifting step. Using (4) and the fact that the layers are modelled by a constant disparity, we define the warping operation from viewpoint n₁ to n₂ along the V_x dimension as:

W_{n_{1} \to n_{2}} {P [n_{1}]} (x, y) = P [n_{1}] (x + Δ p (n_{2} - n_{1}), y),

(33)

where Δp is the layer disparity.

Note that in the case of an occlusion, the DWT leads to filtering across an artificial boundary and, thus, results in a reduced sparsity efficiency. To prevent this, we use the concept proposed in [26] to create a shape-adaptive transform in the view domain. The transform in (32) is modified whenever a pixel at an even or odd location is occluded such that

ℒ_{e} [n] = \{\begin{gathered} P_{e} [n], occlusion at 2 n + 1 \\ \hat{W} {P_{o} [n]}, occlusion at 2 n \end{gathered},

(34)

and the high pass coefficient in $ℒ_{o} [n]$ is set to zero. In (34), the warping operator $\hat{W}$ is set to an integer pixel precision to ensure invertibility and is set to be the ceiling of the disparity in (33).

4.2 Spatial shape-adaptive 2D DWT

Following the inter-view transform we reduce the intra-view redundancy by using a 2D DWT. However, prior to applying the 2D DWT on each image, we recombine the transform coefficients into a single layer. This is done to increase the number of decompositions which can be applied by the spatial transform. A comparison between the original and recombined layers is illustrated in Figure 12. Note that due to occlusions and the way in which the inter-view transform is implemented, two or more layers may overlap in each subband. In this case, we apply a separate spatial transform to the overlapped pixels.

Note that the overlapped pixels are commonly bounded by an irregular (non-rectangular) shape. For that reason, the standard 2D DWT applied to the entire spatial domain is inefficient due to the boundary effect. We therefore use the shape-adaptive DWT [26] within arbitrarily shaped objects. The method reduces the magnitude of the high pass coefficients by symmetrically extending the texture whenever the wavelet filter is crossing the boundary. The 2D DWT is built as a separable transform with linear-phase symmetric wavelet filters (9/7 or 5/3 [27]), which, together with the symmetric signal extensions, leads to critically sampled transform subbands.

5 Evaluation

In this section we evaluate the performance of the proposed sparse representation using its nonlinear approximation properties. In addition, we demonstrate de-noising and IBR applications based on the proposed decomposition.

5.1 Nonlinear approximation

We evaluate the sparseness of the representation using its N-term nonlinear approximation. To implement the nonlinear approximation, we keep the N largest coefficients in the transform domain, reconstruct the data and evaluate the data fidelity in terms of PSNR.

Our results show that the proposed layer-based representation offers superior approximation properties compared to a typical multi-dimensional DWT^h. We demonstrate this in Figure 13 on three datasets: Tsukuba light field [272 × 368 × 4 × 4], Teddy EPI [368 × 352 × 4] and Doll EPI [368 × 352 × 4] (all from [28]), which vary in terms of scene complexity, number of images and spatial resolution. We show that in each case our approach achieves a sparser representation across the complete range of retained coefficients, with PSNR gains of up to 7 dB on the Tsukuba light field. The Tsukuba light field has a larger PSNR improvement than the respective Teddy and Doll EPI volumes due to the additional viewing dimension. This means that there exists more redundant information and this is fully exploited by our representation. We also show that the PSNR curves correspond to a subjective improvement in Figure 14.

We note that the nonlinear approximation metric is also a good indicator of the compression capability of the representation. In practice, the issue of compression is more complicated due to the additional problem of encoding the locations of the significant coefficients and also to the rate allocation. These issues are beyond the scope of this paper, however, we refer the reader to [29] where these problems are addressed and a complete multiview image compression method is presented.

5.2 De-noising

Here we demonstrate de-noising results based on the proposed sparse representation in the presence of additive white Gaussian noise (AWGN). We implement the de-noising by soft thresholding the wavelet coefficients in each subband. For each subband, the threshold is chosen by minimising the Stein's Unbiased Risk Estimate (SURE) of the mean squared error (MSE) [30].

Note that the aim of this section is not to compare the results to the state-of-the-art in multiview de-noising techniques but to demonstrate that the sparse representation can be used for de-noising applications. In Figure 15 we compare our algorithm to the competitive SURE-LET OWT de-noising method [31] applied to each image independently. In this setup, we assume the noise is added to the extracted layers. Analysing the Tsukuba light field, Teddy EPI and Doll EPI datasets, our approach corresponds to a PSNR improvement of up to 2 dB. The light field has the most significant gain due to a sparser representation, which results from the larger number of images in the dataset.

The subjective results are illustrated in Figure 16 and these clearly show that the proposed sparse representation attains more visually pleasing results than the SURE-LET OWT method.

5.3 Image-based rendering

In this section we present viewpoint interpolation results based on the layer-based representation shown in Figure 1.

To render an image at an arbitrary viewpoint, we linearly interpolate the closest available images. Recall that the data pixels are highly correlated in the direction of the disparity. We take this into account by modifying the support of the rendering kernel according to the disparity of each layer. Additionally, we modify the interpolation in the presence of occlusions to further improve the results. In this case, only pixels that belong to the layer are used in the rendering process.

In order to obtain an objective evaluation, we use the leave-one-out approach. In this case the images located at odd camera viewpoint locations are removed and synthesised using the scene modelⁱ.

We compare our results to a state-of-the-art stereo matching algorithm [32] and an EPI tubes extraction method [33]. These methods specify the structure of the EPI lines, and the interpolation is implemented using the same approach as in the proposed algorithm.

In Table 1 we show a comparison on four datasets: Dwarves EPI [555 × 695 × 7] [28], Lobby EPI [800 × 800 × 5], Desk light field [500 × 500 × 4 × 4] and Animal Farm EPI [235 × 625 × 32 (last three from [34]). Observe that the layer-based representation achieves an improved SNR in comparison to both the stereo matching and the EPI tubes extraction method on all datasets.

Table 1 Image-based rendering evaluation

Full size table

In addition to the quantitative evaluation, we show a subjective comparison in Figure 17. This shows that interpolation using the proposed layer-based representation achieves significantly improved results. We note that using our method we obtain artifact free and photo-realistic images. In comparison, aliasing artifacts are present in the stereo matching and EPI tube extraction methods. This is due to incorrect compensation of the interpolation kernel, which stems from inaccurate depth correspondence in the scene.

6 Conclusion

We presented a novel method to obtain a sparse representation of multiview images. The fundamental component of the algorithm is the layer-based representation, which partitions the multiview images into a set of layers each related to a constant depth in the scene. We presented a novel method to obtain the layer-based representation using a general segmentation framework which takes into account the structure of multiview data to achieve accurate results. The obtained layers are then decomposed using a 4D DWT applied in a separable approach, first across the camera viewpoint and then the image dimensions. We modify the viewpoint transform to efficiently deal with occlusions and depth variations. Simulation results based on nonlinear approximation have shown that the sparsity of our representation is superior to a multi-dimensional DWT with the same decomposition structure without disparity compensation. In addition, we have shown that the proposed representation can be used to efficiently synthesise novel viewpoints for IBR applications and also de-noise multiview images in the presence of AWGN.

Endnotes

^aEach camera in the setup is modelled by the pinhole model [35]. ^bLight ray intensity is constant when an object is observed from a different angle. ^cBy visible regions we mean the EPI line segments which are present in the EPI volume. ^dWe have not included the regularisation terms for the sake of clarity. ^eIt can be shown that given a fronto-parallel depth plane, the inner product of v_γk· n_γkis equal to v_Γk· n_Γk. ^fNote that in practise we also include a regularisation term to constrain the evolution according to the curvature of the boundary. ^gNote that the background layer $ℋ_{N}^{V}$ is automatically assigned all of the regions which do not belong to the remaining layers and is therefore not evolved. ^hThis multi-dimensional DWT has the same decomposition structure as our method, however no disparity compensation. iThe extracted layers are obtained using the dataset with the removed images.

References

Taubman D, Marcellin M: JPEG2000 Image Compression Fundamentals, Standards and Practice. Kluwer Academic Publishers, Boston; 2004.
Google Scholar
Candés EJ, Donoho DL: Curvelets: a surprisingly effective nonadaptive representation of objects with edges. Curve and Surface Fitting, University Press, Saint-Malo; 2000.
Google Scholar
Do MN, Vetterli M: The contourlet transform: an efficient directional multiresolution image representation. IEEE Trans Image Process 2005, 14(12):2091-2106.
Article MathSciNet Google Scholar
Do MN, Vetterli M: The finite ridgelet transform for image representation. IEEE Trans Image Process 2003, 12(1):16-28. 10.1109/TIP.2002.806252
Article MathSciNet Google Scholar
Velisavljevic V, Beferull-Lozano B, Vetterli M, Dragotti PL: Directionlets: anisotropic multidirectional representation with separable filtering. IEEE Trans Image Process 2006, 5(7):1916-1933.
Article Google Scholar
Pennec E Le, Mallat S: Sparse geometric image representations with bandelets. IEEE Trans Image Process 2005, 14(4):423-438.
Article MathSciNet Google Scholar
Pennec E Le, Mallat S: Bandelet image approximation and compression. SIAM J Multiscale Model Simul 2005, 4(3):992-1039. 10.1137/040619454
Article Google Scholar
Bayram I, Selesnick IW: On the dual-tree complex wavelet packet and M -band transforms. IEEE Trans Signal Process 2008, 56(6):2298-2310.
Article MathSciNet Google Scholar
Selesnick IW, Baraniuk RG, Kingsbury NG: The dual-tree complex wavelet transform. IEEE Signal Process Mag 2005, 22(6):123-151.
Article Google Scholar
Bruckstein AM, Donoho DL, Elad M: From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Rev 2009, 51: 34-81. 10.1137/060657704
Article MathSciNet Google Scholar
Do MN, Nguyen QH, Nguyen HT, Kubacki D, Patel SJ: Immersive visual communication. IEEE Signal Process Mag 2011, 28(1):58-66.
Article Google Scholar
Kubota A, Smolic A, Magnor M, Tanimoto M, Chen T, Zhang C: Multiview imaging and 3DTV. IEEE Signal Process Mag 2007, 24(6):10-21.
Article Google Scholar
Zhang C, Chen T: A survey on image-based rendering-representation, sampling and compression. Signal Process Image Commun 2004, 19: 1-28. 10.1016/j.image.2003.07.001
Article Google Scholar
Adelson EH, Bergen JR: The Plenoptic Function and the Elements of Early Vision Computational Models of Visual Processing. MIT Press, Cambrige; 1991:3-20.
Google Scholar
Levoy M, Hanrahan P: Light field rendering. In Proceedings of Computer Graphics (SIGGRAPH). New Orleans, Louisiana; 1996:31-42.
Google Scholar
Bolles R, Baker H, Marimont D: Epipolar-plane image analysis: an approach to determining structure from motion. Int J Comput Vis 1987, 1(1):7-55. 10.1007/BF00128525
Article Google Scholar
Daubechies I, Sweldens W: Factoring wavelet transforms into lifting steps. J Fourier Anal Appl 1998, 4(3):247-269. 10.1007/BF02476026
Article MathSciNet Google Scholar
Jehan-Besson S, Barlaud M, Aubert G: DREAM2S: Deformable regions driven by an eulerian accurate minimization method for image and video segmentation. Int J Comput Vis 2003, 53: 45-70. 10.1023/A:1023031708305
Article Google Scholar
Kass M, Witkin A, Terzopoulos D: Snakes: Active contour models. Int J Comput Vis 1988, 1(4):321-331. 10.1007/BF00133570
Article Google Scholar
Kolmogorov V, Zabih R: Multi-camera scene reconstruction via graph cuts. In Proceedings of the 7th European Conference on Computer Vision-Part III (ECCV). Springer-Verlag, Copenhagen, Denmark; 2002:82-96.
Google Scholar
Chai JX, Tong X, Chan SC, Shum HY: Plenoptic sampling. In Proceedings of Computer Graphics (SIGGRAPH). ACM Press/Addison-Wesley Publishing Co., New York; 2000:307-318.
Google Scholar
Berent J, Dragotti PL, Brookes M: Adaptive layer extraction for image based rendering. In Proceedings of IEEE Workshop on Multimedia Signal Processing (MMSP). Rio De Janeiro, Brazil; 2009:266-271.
Google Scholar
Sethian JA: Level Set Methods. Cambridge University Press, Cambridge; 1996.
Google Scholar
Hötter M: Object-oriented analysis-synthesis coding based on moving two-dimensional objects. Signal Process Image Commun 1990, 2(4):409-428. 10.1016/0923-5965(90)90027-F
Article Google Scholar
Secker A, Taubman D: Lifting-based invertible motion adaptive transform (LIMAT) framework for highly scalable video compression. IEEE Trans Image Process 2003, 12(12):1530-1542. 10.1109/TIP.2003.819433
Article Google Scholar
Li S, Li W: Shape-adaptive discrete wavelet transforms for arbitrarily shaped visual object coding. IEEE Trans Circ Syst Video Technol 2000, 10(5):725-743. 10.1109/76.856450
Article Google Scholar
Unser M, Blu T: Mathematical properties of the JPEG2000 wavelet filters. IEEE Trans Image Process 2003, 12(9):1080-1090. 10.1109/TIP.2003.812329
Article MathSciNet Google Scholar
Scharstein D, Szeliski R:Middlebury datasets. [http://www.vision.middlebury.edu/stereo/data]
Gelman A, Dragotti PL, Velisavljevi'c V: Multiview image compression using a layer-based representation. In Proceedings of the IEEE International Conference on Image Processing (ICIP). Hong Kong, China; 2010:13-16.
Google Scholar
Donoho D, Johnstone IM: Adapting to unknown smoothness via wavelet shrinkage. J Am Stat Assoc 1995, 90: 1200-1224. 10.2307/2291512
Article MathSciNet Google Scholar
Blu T, Luisier F: The SURE-LET approach to image denoising. IEEE Trans Image Process 2007, 16(11):2778-2786.
Article MathSciNet Google Scholar
Ogale AS, Aloimonos Y: Shape and the stereo correspondence problem. Int J Comput Vis 2005, 65(3):147-162. 10.1007/s11263-005-3672-3
Article Google Scholar
Criminisi A, Kang SB, Swaminathan R, Szeliski R, Anandan P: Extracting layers and analyzing their specular properties using epipolar-plane-image analysis. Microsoft Research Technical Report MSR-TR-2002-19; 2002.
Google Scholar
Berent J:Coherent multi-dimensional segmentation of multi-view images using a vari-ational framework and applications to image based rendering. PhD Thesis, Imperial College; 2008. [http://www.commsp.ee.ic.ac.uk/~pld/group/Thesis_Berent_08.pdf]
Google Scholar
Hartley RI, Zisserman A: Multiple View Geometry in Computer Vision. 2nd edition. Cambridge University Press, Cambridge; 2004. ISBN:0521540518
Book Google Scholar
Shum HY, Kang SB: A review of image-based rendering techniques. IEEE SPIE Vis Commun Image Process (VCIP) 2000, 213: 1-12.
Google Scholar

Download references

Acknowledgements

We would like to thank the anonymous reviewers whose constructive comments led to an improved manuscript.

Author information

Authors and Affiliations

Communications and Signal Processing Group, Imperial College London, London, SW7 2AZ, UK
Andriy Gelman & Pier Luigi Dragotti
Google Inc., Brandschenkestrasse 110, Zurich, 8002, Switzerland
Jesse Berent

Authors

Andriy Gelman
View author publications
You can also search for this author in PubMed Google Scholar
Jesse Berent
View author publications
You can also search for this author in PubMed Google Scholar
Pier Luigi Dragotti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andriy Gelman.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Authors’ original file for figure 13

Authors’ original file for figure 14

Authors’ original file for figure 15

Authors’ original file for figure 16

Authors’ original file for figure 17

Authors’ original file for figure 18

Authors’ original file for figure 19

Authors’ original file for figure 20

Authors’ original file for figure 21

Authors’ original file for figure 22

Authors’ original file for figure 23

Authors’ original file for figure 24

Authors’ original file for figure 25

Authors’ original file for figure 26

Authors’ original file for figure 27

Authors’ original file for figure 28

Authors’ original file for figure 29

Authors’ original file for figure 30

Authors’ original file for figure 31

Authors’ original file for figure 32

Authors’ original file for figure 33

Authors’ original file for figure 34

Authors’ original file for figure 35

Authors’ original file for figure 36

Authors’ original file for figure 37

Authors’ original file for figure 38

Authors’ original file for figure 39

Authors’ original file for figure 40

Authors’ original file for figure 41

Authors’ original file for figure 42

Authors’ original file for figure 43

Authors’ original file for figure 44

Authors’ original file for figure 45

Authors’ original file for figure 46

Authors’ original file for figure 47

Authors’ original file for figure 48

Authors’ original file for figure 49

Authors’ original file for figure 50

Authors’ original file for figure 51

Authors’ original file for figure 52

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Gelman, A., Berent, J. & Dragotti, P.L. Layer-based sparse representation of multiview images. EURASIP J. Adv. Signal Process. 2012, 61 (2012). https://doi.org/10.1186/1687-6180-2012-61

Download citation

Received: 14 July 2011
Accepted: 09 March 2012
Published: 09 March 2012
DOI: https://doi.org/10.1186/1687-6180-2012-61

Layer-based sparse representation of multiview images

Abstact

1 Introduction

2 Multiview data structure

2.1 Plenoptic function

2.2 EPI and light field structure

2.3 Layer-based representation

2.4 Sparse representation method high-level overview

3 Layer-based segmentation

3.1 General region-based data segmentation

3.2 Multiview image segmentation

3.2.1 Imposing camera setup and occlusion constraints

3.2.2 Disparity and number of layers estimation

3.2.3 Level-set method for the boundary evolution

3.2.4 Layer segmentation algorithm overview

4 Data decomposition

4.1 Inter-view 2D DWT

4.2 Spatial shape-adaptive 2D DWT

5 Evaluation

5.1 Nonlinear approximation

5.2 De-noising

5.3 Image-based rendering

6 Conclusion

Endnotes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords