A geometric approach to multi-view compressive imaging
- Jae Young Park^{1}Email author and
- Michael B. Wakin^{2}
https://doi.org/10.1186/1687-6180-2012-37
© Park and Wakin; licensee Springer. 2012
Received: 12 July 2011
Accepted: 20 February 2012
Published: 20 February 2012
Abstract
In this paper, we consider multi-view imaging problems in which an ensemble of cameras collect images describing a common scene. To simplify the acquisition and encoding of these images, we study the effectiveness of non-collaborative compressive sensing encoding schemes wherein each sensor directly and independently compresses its image using randomized measurements. After these measurements and also perhaps the camera positions are transmitted to a central node, the key to an accurate reconstruction is to fully exploit the joint correlation among the signal ensemble. To capture such correlations, we propose a geometric modeling framework in which the image ensemble is treated as a sampling of points from a low-dimensional manifold in the ambient signal space. Building on results that guarantee stable embeddings of manifolds under random measurements, we propose a "manifold lifting" algorithm for recovering the ensemble that can operate even without knowledge of the camera positions. We divide our discussion into two scenarios, the near-field and far-field cases, and describe how the manifold lifting algorithm could be applied to these scenarios. At the end of this paper, we present an in-depth case study of a far-field imaging scenario, where the aim is to reconstruct an ensemble of satellite images taken from different positions with limited but overlapping fields of view. In this case study, we demonstrate the impressive power of random measurements to capture single- and multi-image structure without explicitly searching for it, as the randomized measurement encoding in conjunction with the proposed manifold lifting algorithm can even outperform image-by-image transform coding.
Keywords
1. Introduction
CS is particularly useful in two scenarios. The first is when a high-resolution signal is difficult to measure directly. For example, conventional infrared cameras require expensive sensors, and with increasing resolution such cameras can become extremely costly. A compressive imaging camera has been proposed [3] that can acquire a digital image using far fewer (random) measurements than the number of pixels in the image. Such a camera is simple and inexpensive and can be used not only for imaging at visible wavelengths, but also for imaging at non-visible wavelengths.
A second scenario where CS is useful is when one or more high-resolution signals are difficult or expensive to encode. Such scenarios arise, for example, in sensor networks and multi-view imaging, where it may be feasible to measure the raw data at each sensor, but joint, collaborative compression of that data among the sensors would require costly communication. As an alternative to conventional Distributed Source Coding (DSC) methods [4], an extension of single-signal CS known as Distributed CS (DCS) [5] has been proposed, where each sensor encodes only a random set of linear projections of its own observed signal. These projections could be obtained either by using CS hardware as described above, or by using a random, compressive encoding of the data collected from a conventional sensor.
While DCS encoding is non-collaborative, an effective DCS decoder should reconstruct all signals jointly to exploit their common structure. As we later discuss, most existing DCS algorithms for distributed imaging reconstruction rely fundamentally on sparse models to capture intra- and inter-signal correlations [5–8]. What is missing from each of these algorithms, however, is an assurance that the reconstructed images have a global consistency, i.e. that they all describe a common underlying scene. This may not only lead to possible confusion in interpreting the images, but more critically may also suggest that the reconstruction algorithm is failing to completely exploit the joint structure of the ensemble.
To better extend DCS techniques specifically to problems involving multi-view imaging, we propose in this paper a general geometric framework in which many such reconstruction problems may be cast. We specifically focus on scenarios where a representation of the underlying scene is linearly related to the observations. This is mainly for simplicity, and there is plenty of room for the development of joint reconstruction algorithms given nonlinear mappings; however, we present a number of scenarios where a linear mapping can be found. For these problems, we explain how viewing the unknown images as living along a low-dimensional manifold within the high-dimensional signal space can inform the design of effective joint reconstruction algorithms. Such algorithms can build on existing sparsity-based techniques for CS but ensure a global consistency among the reconstructed images. We refine our discussion by focusing on two settings: far-field and near-field multi-view imaging. Finally, as a proof of concept, we demonstrate a "manifold lifting" algorithm in a specific far-field multi-view scenario where the camera positions are not known a priori and we only observe a small number of random measurements at each sensor. Even in such discouraging circumstances, by effectively exploiting the geometrical information preserved in the manifold model, we are able to accurately reconstruct both the underlying scene and the camera positions.
2. Background on signal models and compressive sensing
A. Concise signal models
Real-world signals typically contain some degree of structure that can be exploited to simplify their processing and recovery. Sparsity is one model of conciseness in which the signal of interest can be represented as a linear combination of only a few basis vectors from some dictionary. To provide a more formal statement, let us consider a signal x ∈ ℝ ^{ N } . (If the signal is a 2D image, we reshape it into a length-N vector.) We let Ψ ∈ ℝ^{N × N}denote an orthonormal basis^{a} for ℝ ^{ N } , with its columns acting as basis vectors, and we write x = Ψα, where α := Ψ ^{ T } x ∈ ℝ ^{ N } denotes the expansion coefficients of x in the basis Ψ. We say that x is K-sparse in the basis Ψ if α contains only K nonzero entries. Sparse representations with K ≪ N provide exact or approximate models for wide varieties of signal classes, as long as the basis Ψ is chosen to match the structure in x. In the case of images, the 2D Discrete Wavelet Transform (DWT) and 2D Discrete Cosine Transform (DCT) are reasonable candidates for Ψ [9].
The rich geometrical information that rests within an IAM makes it an excellent candidate for modeling in multi-view imaging. Letting θ represent camera position, all of the images in a multi-view ensemble will live along a common IAM, and as we will later discuss, image reconstruction in the IAM framework can ensure global consistency of the reconstructed images.
B. Compressive sensing
In conventional signal acquisition devices such as digital cameras and camcorders, we first acquire a full N-dimensional signal x and then apply a compression technique such as JPEG or MPEG [9]. These and other transform coding techniques essentially involve computing the expansion coefficients α describing the signal in some basis Ψ, keeping only the K-largest entries of α, and setting the rest to zero. While this can be a very effective way of consolidating the signal information, one could argue that this procedure of "first sample, then compress" is somewhat wasteful because we must measure N pieces of information only to retain K < N coefficients. For certain sensing modalities (such as infrared), it may be difficult or expensive to acquire so many high-resolution samples of the signal.
The recently emerged theory of CS suggests an alternative acquisition scheme. CS utilizes an efficient encoding framework in which we directly acquire a compressed representation of the underlying signal by computing simple linear inner products with a small set of randomly generated test functions. Let us denote the full-resolution discrete signal as x ∈ ℝ ^{ N } and suppose that we generate a collection of M random vectors, ϕ_{ i } ∈ ℝ ^{ N } , i = 1, 2, . . ., M . We stack these vectors into an M × N matrix Φ = [ϕ_{1}ϕ_{2} ... ϕ_{ M } ] ^{ T } , which we refer to as a measurement matrix. A CS encoder or sensor produces the measurements y = Φx ∈ ℝ ^{ M } , possibly without ever sampling or storing x itself.
which is again convex and can be solved efficiently.
If Φ satisfies the RIP of order 2K with δ_{2K}sufficiently small, it is known that (1) will perfectly recover any K-sparse signal in the basis Ψ and that (2) will incur a recovery error at worst proportional to ε[15]. The performance of both recovery techniques also degrades gracefully if x is not exactly K-sparse but rather is well approximated by a K-sparse signal.
It has been shown that we can obtain an RIP matrix Φ with high probability simply by taking $M=\mathcal{O}(K\mathrm{log}(N/K))$ and populating the matrix with i.i.d. Gaussian, Bernoulli, or more general subgaussian entries [17]. Thus, one of the hallmarks of CS is that this requisite number of measurements M is essentially proportional to the sparsity level K of the signal to be recovered.
In addition to families of K-sparse signals, random matrices can also provide stable embeddings for manifolds (see Figure 2). Letting M denote a smooth^{c}p-dimensional manifold, if we take $M=\mathcal{O}(p\mathrm{log}(N))$ and generate Φ randomly from one of the distributions above, we will obtain an embedding Φℳ:= {Φx : x ∈ ℳ} ∈ ℝ ^{ M } such that all pairwise distances between points on the manifold are approximately preserved [14], i.e. such that (3) holds for all ${x}_{{\theta}_{1}},{x}_{{\theta}_{2}}\in \mathcal{M}$. Geodesic distances are also approximately preserved. Again, the requisite number of measurements is merely proportional to the information level of the signal, which in this case equals p (the dimension of the manifold), rather than the sparsity level of the signal in any particular dictionary. All of this suggests that manifolds may be viable models to use in CS recovery; see [18] for additional discussion on the topic of using manifold models to recover individual signals.
We see from the above that random measurements have a remarkable "universal" ability to capture the key information in a signal, and this occurs with a number of measurements just proportional to the number of degrees of freedom in the signal. Only the decoder attempts to exploit the signal structure, and it can do so by positing any number of possible signal models.
In summary, in settings where a high-resolution signal x is difficult or expensive to measure directly, CS allows us to replace the "first sample, then compress" paradigm with a technique for directly acquiring compressive measurements of x. To do this in practice, we might resort to CS hardware that directly acquires the linear measurements y without ever sampling or storing x directly. Several forms of compressive imaging architectures have been proposed, ranging from existing data collection schemes in Magnetic Resonance Imaging (MRI) [19] to more exotic CS-based techniques. One architecture [3], for example, replaces the conventional CCD/CMOS sensor in a digital camera with a digital micromirror device (DMD), which modulates the incoming light and reflects it onto a single photodiode for measurement. Some intriguing uses of this inexpensive "single pixel camera" could include infrared or hyperspectral imaging, where conventional high-resolution sensors can cost hundreds of thousands of dollars.
Before proceeding, however, we note that CS can also be useful in settings where it is possible to acquire high-resolution signals, but is difficult or expensive to subsequently encode them. For example, x might represent a video signal, for which direct measurement is possible, but for which subsequent compression typically requires exploiting complicated spatio-temporal correlations [7, 8]. A more straightforward encoder might simply compute y = Φx for some random, compressive Φ. Other scenarios where data are difficult to encode efficiently might be in sensor networks or in multi-view imaging, which is the topic of this paper and is discussed further in the next section.
3. Problem setup and related work
A. Multi-view imaging using image-by-image random measurements
Let us now turn to the problem of distributed image compression for multi-view imaging. We imagine an ensemble of J distinct cameras that collect images x_{1}, x_{2}, . . ., x_{ J } ∈ ℝ ^{ N } describing a common scene, with each image x_{ j } taken from some camera position θ_{ j }∈ Θ. We would like to efficiently compress this ensemble of images, but as in any sensor network, we may be limited in battery power, computational horsepower, and/or communication bandwidth. Thus, although we may be able to posit sparse and manifold-based models for concisely capturing the intra- and inter-signal structures among the images in the ensemble, directly exploiting these models for the purpose of data compression may be prohibitively complex or require expensive collaboration among the sensors. This motivates our desire for an effective distributed encoding strategy.
The encoding of multiple signals in distributed scenarios has long been studied under the auspices of the distributed source coding (DSC) community. The Slepian-Wolf framework [4] for lossless DSC states that two sources X_{1} and X_{2} are able to compress at their conditional entropy rate without collaboration and can be decoded successfully when the correlation model (i.e., the joint probability distribution p(x_{1}, x_{2})) is known at the decoder. This work was extended to lossy coding by Wyner and Ziv when side information is available at the decoder [20], and in subsequent years, practical algorithms for these frameworks have been proposed based on channel coding techniques. However, one faces difficulties in applying these frameworks to multi-view imaging because the inter-image correlations are arguably better described geometrically than statistically. Several algorithms (e.g., [21–23]) have been proposed for combining these geometric and statistical frameworks, but fully integrating these concepts remains a very challenging problem.
As a simple alternative to these type of encoding schemes, we advocate the use of CS for distributed image coding, wherein for each sensor j ∈ {1, 2, . . ., J}, the signal x_{ j } ∈ ℝ ^{ N } is independently encoded using an M_{ j } × N measurement matrix Φ_{ j }, yielding the measurement vector ${y}_{j}={\Phi}_{j}{x}_{j}\in {\mathbb{R}}^{{M}_{j}}$. Such an encoding scheme is known in the CS literature as Distributed CS (DCS) [5]. While the primary motivation for DCS is to simplify the encoding of correlated high-resolution signals, one may of course bypass the potentially difficult acquisition of the high-resolution signals and directly collect the random measurements using CS hardware.
After the randomized encoding, the measurement vectors y_{1}, y_{2}, . . ., y_{ J } are then transmitted to a central node for decoding. Indeed, DCS differs from single-signal CS only in the decoding process. Rather than recover the signals one-by-one from the measurement vectors, an effective DCS decoder should solve a joint reconstruction problem, exploiting the intra- and inter-signal correlations among the signals {x_{ j } }, while ensuring consistency with the measurements {y_{ j } }.
The proper design of a DCS decoder depends very much on the type of data being collected and on the nature of the intra- and inter-signal correlations. Ideally, compared to signal-by-signal recovery, joint recovery should provide better reconstruction quality from a given set of measurement vectors, or equivalently, reduce the measurement burden needed to achieve a given reconstruction quality. For example, if each signal in the ensemble is K-sparse, we may hope to jointly recover the ensemble using fewer than the $\mathcal{O}(K\mathrm{log}(N/K))$ measurements per sensor that are required to reconstruct the signals separately. Like single-signal CS, DCS decoding schemes should be robust to noise and to dropped measurement packets. Joint reconstruction techniques should also be robust to the loss of individual sensors, making DCS well-suited for remote sensing applications.
B. Current approaches to DCS multi-view image reconstruction
For signals in general and images in particular, a variety of DCS decoding algorithms have been proposed to date. Fundamentally, all of these frameworks build upon the concept of sparsity for capturing intra- and inter-signal correlations.
One DCS modeling framework involves a collection of joint sparsity models (JSMs) [5]. In a typical JSM, we represent each signal x_{ j } ∈ ℝ ^{ N } in terms of a decomposition x_{ j } = z_{ C } + z_{ j } , where z_{ C } ∈ ℝ ^{ N } is a "common component" that is assumed to be present in all {x_{ j } }, and z_{ j } ∈ ℝ ^{ N } is an "innovation component" that differs for each signal. Depending on the application, different sparsity assumptions may be imposed on z_{ C } and z_{ j } . In some cases, these assumptions can dramatically restrict the space of possible signals. For example, all signals may be restricted to live within the same K-dimensional subspace. The DCS decoder then searches for a signal ensemble that is consistent with the available measurements and falls within the space of signals permitted by the JSM. For signal ensembles well modeled by a JSM, DCS reconstruction can offer a significant savings in the measurement rates. While each sensor must take enough measurements to account for its innovation component z_{ j } , all sensors can share the burden of measuring the common component z_{ C } .
Unfortunately, the applicability of JSMs to multi-view imaging scenarios can be quite limited. While two cameras in very close proximity may yield images having sparse innovations relative to a common background, any significant difference in the camera positions will dramatically increase the complexity of the innovation components. Because conventional JSMs are not appropriate for capturing any residual correlation that may remain among these innovations, we would expect JSM-based recovery to offer very little improvement over independent CS recovery.
Recently, a significant extension of the JSM framework has been proposed specifically for multi-view compressive imaging [6]. This framework assumes that images of a common scene are related by local or global geometrical transformations and proposes an overcomplete dictionary of basis elements consisting of various geometrical transformations of a generating mother function. It is assumed that each image can be decomposed into its own subset of these atoms plus the geometrically transformed atoms of the neighboring images. The benefit of this approach is that information about one image helps reduce the uncertainty about which atoms should be used to comprise the neighboring images. Unfortunately, there seems to be a limit as to how much efficiency may be gained from such an approach. To reconstruct a given image, the decoder may be tasked with solving for, say, K sparse coefficients. While the correlation model may help reduce the measurement burden at that sensor below $\mathcal{O}(K\mathrm{log}(N/K))$, it is not possible to reduce the number of measurements below K. As we will later argue, however, there is reason to believe that alternative reconstruction techniques based on the underlying scene (rather than the images themselves) can succeed with even fewer than K measurements.
Other approaches for multi-view image reconstruction could draw naturally from recent work in CS video reconstruction by ordering the static images {x_{ j } } according to their camera positions and reconstructing the sequence as a sort of "fly-by" video. One approach for video reconstruction exploits the sparsity of inter-frame differences [7]. For multi-view imaging, this would correspond to a difference image x_{ i } - x_{ j } having a sparse representation in some basis Ψ. Again, however, this condition may only be met if cameras i and j have very close proximity. We have also proposed a CS video reconstruction technique based on a motion-compensated temporal wavelet transform [8]. For multi-view imaging, we could modify this algorithm, replacing block-based motion compensation with disparity compensation. The challenge of such an approach, however, would be in finding the disparity information without prior knowledge of the images themselves. For video, we have addressed this challenge using a coarse-to-fine reconstruction algorithm that alternates between estimating the motion vectors and reconstructing successively higher resolution versions of the video using the motion-compensated wavelet transform.
What would still be missing from any of these approaches, however, is an assurance that the reconstructed images have a global consistency, i.e. that they all describe a common underlying scene. In the language of manifolds, this means that the reconstructed images do not necessarily live on a common IAM defined by a hypothetical underlying scene. This may not only lead to possible confusion in interpreting the images, but more critically may also suggest that the reconstruction algorithm is failing to completely exploit the joint structure of the ensemble-the images are in fact constrained to live in a much lower-dimensional set than the algorithm realizes.
4. Manifold lifting techniques for multi-view image reconstruction
In light of the above observations, one could argue that an effective multi-view reconstruction algorithm should exploit the underlying geometry of the scene by using an inter-signal modeling framework that ensures global consistency. To inform the design of such an algorithm, we find it helpful to view the general task of reconstruction as what we term a manifold lifting problem: we would like to recover each image x_{ j } ∈ ℝ ^{ N } from its measurements ${y}_{j}\in {\mathbb{R}}^{{M}_{j}}$ ("lifting" it from the low-dimensional measurement space back to the high-dimensional signal space), while ensuring that all recovered images live along a common IAM.
Although this interpretation does not immediately point us to a general purpose recovery algorithm (and different multi-view scenarios could indeed require markedly different algorithms), it can be informative for a number of reasons. For example, as we have discussed in Section 2-B, manifolds can have stable embeddings under random projections. If we suppose that Φ _{ j } = Φ ∈ ℝ ^{ M×N } for all j, then each measurement vector we obtain will be a point sampled from the embedded manifold Φℳ ⊂ ℝ ^{ M } . From samples of Φℳ in ℝ ^{ M } , we would like to recover samples of (or perhaps all of) ℳ in ℝ ^{ N } , and this may be facilitated if Φℳ preserves the original geometric structure of ℳ. In addition, as we have discussed in Section 2-A, many IAMs have a multiscale structure that has proved useful in solving non-compressive parameter estimation problems, and this structure may also be useful in solving multi-view recovery problems.
While this manifold-based interpretation may give us a geometric framework for signal modeling, it may not in isolation sufficiently capture all intra- and inter-signal correlations. Indeed, one cannot disregard the role that concise models such as sparsity may still play in an effective manifold lifting algorithm. Given an ensemble of measurements y_{1}, y_{2}, . . ., y_{ J } , there may be many candidates IAMs on which the original images x_{1}, x_{2}, . . ., x_{ J } may live. In order to resolve this ambiguity, one could employ either a model for the intra-signal structure (such as sparsity) or a model for the underlying structure of the scene (again, possibly sparsity). To do the latter, one must develop a representation for the underlying scene or phenomenon that is being measured and understand the mapping between that representation and the measurements y_{1}, y_{2}, . . ., y_{ J } . To keep the problem simple, this mapping will ideally be linear, and as we discuss in this section, such a representation and linear mapping can be found in a number of scenarios.
To make things more concrete, we demonstrate in this section how the manifold lifting viewpoint can inform the design of reconstruction algorithms in the context of two generic multi-view scenarios: far-field and near-field imaging. We also discuss how to address complications that can arise due to uncertainties in the camera positions. We hope that such discussions will pave the way for the future development of broader classes of manifold lifting algorithms.
A. Far-field multi-view imaging
We begin by considering the case where the cameras are far from the underlying scene, such as might occur in satellite imaging or unmanned aerial vehicle (UAV) remote sensing scenarios. In problems such as these, it may be reasonable to model each image x_{ j } ∈ ℝ ^{ N } as being a translated, rotated, scaled subimage of a larger fixed image. We represent this larger image as an element x drawn from a vector space such as ℝ ^{ Q } with Q > N, and we represent the mapping from x to x_{ j } (which depends on the camera position θ_{ j } ) as a linear operator that we denote as ${R}_{{\theta}_{j}}:{\mathbb{R}}^{Q}\to {\mathbb{R}}^{N}$. This operator ${R}_{{\theta}_{j}}$ can be designed to incorporate different combinations of translation, rotation, scaling, etc., followed by a restriction that limits the field of view.
Given y and Φ_{big}R, and assuming x is sparse in some basis Ψ (such as the 2D wavelet domain), we can solve the usual optimization problem as stated in (1) (or (2) if the measurements are noisy). If desired, one can use the recovered image $\widehat{x}$ to obtain estimates ${\widehat{x}}_{j}:{R}_{{\theta}_{j}}\widehat{x}$of the original subimages. These are guaranteed to live along a common IAM, namely$\mathcal{M}\left(\widehat{x}\right)$.
B. Near-field multi-view imaging
Near-field imaging may generally be more challenging than far-field imaging. Defining a useful representation for the underlying scene may be difficult, and due to effects such as parallax and occlusions, it may seem impossible to find a linear mapping from any such representation to the measurements. Fortunately, however, there are encouraging precedents that one could follow.
One representative application of near-field imaging is in Computed Tomography (CT). In CT, we seek to acquire a 3D volumetric signal x, but the signals x_{ j } that we observe correspond to slices of the Fourier transform of x. (We may assume y_{ j } = x_{ j } in such problems, and so the challenge is actually to recover ℳ(x), or equivalently just x, rather than the individual {x_{ j } }.) Given a fixed viewing angle θ_{ j } , this relationship between x and x_{ j } is linear, and so we may set up a joint recovery program akin to that proposed above for far-field imaging. Similar approaches have been used for joint recovery from undersampled frequency measurements in MRI [19].
For near-field imaging using visible light, there is generally no clear linear mapping between a 3D volumetric representation of the scene and the observed images x_{ j } . However, rather than contend with complicated nonlinear mappings, we suggest that a promising alternative may be to use the plenoptic function[24] as a centralized representation of the scene. The plenoptic function f is a hypothetical 5D function used to describe the intensity of light that could be observed from any point in space, when viewed in any possible direction. The value f(p_{ x } , p_{ y } , p_{ z } , p_{ θ } , p_{ ϕ } ) specifies the light intensity that would be measured by a sensor located at the position (p_{ x } , p_{ y } , p_{ z } ) and pointing in the direction specified by the spherical coordinates p_{ θ } , and p_{ ϕ } . (Additional parameters such as color channel can be considered.) By considering only a bounded set of viewing positions, the plenoptic function reduces to a 4D function known as the lumigraph[24].
Any image x_{ j } ∈ ℝ ^{ N } of the scene has a clear relationship to the plenoptic function. A given camera j will be positioned at a specific point (p_{ x } , p_{ y } , p_{ z } ) in space and record light intensities arriving from a variety of directions. Therefore, x_{ j } simply corresponds to a 2D "slice" of the plenoptic function, and once the camera viewpoint θ_{ j } is fixed, the mapping from f to x_{ j } is a simple linear restriction operator. Consequently, the structure of the IAM ℳ = ℳ(f) is completely determined by the plenoptic function.
As a proof of concept, we present a simple experiment in support of this approach. For the lumigraph shown in Figure 4b, which has J = 128 1D "images" that each contain N = 128 pixels, we collect M = 5 random measurements from each image. From these measurements, we attempt to reconstruct the entire lumigraph using wedgelets [27] following a multiscale technique outlined in Chapter 6 of [28]. The reconstructed lumigraph is shown in Figure 4d and is relatively accurate despite the small number of measurements.
Finally, to illustrate the rich interplay between geometry within the lumigraph and the underlying geometry of the scene, we show that it is actually possible to use the reconstructed lumigraph to estimate the underlying scene geometry. While we omit the precise details of our approach, the estimated wedgelets help us to infer three pieces of information: the positions of each local wedgelet patch in the v and t directions indicate a camera position and viewing direction, respectively, while the orientation of the wedgelet indicates a depth at which a point in the scene belongs to the object. Putting these estimates together, we obtain the reconstruction of the scene geometry shown in Figure 4e. This promising proof of concept suggests that wedgelets or surflets could indeed play an important role in the future for developing improved concise models for lumigraph processing.
C. Dealing with uncertainties in camera positions
In all of our discussions above, we have assumed the camera positions θ_{ j } were known. In some situations, however, we may have only noisy estimates $\hat{{\theta}_{j}}={\theta}_{j}+{n}_{j}$ of the camera positions. Supposing that we can define linear mappings between the underlying scene and the images x_{ j } , it is straightforward to extend the CS recovery problem to account for this uncertainty. In particular, letting R denote the concatenation of the mappings ${R}_{{\theta}_{j}}$ as in (4), and letting $\hat{R}$ denote the concatenation of the mappings ${R}_{{\hat{\theta}}_{j}}$ corresponding to the noisy camera positions, it follows that $y={\Phi}_{big}Rx={\Phi}_{big}\hat{R}x+n$ for some noise vector n, and so (2) can be used to obtain an approximation $\widehat{x}$ of the underlying scene. Of course, the accuracy of this approximation will depend on the quality of the camera position estimates.
When faced with significant uncertainty about the camera positions, the multiscale properties of IAMs help us to conceive of a possible coarse-to-fine reconstruction approach. As in Section 2-A, let h_{ s } denote a blurring kernel at scale s and suppose for simplicity that Θ = ℝ. Based on the arguments presented in [13], it follows that for most reasonable mappings θ → x_{ θ } , we will have $\left|\right|\frac{\partial \left({h}_{s}*{x}_{\theta}\right)}{{\partial}_{\theta}}|{|}_{2}\to 0$ as s → 0. What this implies is that, on manifolds of regularized images ℳ _{ s } = {h_{ s } * x_{ θ } : θ ∈ Θ}, the images will change slowly as a function of camera position, and so we can ensure that ${h}_{s}*\left({R}_{{\hat{\theta}}_{j}}x\right)$ is arbitrarily close to ${h}_{s}*\left({R}_{{\theta}_{j}}x\right)$ by choosing s sufficiently small (a sufficiently "coarse" scale). Now, suppose that some elements of each y_{ j } are devoted to measuring ${h}_{s}*{x}_{j}={h}_{s}*\left({R}_{{\theta}_{j}}x\right)$. We denote these measurements by y_{ j, s } = Φ_{ j, s }(h_{ s } * x_{ j } ). In practice, we may replace the convolution operator with a matrix H_{ s } and collect ${y}_{j,s}={\Phi}_{j,s}{H}_{s}{x}_{j}={\Phi}_{j,s}{H}_{s}{R}_{{\theta}_{j}}x$ instead. Concatenating all of the ${\left\{{y}_{j,s}\right\}}_{j=1}^{J}$, we may then use the noisy position estimates to define operators $\left\{{R}_{{\hat{\theta}}_{j}}\right\}$ and solve (2) as above to obtain an estimate $\widehat{x}$ of the scene. This estimate will typically correspond to a lowpass filtered version of x, since for many reasonable imaging models, we will have ${h}_{s}*\left({R}_{{\theta}_{j}}x\right)\approx {R}_{{\theta}_{j}}\left({h}_{s}^{\prime}*x\right)$ for some lowpass filter ${h}_{s}^{\prime}$, and this implies that ${y}_{j,s}\approx {\Phi}_{j,s}{R}_{{\theta}_{j}}\left({h}_{s}^{\prime}*x\right)$ contains only low frequency information about x.
Given this estimate, we may then re-estimate the camera positions by projecting the measurement vectors y_{ j, s } onto the manifold $\mathcal{M}\left(\widehat{x}\right)$. (This may be accomplished, for example, using the parameter estimation techniques described in [13].) Then, having improved the camera position estimates, we may reconstruct a finer scale (larger s) approximation to the true images {x_{ j } }, and so on, alternating between the steps of estimating camera positions and reconstructing successively finer scale approximations to the true images. This multiscale, iterative algorithm requires the sort of multiscale randomized measurements we describe above, namely y_{ j, s } = Φ_{ j, s }(h_{ s } * x_{ j } ) for a sequence of scales s. In practice, the noiselet transform [29] offers one fast technique for implementing these measurement operators Φ _{ j, s }H_{ s } at a sequence of scales. Noiselet scales are also nested, so measurements at a scale s_{1} can be re-used as measurements at any scale s_{2}> s_{1}.
The manifold viewpoint can also be quite useful in situations where the camera positions are completely unknown, as they might be in applications such as cryo-electron microscopy (Cryo-EM) [30]. Because we anticipate that an IAM ℳ will have a stable embedding Φℳ in the measurement space, it follows that the relative arrangement of the points {x_{ j } } on ℳ will be preserved in Φℳ. Since this relative arrangement will typically reflect the relative arrangement of the values {θ_{ j } } in Θ, we may apply to the compressive measurements^{d} any number of "manifold learning" techniques (such as ISOMAP [11]) that are designed to discover such parameterizations from unlabeled data. An algorithm such as ISOMAP will provide an embedding of J points in ℝ ^{ p } whose relative positions can be used to infer the relative camera positions; a similar approach has been developed specifically for the Cryo-EM problem [30]. (Some side information may be helpful at this point to convert these relative position estimates into absolute position estimates.) Once we have these estimates, we may resort to the iterative refinement scheme described above, alternating between the steps of estimating camera positions and reconstructing successively finer scale approximations to the true images.
5. Manifold lifting case study
A. Problem setup
We denote the vertical and horizontal position of satellite j by ${\theta}_{j}=\left({\theta}_{j}^{V},{\theta}_{j}^{H}\right)\in {\mathbb{R}}^{2}$. The satellite positions take real values and are chosen randomly except for the caveats that the fields of view all must fall within the square support of x and that each of the four corners of x must be seen by at least one camera. (These assumptions are for convenience but can be relaxed without major modifications to the recovery algorithm.) We let ${R}_{{\theta}_{j}}$ denote the N × Q linear operator that maps x to the image x_{ j } . This operator involves a resampling of x to account for the real-valued position vector θ_{ j } , a restriction of the field of view, and a spatial lowpass filtering and decimation, as we assume that x_{ j } has lower resolution (larger pixel size) than x.
In order to reduce data transmission burdens, we suppose that each satellite encodes a random set of measurements ${y}_{j}={\Phi}_{j}{x}_{j}\in {\mathbb{R}}^{{M}_{j}}$ of its incident image x_{ j } . Following the discussion in 4-C, these random measurements are collected at a sequence of coarse-to-fine scales s_{1}, s_{2}, . . ., s_{ T } using noiselets. (The noiselet measurements can actually be collected using CS imaging hardware [3], bypassing the need for a conventional N-pixel sensor.) We concatenate all of the measurement vectors ${\left\{{y}_{j,{s}_{i}}\right\}}_{i=1}^{T}$ into the length-M_{ j } measurement vector y_{ j } = Φ_{ j }x_{ j }. Finally, we assume that all satellites use the same set of measurement functions, and so we define M := M_{1} = M_{2} = M_{ J } and Φ:= Φ_{1} = Φ_{2} = ... = Φ_{ J }.
Our decoder will be presented with the ensemble of the measurement vectors y_{1}, y_{2}, . . ., y_{ J } but will not be given any information about the camera positions (save for an awareness of the two caveats mentioned above) and will be tasked with the challenge of recovering the underlying scene x. Although it would be interesting to consider quantization in the measurements, it is beyond the scope of this paper and we did not implement any quantization steps in the following simulations.
B. Manifold lifting algorithm
We combine the discussions provided in Sections 4-A and 4-C to design a manifold lifting algorithm that is specifically tailored to this problem.
1) Initial estimates of satellite positions
2) Iterations
where Ψ is a wavelet basis and ε is chosen^{e} to reflect the uncertainty in the camera positions θ_{ j }^{.} Given $\widehat{\alpha}$, we can then compute the corresponding estimate of the underlying scene as $\widehat{x}=\Psi \widehat{\alpha}$.
where again in each y_{ j } we use only the coarse scale measurements. To solve this problem, we use the multiscale Newton algorithm proposed in [13].
With the improved estimates ${\widehat{\theta}}_{j}$, we may then refine our estimate of $\widehat{x}$ but can do so by incorporating finer scale measurements. We alternate between the steps of reconstructing the scene $\widehat{x}$ and re-estimating the camera positions ${\widehat{\theta}}_{j}$, successively bringing in the measurements ${y}_{j,{s}_{2}},{y}_{j,{s}_{3}},\phantom{\rule{0.3em}{0ex}}.\phantom{\rule{0.3em}{0ex}}.\phantom{\rule{0.3em}{0ex}}.\phantom{\rule{0.3em}{0ex}},{y}_{j,{s}_{T}}$. (At each scale, it may help to alternate once or twice between the two estimation steps before bringing in the next finer scale of measurements. One can also repeat until convergence or until reaching a designated stopping criterion.) Finally, having brought in all of the measurements, we obtain our final estimate $\widehat{x}$ of the underlying scene.
3) Experiments
We run our simulations on an underlying image x of size Q = 192 × 192 that is shown in Figure 5a. We suppose that x corresponds to 1 square unit of land area. We observe this scene using J = 200 randomly positioned cameras, each with a limited field of view. Relative to x, each field of view is of size 128 × 128, corresponding to 0.44 square units of land area as indicated by the red boxes in Figure 5a. Within each field of view, we observe an image x_{ j } of size N = 64 × 64 pixels that has half the resolution (twice the pixel size) compared to x. The total number of noiselet scales for an image of this size is 6. For each image, we disregard the coarsest noiselet scale and set s_{1}, s_{2}, . . ., s_{5} corresponding to the five finest noiselet scales. For each image, we collect 96 random noiselet measurements: 16 at scale s_{1}, and 20 at each of the scales s_{2}, . . ., s_{5}. Across all scales and all cameras, we collect a total of 96 · 200 = 19, 200 ≈ 0: 52Q measurements.
In order to assess the effectiveness of our algorithm, we compare it to three different reconstruction methods. In all of these methods, we assume that the exact camera positions are known and we keep the total number of measurements fixed to 19,200. First, we compare to image-by-image CS recovery, in which we reconstruct the images x_{ j } independently from their random measurements y_{ j } and then superimpose and average them at the correct positions. As expected, and as shown in Figure 5b, this does not yield a reasonable reconstruction because there is far too little data collected (just 96 measurements) about any individual image to reconstruct it in isolation. Thus, we see the dramatic benefits of joint recovery.
Third, we compare to another alternative encoding scheme, where rather than encode 96 random noiselet measurements of each image, we encode the 96 largest wavelet coefficients of the image in the Haar wavelet basis. (We choose Haar due to its similarity with the noiselet basis, but the performance is similar using other wavelet bases.) This is a rough approximation for how a non-CS transform coder might encode the image, and for the encoding of a single image in isolation, this is typically a more efficient encoding strategy than using random measurements. (Recall that for reconstructing a single signal, one must encode about K log(N/K) random measurements to obtain an approximation comparable to K-term transform coding.) However, when we concatenate the ensemble of encoded wavelet coefficients and solve (1) to estimate $\widehat{x}$, we see from the result in Figure 8b that the reconstructed image has lower quality than that we obtained using a manifold lifting algorithm based on random measurements, even though the camera positions were unknown for the manifold lifting experiment. In a sense, by using joint decoding, we have reduced the CS overmeasuring factor from its familiar value of log(N/K) down to something below 1! We believe this occurs primarily because the images {x_{ j } } are highly correlated, and the repeated encoding of large wavelet coefficients (which tend to concentrate at coarse scales) results in repeated encoding of redundant information across the multiple satellites. In other words, it is highly likely that prominent features will be encoded by many satellites over and over again, whereas other features may not be encoded at all. As a result, by examining Figure 8b, we see that strong features such as streets and the edges of buildings (which have large wavelet coefficients) are relatively more accurately reconstructed than, for example, forests or cars in parking lots (which have smaller wavelet coefficients). Random measurements capture more diverse information within and among the images. To more clearly illustrate the specific benefit that random measurements provide over transform coding (for which the camera positions were known), we show in Figure 8c a reconstruction obtained using random measurements with known camera positions.
Reconstruction results with varying numbers of camera positions J
ML w/out cam. | |||||
---|---|---|---|---|---|
J | Independent CS | TC w/cam. | ML w/cam. | PSNR | $\frac{1}{J}{\sum}_{j}|{\theta}_{j}-{\widehat{\theta}}_{j}|$ |
200 | 14.4 | 22.8 | 24.7 | 23.6 | (0.0108, 0.0132) |
150 | 13.7 | 22.9 | 24.6 | 23.7 | (0.0110, 0.0148) |
100 | 15.1 | 23.5 | 25.1 | 23.9 | (0.0177, 0.0121) |
70 | 15.6 | 23.7 | 24.6 | 23.8 | (0.0059, 0.0143) |
6. Discussion and conclusion
In summary, we have discussed in this paper how non-collaborative CS measurement schemes can be used to simplify the acquisition and encoding of multi-image ensembles. We have presented a geometric framework in which many multi-view imaging problems may be cast and explained how this framework can inform the design of effective manifold lifting algorithms for joint reconstruction. We conclude with a few remarks concerning practical and theoretical aspects of the manifold lifting framework.
First, let us briefly discuss the process of learning camera positions when they are initially completely unknown. In our satellite experiments, we have observed that the accuracy of the ISOMAP embedding depends on the relative size of the subimages x_{ j } to the underlying scene x, with larger subimages leading us to higher quality embeddings. As the size of the subimages decreases, we need more and more camera positions to get a reasonable embedding, and we can reach a point where even thousands of camera positions are insufficient. In such cases, and in applications not limited to satellite imaging, it may be possible to get a reliable embedding by grouping local camera positions together. On a different note, once an initial set of camera position estimates has been obtained, it may also be possible to build on an idea suggested in [31] and seek a refinement of these position estimates that minimize the overall ${\ell}_{1}$ norm of the reconstructed image. A multiscale approach could again help such a technique converge if the initial estimates are far off.
Second, an interesting open question is whether the measurement matrices utilized in DCS multi-view imaging scenarios satisfy the RIP with respect to some reconstruction basis Ψ. Establishing an RIP bound would give a guide for the requisite number of measurements (ideally, at each scale) and also give a guarantee for reconstruction accuracy. Although we do not yet have a definitive answer to this question, we suggest that there may be promising connections between these matrices and other structured matrices that have been studied in the CS literature. For example, the measurement matrix Φ_{big}R employed in the satellite experiment is closely related to a partial circulant matrix, where the relative shifts between the rows represent the relative offsets between the camera positions. RIP results have been established for circulant matrices [32] that are generated by a densely populated random row vector. In our case, Φ_{big}R has more of a block circulant structure because it is generated by the submatrices Φ_{ j }, and so there may also be connections with the analysis in [33]. However, each row of Φ_{big}R will contain a large number of zeros, and it is conceivable that this could degrade the isometric property of Φ_{big}R. We believe, though, that by collecting multiple measurements from each camera, we are compensating for this degradation. Other possible directions for analysis could be to build on the concentration of measure bounds recently established for block diagonal matrices [34] and Toeplitz matrices [35].
Finally, another open question in the manifold lifting framework is what could be said about the uniqueness of ℳ(x) given samples of Φℳ(x). When all points on the manifold ℳ(x) are K-sparse, the RIP can be one avenue to proving uniqueness, but since our objective is to sample fewer than $\mathcal{O}(K\mathrm{log}(N/K))$ measurements for each signal, a stronger argument would be preferable. By considering the restricted degrees of freedom that these signal ensembles have, it seems reasonable to believe that we can in fact establish a stronger result. We are currently exploring geometric arguments for proving uniqueness.
Endnotes
^{a}It is also possible to consider other more general non-orthonormal dictionaries. ^{b}Depending on the scenario, the parameter space Θ could be a subset of ℝ ^{ p } , or it could be some more general topological manifold such as SO(3), e.g. if θ corresponds to the orientation of some object in 3D space. ^{c}Although an IAM ℳ may not itself be smooth, a regularized manifold ℳ _{ s } will be smooth, and later in this paper we discuss image reconstruction strategies based on random projections of ℳ _{ s } at a sequence of scales s. ^{d}We have found that this process also performs best using measurements of h_{ s } * x_{ j } for s small because of the smoothness of the manifold ℳ _{ s } at coarse scales. ^{e}In our experiments, we choose the parameter ε as somewhat of an oracle, in particular as 1.1$\left|\right|y-{\Phi}_{big}\hat{R}x|{|}_{2}$. In other words, this is slightly larger than the error that would result if we measured the true image x but with the wrong positions as used to define $\hat{R}$. This process should be made more robust in future work.
Declarations
Acknowledgements
The authors gratefully acknowledge Richard Baraniuk and Hyeokho Choi for many influential conversations concerning the lumigraph and for their help in developing the lumigraph experiments presented here. This research was partially supported by DARPA Grant HR0011-08-1-0078 and AFOSR Grant FA9550-09-1-0465. A preliminary version of some results in this paper originally appeared in [10].
Authors’ Affiliations
References
- Donoho D: Compressed sensing. IEEE Trans Inf Theory 2006, 52(4):1289-1306.MathSciNetView ArticleGoogle Scholar
- Candès E: Compressive sampling. In Proc Int Congress Math. Volume 3. Madrid, Spain; 2006:1433-1452.Google Scholar
- Duarte M, Davenport M, Takhar D, Laska J, Sun T, Kelly K, Baraniuk R: Single-pixel imaging via compressive sampling. IEEE Signal Process Mag 2008, 25(2):83-91.View ArticleGoogle Scholar
- Slepian D, Wolf J: Noiseless coding of correlated information sources. IEEE Trans Inf Theory 2003, 19(4):471-480.MathSciNetView ArticleGoogle Scholar
- Baron D, Wakin MB, Duarte M, Sarvotham S, Baraniuk RG: Distributed compressed sensing. Rice University Technical Report TREE-0612; 2006.Google Scholar
- Chen X, Frossard P: Joint reconstruction of compressed multi-view images. Proc IEEE Int Conf Acoustics, Speech, Signal Process (ICASSP) 2009.Google Scholar
- Marcia R, Willett R: Compressive coded aperture video reconstruction. Proc Eur Signal Process Conf (EUSIPCO) 2008.Google Scholar
- Park JY, Wakin MB: A multiscale framework for compressive sensing of video. Proc Picture Coding Symp (PCS) 2009.Google Scholar
- Mallat S: A Wavelet Tour of Signal Processing, 3rd edn. The Sparse Way. Academic Press; 2008.Google Scholar
- Wakin MB: A manifold lifting algorithm for multi-view compressive imaging. Proc Picture Coding Symp (PCS) 2009.Google Scholar
- Tenenbaum JB, Silva V, Langford JC: A global geometric framework for nonlinear dimensionality reduction. Science 2000, 290(5500):2319-2323. 10.1126/science.290.5500.2319View ArticleGoogle Scholar
- Donoho DL, Grimes C: Image manifolds which are isometric to Euclidean space. J Math Imaging Comp Vis 2005, 23(1):5-24. 10.1007/s10851-005-4965-4MathSciNetView ArticleGoogle Scholar
- Wakin MB, Donoho D, Choi H, Baraniuk RG: The multiscale structure of non-differentiable image manifolds. Proc Wavelets XI at SPIE Optics and Photonics 2005.Google Scholar
- Baraniuk R, Wakin M: Random projections of smooth manifolds. Found Comput Math 2009, 9(1):51-77. 10.1007/s10208-007-9011-zMathSciNetView ArticleGoogle Scholar
- Candès E: The restricted isometry property and its implications for compressed sensing. Comptes rendus de l'Académie des Sciences, Série I 2008, 346(9-10):589-592.Google Scholar
- Candès EJ, Tao T: Decoding by linear programming. IEEE Trans Inf Theory 2005, 51(12):4203-4215. 10.1109/TIT.2005.858979View ArticleGoogle Scholar
- Baraniuk R, Davenport M, DeVore R, Wakin M: A simple proof of the restricted isometry property for random matrices. Constr Approx 2008, 28(3):253-263. 10.1007/s00365-007-9003-xMathSciNetView ArticleGoogle Scholar
- Wakin M: Manifold-based signal recovery and parameter estimation from compressive measurements.2008. [http://arxiv.org/abs/1002.1247]Google Scholar
- Lustig M, Donoho DL, Santos JM, Pauly JM: Compressed sensing MRI. IEEE Signal Process Mag 2008, 25(2):72-82.View ArticleGoogle Scholar
- Wyner AD, Ziv J: The rate-distortion function for source coding with side information at the decoder. IEEE Trans Inf Theory 1976, 22: 1-10. 10.1109/TIT.1976.1055508MathSciNetView ArticleGoogle Scholar
- Tošić I, Frossard P: Distributed multi-view image coding with learned dictionaries. Proc Int ICST Mobile Multimedia Comm Conf 2009.Google Scholar
- Gehrig N, Dragotti PL: Geometry-driven distributed compression of the plenoptic function: performance bounds and constructive algorithms. IEEE Trans Image Process 2009, 18: 457-470.MathSciNetView ArticleGoogle Scholar
- Tošić I, Frossard P: Geometry-based distributed scene representation with omnidirectional vision sensors. IEEE Trans Image Process 2008, 17(7):1033-1046.MathSciNetView ArticleGoogle Scholar
- Gortler SJ, Grzeszczuk R, Szeliski R, Cohen MF: The lumigraph. Proc Ann Conf Computer Graphics Interactive Tech (SIGGRAPH) 1996.Google Scholar
- Berent J, Dragotti PL: Plenoptic manifolds. IEEE Signal Process Mag 2007., 24(6):Google Scholar
- Donoho DL: Wedgelets: Nearly-minimax estimation of edges. Ann Statist 1999, 27: 859-897. 10.1214/aos/1018031261MathSciNetView ArticleGoogle Scholar
- Chandrasekaran V, Wakin MB, Baron D, Baraniuk R: Representation and compression of multi-dimensional piecewise functions using surflets. IEEE Trans Inf Theory 2009, 55(1):374-400.MathSciNetView ArticleGoogle Scholar
- Wakin MB: The geometry of low-dimensional signal models. PhD dissertation, Department of Electrical and Computer Engineering, Rice University, Houston, TX; 2006.Google Scholar
- Coifman R, Geshwind F, Meyer Y: Noiselets. Appl Comput Harmon Anal 2001, 10(1):27-44. 10.1006/acha.2000.0313MathSciNetView ArticleGoogle Scholar
- Singer A, Coifman RR, Sigworth FJ, Chester DW, Shkolnisky Y: Detecting consistent common lines in cryo-em by voting. J Struct Biol 2010, 169(3):312-322. 10.1016/j.jsb.2009.11.003View ArticleGoogle Scholar
- Huff C, Muise R: Wide-area surveillance with multiple cameras using distributed compressive imaging. Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series 2011., 8055:Google Scholar
- Bajwa WU, Haupt JD, Raz GM, Wright SJ, Nowak RD: Toeplitz-structured compressed sensing matrices. Proc IEEE/SP Workshop Stat Signal Process (SSP) 2007.Google Scholar
- Marcia RF, Willett RM: Compressive coded aperture superresolution image reconstruction. Proc IEEE Int Conf Acoustics, Speech, Signal Process (ICASSP) 2008.Google Scholar
- Wakin MB, Park JY, Yap HL, Rozell CJ: Concentration of measure for block diagonal measurement matrices. Proc IEEE Int Conf Acoustics, Speech, Signal Process (ICASSP) 2010.Google Scholar
- Sanandaji BM, Vincent TL, Wakin MB: Concentration of measure inequalities for compressive Toeplitz matrices with applications to detection and system identification. Proc IEEE Conf Decision and Control (CDC) 2010.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.