3.1 General principle
The schematic layout of CLFIS is shown in Fig. 1. The entire system includes two main parts which are fore optics and imaging spectrometer. The fore optics is telecentric in image space in order to make the chief ray from L1 is parallel to the optical axis. The telecentric structure ensures the chief ray from lenses array converge on the focus, so that the microlens center and the sub image center are at the same location in x and y directions. The aperture is placed at the focal plane of L1 and is encoded in binary randomly. L2 and L3 have the same focal length and form a 4f system. The light from first image formed by L1 is dispersed by an Amici prism on the focal plane of L2 and L3. Then, the light is collected by L3 and imaged on “light field sensor,” which is composed by lenses array and a format detector. The lenses array consists of hundreds of microlenses which have the same focal length and diameter. The distance between lenses array and the format detector is the focal length of microlens. The “light field sensor” can record the intensity and directions of the light simultaneously. As shown in Fig. 1, the object plane and lenses array are conjugate; the coded aperture and format detector are conjugate. From the above, CLFIS obtains the encoded light field data with spectral information. Based on the light field theory and digital refocus algorithm, we can recover the depth information of targets using the light field data. Since the coded aperture is imaged on the detector, so that the sub-image under each microlens is encoded in binary. The sub-image is sparsity in some transform domain, such as Fourier domain and wavelet domain. As results, the spatial and spectral information of targets can be recovered based on CS theory and CASSI reconstruction technique [23].
3.2 Optical model
We assume that the light intensity from arbitrarily target point is noted by L(xo, yo, zo, λ). The coordinate of object plane is represented by (xo, yo, zo). λ is the wavelength that is considered. The coordinates of coded aperture are noted by (u, v), which is set as the origin of optical axis (zo axis, and the positive direction is the light path direction). The coordinate of the first image plane, lenses array and detector is noted by (xi,yi,zi), (s, t) and (xd, yd), respectively. The optical layout of the fore optics is shown in Fig. 2. The width of the aperture is D1, which is considered as a square. The first image point of the arbitrarily object point P (xo, yo, zo) is represented by P’(xi,yi,zi). According to the Gauss image formula, (xi,yi,zi) is given as
$$\left\{ \begin{gathered} x_{i} = x_{o} f_{1} /z_{o} \hfill \\ y_{i} = y_{o} f_{1} /z_{o} \hfill \\ z_{i} = \left( {z_{o} - f_{1} } \right)f_{1} /z_{o} \hfill \\ \end{gathered} \right.$$
(1)
The light field data record the light direction information. Since the coded aperture conjugates with detector, the pixels under each sub-image are encoded by the aperture. The number of coded elements in the aperture is equal to that of pixels under each microlens. The position of each sub-pupil is noted by (um, vn) (m = 1, 2, …, M, n = 1, 2, …, N; M, N are the numbers of encoded elements of the aperture in u direction and v direction, respectively, and m, n are the indexes). The aperture stop is encoded randomly in binary, which is illustrated in Fig. 3. Besides, we assume that the transmission function printed on the aperture is represented as T(u, v),
$$T\left( {u,v} \right) = \sum\limits_{m}^{M} {\sum\limits_{n}^{N} {t_{m,n} } rect\left( {\frac{u}{{d_{a} }} - u_{m} ,\frac{v}{{d_{a} }} - v_{n} } \right)}$$
(2)
where da is the side length of each element, and tm,n represents the coded status (1 for open and 0 for closed) at location (um, vn).
We assume that the light from P(xo, yo, zo) passing through sub-pupil A(um, vn, 0) has the intersection point with L1 at Q(ul, vl, f1). The vector PA is parallel to AQ according to the geometrical optical theory [24], then we can get
$$\frac{{u_{n} - x_{o} }}{{u_{l} - u_{n} }} = \frac{{v_{n} - y_{o} }}{{v_{l} - v_{n} }} = - \frac{{z_{o} }}{{f_{1} }}$$
(3)
The intersection point with L1 is given by
$$\left\{ {\begin{array}{*{20}l} {u_{1} = \frac{{f_{1} }}{{z_{o} }} \cdot x_{o} + \left( {1 - \frac{{f_{1} }}{{z_{o} }}} \right) \cdot u_{n} } \hfill \\ {v_{1} = \frac{{f_{1} }}{{z_{o} }} \cdot y_{o} + \left( {1 + \frac{{f_{1} }}{{z_{o} }}} \right) \cdot v_{n} } \hfill \\ \end{array} } \right.$$
(4)
Since the L2 and L3 form a 4f system, the lenses array conjugates with the first image plane, CLFIS without dispersion element can be considered as the system shown in Fig. 4 for simplification. In this system, the light from P(xo, yo, zo) passing through A(um, vn, 0) intersects with lenses array at P’ (x’a, y’a, f1 + li). Here, li is the distance between first image plane and L1. Then,
$$\left\{ {\begin{array}{*{20}l} {x_{a}^{{\prime }} = - \frac{{f_{1} + l_{i} - z_{i} }}{{z_{i} - f_{1} }} \cdot u_{l} + \frac{{l_{i} }}{{z_{i} - f_{1} }} \cdot x_{i} } \hfill \\ {y_{a}^{{\prime }} = \frac{{f_{1} + l_{i} - z_{i} }}{{z_{i} - f_{1} }} \cdot v_{l} - \frac{{l_{i} }}{{z_{i} - f_{1} }} \cdot y_{i} } \hfill \\ \end{array} } \right.$$
(5)
As seen in Fig. 1, the Amici prism is direct vision, through which the light at center wavelength has no deflection. The dispersion coefficient is assumed to be α(λ). Then, the position of P’ in CLFIS with dispersion is given by
$$\left\{ \begin{gathered} x_{a} = \frac{{f_{1} + l_{i} - z_{i} }}{{z_{i} - f_{1} }} \cdot u_{l} - \frac{{l_{i} }}{{z_{i} - f_{1} }} \cdot x_{i} + \alpha \left( {\lambda - \lambda_{o} } \right) \hfill \\ y_{a} = \frac{{f_{1} + l_{i} - z_{i} }}{{z_{i} - f_{1} }} \cdot v_{l} - \frac{{l_{i} }}{{z_{i} - f_{1} }} \cdot y_{i} \hfill \\ \end{gathered} \right.$$
(6)
where λo is the center wavelength, and the “smile,” “keystone” and nonlinear dispersion are not considered. Finally, considering the light encoded by aperture stop, the intensity of the pixel on detector related to the sub-pupil (um, vn) under the microlens (px, py) is expressed as
$$\begin{aligned} & I\left( {p_{x} ,p_{y} ,u_{m} ,v_{n} } \right) \\ & \quad = \int\limits_{0}^{{X_{o} }} {\int\limits_{0}^{{Y_{o} }} {\int\limits_{0}^{{Z_{o} }} {L\left( {x_{o} ,y_{o} ,z_{o} ,u_{m} ,v_{n} } \right)} } } \cdot T\left( {u_{m} ,v_{n} } \right) \cdot dx_{o} dy_{o} dz_{o} \\ \end{aligned}$$
(7)
where L(*) is the light intensity from P(xo,yo,zo) passing through sub-pupil (um, vn).Considering the discrete pixels of sensor for sampling, the index of pixels of images (px, py) can be given by
$$\left\{ {\begin{array}{*{20}l} {p_{x} = \frac{{x_{a} + 0.5w}}{d} + 1} \hfill \\ {p_{y} = \frac{{y_{a} + 0.5w}}{d} + 1} \hfill \\ \end{array} } \right.$$
(8)
where w is the width of sensor format and d is the width of each pixel of the sensor.
3.3 Datacube reconstruction
The detector records the compressed spectral light field data in CLFIS. In order to obtain the spectral images with depth estimation, we should firstly reconstruct the light field data under different wavelengths based on CS algorithm and then recover each spectral images for depth estimation from light field data based on digital refocus technique. The sub-images of spectral light field images have smooth spatial structure, for these pixels, measure the light intensity from different directions of the same object point. As results, the sub-images are sparsity in some transform domain, like Fourier domain, wavelet domain, or some orthogonal dictionary domain et al. We assume that the original mixture data noted by g, and the transform expression in 2D wavelet domain is given by
where g is the vector of 3D spatial information and spectral information, W is the transform matrix for 2D wavelets. θ is the transform coefficients of g in 2D wavelets domain. The imaging processing can be represented by
where H is the imaging matrix which can be derived by Eq. (7). If I, H and W are the known values, we can estimate θ under the sparsity assumption based on CS reconstruction algorithm. The estimation processing of θ is to solve the problem as followed, the estimation of θ is represented by θ',
$$\theta^{{\prime }} = \arg \min \left[ {\left\| {I - HW\theta^{{\prime }} } \right\|_{2}^{2} + \tau \left\| {\theta^{{\prime }} } \right\|_{1} } \right]$$
(11)
where the || ||2 term is l2 norm of (I−HW θ') and the || ||1 term is l1 norm of estimation value θ'. The first term minimizes the l2 error, and the second term minimizes the number of elements in θ to ensure the sparsity of θ. Plenty of approaches are proposed to solve this optimization problem, such as Two-step Iterative Shrinkage/Thresholding (TwIST) algorithm [25], Gradient Projection for Sparse Reconstruction (GPSR) algorithm [26], Orthogonal Matching Pursuit (OMP) algorithm [27] and some learning methods based on deep network [28,29,30]. In this paper, we chose the traditional TwIST for the optimal sparse solution. This algorithm is to use the regularization function to penalize the estimations of θ’ that are undesirable to appear in the estimated θ’. τ is a tuning parameter to control the sparsity of θ’. The code of TwIST is available online (http://www.lx.it.pt/~bioucas/code.htm). After estimating θ, we can calculate the estimated g through Eq. (9). Since the estimated g is spectral light field data, we need digital refocus for g to obtain the depth information of targets. The digital refocused method used in this paper is proposed in [31]. The refocused processing is to remap the light field onto the selected imaging plane and integrate within the pupil plane. According to Fourier projection-slice theorem, the projection integration in spatial domain is equivalent to the slicing in Fourier domain. The derived Fourier slice imaging theorem shows that the imaging projection of the light field in the spatial domain is equivalent to the slicing of the Fourier transform of light field in frequency domain, and the slicing angle is perpendicular to the projection angle. If the image distance after digital refocus is assumed to be zi’, which is represented as zi’ = γzi, the image after refocusing is given by
$$f^{{\prime }} \left( {x_{d}^{{\prime }} ,y_{d}^{{\prime }} } \right)_{\lambda } = \iint {f_{\lambda } }\left( {u_{m} ,v_{n} ,x_{d} ,y_{d} } \right)du_{m} dv_{n}$$
(12)
where
$$\left\{ {\begin{array}{*{20}l} {x_{d} = \frac{{x_{d}^{{\prime }} }}{\gamma } + u_{m} \left( {1 - \frac{1}{\gamma }} \right)} \hfill \\ {y_{d} = \frac{{y_{d} }}{\gamma } + v_{n} \left( {1 - \frac{1}{\gamma }} \right)} \hfill \\ \end{array} } \right.$$
(13)
According to Eq. (13), we could choose γ value in order for acquiring the distinct spectral images under different depths. The depth estimation approaches using light field camera data usually include three main categories: sub-aperture image matching-based methods, Epipolar- Plane Image (EPI)-based methods and learning-based methods. In this paper, we use the first one integrating with defocus cue mentioned in Ng’s work [31]. The spectral light field images are refocused at a series continues depth candidates. Among these candidates, the most distinct one for the selected area judged by an average gradient function is thought to be the actual depth.