 Research
 Open Access
 Published:
Discrete shearlet transform on GPU with applications in anomaly detection and denoising
EURASIP Journal on Advances in Signal Processing volume 2014, Article number: 64 (2014)
Abstract
Shearlets have emerged in recent years as one of the most successful methods for the multiscale analysis of multidimensional signals. Unlike wavelets, shearlets form a pyramid of welllocalized functions defined not only over a range of scales and locations, but also over a range of orientations and with highly anisotropic supports. As a result, shearlets are much more effective than traditional wavelets in handling the geometry of multidimensional data, and this was exploited in a wide range of applications from image and signal processing. However, despite their desirable properties, the wider applicability of shearlets is limited by the computational complexity of current software implementations. For example, denoising a single 512 × 512 image using a current implementation of the shearletbased shrinkage algorithm can take between 10 s and 2 min, depending on the number of CPU cores, and much longer processing times are required for video denoising. On the other hand, due to the parallel nature of the shearlet transform, it is possible to use graphics processing units (GPU) to accelerate its implementation. In this paper, we present an open source standalone implementation of the 2D discrete shearlet transform using CUDA C++ as well as GPUaccelerated MATLAB implementations of the 2D and 3D shearlet transforms. We have instrumented the code so that we can analyze the running time of each kernel under different GPU hardware. In addition to denoising, we describe a novel application of shearlets for detecting anomalies in textured images. In this application, computation times can be reduced by a factor of 50 or more, compared to multicore CPU implementations.
1 Introduction
During the last decade, a new generation of multiscale systems has emerged which combines the power of the classical multiresolution analysis with the ability to process directional information with very high efficiency. Some of the most notable examples of such systems include the curvelets[1], the contourlets[2], and the shearlets[3]. Unlike classical wavelets, the elements of such systems form a pyramid of welllocalized waveforms ranging not only across various scales and locations, but also across various orientations and with highly anisotropic shapes. Due to their richer structure, these more sophisticated multiscale systems are able to overcome the poor directional sensitivity of traditional multiscale systems and have been used to derive several stateoftheart algorithms in image and signal processing (cf.[4, 5]).
Shearlets, in particular, offer a unique combination of very remarkable features: they have a simple and wellunderstood mathematical structure derived from the theory of affine systems [3, 6]; they provide optimally sparse representations, in a precise sense, for a large class of images and other multidimensional data where wavelets are suboptimal [7, 8]; and the directionality is controlled by shear matrices rather than rotations. This last property, in particular, enables a unified framework for continuum and discrete setting since shear transformations preserve the rectangular lattice and is an advantage in deriving faithful digital implementations [9, 10].
The shearlet decomposition has been successfully employed in many problems from applied mathematics and signal processing, including decomposition of operators [11], inverse problems [12, 13], edge detection [14–16], image separation [17], and image restoration [18–20]. However, one major bottleneck to the wider applicability of the shearlet transform is that current discrete implementations tend to be very time consuming, making its use impractical for large data sets and for realtime applications. For instance, the current (CPUbased) MATLAB implementation^{a} of the 2D shearlet transform, run on a typical desktop PC, takes about 2 min to denoise a noisy image of size 512×512 [9, 21]. The running time of the current (CPUbased) MATLAB implementation of the 3D shearlet transform for denoising a video sequence of size 192^{3} is about 5 min [20]. Running times for alternative shearlet implementations from Shearlab [10] as well as for the current implementation of the curvelet transform [22] are comparable.
In recent years, generalpurpose graphics processing units (GPGPUs) have become ubiquitous not only on highperformance computing (HPC) clusters, but also on workstations. For example, Titan, which was until recently the world’s fastest supercomputer, contains 18,688 NVIDIA Tesla K20X GPUs. These GPUs provide about 90% of Titan’s peak computing performance, which is greater than 20 PetaFLOPS (quadrillion floating point operations per second). Due to their energy efficiency and capabilities, GPGPUs are also becoming mainstream on mobile platforms, such as iOS and Android devices. There are two main architectures for GPGPU computing: CUDA and OpenCL. CUDA was designed by NVIDIA, and has been around since 2006. OpenCL was originally designed by Apple, Inc., and was introduced in 2008. OpenCL is an open standard maintained by the Khronos Group, whose members include Intel, AMD, NVIDIA, and many others, so it has broader industry acceptance than any other architecture. In 2009, Microsoft introduced DirectCompute as an alternative architecture for GPGPU computing, which is only available in Windows Vista and later. OpenCL has been designed to provide the developer with a common framework for doing computation on heterogeneous devices. One of the advantages of OpenCL is that it can potentially support any computing device, such as CPUs, GPUs, and FPGAs, as long as there is an OpenCL compiler available for such processor. NVIDIA provides CUDA/OpenCL drivers, libraries, and development tools for the three major operating systems (Linux, Windows, and Mac OS X), while AMD/ATI™ and Intel provide OpenCL drivers and tools for their respective GPUs.
The objective of this paper is to introduce and demonstrate a new implementation of the 2D and 3D discrete shearlet transform which takes advantage of the computational capabilities of the graphics processing unit (GPU). To demonstrate the effectiveness of the proposed implementations, we will illustrate its application on problems of image and video denoising and on a problem of feature recognition aiming at crack detection of railway components. In particular, we will show that our new implementation takes about 40 ms to denoise an image of size 512 × 512, which is a 233 × speedup compared to singlecore CPU, and about 3 s to denoise a video of size 192^{3}, which is a 551× speedup compared to singlecore CPU.
The organization of the paper is as follows. In Section 2, we recall the construction of 2D and 3D shearlets. Next, in Section 3, we present our implementation of the discrete shearlet transform, and in Section 4, we benchmark our implementation using three specific applications. Finally, concluding remarks and future work are discussed in Section 5.
2 Shearlets
In this section, we recall the construction of 2D and 3D shearlets (cf.[6, 7]).
2.1 2D shearlets
To construct smooth Parseval frames of shearlets for ${L}^{2}\left({\mathbb{R}}^{2}\right)$, we start by defining appropriate multiscale function systems supported in the following coneshaped regions of the Fourier domain ${\hat{\mathbb{R}}}^{2}$:
Let ϕ ∈ C^{∞}([0,1]) be a ‘bump’ function with $\text{supp}\phantom{\rule{0.3em}{0ex}}\varphi \subset \left[\frac{1}{8},\frac{1}{8}\right]$ and ϕ = 1 on $\left[\frac{1}{16},\frac{1}{16}\right].$ For $\xi =\left({\xi}_{1},{\xi}_{2}\right)\in {\hat{\mathbb{R}}}^{2}$, let Φ (ξ) = Φ (ξ_{1},ξ_{2}) = ϕ (ξ_{1}) ϕ (ξ_{2}) and define the function
Note that the functions ${W}_{j}^{2}={W}^{2}\left({2}^{2j}\xb7\right),j\ge 0$, have support inside the Cartesian coronae
and that they produce a smooth tiling of the frequency plane:
Let $V\in {C}^{\infty}\left(\mathbb{R}\right)$ so that suppV ⊂ [1,1],V (0) = 1,V^{(n)}(0) = 0, for all n ≥ 1 and
For ${F}_{\left(1\right)}\left({\xi}_{1},{\xi}_{2}\right)=V\left(\frac{{\xi}_{2}}{{\xi}_{1}}\right)$ and ${F}_{\left(2\right)}\left({\xi}_{1},{\xi}_{2}\right)=V\left(\frac{{\xi}_{1}}{{\xi}_{2}}\right),$ the shearlet systems associated with the coneshaped regions${\mathcal{P}}_{\nu},\nu =1,2$ are defined as the countable collection of functions
where
and
Note that the dilation matrices A_{(1)},A_{(2)} produce anisotropic dilations, namely, parabolic scaling dilations; by contrast, the shear matrices B_{(1)},B_{(2)} are nonexpanding and their integer powers control the directional features of the shearlet system. Hence, the systems (1) form collections of welllocalized functions defined at various scales, orientations, and locations, controlled by the indices j,ℓ,k, respectively. In particular, the functions ${\widehat{\psi}}_{j,\ell ,k}^{\left(1\right)}$, given by (2) with ν = 1, can be written explicitly as
showing that their supports are contained inside the trapezoidal regions
in the Fourier plane (see Figure 1). Similar properties hold for the functions ${\widehat{\psi}}_{j,\ell ,k}^{\left(2\right)}$.
A smooth Parseval frame for the whole space ${L}^{2}\left({\mathbb{R}}^{2}\right)$ is obtained by combining the two shearlet systems associated with the conebased regions ${\mathcal{P}}_{1}$ and ${\mathcal{P}}_{2}$ together with a coarsescale system, associated with the low frequency region. To ensure that all elements of this combined shearlet system are ${C}_{c}^{\infty}$ in the Fourier domain, the elements whose supports overlap the boundaries of the cone regions in the frequency domain are slightly modified. That is, we define a shearlet system for${L}^{2}\left({\mathbb{R}}^{2}\right)$ as
consisting of

the coarsescale shearlets$\left\{{\stackrel{~}{\psi}}_{1,k}=\Phi (\xb7k):k\in {\mathbb{Z}}^{2}\right\}$;

the interior shearlets$\{{\stackrel{~}{\psi}}_{j,\ell ,k,\nu}={\psi}_{j,\ell ,k}^{(\nu )}:j\ge 0,\ell <{2}^{\phantom{\rule{0.3em}{0ex}}j},k\in {\mathbb{Z}}^{2},\nu =1,2\}$, where the functions ${\psi}_{j,\ell ,k}^{(\nu )}$ are given by (2);

the boundary shearlets$\left\{{\stackrel{~}{\psi}}_{j,\ell ,k}:\phantom{\rule{0.3em}{0ex}}j\ge 0,\ell =\pm {2}^{\phantom{\rule{0.3em}{0ex}}j},k\in {\mathbb{Z}}^{2}\right\}$, obtained by joining together slightly modified versions of ${\psi}_{j,\ell ,k}^{\left(1\right)}$ and ${\psi}_{j,\ell ,k}^{\left(2\right)}$, for ℓ = ± 2^{j}; after that, they have been restricted in the Fourier domain to the cones ${\mathcal{P}}_{1}$ and ${\mathcal{P}}_{2}$, respectively. We refer to [6] for details.
For brevity, let us denote the system (3) using the compact notation
where M = M_{ C }∪ M_{ I }∪ M_{ B }are the indices associated with coarsescale shearlets, interior shearlets, and boundary shearlets, respectively. We have the following result from [6]:
Theorem 2.1.
The system of shearlets (3) is a Parseval frame for${L}^{2}\left({\mathbb{R}}^{2}\right)$. That is, for any$f\in {L}^{2}\left({\mathbb{R}}^{2}\right)$,
All elements$\left\{{\stackrel{~}{\psi}}_{\mu},\phantom{\rule{0.6em}{0ex}}\mu \in M\right\}$are C^{∞}and compactly supported in the Fourier domain.
As mentioned above, it is proved in [7] that the 2D Parseval frame of shearlets $\left\{{\stackrel{~}{\psi}}_{\mu},\phantom{\rule{0.6em}{0ex}}\mu \in M\right\}$ provides essentially optimal approximations for functions of two variables which are C^{2} regular away from discontinuities along C^{2} curves.
The mapping from $f\in {L}^{2}\left({\mathbb{R}}^{2}\right)$ into the elements $\u3008f{\stackrel{~}{\psi}}_{\mu}\u3009,\mu \in M$, is called the 2D shearlet transform.
2.2 3D shearlets
The construction outlined above extends to higher dimensions. In 3D, a shearlet system is obtained by appropriately combining three systems of functions associated with the pyramidal regions
in which the Fourier space ${\hat{\mathbb{R}}}^{3}$ is partitioned. With ϕ defined as above, for $\xi =\left({\xi}_{1},{\xi}_{2},{\xi}_{3}\right)\in {\hat{\mathbb{R}}}^{3}$, we now let
and $W(\xi )=\sqrt{{\Phi}^{2}\left({2}^{2}\xi \right){\Phi}^{2}(\xi )}$. As in the twodimensional case, we have the smooth tiling condition
Hence, for $d=1,2,3,\ell =\left({\ell}_{1},{\ell}_{2}\right)\in {\mathbb{Z}}^{2}$, the 3D shearlet systems associated with the pyramidal regions${\mathcal{P}}_{d}$ are defined as the collections
where
${F}_{\left(1\right)}\left({\xi}_{1},{\xi}_{2},{\xi}_{3}\right)=V\left(\frac{{\xi}_{2}}{{\xi}_{1}}\right)V\left(\frac{{\xi}_{3}}{{\xi}_{1}}\right),{F}_{\left(2\right)}\left({\xi}_{1},{\xi}_{2},{\xi}_{3}\right)=V\left(\frac{{\xi}_{1}}{{\xi}_{2}}\right)V\left(\frac{{\xi}_{3}}{{\xi}_{2}}\right),{F}_{\left(3\right)}\left({\xi}_{1},{\xi}_{2},{\xi}_{3}\right)=V\left(\frac{{\xi}_{1}}{{\xi}_{3}}\right)V\left(\frac{{\xi}_{2}}{{\xi}_{3}}\right)$, the anisotropic dilation matrices A_{(d)} are given by
and the shear matrices are defined by
Similar to the 2D case, the shearlets ${\widehat{\psi}}_{j,\ell ,k}^{\left(1\right)}(\xi )$ can be written explicitly as
showing that their supports are contained inside the trapezoidal regions
Note that these support regions become increasingly more elongated at fine scales, due to the action of the anisotropic dilation matrices ${A}_{\left(1\right)}^{j}$, and the orientations of these regions are controlled by the shear parameters ℓ_{1},ℓ_{2}. A typical support region is illustrated in Figure 2. Similar properties hold for the elements associated with the regions ${\mathcal{P}}_{2}$ and ${\mathcal{P}}_{3}$.
A Parseval frame of shearlets for ${L}^{2}\left({\mathbb{R}}^{3}\right)$ is obtained by using an appropriate combination of the systems of shearlets associated with the three pyramidal regions ${\mathcal{P}}_{d},d=1,2,3$, together with a coarsescale system, which will take care of the low frequency region. Similar to the 2D case, in order to build such system in a way that all its elements are smooth in the Fourier domain, one has to appropriately define the elements of the shearlet systems overlapping the boundaries of the pyramidal regions ${\mathcal{P}}_{d}$ in the Fourier domain. We refer to [8, 15] for details. Hence, we define the 3D shearlet systems for${L}^{2}\left({\mathbb{R}}^{3}\right)$ as the collections
which again can be identified as the coarsescale, interior and boundary shearlets. It turns out that the 3D system of shearlets is a Parseval frame of ${L}^{2}\left({\mathbb{R}}^{3}\right)$[6] and it provides essentially optimal approximations for functions of three variables which are C^{2} regular away from discontinuities along C^{2} surfaces [8].
3 Discrete implementation
A faithful numerical implementation of the 2D shearlet transform was originally presented in [9]. Let us briefly recall the main steps of this implementation.
3.1 2D discrete shearlet transform
Recall that the shearlet coefficients associated with the interior shearlets can be expressed as
First, to compute $\widehat{f}\left({\xi}_{1},{\xi}_{2}\right)\phantom{\rule{0.3em}{0ex}}W\left({2}^{2j}\xi \right)$ in the discrete domain, at the resolution level j, we apply the Laplacian pyramid algorithm [23], which is implemented in spacedomain. Let $\widehat{f}\left[{k}_{1},{k}_{2}\right]$ denote 2D discrete Fourier transform (DFT) of $f\in {\ell}^{2}\left({\mathbb{Z}}_{N}^{2}\right)$, where we adopt the convention that brackets [·,·] denote arrays of indices and that parentheses (·,·) denote function evaluations, and where we interpret the numbers $\widehat{f}\left[{k}_{1},{k}_{2}\right]$ as samples $\widehat{f}\left[{k}_{1},{k}_{2}\right]=\widehat{f}\left({k}_{1},{k}_{2}\right)$ from the trigonometric polynomial
The Laplacian pyramid algorithm will accomplish the multiscale partition illustrated in Figure 3, by decomposing ${f}_{a}^{\phantom{\rule{0.3em}{0ex}}j1}[\phantom{\rule{0.3em}{0ex}}{n}_{1},{n}_{2}]$, 0 ≤ n_{1},n_{2} < N_{j1}, into a lowpass filtered image ${f}_{a}^{\phantom{\rule{0.3em}{0ex}}j}[\phantom{\rule{0.3em}{0ex}}{n}_{1},{n}_{2}]$, a quarter of the size of ${f}_{a}^{j1}[\phantom{\rule{0.3em}{0ex}}{n}_{1},{n}_{2}]$, and a highpass filtered image ${f}_{d}^{\phantom{\rule{0.3em}{0ex}}j}[\phantom{\rule{0.3em}{0ex}}{n}_{1},{n}_{2}]$. Observe that the matrix ${f}_{a}^{\phantom{\rule{0.3em}{0ex}}j}[\phantom{\rule{0.3em}{0ex}}{n}_{1},{n}_{2}]$ has size N_{ j }× N_{ j }, where N_{ j }= 2^{2j}N, and ${f}_{a}^{0}[\phantom{\rule{0.3em}{0ex}}{n}_{1},{n}_{2}]=f[\phantom{\rule{0.3em}{0ex}}{n}_{1},{n}_{2}]$ has size N × N. In particular, we have that
and thus, ${f}_{d}^{\phantom{\rule{0.3em}{0ex}}j}[\phantom{\rule{0.3em}{0ex}}{n}_{1},{n}_{2}]$ are the discrete samples of a function ${f}_{d}^{\phantom{\rule{0.3em}{0ex}}j}({x}_{1},{x}_{2})$, whose Fourier transform is $\hat{{f}_{d}^{\phantom{\rule{0.3em}{0ex}}j}}({\xi}_{1},{\xi}_{2})$. Since this operation is implemented as a convolution in spacedomain, this step of the algorithm is one of the most computationally expensive.
The next step produces the directional filtering, and this is achieved by computing the DFT on the pseudopolar grid and then applying a onedimensional bandpass filter to the components of the signal with respect to this grid. More precisely, let us define the pseudopolar coordinates $(u,v)\in {\mathbb{R}}^{2}$ as follows:
After performing this change of coordinates, we obtain
where ${g}_{j}(u,w)=\widehat{{f}_{d}^{\phantom{\rule{0.3em}{0ex}}j}}({\xi}_{1},{\xi}_{2}).$ This shows that the directional components are obtained by simply translating the window function V. The discrete samples g_{ j }[ n_{1},n_{2}] = g_{ j }(n_{1},n_{2}) are the values of the DFT of ${f}_{d}^{\phantom{\rule{0.3em}{0ex}}j}[\phantom{\rule{0.3em}{0ex}}{n}_{1},{n}_{2}]$ on a pseudopolar grid.
Now let $\left\{{v}_{j,\ell}\right[\phantom{\rule{0.3em}{0ex}}n]:\phantom{\rule{0.3em}{0ex}}n\in \mathbb{Z}\}$ be the sequence whose discrete Fourier transform gives the samples of the window function V(2^{j}k  ℓ), i.e., ${\widehat{v}}_{j,\ell}\left[\phantom{\rule{0.3em}{0ex}}k\right]=V({2}^{\phantom{\rule{0.3em}{0ex}}j}k\ell )$. Then, for fixed ${n}_{1}\in \mathbb{Z}$, we have
where ∗ denotes the onedimensional convolution along the n_{2} axis and ${\mathcal{F}}_{1}$ is the onedimensional discrete Fourier transform. Thus, (6) gives the algorithmic implementation for computing the discrete samples of g_{ j }(u,w) v(2^{j}w  ℓ). At this point, to compute the shearlet coefficient in the discrete domain, it suffices to compute the inverse PDFT or directly reassemble the Cartesian sampled values and apply the inverse twodimensional FFT.Figure 3 illustrates the cascade of Laplacian pyramid and directional filtering. Recall that once the discrete shearlet coefficients are obtained, the inverse shearlet transform is computed using the following steps: (i) convolution of discrete shearlet coefficients and synthesis directional filters, (ii) summation of all directional components, and (iii) reconstruction by inverse Laplacian pyramidal transformation.
3.2 2D GPUbased implementation
Before implementing the 2D discrete shearlet transform algorithm on the GPU, we profiled the existing implementation available as a MATLAB toolbox at http://www.math.uh.edu/~dlabate/shearlet_toolbox.zip. Table 1 contains the breakdown of the processing times showing that the FFT computations used to perform directional filtering and the convolution part of the à trous algorithm used for pyramidal image decomposition and reconstruction take around 75% of the computation time. Hence, they were the first candidates for porting into CUDA.
Since most of the computing time for performing a discrete shearlet transform is spent in FFT function calls, it is crucial to have the best possible library to perform FFTs. The main two GPU vendors provide optimized FFT libraries: NVIDIA provides cuFFT as part of its CUDA Toolkit, and AMD provides clAmdFft as part of its Accelerated Parallel Processing Math Libraries (APPML). We have decided to use CUDA as our development architecture both because there is better documentation and because of the availability of more mature development tools. We have implemented the device code in CUDA C++, while the host code is pure C++. Since both CUDA C/C++ and OpenCL are based on the C programming language, porting the code from CUDA to OpenCL should not be difficult. However, for code compactness, we have made extensive use of templates and operator overloading, which are supported in CUDA C++, but not in OpenCL, which is based on C99.
To facilitate the development, we have used GPUmat from the GPyou Group, a free (GPLv3) GPU engine for MATLAB^{®} (source code is available from http://sourceforge.net/projects/gpumat/). This framework provides two new classes, GPUsingle and GPUdouble, which encapsulate vectors of numerical data allocated on GPU memory and allow mathematical operations on objects of such classes via function and operator overloading. Transfers between CPU and GPU memory are as simple as doing type casting, and memory allocation and deallocation are done automatically. The idea is that existing MATLAB functions could be reused without any code changes. In practice, however, in order to get acceptable performance, it is necessary to handtune the code or even use lower level languages such as C/C++.
Fortunately, the GPUmat framework provides an interface for manipulating these objects from MEX files, and a mechanism for loading custom kernels. Although there are commercial alternatives to GPUmat such as Jacket from AccelerEyes, or the Parallel Computing Toolbox from Mathworks, we have found that GPUmat is pretty robust and adds very little overhead to the execution time as long as we follow good programming practices such as inplace operations and reuse of preallocated buffers.
Our implementation supports both single precision (32bit) and double precision (64bit) IEEE 754 floating point numbers (double precision is only supported on devices with compute capability 2.0 or newer due to limitations in the maximum amount of shared memory available per multiprocessor). We generate the filter bank of directional filters using the Fourierdomain approach from [9], where directional filters are designed as Meyertype window functions in the Fourier domain. Since this step only needs to be run once and does not depend on the image dimensions, we precompute these directional filters using the original MATLAB implementation.
For the Laplacian pyramidal decomposition, we ported the à trous algorithm using symmetric extension [2] into CUDA. This algorithm requires performing nonseparable convolutions with decimated signals. For efficiency reasons, the kernel that performs à trous convolutions preloads blocks of data into shared memory, so that the memory is only accessed once from each GPU thread.
With the above GPUbased Laplacian pyramid and directional filter implementation, it is just a matter of applying convolutions in the GPU to find the forward and inverse shearlet transform.
The main steps of our GPUbased shearlet transform are as shown in Table 2.
3.3 3D discrete shearlet transform
The algorithm for the discretization of the 3D shearlet transform is very similar to the 2D shearlet transform, and our implementation of the 3D discrete shearlet transform adapts the code available from http://www.math.uh.edu/~dlabate/3Dshearlet_toolbox.zip and described in [20]. The main practical difference is that storing the 3D shearlet coefficients is much more memoryintensive. Since the memory requirement can easily exceed the available GPU memory, in our algorithm, we compute one convolution at a time in CUDA and add the result to the output.
4 Applications
In the following, we illustrate the advantages of our new implementation of the discrete shearlet transform by considering three applications: denoising of natural images corrupted with white Gaussian noise, detection of cracks in textured images, and denoising of videos. The source code, sample data, as well as the MATLAB scripts used to generate all the figures in this paper are publicly available at http://www.umiacs.umd.edu/~gibert/ShearCuda.zip.
For benchmark, we have evaluated the performance of the new discrete shearlet transform both on multicore CPUs and GPU. All CPU tests have been performed on a Dell PowerEdge C6145 with foursocket AMD Opteron™ 6274 processors at 2.2 GHz (64 cores total) and 256 GB RAM, running Red Hat Enterprise Linux (REHL) 6. This machine is one of 16 identical nodes in the highperformance computing (HPC) cluster Euclid at the University of Maryland. During these benchmarks, we had exclusive access to this node, and no other processes were running, except for regular system services. To better understand the performance of this code when running on systems with different numbers of cores, we limited the number of available cores on some of the experiments. We found that neither MATLAB’s maxNumCompThreads nor –singleCompThread works reliably on nonIntel processors, so we used the taskset Linux command to set the processor affinity to the desired number of cores. GPU tests were performed on different machines running RHEL 5 or 6, and CUDA 4.2 or 5.0. The tests include devices with CUDA Compute Capabilities (CC) between 1.3 and 3.5. Table 3 summarizes the configurations used in our experiments.
4.1 Image denoising
As a first test, we evaluated the performance of our implementation of the discrete shearlet transform on a problem of image denoising, using a standard denoising algorithm based on hard threshold of the shearlet coefficients. The setup is similar to the one described in [9]. That is, given an image $f\in {\mathbb{R}}^{{N}^{2}}$, we observe a noisy version of it given by u = f + ε, where $\epsilon \in {\mathbb{R}}^{{N}^{2}}$ is an additive white Gaussian noise process which is independent of f, i.e., $\epsilon \sim N(0,{\sigma}^{2}{\mathbf{I}}_{{N}^{2}\times {N}^{2}})$. Our goal is to compute an estimate $\stackrel{~}{f}$ of f from the noisy data u by applying a classical hard thresholding scheme [24] on the shearlet coefficients of u. The threshold levels are given by ${\tau}_{i,j,n}={\sigma}_{{\epsilon}_{i,j}}^{2}/{\sigma}_{i,j,n}^{2}$, as in [2, 9, 25], where ${\sigma}_{i,j,n}^{2}$ denotes the variance of the n th coefficient at the i th directional subband in the j th scale, and ${\sigma}_{{\epsilon}_{i,j}}^{2}$ is the noise variance at scale j and directional band i. The variances ${\sigma}_{{\epsilon}_{i,j}}^{2}$ are estimated by using a Monte Carlo technique in which the variances are computed for several normalized noise images and then the estimates are averaged.
For our experiments, we used five levels of the Laplacian pyramid decomposition, and we applied a directional decomposition on four of the five scales. We used 8 shear filters of sizes 32×32 for the first two scales (coarser scales), and 16 shear filters of sizes 16×16 for the third and fourth levels (fine scales). The shear filters are Meyertype windows [9]. We used the 512×512 Barbara image to test our algorithm, and to assess its performance, we used the peak signaltonoise ratio (PSNR), measured in decibels (dB), defined by
where ∥ · ∥_{ F }is the Frobenius norm, the given image f is of size N × N, and $\stackrel{~}{f}$ denotes the estimated image.
In order to minimize latency as well as bandwidth usage on the PCIe bus, we first transferred the input image to GPU memory, then we let all the computation happen on the GPU and we finally transferred the results back to CPU memory. We have verified that both CPU and GPU implementations provide an output PSNR of 29.9 dB when the input PSNR is 22.1 dB. At these noise levels, there is no difference in PSNR between the single and the double precision implementations.
To verify the numerical accuracy, we ran the shearlet decomposition and reconstruction on a noisefree image (without thresholding), and we got a reconstruction mean squared error (MSE) of 9.197 × 10^{09} for single precision and 2.503 × 10^{12} for double precision on a GeForce GTX 690. On the CPU implementation, we get reconstruction errors of 9.1711 × 10^{09} and 1.6643 × 10^{26}, respectively. This verifies that our implementation does provide the exact reconstruction.The running times vary significantly depending on the number of CPU cores available and the GPU model. Figure 4 shows a comparison of running times (wall times) of the image denoising algorithm on different hardware configurations. We can clearly see that the CPU implementation does not scale well as we increase the number of CPU cores due to parts of the algorithm running sequentially. For a fair comparison of multicore vs. GPU, we would have to compare the performance to a fully optimized CPU implementation. It should be noted that there is enough coarselevel parallelism on this algorithm to accomplish full CPU utilization without incurring in inter CPU communication issues. However, the trend reveals that on this application, GPU is more efficient than CPU. In summary, the denoising algorithm takes 8.89 s on 4 CPU cores vs. 0.038 s on the GeForce GTX 690 (a 233 × speedup) when using single precision. For double precision, it takes 10.7 s on 4 CPU cores vs. 0.127 s on the GeForce GTX 690 (an 84 × speedup).
Table 1 shows the breakdown of different parts of the image denoising algorithm both on CPU and GPU.
4.2 Crack detection
Detection of cracks on concrete structures is a difficult problem due to the changes in width and direction of the cracks, as well as the variability in the surface texture. This problem has received considerable attention recently. Redundant representations, such as undecimated wavelets, have been extensively used for crack detection [26, 27]. However, wavelets have poor directional sensitivity and have difficulties in detecting weak diagonal cracks. To overcome this limitation, Ma et al. [28] proposed the use of the nonsubsampled contourlet transform[2] for crack detection. However, all these methods rely on the assumption that the background surface can be modeled as additive white Gaussian noise, and this assumption leads to matched filter solutions. As a matter of fact, on real images, textures are highly correlated and applying linear filters leads to poor performance.
To address this problem, we propose a completely new approach to crack detection based on separating the image into morphological distinct components using sparse representations, adaptive thresholding, and variational regularization. This technique was pioneered by Starck et al. [29] and later extended and generalized by many authors (e.g., [17, 18, 30]). In particular, we will use the Iterative Shrinkage Algorithm with a combined dictionary of shearlets and wavelets to separate cracks from background texture.
To demonstrate the performance of the GPUaccelerated iterative shrinkage algorithm, we processed three 512 × 512 images. The images correspond to cracks on concrete railroad crossties collected by ENSCO Inc. during summer 2012 using four 2,048 × 1 linescan cameras, which were assembled into 8,192 × 3,072 frames. The cameras were triggered using a calibrated encoder, producing images with square pixels with a constant size of 0.43 mm. We have manually cropped these images so that we can decouple crack detection from crosstie boundary tracking. As one can see from Figure 5, these cracks propagate in different directions and the background texture has a lot of variation. However, due to the fact that the information in these images is highly redundant, it is possible to separate the image into two components, that is, cracks and texture, by solving an ℓ_{1} optimization problem [17].
More precisely, we will model an image x containing cracks on textural background as a superposition of a crack component x_{ c }with a textural component x_{ t }:
Let Φ_{1} and Φ_{2} be the dictionaries corresponding to wavelets and shearlets, respectively. We assume that x_{ c }is sparse in a shearlet dictionary Φ_{1}, and similarly, x_{ t }is sparse in a wavelet dictionary Φ_{2}. That is, we assume that there are sparse coefficients a_{ c }and a_{ t }so that x_{ c }= Φ_{1}a_{ c }and x_{ t }= Φ_{2}a_{ t }. Then, one can separate these components from an x via the coefficients a_{ c }and a_{ t }by solving the following optimization problem:
where for an n  dimensional vector b, the ℓ_{1} norm is defined as $\parallel b{\parallel}_{1}=\sum _{i}\left{b}_{i}\right.$ This image separation problem can be solved efficiently using an iterative shrinkage algorithm proposed in [17] (Figure 5).
In our numerical experiments, we used symlet wavelets with four decomposition levels to generate Φ_{2} and a fourlevel shearlet decomposition with Meyer filters of sizes 80×80 on all four scales, 8 directional filters on the first three scales, and 16 directional filters on the fourth scale, to generate Φ_{1}. To assess the performance of the separation algorithm, we visually compare detection results at peak F_{2} score (Figure 6), and calculated the ROC curves for each image using the following two detection methods (Figure 7).

1.
ShearletC. This method takes advantage of the Parseval property of the shearlet transform and performs crack detection directly in the transform domain. We first decompose the image into cracks and texture components using iterative shrinkage with a shearlet dictionary and a wavelet one. Instead of using the reconstructed image, we analyze the values of the shearlet transform coefficients. For each scale in the shearlet transform domain, we analyze the directional components corresponding to each displacement and collect the maximum magnitude across all directions. If the sign of the shearlet coefficient corresponding to the maximum magnitude is positive, we classify the corresponding pixel as background; otherwise, we assign the norm of the vector containing the maximum responses at each scale to each pixel and we apply a threshold.

2.
ShearletI. We first decompose the image into cracks and texture components as described for the previous method. Then, we apply an intensity threshold on the reconstructed cracks image.
We compare our results to the following two basic methods not based on shearlets:

1.
Intensity. This is the most basic approach, which only uses image intensity. After compensating for slow variations of intensity in the image, we apply a global threshold.

2.
Canny. We use the Canny[31] edge detector as implemented in MATLAB using the default $\sigma =\sqrt{2}$ and the default high to low threshold ratio of 40%.
After using a lowlevel detector, it may be necessary to remove small isolated regions corresponding to false detections due to random noise. This postprocessing step may reduce the false detection rate on intensitybased methods. However, to provide an objective comparison, we have generated the experimental results without running any postprocessing. We leave the performance analysis of a complete crack detector for future work.
To evaluate the performance of each crack detector, we manually annotated the crack pixels in each image. To mitigate the effect of ambiguous segmentation boundaries, we annotated the boundaries around the cracks as tightly as possible (making sure that only pixels completely contained inside the crack boundaries are annotated as such) and defined an envelope region around each crack whose labels are treated as ‘do not care’. Formally, let Ω denote the set of pixels in the image, and F (foreground) denote the set of pixels labeled as cracks. We define the set B (background) as
where ∥xf∥ denotes the Euclidean distance between sites x and f. In our experiments, we used δ = 3. To account for possible small inaccuracies in the ground truth, we performed a bipartite graph matching between the detected crack pixels and the crack pixels in the ground truth. For our experiments, we allow matching within a maximum distance of 2 pixels. This choice of matching metric does not penalize crack overestimation errors as long as these errors are contained in such envelope. This allows us to decouple errors in estimating the position of the crack centerline from errors in estimating the crack width, which is more sensitive to lighting variations. Let D be the set of pixels detected as cracks by a given detector and
The probability of detection (PD) and false alarm (PF) are defined as
A sequence of admissible detectors D_{PF ≤ ε}, for a given false detection rate ε,0 ≤ ε≤1 would produce monotonically increasing detection rates, PD_{PF ≤ ε}. The receiver operating characteristic function (ROC curve) is defined as PD as a function of PF
One commonly used metric is the area under the ROC curve (AUC), defined by
which corresponds to the probability that a sample randomly drawn from F will receive a score higher than a sample randomly drawn from B. AUC provides a measure of the average performance of the detection across all possible sensitivity settings. Although it is an important measure, in practice, we are interested in knowing how well the detector will work when we choose a particular sensitivity setting. For this reason, we have selected constant false alarm rate (CFAR) detectors with PF = 10^{3} and PF = 10^{4}, and we report the corresponding PD. For completeness, we also report the F_{1} score (also know as the Dice similarity index), which is defined as
The F_{1} score is also known as the balanced Fscore, since it is equivalent to the harmonic mean of the precision and recall:
where
In this paper, we report the peak F_{1} score for all methods. The Canny edge detection method estimates the location of the crack boundary, while the other three methods estimate the location of the crack itself. To have a meaningful comparison, we have generated separate ground truth masks for the crack outline, so we can use the same matching metric on the Canny method. For each method, we have used the same algorithm parameters on all the images.
Table 4 summarizes our results. We observe that our shearletbased detectors perform consistently well on all evaluation metrics. Note that, on image 3, the ShearletI method, which is based on intensity in the reconstructed image, produces better results than all other methods. Due to its simplicity, the intensitybased method is still being used by the industry. For example, the system recently proposed in [32] uses pixel intensities to detect cracks on road pavement. Based on the results from Table 4, we can conclude that, with the proper image preprocessing, intensity can still be a powerful feature for crack detection. However, the detection performance provided by shearletbased features is more consistent across images. In future work, we will further explore the potential of combining both intensity and shearletbased features. With any of the methods described in this section, it may be possible to further remove small artifacts in the detected crack boundary by adding a postprocessing step as was done in [27].
4.3 Video denoising
Video denoising can be performed using the same type of algorithm described above for image denoising and consisting, essentially, in computing the shearlet coefficients of the noisy data, followed by hard thresholding and reconstruction from the thresholded coefficients. Similar to the previous section, a noisy video is obtained by adding white Gaussian noise to a video sequence.We have tested our GPUbased implementation of the 3D shearlet video denoising algorithm using the 192× 192 × 192 waterfall video sequence. Figure 8 shows frame 96 before and after denoising. Figure 9 compares the running times of the video denoising algorithm based on CPU vs. GPU. One can notice that when we go from single core to dual core, the run time drops from 27.5 min to 14.4 min on single precision (a 1.91 × speedup). However, when going from dualcore to quad core, we only get 1.62 × speedup, and the rate of improvement as we keep doubling the number of cores keeps diminishing to the point where the improvement from single core to 64 cores is just a 9.45 × speedup. On the other hand, a GeForce 480 produces the same result in just 3 s, a remarkable 551 × speedup compared to singlecore CPU, and 58 × speedup over 64 CPU cores.
5 Conclusions
The shearlet transform is an advanced multiscale method which has emerged in recent years as a refinement of the traditional wavelet transform and was shown to perform very competitively over a wide range of image and data processing problems. However, standard CPUbased numerical implementations are very timeconsuming and make the application of this method to large data sets and realtime problems very impractical.
In this paper, we have described how to speed up the computation of the 2D/3D discrete shearlet transform by using GPUbased implementations. The development of algorithms on GPU used to be tedious and require a very specialized knowledge of the hardware. Using CUDA, this is no longer the case, and scientists with C/C++ programming skills can quickly develop efficient GPU implementations of dataintensive algorithms. In this paper, we have taken advantage of the GPUbased implementation of the fast Fourier transform and used the capabilities of MATLAB for quick prototyping. The results presented in this paper illustrate the practical benefits of this approach. For example, a GeForce 480 GTX, a $200 graphics card, can perform video denoising 58 times faster than an expensive 64core machine while consuming much less power.
Our new implementation enables the efficient application of the shearlet decomposition to a variety of image and data processing tasks for which the required CPU resources would be prohibitive. There are further improvements and extensions that can be achieved such as precalculating the filter coefficients and porting the code to OpenCL so it can also run on AMD and Intel GPUs, but this would go beyond the scope of this paper.
Endnote
^{a} Note that this code also includes some C routines to speed up the computation time. This is true both for the 2D and 3D implementations.
References
 1.
Candès EJ, Donoho DL: New tight frames of curvelets and optimal representations of objects with C^{2} singularities. Comm. Pure Appl. Math 2004, 57: 219266. 10.1002/cpa.10116
 2.
Cunha A, Zhou J, Do M: The nonsubsampled contourlet transform: theory, design, and applications. IEEE Trans. Image Process 2006, 15(10):30893101.
 3.
Labate D, Lim W, Kutyniok G, Weiss G: Sparse multidimensional representation using shearlets. Wavelets XI (San Diego, CA, 2005), Volume SPIE Proc. 5914 2005, 254262.
 4.
Kutyniok G, Labate D: Shearlets: Multiscale Analysis for Multivariate Data. Birkhäuser, Boston; 2012.
 5.
Starck JL, Murtagh F, Fadili JM: Sparse image and signal processing: wavelets, curvelets, morphological diversity. In Shearlets: Multiscale Analysis for Multivariate Data. Cambridge books online, Cambridge University Press Cambridge; 2010.
 6.
Guo K, Labate D: The construction of smooth Parseval frames of shearlets. Math. Model. Nat. Phenom 2013, 8: 82105. 10.1051/mmnp/20138106
 7.
Guo K, Labate D: Optimally sparse multidimensional representation using shearlets. Siam J. Math. Anal 2007, 9: 298318.
 8.
Guo K, Labate D: Optimally sparse representations of 3D data with C^{2} surface singularities using Parseval frames of shearlets. Siam J. Math. Anal 2012, 44: 851886. 10.1137/100813397
 9.
Easley GR, Labate D, Lim W: Sparse directional image representations using the discrete shearlet transform. Appl. Comput. Harmon. Anal 2008, 25: 2546. 10.1016/j.acha.2007.09.003
 10.
Kutyniok G, Shahram M, Zhuang X: ShearLab: a rational design of a digital parabolic scaling algorithm. SIAM J. Imaging Sci 2012, 5(4):12911332. 10.1137/110854497
 11.
Guo K, Labate D: Representation of Fourier integral operators using shearlets. J. Fourier Anal. Appl 2008, 14: 327371. 10.1007/s0004100890180
 12.
Colonna F, Easley GR, Guo K, Labate D: Radon transform inversion using the shearlet representation. Appl. Comput. Harmon. Anal 2010, 29(2):232250. 10.1016/j.acha.2009.10.005
 13.
Vandeghinste B, Goossens B, Van Holen R, Vanhove C, Pizurica A, Vandenberghe S, Staelens S: Iterative CT reconstruction using shearletbased regularization. IEEE Trans. Nuclear Sci 2013, 60(5):33053317.
 14.
Guo K, Labate D: Characterization and analysis of edges using the continuous shearlet transform. SIAM Imaging Sci 2009, 2: 959986. 10.1137/080741537
 15.
Guo K, Labate D: Analysis and detection of surface discontinuities using the 3D continuous shearlet transform. Appl. Comput. Harmon. Anal 2011, 30: 231242. 10.1016/j.acha.2010.08.004
 16.
Yi S, Labate D, Easley GR, Krim H: A shearlet approach to edge analysis and detection. IEEE Trans. Image Process 2009, 18(5):929941.
 17.
Kutyniok G, Lim W: Image separation using wavelets and shearlets. In Curves and Surfaces (Avignon, France, 2010), 416–430, Lecture Notes in Computer Science 6920. Springer Berlin Heidelberg; 2011.
 18.
Easley G, Labate D, Negi PS: 3D data denoising using combined sparse dictionaries. Math. Model. Nat. Phenom 2013, 8: 6074.
 19.
Patel VM, Easley G, Healy D: Shearletbased deconvolution. IEEE Trans. Image Process 2009, 18: 26732685.
 20.
Negi P, Labate D: 3D discrete shearlet transform and video processing. IEEE Trans. Image Process 2012, 21: 29442954.
 21.
Easley G, Labate D, Patel VM: Directional multiscale processing of images using wavelets with composite dilations. J. Math. Imaging Vis 2014, 48(1):1343. 10.1007/s1085101203854
 22.
Candès EJ, Demanet L, Donoho D, Ying L: Fast discrete curvelet transforms. SIAM Multiscale Model. Simul 2006, 5(3):861899. 10.1137/05064182X
 23.
Burt PJ, Adelson EH: The Laplacian pyramid as a compact image code. IEEE Trans. Commun 1983, 31(4):532540. 10.1109/TCOM.1983.1095851
 24.
Donoho D, Johnstone I: Adapting to unknown smoothness via wavelet shrinkage. J. Am. Stat. Assoc 1995, 90: 12001224. 10.1080/01621459.1995.10476626
 25.
Chang SG, Yu B, Vetterli M: Adaptive wavelet thresholding for image denoising and compression. IEEE Trans. Image Process 2000, 9(9):15321546. 10.1109/83.862633
 26.
Subirats P, Dumoulin J, Legeay V, Barba D: Automation of pavement surface crack detection using the continuous wavelet transform. IEEE International Conference on Image Processing, Atlanta, GA 3037.
 27.
Chambon S, Moliard J: Automatic road pavement assessment with image processing: review and comparison. Int. J. Geophys. article ID 989354 2011, 20 pages. doi:10.1155/2011/989354
 28.
Ma C, Zhao C, Hou Y: Pavement distress detection based on nonsubsampled contourlet transform. Int. Conf. Comput. Sci. Softw. Eng. 2008, 1: 2831.
 29.
Starck JL, Elad M, Donoho D: Image decomposition via the combination of sparse representation and a variational approach. IEEE Trans. Image Process 2005, 14(10):15701582.
 30.
Bobin J, Starck JL, Fadili M, Moudden Y, Donoho D: Morphological component analysis: an adaptive thresholding strategy. IEEE Trans. Image Process 2007, 16(11):26752681.
 31.
Canny J: A computational approach to edge detection. Mach. Intell. 1986, 8(6):679698.
 32.
Oliveira H, Correia P: Automatic road crack detection and characterization. IEEE Trans. Intell. Transport. Syst. 2013, 14: 155168.
Acknowledgements
The authors thank Amtrak, ENSCO, Inc. and the Federal Railroad Administration for providing the images used in Section 4.2. This work was supported by the Federal Railroad Administration under contract DTFR5313C00032. DL acknowledges support from NSF grant DMS 1008900/1008907 and DMS 1005799.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Gibert, X., Patel, V.M., Labate, D. et al. Discrete shearlet transform on GPU with applications in anomaly detection and denoising. EURASIP J. Adv. Signal Process. 2014, 64 (2014). https://doi.org/10.1186/16876180201464
Received:
Accepted:
Published:
Keywords
 Shearlets
 Wavelets
 Image processing
 Parallelism
 Multicore
 GPU