Skip to main content

Discrete shearlet transform on GPU with applications in anomaly detection and denoising


Shearlets have emerged in recent years as one of the most successful methods for the multiscale analysis of multidimensional signals. Unlike wavelets, shearlets form a pyramid of well-localized functions defined not only over a range of scales and locations, but also over a range of orientations and with highly anisotropic supports. As a result, shearlets are much more effective than traditional wavelets in handling the geometry of multidimensional data, and this was exploited in a wide range of applications from image and signal processing. However, despite their desirable properties, the wider applicability of shearlets is limited by the computational complexity of current software implementations. For example, denoising a single 512 × 512 image using a current implementation of the shearlet-based shrinkage algorithm can take between 10 s and 2 min, depending on the number of CPU cores, and much longer processing times are required for video denoising. On the other hand, due to the parallel nature of the shearlet transform, it is possible to use graphics processing units (GPU) to accelerate its implementation. In this paper, we present an open source stand-alone implementation of the 2D discrete shearlet transform using CUDA C++ as well as GPU-accelerated MATLAB implementations of the 2D and 3D shearlet transforms. We have instrumented the code so that we can analyze the running time of each kernel under different GPU hardware. In addition to denoising, we describe a novel application of shearlets for detecting anomalies in textured images. In this application, computation times can be reduced by a factor of 50 or more, compared to multicore CPU implementations.

1 Introduction

During the last decade, a new generation of multiscale systems has emerged which combines the power of the classical multiresolution analysis with the ability to process directional information with very high efficiency. Some of the most notable examples of such systems include the curvelets[1], the contourlets[2], and the shearlets[3]. Unlike classical wavelets, the elements of such systems form a pyramid of well-localized waveforms ranging not only across various scales and locations, but also across various orientations and with highly anisotropic shapes. Due to their richer structure, these more sophisticated multiscale systems are able to overcome the poor directional sensitivity of traditional multiscale systems and have been used to derive several state-of-the-art algorithms in image and signal processing (cf.[4, 5]).

Shearlets, in particular, offer a unique combination of very remarkable features: they have a simple and well-understood mathematical structure derived from the theory of affine systems [3, 6]; they provide optimally sparse representations, in a precise sense, for a large class of images and other multidimensional data where wavelets are suboptimal [7, 8]; and the directionality is controlled by shear matrices rather than rotations. This last property, in particular, enables a unified framework for continuum and discrete setting since shear transformations preserve the rectangular lattice and is an advantage in deriving faithful digital implementations [9, 10].

The shearlet decomposition has been successfully employed in many problems from applied mathematics and signal processing, including decomposition of operators [11], inverse problems [12, 13], edge detection [1416], image separation [17], and image restoration [1820]. However, one major bottleneck to the wider applicability of the shearlet transform is that current discrete implementations tend to be very time consuming, making its use impractical for large data sets and for real-time applications. For instance, the current (CPU-based) MATLAB implementationa of the 2D shearlet transform, run on a typical desktop PC, takes about 2 min to denoise a noisy image of size 512×512 [9, 21]. The running time of the current (CPU-based) MATLAB implementation of the 3D shearlet transform for denoising a video sequence of size 1923 is about 5 min [20]. Running times for alternative shearlet implementations from Shearlab [10] as well as for the current implementation of the curvelet transform [22] are comparable.

In recent years, general-purpose graphics processing units (GPGPUs) have become ubiquitous not only on high-performance computing (HPC) clusters, but also on workstations. For example, Titan, which was until recently the world’s fastest supercomputer, contains 18,688 NVIDIA Tesla K20X GPUs. These GPUs provide about 90% of Titan’s peak computing performance, which is greater than 20 PetaFLOPS (quadrillion floating point operations per second). Due to their energy efficiency and capabilities, GPGPUs are also becoming mainstream on mobile platforms, such as iOS and Android devices. There are two main architectures for GPGPU computing: CUDA and OpenCL. CUDA was designed by NVIDIA, and has been around since 2006. OpenCL was originally designed by Apple, Inc., and was introduced in 2008. OpenCL is an open standard maintained by the Khronos Group, whose members include Intel, AMD, NVIDIA, and many others, so it has broader industry acceptance than any other architecture. In 2009, Microsoft introduced DirectCompute as an alternative architecture for GPGPU computing, which is only available in Windows Vista and later. OpenCL has been designed to provide the developer with a common framework for doing computation on heterogeneous devices. One of the advantages of OpenCL is that it can potentially support any computing device, such as CPUs, GPUs, and FPGAs, as long as there is an OpenCL compiler available for such processor. NVIDIA provides CUDA/OpenCL drivers, libraries, and development tools for the three major operating systems (Linux, Windows, and Mac OS X), while AMD/ATI™ and Intel provide OpenCL drivers and tools for their respective GPUs.

The objective of this paper is to introduce and demonstrate a new implementation of the 2D and 3D discrete shearlet transform which takes advantage of the computational capabilities of the graphics processing unit (GPU). To demonstrate the effectiveness of the proposed implementations, we will illustrate its application on problems of image and video denoising and on a problem of feature recognition aiming at crack detection of railway components. In particular, we will show that our new implementation takes about 40 ms to denoise an image of size 512 × 512, which is a 233 × speed-up compared to single-core CPU, and about 3 s to denoise a video of size 1923, which is a 551× speed-up compared to single-core CPU.

The organization of the paper is as follows. In Section 2, we recall the construction of 2D and 3D shearlets. Next, in Section 3, we present our implementation of the discrete shearlet transform, and in Section 4, we benchmark our implementation using three specific applications. Finally, concluding remarks and future work are discussed in Section 5.

2 Shearlets

In this section, we recall the construction of 2D and 3D shearlets (cf.[6, 7]).

2.1 2D shearlets

To construct smooth Parseval frames of shearlets for L 2 R 2 , we start by defining appropriate multiscale function systems supported in the following cone-shaped regions of the Fourier domain R ̂ 2 :

P 1 = ξ 1 , ξ 2 R 2 : ξ 2 ξ 1 1 , P 2 = ξ 1 , ξ 2 R 2 : ξ 2 ξ 1 > 1 .

Let ϕ C([0,1]) be a ‘bump’ function with suppϕ - 1 8 , 1 8 and ϕ = 1 on - 1 16 , 1 16 . For ξ= ξ 1 , ξ 2 R ̂ 2 , let Φ (ξ) = Φ (ξ1,ξ2) = ϕ (ξ1) ϕ (ξ2) and define the function

W ( ξ ) = W ξ 1 , ξ 2 = Φ 2 2 - 2 ξ 1 , 2 - 2 ξ 2 - Φ 2 ξ 1 , ξ 2 .

Note that the functions W j 2 = W 2 2 - 2 j · ,j0, have support inside the Cartesian coronae

C j = - 2 2 j - 1 , 2 2 j - 1 2 - 2 2 j - 4 , 2 2 j - 4 2

and that they produce a smooth tiling of the frequency plane:

Φ 2 ξ 1 , ξ 2 + j 0 W 2 2 - 2 j ξ 1 , 2 - 2 j ξ 2 = 1 for ξ 1 , ξ 2 R ̂ 2 .

Let V C (R) so that suppV [-1,1],V (0) = 1,V(n)(0) = 0, for all n ≥ 1 and

V ( u - 1 ) 2 + V ( u ) 2 + V ( u + 1 ) 2 = 1 for | u | 1 .

For F ( 1 ) ξ 1 , ξ 2 =V ξ 2 ξ 1 and F ( 2 ) ξ 1 , ξ 2 =V ξ 1 ξ 2 , the shearlet systems associated with the cone-shaped regions P ν ,ν=1,2 are defined as the countable collection of functions

ψ j , , k ( ν ) : j 0 , - 2 j 2 j , k Z 2 ,


ψ ̂ j , , k ( ν ) ( ξ ) = det A ( ν ) - j / 2 W 2 - j ξ F ( ν ) ξ A ( ν ) - j B ( ν ) - e 2 π i ξ A ( ν ) - j B ( ν ) - k ,


A ( 1 ) = 4 0 0 2 , B ( 1 ) = 1 1 0 1 , A ( 2 ) = 2 0 0 4 , B ( 2 ) = 1 0 1 1 .

Note that the dilation matrices A(1),A(2) produce anisotropic dilations, namely, parabolic scaling dilations; by contrast, the shear matrices B(1),B(2) are non-expanding and their integer powers control the directional features of the shearlet system. Hence, the systems (1) form collections of well-localized functions defined at various scales, orientations, and locations, controlled by the indices j,,k, respectively. In particular, the functions ψ ̂ j , , k ( 1 ) , given by (2) with ν = 1, can be written explicitly as

ψ ̂ j , , k ( 1 ) ( ξ ) = 2 - 2 j W 2 - 2 j ξ V 2 j ξ 2 ξ 1 - e 2 π i π A ( 1 ) - j B ( 1 ) - k ,

showing that their supports are contained inside the trapezoidal regions

Σ j , : = ξ 1 , ξ 2 : ξ 1 - 2 2 j - 1 , - 2 2 j - 4 × 2 2 j - 4 , 2 2 j - 1 , ξ 2 ξ 1 - 2 - j 2 - j

in the Fourier plane (see Figure 1). Similar properties hold for the functions ψ ̂ j , , k ( 2 ) .

Figure 1
figure 1

Frequency plane and frequency support. (a) The tiling of the frequency plane R ̂ 2 induced by the shearlets. (b) Frequency support Σj,of a shearlet ψ j , , k ( 1 ) , for ξ1 > 0. The other half of the support, for ξ1 < 0, is symmetrical.

A smooth Parseval frame for the whole space L 2 R 2 is obtained by combining the two shearlet systems associated with the cone-based regions P 1 and P 2 together with a coarse-scale system, associated with the low frequency region. To ensure that all elements of this combined shearlet system are C c in the Fourier domain, the elements whose supports overlap the boundaries of the cone regions in the frequency domain are slightly modified. That is, we define a shearlet system for L 2 R 2 as

ψ ~ - 1 , k : k Z 2 ψ ~ j , , k , ν : j 0 , | | < 2 j , k Z 2 , ν = 1 , 2 ψ ~ j , , k : j 0 , = ± 2 j , k Z 2 ,

consisting of

  •  the coarse-scale shearlets ψ ~ - 1 , k = Φ ( · - k ) : k Z 2 ;

  •  the interior shearlets{ ψ ~ j , , k , ν = ψ j , , k ( ν ) :j0,||< 2 j ,k Z 2 ,ν=1,2}, where the functions ψ j , , k ( ν ) are given by (2);

  •  the boundary shearlets ψ ~ j , , k : j 0 , = ± 2 j , k Z 2 , obtained by joining together slightly modified versions of ψ j , , k ( 1 ) and ψ j , , k ( 2 ) , for = ± 2j; after that, they have been restricted in the Fourier domain to the cones P 1 and P 2 , respectively. We refer to [6] for details.

For brevity, let us denote the system (3) using the compact notation

ψ ~ μ , μ M ,

where M = M C M I M B are the indices associated with coarse-scale shearlets, interior shearlets, and boundary shearlets, respectively. We have the following result from [6]:

Theorem 2.1.

The system of shearlets (3) is a Parseval frame for L 2 R 2 . That is, for anyf L 2 R 2 ,

μ M f ψ ~ μ 2 = f 2 .

All elements ψ ~ μ , μ M are Cand compactly supported in the Fourier domain.

As mentioned above, it is proved in [7] that the 2D Parseval frame of shearlets ψ ~ μ , μ M provides essentially optimal approximations for functions of two variables which are C2 regular away from discontinuities along C2 curves.

The mapping from f L 2 R 2 into the elements f ψ ~ μ ,μM, is called the 2D shearlet transform.

2.2 3D shearlets

The construction outlined above extends to higher dimensions. In 3D, a shearlet system is obtained by appropriately combining three systems of functions associated with the pyramidal regions

P 1 = ξ 1 , ξ 2 , ξ 3 R 3 : ξ 2 ξ 1 1 , ξ 3 ξ 1 1 ,
P 2 = ξ 1 , ξ 2 , ξ 3 R 3 : ξ 1 ξ 2 < 1 , ξ 3 ξ 2 1 ,
P 3 = ξ 1 , ξ 2 , ξ 3 R 3 : ξ 1 ξ 3 < 1 , ξ 2 ξ 3 < 1 ,

in which the Fourier space R ̂ 3 is partitioned. With ϕ defined as above, for ξ= ξ 1 , ξ 2 , ξ 3 R ̂ 3 , we now let

Φ ( ξ ) = Φ ξ 1 , ξ 2 , ξ 3 = ϕ ξ 1 ϕ ξ 2 ϕ ξ 3

and W(ξ)= Φ 2 2 - 2 ξ - Φ 2 ( ξ ) . As in the two-dimensional case, we have the smooth tiling condition

Φ 2 ( ξ ) + j 0 W 2 2 - 2 j ξ = 1 for ξ R ̂ 3 .

Hence, for d=1,2,3,= 1 , 2 Z 2 , the 3D shearlet systems associated with the pyramidal regions P d are defined as the collections

ψ j , , k ( d ) : j 0 , - 2 j 1 , 2 2 j , k Z 3 ,


ψ ̂ j , , k ( d ) ( ξ ) = det A ( d ) - j / 2 W 2 - 2 j ξ F ( d ) × ξ A ( d ) - j B ( d ) - e 2 πiξ A ( d ) - j B ( d ) - k ,

F ( 1 ) ξ 1 , ξ 2 , ξ 3 =V ξ 2 ξ 1 V ξ 3 ξ 1 , F ( 2 ) ξ 1 , ξ 2 , ξ 3 =V ξ 1 ξ 2 V ξ 3 ξ 2 , F ( 3 ) ξ 1 , ξ 2 , ξ 3 =V ξ 1 ξ 3 V ξ 2 ξ 3 , the anisotropic dilation matrices A(d) are given by

A ( 1 ) = 4 0 0 0 2 0 0 0 2 , A ( 2 ) = 2 0 0 0 4 0 0 0 2 , A ( 3 ) = 2 0 0 0 2 0 0 0 4 ,

and the shear matrices are defined by

B ( 1 ) = 1 1 2 0 1 0 0 0 1 , B ( 2 ) = 1 0 0 1 1 2 0 0 1 , B ( 3 ) = 1 0 0 0 1 0 1 2 1 .

Similar to the 2D case, the shearlets ψ ̂ j , , k ( 1 ) (ξ) can be written explicitly as

ψ ̂ j , 1 , 2 , k ( 1 ) ( ξ ) = 2 - 2 j W 2 - 2 j ξ V 2 j ξ 2 ξ 1 - 1 × V 2 j ξ 3 ξ 1 - 2 e 2 πiξ A ( 1 ) - j B ( 1 ) - 1 , - 2 k ,

showing that their supports are contained inside the trapezoidal regions

ξ : ξ 1 - 2 2 j - 1 , - 2 2 j - 4 2 2 j - 4 , 2 2 j - 1 , × ξ 2 ξ 1 - 1 2 - j 2 - j , ξ 3 ξ 1 - 2 2 - j 2 - j .

Note that these support regions become increasingly more elongated at fine scales, due to the action of the anisotropic dilation matrices A ( 1 ) j , and the orientations of these regions are controlled by the shear parameters 1,2. A typical support region is illustrated in Figure 2. Similar properties hold for the elements associated with the regions P 2 and P 3 .

Figure 2
figure 2

Frequency support of a representative shearlet function ψ j,ℓ,k , inside the pyramidal region P 1 . The orientation of the support region is controlled by = (1,2); its shape is becoming more elongated as j increases (j = 4 in this plot).

A Parseval frame of shearlets for L 2 R 3 is obtained by using an appropriate combination of the systems of shearlets associated with the three pyramidal regions P d ,d=1,2,3, together with a coarse-scale system, which will take care of the low frequency region. Similar to the 2D case, in order to build such system in a way that all its elements are smooth in the Fourier domain, one has to appropriately define the elements of the shearlet systems overlapping the boundaries of the pyramidal regions P d in the Fourier domain. We refer to [8, 15] for details. Hence, we define the 3D shearlet systems for L 2 R 3 as the collections

ψ ~ - 1 , k : k Z 3 ψ ~ j , , k , d : j 0 , 1 < 2 j , 2 2 j , k Z 3 , d = 1 , 2 , 3 ψ ~ j , , k : j 0 , 1 , 2 = ± 2 j , k Z 3 ,

which again can be identified as the coarse-scale, interior and boundary shearlets. It turns out that the 3D system of shearlets is a Parseval frame of L 2 R 3 [6] and it provides essentially optimal approximations for functions of three variables which are C2 regular away from discontinuities along C2 surfaces [8].

3 Discrete implementation

A faithful numerical implementation of the 2D shearlet transform was originally presented in [9]. Let us briefly recall the main steps of this implementation.

3.1 2D discrete shearlet transform

Recall that the shearlet coefficients associated with the interior shearlets can be expressed as

f ψ j , , k ν = 2 3 j / 2 R ̂ 2 f ̂ ( ξ ) W 2 - 2 j ξ F ( ν ) × ξ A ( ν ) - j B - ( ν ) e 2 π i ξ A ( ν ) - j B - ( ν ) k d ξ .

First, to compute f ̂ ξ 1 , ξ 2 W 2 - 2 j ξ in the discrete domain, at the resolution level j, we apply the Laplacian pyramid algorithm [23], which is implemented in space-domain. Let f ̂ k 1 , k 2 denote 2D discrete Fourier transform (DFT) of f 2 Z N 2 , where we adopt the convention that brackets [·,·] denote arrays of indices and that parentheses (·,·) denote function evaluations, and where we interpret the numbers f ̂ k 1 , k 2 as samples f ̂ k 1 , k 2 = f ̂ k 1 , k 2 from the trigonometric polynomial

f ̂ ξ 1 , ξ 2 = n 1 , n 2 = 0 N - 1 f n 1 , n 2 e - 2 πi n 1 N ξ 1 + n 1 N ξ 2 .

The Laplacian pyramid algorithm will accomplish the multiscale partition illustrated in Figure 3, by decomposing f a j - 1 [ n 1 , n 2 ], 0 ≤ n1,n2 < Nj-1, into a low-pass filtered image f a j [ n 1 , n 2 ], a quarter of the size of f a j - 1 [ n 1 , n 2 ], and a high-pass filtered image f d j [ n 1 , n 2 ]. Observe that the matrix f a j [ n 1 , n 2 ] has size N j × N j , where N j = 2-2jN, and f a 0 [ n 1 , n 2 ]=f[ n 1 , n 2 ] has size N × N. In particular, we have that

f d j ̂ ξ 1 , ξ 2 = f ̂ ξ 1 , ξ 2 W 2 - 2 j ξ 1 , ξ 2

and thus, f d j [ n 1 , n 2 ] are the discrete samples of a function f d j ( x 1 , x 2 ), whose Fourier transform is f d j ̂ ( ξ 1 , ξ 2 ). Since this operation is implemented as a convolution in space-domain, this step of the algorithm is one of the most computationally expensive.

Figure 3
figure 3

The figure illustrating the succession of Laplacian pyramid and directional filtering.

The next step produces the directional filtering, and this is achieved by computing the DFT on the pseudo-polar grid and then applying a one-dimensional band-pass filter to the components of the signal with respect to this grid. More precisely, let us define the pseudo-polar coordinates (u,v) R 2 as follows:

( u , w ) = ξ 1 , ξ 2 ξ 1 if ξ 1 , ξ 2 P 1 , ( u , w ) = ξ 2 , ξ 1 ξ 2 if ξ 1 , ξ 2 P 2 .

After performing this change of coordinates, we obtain

f ̂ ξ 1 , ξ 2 W 2 - 2 j ξ 1 , 2 - 2 j ξ 2 F ( ν ) ξ A ( ν ) - j B - ( ν ) = g j ( u , w ) V 2 j w - ,

where g j (u,w)= f d j ̂ ( ξ 1 , ξ 2 ). This shows that the directional components are obtained by simply translating the window function V. The discrete samples g j [ n1,n2] = g j (n1,n2) are the values of the DFT of f d j [ n 1 , n 2 ] on a pseudo-polar grid.

Now let { v j , [n]:nZ} be the sequence whose discrete Fourier transform gives the samples of the window function V(2jk - ), i.e., v ̂ j , [k]=V( 2 j k-). Then, for fixed n 1 Z, we have

F 1 F 1 - 1 g j n 1 , n 2 v j n 2 = g j n 1 , n 2 F 1 v j n 2 ,

where denotes the one-dimensional convolution along the n2 axis and F 1 is the one-dimensional discrete Fourier transform. Thus, (6) gives the algorithmic implementation for computing the discrete samples of g j (u,w) v(2jw - ). At this point, to compute the shearlet coefficient in the discrete domain, it suffices to compute the inverse PDFT or directly reassemble the Cartesian sampled values and apply the inverse two-dimensional FFT.Figure 3 illustrates the cascade of Laplacian pyramid and directional filtering. Recall that once the discrete shearlet coefficients are obtained, the inverse shearlet transform is computed using the following steps: (i) convolution of discrete shearlet coefficients and synthesis directional filters, (ii) summation of all directional components, and (iii) reconstruction by inverse Laplacian pyramidal transformation.

3.2 2D GPU-based implementation

Before implementing the 2D discrete shearlet transform algorithm on the GPU, we profiled the existing implementation available as a MATLAB toolbox at Table 1 contains the breakdown of the processing times showing that the FFT computations used to perform directional filtering and the convolution part of the à trous algorithm used for pyramidal image decomposition and reconstruction take around 75% of the computation time. Hence, they were the first candidates for porting into CUDA.

Table 1 Comparison of processing times for denoising a single precision 512 × 512 image

Since most of the computing time for performing a discrete shearlet transform is spent in FFT function calls, it is crucial to have the best possible library to perform FFTs. The main two GPU vendors provide optimized FFT libraries: NVIDIA provides cuFFT as part of its CUDA Toolkit, and AMD provides clAmdFft as part of its Accelerated Parallel Processing Math Libraries (APPML). We have decided to use CUDA as our development architecture both because there is better documentation and because of the availability of more mature development tools. We have implemented the device code in CUDA C++, while the host code is pure C++. Since both CUDA C/C++ and OpenCL are based on the C programming language, porting the code from CUDA to OpenCL should not be difficult. However, for code compactness, we have made extensive use of templates and operator overloading, which are supported in CUDA C++, but not in OpenCL, which is based on C99.

To facilitate the development, we have used GPUmat from the GP-you Group, a free (GPLv3) GPU engine for MATLAB® (source code is available from This framework provides two new classes, GPUsingle and GPUdouble, which encapsulate vectors of numerical data allocated on GPU memory and allow mathematical operations on objects of such classes via function and operator overloading. Transfers between CPU and GPU memory are as simple as doing type casting, and memory allocation and deallocation are done automatically. The idea is that existing MATLAB functions could be reused without any code changes. In practice, however, in order to get acceptable performance, it is necessary to hand-tune the code or even use lower level languages such as C/C++.

Fortunately, the GPUmat framework provides an interface for manipulating these objects from MEX files, and a mechanism for loading custom kernels. Although there are commercial alternatives to GPUmat such as Jacket from AccelerEyes, or the Parallel Computing Toolbox from Mathworks, we have found that GPUmat is pretty robust and adds very little overhead to the execution time as long as we follow good programming practices such as in-place operations and reuse of preallocated buffers.

Our implementation supports both single precision (32-bit) and double precision (64-bit) IEEE 754 floating point numbers (double precision is only supported on devices with compute capability 2.0 or newer due to limitations in the maximum amount of shared memory available per multiprocessor). We generate the filter bank of directional filters using the Fourier-domain approach from [9], where directional filters are designed as Meyer-type window functions in the Fourier domain. Since this step only needs to be run once and does not depend on the image dimensions, we precompute these directional filters using the original MATLAB implementation.

For the Laplacian pyramidal decomposition, we ported the à trous algorithm using symmetric extension [2] into CUDA. This algorithm requires performing non-separable convolutions with decimated signals. For efficiency reasons, the kernel that performs à trous convolutions preloads blocks of data into shared memory, so that the memory is only accessed once from each GPU thread.

With the above GPU-based Laplacian pyramid and directional filter implementation, it is just a matter of applying convolutions in the GPU to find the forward and inverse shearlet transform.

The main steps of our GPU-based shearlet transform are as shown in Table 2.

Table 2 Main steps of the shearlet transform

3.3 3D discrete shearlet transform

The algorithm for the discretization of the 3D shearlet transform is very similar to the 2D shearlet transform, and our implementation of the 3D discrete shearlet transform adapts the code available from and described in [20]. The main practical difference is that storing the 3D shearlet coefficients is much more memory-intensive. Since the memory requirement can easily exceed the available GPU memory, in our algorithm, we compute one convolution at a time in CUDA and add the result to the output.

4 Applications

In the following, we illustrate the advantages of our new implementation of the discrete shearlet transform by considering three applications: denoising of natural images corrupted with white Gaussian noise, detection of cracks in textured images, and denoising of videos. The source code, sample data, as well as the MATLAB scripts used to generate all the figures in this paper are publicly available at

For benchmark, we have evaluated the performance of the new discrete shearlet transform both on multicore CPUs and GPU. All CPU tests have been performed on a Dell PowerEdge C6145 with four-socket AMD Opteron™ 6274 processors at 2.2 GHz (64 cores total) and 256 GB RAM, running Red Hat Enterprise Linux (REHL) 6. This machine is one of 16 identical nodes in the high-performance computing (HPC) cluster Euclid at the University of Maryland. During these benchmarks, we had exclusive access to this node, and no other processes were running, except for regular system services. To better understand the performance of this code when running on systems with different numbers of cores, we limited the number of available cores on some of the experiments. We found that neither MATLAB’s maxNumCompThreads nor –singleCompThread works reliably on non-Intel processors, so we used the taskset Linux command to set the processor affinity to the desired number of cores. GPU tests were performed on different machines running RHEL 5 or 6, and CUDA 4.2 or 5.0. The tests include devices with CUDA Compute Capabilities (CC) between 1.3 and 3.5. Table 3 summarizes the configurations used in our experiments.

Table 3 Specifications and computing environments for each of the graphics processors used on our benchmarks

4.1 Image denoising

As a first test, we evaluated the performance of our implementation of the discrete shearlet transform on a problem of image denoising, using a standard denoising algorithm based on hard threshold of the shearlet coefficients. The setup is similar to the one described in [9]. That is, given an image f R N 2 , we observe a noisy version of it given by u = f + ε, where ε R N 2 is an additive white Gaussian noise process which is independent of f, i.e., εN(0, σ 2 I N 2 × N 2 ). Our goal is to compute an estimate f ~ of f from the noisy data u by applying a classical hard thresholding scheme [24] on the shearlet coefficients of u. The threshold levels are given by τ i , j , n = σ ε i , j 2 / σ i , j , n 2 , as in [2, 9, 25], where σ i , j , n 2 denotes the variance of the n th coefficient at the i th directional subband in the j th scale, and σ ε i , j 2 is the noise variance at scale j and directional band i. The variances σ ε i , j 2 are estimated by using a Monte Carlo technique in which the variances are computed for several normalized noise images and then the estimates are averaged.

For our experiments, we used five levels of the Laplacian pyramid decomposition, and we applied a directional decomposition on four of the five scales. We used 8 shear filters of sizes 32×32 for the first two scales (coarser scales), and 16 shear filters of sizes 16×16 for the third and fourth levels (fine scales). The shear filters are Meyer-type windows [9]. We used the 512×512 Barbara image to test our algorithm, and to assess its performance, we used the peak signal-to-noise ratio (PSNR), measured in decibels (dB), defined by

PSNR = 20 log 10 255 N f - f ~ F ,

where · F is the Frobenius norm, the given image f is of size N × N, and f ~ denotes the estimated image.

In order to minimize latency as well as bandwidth usage on the PCIe bus, we first transferred the input image to GPU memory, then we let all the computation happen on the GPU and we finally transferred the results back to CPU memory. We have verified that both CPU and GPU implementations provide an output PSNR of 29.9 dB when the input PSNR is 22.1 dB. At these noise levels, there is no difference in PSNR between the single and the double precision implementations.

To verify the numerical accuracy, we ran the shearlet decomposition and reconstruction on a noise-free image (without thresholding), and we got a reconstruction mean squared error (MSE) of 9.197 × 10-09 for single precision and 2.503 × 10-12 for double precision on a GeForce GTX 690. On the CPU implementation, we get reconstruction errors of 9.1711 × 10-09 and 1.6643 × 10-26, respectively. This verifies that our implementation does provide the exact reconstruction.The running times vary significantly depending on the number of CPU cores available and the GPU model. Figure 4 shows a comparison of running times (wall times) of the image denoising algorithm on different hardware configurations. We can clearly see that the CPU implementation does not scale well as we increase the number of CPU cores due to parts of the algorithm running sequentially. For a fair comparison of multicore vs. GPU, we would have to compare the performance to a fully optimized CPU implementation. It should be noted that there is enough coarse-level parallelism on this algorithm to accomplish full CPU utilization without incurring in inter CPU communication issues. However, the trend reveals that on this application, GPU is more efficient than CPU. In summary, the denoising algorithm takes 8.89 s on 4 CPU cores vs. 0.038 s on the GeForce GTX 690 (a 233 × speed-up) when using single precision. For double precision, it takes 10.7 s on 4 CPU cores vs. 0.127 s on the GeForce GTX 690 (an 84 × speed-up).

Figure 4
figure 4

Comparison of CPU vs. GPU run times for denoising a 512 × 512 image using shearlets.

Table 1 shows the breakdown of different parts of the image denoising algorithm both on CPU and GPU.

4.2 Crack detection

Detection of cracks on concrete structures is a difficult problem due to the changes in width and direction of the cracks, as well as the variability in the surface texture. This problem has received considerable attention recently. Redundant representations, such as undecimated wavelets, have been extensively used for crack detection [26, 27]. However, wavelets have poor directional sensitivity and have difficulties in detecting weak diagonal cracks. To overcome this limitation, Ma et al. [28] proposed the use of the nonsubsampled contourlet transform[2] for crack detection. However, all these methods rely on the assumption that the background surface can be modeled as additive white Gaussian noise, and this assumption leads to matched filter solutions. As a matter of fact, on real images, textures are highly correlated and applying linear filters leads to poor performance.

To address this problem, we propose a completely new approach to crack detection based on separating the image into morphological distinct components using sparse representations, adaptive thresholding, and variational regularization. This technique was pioneered by Starck et al. [29] and later extended and generalized by many authors (e.g., [17, 18, 30]). In particular, we will use the Iterative Shrinkage Algorithm with a combined dictionary of shearlets and wavelets to separate cracks from background texture.

To demonstrate the performance of the GPU-accelerated iterative shrinkage algorithm, we processed three 512 × 512 images. The images correspond to cracks on concrete railroad crossties collected by ENSCO Inc. during summer 2012 using four 2,048 × 1 line-scan cameras, which were assembled into 8,192 × 3,072 frames. The cameras were triggered using a calibrated encoder, producing images with square pixels with a constant size of 0.43 mm. We have manually cropped these images so that we can decouple crack detection from crosstie boundary tracking. As one can see from Figure 5, these cracks propagate in different directions and the background texture has a lot of variation. However, due to the fact that the information in these images is highly redundant, it is possible to separate the image into two components, that is, cracks and texture, by solving an 1 optimization problem [17].

Figure 5
figure 5

Image separation. (a) Original images separated into (b) cracks and (c) textural background components, and (d) crack ground truth.

More precisely, we will model an image x containing cracks on textural background as a superposition of a crack component x c with a textural component x t :

x = x c + x t .

Let Φ1 and Φ2 be the dictionaries corresponding to wavelets and shearlets, respectively. We assume that x c is sparse in a shearlet dictionary Φ1, and similarly, x t is sparse in a wavelet dictionary Φ2. That is, we assume that there are sparse coefficients a c and a t so that x c = Φ1a c and x t = Φ2a t . Then, one can separate these components from an x via the coefficients a c and a t by solving the following optimization problem:

â c , â t = arg min a c , a t λ a c 1 + λ a t 1 + 1 2 x - Φ 1 a c - Φ 2 a t 2 2 ,

where for an n - dimensional vector b, the 1 norm is defined as b 1 = i | b i |. This image separation problem can be solved efficiently using an iterative shrinkage algorithm proposed in [17] (Figure 5).

In our numerical experiments, we used symlet wavelets with four decomposition levels to generate Φ2 and a four-level shearlet decomposition with Meyer filters of sizes 80×80 on all four scales, 8 directional filters on the first three scales, and 16 directional filters on the fourth scale, to generate Φ1. To assess the performance of the separation algorithm, we visually compare detection results at peak F2 score (Figure 6), and calculated the ROC curves for each image using the following two detection methods (Figure 7).

  1. 1.

    Shearlet-C. This method takes advantage of the Parseval property of the shearlet transform and performs crack detection directly in the transform domain. We first decompose the image into cracks and texture components using iterative shrinkage with a shearlet dictionary and a wavelet one. Instead of using the reconstructed image, we analyze the values of the shearlet transform coefficients. For each scale in the shearlet transform domain, we analyze the directional components corresponding to each displacement and collect the maximum magnitude across all directions. If the sign of the shearlet coefficient corresponding to the maximum magnitude is positive, we classify the corresponding pixel as background; otherwise, we assign the norm of the vector containing the maximum responses at each scale to each pixel and we apply a threshold.

  2. 2.

    Shearlet-I. We first decompose the image into cracks and texture components as described for the previous method. Then, we apply an intensity threshold on the reconstructed cracks image.

Figure 6
figure 6

Crack detection results. (a) Using shearlet coefficients (Shearlet-C), (b) using thresholding in the image reconstructed using shearlets (Shearlet-I), (c) using intensity thresholding in the original image (d), and using Canny edge detection. All results are generated at peak F2 score.

Figure 7
figure 7

ROC curves for crack detection. (a) Image 1, (b) image 2, and (c) image 3.

We compare our results to the following two basic methods not based on shearlets:

  1. 1.

    Intensity. This is the most basic approach, which only uses image intensity. After compensating for slow variations of intensity in the image, we apply a global threshold.

  2. 2.

    Canny. We use the Canny[31] edge detector as implemented in MATLAB using the default σ= 2 and the default high to low threshold ratio of 40%.

After using a low-level detector, it may be necessary to remove small isolated regions corresponding to false detections due to random noise. This postprocessing step may reduce the false detection rate on intensity-based methods. However, to provide an objective comparison, we have generated the experimental results without running any postprocessing. We leave the performance analysis of a complete crack detector for future work.

To evaluate the performance of each crack detector, we manually annotated the crack pixels in each image. To mitigate the effect of ambiguous segmentation boundaries, we annotated the boundaries around the cracks as tightly as possible (making sure that only pixels completely contained inside the crack boundaries are annotated as such) and defined an envelope region around each crack whose labels are treated as ‘do not care’. Formally, let Ω denote the set of pixels in the image, and F (foreground) denote the set of pixels labeled as cracks. We define the set B (background) as

B = x Ω : min f F x - f > δ

where x-f denotes the Euclidean distance between sites x and f. In our experiments, we used δ = 3. To account for possible small inaccuracies in the ground truth, we performed a bipartite graph matching between the detected crack pixels and the crack pixels in the ground truth. For our experiments, we allow matching within a maximum distance of 2 pixels. This choice of matching metric does not penalize crack overestimation errors as long as these errors are contained in such envelope. This allows us to decouple errors in estimating the position of the crack centerline from errors in estimating the crack width, which is more sensitive to lighting variations. Let D be the set of pixels detected as cracks by a given detector and

tp = | D F | fn = | D ̄ F | p = tp + fn = | F | tn = | D ̄ B | fp = | D B | n = tn + fp = | B | .

The probability of detection (PD) and false alarm (PF) are defined as

PD = tp p PF = fp n .

A sequence of admissible detectors D|PF ≤ ε, for a given false detection rate ε,0 ≤ ε≤1 would produce monotonically increasing detection rates, PD|PF ≤ ε. The receiver operating characteristic function (ROC curve) is defined as PD as a function of PF

ROC ( x ) = max ε x PD | PF = ε .

One commonly used metric is the area under the ROC curve (AUC), defined by

AUC = 0 1 ROC ( x ) dx ,

which corresponds to the probability that a sample randomly drawn from F will receive a score higher than a sample randomly drawn from B. AUC provides a measure of the average performance of the detection across all possible sensitivity settings. Although it is an important measure, in practice, we are interested in knowing how well the detector will work when we choose a particular sensitivity setting. For this reason, we have selected constant false alarm rate (CFAR) detectors with PF = 10-3 and PF = 10-4, and we report the corresponding PD. For completeness, we also report the F1 score (also know as the Dice similarity index), which is defined as

F 1 = 2 tp 2 tp + fn + fp .

The F1 score is also known as the balanced F-score, since it is equivalent to the harmonic mean of the precision and recall:

F 1 = 2 precision · recall precision + recall


precision = tp p recall = tp tp + fn .

In this paper, we report the peak F1 score for all methods. The Canny edge detection method estimates the location of the crack boundary, while the other three methods estimate the location of the crack itself. To have a meaningful comparison, we have generated separate ground truth masks for the crack outline, so we can use the same matching metric on the Canny method. For each method, we have used the same algorithm parameters on all the images.

Table 4 summarizes our results. We observe that our shearlet-based detectors perform consistently well on all evaluation metrics. Note that, on image 3, the Shearlet-I method, which is based on intensity in the reconstructed image, produces better results than all other methods. Due to its simplicity, the intensity-based method is still being used by the industry. For example, the system recently proposed in [32] uses pixel intensities to detect cracks on road pavement. Based on the results from Table 4, we can conclude that, with the proper image preprocessing, intensity can still be a powerful feature for crack detection. However, the detection performance provided by shearlet-based features is more consistent across images. In future work, we will further explore the potential of combining both intensity and shearlet-based features. With any of the methods described in this section, it may be possible to further remove small artifacts in the detected crack boundary by adding a postprocessing step as was done in [27].

Table 4 Comparison of detection performance for different crack detection algorithms (best results are emphasized in italics)

4.3 Video denoising

Video denoising can be performed using the same type of algorithm described above for image denoising and consisting, essentially, in computing the shearlet coefficients of the noisy data, followed by hard thresholding and reconstruction from the thresholded coefficients. Similar to the previous section, a noisy video is obtained by adding white Gaussian noise to a video sequence.We have tested our GPU-based implementation of the 3D shearlet video denoising algorithm using the 192× 192 × 192 waterfall video sequence. Figure 8 shows frame 96 before and after denoising. Figure 9 compares the running times of the video denoising algorithm based on CPU vs. GPU. One can notice that when we go from single core to dual core, the run time drops from 27.5 min to 14.4 min on single precision (a 1.91 × speed-up). However, when going from dual-core to quad core, we only get 1.62 × speed-up, and the rate of improvement as we keep doubling the number of cores keeps diminishing to the point where the improvement from single core to 64 cores is just a 9.45 × speed-up. On the other hand, a GeForce 480 produces the same result in just 3 s, a remarkable 551 × speed-up compared to single-core CPU, and 58 × speed-up over 64 CPU cores.

Figure 8
figure 8

Video denoising. (a) Original video frame. (b) Noise added. (c) Denoised frame.

Figure 9
figure 9

Comparison of CPU vs. GPU run times for denoising a 1923video using 3D shearlets. Time includes all transfers between CPU and GPU.

5 Conclusions

The shearlet transform is an advanced multiscale method which has emerged in recent years as a refinement of the traditional wavelet transform and was shown to perform very competitively over a wide range of image and data processing problems. However, standard CPU-based numerical implementations are very time-consuming and make the application of this method to large data sets and real-time problems very impractical.

In this paper, we have described how to speed up the computation of the 2D/3D discrete shearlet transform by using GPU-based implementations. The development of algorithms on GPU used to be tedious and require a very specialized knowledge of the hardware. Using CUDA, this is no longer the case, and scientists with C/C++ programming skills can quickly develop efficient GPU implementations of data-intensive algorithms. In this paper, we have taken advantage of the GPU-based implementation of the fast Fourier transform and used the capabilities of MATLAB for quick prototyping. The results presented in this paper illustrate the practical benefits of this approach. For example, a GeForce 480 GTX, a $200 graphics card, can perform video denoising 58 times faster than an expensive 64-core machine while consuming much less power.

Our new implementation enables the efficient application of the shearlet decomposition to a variety of image and data processing tasks for which the required CPU resources would be prohibitive. There are further improvements and extensions that can be achieved such as precalculating the filter coefficients and porting the code to OpenCL so it can also run on AMD and Intel GPUs, but this would go beyond the scope of this paper.


a Note that this code also includes some C routines to speed up the computation time. This is true both for the 2D and 3D implementations.


  1. Candès EJ, Donoho DL: New tight frames of curvelets and optimal representations of objects with C2 singularities. Comm. Pure Appl. Math 2004, 57: 219-266. 10.1002/cpa.10116

    Article  MATH  MathSciNet  Google Scholar 

  2. Cunha A, Zhou J, Do M: The nonsubsampled contourlet transform: theory, design, and applications. IEEE Trans. Image Process 2006, 15(10):3089-3101.

    Article  Google Scholar 

  3. Labate D, Lim W, Kutyniok G, Weiss G: Sparse multidimensional representation using shearlets. Wavelets XI (San Diego, CA, 2005), Volume SPIE Proc. 5914 2005, 254-262.

    Google Scholar 

  4. Kutyniok G, Labate D: Shearlets: Multiscale Analysis for Multivariate Data. Birkhäuser, Boston; 2012.

    Book  MATH  Google Scholar 

  5. Starck JL, Murtagh F, Fadili JM: Sparse image and signal processing: wavelets, curvelets, morphological diversity. In Shearlets: Multiscale Analysis for Multivariate Data. Cambridge books online, Cambridge University Press Cambridge; 2010.

    Google Scholar 

  6. Guo K, Labate D: The construction of smooth Parseval frames of shearlets. Math. Model. Nat. Phenom 2013, 8: 82-105. 10.1051/mmnp/20138106

    Article  MATH  MathSciNet  Google Scholar 

  7. Guo K, Labate D: Optimally sparse multidimensional representation using shearlets. Siam J. Math. Anal 2007, 9: 298-318.

    Article  MathSciNet  MATH  Google Scholar 

  8. Guo K, Labate D: Optimally sparse representations of 3D data with C2 surface singularities using Parseval frames of shearlets. Siam J. Math. Anal 2012, 44: 851-886. 10.1137/100813397

    Article  MATH  MathSciNet  Google Scholar 

  9. Easley GR, Labate D, Lim W: Sparse directional image representations using the discrete shearlet transform. Appl. Comput. Harmon. Anal 2008, 25: 25-46. 10.1016/j.acha.2007.09.003

    Article  MATH  MathSciNet  Google Scholar 

  10. Kutyniok G, Shahram M, Zhuang X: ShearLab: a rational design of a digital parabolic scaling algorithm. SIAM J. Imaging Sci 2012, 5(4):1291-1332. 10.1137/110854497

    Article  MATH  MathSciNet  Google Scholar 

  11. Guo K, Labate D: Representation of Fourier integral operators using shearlets. J. Fourier Anal. Appl 2008, 14: 327-371. 10.1007/s00041-008-9018-0

    Article  MATH  MathSciNet  Google Scholar 

  12. Colonna F, Easley GR, Guo K, Labate D: Radon transform inversion using the shearlet representation. Appl. Comput. Harmon. Anal 2010, 29(2):232-250. 10.1016/j.acha.2009.10.005

    Article  MATH  MathSciNet  Google Scholar 

  13. Vandeghinste B, Goossens B, Van Holen R, Vanhove C, Pizurica A, Vandenberghe S, Staelens S: Iterative CT reconstruction using shearlet-based regularization. IEEE Trans. Nuclear Sci 2013, 60(5):3305-3317.

    Article  Google Scholar 

  14. Guo K, Labate D: Characterization and analysis of edges using the continuous shearlet transform. SIAM Imaging Sci 2009, 2: 959-986. 10.1137/080741537

    Article  MATH  MathSciNet  Google Scholar 

  15. Guo K, Labate D: Analysis and detection of surface discontinuities using the 3D continuous shearlet transform. Appl. Comput. Harmon. Anal 2011, 30: 231-242. 10.1016/j.acha.2010.08.004

    Article  MATH  MathSciNet  Google Scholar 

  16. Yi S, Labate D, Easley GR, Krim H: A shearlet approach to edge analysis and detection. IEEE Trans. Image Process 2009, 18(5):929-941.

    Article  MathSciNet  Google Scholar 

  17. Kutyniok G, Lim W: Image separation using wavelets and shearlets. In Curves and Surfaces (Avignon, France, 2010), 416–430, Lecture Notes in Computer Science 6920. Springer Berlin Heidelberg; 2011.

    Google Scholar 

  18. Easley G, Labate D, Negi PS: 3D data denoising using combined sparse dictionaries. Math. Model. Nat. Phenom 2013, 8: 60-74.

    Article  MATH  MathSciNet  Google Scholar 

  19. Patel VM, Easley G, Healy D: Shearlet-based deconvolution. IEEE Trans. Image Process 2009, 18: 2673-2685.

    Article  MathSciNet  Google Scholar 

  20. Negi P, Labate D: 3-D discrete shearlet transform and video processing. IEEE Trans. Image Process 2012, 21: 2944-2954.

    Article  MathSciNet  Google Scholar 

  21. Easley G, Labate D, Patel VM: Directional multiscale processing of images using wavelets with composite dilations. J. Math. Imaging Vis 2014, 48(1):13-43. 10.1007/s10851-012-0385-4

    Article  MATH  MathSciNet  Google Scholar 

  22. Candès EJ, Demanet L, Donoho D, Ying L: Fast discrete curvelet transforms. SIAM Multiscale Model. Simul 2006, 5(3):861-899. 10.1137/05064182X

    Article  MATH  MathSciNet  Google Scholar 

  23. Burt PJ, Adelson EH: The Laplacian pyramid as a compact image code. IEEE Trans. Commun 1983, 31(4):532-540. 10.1109/TCOM.1983.1095851

    Article  Google Scholar 

  24. Donoho D, Johnstone I: Adapting to unknown smoothness via wavelet shrinkage. J. Am. Stat. Assoc 1995, 90: 1200-1224. 10.1080/01621459.1995.10476626

    Article  MATH  MathSciNet  Google Scholar 

  25. Chang SG, Yu B, Vetterli M: Adaptive wavelet thresholding for image denoising and compression. IEEE Trans. Image Process 2000, 9(9):1532-1546. 10.1109/83.862633

    Article  MATH  MathSciNet  Google Scholar 

  26. Subirats P, Dumoulin J, Legeay V, Barba D: Automation of pavement surface crack detection using the continuous wavelet transform. IEEE International Conference on Image Processing, Atlanta, GA 3037.

    Google Scholar 

  27. Chambon S, Moliard J: Automatic road pavement assessment with image processing: review and comparison. Int. J. Geophys. article ID 989354 2011, 20 pages. doi:10.1155/2011/989354

    Google Scholar 

  28. Ma C, Zhao C, Hou Y: Pavement distress detection based on nonsubsampled contourlet transform. Int. Conf. Comput. Sci. Softw. Eng. 2008, 1: 28-31.

    Google Scholar 

  29. Starck JL, Elad M, Donoho D: Image decomposition via the combination of sparse representation and a variational approach. IEEE Trans. Image Process 2005, 14(10):1570-1582.

    Article  MATH  MathSciNet  Google Scholar 

  30. Bobin J, Starck JL, Fadili M, Moudden Y, Donoho D: Morphological component analysis: an adaptive thresholding strategy. IEEE Trans. Image Process 2007, 16(11):2675-2681.

    Article  MATH  MathSciNet  Google Scholar 

  31. Canny J: A computational approach to edge detection. Mach. Intell. 1986, 8(6):679-698.

    Article  Google Scholar 

  32. Oliveira H, Correia P: Automatic road crack detection and characterization. IEEE Trans. Intell. Transport. Syst. 2013, 14: 155-168.

    Article  Google Scholar 

Download references


The authors thank Amtrak, ENSCO, Inc. and the Federal Railroad Administration for providing the images used in Section 4.2. This work was supported by the Federal Railroad Administration under contract DTFR53-13-C-00032. DL acknowledges support from NSF grant DMS 1008900/1008907 and DMS 1005799.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Xavier Gibert.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gibert, X., Patel, V.M., Labate, D. et al. Discrete shearlet transform on GPU with applications in anomaly detection and denoising. EURASIP J. Adv. Signal Process. 2014, 64 (2014).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: