- Open Access
Discrete shearlet transform on GPU with applications in anomaly detection and denoising
© Gibert et al.; licensee Springer. 2014
- Received: 4 November 2013
- Accepted: 19 April 2014
- Published: 10 May 2014
Shearlets have emerged in recent years as one of the most successful methods for the multiscale analysis of multidimensional signals. Unlike wavelets, shearlets form a pyramid of well-localized functions defined not only over a range of scales and locations, but also over a range of orientations and with highly anisotropic supports. As a result, shearlets are much more effective than traditional wavelets in handling the geometry of multidimensional data, and this was exploited in a wide range of applications from image and signal processing. However, despite their desirable properties, the wider applicability of shearlets is limited by the computational complexity of current software implementations. For example, denoising a single 512 × 512 image using a current implementation of the shearlet-based shrinkage algorithm can take between 10 s and 2 min, depending on the number of CPU cores, and much longer processing times are required for video denoising. On the other hand, due to the parallel nature of the shearlet transform, it is possible to use graphics processing units (GPU) to accelerate its implementation. In this paper, we present an open source stand-alone implementation of the 2D discrete shearlet transform using CUDA C++ as well as GPU-accelerated MATLAB implementations of the 2D and 3D shearlet transforms. We have instrumented the code so that we can analyze the running time of each kernel under different GPU hardware. In addition to denoising, we describe a novel application of shearlets for detecting anomalies in textured images. In this application, computation times can be reduced by a factor of 50 or more, compared to multicore CPU implementations.
- Image processing
During the last decade, a new generation of multiscale systems has emerged which combines the power of the classical multiresolution analysis with the ability to process directional information with very high efficiency. Some of the most notable examples of such systems include the curvelets, the contourlets, and the shearlets. Unlike classical wavelets, the elements of such systems form a pyramid of well-localized waveforms ranging not only across various scales and locations, but also across various orientations and with highly anisotropic shapes. Due to their richer structure, these more sophisticated multiscale systems are able to overcome the poor directional sensitivity of traditional multiscale systems and have been used to derive several state-of-the-art algorithms in image and signal processing (cf.[4, 5]).
Shearlets, in particular, offer a unique combination of very remarkable features: they have a simple and well-understood mathematical structure derived from the theory of affine systems [3, 6]; they provide optimally sparse representations, in a precise sense, for a large class of images and other multidimensional data where wavelets are suboptimal [7, 8]; and the directionality is controlled by shear matrices rather than rotations. This last property, in particular, enables a unified framework for continuum and discrete setting since shear transformations preserve the rectangular lattice and is an advantage in deriving faithful digital implementations [9, 10].
The shearlet decomposition has been successfully employed in many problems from applied mathematics and signal processing, including decomposition of operators , inverse problems [12, 13], edge detection [14–16], image separation , and image restoration [18–20]. However, one major bottleneck to the wider applicability of the shearlet transform is that current discrete implementations tend to be very time consuming, making its use impractical for large data sets and for real-time applications. For instance, the current (CPU-based) MATLAB implementationa of the 2D shearlet transform, run on a typical desktop PC, takes about 2 min to denoise a noisy image of size 512×512 [9, 21]. The running time of the current (CPU-based) MATLAB implementation of the 3D shearlet transform for denoising a video sequence of size 1923 is about 5 min . Running times for alternative shearlet implementations from Shearlab  as well as for the current implementation of the curvelet transform  are comparable.
In recent years, general-purpose graphics processing units (GPGPUs) have become ubiquitous not only on high-performance computing (HPC) clusters, but also on workstations. For example, Titan, which was until recently the world’s fastest supercomputer, contains 18,688 NVIDIA Tesla K20X GPUs. These GPUs provide about 90% of Titan’s peak computing performance, which is greater than 20 PetaFLOPS (quadrillion floating point operations per second). Due to their energy efficiency and capabilities, GPGPUs are also becoming mainstream on mobile platforms, such as iOS and Android devices. There are two main architectures for GPGPU computing: CUDA and OpenCL. CUDA was designed by NVIDIA, and has been around since 2006. OpenCL was originally designed by Apple, Inc., and was introduced in 2008. OpenCL is an open standard maintained by the Khronos Group, whose members include Intel, AMD, NVIDIA, and many others, so it has broader industry acceptance than any other architecture. In 2009, Microsoft introduced DirectCompute as an alternative architecture for GPGPU computing, which is only available in Windows Vista and later. OpenCL has been designed to provide the developer with a common framework for doing computation on heterogeneous devices. One of the advantages of OpenCL is that it can potentially support any computing device, such as CPUs, GPUs, and FPGAs, as long as there is an OpenCL compiler available for such processor. NVIDIA provides CUDA/OpenCL drivers, libraries, and development tools for the three major operating systems (Linux, Windows, and Mac OS X), while AMD/ATI™ and Intel provide OpenCL drivers and tools for their respective GPUs.
The objective of this paper is to introduce and demonstrate a new implementation of the 2D and 3D discrete shearlet transform which takes advantage of the computational capabilities of the graphics processing unit (GPU). To demonstrate the effectiveness of the proposed implementations, we will illustrate its application on problems of image and video denoising and on a problem of feature recognition aiming at crack detection of railway components. In particular, we will show that our new implementation takes about 40 ms to denoise an image of size 512 × 512, which is a 233 × speed-up compared to single-core CPU, and about 3 s to denoise a video of size 1923, which is a 551× speed-up compared to single-core CPU.
The organization of the paper is as follows. In Section 2, we recall the construction of 2D and 3D shearlets. Next, in Section 3, we present our implementation of the discrete shearlet transform, and in Section 4, we benchmark our implementation using three specific applications. Finally, concluding remarks and future work are discussed in Section 5.
2.1 2D shearlets
the coarse-scale shearlets;
the interior shearlets, where the functions are given by (2);
the boundary shearlets, obtained by joining together slightly modified versions of and , for ℓ = ± 2 j ; after that, they have been restricted in the Fourier domain to the cones and , respectively. We refer to  for details.
where M = M C ∪ M I ∪ M B are the indices associated with coarse-scale shearlets, interior shearlets, and boundary shearlets, respectively. We have the following result from :
All elementsare C ∞ and compactly supported in the Fourier domain.
As mentioned above, it is proved in  that the 2D Parseval frame of shearlets provides essentially optimal approximations for functions of two variables which are C2 regular away from discontinuities along C2 curves.
The mapping from into the elements , is called the 2D shearlet transform.
2.2 3D shearlets
which again can be identified as the coarse-scale, interior and boundary shearlets. It turns out that the 3D system of shearlets is a Parseval frame of  and it provides essentially optimal approximations for functions of three variables which are C2 regular away from discontinuities along C2 surfaces .
A faithful numerical implementation of the 2D shearlet transform was originally presented in . Let us briefly recall the main steps of this implementation.
3.1 2D discrete shearlet transform
where This shows that the directional components are obtained by simply translating the window function V. The discrete samples g j [ n1,n2] = g j (n1,n2) are the values of the DFT of on a pseudo-polar grid.
where ∗ denotes the one-dimensional convolution along the n2 axis and is the one-dimensional discrete Fourier transform. Thus, (6) gives the algorithmic implementation for computing the discrete samples of g j (u,w) v(2 j w - ℓ). At this point, to compute the shearlet coefficient in the discrete domain, it suffices to compute the inverse PDFT or directly reassemble the Cartesian sampled values and apply the inverse two-dimensional FFT.Figure 3 illustrates the cascade of Laplacian pyramid and directional filtering. Recall that once the discrete shearlet coefficients are obtained, the inverse shearlet transform is computed using the following steps: (i) convolution of discrete shearlet coefficients and synthesis directional filters, (ii) summation of all directional components, and (iii) reconstruction by inverse Laplacian pyramidal transformation.
3.2 2D GPU-based implementation
Comparison of processing times for denoising a single precision 512 × 512 image
GTX 690 GPU
Since most of the computing time for performing a discrete shearlet transform is spent in FFT function calls, it is crucial to have the best possible library to perform FFTs. The main two GPU vendors provide optimized FFT libraries: NVIDIA provides cuFFT as part of its CUDA Toolkit, and AMD provides clAmdFft as part of its Accelerated Parallel Processing Math Libraries (APPML). We have decided to use CUDA as our development architecture both because there is better documentation and because of the availability of more mature development tools. We have implemented the device code in CUDA C++, while the host code is pure C++. Since both CUDA C/C++ and OpenCL are based on the C programming language, porting the code from CUDA to OpenCL should not be difficult. However, for code compactness, we have made extensive use of templates and operator overloading, which are supported in CUDA C++, but not in OpenCL, which is based on C99.
To facilitate the development, we have used GPUmat from the GP-you Group, a free (GPLv3) GPU engine for MATLAB® (source code is available from http://sourceforge.net/projects/gpumat/). This framework provides two new classes, GPUsingle and GPUdouble, which encapsulate vectors of numerical data allocated on GPU memory and allow mathematical operations on objects of such classes via function and operator overloading. Transfers between CPU and GPU memory are as simple as doing type casting, and memory allocation and deallocation are done automatically. The idea is that existing MATLAB functions could be reused without any code changes. In practice, however, in order to get acceptable performance, it is necessary to hand-tune the code or even use lower level languages such as C/C++.
Fortunately, the GPUmat framework provides an interface for manipulating these objects from MEX files, and a mechanism for loading custom kernels. Although there are commercial alternatives to GPUmat such as Jacket from AccelerEyes, or the Parallel Computing Toolbox from Mathworks, we have found that GPUmat is pretty robust and adds very little overhead to the execution time as long as we follow good programming practices such as in-place operations and reuse of preallocated buffers.
Our implementation supports both single precision (32-bit) and double precision (64-bit) IEEE 754 floating point numbers (double precision is only supported on devices with compute capability 2.0 or newer due to limitations in the maximum amount of shared memory available per multiprocessor). We generate the filter bank of directional filters using the Fourier-domain approach from , where directional filters are designed as Meyer-type window functions in the Fourier domain. Since this step only needs to be run once and does not depend on the image dimensions, we precompute these directional filters using the original MATLAB implementation.
For the Laplacian pyramidal decomposition, we ported the à trous algorithm using symmetric extension  into CUDA. This algorithm requires performing non-separable convolutions with decimated signals. For efficiency reasons, the kernel that performs à trous convolutions preloads blocks of data into shared memory, so that the memory is only accessed once from each GPU thread.
With the above GPU-based Laplacian pyramid and directional filter implementation, it is just a matter of applying convolutions in the GPU to find the forward and inverse shearlet transform.
Main steps of the shearlet transform
Forward FFT of directional components
Forward FFT of Laplacian components
Modulation with complex conjugate directional filter bank
Modulation of Laplacian components with directional filter bank
Inverse FFT of directional components
Inverse FFT of directional components
3.3 3D discrete shearlet transform
The algorithm for the discretization of the 3D shearlet transform is very similar to the 2D shearlet transform, and our implementation of the 3D discrete shearlet transform adapts the code available from http://www.math.uh.edu/~dlabate/3Dshearlet_toolbox.zip and described in . The main practical difference is that storing the 3D shearlet coefficients is much more memory-intensive. Since the memory requirement can easily exceed the available GPU memory, in our algorithm, we compute one convolution at a time in CUDA and add the result to the output.
In the following, we illustrate the advantages of our new implementation of the discrete shearlet transform by considering three applications: denoising of natural images corrupted with white Gaussian noise, detection of cracks in textured images, and denoising of videos. The source code, sample data, as well as the MATLAB scripts used to generate all the figures in this paper are publicly available at http://www.umiacs.umd.edu/~gibert/ShearCuda.zip.
Specifications and computing environments for each of the graphics processors used on our benchmarks
Number of cores
GeForce GTX 480
GeForce GTX 690a
4.1 Image denoising
As a first test, we evaluated the performance of our implementation of the discrete shearlet transform on a problem of image denoising, using a standard denoising algorithm based on hard threshold of the shearlet coefficients. The setup is similar to the one described in . That is, given an image , we observe a noisy version of it given by u = f + ε, where is an additive white Gaussian noise process which is independent of f, i.e., . Our goal is to compute an estimate of f from the noisy data u by applying a classical hard thresholding scheme  on the shearlet coefficients of u. The threshold levels are given by , as in [2, 9, 25], where denotes the variance of the n th coefficient at the i th directional subband in the j th scale, and is the noise variance at scale j and directional band i. The variances are estimated by using a Monte Carlo technique in which the variances are computed for several normalized noise images and then the estimates are averaged.
where ∥ · ∥ F is the Frobenius norm, the given image f is of size N × N, and denotes the estimated image.
In order to minimize latency as well as bandwidth usage on the PCIe bus, we first transferred the input image to GPU memory, then we let all the computation happen on the GPU and we finally transferred the results back to CPU memory. We have verified that both CPU and GPU implementations provide an output PSNR of 29.9 dB when the input PSNR is 22.1 dB. At these noise levels, there is no difference in PSNR between the single and the double precision implementations.
Table 1 shows the breakdown of different parts of the image denoising algorithm both on CPU and GPU.
4.2 Crack detection
Detection of cracks on concrete structures is a difficult problem due to the changes in width and direction of the cracks, as well as the variability in the surface texture. This problem has received considerable attention recently. Redundant representations, such as undecimated wavelets, have been extensively used for crack detection [26, 27]. However, wavelets have poor directional sensitivity and have difficulties in detecting weak diagonal cracks. To overcome this limitation, Ma et al.  proposed the use of the nonsubsampled contourlet transform for crack detection. However, all these methods rely on the assumption that the background surface can be modeled as additive white Gaussian noise, and this assumption leads to matched filter solutions. As a matter of fact, on real images, textures are highly correlated and applying linear filters leads to poor performance.
To address this problem, we propose a completely new approach to crack detection based on separating the image into morphological distinct components using sparse representations, adaptive thresholding, and variational regularization. This technique was pioneered by Starck et al.  and later extended and generalized by many authors (e.g., [17, 18, 30]). In particular, we will use the Iterative Shrinkage Algorithm with a combined dictionary of shearlets and wavelets to separate cracks from background texture.
Shearlet-C. This method takes advantage of the Parseval property of the shearlet transform and performs crack detection directly in the transform domain. We first decompose the image into cracks and texture components using iterative shrinkage with a shearlet dictionary and a wavelet one. Instead of using the reconstructed image, we analyze the values of the shearlet transform coefficients. For each scale in the shearlet transform domain, we analyze the directional components corresponding to each displacement and collect the maximum magnitude across all directions. If the sign of the shearlet coefficient corresponding to the maximum magnitude is positive, we classify the corresponding pixel as background; otherwise, we assign the norm of the vector containing the maximum responses at each scale to each pixel and we apply a threshold.
Shearlet-I. We first decompose the image into cracks and texture components as described for the previous method. Then, we apply an intensity threshold on the reconstructed cracks image.
Intensity. This is the most basic approach, which only uses image intensity. After compensating for slow variations of intensity in the image, we apply a global threshold.
Canny. We use the Canny edge detector as implemented in MATLAB using the default and the default high to low threshold ratio of 40%.
After using a low-level detector, it may be necessary to remove small isolated regions corresponding to false detections due to random noise. This postprocessing step may reduce the false detection rate on intensity-based methods. However, to provide an objective comparison, we have generated the experimental results without running any postprocessing. We leave the performance analysis of a complete crack detector for future work.
In this paper, we report the peak F1 score for all methods. The Canny edge detection method estimates the location of the crack boundary, while the other three methods estimate the location of the crack itself. To have a meaningful comparison, we have generated separate ground truth masks for the crack outline, so we can use the same matching metric on the Canny method. For each method, we have used the same algorithm parameters on all the images.
Comparison of detection performance for different crack detection algorithms (best results are emphasized in italics)
4.3 Video denoising
The shearlet transform is an advanced multiscale method which has emerged in recent years as a refinement of the traditional wavelet transform and was shown to perform very competitively over a wide range of image and data processing problems. However, standard CPU-based numerical implementations are very time-consuming and make the application of this method to large data sets and real-time problems very impractical.
In this paper, we have described how to speed up the computation of the 2D/3D discrete shearlet transform by using GPU-based implementations. The development of algorithms on GPU used to be tedious and require a very specialized knowledge of the hardware. Using CUDA, this is no longer the case, and scientists with C/C++ programming skills can quickly develop efficient GPU implementations of data-intensive algorithms. In this paper, we have taken advantage of the GPU-based implementation of the fast Fourier transform and used the capabilities of MATLAB for quick prototyping. The results presented in this paper illustrate the practical benefits of this approach. For example, a GeForce 480 GTX, a $200 graphics card, can perform video denoising 58 times faster than an expensive 64-core machine while consuming much less power.
Our new implementation enables the efficient application of the shearlet decomposition to a variety of image and data processing tasks for which the required CPU resources would be prohibitive. There are further improvements and extensions that can be achieved such as precalculating the filter coefficients and porting the code to OpenCL so it can also run on AMD and Intel GPUs, but this would go beyond the scope of this paper.
a Note that this code also includes some C routines to speed up the computation time. This is true both for the 2D and 3D implementations.
The authors thank Amtrak, ENSCO, Inc. and the Federal Railroad Administration for providing the images used in Section 4.2. This work was supported by the Federal Railroad Administration under contract DTFR53-13-C-00032. DL acknowledges support from NSF grant DMS 1008900/1008907 and DMS 1005799.
- Candès EJ, Donoho DL: New tight frames of curvelets and optimal representations of objects with C2 singularities. Comm. Pure Appl. Math 2004, 57: 219-266. 10.1002/cpa.10116MATHMathSciNetView ArticleGoogle Scholar
- Cunha A, Zhou J, Do M: The nonsubsampled contourlet transform: theory, design, and applications. IEEE Trans. Image Process 2006, 15(10):3089-3101.View ArticleGoogle Scholar
- Labate D, Lim W, Kutyniok G, Weiss G: Sparse multidimensional representation using shearlets. Wavelets XI (San Diego, CA, 2005), Volume SPIE Proc. 5914 2005, 254-262.Google Scholar
- Kutyniok G, Labate D: Shearlets: Multiscale Analysis for Multivariate Data. Birkhäuser, Boston; 2012.View ArticleMATHGoogle Scholar
- Starck JL, Murtagh F, Fadili JM: Sparse image and signal processing: wavelets, curvelets, morphological diversity. In Shearlets: Multiscale Analysis for Multivariate Data. Cambridge books online, Cambridge University Press Cambridge; 2010.Google Scholar
- Guo K, Labate D: The construction of smooth Parseval frames of shearlets. Math. Model. Nat. Phenom 2013, 8: 82-105. 10.1051/mmnp/20138106MATHMathSciNetView ArticleGoogle Scholar
- Guo K, Labate D: Optimally sparse multidimensional representation using shearlets. Siam J. Math. Anal 2007, 9: 298-318.MathSciNetView ArticleMATHGoogle Scholar
- Guo K, Labate D: Optimally sparse representations of 3D data with C2 surface singularities using Parseval frames of shearlets. Siam J. Math. Anal 2012, 44: 851-886. 10.1137/100813397MATHMathSciNetView ArticleGoogle Scholar
- Easley GR, Labate D, Lim W: Sparse directional image representations using the discrete shearlet transform. Appl. Comput. Harmon. Anal 2008, 25: 25-46. 10.1016/j.acha.2007.09.003MATHMathSciNetView ArticleGoogle Scholar
- Kutyniok G, Shahram M, Zhuang X: ShearLab: a rational design of a digital parabolic scaling algorithm. SIAM J. Imaging Sci 2012, 5(4):1291-1332. 10.1137/110854497MATHMathSciNetView ArticleGoogle Scholar
- Guo K, Labate D: Representation of Fourier integral operators using shearlets. J. Fourier Anal. Appl 2008, 14: 327-371. 10.1007/s00041-008-9018-0MATHMathSciNetView ArticleGoogle Scholar
- Colonna F, Easley GR, Guo K, Labate D: Radon transform inversion using the shearlet representation. Appl. Comput. Harmon. Anal 2010, 29(2):232-250. 10.1016/j.acha.2009.10.005MATHMathSciNetView ArticleGoogle Scholar
- Vandeghinste B, Goossens B, Van Holen R, Vanhove C, Pizurica A, Vandenberghe S, Staelens S: Iterative CT reconstruction using shearlet-based regularization. IEEE Trans. Nuclear Sci 2013, 60(5):3305-3317.View ArticleGoogle Scholar
- Guo K, Labate D: Characterization and analysis of edges using the continuous shearlet transform. SIAM Imaging Sci 2009, 2: 959-986. 10.1137/080741537MATHMathSciNetView ArticleGoogle Scholar
- Guo K, Labate D: Analysis and detection of surface discontinuities using the 3D continuous shearlet transform. Appl. Comput. Harmon. Anal 2011, 30: 231-242. 10.1016/j.acha.2010.08.004MATHMathSciNetView ArticleGoogle Scholar
- Yi S, Labate D, Easley GR, Krim H: A shearlet approach to edge analysis and detection. IEEE Trans. Image Process 2009, 18(5):929-941.MathSciNetView ArticleGoogle Scholar
- Kutyniok G, Lim W: Image separation using wavelets and shearlets. In Curves and Surfaces (Avignon, France, 2010), 416–430, Lecture Notes in Computer Science 6920. Springer Berlin Heidelberg; 2011.Google Scholar
- Easley G, Labate D, Negi PS: 3D data denoising using combined sparse dictionaries. Math. Model. Nat. Phenom 2013, 8: 60-74.MATHMathSciNetView ArticleGoogle Scholar
- Patel VM, Easley G, Healy D: Shearlet-based deconvolution. IEEE Trans. Image Process 2009, 18: 2673-2685.MathSciNetView ArticleGoogle Scholar
- Negi P, Labate D: 3-D discrete shearlet transform and video processing. IEEE Trans. Image Process 2012, 21: 2944-2954.MathSciNetView ArticleGoogle Scholar
- Easley G, Labate D, Patel VM: Directional multiscale processing of images using wavelets with composite dilations. J. Math. Imaging Vis 2014, 48(1):13-43. 10.1007/s10851-012-0385-4MATHMathSciNetView ArticleGoogle Scholar
- Candès EJ, Demanet L, Donoho D, Ying L: Fast discrete curvelet transforms. SIAM Multiscale Model. Simul 2006, 5(3):861-899. 10.1137/05064182XMATHView ArticleMathSciNetGoogle Scholar
- Burt PJ, Adelson EH: The Laplacian pyramid as a compact image code. IEEE Trans. Commun 1983, 31(4):532-540. 10.1109/TCOM.1983.1095851View ArticleGoogle Scholar
- Donoho D, Johnstone I: Adapting to unknown smoothness via wavelet shrinkage. J. Am. Stat. Assoc 1995, 90: 1200-1224. 10.1080/01621459.1995.10476626MATHMathSciNetView ArticleGoogle Scholar
- Chang SG, Yu B, Vetterli M: Adaptive wavelet thresholding for image denoising and compression. IEEE Trans. Image Process 2000, 9(9):1532-1546. 10.1109/83.862633MATHMathSciNetView ArticleGoogle Scholar
- Subirats P, Dumoulin J, Legeay V, Barba D: Automation of pavement surface crack detection using the continuous wavelet transform. IEEE International Conference on Image Processing, Atlanta, GA 3037.Google Scholar
- Chambon S, Moliard J: Automatic road pavement assessment with image processing: review and comparison. Int. J. Geophys. article ID 989354 2011, 20 pages. doi:10.1155/2011/989354Google Scholar
- Ma C, Zhao C, Hou Y: Pavement distress detection based on nonsubsampled contourlet transform. Int. Conf. Comput. Sci. Softw. Eng. 2008, 1: 28-31.Google Scholar
- Starck JL, Elad M, Donoho D: Image decomposition via the combination of sparse representation and a variational approach. IEEE Trans. Image Process 2005, 14(10):1570-1582.MATHMathSciNetView ArticleGoogle Scholar
- Bobin J, Starck JL, Fadili M, Moudden Y, Donoho D: Morphological component analysis: an adaptive thresholding strategy. IEEE Trans. Image Process 2007, 16(11):2675-2681.MATHMathSciNetView ArticleGoogle Scholar
- Canny J: A computational approach to edge detection. Mach. Intell. 1986, 8(6):679-698.View ArticleGoogle Scholar
- Oliveira H, Correia P: Automatic road crack detection and characterization. IEEE Trans. Intell. Transport. Syst. 2013, 14: 155-168.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.