Convolution of large 3D images on GPU and its decomposition
- Pavel Karas^{1}Email author and
- David Svoboda^{1}
https://doi.org/10.1186/1687-6180-2011-120
© Karas and Svoboda; licensee Springer. 2011
Received: 2 September 2010
Accepted: 28 November 2011
Published: 28 November 2011
Abstract
In this article, we propose a method for computing convolution of large 3D images. The convolution is performed in a frequency domain using a convolution theorem. The algorithm is accelerated on a graphic card by means of the CUDA parallel computing model. Convolution is decomposed in a frequency domain using the decimation in frequency algorithm. We pay attention to keeping our approach efficient in terms of both time and memory consumption and also in terms of memory transfers between CPU and GPU which have a significant inuence on overall computational time. We also study the implementation on multiple GPUs and compare the results between the multi-GPU and multi-CPU implementations.
Keywords
1 Introduction
The convolution of two signals can be employed for blurring images, deconvolving blurred images, edge detection, noise suppression, and in many other applications [1–3]. For example, a cross-correlation and a phase-correlation (which are important methods of image registration) are both very similar to a convolution since they have basically the same mathematical meaning except that a convolution involves reversing a signal [[1], p.211]. The convolution of large signals is also used for simulating image formation in optical systems such as light microscopes [4]. The convolution is a common method used in image processing; however, its computation is very time-consuming for large images. Graphic cards can be employed for accelerating the computation. Some of the algorithms can be found in NVIDIA whitepaper [5]. Here, a so-called naïve convolution and a convolution with separable kernel are described, along with their optimized GPU implementation in CUDA. These algorithms can be used in many applications, such as fast computation of Canny edge detection [6, 7]. However, these approaches are not suitable for general large kernels.
In optical microscopy, we often deal with both large input signals and kernels. Thus, in this article, we discuss the time complexity of a convolution with emphasis on large 3D images. We recall the convolution theorem and its positive effect on the time complexity. For example, having a signal of 1000 × 1000 × 100 voxels and a filter kernel of 100 × 100 × 100 voxels, which is common in optical microscopy, the calculation using the convolution theorem takes tens of seconds, instead of several days, on the most recent CPU architecture. Even better times can be obtained using graphic cards. The GPU-based convolution using the convolution theorem is described in [8]. As indicated by authors, the FFT-based approach is suitable for large non-separable kernels.
The essential part of the algorithm described above is the Fourier transform. The first attempt to compute the fast Fourier transform on graphics hardware was described in [9]. The implementation was written in OpenGL and Cg shading languages and tested in the convolution application. The comparison of convolution in spatial and frequency domain (for the description of both approaches refer to the following section) was made in [10]. A significant speedup was achieved by implementing the algorithm on GPU, using HLSL and DirectX. Recently, the NVidia^{®} CUDA programming model [11] along with the CUFFT library [12] offers a framework for implementing convolution in a straightforward manner. Besides CUFFT, other FFT libraries for GPU were developed, such as [13] and [14]. The OpenCL framework allows implementing methods on heteregenous platforms, consisting of CPU, GPU and other architectures [15, 16].
The bottleneck of the GPU acceleration is that graphics hardware offers rather small amount of memory. This poses a significant problem to attempts of accelerating convolution of huge images on GPU. Due to convolution properties, the convolved image can be divided into arbitrarily small parts, but all the sub-images have to be extended with neighboring pixels of at least size of the filter kernel (point spread function--PSF), as described in [17]. Thus, a lot of redundant computation needs to be performed, proportionally to the PSF size. This approach was successfully used in [18], to compute spatially variant deconvolution of Hubble Space Telescope images. We propose a new approach, which is optimal, in terms of both the number of per-voxel computations and the number of memory transfers.
1.1 Convolution
where M_{ f } and M_{ g } is the number of samples of f and g, respectively. The convolution then produces a signal of size M = M_{ f } +M_{ g } - 1. For the list of conventions used in the article refer to Appendix A.
(Neumann boundary conditions).
In practice, one of the signals has usually significantly larger size and is called simply the signal whereas the other is of a smaller size and is called the filter kernel. For instance, a kernel can be given by a simple function, such as Gaussian, or by so-called PSF, a function that describes the impulse response of an imaging system to a point source [[3], pp. 205-207].
A computation of convolution can be very time-consuming. From Equation (1) it can be easily deduced that the time complexity of the problem is O (M_{ f } M_{ g } ). In the case the kernel is small (tens of samples) the convolution can be computed in a reasonable time, even for large signals. However, in some applications such as optical microscopy one deals with both signals and kernels of more than a million of samples each. In this case computation would take several days which is unacceptable and a different solution needs to be applied.
1.2 Convolution theorem
where $\mathcal{F}$ denotes a Fourier transform. Therefore, instead of a convolution according to the definition, Fourier transforms can be applied on both the signal and the kernel, then their pointwise product is computed and finally inverse Fourier transform is applied to obtain the result [19].
Keep in mind that before the computation both the signal and the kernel need to be padded to the same size (that is, to the size of the resulting convolved signal) to avoid problems with boundary values. For example, in 1D case both the signal and the kernel are padded to M = M_{ f } + M_{ g } - 1 samples. There are several ways how to pad the data, usually they are padded with zeros. The position of the padding influences the position of the resulting signal. Refer to [19] for more details. According to the convolution theorem the asymptotic time complexity of a convolution is the complexity of FFT [19, 21], i.e., O(M log M).^{a}
1.3 Memory complexity
As we mentioned in the previous section, to be able to apply the convolution theorem both the signal and the kernel have to be padded before computation. Hence, αM bytes of memory is required to store signal and the same amount for kernel, thus 2αM bytes in total where α is the size of data type used, e.g., typically 4 bytes for a single precision.
where * denotes a complex conjugate. This also means that Im [F(0)] = 0. In n-D case, an analogous property is held. Therefore, real data can also be processed in-place except that it needs to be padded in the last dimension^{b} to size ${M}^{\prime}=2\left(\u230a\frac{M}{2}\u230b+1\right)$. Thus, only half of the Fourier domain can be stored in the memory [19, 22].
2 Method
2.1 GPU accelerated convolution
In this section, we describe a basic GPU-based implementation of convolution using CUDA. CUDA is a parallel computing model introduced by the nVidia company. It provides C language extensions to implement pieces of code on GPU. This so-called CUDA Toolkit also includes the CUBLAS and the CUFFT libraries providing algorithms for linear algebra and fast Fourier transform, respectively.
The pointwise multiplication is done in a simple loop so its parallelization is straightforward. GPU threads are mapped to individual image pixels, naturally providing coalesced memory accesses and massive parallelism. This approach is thus simple and optimal, limited by a global memory bandwidth only. For basic information and examples refer to the CUDA Programming Guide document and CUDA SDK samples [11, 23].
The Fourier transform parallel computations are provided by the CUFFT library [12], as it is presently to our best knowledge the best library for FFT computation on GPU. In this part of the algorithm, all the optimizing issues (e.g., memory coalescing, shared or texture memory usage, etc.) are concerns of the CUFFT library.
When dealing with both large 3D images and large kernels it is impossible to compute convolution at once on GPU since the recent graphic cards typically have about 1GB of memory. This is also due to CUFFT specifics--if an image is too large to be stored in a shared memory, the FFT is performed out-of-place. As a result, even more memory is required [12]. In this section, we propose an algorithm for the decomposition of convolution. First, we describe the decimation in frequency (DIF) algorithm. This approach is not new, it was used, e.g., in [10], to provide the whole FFT computation. However, our contribution is to employ the DIF method to decompose data, i.e., to prepare it for convolution so that it can be processed in parts.
It should be noted that there are several other approaches to decompose the FFT problem, however, they are sub-optimal in means of the number of per-pixel operations and data transfers. First, the DIT method can be used instead of the DIF. However, unlike the DIF, this approach does not provide complete separability of the resulting sub-problems. This leads to a lot of redundant data transfers. Second, in the spatial domain, the convolved image can be divided into small parts. This method will be referred to as the tiling. However, all the sub-images have to be extended with neighboring pixels of at least size of the filter kernel [17]. Thus, a lot of redundant computation needs to be performed, proportionally to the PSF size.
Methods for decomposition of the convolution problem and their requirements
Method | Number of operations | Number of memory transactions |
---|---|---|
DIF | c(M_{ f } + M_{ g }) log(M_{ f } + M_{ g }) + (M_{ f } + M_{ g }) | 3(M_{ f } + M_{ g }) |
DIT | c(M_{ f } + M_{ g }) log(M_{ f } + M_{ g }) + 2(M_{ f } + M_{ g }) | $\left(2+\mathcal{P}\right)\left({M}_{f}+{M}_{g}\right)$ |
Tilling | $c\left({M}_{f}+\mathcal{P}{M}_{g}\right)log\left(\frac{{M}_{f}}{\mathcal{P}}+{M}_{g}\right)+\left({M}_{f}+2\mathcal{P}{M}_{g}\right)$ | $\mathsf{\text{2}}{M}_{f}+\left(\mathcal{P}+\mathsf{\text{1}}\right){M}_{g}$ |
2.2 Decimation in frequency
The decomposition can be employed in n-D case. Since a Fourier transform is separable, an n-D transform can be expressed as a sequence of 1D transforms. Hence, the decomposition can be applied in any dimension. For 3D signals, a decomposition in z dimension--or the first coordinate in a row-major order--should be applied so that $\mathcal{P}$ separated parts in $\mathcal{P}$ continuous blocks of a memory are obtained. Unlike interlaced data, continuous blocks are optimal for data transfers.
2.3 Implementation
In this section, we introduce a modified algorithm for computing a convolution of large 3D images on GPU. The algorithm uses the DIF method for decomposing data. If the data are complex then the decomposition can be performed in-place, whereas if the data are real then the decomposition need to be performed out-of-place. The real data could also be transformed into a complex form but it is rather ineficient.
For the out-of-place decomposition of the data, 2αKLM bytes are required for storing the signal and the kernel and 4αKLM bytes for the decomposed data since complex numbers use up twice the size of real numbers. Thus, the overall memory space required is 6αKLM bytes.
Our implementation offers (de)compositions into 2, 4, 8, or 16 parts. These are provided by particular procedures named Decompose2, Decompose4, Decompose8, and Decompose16 (and analogously Compose2, etc.). The Decompose2 and Compose2 procedures conduct Equations (5) and (6), respectively. The Decompose4 and Compose4 may either recursively call the Decompose2 and Compose2, respectively, or they may directly split the data into four parts and reconstruct back from the four parts, respectively. We have opted for the latter solution since it requires smaller number of operations per voxel. Refer to Appendix B for more details.
where ∘ denotes a composition of operations.
2.4 Multi-GPU implementation
where t_{ p } is the time required for the padding of the data, t_{ d } for decomposition, t_{ a } for allocating memory and setting up FFT plans on GPU, t_{h → d}for data transfers from CPU to GPU (host to device), t_{conv} for the convolution itself, t_{d→h}for data transfers from GPU to CPU (device to host) and finally t_{ c } for composition. Despite the fact that most time is spent in the convolution phase, the other phases cannot be neglected. Please note that the t_{ a } time needed for preparing the GPU can be overlapped with the t_{ p } and t_{ d } times needed for preparing the data on CPU.
3 Results
We have performed several benchmarks on a machine with the Intel^{®} Core™ i7 950 processor with 6 GB DDR3 RAM and the nVidia^{®} GeForce GTX 295 graphic card with 2 × 896 MB GDDR3 RAM. The CPU implementation uses the FFTW library while the GPU implementation uses our algorithm. Both the decomposition and the composition are performed on CPU. Therefore, we have written them in C language and further improved with SSE intrinsics for a better performance.
3.1 Single GPU implementation
The single GPU implementation was compared with a single thread CPU implementation. Considering the CPU implementation two approaches are distinguished: The one using both complex-to-complex FFT and inverse FFT (we call it the c2c implementation) and the one using real-to-complex FFT and complex-to-real inverse FFT (the r2c implementation). The former implementation can generally be used for convolving complex data (much like our approach), the latter one can be used on real data only but is twice as effective (in means of both the memory requirements and the speed).
Convolution of specifically sized images
Image size | Time (s) | |||||
---|---|---|---|---|---|---|
Image | x × y × z | [Mpx] | $\mathcal{P}$ | GPU^{c2c} | CPU^{r2c} | CPU^{r2c} |
1 | 256 × 256 × 64 | 4.2 | 1 | 0.3 | 0.6 | 0.2 |
2 | 512 × 256 × 64 | 8.4 | 1 | 0.4 | 1.4 | 0.4 |
3 | 512 × 512 × 64 | 16.8 | 1 | 0.5 | 3.4 | 0.9 |
4 | 512 × 512 × 128 | 33.6 | 2 | 0.6 | 13.2 | 2.6 |
5 | 1024 × 512 × 128 | 67.1 | 4 | 1.4 | 29.3 | 5.4 |
6 | 1024 × 1024 × 128 | 134.2 | 8 | 2.8 | 54.3 | 12.9 |
7 | 1024 × 1024 × 256 | 268.4 | 16 | 6.2 | 104.3 | 24.0 |
8 | 2048 × 1024 × 256 | 536.9 | -- | -- | -- | 52.2 |
Convolution of arbitrarily sized images
Image size | Time (s) | |||||
---|---|---|---|---|---|---|
Image | x × y × z | [Mpx] | $\mathcal{P}$ | GPU^{c2c} | CPU^{c2c} | CPU^{r2c} |
9 | 257 × 257 × 64 | 4.2 | 1 | 0.6 | 1.4 | 1.1 |
10 | 513 × 257 × 64 | 8.4 | 1 | 0.8 | 2.4 | 1.2 |
11 | 513 × 513 × 64 | 16.8 | 1 | 0.8 | 3.7 | 1.8 |
12 | 513 × 513 × 128 | 33.7 | 2 | 1.4 | 8.0 | 3.8 |
13 | 1025 × 513 × 128 | 67.3 | 4 | 2.6 | 19.8 | 14.7 |
14 | 1025 × 1025 × 128 | 134.5 | 8 | 5.6 | 52.8 | 34.8 |
15 | 1025 × 1025 × 256 | 269.0 | 16 | 11.5 | 118.9 | 70.5 |
16 | 2049 × 1025 × 256 | 537.7 | - | - | - | 189.0 |
In all tests, the decomposition parameter $\mathcal{P}$ was set to the least possible value so that the resulting sub-images fitted into GPU memory. For example, if the image was small enough, the decomposition was not performed at all, the larger the image was the more parts it was decomposed into.
The results show that a single-GPU implementation is more than ten times faster as compared to a single thread CPU c2c implementation and about four times faster than a single thread CPU r2c implementation if the images are large enough. The r2c approach is the only one which can process images with sizes of more than 500 megavoxels. This is due to the limitations of the CPU memory.
We can also see that in the case of specifically sized images the difference between the time values is small for images smaller than 10 megavoxels. Nevertheless, in practise the arbitrarily sized images are more frequent. In this case, the GPU implementation is faster even on smaller images.
3.2 Multi-GPU implementation
Multi-GPU convolution of specifically sized images
Time (s) | |||||
---|---|---|---|---|---|
Image | 2GPU^{c2c} | 2CPU^{c2c} | 2CPU^{r2c} | 4CPU^{c2c} | 4CPU^{r2c} |
1 | 0.5 | 0.3 | 0.1 | 0.3 | 0.1 |
2 | 0.5 | 0.7 | 0.2 | 0.5 | 0.3 |
3 | 0.7 | 1.7 | 0.5 | 1.1 | 0.6 |
4 | 0.9 | 6.3 | 1.3 | 3.5 | 0.9 |
5 | 1.3 | 14.5 | 2.8 | 8.9 | 1.8 |
6 | 2.5 | 30.7 | 6.6 | 19.9 | 3.7 |
7 | 4.8 | 53.4 | 12.3 | 31.4 | 6.8 |
8 | - | - | 26.7 | - | 14.5 |
Multi-GPU convolution of arbitrarily sized images
Time (s) | |||||
---|---|---|---|---|---|
Image | 2GPU^{c2c} | 2CPU^{c2c} | 2CPU^{r2c} | 4CPU^{c2c} | 4CPU^{r2c} |
9 | 0.6 | 0.7 | 0.6 | 0.4 | 0.4 |
10 | 0.7 | 1.2 | 0.6 | 0.8 | 0.5 |
11 | 0.7 | 1.9 | 0.9 | 1.2 | 0.5 |
12 | 1.3 | 4.1 | 1.8 | 2.3 | 1.1 |
13 | 2.0 | 10.2 | 7.2 | 5.6 | 3.8 |
14 | 3.8 | 27.6 | 17.9 | 14.9 | 9.2 |
15 | 7.3 | 62.0 | 36.0 | 32.0 | 18.4 |
16 | - | - | 95.7 | - | 48.6 |
The results reveal several facts. Again, the GPU implementation is up to eight times faster than the CPU c2c implementation (depending on image sizes) and up to three times faster than the CPU r2c implementation when comparing two GPUs with two CPU cores. Also a test with four CPU cores was made and the GPU implementation has performed still two times faster if the images were large enough. In general, the GPU implementation becomes advantageous on images larger than 50 megavoxels.
Note that in some cases a single GPU has performed better than two GPUs (especially in the case of specifically sized images). We shall see in the following section that the algorithm spent too much time in preliminary phases (such as data decomposing or memory allocating) at the expense of the convolution itself.
3.3 Time analysis
The convolution itself consumes slightly less than a half of the overall time. The rest is spent on the other phases such as the preparation phase (memory allocation and setting up FFT plans [12] on GPU; and the data padding and the data decomposition on CPU), data transfers between CPU and GPU and finally the data composition. As the preparation phases on GPU on CPU are independent, they can be conducted simultaneously. The results can be confronted with Equation 8. Here, the "Pad", "Decompose", "Allocate", "CPU>GPU", "Convolution", "GPU>CPU", and "Compose" times correspond to t_{ p } , t_{ d } , t_{ a } , t_{h→d}, t_{ conv } , t_{d→h}, and t_{ c } , respectively.
In the case of the multi-GPU implementation the time spent in the preparation phase on GPU t_{ a } is double, see Figure 9(b). This unpleasant behavior is also observed on a machine with CUDA 4.0 with extended support for multi-GPU processing [25]. To the best of our knowledge, it is not mentioned in official NVIDIA documents. Fortunately, the time needed for padding the data on CPU t_{ p } is still larger. Thus, the overhead inducted by t_{ a } is hidden by t_{ p } . The times confronted with Equation 9 give us a good idea of how the usage of multiple GPU cards can speed up the computation. Since the significant portion of time is spent in sequential phases of the algorithm, the overall speedup is limited, corresponding to the famous Amdahl's law [26].
We may draw the conclusion that it is reasonable to choose the smallest value possible for $\mathcal{P}$ for given image size. The exact results depend on particular dataset. Sometimes, one may prefer usage of the (De)Compose4 function instead of the (De)Compose2 function--and therefore setting $\mathcal{P}$ to 4 instead of 2, or to 16 instead of 8--since these two functions are equally efficient and the Fourier transforms may be more efficient on the smaller parts of the decomposed image, rather than on the larger parts.
3.4 Precision analysis
where C_{ i } is the voxel value at the position i in the image computed by CPU, G_{ i } is the voxel value at the position i in the image computed by GPU and Ω is the set of all coordinates in the image.
Precision analysis of specifically sized images
Image size | δ[10^{-3}] (single) | δ[10^{-3}] (double) | ||||||
---|---|---|---|---|---|---|---|---|
Image | x × y × z | [Mpx] | $\mathcal{P}=1$ | $\mathcal{P}=2$ | $\mathcal{P}=4$ | $\mathcal{P}=8$ | $\mathcal{P}=16$ | $\mathcal{P}=1,\dots ,16$ |
1 | 256 × 256 × 64 | 4.2 | 0.12 | 0.11 | 0.12 | 0.11 | 0.12 | 0.004 |
2 | 512 × 256 × 64 | 8.4 | 0.16 | 0.15 | 0.15 | 0.14 | 0.14 | 0.006 |
3 | 512 × 512 × 64 | 16.8 | 0.18 | 0.17 | 0.20 | 0.18 | 0.20 | 0.007 |
4 | 512 × 512 × 128 | 33.6 | - | 0.22 | 0.22 | 0.24 | 0.23 | 0.015 |
5 | 1024 × 512 × 128 | 67.1 | - | - | 0.26 | 0.25 | 0.25 | 0.019 |
6 | 1024 × 1024 × 128 | 134.2 | - | - | - | 0.28 | 0.29 | 0.018 |
7 | 1024 × 1024 × 256 | 268.4 | - | - | - | - | 0.26 | - |
8 | 2048 × 1024 × 256 | 536.9 | - | - | - | - | - | - |
Precision analysis of arbitrarily sized images
Image size | δ[10^{0}] (single) | δ[10^{-3}] (double) | ||||||
---|---|---|---|---|---|---|---|---|
Image | x × y × z | [Mpx] | $\mathcal{P}=1$ | $\mathcal{P}=2$ | $\mathcal{P}=4$ | $\mathcal{P}=8$ | $\mathcal{P}=16$ | $\mathcal{P}=1,\dots ,16$ |
9 | 257 × 257 × 64 | 4.2 | 0.02 | 0.56 | 1.12 | 1.90 | 3.24 | 0.018 |
10 | 513 × 257 × 64 | 8.4 | 0.01 | 0.94 | 1.69 | 2.65 | 4.15 | 0.003 |
11 | 513 × 513 × 64 | 16.8 | < 0.01 | 0.53 | 0.90 | 1.35 | 2.15 | 0.003 |
12 | 513 × 513 × 128 | 33.7 | - | 0.54 | 0.87 | 1.04 | 1.46 | 0.011 |
13 | 1025 × 513 × 128 | 67.3 | - | - | 1.19 | 1.38 | 1.67 | 0.004 |
14 | 1025 × 1025 × 128 | 134.5 | - | - | - | 0.80 | 0.91 | 0.010 |
15 | 1025 × 1025 × 256 | 269.0 | - | - | - | - | 0.61 | - |
16 | 2049 × 1025 × 256 | 537.7 | - | - | - | - | - | - |
We find that for images of sizes that can be factored into small primes single precision is acceptable for most purposes. However, we are aware that in some applications single precision might not be enough. The recent nVidia graphic cards offer computing in double precision, however the speed is too low (1/8 of the speed in single precision). With the release of the Fermi architecture computing in double precision on GPU will become feasible.
4 Conclusion
In this article, we have proposed a new method for convolving large images. We have taken advantage of high computation power of GPU and extended the algorithm with a decomposition. As a result we are able not only to convolve images larger than the size of GPU memory, but also to employ multiple GPUs in parallel.
Our method is generally suitable for complex data. However, in the case of real data it is rather inefficient. Since the input data are represented by the complex numbers with zero imaginary parts, and not by the real numbers, double effort is made to compute the result. On the other hand, the method can be optimized for real data in a few ways [22, 24, 27]. This is the subject of our future study.
The results show that it is reasonable to use our algorithm especially on very large images where the speedup of the GPU implementation can be up to 5× for the complex data and 2-3× for the real data. We suggest to decompose images into the smallest number of parts possible as this approach seems to be the most efficient.
We also studied the precision of the convolution in a practical application. The results revealed that for this application the computation in a single precision is acceptable (and it will be probably so for many other applications). In the case the single precision is not enough it is also possible to compute in a double precision. However, in this precision the recent graphic cards perform poorly. With the release of new architectures (e.g., nVidia Fermi) double precision will become feasible.
The proposed method can be implemented also in OpenCL and other languages besides CUDA. Besides the graphic cards, other parallel architectures can be taken into account as well. As the convolution is a key part of many deconvolution methods [28, 29], application of the proposed algorithm in the deconvolution will also be the subject of our future study.
Endnotes
^{a}In 3D case both signals are padded to size M = M_{ f } + M_{ g } - 1 in x dimension, L = L_{ f } + L_{ g } - 1 in y dimension and K = K_{ f } + K_{ g } - 1 in z dimension; and the resulting complexity is O (KLM log(KLM)). ^{b}Supposing the data are stored in a memory in a row-major order, the last dimension is the x dimension.
Endnotes
.1 Conventions
We introduce the conventions used in the text:
f * g... convolution of signals f, g
$\mathcal{F}\left[f\right]$ ... Fourier transform of a signal f
${\mathcal{F}}^{-1}\left[F\right]$ ... inverse Fourier transform of a signal F
f(m) ... a signal is denoted by a lowercase letter with a Latin letter index
F(μ) ... a Fourier transform is denoted by an uppercase letter with a greek letter index
${W}_{M}={\mathsf{\text{e}}}^{\mathsf{\text{i}}\frac{2\pi}{M}}\dots $ a base function of a Fourier transform
In 1D case we introduce
M_{ f }, M_{ g }... sizes of the signal and the kernel, respectively
M = M_{ f }+ M_{ g }- 1 ... size of the output convolved signal
In 3D case we introduce
K_{ f }, L_{ f }, M_{ f }... sizes of the signal in dimensions z, y, x, respectively
K_{ g }, L_{ g }, M_{ g }... sizes of the kernel in dimensions z, y, x, respectively
K, L, M ... sizes of the output convolved signal in dimensions z, y, x, respectively (K = K_{ f }+ K_{g-1}, etc.)
.2 Decimation in frequency
Now let us introduce$\begin{array}{c}o\equiv n+M\u22154,\\ p\equiv n+M\u22152,\\ q\equiv n+\mathsf{\text{3}}M\u22154,\end{array}$
Declarations
Acknowledgements
This work was supported by the Ministry of Education of the Czech Republic (Projects No. MSM-0021622419, No. LC535, and No. 2B06052).
Authors’ Affiliations
References
- Gonzales RC, Woods RE: Digital Image Processing. 2nd edition. Prentice-Hall; 2002.Google Scholar
- Pratt WK: Digital Image Processing. 3rd edition. John Wiley & Sons; 2001.View ArticleGoogle Scholar
- Jähne B: Digital Image Processing. 6th edition. Springer; 2005.Google Scholar
- Svoboda D, Kozubek M, Stejskal S: Generation of Digital Phantoms of Cell Nuclei and Simulation of Image Formation in 3D Image Cytometry. CYTOMETRY PART A 2009,75A(6):494-509. 10.1002/cyto.a.20714View ArticleGoogle Scholar
- Podlozhnyuk V: Image Convolution with CUDA. NVIDIA Corporation 2007. [http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_64_website/projects/convolutionSeparable/doc/convolutionSeparable.pdf]Google Scholar
- Luo Y, Duraiswami R: Canny edge detection on nvidia cuda. Computer Vision and Pattern Recognition Workshop 2008, 1-8.Google Scholar
- Ogawa K, Ito Y, Nakano K: Efficient canny edge detection using a gpu. International Conference on Natural Computation 2010, 279-280.Google Scholar
- Podlozhnyuk V: FFT-based 2D convolution. NVIDIA Corporation 2007. [http://developer.download.nvidia.com/compute/cuda/2_2/sdk/website/projects/convolutionFFT2D/doc/convolutionFFT2D.pdf]Google Scholar
- Moreland K, Angel E: The FFT on a GPU. HWWS '03: Proceedings of the ACM SIG-GRAPH/EUROGRAPHICS conference on Graphics hardware 2003, 112-119.Google Scholar
- Fialka O, Cadik M: FFT and Convolution Performance in Image Filtering on GPU. Information Visualization, 2006. IV 2006. Tenth International Conference on 2006, 609-614.Google Scholar
- CUDA™ SDK code samples 3.1 NVIDIA Corporation 2010. [http://developer.nvidia.com/cuda-toolkit-31-downloads]
- CUDA™ CUFFT Library 3.1 NVIDIA Corporation 2010. [http://developer.nvidia.com/cuda-toolkit-31-downloads]
- Nukada A, Ogata Y, Endo T, Matsuoka S: Bandwidth intensive 3-D FFT kernel for GPUs using CUDA. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. Piscataway, NJ, USA: IEEE Press; 2008:1-11.Google Scholar
- Govindaraju NK, Lloyd B, Dotsenko Y, Smith B, Manferdelli J: High performance discrete Fourier transforms on graphics processors. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. Piscataway, NJ, USA: IEEE Press; 2008:1-12.Google Scholar
- OpenCL Khronos Group 2010. [http://www.khronos.org/opencl/]
- OpenCL 1.1 Reference Pages Khronos Group 2010. [http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/]
- Trussell H, Hunt B: Image restoration of space variant blurs by sectioned methods. Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP '78. 1978, 3: 196-198.View ArticleGoogle Scholar
- Boden AF, Redding DC, Hanisch RJ, Mo J: Massively parallel spatially variant maximum-likelihood restoration of Hubble Space Telescope imagery. J Opt Soc Am A 1996,13(7):1537-1545. [http://josaa.osa.org/abstract.cfm?URI=josaa-13-7-1537] 10.1364/JOSAA.13.001537View ArticleGoogle Scholar
- Bracewell RN: The Fourier Transform and Its Applications. 3rd edition. McGraw-Hill; 2000.Google Scholar
- Hanna JR, Rowland JH: Fourier Series, Transforms, and Boundary Value Problems. 2nd edition. John Wiley & Sons; 1990.Google Scholar
- W VT, Press WilliamH, Teukolsky SaulA, Flannery PB: Numerical Recipes in C. Volume ch 7. 2nd edition. Cambridge University Press; 1992.Google Scholar
- Hey A: The FFT demystified. Engineering Productivity Tools Ltd. 21 Leaveden Road, Watford, Hertfordshire, UK [http://www.engineeringproductivitytools.com/stuff/T0001/PT10.HTM]
- CUDA™ Programming Guide 3.1 NVIDIA Corporation 2010. [http://developer.nvidia.com/cuda-toolkit-31-downloads]
- Saidi A: Generalized FFT Algorithm. IEEE International Conference on Communications 93: Technical program, conference record, IEEE International Conference on Communications - Communications: TECHNOLOGY THAT UNITED NATIONS (ICC 93), Geneva, SWITZERLAND, May 23-26, 1993 1993, 1-3: 227-231.Google Scholar
- CUDA™ Toolkit 4.0 NVIDIA Corporation 2011. [http://developer.nvidia.com/cuda-toolkit-40]
- Amdahl GM: Validity of the single processor approach to achieving large scale computing capabilities. Proceedings of the April 18-20, 1967, spring joint computer conference, ser. AFIPS '67 (Spring) New York, NY, USA: ACM; 1967, 483-485. [http://doi.acm.org/10.1145/1465482.1465560]View ArticleGoogle Scholar
- Brigham E: Fast Fourier Transform and Its Applications. 1st edition. Prentice-Hall; 1988.Google Scholar
- Verveer PJ: Computational and optical methods for improving resolution and signal quality in fluorescence microscopy. In Ph.D. dissertation. Technische Universiteit Te Delft; 1998.Google Scholar
- Quammen CW, Feng D, Taylor RM II: Performance of 3D Deconvolution Algorithms on Multi-Core and Many-Core Architectures. University of North Carolina at Chapel Hill, Department of Computer Science, Tech. Rep 2009.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.