Convolution of large 3D images on GPU and its decomposition

In this article, we propose a method for computing convolution of large 3D images. The convolution is performed in a frequency domain using a convolution theorem. The algorithm is accelerated on a graphic card by means of the CUDA parallel computing model. Convolution is decomposed in a frequency domain using the decimation in frequency algorithm. We pay attention to keeping our approach efficient in terms of both time and memory consumption and also in terms of memory transfers between CPU and GPU which have a significant inuence on overall computational time. We also study the implementation on multiple GPUs and compare the results between the multi-GPU and multi-CPU implementations.


Introduction
The convolution of two signals can be employed for blurring images, deconvolving blurred images, edge detection, noise suppression, and in many other applications [1][2][3]. For example, a cross-correlation and a phase-correlation (which are important methods of image registration) are both very similar to a convolution since they have basically the same mathematical meaning except that a convolution involves reversing a signal [ [1], p.211]. The convolution of large signals is also used for simulating image formation in optical systems such as light microscopes [4]. The convolution is a common method used in image processing; however, its computation is very time-consuming for large images. Graphic cards can be employed for accelerating the computation. Some of the algorithms can be found in NVIDIA whitepaper [5]. Here, a so-called naïve convolution and a convolution with separable kernel are described, along with their optimized GPU implementation in CUDA. These algorithms can be used in many applications, such as fast computation of Canny edge detection [6,7]. However, these approaches are not suitable for general large kernels.
In optical microscopy, we often deal with both large input signals and kernels. Thus, in this article, we discuss the time complexity of a convolution with emphasis on large 3D images. We recall the convolution theorem and its positive effect on the time complexity. For example, having a signal of 1000 × 1000 × 100 voxels and a filter kernel of 100 × 100 × 100 voxels, which is common in optical microscopy, the calculation using the convolution theorem takes tens of seconds, instead of several days, on the most recent CPU architecture. Even better times can be obtained using graphic cards. The GPU-based convolution using the convolution theorem is described in [8]. As indicated by authors, the FFT-based approach is suitable for large non-separable kernels.
The essential part of the algorithm described above is the Fourier transform. The first attempt to compute the fast Fourier transform on graphics hardware was described in [9]. The implementation was written in OpenGL and Cg shading languages and tested in the convolution application. The comparison of convolution in spatial and frequency domain (for the description of both approaches refer to the following section) was made in [10]. A significant speedup was achieved by implementing the algorithm on GPU, using HLSL and DirectX. Recently, the NVidia ® CUDA programming model [11] along with the CUFFT library [12] offers a framework for implementing convolution in a straightforward manner. Besides CUFFT, other FFT libraries for GPU were developed, such as [13] and [14]. The OpenCL framework allows implementing methods on heteregenous platforms, consisting of CPU, GPU and other architectures [15,16].
The bottleneck of the GPU acceleration is that graphics hardware offers rather small amount of memory. This poses a significant problem to attempts of accelerating convolution of huge images on GPU. Due to convolution properties, the convolved image can be divided into arbitrarily small parts, but all the sub-images have to be extended with neighboring pixels of at least size of the filter kernel (point spread function-PSF), as described in [17]. Thus, a lot of redundant computation needs to be performed, proportionally to the PSF size. This approach was successfully used in [18], to compute spatially variant deconvolution of Hubble Space Telescope images. We propose a new approach, which is optimal, in terms of both the number of per-voxel computations and the number of memory transfers.

Convolution
A 1D convolution of two discrete finite signals f, g is defined by following [19]: where M f and M g is the number of samples of f and g, respectively. The convolution then produces a signal of size M = M f +M g -1. For the list of conventions used in the article refer to Appendix A.
Obviously, we have to take into account the boundary conditions defining f(m) at m < 0 and m >M f -1 [19,20]. Usually we set In practice, one of the signals has usually significantly larger size and is called simply the signal whereas the other is of a smaller size and is called the filter kernel. For instance, a kernel can be given by a simple function, such as Gaussian, or by so-called PSF, a function that describes the impulse response of an imaging system to a point source [ [3], pp. 205-207].
A computation of convolution can be very time-consuming. From Equation (1) it can be easily deduced that the time complexity of the problem is O (M f M g ). In the case the kernel is small (tens of samples) the convolution can be computed in a reasonable time, even for large signals. However, in some applications such as optical microscopy one deals with both signals and kernels of more than a million of samples each. In this case computation would take several days which is unacceptable and a different solution needs to be applied.

Convolution theorem
The convolution theorem provides us with a powerful tool for convolving large signals. Having two signals f, g it can be proved that where F denotes a Fourier transform. Therefore, instead of a convolution according to the definition, Fourier transforms can be applied on both the signal and the kernel, then their pointwise product is computed and finally inverse Fourier transform is applied to obtain the result [19].
Keep in mind that before the computation both the signal and the kernel need to be padded to the same size (that is, to the size of the resulting convolved signal) to avoid problems with boundary values. For example, in 1D case both the signal and the kernel are padded to M = M f + M g -1 samples. There are several ways how to pad the data, usually they are padded with zeros. The position of the padding influences the position of the resulting signal. Refer to [19] for more details. According to the convolution theorem the asymptotic time complexity of a convolution is the complexity of FFT [19,21], i.e., O(M log M). a

Memory complexity
As we mentioned in the previous section, to be able to apply the convolution theorem both the signal and the kernel have to be padded before computation. Hence, aM bytes of memory is required to store signal and the same amount for kernel, thus 2aM bytes in total where a is the size of data type used, e.g., typically 4 bytes for a single precision.
The Fourier transform can be performed in-place on a complex signal, so no additional memory is required. If the Fourier transform is performed on a real signal, the following property holds: where * denotes a complex conjugate. This also means that Im [F(0)] = 0. In n-D case, an analogous property is held. Therefore, real data can also be processed in-place except that it needs to be padded in the last dimension b to size M = 2 M 2 + 1 . Thus, only half of the Fourier domain can be stored in the memory [19,22].

GPU accelerated convolution
In this section, we describe a basic GPU-based implementation of convolution using CUDA. CUDA is a parallel computing model introduced by the nVidia company. It provides C language extensions to implement pieces of code on GPU. This so-called CUDA Toolkit also includes the CUBLAS and the CUFFT libraries providing algorithms for linear algebra and fast Fourier transform, respectively.
Using the Toolkit, an implementation of a convolution is quite simple and straightforward, see the example in Figure 1. Basically, the algorithm consists of three parts: First, the data are transferred from CPU (host) memory to GPU (device) memory. Then, the Fourier transform, the pointwise multiplication, and the inverse Fourier transform are subsequently performed. Finally, the data are transferred back to the host memory.
The pointwise multiplication is done in a simple loop so its parallelization is straightforward. GPU threads are mapped to individual image pixels, naturally providing coalesced memory accesses and massive parallelism. This approach is thus simple and optimal, limited by a global memory bandwidth only. For basic information and examples refer to the CUDA Programming Guide document and CUDA SDK samples [11,23].
The Fourier transform parallel computations are provided by the CUFFT library [12], as it is presently to our best knowledge the best library for FFT computation on GPU. In this part of the algorithm, all the optimizing issues (e.g., memory coalescing, shared or texture memory usage, etc.) are concerns of the CUFFT library.
When dealing with both large 3D images and large kernels it is impossible to compute convolution at once on GPU since the recent graphic cards typically have about 1GB of memory. This is also due to CUFFT specifics-if an image is too large to be stored in a shared memory, the FFT is performed out-of-place. As a result, even more memory is required [12]. In this section, we propose an algorithm for the decomposition of convolution. First, we describe the decimation in frequency (DIF) algorithm. This approach is not new, it was used, e.g., in [10], to provide the whole FFT computation. However, our contribution is to employ the DIF method to decompose data, i.e., to prepare it for convolution so that it can be processed in parts.
It should be noted that there are several other approaches to decompose the FFT problem, however, they are sub-optimal in means of the number of perpixel operations and data transfers. First, the DIT method can be used instead of the DIF. However, unlike the DIF, this approach does not provide complete separability of the resulting sub-problems. This leads to a lot of redundant data transfers. Second, in the spatial domain, the convolved image can be divided into small parts. This method will be referred to as the tiling. However, all the sub-images have to be extended with neighboring pixels of at least size of the filter kernel [17]. Thus, a lot of redundant computation needs to be performed, proportionally to the PSF size.
The comparison of three approaches to divide the convolution problem is shown in Table 1. are divided into. The DIF method is the only one which is not depend on the P parameter.

Decimation in frequency
The DIF algorithm is a technique used in fast Fourier transform to split data into several disjoint parts. We will introduce the idea for 1D case first. Let us have a function f(m) and its Fourier transform F(μ), m, μ = 0, ..., M -1. Supposing that M is even we introduce new functions r(m') and s(m'), m' = 0, ..., M/2 -1 as follows [22,24]: where W M = e i 2π M is a base function of a Fourier transform. Vice versa, it is simple to deduce Then it can be proved that the Fourier transforms R (μ') and S(μ') of the functions fulfil the following property: As explained above, the input function f can be decomposed into two half-sized parts r and s. The most important property of r and s is that they are completely separated parts of the original function in the frequency domain. Furthermore, the DIF algorithm can be applied recursively on the functions r and s, so the function f can be decomposed into four parts, then into eight parts, etc., supposing that M is divisible by 4, 8, .... The decomposition can be employed in n-D case. Since a Fourier transform is separable, an n-D transform can be expressed as a sequence of 1D transforms. Hence, the decomposition can be applied in any dimension. For 3D signals, a decomposition in z dimension-or the first coordinate in a row-major order-should be applied so that P separated parts in P continuous blocks of a memory are obtained. Unlike interlaced data, continuous blocks are optimal for data transfers.

Implementation
In this section, we introduce a modified algorithm for computing a convolution of large 3D images on GPU. The algorithm uses the DIF method for decomposing data. If the data are complex then the decomposition   can be performed in-place, whereas if the data are real then the decomposition need to be performed out-ofplace. The real data could also be transformed into a complex form but it is rather ineficient.
For the out-of-place decomposition of the data, 2aKLM bytes are required for storing the signal and the kernel and 4aKLM bytes for the decomposed data since complex numbers use up twice the size of real numbers. Thus, the overall memory space required is 6aKLM bytes.
A general review of the algorithm is shown in Figure  3, full description is in Figure 4. It is assumed that size of the data in z dimension is divisible by P, otherwise the data need to be padded to the nearest multiple of P before the convolution. However, it does not introduce a significant overhead to the algorithm since the size of the padding, if needed, is less than 1% of the overall image size.
Our implementation offers (de)compositions into 2, 4, 8, or 16 parts. These are provided by particular procedures named Decompose2, Decompose4, Decompose8, and Decompose16 (and analogously Compose2, etc.). The Decompose2 and Compose2 procedures conduct Equations (5) and (6), respectively. The Decompose4 and Compose4 may either recursively call the Decom-pose2 and Compose2, respectively, or they may directly split the data into four parts and reconstruct back from the four parts, respectively. We have opted for the latter solution since it requires smaller number of operations per voxel. Refer to Appendix B for more details.
The remaining procedures were implemented as follows: where ∘ denotes a composition of operations.

Multi-GPU implementation
Since the decomposition separates the convolution into P parts it is plain enough to solve the problem by multiple GPUs. To analyze precisely the contribution of the multi-GPU implementation we have to examine times spent in individual phases of the algorithm. In the single-GPU implementation described in Section 2.3 the overall time T can be expressed as follows: where t p is the time required for the padding of the data, t d for decomposition, t a for allocating memory and setting up FFT plans on GPU, t h d for data transfers from CPU to GPU (host to device), t conv for the convolution itself, t d h for data transfers from GPU to CPU (device to host) and finally t c for composition. Despite the fact that most time is spent in the convolution phase, the other phases cannot be neglected. Please note that the t a time needed for preparing the GPU can be overlapped with the t p and t d times needed for preparing the data on CPU.
Supposing that G GPUs are available and, for the sake of simplicity, P is divisible by G, the overall time T' can be expressed as follows: assuming that data cannot be transferred to multiple devices concurrently so both the times t h d and t d h remain the same. As a result, a data transfer of two image tiles from CPU to two GPUs concurrently-that is one tile to each GPU-lasts double the time of a data transfer of one tile to a single GPU. Actually, the overall time complexity of multi-GPU implementation can be even better than determined in Equation (9)   Karas and Svoboda EURASIP Journal on Advances in Signal Processing 2011, 2011:120 http://asp.eurasipjournals.com/content/2011/1/120 data transfers can be overlapped by computations on those GPUs that already possess the data (see Figure 5). The precise time values differ in particular situations. In general, they depend on the ratio of the data transfers and the convolution itself so Equation (9) can be considered as the upper estimate.
The final multi-GPU implementation is described in algorithm in Figure 6. Also a simple implementation of a custom cuMemcpy function is described in Figure 7. This function guarantees unique memory access enabling faster overall computation time in this way.

Results
We have performed several benchmarks on a machine with the Intel ® Core™ i7 950 processor with 6 GB DDR3 RAM and the nVidia ® GeForce GTX 295 graphic card with 2 × 896 MB GDDR3 RAM. The CPU implementation uses the FFTW library while the GPU implementation uses our algorithm. Both the decomposition and the composition are performed on CPU. Therefore, we have written them in C language and further improved with SSE intrinsics for a better performance. Figure 5 Timeline: single-GPU versus multi-GPU (a model situation). The dark boxes depict data transfers between CPU and GPU while the light boxes represent convolution computations. In the first row there is the single-GPU implementation. In the second row there is a timeline for parallel usage of two GPUs. The data transfers are performed concurrently but through a common bus, therefore they last twice longer. For the third row the data transfers are synchronized so that only one transfer is made at a time. In the last row the data transfers are overlapped with a convolution execution.

Single GPU implementation
The single GPU implementation was compared with a single thread CPU implementation. Considering the CPU implementation two approaches are distinguished: The one using both complex-to-complex FFT and inverse FFT (we call it the c2c implementation) and the one using real-to-complex FFT and complex-to-real inverse FFT (the r2c implementation). The former implementation can generally be used for convolving complex data (much like our approach), the latter one can be used on real data only but is twice as effective (in means of both the memory requirements and the speed).
In the first dataset ( Table 2) the images are padded to the powers of two in each dimension so we refer to them as to specifically sized images. This property has a significant inuence on computation time because on these images the FFT algorithm performs the best. In the second dataset (Table 3) the images are arbitrarily sized except that in the z dimension the padded size is kept so that the decomposition can be performed without the need of additional image padding.
In all tests, the decomposition parameter P was set to the least possible value so that the resulting sub-images fitted into GPU memory. For example, if the image was small enough, the decomposition was not performed at all, the larger the image was the more parts it was decomposed into.
The computation times are presented in Tables 2 and  3. For the sake of understandability the times were also converted into computation speed (number of voxels processed per second), see Figure 8(a),(c).
The results show that a single-GPU implementation is more than ten times faster as compared to a single thread CPU c2c implementation and about four times faster than a single thread CPU r2c implementation if the images are large enough. The r2c approach is the only one which can process images with sizes of more than 500 megavoxels. This is due to the limitations of the CPU memory.
We can also see that in the case of specifically sized images the difference between the time values is small for images smaller than 10 megavoxels. Nevertheless, in practise the arbitrarily sized images are more frequent. In this case, the GPU implementation is faster even on smaller images.

Multi-GPU implementation
In the second experiment, the multi-GPU implementation (using both the GPUs of GTX 295) was compared with a multiple thread CPU implementation. The same dataset was used, e.g., both the specifically sized and the arbitrarily sized images. The results are presented in Tables 4 and 5 and in Figure 8(b),(d).
The results reveal several facts. Again, the GPU implementation is up to eight times faster than the CPU c2c implementation (depending on image sizes) and up to three times faster than the CPU r2c implementation when comparing two GPUs with two CPU cores. Also a test with four CPU cores was made and the GPU implementation has performed still two times faster if the   images were large enough. In general, the GPU implementation becomes advantageous on images larger than 50 megavoxels. Note that in some cases a single GPU has performed better than two GPUs (especially in the case of specifically sized images). We shall see in the following section that the algorithm spent too much time in preliminary phases (such as data decomposing or memory allocating) at the expense of the convolution itself.

Time analysis
We have analyzed the time distribution of the phases of the proposed algorithm on two images with a resulting image size of 640 × 640 × 128 voxels. Our method was tested with different settings of the parameter P  (d) Two GPUs, arbitrary sizes   (number of decomposed parts). The results are depicted in Figure 9(a). The convolution itself consumes slightly less than a half of the overall time. The rest is spent on the other phases such as the preparation phase (memory allocation and setting up FFT plans [12] on GPU; and the data padding and the data decomposition on CPU), data transfers between CPU and GPU and finally the data composition. As the preparation phases on GPU on CPU are independent, they can be conducted simultaneously. The results can be confronted with Equation 8. Here, the "Pad", "Decompose", "Allocate", "CPU>GPU", "Convolution", "GPU>CPU", and "Compose" times correspond to t p , t d , t a , t h d , t conv , t d h , and t c , respectively.
In the case of the multi-GPU implementation the time spent in the preparation phase on GPU t a is double, see Figure 9(b). This unpleasant behavior is also observed on a machine with CUDA 4.0 with extended support for multi-GPU processing [25]. To the best of our knowledge, it is not mentioned in official NVIDIA documents. Fortunately, the time needed for padding the data on CPU t p is still larger. Thus, the overhead inducted by t a is hidden by t p . The times confronted with Equation 9 give us a good idea of how the usage of multiple GPU cards can speed up the computation. Since the significant portion of time is spent in sequential phases of the algorithm, the overall speedup is limited, corresponding to the famous Amdahl's law [26].
We may draw the conclusion that it is reasonable to choose the smallest value possible for P for given image size. The exact results depend on particular dataset. Sometimes, one may prefer usage of the (De)Compose4 function instead of the (De)Compose2 function-and therefore setting P to 4 instead of 2, or to 16 instead of 8-since these two functions are equally efficient and the Fourier transforms may be more efficient on the smaller parts of the decomposed image, rather than on the larger parts.

Precision analysis
We tested our algorithm for real data from optical microscopy. The bit depth of the images was 16 bits;   however, the actual bit depth of the image contents was approximately 10-11 bits. This is often the case in the optical microscopy since the bit depth is limited by both parameters of CCD sensors, such as the accuracy of the A/D convertor, and the imaging conditions, namely the amount of light. The convolution normalized with the sum of the kernel values was computed and the results of the CPU implementation and the GPU implementation, both in single precision, were compared. Then the maximum difference between the two resulting images over all voxels was computed: where C i is the voxel value at the position i in the image computed by CPU, G i is the voxel value at the position i in the image computed by GPU and Ω is the set of all coordinates in the image.
The results for different settings of the parameter P as for both single and double precision are shown in Tables 6 and 7. For the latter, the results were approximately the same for all values of P, therefore only one column is shown for the sake of simplicity. As expected, the precision is not an issue for specifically sized images. For single precision the error is lower than 10 -3 , for double it is even lower or equal to 10 -5 . However, it starts to become an issue for arbitrarily sized images. As results show, the error rate depends not only on the overall image size, but also on the individual dimensions. In particular, the error is significantly higher for images of dimensions that cannot be factored into small primes. Thus, the worst results (δ > 1) were achieved for image where one dimension was equal to 257, which is prime itself. In some cases, image padding can be used to achieve better accuracy. For those applications where the accuracy is essential, the double precision is still required.
We find that for images of sizes that can be factored into small primes single precision is acceptable for most purposes. However, we are aware that in some applications single precision might not be enough. The recent nVidia graphic cards offer computing in double precision, however the speed is too low (1/8 of the speed in single precision). With the release of the Fermi architecture computing in double precision on GPU will become feasible.

Conclusion
In this article, we have proposed a new method for convolving large images. We have taken advantage of high computation power of GPU and extended the algorithm with a decomposition. As a result we are able not only to convolve images larger than the size of GPU memory, but also to employ multiple GPUs in parallel.
Our method is generally suitable for complex data. However, in the case of real data it is rather inefficient. Since the input data are represented by the complex numbers with zero imaginary parts, and not by the real numbers, double effort is made to compute the result. On the other hand, the method can be optimized for real data in a few ways [22,24,27]. This is the subject of our future study.
The results show that it is reasonable to use our algorithm especially on very large images where the speedup of the GPU implementation can be up to 5× for the complex data and 2-3× for the real data. We suggest to decompose images into the smallest number of parts possible as this approach seems to be the most efficient.
We also studied the precision of the convolution in a practical application. The results revealed that for this application the computation in a single precision is acceptable (and it will be probably so for many other applications). In the case the single precision is not enough it is also possible to compute in a double precision. However, in this precision the recent graphic cards perform poorly. With the release of new architectures (e.g., nVidia Fermi) double precision will become feasible.
The proposed method can be implemented also in OpenCL and other languages besides CUDA. Besides the graphic cards, other parallel architectures can be taken into account as well. As the convolution is a key part of many deconvolution methods [28,29], application of the proposed algorithm in the deconvolution will also be the subject of our future study.
Endnotes a In 3D case both signals are padded to size M = M f + M g -1 in x dimension, L = L f + L g -1 in y dimension and K = K f + K g -1 in z dimension; and the resulting complexity is O (KLM log(KLM)). b Supposing the data are stored in a memory in a row-major order, the last dimension is the x dimension.