Skip to main content

Convolution of large 3D images on GPU and its decomposition


In this article, we propose a method for computing convolution of large 3D images. The convolution is performed in a frequency domain using a convolution theorem. The algorithm is accelerated on a graphic card by means of the CUDA parallel computing model. Convolution is decomposed in a frequency domain using the decimation in frequency algorithm. We pay attention to keeping our approach efficient in terms of both time and memory consumption and also in terms of memory transfers between CPU and GPU which have a significant inuence on overall computational time. We also study the implementation on multiple GPUs and compare the results between the multi-GPU and multi-CPU implementations.

1 Introduction

The convolution of two signals can be employed for blurring images, deconvolving blurred images, edge detection, noise suppression, and in many other applications [13]. For example, a cross-correlation and a phase-correlation (which are important methods of image registration) are both very similar to a convolution since they have basically the same mathematical meaning except that a convolution involves reversing a signal [[1], p.211]. The convolution of large signals is also used for simulating image formation in optical systems such as light microscopes [4]. The convolution is a common method used in image processing; however, its computation is very time-consuming for large images. Graphic cards can be employed for accelerating the computation. Some of the algorithms can be found in NVIDIA whitepaper [5]. Here, a so-called naïve convolution and a convolution with separable kernel are described, along with their optimized GPU implementation in CUDA. These algorithms can be used in many applications, such as fast computation of Canny edge detection [6, 7]. However, these approaches are not suitable for general large kernels.

In optical microscopy, we often deal with both large input signals and kernels. Thus, in this article, we discuss the time complexity of a convolution with emphasis on large 3D images. We recall the convolution theorem and its positive effect on the time complexity. For example, having a signal of 1000 × 1000 × 100 voxels and a filter kernel of 100 × 100 × 100 voxels, which is common in optical microscopy, the calculation using the convolution theorem takes tens of seconds, instead of several days, on the most recent CPU architecture. Even better times can be obtained using graphic cards. The GPU-based convolution using the convolution theorem is described in [8]. As indicated by authors, the FFT-based approach is suitable for large non-separable kernels.

The essential part of the algorithm described above is the Fourier transform. The first attempt to compute the fast Fourier transform on graphics hardware was described in [9]. The implementation was written in OpenGL and Cg shading languages and tested in the convolution application. The comparison of convolution in spatial and frequency domain (for the description of both approaches refer to the following section) was made in [10]. A significant speedup was achieved by implementing the algorithm on GPU, using HLSL and DirectX. Recently, the NVidia® CUDA programming model [11] along with the CUFFT library [12] offers a framework for implementing convolution in a straightforward manner. Besides CUFFT, other FFT libraries for GPU were developed, such as [13] and [14]. The OpenCL framework allows implementing methods on heteregenous platforms, consisting of CPU, GPU and other architectures [15, 16].

The bottleneck of the GPU acceleration is that graphics hardware offers rather small amount of memory. This poses a significant problem to attempts of accelerating convolution of huge images on GPU. Due to convolution properties, the convolved image can be divided into arbitrarily small parts, but all the sub-images have to be extended with neighboring pixels of at least size of the filter kernel (point spread function--PSF), as described in [17]. Thus, a lot of redundant computation needs to be performed, proportionally to the PSF size. This approach was successfully used in [18], to compute spatially variant deconvolution of Hubble Space Telescope images. We propose a new approach, which is optimal, in terms of both the number of per-voxel computations and the number of memory transfers.

1.1 Convolution

A 1D convolution of two discrete finite signals f, g is defined by following [19]:

[ f * g ] ( n ) m = 0 M g - 1 f ( n - m ) g ( m ) , n = 0 , , M - 1 ,

where M f and M g is the number of samples of f and g, respectively. The convolution then produces a signal of size M = M f +M g - 1. For the list of conventions used in the article refer to Appendix A.

Obviously, we have to take into account the boundary conditions defining f(m) at m < 0 and m > M f - 1 [19, 20]. Usually we set

f ( m ) 0 , ( m < 0 ) ( m > M f - 1 )

(Dirichlet boundary conditions) or

f ( m ) f ( 0 ) , m < 0 , f ( M f - 1 ) , m > M f - 1

(Neumann boundary conditions).

In practice, one of the signals has usually significantly larger size and is called simply the signal whereas the other is of a smaller size and is called the filter kernel. For instance, a kernel can be given by a simple function, such as Gaussian, or by so-called PSF, a function that describes the impulse response of an imaging system to a point source [[3], pp. 205-207].

A computation of convolution can be very time-consuming. From Equation (1) it can be easily deduced that the time complexity of the problem is O (M f M g ). In the case the kernel is small (tens of samples) the convolution can be computed in a reasonable time, even for large signals. However, in some applications such as optical microscopy one deals with both signals and kernels of more than a million of samples each. In this case computation would take several days which is unacceptable and a different solution needs to be applied.

1.2 Convolution theorem

The convolution theorem provides us with a powerful tool for convolving large signals. Having two signals f, g it can be proved that

F f * g = F f F g ,
f * g = F - 1 ( F [ f ] F [ g ] ) ,

where F denotes a Fourier transform. Therefore, instead of a convolution according to the definition, Fourier transforms can be applied on both the signal and the kernel, then their pointwise product is computed and finally inverse Fourier transform is applied to obtain the result [19].

Keep in mind that before the computation both the signal and the kernel need to be padded to the same size (that is, to the size of the resulting convolved signal) to avoid problems with boundary values. For example, in 1D case both the signal and the kernel are padded to M = M f + M g - 1 samples. There are several ways how to pad the data, usually they are padded with zeros. The position of the padding influences the position of the resulting signal. Refer to [19] for more details. According to the convolution theorem the asymptotic time complexity of a convolution is the complexity of FFT [19, 21], i.e., O(M log M).a

1.3 Memory complexity

As we mentioned in the previous section, to be able to apply the convolution theorem both the signal and the kernel have to be padded before computation. Hence, αM bytes of memory is required to store signal and the same amount for kernel, thus 2αM bytes in total where α is the size of data type used, e.g., typically 4 bytes for a single precision.

The Fourier transform can be performed in-place on a complex signal, so no additional memory is required. If the Fourier transform is performed on a real signal, the following property holds:

F ( μ ) = F * ( M - μ ) , μ 0 , F ( 0 ) = F * ( 0 ) ,

where * denotes a complex conjugate. This also means that Im [F(0)] = 0. In n-D case, an analogous property is held. Therefore, real data can also be processed in-place except that it needs to be padded in the last dimensionb to size M =2 M 2 + 1 . Thus, only half of the Fourier domain can be stored in the memory [19, 22].

2 Method

2.1 GPU accelerated convolution

In this section, we describe a basic GPU-based implementation of convolution using CUDA. CUDA is a parallel computing model introduced by the nVidia company. It provides C language extensions to implement pieces of code on GPU. This so-called CUDA Toolkit also includes the CUBLAS and the CUFFT libraries providing algorithms for linear algebra and fast Fourier transform, respectively.

Using the Toolkit, an implementation of a convolution is quite simple and straightforward, see the example in Figure 1. Basically, the algorithm consists of three parts: First, the data are transferred from CPU (host) memory to GPU (device) memory. Then, the Fourier transform, the pointwise multiplication, and the inverse Fourier transform are subsequently performed. Finally, the data are transferred back to the host memory.

Figure 1
figure 1

Convolution in CUDA. The function called cuMemcpy provides data transfers between CPU (Host) and GPU (Device). The pwProd provides a pointwise multiplication of two signals on GPU. The norm constant normalizes the convolution if required. The result is stored in signal variable in order to save memory. The result could also be stored in another parameter in case we needed to keep the input data.

The pointwise multiplication is done in a simple loop so its parallelization is straightforward. GPU threads are mapped to individual image pixels, naturally providing coalesced memory accesses and massive parallelism. This approach is thus simple and optimal, limited by a global memory bandwidth only. For basic information and examples refer to the CUDA Programming Guide document and CUDA SDK samples [11, 23].

The Fourier transform parallel computations are provided by the CUFFT library [12], as it is presently to our best knowledge the best library for FFT computation on GPU. In this part of the algorithm, all the optimizing issues (e.g., memory coalescing, shared or texture memory usage, etc.) are concerns of the CUFFT library.

When dealing with both large 3D images and large kernels it is impossible to compute convolution at once on GPU since the recent graphic cards typically have about 1GB of memory. This is also due to CUFFT specifics--if an image is too large to be stored in a shared memory, the FFT is performed out-of-place. As a result, even more memory is required [12]. In this section, we propose an algorithm for the decomposition of convolution. First, we describe the decimation in frequency (DIF) algorithm. This approach is not new, it was used, e.g., in [10], to provide the whole FFT computation. However, our contribution is to employ the DIF method to decompose data, i.e., to prepare it for convolution so that it can be processed in parts.

It should be noted that there are several other approaches to decompose the FFT problem, however, they are sub-optimal in means of the number of per-pixel operations and data transfers. First, the DIT method can be used instead of the DIF. However, unlike the DIF, this approach does not provide complete separability of the resulting sub-problems. This leads to a lot of redundant data transfers. Second, in the spatial domain, the convolved image can be divided into small parts. This method will be referred to as the tiling. However, all the sub-images have to be extended with neighboring pixels of at least size of the filter kernel [17]. Thus, a lot of redundant computation needs to be performed, proportionally to the PSF size.

The comparison of three approaches to divide the convolution problem is shown in Table 1. M f and M g denote the size of the signal and kernel, respectively, c is a constant corresponding to a per-pixel number of arithmetic operations needed for the computation of FFT, and P denotes the number of parts the input data are divided into. The DIF method is the only one which is not depend on the P parameter.

Table 1 Methods for decomposition of the convolution problem and their requirements

2.2 Decimation in frequency

The DIF algorithm is a technique used in fast Fourier transform to split data into several disjoint parts. We will introduce the idea for 1D case first. Let us have a function f(m) and its Fourier transform F(μ), m, μ = 0, ..., M - 1. Supposing that M is even we introduce new functions r(m') and s(m'), m' = 0, ..., M/2 - 1 as follows [22, 24]:

r ( m ) f ( m ) + f ( m + M 2 ) ,
s ( m ) [ f ( m ) - f ( m + M 2 ) ] W M - m ,

where W M = e i 2 π M is a base function of a Fourier transform. Vice versa, it is simple to deduce

f ( m ) = 1 2 r ( m ) + s ( m ) W M m ,
f ( m + M 2 ) = 1 2 r ( m ) - s ( m ) W M m .

Then it can be proved that the Fourier transforms R(μ') and S(μ') of the functions fulfil the following property:

R ( μ ) = F ( 2 μ ) ,
S ( μ ) = F ( 2 μ + 1 ) .

As explained above, the input function f can be decomposed into two half-sized parts r and s. The most important property of r and s is that they are completely separated parts of the original function in the frequency domain. Furthermore, the DIF algorithm can be applied recursively on the functions r and s, so the function f can be decomposed into four parts, then into eight parts, etc., supposing that M is divisible by 4, 8, .... Figure 2 depicts a scheme of the DIF algorithm.

Figure 2
figure 2

Buttery scheme of the DIF. Equation (5) is applied on the signal f of length 8 (the black points) and then two discrete Fourier transform of the half size are performed. The transformed signal F is the result.

The decomposition can be employed in n-D case. Since a Fourier transform is separable, an n-D transform can be expressed as a sequence of 1D transforms. Hence, the decomposition can be applied in any dimension. For 3D signals, a decomposition in z dimension--or the first coordinate in a row-major order--should be applied so that P separated parts in P continuous blocks of a memory are obtained. Unlike interlaced data, continuous blocks are optimal for data transfers.

2.3 Implementation

In this section, we introduce a modified algorithm for computing a convolution of large 3D images on GPU. The algorithm uses the DIF method for decomposing data. If the data are complex then the decomposition can be performed in-place, whereas if the data are real then the decomposition need to be performed out-of-place. The real data could also be transformed into a complex form but it is rather ineficient.

For the out-of-place decomposition of the data, 2αKLM bytes are required for storing the signal and the kernel and 4αKLM bytes for the decomposed data since complex numbers use up twice the size of real numbers. Thus, the overall memory space required is 6αKLM bytes.

A general review of the algorithm is shown in Figure 3, full description is in Figure 4. It is assumed that size of the data in z dimension is divisible by P, otherwise the data need to be padded to the nearest multiple of P before the convolution. However, it does not introduce a significant overhead to the algorithm since the size of the padding, if needed, is less than 1% of the overall image size.

Figure 3
figure 3

A scheme description of the proposed algorithm: convolution with a decomposition in a frequency domain.

Figure 4
figure 4

Convolution in CUDA with a decomposition.

Our implementation offers (de)compositions into 2, 4, 8, or 16 parts. These are provided by particular procedures named Decompose2, Decompose4, Decompose8, and Decompose16 (and analogously Compose2, etc.). The Decompose2 and Compose2 procedures conduct Equations (5) and (6), respectively. The Decompose4 and Compose4 may either recursively call the Decompose2 and Compose2, respectively, or they may directly split the data into four parts and reconstruct back from the four parts, respectively. We have opted for the latter solution since it requires smaller number of operations per voxel. Refer to Appendix B for more details.

The remaining procedures were implemented as follows:

Decompose8 = Decompose4   Decompose2, Decompose16 = Decompose4   Decompose4, Compose8 = Compose2   Compose4, Compose16 = Compose4   Compose4,

where denotes a composition of operations.

2.4 Multi-GPU implementation

Since the decomposition separates the convolution into P parts it is plain enough to solve the problem by multiple GPUs. To analyze precisely the contribution of the multi-GPU implementation we have to examine times spent in individual phases of the algorithm. In the single-GPU implementation described in Section 2.3 the overall time T can be expressed as follows:

T = max ( t p + t d , t a ) + t h d + t conv + t d h + t c ,

where t p is the time required for the padding of the data, t d for decomposition, t a for allocating memory and setting up FFT plans on GPU, thdfor data transfers from CPU to GPU (host to device), tconv for the convolution itself, tdhfor data transfers from GPU to CPU (device to host) and finally t c for composition. Despite the fact that most time is spent in the convolution phase, the other phases cannot be neglected. Please note that the t a time needed for preparing the GPU can be overlapped with the t p and t d times needed for preparing the data on CPU.

Supposing that G GPUs are available and, for the sake of simplicity, P is divisible by G , the overall time T' can be expressed as follows:

T = max t p + t d , t a + t h d + t conv G + t d h + t c ,

assuming that data cannot be transferred to multiple devices concurrently so both the times thdand tdhremain the same. As a result, a data transfer of two image tiles from CPU to two GPUs concurrently--that is one tile to each GPU--lasts double the time of a data transfer of one tile to a single GPU. Actually, the overall time complexity of multi-GPU implementation can be even better than determined in Equation (9) because the data transfers can be overlapped by computations on those GPUs that already possess the data (see Figure 5). The precise time values differ in particular situations. In general, they depend on the ratio of the data transfers and the convolution itself so Equation (9) can be considered as the upper estimate.

Figure 5
figure 5

Timeline: single-GPU versus multi-GPU (a model situation). The dark boxes depict data transfers between CPU and GPU while the light boxes represent convolution computations. In the first row there is the single-GPU implementation. In the second row there is a timeline for parallel usage of two GPUs. The data transfers are performed concurrently but through a common bus, therefore they last twice longer. For the third row the data transfers are synchronized so that only one transfer is made at a time. In the last row the data transfers are overlapped with a convolution execution.

The final multi-GPU implementation is described in algorithm in Figure 6. Also a simple implementation of a custom cuMemcpy function is described in Figure 7. This function guarantees unique memory access enabling faster overall computation time in this way.

Figure 6
figure 6

Multi-GPU convolution. The for loop is performed on all the available graphic cards in parallel.

Figure 7
figure 7

Data transfer with the bus lock. A new (global) variable called lock is introduced. This variable (or a flag) is shared across all the GPUs sharing the common data bus so that each time a GPU needs to send or receive data, it waits until the bus is free, then it locks the bus, transfers data, and finally unlocks the bus. (a) Single GPU, specific sizes. (b) Two GPUs, specific sizes. (c) Single GPU, arbitrary sizes. (d) Two GPUs, arbitrary sizes.

3 Results

We have performed several benchmarks on a machine with the Intel® Core™ i7 950 processor with 6 GB DDR3 RAM and the nVidia® GeForce GTX 295 graphic card with 2 × 896 MB GDDR3 RAM. The CPU implementation uses the FFTW library while the GPU implementation uses our algorithm. Both the decomposition and the composition are performed on CPU. Therefore, we have written them in C language and further improved with SSE intrinsics for a better performance.

3.1 Single GPU implementation

The single GPU implementation was compared with a single thread CPU implementation. Considering the CPU implementation two approaches are distinguished: The one using both complex-to-complex FFT and inverse FFT (we call it the c2c implementation) and the one using real-to-complex FFT and complex-to-real inverse FFT (the r2c implementation). The former implementation can generally be used for convolving complex data (much like our approach), the latter one can be used on real data only but is twice as effective (in means of both the memory requirements and the speed).

In the first dataset (Table 2) the images are padded to the powers of two in each dimension so we refer to them as to specifically sized images. This property has a significant inuence on computation time because on these images the FFT algorithm performs the best. In the second dataset (Table 3) the images are arbitrarily sized except that in the z dimension the padded size is kept so that the decomposition can be performed without the need of additional image padding.

Table 2 Convolution of specifically sized images
Table 3 Convolution of arbitrarily sized images

In all tests, the decomposition parameter P was set to the least possible value so that the resulting sub-images fitted into GPU memory. For example, if the image was small enough, the decomposition was not performed at all, the larger the image was the more parts it was decomposed into.

The computation times are presented in Tables 2 and 3. For the sake of understandability the times were also converted into computation speed (number of voxels processed per second), see Figure 8(a),(c).

Figure 8
figure 8

The computation speed (in Megavoxels per second) on different datasets (specifically and arbitrarily sized) and configurations (single and multiple GPUs). Comparison between CPU and GPU.

The results show that a single-GPU implementation is more than ten times faster as compared to a single thread CPU c2c implementation and about four times faster than a single thread CPU r2c implementation if the images are large enough. The r2c approach is the only one which can process images with sizes of more than 500 megavoxels. This is due to the limitations of the CPU memory.

We can also see that in the case of specifically sized images the difference between the time values is small for images smaller than 10 megavoxels. Nevertheless, in practise the arbitrarily sized images are more frequent. In this case, the GPU implementation is faster even on smaller images.

3.2 Multi-GPU implementation

In the second experiment, the multi-GPU implementation (using both the GPUs of GTX 295) was compared with a multiple thread CPU implementation. The same dataset was used, e.g., both the specifically sized and the arbitrarily sized images. The results are presented in Tables 4 and 5 and in Figure 8(b),(d).

Table 4 Multi-GPU convolution of specifically sized images
Table 5 Multi-GPU convolution of arbitrarily sized images

The results reveal several facts. Again, the GPU implementation is up to eight times faster than the CPU c2c implementation (depending on image sizes) and up to three times faster than the CPU r2c implementation when comparing two GPUs with two CPU cores. Also a test with four CPU cores was made and the GPU implementation has performed still two times faster if the images were large enough. In general, the GPU implementation becomes advantageous on images larger than 50 megavoxels.

Note that in some cases a single GPU has performed better than two GPUs (especially in the case of specifically sized images). We shall see in the following section that the algorithm spent too much time in preliminary phases (such as data decomposing or memory allocating) at the expense of the convolution itself.

3.3 Time analysis

We have analyzed the time distribution of the phases of the proposed algorithm on two images with a resulting image size of 640 × 640 × 128 voxels. Our method was tested with different settings of the parameter P (number of decomposed parts). The results are depicted in Figure 9(a).

Figure 9
figure 9

Time analysis of the proposed convolution algorithm. The charts describe time (y-axis) spent in particular phases of the proposed algorithm depending on decomposition parameter P (x-axis). In the single-GPU implementation there are two parallel timelines, one for CPU, one for GPU. In the multi-GPU implementation two GPUs have been tested. Thus, there are three parallel timelines, one for CPU and one for each GPU. The overall time of the algorithm is equal to the top of the CPU timeline.

The convolution itself consumes slightly less than a half of the overall time. The rest is spent on the other phases such as the preparation phase (memory allocation and setting up FFT plans [12] on GPU; and the data padding and the data decomposition on CPU), data transfers between CPU and GPU and finally the data composition. As the preparation phases on GPU on CPU are independent, they can be conducted simultaneously. The results can be confronted with Equation 8. Here, the "Pad", "Decompose", "Allocate", "CPU>GPU", "Convolution", "GPU>CPU", and "Compose" times correspond to t p , t d , t a , thd, t conv , tdh, and t c , respectively.

In the case of the multi-GPU implementation the time spent in the preparation phase on GPU t a is double, see Figure 9(b). This unpleasant behavior is also observed on a machine with CUDA 4.0 with extended support for multi-GPU processing [25]. To the best of our knowledge, it is not mentioned in official NVIDIA documents. Fortunately, the time needed for padding the data on CPU t p is still larger. Thus, the overhead inducted by t a is hidden by t p . The times confronted with Equation 9 give us a good idea of how the usage of multiple GPU cards can speed up the computation. Since the significant portion of time is spent in sequential phases of the algorithm, the overall speedup is limited, corresponding to the famous Amdahl's law [26].

We may draw the conclusion that it is reasonable to choose the smallest value possible for P for given image size. The exact results depend on particular dataset. Sometimes, one may prefer usage of the (De)Compose4 function instead of the (De)Compose2 function--and therefore setting P to 4 instead of 2, or to 16 instead of 8--since these two functions are equally efficient and the Fourier transforms may be more efficient on the smaller parts of the decomposed image, rather than on the larger parts.

3.4 Precision analysis

We tested our algorithm for real data from optical microscopy. The bit depth of the images was 16 bits; however, the actual bit depth of the image contents was approximately 10-11 bits. This is often the case in the optical microscopy since the bit depth is limited by both parameters of CCD sensors, such as the accuracy of the A/D convertor, and the imaging conditions, namely the amount of light. The convolution normalized with the sum of the kernel values was computed and the results of the CPU implementation and the GPU implementation, both in single precision, were compared. Then the maximum difference between the two resulting images over all voxels was computed:

δ = max { | C i - G i | , i Ω } ,

where C i is the voxel value at the position i in the image computed by CPU, G i is the voxel value at the position i in the image computed by GPU and Ω is the set of all coordinates in the image.

The results for different settings of the parameter P as for both single and double precision are shown in Tables 6 and 7. For the latter, the results were approximately the same for all values of P, therefore only one column is shown for the sake of simplicity. As expected, the precision is not an issue for specifically sized images. For single precision the error is lower than 10-3, for double it is even lower or equal to 10-5. However, it starts to become an issue for arbitrarily sized images. As results show, the error rate depends not only on the overall image size, but also on the individual dimensions. In particular, the error is significantly higher for images of dimensions that cannot be factored into small primes. Thus, the worst results (δ > 1) were achieved for image where one dimension was equal to 257, which is prime itself. In some cases, image padding can be used to achieve better accuracy. For those applications where the accuracy is essential, the double precision is still required.

Table 6 Precision analysis of specifically sized images
Table 7 Precision analysis of arbitrarily sized images

We find that for images of sizes that can be factored into small primes single precision is acceptable for most purposes. However, we are aware that in some applications single precision might not be enough. The recent nVidia graphic cards offer computing in double precision, however the speed is too low (1/8 of the speed in single precision). With the release of the Fermi architecture computing in double precision on GPU will become feasible.

4 Conclusion

In this article, we have proposed a new method for convolving large images. We have taken advantage of high computation power of GPU and extended the algorithm with a decomposition. As a result we are able not only to convolve images larger than the size of GPU memory, but also to employ multiple GPUs in parallel.

Our method is generally suitable for complex data. However, in the case of real data it is rather inefficient. Since the input data are represented by the complex numbers with zero imaginary parts, and not by the real numbers, double effort is made to compute the result. On the other hand, the method can be optimized for real data in a few ways [22, 24, 27]. This is the subject of our future study.

The results show that it is reasonable to use our algorithm especially on very large images where the speedup of the GPU implementation can be up to 5× for the complex data and 2-3× for the real data. We suggest to decompose images into the smallest number of parts possible as this approach seems to be the most efficient.

We also studied the precision of the convolution in a practical application. The results revealed that for this application the computation in a single precision is acceptable (and it will be probably so for many other applications). In the case the single precision is not enough it is also possible to compute in a double precision. However, in this precision the recent graphic cards perform poorly. With the release of new architectures (e.g., nVidia Fermi) double precision will become feasible.

The proposed method can be implemented also in OpenCL and other languages besides CUDA. Besides the graphic cards, other parallel architectures can be taken into account as well. As the convolution is a key part of many deconvolution methods [28, 29], application of the proposed algorithm in the deconvolution will also be the subject of our future study.


aIn 3D case both signals are padded to size M = M f + M g - 1 in x dimension, L = L f + L g - 1 in y dimension and K = K f + K g - 1 in z dimension; and the resulting complexity is O (KLM log(KLM)). bSupposing the data are stored in a memory in a row-major order, the last dimension is the x dimension.


.1 Conventions

We introduce the conventions used in the text:

f * g... convolution of signals f, g

F f ... Fourier transform of a signal f

F - 1 F ... inverse Fourier transform of a signal F

f(m) ... a signal is denoted by a lowercase letter with a Latin letter index

F(μ) ... a Fourier transform is denoted by an uppercase letter with a greek letter index

W M = e i 2 π M a base function of a Fourier transform

In 1D case we introduce

M f , M g ... sizes of the signal and the kernel, respectively

M = M f + M g - 1 ... size of the output convolved signal

In 3D case we introduce

K f , L f , M f ... sizes of the signal in dimensions z, y, x, respectively

K g , L g , M g ... sizes of the kernel in dimensions z, y, x, respectively

K, L, M ... sizes of the output convolved signal in dimensions z, y, x, respectively (K = K f + Kg-1, etc.)

.2 Decimation in frequency

A signal f(m) can directly be decomposed into four parts by applying Equation 5 recursively:

r ( m ) t ( n ) r ( n ) + r ( n + M 4 ) , u ( n ) [ r ( n ) - r ( n + M 4 ) ] W M 2 - n ;
s ( m ) v ( n ) s ( n ) + s ( n + M 4 ) , w ( n ) [ s ( n ) - s ( n + M 4 ) ] W M 2 - n ,

where n = 0, ..., M/4 - 1. Then the Fourier transforms T(ν), U(ν), V (ν), and W(ν) hold the following property:

T ( v ) = F ( 4 ν ) ,
U ( v ) = F ( 4 ν + 2 ) ,
V ( v ) = F ( 4 ν + 1 ) ,
W ( v ) = F ( 4 ν + 3 ) .

Now let us introduce o n + M 4 , p n + M 2 , q n + 3 M 4 ,

then it can be proved that

t ( n ) = x ( n ) + x ( o ) + x ( p ) + x ( q ) ,
u ( n ) = ( x ( n ) - x ( o ) + x ( p ) - x ( q ) ) W M - 2 n ,
v ( n ) = [ ( x ( n ) - x ( o ) ) - i ( x ( p ) - x ( q ) ) ] W M - n ,
w ( n ) = [ ( x ( n ) - x ( o ) ) + i ( x ( p ) - x ( q ) ) ] W M - 3 n .

For x(n), x(o), x(p), x(q) we get the following:

x ( n ) = 1 4 ( t ( n ) + u ( n ) W M 2 n ) + ( v ( n ) W M n + w ( n ) W M 3 n ) ,
x ( o ) = 1 4 ( t ( n ) - u ( n ) W M 2 n ) + ( v ( n ) W M n - w ( n ) i W M 3 n ) ,
x ( p ) = 1 4 t ( n ) + u ( n ) W M 2 n ) - ( v ( n ) W M n + w ( n ) W M 3 n ) ,
x ( q ) = 1 4 ( t ( n ) - u ( n ) W M 2 n ) - ( v ( n ) W M n - w ( n ) i W M 3 n ) .


  1. Gonzales RC, Woods RE: Digital Image Processing. 2nd edition. Prentice-Hall; 2002.

    Google Scholar 

  2. Pratt WK: Digital Image Processing. 3rd edition. John Wiley & Sons; 2001.

    Book  Google Scholar 

  3. Jähne B: Digital Image Processing. 6th edition. Springer; 2005.

    Google Scholar 

  4. Svoboda D, Kozubek M, Stejskal S: Generation of Digital Phantoms of Cell Nuclei and Simulation of Image Formation in 3D Image Cytometry. CYTOMETRY PART A 2009,75A(6):494-509. 10.1002/cyto.a.20714

    Article  Google Scholar 

  5. Podlozhnyuk V: Image Convolution with CUDA. NVIDIA Corporation 2007. []

    Google Scholar 

  6. Luo Y, Duraiswami R: Canny edge detection on nvidia cuda. Computer Vision and Pattern Recognition Workshop 2008, 1-8.

    Google Scholar 

  7. Ogawa K, Ito Y, Nakano K: Efficient canny edge detection using a gpu. International Conference on Natural Computation 2010, 279-280.

    Google Scholar 

  8. Podlozhnyuk V: FFT-based 2D convolution. NVIDIA Corporation 2007. []

    Google Scholar 

  9. Moreland K, Angel E: The FFT on a GPU. HWWS '03: Proceedings of the ACM SIG-GRAPH/EUROGRAPHICS conference on Graphics hardware 2003, 112-119.

    Google Scholar 

  10. Fialka O, Cadik M: FFT and Convolution Performance in Image Filtering on GPU. Information Visualization, 2006. IV 2006. Tenth International Conference on 2006, 609-614.

    Google Scholar 

  11. CUDA™ SDK code samples 3.1 NVIDIA Corporation 2010. []

  12. CUDA™ CUFFT Library 3.1 NVIDIA Corporation 2010. []

  13. Nukada A, Ogata Y, Endo T, Matsuoka S: Bandwidth intensive 3-D FFT kernel for GPUs using CUDA. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. Piscataway, NJ, USA: IEEE Press; 2008:1-11.

    Google Scholar 

  14. Govindaraju NK, Lloyd B, Dotsenko Y, Smith B, Manferdelli J: High performance discrete Fourier transforms on graphics processors. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. Piscataway, NJ, USA: IEEE Press; 2008:1-12.

    Google Scholar 

  15. OpenCL Khronos Group 2010. []

  16. OpenCL 1.1 Reference Pages Khronos Group 2010. []

  17. Trussell H, Hunt B: Image restoration of space variant blurs by sectioned methods. Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP '78. 1978, 3: 196-198.

    Article  Google Scholar 

  18. Boden AF, Redding DC, Hanisch RJ, Mo J: Massively parallel spatially variant maximum-likelihood restoration of Hubble Space Telescope imagery. J Opt Soc Am A 1996,13(7):1537-1545. [] 10.1364/JOSAA.13.001537

    Article  Google Scholar 

  19. Bracewell RN: The Fourier Transform and Its Applications. 3rd edition. McGraw-Hill; 2000.

    Google Scholar 

  20. Hanna JR, Rowland JH: Fourier Series, Transforms, and Boundary Value Problems. 2nd edition. John Wiley & Sons; 1990.

    Google Scholar 

  21. W VT, Press WilliamH, Teukolsky SaulA, Flannery PB: Numerical Recipes in C. Volume ch 7. 2nd edition. Cambridge University Press; 1992.

    Google Scholar 

  22. Hey A: The FFT demystified. Engineering Productivity Tools Ltd. 21 Leaveden Road, Watford, Hertfordshire, UK []

  23. CUDA™ Programming Guide 3.1 NVIDIA Corporation 2010. []

  24. Saidi A: Generalized FFT Algorithm. IEEE International Conference on Communications 93: Technical program, conference record, IEEE International Conference on Communications - Communications: TECHNOLOGY THAT UNITED NATIONS (ICC 93), Geneva, SWITZERLAND, May 23-26, 1993 1993, 1-3: 227-231.

    Google Scholar 

  25. CUDA™ Toolkit 4.0 NVIDIA Corporation 2011. []

  26. Amdahl GM: Validity of the single processor approach to achieving large scale computing capabilities. Proceedings of the April 18-20, 1967, spring joint computer conference, ser. AFIPS '67 (Spring) New York, NY, USA: ACM; 1967, 483-485. []

    Chapter  Google Scholar 

  27. Brigham E: Fast Fourier Transform and Its Applications. 1st edition. Prentice-Hall; 1988.

    Google Scholar 

  28. Verveer PJ: Computational and optical methods for improving resolution and signal quality in fluorescence microscopy. In Ph.D. dissertation. Technische Universiteit Te Delft; 1998.

    Google Scholar 

  29. Quammen CW, Feng D, Taylor RM II: Performance of 3D Deconvolution Algorithms on Multi-Core and Many-Core Architectures. University of North Carolina at Chapel Hill, Department of Computer Science, Tech. Rep 2009.

    Google Scholar 

Download references


This work was supported by the Ministry of Education of the Czech Republic (Projects No. MSM-0021622419, No. LC535, and No. 2B06052).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Pavel Karas.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Karas, P., Svoboda, D. Convolution of large 3D images on GPU and its decomposition. EURASIP J. Adv. Signal Process. 2011, 120 (2011).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: