Unified commutation-pruning technique for efficient computation of composite DFTs

Castro-Palazuelos, David E.; Medina-Melendrez, Modesto Gpe.; Torres-Roman, Deni L.; Shkvarko, Yuriy V.

doi:10.1186/s13634-015-0285-z

Research
Open access
Published: 26 November 2015

Unified commutation-pruning technique for efficient computation of composite DFTs

David E. Castro-Palazuelos^1,2,
Modesto Gpe. Medina-Melendrez¹,
Deni L. Torres-Roman² &
…
Yuriy V. Shkvarko²

EURASIP Journal on Advances in Signal Processing volume 2015, Article number: 100 (2015) Cite this article

2469 Accesses
2 Citations
Metrics details

Abstract

An efficient computation of a composite length discrete Fourier transform (DFT), as well as a fast Fourier transform (FFT) of both time and space data sequences in uncertain (non-sparse or sparse) computational scenarios, requires specific processing algorithms. Traditional algorithms typically employ some pruning methods without any commutations, which prevents them from attaining the potential computational efficiency. In this paper, we propose an alternative unified approach with automatic commutations between three computational modalities aimed at efficient computations of the pruned DFTs adapted for variable composite lengths of the non-sparse input-output data. The first modality is an implementation of the direct computation of a composite length DFT, the second one employs the second-order recursive filtering method, and the third one performs the new pruned decomposed transform. The pruned decomposed transform algorithm performs the decimation in time or space (DIT) data acquisition domain and, then, decimation in frequency (DIF). The unified combination of these three algorithms is addressed as the DFT_COMM technique. Based on the treatment of the combinational-type hypotheses testing optimization problem of preferable allocations between all feasible commuting-pruning modalities, we have found the global optimal solution to the pruning problem that always requires a fewer or, at most, the same number of arithmetic operations than other feasible modalities. The DFT_COMM method outperforms the existing competing pruning techniques in the sense of attainable savings in the number of required arithmetic operations. It requires fewer or at most the same number of arithmetic operations for its execution than any other of the competing pruning methods reported in the literature. Finally, we provide the comparison of the DFT_COMM with the recently developed sparse fast Fourier transform (SFFT) algorithmic family. We feature that, in the sensing scenarios with sparse/non-sparse data Fourier spectrum, the DFT_COMM technique manifests robustness against such model uncertainties in the sense of insensitivity for sparsity/non-sparsity restrictions and the variability of the operating parameters.

1 Introduction

1.1 Motivation

Many signal processing applications require computation of the so-called pruned discrete Fourier transform (DFT), i.e., an efficient alternative to compute the required DFT when the input sequence and/or the required output sequences are smaller than the length of the full DFT (a full DFT means that all the output components are to be computed, and all the input elements are used to compute the transform); in the literature those are referred to as pruned fast Fourier transforms (FFTs) or pruned DFTs [1]. Common practical examples relate to, e.g., the least mean squared (LMS) optimal DFT-based pruned signal filtering [2], and the complexity-reduced computational implementation of the orthogonal frequency division multiplexing systems [3]. Another practical example relates to efficient implementation of the matched spatial filtering (MSF) algorithm for performing the range and azimuth data compression in unfocused of fractionally focused synthetic aperture radar (SAR) system that both employ the pruned DFT-based MSF processing of the trajectory data signals performed in a factorized fashion in the so-called slow time and fast time data acquisition scales [4–6]. Other examples relate to DFT-based analysis of remote sensing (RS) data acquired with a variety of sensor systems, ranging from seismology [7] to multispectral radiometry [8]. Other authors as Zhu et al. in [9] proposed an algorithm for performing SAR polar format re-gridding interpolation suited for the logic-in-memory paradigm (hardware/architecture solution) and to provide the necessary design automation tool chain to implement their proposed algorithm (e.g., FFTs for image formation) in advanced silicon technology. It is important to note that a majority of real-world RS data acquisition and processing problems can be qualified as sensing in harsh environments [4–8, 10, 11] in the sense of intrinsic problem model uncertainties peculiar for such RS modalities. In a context of pruned DFTs, realistic harsh sensing scenarios are characterized by the uncertainties attributed to zero-padded input data acquisition modes with variable composite length windowing of the input and/or output Fourier transform sequences, in general cases, with non-sparse Fourier spectra [10–12]. Those specifics motivate the development of efficient pruned DFT/FFT techniques particularly adapted for computational implementation with uncertain data acquired in harsh sensing scenarios.

1.2 Related work

Traditional DFT algorithms adapted for such uncertain scenarios typically employ some pruning methods without any commutations, which prevent them from attaining the potential computational efficiency. Most of the proposals reported in the literature are based on construction of pruning modalities of specific FFT-related algorithms. Some of them prune the input of a specific FFT algorithm, others prune the output, and just a few can prune the input and output (input-output) at the same time. Markel in [1], and Skinner in [13], proposed the input pruning methods based on a radix-2 FFTs, while Yuan et al., in [14], proposed an input pruning of a split-radix FFT. The approaches of Bouguezel et al. [15] and Fan et al. [16] are applicable for output pruning a radix-2 FFTs, while the Xu’s et al. [3] proposal suggests pruning the output of a split-radix FFT. In addition, Sreenivas et al., in [17], Roche, in [18], and Wang et al., in [19], developed the methods for pruning the input-output at the same time. The first one is based on a radix-2 FFT, the second one employs the split-radix FFT, and the third one performs the mixed-radix FFT, respectively. A majority of those methods are applicable only for computing DFTs with the length of a power of two that drastically restricts their applicability to general uncertain sensing scenarios.

On the other hand, a family of novel so-called sparse FFT (SFFT) algorithms adapted to computing the FFTs, when only a few Fourier spectrum coefficients of the input signal are different from zero (few largest coefficients of the Fourier transform spectrum), has been developed recently [20, 21]. The celebrated SFFT-related algorithms, so-called SFFTv1 and SFFTv2, were reported by Hassanieh et al., in [20]. Later, in [21], the improved SFFT-related versions, addressed as SFFTv3 and SFFTv4, were reported. Another algorithm that considers the Fourier spectrum sparsity restrictions is the so-called FADFT-2 reported and implemented in the AAFFT library [22]. However, the SFFT-related algorithms significantly outperform the AAFFT as it was corroborated in [20].

It is worthwhile to mention that the SFFT-related techniques are applicable only for the sparse sensing scenarios; e.g., referring to [20, 21], the authors exemplified the sparsity level by imposing the restriction that up to 89 % of the Fourier coefficients are zeroes or negligible, thus can be discarded. Such a restriction could be valid in a variety of data compressing applications, e.g., compression and recovery of video data not degraded by noise and/or imaging system instrumental function [20]. Nevertheless, the restriction on such sparsity is not valid for many real-world operational scenarios, e.g., processing of the RS data acquired in harsh sensing environments [4–8, 10–12]. For example, in SAR imaging of non-homogeneous scenes, e.g., urban areas, non-uniformly textured zones, etc., a majority of the Fourier transform coefficients should be considered for feature-enhanced MSF-based imaging [5, 6]; thus, an 89 % of sparsity level restriction is never a feasible model assumption.

In this paper, we are interested in developing the pruned DFT (DFTs of highly composite length) algorithms applicable for near-real-time signal processing and analysis in uncertain sensing scenarios (i.e., with non-guaranteed sparsity of the data Fourier spectra); that is why the family of the SFFT-related techniques is beyond our detailed study here. Nevertheless, for the purpose of generality, in Section 4, we perform comparative analysis of our developed methods with the SFFT under the same conditions and constraints for different combinations of the specified processing/operational parameters.

In the related literature, in which the feasible non-sparse scenarios are considered, two competing approaches for pruning the composite (no prime) length DFTs were addressed. Sorensen et al., in [23], proposed two methods to prune composite length DFTs, first one to prune the input and another one to prune the output. Next, the methodology of Medina-Melendrez et al., in [24], merges the methods developed originally in [23] to obtain a composite structure that is capable to prune the input and/or output of a general decomposed transform at the same time. It was demonstrated, in [24], that such a computational structure could be as efficient as the one based on specific FFT algorithms [15–17]. In [24], a new methodology for decomposition over a composite length DFT has been proposed as a modification of the Sorensen’s approach [23]. Furthermore, the [24] suggests, first, to perform decimation in frequency (DIF) and, second, a decimation in time (DIT). For processing of spatial data, the corresponding decimation in the space domain should be performed similarly to the DIT operation for time data processing. To avoid misunderstandings, in the rest of the paper, we will use the same abbreviation (DIT) for both processing models and consider the time data processing as a principal model. Nevertheless, all developments are directly transferable for the space data processing scenario.

Hence, the three basic stages to compute the composite length DFTs of non-sparse data encompass the input, the intermediate, and the output stages. The decomposed transform is then pruned by eliminating, from the input and output stages, additions and multiplications by zero, multiplications by one, and all other computations not needed to obtain the required Fourier transform coefficients. In [24], such the multistage decomposed and pruned transform is referred to as FFT_{DIF−DIT−TD} (here, that method is referred as DFT_{DIF−DIT−Pr}). Nevertheless, both methods addressed in [23, 24] do not achieve the lowest attainable number of the required arithmetic operations. A possible alternative for computing few Fourier coefficients from few input elements (all non-zero, thus non-sparse) can be addressed based on the application of the second-order Goertzel algorithm [23] modified to accept the input elements in a reverse order.

1.3 Novel contributions

The main contribution of this paper consists in the development of a new alternative method for efficient computing of a composite length DFT, when the input sequence and/or the required output sequence are smaller than the length of the full transform. Our proposal guarantees the same or smaller number of arithmetic operations in comparison with the competing methods in the literature. Moreover, it manifests robustness against sparsity/non-sparsity restrictions and the variability of the operating parameters as detailed in Sections 3 and 4.

The innovative idea is to automatically commute among three modalities to implement the DFT: the direct method, the recursive method, and the pruned decomposed transform. Thus, our new proposed composite approach unifies the decomposition of the DFT with its pruning. First, we develop an alternative technique to compute the pruned decomposed transform, in which the DIT is performed at the first stage followed by the DIF. We address this method as DFT_{DIT−DIF−Pr}. An analysis of the two alternatives (DFT_{DIT−DIF−Pr} and DFT_{DIF−DIT−Pr}) verifies that the DFT_{DIT−DIF−Pr} requires a smaller or as maximum equal number of arithmetic operations compared with the DFT_{DIF−DIT−Pr}, so the use of the DFT_{DIT−DIF−Pr} is strongly recommended when the decomposed and pruned transforms are required. Next, we demonstrate that our proposal requires a lower number of arithmetic operations than any of the pruning-based competing methods [3, 14, 23, 24]. Further, we demonstrate that both decomposed transforms (DFT_{DIF−DIT−Pr} and DFT_{DIT−DIF−Pr}) can be obtained from a general decomposition methodology. Also, it manifests the robustness in sparse and non-sparse sensing scenarios (i.e., operability for an arbitrary number of consecutive input elements (L _i), the number of consecutive outputs that should be computed (L _o), and the length of the full transform (N)) in contrast to the recently developed most prominent SFFT family-related methods [20, 21] operable in sparse scenarios only.

It is noteworthy to mention that in the majority of practical computational scenarios, significant savings in the number of arithmetic operations with the proposed technique are achieved, e.g., in Section 4.1, the DFT_COMM technique compared with split-radix FFT (SRFFT) algorithm produces savings of 42 to 92 %.

The rest of the paper is organized as follows: in Section 2, the general decomposition transform methodology is described and explained. An analysis of all feasible transform decomposition methods is presented next in Section 3 followed by the combinational hypotheses testing optimization-based selection of the best decomposition transform permutation modality that yields the unified commutation-pruning DFT_COMM technique. In Section 4, comparisons among the developed unified commutation-pruning technique and other competing algorithms in the sense of savings in the number of required arithmetic operations are presented and featured. Also, the proposed DFT_COMM method is compared in detail with the most prominent competing SFFT-related algorithms in the context of computing the DFTs in both sparse and non-sparse (harsh) sensing scenarios for different values of the operational parameters (L _i, L _o, and N). Concluding remarks in Section 5 summarize the study. The Appendix provides a pseudo-code for implementing the proposed method.

2 DFT transform decomposition

The definition of the DFT of a sequence of length N (DFT_N) is given by

$$ X(k)={\displaystyle \sum_{n=0}^{N-1}x(n){W}_N^{nk}}\kern0.5em \mathrm{f}\mathrm{o}\mathrm{r}\kern0.75em k=0,1,2,\dots, N-1 $$

(1)

where $ {W}_N^{nk}={e}^{-j2\pi nk/N} $ is the kernel of the transform. Let us define L _i as the number of consecutive input elements different from zero and L _o as the number of consecutive outputs that should be computed. If N is a composite number formed by multiplications of many integer factors, the DFT_N can be decomposed into smaller DFTs. In particular, the DFT_N can be decomposed into three stages of DFTs (an input stage, an intermediate stage, and an output stage) in order to avoid the arithmetic operations involving zeros, multiplications by one, and the operations not required to compute the final outputs. Here beneath, we briefly describe such feasible decompositions. Assuming that there are two integer factors, D _ip and D _op, of N such that N/D _ip D _op≡ P is an integer, the indexes n and k can be re-expressed as

$$ n={n}_1+{D}_{op}{n}_2+\left(\frac{N}{D_{ip}}\right){n}_3\kern1em \mathrm{f}\mathrm{o}\mathrm{r}\kern0.75em \left\{\begin{array}{l}{n}_1=0,1,2,\dots, {D}_{op}-1\\ {}{n}_2=0,1,2,\dots, {\scriptscriptstyle \raisebox{1ex}{$N$}\!\left/ \!\raisebox{-1ex}{${D}_{ip}{D}_{\mathrm{op}}$}\right.}-1\\ {}{n}_3=0,1,2,\dots, {D}_{ip}-1\end{array}\right. $$

(2)

$$ k={k}_1+{D}_{ip}{k}_2+\left(\frac{N}{D_{op}}\right){k}_3\kern1em \mathrm{f}\mathrm{o}\mathrm{r}\kern1em \left\{\begin{array}{l}{k}_1=0,1,2,\dots, {D}_{ip}-1\\ {}{k}_2=0,1,2,\dots, {\scriptscriptstyle \raisebox{1ex}{$N$}\!\left/ \!\raisebox{-1ex}{${D}_{ip}{D}_{op}$}\right.}-1\\ {}{k}_3=0,1,2,\dots, {D}_{op}-1.\end{array}\right. $$

(3)

Substituting n and k in (1) by (2), (3), the original DFT_N is decomposed into

$$ \begin{array}{l}X\left({k}_1+{D}_{ip}{k}_2+\frac{N}{D_{op}}{k}_3\right)={\displaystyle \sum_{n_1=0}^{D_{op}-1}{\displaystyle \sum_{n_2=0}^{P-1}{\displaystyle \sum_{n_3=0}^{D_{ip}-1}x\left({n}_1+{D}_{op}{n}_2+\frac{N}{D_{ip}}{n}_3\right)}}}\\ {}\kern10.5em \times {W}_N^{\left({n}_1+{D}_{op}{n}_2+\left(N/{D}_{ip}\right){n}_3\right)\left({k}_1+{D}_{ip}{k}_2+\left(N/{D}_{op}\right){k}_3\right)}.\end{array} $$

(4)

Here, it is assumed that D _ip and D _op are chosen in such a way that N/D _ip ≥ L _i and N/D _op ≈ L _o. Thus, index n ₃ is always equal to zero; k ₃ is near 0, hence (4) can be next rewritten as follows

$$ \begin{array}{l}X\left({k}_1+{D}_{ip}{k}_2+\frac{N}{D_{op}}{k}_3\right)={\displaystyle \sum_{n_1=0}^{D_{op}-1}{\displaystyle \sum_{n_2=0}^{P-1}x\left({n}_1+{D}_{op}{n}_2\right){W}_N^{\left({n}_1+{D}_{op}{n}_2\right)\left({k}_1+{D}_{ip}{k}_2+\left(N/{D}_{op}\right){k}_3\right)}}}\\ {}={\displaystyle \sum_{n_1=0}^{D_{op}-1}{\displaystyle \sum_{n_2=0}^{P-1}x\left({n}_1+{D}_{op}{n}_2\right){W}_N^{\left({n}_1{k}_1+{D}_{op}{n}_2{k}_1+{D}_{ip}{n}_1{k}_2+{D}_{ip}{D}_{op}{n}_2{k}_2+{n}_1{k}_3N/{D}_{op}+{n}_2{k}_3N\right)}}}.\end{array} $$

(5)

The computation of (5) is more efficient than the direct computation of the DFT_N since the complex arithmetic operations dependent on n ₃ have been pruned. The complex exponential in (5) can next be grouped in different ways, resulting in different structures for the pruned decomposed transform. The methodology of [24] suggests expressing the pruned decomposed transform as

$$ \begin{array}{l}X\left({k}_1+{D}_{ip}{k}_2+\frac{N}{D_{op}}{k}_3\right)={\displaystyle \sum_{n_1=0}^{D_{op}-1}\left\{{\displaystyle \sum_{n_2=0}^{P-1}\left[{W}_N^{\left(\left({n}_1+{D}_{op}{n}_2\right){k}_1\right)}x\left({n}_1+{D}_{op}{n}_2\right)\right.}\right.}\\ {}\kern10.5em \times \left.\left.{W}_{N/{D}_{ip}{D}_{op}}^{n_2{k}_2}\right]\right\}{W}_N^{\left({n}_1\left({D}_{ip}{k}_2+{k}_3N/{D}_{op}\right)\right)}.\end{array} $$

(6)

The pruned decomposed transform of (6) can be interpreted as follows: first, apply DIF to the DFT_N with D _ip as a decomposition factor, then, DIT to the resulting DFTs with D _op as a decomposition factor and, finally, perform the pruning. In [24], the pruned decomposed transform of (6) was addressed as an FFT_{DIF−DIT−TD} modality, that in our notations, we refer to as DFT_{DIF−DIT−Pr}. A computational diagram of such technique (6) is presented in Fig. 1.

An alternative grouping of the complex exponentials in (5) yields

$$ \begin{array}{l}X\left({k}_1+{D}_{ip}{k}_2+\frac{N}{D_{op}}{k}_3\right)={\displaystyle \sum_{n_1=0}^{D_{op}-1}\left\{{\displaystyle \sum_{n_2=0}^{P-1}\left[{W}_N^{\left({D}_{op}{n}_2{k}_1\right)}x\left({n}_1+{D}_{op}{n}_2\right){W}_P^{n_2{k}_2}\right]}\right\}}\\ {}\kern10.25em \times {W}_N^{\left({n}_1\left({k}_1+{D}_{ip}{k}_2+{k}_3N/{D}_{op}\right)\right)}\\ {}\kern9em ={\displaystyle \sum_{n_1=0}^{D_{op}-1}\left\{{\displaystyle \sum_{n_2=0}^{P-1}\left[y\left({n}_1,{n}_2,{k}_1\right){W}_P^{n_2{k}_2}\right]}\right\}}\\ {}\kern10.25em \times {W}_N^{\left({n}_1\left({k}_1+{D}_{ip}{k}_2+{k}_3N/{D}_{op}\right)\right)}.\end{array} $$

(7)

The computing of the pruned decomposed transform (7) requires, first, application of DIT to the DFT_N with D _op as a decomposition factor and, then, application of DIF to the resulting DFTs with D _ip as a decomposition factor.

Hence, we refer to the pruned decomposed transform of (7) as a DFT_{DIT−DIF−Pr} modality. A computational diagram of such the technique (7) is presented in Fig. 2.

The DFT_{DIT−DIF−Pr} involves three processing stages: an input stage (computation of y(n ₁, n ₂, k ₁)), an intermediate stage (computation of D _ip D _op DFTs of length P), and an output stage (computation of the complex multiplications and additions dependent on index n ₁).

3 Proposed method

Our method employs three different alternatives to compute the DFT_N: a direct method, a recursive method, and/or a pruned decomposed transform. Admissible permutations/allocations of all feasible decomposition-pruning modalities compose all possible hypotheses regarding the feasible alternative schemes for computing the composite DFTs.

All feasible commuting-pruning implementation structures are listed in Table 1. Those could be addressed as possible search “hypotheses” to be tested. Thus, the problem of selection of an optimal computing-pruning implementation structure can be recast as a hypotheses testing task. All feasible hypotheses relate to formal implementation structures specified in Table 1. Four of them prescribe cascade computational implementation involving cascade combinations of structures (hypotheses H₄,…, H₇), while four others (hypotheses H₉,…, H₁₂) prescribe combinational unions of the previous hypotheses. It is important to remark that (1), (6), and (7) are the mathematical definitions of H₈, H₄, and H₅, respectively. Hence, the decision-making process that is a selection from those feasible operational prescriptions cannot be formalized as an optimization strategy for minimization of some cost function subject to relevant restrictions/constraints specified in a closed analytical form. Thus, due to the composite combinations (hypotheses over hypotheses with cascade interlaces, as in the cases of hypotheses H₉,…, H₁₂), the proper selection of the preferable implementation structure cannot be cast as an analytically tractable closed-form optimization problem. Hence, it should be treated as a test of combinations of hypotheses (hypotheses over hypotheses, as in the case of H₉,…, H₁₂), sometimes referred to as a combinational (or combinatorial-type) hypothesis testing problem [23, 24]. The global optimal solution to such a kind of problems presumes test of all feasible hypotheses in the list, making the decision in favor of the best one (in the prescribed quality measure), and rejection of all other competing hypotheses [23, 24]. In our particular case, only 12 hypotheses are admissible/feasible; thus the (global) optimal selection of the best possible implementation structure can be found simply via employing the so-called brute force search over complete hypotheses list specified in Table 1.

Table 1 Complete list of hypotheses $ {\left\{{\mathrm{H}}_h\right\}}_{h=1}^{12} $ regarding feasible commuting-pruning implementation structures

Full size table

Sorensen et al. [23] sketched how to prune the input and output of DFTs using independent allocations listed in Table 1 as H₁, H₂, and H₃ and featured in Fig. 3a. However, the authors of [23] concluded that their pruning method is less efficient than other pruning methods in the cases when both the number of input and output elements are bounded. They recommended turning to the method proposed by Sreenivas et al., in [17], i.e., to prune the input and output of a power of two length FFTs. Furthermore, an efficient input-output pruning method for a power of two length FFTs was proposed by Roche in [18].

Later, a more efficient input-output pruning method for composite length DFTs was developed in [24]. Such commuting between H₄ ∪ H₆ leads to hypothesis H₉ as featured in Fig. 3a. In [24], such a technique was constructed as a modification of the transform decomposition proposed originally by Sorensen et al., in [23], but with extra capability to perform the input-output pruning at the same time. Additionally, the computation of each final output employs a commutation between a direct method and the 2BF filtering algorithm, i.e., the 2BF-filtering algorithm is an efficient method for computing a subset of final outputs from their decomposition transform [23, 24].

In our study, two additional feasible hypotheses are devised to perform unified commutation-pruning techniques for efficient computations of composite length DFTs (hypotheses H₁₁ and H₁₂) as reported in Fig. 3b. Therefore, our proposal relates to an adaptive commuting between feasible implementation structures specified by the union of hypotheses H₁₀ ∪ H₃ ∪ H₈ that is included in Table 1 as an alternative composite hypothesis H₁₂. A comparison of computational complexities related to implementation of the competing computational structures formalized by hypotheses H₉ and H₁₀ (in the number of required arithmetical operations) is reported in Table 2. Also, the relevant comparisons between two other feasible structures specified by hypotheses H₁₁ and H₁₂(referred here as DFT_{COMM−DIF−DIT−Pr} and DFT_{COMM−DIT−DIF−Pr}, respectively), are reported in Tables 2 and 3 and Figs. 5a–f (in the sense of the number of required arithmetic operations).

Table 2 Total number of arithmetic operations required to compute the input and output stages of DFT_{DIF−DIT−Pr} and DFT_{DIT−DIF−Pr}

Full size table

Table 3 Total number of arithmetic operations required to compute the input and output stages of DFT_{COMM−DIF−DIT−Pr} and DFT_{COMM−DIT−DIF−Pr} modalities

Full size table

The selection of proper permutation/allocation structure directly relates to the considered above problem of selection of an optimal commutation-pruning implementation structure casted and treated as a combinational hypotheses testing task. All feasible hypotheses $ {\left\{{\mathrm{H}}_h\right\}}_{h=1}^{12} $ relate to formal implementation structures specified in Table 1. Now, we are ready to find the best permutation/allocation structure in the sense of the imposed quality measure (in our case in the sense of the lowest possible number of required arithmetical operations).

3.1 Analysis of the hypotheses

Let us analyze, first, the pruned decomposed transform and deduce whether the direct or recursive method would be preferable. The total number of arithmetic operations (OPER_tot) required by the DFT_{DIF−DIT−Pr} and the DFT_{DIT−DIF−Pr} depends on the number of operations needed to be performed to implement the input stage (OPER_input), the output stage (OPER_output), and the intermediate stage (D _ip D _opOPER_DFTP), correspondingly. Thus, one could express OPER_tot of both pruned decomposed transforms as

$$ \mathrm{O}\mathrm{P}\mathrm{E}{\mathrm{R}}_{\mathrm{tot}}=\mathrm{O}\mathrm{P}\mathrm{E}{\mathrm{R}}_{\mathrm{input}}+{D}_{ip}{D}_{op}\mathrm{O}\mathrm{P}\mathrm{E}{\mathrm{R}}_{\mathrm{DFTP}}+\mathrm{O}\mathrm{P}\mathrm{E}{\mathrm{R}}_{\mathrm{output}}. $$

(8)

According to (8), OPER_tot depends on L _i, L _o, N, D _ip, D _op, and the algorithm employed to implement the D _ip D _op DFT_P blocks (OPER_DFTP).

At the input and output stages, there are multiplications by one, so those multiplications are avoided at all in our approach. Also, the multiplications by one at the input stage are also avoided depending on whether DFT_{DIF−DIT−Pr} or DFT_{DIT−DIF−Pr} was executed in the particular employed pruned decomposed transform modality.

If the DFT_{DIF−DIT−Pr} modality is employed (see the general diagram in Fig. 1), then:

At the input stage, the multiplications by one are excluded when n ₁ = n ₂ = 0 and k ₁ = 0.
Furthermore, the multiplications by one at the output stage are also avoided when n ₁ = 0 or k ₂ = 0.

Therefore, the DFT_{DIF−DIT−Pr} modality always requires fewer complex multiplications to compute the output stage than the DFT_DIT-DIF-Pr modality (this is reported in Tables 2 and 3).

On the other hand, if the DFT_DIT-DIF-Pr modality is used (see Fig. 2), then:

At the input stage, the multiplications by one are excluded when n ₂ = 0 or k ₁ = 0.
Also, at the output stage, the multiplications by one are avoided when k ₁ = k ₂ = 0 or n ₁ = 0.

Therefore, the DFT_DIT-DIF-Pr modality always requires fewer complex multiplications at the input stage than the DFT_{DIF−DIT−Pr} modality (as it is corroborated in the analysis reported in Tables 2 and 3).

The output stage of both pruned decomposed transform modalities can be computed by the direct addition of complex multiplications or a kind of recursive algorithm as those proposed in [23] (referred to as the 2BF filtering method), which reduces the number of required multiplications by about half. The number of arithmetic multiplications required by the output stage of the DFT_{DIF−DIT−Pr} algorithm is equal to 4 (L _o − D _ip) (D _op − 1) when (L _o > D _ip) and (D _op < 4). Next, the number of arithmetic multiplications is equal to (L _o − D _ip) (2D _op + 2) when (L _o > D _ip) and (D _op ≥ 4). Thus, the 2BF filtering algorithm can be effectively used to compute the output stage.

On the other hand, the number of arithmetic multiplications required to compute the output stage of the DFT_DIT-DIF-Pr algorithm is equal to 4(L _o − 1) (D _op − 1) when (D _op < 4); and the number of arithmetic multiplications is equal to (L _o − 1) (2D _op + 2) when (D _op ≥ 4). Thus, the 2BF filtering algorithm can also be effectively employed to compute the output stage.

In [23], it was proven that the 2BF filtering method is more efficient than the direct addition of complex multiplications when the number of input elements is larger than 4 (when the number of input elements is equal to 4, both methods manifest the same operational complexity performances). The output stages of both pruned decomposed transforms have the same structures, so same sort of commutations is required to efficiently compute the output stage of the DFT_{DIT−DIF−Pr}. The expressions for OPER_input and OPER_ouput for the DFT_{DIF−DIT−Pr} and the DFT_{DIT−DIF−Pr} are listed in Table 2, where it is implicitly assumed that each complex multiplication requires six arithmetic operations (four real multiplications and two real additions), and each complex addition requires two arithmetic operations (two real additions).

The performances of the pruned decomposed transforms depend on the decomposition factors, D _ip and D _op. A simple analysis can be carried out to deduce which decomposition factors are preferable to be used. Our unified commutation-pruning method performs the decomposition of the DFT_N into three stages of smaller dimension DFTs and pruning part of those inputs that are equal to zero and/or part of those outputs that are not needed to compute the final Fourier coefficients.

Thus, the decomposed transform algorithm always selects a pair (D _ip, D _op) for which the largest DFTs could be successfully pruned, or equivalently, a pair (D _ip, D _op) for which the intermediate stage results in the smallest dimension DFTs.

The DFTs of the intermediate stage have a size of N/D _ip D _op ≡ P, so D _ip and D _op should be chosen as large as possible. Furthermore, the values for the decomposition factors should satisfy the bound N/D _ip ≥ L _i (where, N/D _ip must be close to but higher than L _i) and N/D _op ≈ L _o, as it was considered in the derivation of (5). Hence, the pair of decomposition factors (D _ip, D _op) closest to (N/L _i, N/L _o) that satisfy D _ip ≤ N/L _i are used by the decomposed transform algorithm, according to the proximity evaluated by its Euclidean distance.

Let us now consider the cases when the number of input elements (L _i) or the number of the required Fourier coefficients (L _o) is too small. In these cases, for the both modalities, the general diagrams presented in Figs. 2 and 1 clarify the following features of the DFT_{DIT−DIF−Pr} and the DFT_{DIF−DIT−Pr} algorithms, respectively.

If L _i ≤ D _op, at most one input of each DFT_P (i.e., the first one) in the intermediate stage would be applied; therefore, their P outputs would be replicas of that single input.
For L _o ≤ D _ip, only the first output of each DFT_P (this corresponds to a simple addition of the input elements) is required to compute the final Fourier coefficients.

Thus, inefficient implementations of the DFT_Ps yield the inequality-type constraints Li ≤ D _op or L _o ≤ D _ip. In these cases, our method commutes to efficiently perform the direct computation of the DFT_N or an efficient recursive alternative (via performing the 2BF filtering technique).

Sorensen et al., in [23], proposed a method to compute a subset of the output components of their proposed specific DFT decomposition; this algorithm was referred to as a 2BF filtering method. The 2BF filtering method [23] was derived as a modification of the previously addressed Goertzel algorithm [25]. The 2BF filtering method takes advantages of the periodicity and the shifted cyclic convolution shape between the input sequence and the $ {W}_N^{nk}={e}^{-j\left(2\pi /N\right)kn} $ factor.

The transfer function H(z) of a system that performs the 2BF filtering method is given by the equation

$$ H(z)=\frac{z^{-1}\left(1-{z}^{-1}{W}_N^{-k}\right)}{1-2 \cos \left(\frac{2\pi k}{N}\right){z}^{-1}+{z}^{-2}} $$

(9)

The corresponding algorithmic diagram of the second-order 2BF method is presented in Fig. 4. Thus, (9) is the mathematical definition of H₃.

The poles of the system transfer function (the roots of the polynomial in the denominator of H(z)) have to be evaluated L times (n = 0, 1, 2,…, L − 1), while the zeros of the system transfer function (the roots of the numerator of H(z)) only once. Here, L represents the number of consecutive non-zero input elements of the 2BF filter; i.e., in the opposite case, it represents the number of consecutive non-zero output components of the employed pruned decomposed transform modality (DFT_{DIF−DIT−Pr} or DFT_{DIT−DIF−Pr}).

The computation of each pole of (9) requires two arithmetic multiplications (two real multiplications) and two arithmetic additions (two real additions). Furthermore, the computation of the zeros of (9) requires four arithmetic multiplications and four arithmetic additions only.

The Q ₁ node in Fig. 4 is initialized with f(L − 1); therefore, the computation starts from n = L − 2. When n = 0, the complex addition of the input is only required; then, the zero is computed after such a delay. Such computational organization saves two arithmetic multiplications and six arithmetic additions for finding of each required output component.

The 2BF filtering method employed to compute the output components required by the pruned decomposed transform performed by the DFT_{COMM−DIF−DIT−Pr} or the DFT_{COMM−DIT−DIF−Pr} algorithm can be featured as the following multistage procedure:

The structure of the DFT_{DIF−DIT−Pr} contains D _ip sets of D _op DFT_Ps from which the final outputs are computed (see the general diagram in Fig. 1).
The DFT_{COMM−DIF−DIT−Pr} algorithm employs the 2BF filtering method to implement the output stage of DFT_{DIF−DIT−Pr} with L = L _o, if ( (L _i > D _op) & (L _o > D _ip) ) & ( (L _o > D _ip)&(D _op ≥ 4) ) (as featured in Tables 2 and 3). Here, the required arithmetic operations are specified as follows: the number of arithmetic multiplications are equal to NumArithMult _2BF = (L _o − D _ip)(2D _op + 2) and the number of arithmetic additions are equal to NumArithAdd _2BF = 2 D _ip(D _op − 1) + (L _o − D _ip)(4D _op − 2).
Furthermore, the DFT_{COMM−DIF−DIT−Pr} algorithm employs the 2BF filtering method exclusively with L = L _i, if ( (L _i ≤ D _op) | (L _o ≤ D _ip) ) & (L _i ≥ 4) (as featured in Table 3) to compute the required Fourier coefficients. Here, the required arithmetic operations are specified as follows: NumArithMult _2BF = (L _o − 1)(2L _i + 2) and NumArithAdd _2BF = 2(L _i − 1) + (L _o − 1)(4L _i − 2).

In contrast, the DFT_{COMM-DIT-DIF-Pr} algorithm differs from the abovementioned in the following features:

The structure of the DFT_DIT-DIF-Pr contains D _op sets of D _ip DFT_Ps from which the final outputs are computed (as featured in Fig. 2).
The DFT_{COMM-DIT-DIF-Pr} algorithm employs the 2BF filtering method to implement the output stage of DFT_DIT-DIF-Pr with L = L _o, if ( (L _i > D _op) & (L _o > D _ip) ) & (D _op ≥ 4), (as featured in Tables 2 and 3). Here, the required arithmetic operations are specified as follows: NumArithMult _2BF = (L _o − 1)(2D _op + 2), and NumArithAdd _2BF = 2 (D _op − 1) + (L _o − 1)(4D _op − 2).
On the other hand, the DFT_{COMM-DIT-DIF-Pr} algorithm employs the 2BF filtering method exclusively with L = L _i, if ( (L _i ≤ D _op) | (L _o ≤ D _ip) ) & (L _i ≥ 4) (as reported in Table 3) to compute the required Fourier coefficients. Here, the required arithmetic operations are specified as follows: NumArithMult _2BF = (L _o − 1)(2L _i + 2) and NumArithAdd _2BF = 2(L _i − 1) + (L _o − 1)(4L _i − 2).

The computation of each input and/or output element in both cases detailed above is executed according to the diagram presented in Fig. 4. In closing, we note that the pseudo-code presented in the Appendix (see Fig. 9) contains all scripts needed to compute each Fourier coefficient employing the 2BF filtering method.

Note once again that the 2BF filtering method has to be employed if L _i is larger or equal to 4, in which case, it manifests a higher efficiency than the direct method for computing the DFT_N in (1). The total number of arithmetic operations required by our proposed method is reported in Table 3.

3.2 Selection of the permutation/allocation structure

In Fig. 5, the total number of required arithmetic operations to compute the DFT_{DIF−DIT−Pr} from [24] (H₉), DFT_{COMM−DIF−DIT−Pr} (H₁₁), and DFT_{COMM−DIT−DIF−Pr} (H₁₂) modalities are plotted for different values of L _i and L _o for the test examples with N = 8192 and N = 6561 (It is assumed that the DFT_Ps are implemented employing the split-radix algorithm from [26] for N = 8192 and employing the radix-3 algorithm from [27] for N = 6561.) All the competing alternatives corresponding to three feasible arrangements (H₉, H₁₁, and H₁₂) in the considered permutation/allocation structure are featured in Fig. 5. The DFT_{DIF−DIT−Pr} or the DFT_{DIT−DIF−Pr} could be used to implement the pruned decomposed transform in the DFT_{COMM−DIF−DIT−Pr} and DFT_{COMM−DIT−DIF−Pr} techniques. Here, the D _ip and D _op values are the pair specified by the rough selection method (the proximity evaluated by its Euclidean distance is referred as roughDP) and those obtained by an exhaustive search method (the total numbers of operations required to implement the DFT_{DIF−DIT−Pr} and the DFT_{DIT−DIF−Pr} were evaluated for each possible pair of (D _ip, D _op), and, then, the pair (D _ip, D _op) with the best performance metric is selected; this selection method is referred as exhDP). Fig. 5a–f demonstrate that two commutation-pruning techniques (related to hypotheses H₁₁ and H₁₂) require the same or smaller number of arithmetic operations than that specified by hypothesis H₉. Next, it is necessary to make a choice between H₁₁ and H₁₂.

Graphs in Fig. 5 indicate that the number of operations required to perform our commutation-pruning technique (DFT_{COMM−DIF−DIT−Pr} and DFT_{COMM−DIT−DIF−Pr}) with the selected decomposition factors using the roughDP method are equal to or slighty greater than those, in which the decomposition factors are specified employing exhDP. The differences correspond to the regions where the commutation conditions prescribe performing the pruned decomposed transform instead of the 2BF filtering method.

The DFT_{DIT−DIF−Pr} modality requires the same or a smaller number of arithmetic operations than the competing DFT_{DIF−DIT−Pr} for all the cases where the pruned decomposed transform is performed (as it follows from the data reported in Fig. 5). Since the same decomposition factors (D _ip, D _op) are used in both pruned decomposed transforms, it is sufficient to compare the number of required operations by their input and output stages (OPER_input + OPER_output) reported in Table 2 to distinguish which one is the most efficient. The comparison for the cases L _i ≤ D _op and L _o ≤ D _ip is not needed since in such scenarios, a direct or recursive method is employed instead of a pruned decomposed transform. For scenarios with D _op < 4, both pruned decomposed transforms require the same number of arithmetic operations for their execution. Otherwise, for D _op ≥ 4, the execution of DFT_{DIF−DIT−Pr} requires 2D _ip D _op − 8D _ip − 2D _op + 8 more arithmetic operations than DFT_{DIT−DIF−Pr} demonstrating that the latter manifests always the same or a better performance. Thus, from the combinational permutation analysis, it follows that it is always desirable to perform the DFT_{DIT−DIF−Pr} when a pruned decomposed transform would be required. In the following section, an efficient implementation of that proposed unified commutation-pruning technique is detailed considering that the pruned decomposed transform is implemented using the DFT_{DIT−DIF−Pr}. In summary, we now resume that the performed combinational hypothesis testing-based optimal selection of the preferable computational structure of the decomposed DFTs made the decision in favor of hypothesis H₁₂; this yields the proposed DFT_{COMM−DIT−DIF−Pr} method (referred further on for simplicity as DFT_COMM) with the highest possible computational efficiency. Being the optimal decision of the performed “brute force search” based testing of all feasible hypotheses, this method is guaranteed to be globally optimal one and thus is strongly recommended for performing the required commuting between three techniques to implement the overall composite DFT in the following arrangement mode: the direct method, the recursive method, and the pruned decomposed transform implemented via DFT_{DIT−DIF−Pr}.

4 Comparison with other competing algorithms

A variety of competing methods for pruning the DFTs in arbitrary (non-sparse) computational scenarios have been addressed in the literature (see [1, 3, 13–19, 23, 24]). In [24], the FFT_{DIF−DIT−TD} modality (that we here refer to as DFT_{DIF−DIT−Pr}) was proposed as an alternative technique for pruning the input and/or the output of DFTs. That method [24] was compared with other pruning techniques reported in the literature until 2009. Comparisons of the methods proposed by Bouguezel et al. [15], Fan et al. [16], Sreenivas et al. [17], Roche [18], and the DFT_{DIF−DIT−Pr} reported in [24] demonstrated that the DFT_{DIF−DIT−Pr} modality requires fewer arithmetic operations than those of [15–17], while attaining the operational performances similar to that of [18]. Additionally, in Section 3, it was corroborated that our proposed DFT_COMM technique requires equal or less arithmetic operations than [24]. Here beneath, we compare our approach with the recently reported most prominent competing pruning methods.

4.1 Comparisons with pruning-based algorithms

The first competing algorithm for pruning the output of a SRFFT was reported in [3]. That so-called SRFFT_pruning algorithm was developed for an implicit restriction that only a few consecutive output components (a number L equal to a power of two) are required. Fig. 6 reports the number of arithmetic operations required to perform SRFFT_pruning in comparison with our unified DFT_COMM method for multiple output pruning examples using the decomposition factors (D _ip, D _op) evaluated via the roughDP method and those specified by the exhDP method, respectively.

In both cases, it is considered that the DFTs of length P required by the intermediate stage of the pruned decomposed transform have been implemented by applying the split-radix FFT, e.g., [26]. Therefore, the total number of arithmetic operations required by our proposed DFT_COMM method in comparison with the competing pruning-based algorithms can be found in Table 4. The savings in the number of arithmetic operations attained with the new developed DFT_COMM technique are reported in Tables 5 and 6.

Table 4 Total number of arithmetic operations required to compute the SRFFT_pruning, SRFFT_{pruning-time-shift}, and DFT_COMM algorithms

Full size table

Table 5 Savings in the number of arithmetic operations attained with the DFT_COMM algorithm in comparison with the competing SRFFT (noprun) and SRFFT_pruning methods for N = 262,144

Full size table

Table 6 Savings in the number of arithmetic operations attained with the DFT_COMM algorithm in comparison with the competing SRFFT (noprun) and SRFFT_pruning methods for N = 1024

Full size table

From Fig. 6, one can deduce that our proposed DFT_COMM method requires fewer arithmetic operations than the competing SRFFT_pruning method in almost all the test cases (with the only one exception for the case L _o = N/2 and L _o = N/4). Next, Tables 5 and 6 report the savings in the number of arithmetic operations attained with our DFT_COMM in comparison with the competing SRFFT and the SRFFT_pruning techniques. In the scenarios with L _o = N and L _i = {2¹, 2²,…, N}, the DFT_COMM algorithm manifests 2.96 and 2.73 % savings in the number of arithmetic operations in comparison with the SRFFT_pruning for N = {262,144, 1024}, respectively.

In other cases, from Table 5, it follows that in the scenarios with L _i = 1027, L _i = 33, and L _o = {2¹, 2²,…, N}, the SRFFT_pruning method fails to deliver a result at all. Thus, from Table 5, it follows that in the cases when L _i = N = 262,144, L _i = 1027, L _i = 33, and L _o = {2¹, 2²,…, N}, the DFT_COMM algorithm produces savings of 42.76, 75.02, and 91.35 %, respectively, in the number of arithmetic operations required to compute the composite length DFT in comparison with the competing SRFFT algorithm. Furthermore, from Table 6, it follows that in the scenarios with L _i = 90, L _i = 13, and L _o = {2¹, 2²,…, N}, the SRFFT_pruning method fails to deliver a result at all. Thus, from Table 6, it follows that in the cases when L _i = N = 1024, L _i = 90, L _i = 13, and L _o = {2¹, 2²,…, N}, the DFT_COMM algorithm produces savings of 36.48, 59.30, and 81.65 %, respectively, in the number of arithmetic operations required to compute the composite length DFT in comparison with the competing SRFFT algorithm.

Yuan et al., in [14], proposed another competing, the so-called SRFFT_{pruning−time−shift} method via modifying the SRFFT_pruning employing a time shifting approach that yields the input pruning algorithm based on the SRFFT methodology for L consecutive non-zero input elements. It is noteworthy to stress that the SRFFT_{pruning−time−shift} approach implicitly assumes that lengths L and N may take values equal to the power of two only.

Figure 7 reports the number of required arithmetic operations to execute our proposed unified DFT_COMM method and those required by the competing pruned DFTs of [14]. These results verify that our approach requires fewer arithmetic operations than those required to perform the SRFFT_{pruning−time−shift} algorithm in all the reported tests. Again, it is implicitly assumed that the DFTs of length P involved in the DFT_{DIT−DIF−Pr} used by our DFT_COMM have been computed using the split-radix FFT [26], as reported in Table 4.

Next, Tables 7 and 8 report the savings in the number of arithmetic operations attained with our DFT_COMM in comparison with the competing SRFFT and the SRFFT_{pruning−time−shift} techniques. In the scenarios with L _o = N and L _i = {2¹, 2²,…, N}, the DFT_COMM algorithm manifests 5.11 and 8.71 % savings in the number of arithmetic operations in comparison with the SRFFT_{pruning−time−shift} for N = {262,144, 1024}, respectively.

Table 7 Savings in the number of arithmetic operations attained with the DFT_COMM algorithm in comparison with the competing SRFFT (noprun) and SRFFT_{pruning-time-shift} methods for N = 262,144

Full size table

Table 8 Savings in the number of arithmetic operations attained with the DFT_COMM algorithm in comparison with the competing SRFFT (noprun) and SRFFT_{pruning-time-shift} methods for N = 1024

Full size table

In other test cases, from Tables 7 and 8, it follows that for L _o = {1027, 90}, L _o = {33, 13}, and L _i = {2¹, 2²,…, N}, the SRFFT_{pruning−time−shift} algorithm fails to deliver a result at all. Furthermore, from Table 7, it follows that in the scenarios with L _o = {N, 1027, 33} and L _i = {2¹, 2²,…, N}, our DFT_COMM attains 43.26, 76.24, and 92.11 % savings for N = 262,144, respectively, in the number of arithmetic operations required to compute the composite length DFT. In addition, from Table 8, it follows that in the scenarios with L _o = {N, 90, 13} and L _i = {2¹, 2²,…, N}, our DFT_COMM attains 38.22, 59.22, and 82.45 % savings for N = 1024, respectively, in the number of arithmetic operations required to compute the composite length DFT.

Note that our DFT_COMM always requires fewer arithmetic operations than the competing SRFFT_pruning and SRFFT_{pruning−time−shift} algorithms due to the different butterfly schemes employed to implement the split-radix FFT algorithms [26] and the unified commutation-pruning technique employed (see Section 3). The SRFFT_pruning and SRFFT_{pruning−time−shift} algorithms perform the two-butterfly scheme [26], while our DFT_{DIT−DIF−Pr} algorithm employs the three-butterfly scheme to achieve a reduction in the number of arithmetic operations required to implement the DFT_P blocks. Furthermore, graphs of Fig. 6 report that the SRFFT_pruning algorithm fail to deliver a result at all in the scenarios with L equal to N due to their algorithmic construction as reported by the authors of [3]. For this reason, this algorithm cannot present a valid value for the last test of L _o (it is simply unable to stop to prune at all). In addition, Fig. 6 reports minimal differences between the numbers of arithmetic operations attained by the DFT_COMM evaluated using the roughDP- or exhDP-based selection for specifying D _ip and D _op. In summary, the number of arithmetic operations required to compute the SRFFT_pruning, SRFFT_{pruning-time-shift}, and DFT_COMM algorithms can be found in Table 4.

4.2 Comparison with the SFFT-related algorithms

In a context of pruned DFTs, real-world sensing scenarios are characterized by the uncertainties attributed to zero-padded input data acquisition modes with variable composite length windowing of the input and/or output Fourier transform sequences, in general cases, with non-sparse Fourier spectrum [10–12]. In contrast, the celebrated SFFT method developed and featured in [20] presumes “sparsity” of the Fourier spectrum that requires that majority of the Fourier coefficients are zeros or negligible; e.g., the authors of [20] exemplified such sparsity level at approximately 89 %, i.e., up to 89 % of the Fourier transform coefficients are to be zeroes or negligible for operability of their SFFT. Otherwise, the DFT should be specified and treated as a non-sparse transform.

Currently, a family of novel efficient algorithms for computing the FFTs applicable for sparse sensing scenarios when only a few Fourier transform coefficients (k _s largest coefficients of the N-length Fourier transform) of the input signal x are different from zero have been developed [20, 21], which compose a family of the so-called SFFT methods. To compute a reliable SFFT for typical high N > 2¹⁰, the sparsity level constraint requires that majority of the Fourier coefficients are zeros [20] (or negligible to be discarded). Such model assumptions are valid, for example, in video compressing applications [20]. Therefore, if majority of the Fourier transform coefficients are supposed to be zeros or can be discarded, then efficient computing techniques from the SFFT family can be employed. The celebrated algorithms from such a family are the SFFTv1 and the SFFTv2 developed and featured in [20] where the sparsity level was exemplified at 89 % of zero (negligible) Fourier coefficients. In [21], the SFFTv3 and SFFTv4 algorithms were proposed, where some computational improvements were introduced. SFFTv3 was implemented in [28] while the program code for implementation of the SFFTv4 algorithm is not available at this time. Another competing technique for computing of the FFT of sparse (in the frequency domain) signals was addressed in [22] as the so-called FADFT-2 algorithm from the AAFFT library [22]. However, in [20, 21], it was corroborated that the SFFT-related algorithms manifest better operational performances than FADFT-2 of [22].

To perform valid test comparisons between the SFFTv1, SFFTv2, SFFTv3, and the DFT_COMM algorithms, those should be tested under the same conditions and constrains. Here, we use the following feasible constraints: the values of N vary as follows: N = {2⁶, 2⁷,…, 2²⁰} and k _s = L _o, where L _o represents the number of consecutive output coefficients to be calculated. In different test scenarios, the SFFTv1, SFFTv2, and SFFTv3 algorithms deliver successful results: the first of them for N = {2¹³, 2¹⁴,…, 2²⁰} and k _s = L _o = 50, the second of them for N = {2¹³, 2¹⁴,…, 2²⁰} and k _s = L _o = 50, and finally, the third of them for N = {2¹⁰, 2¹¹,…, 2²⁰} and k _s = L _o = 50, respectively. Furthermore, it was experimentally corroborated that the DFT_COMM algorithm was able to deliver efficient results in all such tested sparse scenarios, as reported in Table 9.

Table 9 Comparisons of the SFFTv1, SFFTv2, SFFTv3, and DFT_COMM algorithms for different sizes (N) of the signal x, with N = L _i = {2⁶, 2⁷,…, 2²⁰} and k _s = L _o = 50

Full size table

In addition, DFT computations for other sparse test scenarios with different values of N and k _s were run, in particular, for N = L _i = {2¹³, 2¹⁴,…, 2¹⁷} and k _s = L _o = {1, 2, …, k _smax} with k _smax = 11 % of N. The test scenarios for the SFFT algorithms delivered successful results only for a few tested values of k _s. For example, the SFFTv1 algorithm is executed successfully for N = {2¹³, 2¹⁵} and k _s = {1, 2,…, 50}, for N = 2¹⁴ and k _s = {1, 2, …, 50} ∪ {56, 57, …, 63}, for N = 2¹⁶ and k _s = {1, 2, …, 50} ∪ {64, 65, …, 97}, and for N = 2¹⁷ and k _s = {1, 2,…, 74}.

The SFFTv2 algorithm is executed successfully for N = {2¹³, 2¹⁴,…, 2¹⁷} and k _s = {1, 2,…, 50}, while, the SFFTv3 algorithm performed successfully for N = 2¹³ and k _s = {4, 5,…, 673}, for N = 2¹⁴ and k _s = {4, 5,…, 1346}, for N = 2¹⁵ and k _s = {4, 5,…, 2692}, for N = 2¹⁶ and k _s = {4, 5,…, 5385}, and for N = 2¹⁷ and k _s = {4, 5,…, 10,771}. Furthermore, the DFT_COMM algorithm is executed successfully for all test cases (for N = {2¹³, 2¹⁴,…, 2¹⁷} in combination with all k _s = {1, 5,…, k _smax}, as follows from the data reported in Table 10.

Table 10 Comparisons of the SFFTv1, SFFTv2, SFFTv3, and DFT_COMM algorithms for different sizes (N) of the signal x, with N = L _i = {2¹³, 2¹⁴,…, 2¹⁷} and k _s = L _o = {1, 2,…, k _smax} in the tested sparse scenarios with k _smax ~ 11 % of N

Full size table

Table 11 reports the absolute average errors attained with the SFFTv1, SFFTv2, SFFTv3, and DFT_COMM algorithms, for N = {2¹³, 2¹⁴,…, 2¹⁸} and k _s = L _o = 50. In all test cases, the FFTW algorithm from [29] was used as a reference for computing the absolute error measures.

Table 11 Average absolute errors attained in sparse scenarios with the SFFTv1, SFFTv2, SFFTv3, and DFT_COMM algorithms for N = {2¹³, 2¹⁴,…, 2¹⁸} and k _s = L _o = 50

Full size table

From the data reported in Table 11, it follows that for N = 8192 and k _s = L _o = 50, the SFFTv1 and SFFTv2 algorithms manifest very close absolute error values; in particular, the attained average absolute error values were 5.6162 × 10⁻⁵ and 5.0689 × 10⁻⁵, respectively. However, the SFFTv3 attains a lower absolute average error values than other SFFT versions. It is noteworthy to mention that the lowest absolute average error was attained with the DFT_COMM algorithm at a value of 2.7642 × 10⁻¹⁰.

In addition, Fig. 8 reports the absolute values of errors of the compared tested SFFTv3 and the DFT_COMM algorithms for N = 8192 and k _s = L _o = 50 under the same sparse computing scenarios.

On the other hand, the SFFT-related algorithms demonstrate reliable operation for specific input parameter combinations, i.e., they are dependent on the combination of the dimension N of the input signal x, and the sparsity factor k _s. In contrast, the DFT_COMM algorithm manifests the operational robustness in the sense that it does not subject to any of such dimensional limitation and demonstrated perfect operational performances in all tested harsh (non-sparse) computational scenarios. Furthermore, all SFFT-related algorithms are probabilistic-type techniques [20, 21], in which the desired k _s largest coefficients of the Fourier spectrum of the input sequence are reconstructed (approximated) with a high probability (not mandatory with probability one). In contrast, the DFT_COMM algorithm is a deterministic technique, and it produces more reliable and accurate results than the family of the SFFT-related algorithms (as demonstrated in Fig. 8 and Tables 10 and 11).

It is also worthwhile to note that presently (in the sparsity-guaranteed computational scenarios only), the SFFT-related algorithms outperform the DFT_COMM in the computational speed due to their specially devised execution parallelism [20, 21, 28]. From the family of the SFFT-related algorithms, the SFFTv3 [28] manifests the most speed-up computational performances for any input sequence dimension N and any feasible value k _s in the sparsity-guaranteed scenarios only; in particular, when approximately only 8.2 % (or lower number) of the Fourier coefficients of the input signal are significant, thus not discarded (as shown in Table 10). In contrast, in all comparable (sparse or non-sparse) computational scenarios, the DFT_COMM algorithm manifested superior accuracy performances (lower absolute error values) than those attained with the SFFT-related algorithms.

In closing, it is noteworthy to mention that in a majority of practical computational scenarios, the savings in the number of arithmetic operations achievable with the optimized unified DFT_COMM technique are significant. As a concluding example, refer to the test scenario with N = 8192 and L _i = L _o =307 in which case the savings in the total number of required arithmetic operations attainable with the DFT_COMM algorithm in comparison with the most prominent competing split-radix FFT algorithm [3, 14, 23, 24] constitute 45 %.

5 Conclusions

We have developed a new technique that carries out an efficient computation of the DFTs of composite lengths of the input and/or output data sequences smaller than the dimension N of the full DFT/FFT. The addressed methodology unifies the commuting, filtering, and pruning paradigms yielding the new DFT_COMM method that outperforms the existing competing pruning-decomposition-based techniques in the sense of attainable savings in the number of required arithmetic operations.

Furthermore, our DFT_COMM method admits computing the DFT_P blocks at the intermediate stage of the pruned decomposed transform using any existing FFT algorithm. Based on the performed treatment of the combinational hypotheses testing-type problem regarding all feasible allocation-pruning modalities, the decision in favor of the preferable hypothesis was made that yields the proposed DFT_COMM method. Being the globally optimal decision making result of testing the complete list of all feasible hypotheses, the DFT_COMM method guarantees to require a fewer or at most the same number of arithmetic operations for its execution than any other of the competing pruning-decomposition-based methods reported in the literature.

In addition, we have corroborated that, in the scenarios with non-guaranteed sparsity of the data Fourier spectra, the DFT_COMM method manifests better reliability and accuracy than the family of the celebrated competing SFFT-related algorithms; while in scenarios with severe Fourier spectrum non-sparsity (i.e., when the majority of the data Fourier spectrum coefficients take non-zero values, thus cannot be discarded), the DFT_COMM technique always outperforms the celebrated SFFT-related algorithms because all those simply fail to execute the program code in such uncertain computational scenarios.

References

J Markel, FFT pruning. Audio and Electroacoustics, IEEE Transactions on 19(4), 305, 311 (1971). doi:10.1109/TAU.1971.1162205
Article Google Scholar
V Raghavan, KMM Prabhu, PCW Sommen, Complexity of pruning strategies for the frequency domain LMS algorithm. Signal Processing 86(10), 2836–2843 (2006). ISSN 0165–1684, http://dx.doi.org/10.1016/j.sigpro.2005.11.015
Article MATH Google Scholar
Y Xu; M-S Lim, Split-radix FFT pruning for the reduction of computational complexity in OFDM based cognitive radio system, in Proceedings of 2010 IEEE International Symposium on Circuits and Systems (ISCAS), 69-72, May 30-June 2 2010. doi: 10.1109/ISCAS.2010.5537048.
FM Henderson, AV Lewis (eds.), Principles and applications of imaging radar, manual of remote sensing, vol. 3, 3dth edn. (Willey, NY, 1998)
Google Scholar
HH Barrett, KJ Myers, Foundations of image science (Willey, NY, 2004)
Google Scholar
YV Shkvarko, Unifying experiment design and convex regularization techniques for enhanced imaging with uncertain remote sensing data––part I: theory, part II: adaptive implementation and performance issues. IEEE Trans. Geoscience and Remote Sensing 48(1), 82–111 (2010)
Article Google Scholar
A Moni, CJ Bean, I Lokmer, S Rickard, Source separation on seismic data. IEEE Signal Processing Magazine 29(3), 16–28 (2012)
Article Google Scholar
RM Willet, MF Duarte, MA Davenport, RG Baraniuk, Sparsity and structure in hyperspectral imaging. IEEE Signal Processing Magazine 31(1), 116–126 (2014)
Article Google Scholar
Q Zhu, CR Berger, EL Turner, L Pileggi, F Franchetti, Polar format synthetic aperture radar in energy efficient application-specific logic-in-memory, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 1557- 1560, 25–30 March 2012. doi: 10.1109/ICASSP.2012.6288189.
YV Shkvarko, J Tuxpan, SR Santos, I Yaniez, High-resolution imaging with uncertain radar measurement data: a doubly regularized compressive sensing experiment design approach, in IEEE Intern. Symposium on Geoscience and Remote Sensing (IGRSS’2012), Munich, Germany, 6976–6970. (2012). ISBN: 978-1-46731159-51/12
YV Shkvarko, J Tuxpan, SR Santos, l ₂-l ₁ Structured descriptive experiment design regularization based enhancement of fractional SAR imagery. Signal Processing 93, 3553–3566 (2013). http://dx.doi.org/10.1016/j.sigpro.2013.03.024
S Foucart, H Rauhut, A mathematical introduction to compressive sensing (Springer, NY-Heidelberg, 2013)
Book MATH Google Scholar
DP Skinner, Pruning the decimation in-time FFT algorithm, in IEEE Transactions on Acoustics, Speech and Signal Processing, 24(2), 193–194 (1976). doi:10.1109/TASSP.1976.1162782
L Yuan, X Tian, Y Chen, Pruning split-radix FFT with time shift, International Conference on Electronics, Communications and Control (ICECC), 2011, 1581- 1586, 9–11 Sept. 2011. doi: 10.1109/ICECC.2011.6066654.
S Bouguezel, MO Ahmad, MNS Swamy, Efficient pruning algorithms for the DFT computation for a subset of output samples, in Proceedings of the 2003 International Symposium on Circuits and Systems, 2003. ISCAS '03, vol.4, pp. IV-97, IV-100 vol.4, 25–28 May 2003. doi: 10.1109/ISCAS.2003.1205782.
C-P Fan, G-A Su, Pruning fast Fourier transform algorithm design using group-based method, Signal Processing 87(11), 2781–2798 (2007), ISSN0165-1684, http://dx.doi.org/10.1016/j.sigpro.2007.05.012
Article MATH Google Scholar
TV Sreenivas, P Rao, FFT algorithm for both input and output pruning, in IEEE Transactions on Acoustics, Speech and Signal Processing, 27(3), 291–292 (1979). doi:10.1109/TASSP.1979.1163246
C Roche, A split-radix partial input/output fast Fourier transform algorithm, in IEEE Transactions on Signal Processing, 40(5), 1273, 1276 (1992). doi:10.1109/78.134493
L Wang, X Zhou, GE Sobelman, R Liu, Generic mixed-radix FFT pruning, in IEEE Signal Processing Letters, 19(3), 167, 170 (2012). doi:10.1109/LSP.2012.2184283
H Hassanieh, P Indyk, D Katabi, E Price, 2012. Simple and practical algorithm for sparse Fourier transform, in Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms (SODA '12) Kyoto, Japan, 17-19 Jan, 1183–1194, (2012)
H Hassanieh, P Indyk, D Katabi, E Price, Nearly optimal sparse Fourier transform, in Proceedings of the forty-fourth annual ACM symposium on Theory of computing (STOC '12), ACM, New York, (2012), 563–578. doi:10.1145/2213977.2214029. http://doi.acm.org/10.1145/2213977.2214029
Google Scholar
M Iwen, A Gilbert, M Strauss et al., Empirical evaluation of a sub-linear time sparse DFT algorithm, Communications in Mathematical Sciences 5(4), 981–998 (2007)
HV Sorensen, CS Burrus, Efficient computation of the DFT with only a subset of input or output points, in IEEE Transactions on Signal Processing, 41(3), 1184–1200 (1993). doi:10.1109/78.205723
M Medina-Melendrez, M Arias-Estrada, A Castro, Input and/or output pruning of composite length FFTs using a DIF-DIT transform decomposition, in IEEE Transactions on Signal Processing, 57(10), 4124, 4128 (2009). doi:10.1109/TSP.2009.2024855
AV Oppenheim, RW Schafer, Discrete-time signal processing, (Prentice Hall, 2nd Edition, U.S., 1999)
HV Sorensen, M Heideman, CS Burrus, On computing the split-radix FFT, in IEEE Transactions on Acoustics, Speech and Signal Processing, 34(1), 152–156 (1986). doi:10.1109/TASSP.1986.1164804
Y Suzuki, S Toshio, K Kido, A new FFT algorithm of radix 3,6, and 12, in IEEE Transactions on Acoustics, Speech and Signal Processing, 34(2), 380–383 (1986). doi:10.1109/TASSP.1986.1164826
J. Schumacher, M. Püschel, High performance sparse fast Fourier transform, Master´s thesis, ETH Zurich, Department of Computer Science (2013).
M Frigo, SG Johnson, The design and implementation of FFTW3, Proceedings of the IEEE 93(2), 216–231 (2005) doi:10.1109/JPROC.2004.840301

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their constructive criticism and comments that helped to improve the presentation of the paper.

Author information

Authors and Affiliations

Department of Electrical-Electronic Engineering, Culiacan Technological Institute, Ave. Juan de Dios Batiz S/N, Col. Guadalupe, C.P. 80220, Culiacan, Sinaloa, Mexico
David E. Castro-Palazuelos & Modesto Gpe. Medina-Melendrez
Telecommunications Group, CINVESTAV-IPN Campus Guadalajara, Ave. del Bosque 1145, Col. el Bajio, C.P. 45019, Zapopan, Jalisco, Mexico
David E. Castro-Palazuelos, Deni L. Torres-Roman & Yuriy V. Shkvarko

Authors

David E. Castro-Palazuelos
View author publications
You can also search for this author in PubMed Google Scholar
Modesto Gpe. Medina-Melendrez
View author publications
You can also search for this author in PubMed Google Scholar
Deni L. Torres-Roman
View author publications
You can also search for this author in PubMed Google Scholar
Yuriy V. Shkvarko
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David E. Castro-Palazuelos.

Additional information

Competing interests

The authors declare that they have no competing interests.

Appendix

1.1 Main function

Fig 9 presents the pseudo-code of the main function that commute among the different alternatives to compute the DFT_N (DFT_COMM). When the pruned decomposed transform is not required (L _i ≤ D _op or L _o ≤ D _ip), the direct method or the 2BF filtering method could be employed. In both cases, the Fourier coefficient X(0) is computed as a simple addition of the elements in the input sequence x(n).

The directFourier function is used in the scenarios with L _i < 4 to compute the remaining Fourier coefficients (k = 1:1:L _o − 1). The 2BF filtering method is implemented when L _i ≥ 4. The directFourier function carries out the addition of complex multiplications of elements in x(n) by the complex exponential W _N ^nk defined in (1). The filterFourier function computes each Fourier coefficient by implementing a recursive algorithm similar to the second-order Goertzel algorithm of [25]. In the filterFourier function, the feedback signal is multiplied by the real part of the complex exponentials W _N ^k and, next, by the conjugate of W _N ^m. In our modification, the array of complex exponentials W _N ^m is pre-computed for m = 0:1:N − 1 and stored by duplicating in the vector W of length 2N (W = [W _N ^m, W _N ^m]), in such a way that W _N ^nk and W _N ^k could be read from it using nk and k as indexes, respectively. Accessing an element out of the vector W is impossible for these cases, as verified next. L _i is inferior than 4 (or equivalently L _i ≤ 3) when the direct method is used, thus n ≤ L _i − 1 ≤ 2 and k ≤ L _o − 1 ≤ N − 1, and consequently nk ≤ 2(N − 1) < 2 N. This assures that each element of W _N ^nk can be extracted from W just via accessing the element indexed by nk. Similarly, for k ≤ L _o − 1 ≤ N − 1, each element of W _N ^k is directly extracted from W accessing the element indexed by k. In order to avoid multiplications in the generation of the index, nk, the latter is computed by adding k to nk in each iteration of the loop n (inside the function directFourier).

In the scenarios with L _i > D _op and L _o > D _ip, the DFT_{DIT−DIF−Pr} is performed to compute the DFT_N. As it was explained previously, the DFT_{DIT−DIF−Pr} is performed in three commuting stages: the input stage, the intermediate stage, and the output stage. These stages are executed in a sequential order by calling the InputStage function, next the IntermediateStage function, and, finally, the OutputStage function.

1.2 InputStage function

The InputStage function generates the inputs to the intermediate D _ip D _op DFTs of length P (DFT_Ps), resulting in an array of three dimensions y(n ₁, n ₂, k ₁). The pseudo-code for implementing the InputStage function is listed in Fig. 10. The indexes, n ₁, n ₂, and k ₁ are varied using three nested loops (“for” instructions), in such an order that the number of accesses to each element in x(n) is reduced. This is achieved by specifying k ₁ for the inner loop, n ₁ for the intermediate loop, and n ₂ for the outer loop. With this order, once an element in x(n ₁ + D _op n ₂) is loaded, all the inputs of the DFT_Ps that depend on it are generated. To minimize the required computations, the nested loops have been broken down to avoid multiplications by one and the application of if-clauses.

In order to avoid overhead in the generation of the indexes, those are generated by additions only. After the InputStage function has been executed, the intermediate stage should be called.

1.3 Intermediate stage function

The intermediate stage consists in computing D _ip D _opDFTs of length P = N/D _ip D _op. This stage could be implemented with any algorithm for computing a DFT. For instance, the split-radix could be used if P is a power of two [26] or the radix-3 could be used if P is a power of three [27]. For a general case, we recommend using the FFTW (the fastest Fourier transform in the west) reported in [29] to compute the D _ip D _opDFT_Ps since this is the most efficient algorithm for an arbitrary length DFT. The selected algorithm should be applied over each vector obtained from y(n ₁, 0 : 1 : P − 1, k ₁) for each value of n ₁ and k ₁, resulting in a vector with output index k ₂ that is stored in the array z(n ₁, k ₂, k ₁). This array is then processed by the OuputStage function.

1.4 OutputStage function

The OutputStage is performed to compute the final Fourier coefficients from the outputs of the D _ip D _opDFT_Ps stored in z(n ₁, k ₂, k ₁). This function is listed in Fig. 11. In fact, the OutputStage function performs the computation of another stage of DFTs, although with a few outputs. As previously mentioned, there are two alternatives to compute each Fourier coefficient from z(n ₁, k ₂, k ₁), using a direct computation or using the 2BF filtering method. Thus, the OutputStage function could employ the direct-Fourier or the filterFourier functions listed in the pseudo-code of Fig. 9 to compute the final Fourier coefficients.

Each Fourier coefficient depends on D _op inputs (obtained from z(n ₁, k ₂, k ₁) by varying n ₁), so for D _op < 4, the direct method is desirable; otherwise, the 2BF filtering method is to be executed.

These nested loops should be implemented in the indicated order to specify the indexes of the final Fourier coefficients. Those indexes are obtained by increasing index k by a unit in each iteration of the loop indexed by k ₁. The directFourier function utilizes the complex exponential W _N ^nk, while the filterFourier function involves the complex exponential W _N ^k. All elements W _N ^k and W _N ^nk are extracted from W using k and nk as indexes, respectively. In order to reduce multiple copies of data and thus to achieve an enhanced efficiency of the algorithm, it is strongly desirable to implement inline functions and passing the arrays elements by reference instead of by value.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Castro-Palazuelos, D., Medina-Melendrez, M., Torres-Roman, D. et al. Unified commutation-pruning technique for efficient computation of composite DFTs. EURASIP J. Adv. Signal Process. 2015, 100 (2015). https://doi.org/10.1186/s13634-015-0285-z

Download citation

Received: 30 March 2015
Accepted: 11 November 2015
Published: 26 November 2015
DOI: https://doi.org/10.1186/s13634-015-0285-z

Unified commutation-pruning technique for efficient computation of composite DFTs

Abstract

1 Introduction

1.1 Motivation

1.2 Related work

1.3 Novel contributions

2 DFT transform decomposition

3 Proposed method

3.1 Analysis of the hypotheses

3.2 Selection of the permutation/allocation structure

4 Comparison with other competing algorithms

4.1 Comparisons with pruning-based algorithms

4.2 Comparison with the SFFT-related algorithms

5 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Appendix

Appendix

1.1 Main function

1.2 InputStage function

1.3 Intermediate stage function

1.4 OutputStage function

Rights and permissions

About this article

Cite this article

Share this article

Keywords