 Research
 Open Access
 Published:
Low power reconfigurable FPFFT core with an array of folded DA butterflies
EURASIP Journal on Advances in Signal Processing volume 2014, Article number: 144 (2014)
Abstract
A variable length (32 ~ 2,048), low power, floating point fast Fourier transform (FPFFT) processor is designed and implemented using energyefficient butterfly elements. The butterfly elements are implemented using distributed arithmetic (DA) algorithm that eliminates the powerconsuming complex multipliers. The FFT computations are scheduled in a quasiparallel mode with an array of 16 butterflies. The nodes of the data flow graph (DFG) of the FFT are folded to these 16 butterflies for any value of N by the control unit. Register minimization is also applied after folding to decrease the number of scratch pad registers to (log_{ 2 }N − 1) × 16. The real and imaginary parts of the samples are represented by 32bit singleprecision floating point notation to achieve high precision in the results. Thus, each sample is represented using 64 bits. Twiddle factor ROM size is reduced by 25% using the symmetry of the twiddle factors. Reconfigurability based on the sample size is achieved by the control unit. This distributed floating point arithmetic (DFPA)based design of FFT processor implemented in 45nm process occupies an area of 0.973 mm^{2} and dissipates a power of 68 mW at an operating frequency of 100 MHz. When compared with FFT processor designed in the same technology with multiplierbased butterflies, this design shows 33% less area and 38% less power. The throughput for 2,048point FFT is 222 KS/s and the energy spent per FFT is 7.4 to 14 nJ for 64 to 2,048 points being one among the most energyefficient FFT processors.
1 Introduction
Fast Fourier transforms (FFTs) efficiently compute the coefficients of a discrete Fourier series (DFS). Also, FFT is one of the most commonly used signal processing algorithms in any communication or multimedia system. Direct applications of FFT include spectral analysis, spectral estimation, image processing, interpolation, decimation, convolution, correlation, filtering, etc. FFT is also used in all wideband digital communication systems, which use orthogonal frequency division multiplexing (OFDM) as the modulation technique.
1.1 Need for reconfigurable FFT
In a multimode, multiband, multifunctional wireless communication system like softwaredefined radio (SDR), OFDM is used for base band processing. FFTs of different size are required for different applications, which use OFDM. Table 1 tabulates the wired and wireless communication technologies that use OFDM as their modulation technique and their FFT size. The size of FFT varies from 64 to 2,048 in these applications and the need for a variablelength reconfigurable FFT processor is inevitable. This is the motivation for the researchers to propose as many methods and architectures for a reconfigurable FFT processor.
1.2 Need for low power FFT processor
While implementing FFT algorithm on hardware, the area, power, and speed are the major performance parameters. FFT algorithm is a computationally intensive algorithm and the large number of complex multiplications consumes a lot of power and area.
Implementing FFT and inverse fast Fourier transform (IFFT) blocks using digital signal processors (DSPs) is the method used in the initial years and it is followed even now, as reconfiguring the requirements can be done easily through software. But DSPs are power hungry and not suitable for batteryoperated communications equipment. The FFT and IFFT can also be implemented on fieldprogrammable gate array (FPGA) and other reusable IP cores, but the area and power consumption are not as low as for dedicated hard FFT cores.
Also as the technology node shrinks, with millions of switching transistors per μm^{2}, the total power dissipated by the high performing VLSI circuits greatly increases the temperature of devices and reduces its reliability. It needs higher efforts for cooling and increases the battery weight. In this scenario, a number of low power reconfigurable FFT processors with different architectures have been proposed in the literature and they are summarized in Section 2.
1.3 Review of FFT algorithm
Reviewing the basic discrete Fourier transform (DFT) equation of a Npoint sequence x(n) consisting of the samples, {x(0),x(1),x(2).......x(N1)}, the DFT X(k) is given by Equation 1.
where the variables ‘k’ and ‘n’ vary from 0 to N − 1. The transforming coefficient ${W}_{N}^{\mathit{kn}}$, commonly called as the ‘twiddle factor’, is defined as given by Equation 2.
The direct computation or implementation of the DFT equation requires N^{2} complex multiplications and N(N − 1) number of complex additions. Fast Fourier transforms (FFTs) compute the DFT efficiently with reduced number of multiplications and additions. The basic FFT algorithm was developed by Cooley and Tukey in 1965 [1]. The techniques used in developing the FFT algorithm are breaking down the DFT of a long sequence into small DFTs and exploiting the following properties of the twiddle factor. Those two properties are given in Equations 3 and 4.

Symmetry property
$${W}_{N}^{\left(k+N/2\right)}={W}_{N}^{k}\text{.}$$(3) 
Periodicity property
$${W}_{N}^{\left(k+N\right)}={W}_{N}^{k}\text{.}$$(4)
There are hundreds of different versions of the FFT algorithm. Decimation in time (DIT) and decimation in frequency (DIF) are the two methods in grouping the N samples. Radix2 DIF FFT algorithm is applied in this work.
2 FFT processor architectures
2.1 General FFT architectures
Based on the FFT algorithm used, radix chosen, size of FFT and the number of channels, a variety of FFT architectures have been proposed in the literature [2–18]. While mapping the FFT algorithm into hardware generally, three [2] or more different architectures are followed [3].
2.1.1 Single PE architecture
A monoprocessor, i.e. a single processing element, is used to perform all the butterflies in the signal flow graph. As the single processing element is reused, usually a butterfly element of higher radix is preferred to reduce the latency. The advantage of single processing element (PE) architecture is high hardware utilization and the disadvantages are discontinuous input and output data streams [3].
2.1.2 Pipelined architecture
The pipelined architecture uses one PE for each stage and the speed of processing is increased. Thus, many concurrent processing elements are used to process different stages to achieve high throughput with less number of cycles [4, 5]. Singlepath delay feedback (SDF), singlepath delay commutator (SDC) [6], and multipath delay commutator (MDC) [7–9] are the common types of pipelined architectures.
2.1.3 Fully parallel FFT architecture
Parallel or column FFT processor maps the signal flow graph or a single stage of the signal flow graph, isomorphically, into a hardware structure. One stage of FFT computation is done using several processing elements and the same hardware is reused for the next stages. This architecture is hardware intensive.
2.1.4 Array architecture
Arraybased architecture uses an array of processing elements to do the FFT computation. All the processing elements can be enabled in parallel to increase the speed of operation. Thus, an areaspeed tradeoff is done. The scheduling logic of the processing elements makes the design complex and hence, this architecture is not commonly used.
2.2 Low power FFT architectures and techniques
Several low power FFT implementation approaches have been proposed in the literature over the past two decades, but still there is a continuing search for an ultra low power implementation of FFT. The research papers [19–21] on methods for motion estimation on a customizable reconfigurable hardware motivate the researchers to search for biologically inspired FFT architectures which might provide the best solution to design a low power FFT.
To achieve low power implementation of DSP circuits, pipelining, parallel processing, algebraic transformations, and algorithmic modifications are generally employed [10]. Reducing the physical switching capacitance either by reducing the physical capacitance or by reducing the switching activity is an appropriate solution to achieve low power. The physical capacitance can be reduced by reducing the complexity of the architecture, while the switching activity can be reduced by an appropriate data encoding method, by proper reordering of the operations, and by using pointtopoint data buses [10]. The reduction in complexity and increase in throughput is depicted in [7] for MIMO OFDM system using 4channel radix2^{3} (R2^{3}) and mixed radix architecture, as R2^{3}SDF needs the smallest number of nontrivial multiplications.
Pipelined FFT processor architecture is put into practice in [2, 4, 5, 11]. Pipelining can be used either to increase the operating frequency or to lower the operating voltage, thus decreasing the power consumption. Radix2, radix4, and radix8 butterflies are used in a pipeline to achieve the implementation of 64 to 2,048 point FFT in [11], and a highspeed radix2^{5} based processor is presented in [14]. A pipelined low power FFT/IFFT processor, along with optimized complex multiplier, is designed for up to 2,048 points for WiMax application in [15].
In [16], low power consumption is achieved by novel radix2 and radix4 butterfly elements, which share two complex multipliers. High throughput is also achieved in [16] using three distributed memories for loading the input data and for reading/writing data before and after computation. The low power FFT processor proposed in [5] uses radix2^{2} algorithm and power saving is achieved by using asynchronous memory instead of synchronous memory. The 64point low power FFT processor of [4] has used radix2, pipelined architecture. Twiddle factor ROM size is reduced by using a reconfigurable complex multiplier. Five types of twiddle factor multiplications are identified and thus the number of computations is reduced, achieving low power consumption [4]. There are many more architectures in the literature, which proposes low power design.
In [17], a 64 to 8,192 point FFT processor for low power applications is presented, by using dynamic data scaling scheme, thereby using a small word length of 11 × 2. To compensate for the signaltoquantizationnoise ratio (SQNR) of the reduced word length, ‘trounding’ (truncation and rounding) strategy is used instead of rounding/truncation. A power and area optimized reconfigurable FFT processor, employing radix factorization using the algorithmic, architectural, and also the circuit level optimization is proved to be highly energyefficient in [18]. The possibility of achieving the most energyefficient FFT processor architecture is investigated in all dimensions. An area and energyefficient multimode processor proposed in [22] is also designed based on flexibleradix and multipath delay feedback architecture and has achieved high throughput with good SQNR. Better SQNR and also 2.5 GS/s are reported in [14].
To summarize, the following methodologies are commonly employed in FFT processors to achieve low power.

Reducing the load capacitance C or the switching frequency ‘f’

Pipelining

Memory partitioning and reducing the twiddle ROM size

Using higher radix, mixed radix algorithms, and radix factorization

Using energyefficient processing blocks
3 The proposed methodology
Three major approaches are used to achieve both low power and reconfigurability of the FFT core in our work. In a FFT core, the major portion of power consumption occurs in two blocks, namely the butterflies with complex twiddle factor multiplications and the internal data storage registers. These two issues are addressed in this design to achieve low power, and reconfigurability is also achieved with the following listed methodologies.

Conventional butterfly with complex multipliers is replaced with distributed arithmeticbased butterfly, which reduces the dynamic power generated by the butterfly computation by 80% (at 20 MHz), thus the whole FFT computation consumes very less dynamic power.

Reconfigurability of the processor to accommodate different lengths of FFT is made possible by the folded butterfly architecture done for an array of 16 coarse grain butterfly processing elements. This is an atypical architecture in contrast to the typical pipelined or parallel architectures.

Using register minimization technique [23], the internal memory requirement is reduced to (log_{ 2 }N − 1) × 16, which further reduces the power.
These three features are explained in detail in the following sections.
3.1 The DAAbased butterfly design
Butterfly operation is the basic computation in the FFT algorithm. Distributed arithmetic algorithm is used for lowpower finite impulse response (FIR) filter implementation without multipliers. [24] shows such an implementation of FIR filter with a combination of DA and common subexpression elimination (CSE) and genetic algorithm on a reconfigurable hardware. Relating distributed arithmetic to a butterfly computation and constructing a FFT processor based on the bit serial butterfly with high latency are done for the first time in this project. The butterfly element is hand crafted using distributed floating point arithmetic (DFPA) for highenergy efficiency. The FFT processor design using distributed arithmetic was proposed as long back as in 1981 in the literature [25] but distributed arithmetic algorithm (DAA) is not applied to the butterfly operation, instead it is used directly for computing prime number DFTs. Combining many prime number FFTs, a larger length FFT is formed. In [25], a pipelined prime factor FFT algorithm is implemented for 504 points using shorter transforms of 8point, 9point, and 7point which are implemented using DAA. But in this work, DA is applied in the butterfly level so that the design can be modularized for reusing it for variablelength, thus making reconfigurability possible.
Figure 1 shows the radix2 DIF butterfly operation. Here x and y are the two complex inputs to the DIF butterfly and the two outputs X and Y are defined as in Equations 5 and 6. The real and imaginary parts of the inputs x and y and the outputs X and Y are represented as 32bit singleprecision floating point number in the IEEE 754 format. Thus, each sample is 64 bits wide.
The outputs X and Y of the butterfly in Figure 1 are given by Equations 5 and 6.
where x, y, X, and Y have real and imaginary parts and these complex values are represented in a 64bit format.
Equations 5 and 6 can be expanded as given in Equations 7 and 8, respectively.
Thus in a conventional butterfly, to compute output X, two floating point adders/subtractors are required. To compute output Y, four floating point adders/subtractors and four floating point multipliers are required. These floating point multipliers consume more dynamic power and occupy more area.
DFPA algorithmbased butterfly (DFPABF) does not employ floating point complex multipliers. Instead, it uses two numbers of shiftaccumulators and lookup tables (LUT) to generate the output Y. DA is a bit serial computation technique for finding the inner product of two vectors, when one of the vectors is known. The radix2 butterfly, which is a twopoint DFT, can be computed using DAA as the twiddle factors are known values. Design of a DAABF is described in [26] completely, by the same authors. Figure 2 shows the block diagram of a DFPABF. Table 2 shows the hardware requirement of a conventional butterfly and the DFPABF.
When implemented in 45nm technology, the DFPA butterfly shows 80% less power and 44% less area compared with the conventional butterfly. As DAA is a bit serial operation, this design has a latency of 31 cycles to produce the output of the butterfly operation. But this latency is used efficiently to operate the butterflies in a quasiparallel mode, in this design of the DFPABFbased FFT processor.
3.2 Mapping the DFG on the array of DA butterflies using folding transformation
Getting the insight from the FPGA architecture, 16 DFPA butterflies arranged in an array topology are used to compute FFT up to 2,048 points, instead of the pipelined architecture used in the conventional FFT processor. In pipelined architecture, one butterfly is employed for one stage of computation and the hardware utilization of a pipelined processor is not 100% except for the higher FFT points. But in this methodology, all the butterflies are used even for small FFT size.
Pipeline FFT architecture can be derived systematically via folding transformation [27]. In this work, the folding transformation is used to arrive at array architecture and the scheduling of this array of butterflies is done. In other words, the nodes in the data flow graph (DFG) of FFT of any size can be folded on to the array of 16 butterfly elements available in the hardware. Such a folding transformation for a 32point FFT is shown as an illustration to explain the reconfigurability of this design. DFG of the N = 32 point FFT is shown in Figure 3. These (N/2) log2 N (=80) butterfly operations are mapped using folding transformation on to the butterfly block with 16 DFPABFs.
Note: The delays are shown only along the first set of edges and not shown for other edges for simplicity. They are present along the appropriate edges, and the pipelining delays used in the calculations are not shown.
For folding transformation, the folding set which is the set of nodes folded to a single computational unit and also the folding order should be determined. The folding set and the order of each node vary with the number of FFT points N. Null operations are not required in the folding set, as this is not a pipelined architecture. The instance at which A 0 would fire is taken as t = 0, though it fires only after receiving both x(0) and x(N/2) which arrives after N/2 cycles with respect to the arrival of x(0). Considering the first N/2 cycles as the initial latency, the orders of all ‘A’type nodes are ‘0’. The 80 butterfly nodes shown in the DFG for N = 32 are folded on to the 16 DFPABFs available in the hardware. Therefore, 16 folding sets each containing log2 N nodes are formed as shown in Equation 9.
Here the folding factor F is 5 for all the sets, the folding order is the time instance at which a particular node in the set fires, and the folding order varies from 0 to F − 1. For example, in the folding set BF0 containing five operations, the folding orders of A 0 is 0, B 0 is 1, C 0 is 2, D 0 is 3, and E 0 is 4. The folding edges and switched inputs/outputs for set S 1 folded to butterfly element BF0 = {A 0,B 0,C 0,D 0,E 0} are derived as follows.
For an edge e from the node U whose l^{th} iteration is scheduled at Fl + u time units to a node V whose l^{th} iteration is scheduled at Fl + v time units in the original DFG with weight w(e), with F as the folding factor, the new weight on the folded edge is calculated using the formula given in Equation 10. [23],
where P_{ U } is the number of pipeline stages in the butterfly unit, which are 31 for DFPABF. The new weights of all the edges of the folded set for BF0 are calculated as follows. Pipelining delays are added on the edges to get positive delays on the folded edges.
Pipelining registers are added along the feed forward cut set, so that all the delays of the folded DFG are positive to make the block realizable. The delays added as pipelining delays along the edges are not shown in DFG diagram. Using the folding equations given in Equation 11, the set of nodes {A 0,B 0,C 0,D 0,E 0,} is folded to one computational element BF0 as shown in Figure 4. The DAAbased BF takes 31 cycles by itself for a complete butterfly operation including the twiddle factor multiplication. This is considered as internal pipelining delay of the node.
Similar folding switches are added for ‘output 2’ terminal of the butterfly too. All the sets are folded to form BF1, BF2, etc. Thus, DFG for N = 32 with 80 nodes is folded/rolled over in the horizontal direction on to the 16 BFs, in contrast to vertical folding as done in the pipelined architecture. The same technique can be used for a FFT and the corresponding DFG of any size.
3.3 Register minimization
The number of internal registers required for storing the outputs of the nodes is determined systematically using the register minimization technique explained in [23]. The lifetime analysis is done for the five nodes in BF0 and the output variables produced by them. For each node, T_{input} → T_{output} is calculated. T_{input} is the time at which the node produces data and T_{input} = u + P_{ U }, where u is the folding order and P_{ U } is the pipelined delay of the node. T_{output} = u + P_{ U } + max v D_{ F } (U → V), where max v D_{ F }(U → V) is the longest folded path delay for all the paths from node U. Here for every butterfly node there are two outputs, so the edges along both the outputs should be considered. The life time table for the set BF0 is given in Table 3.
The life time chart is given in Figure 5 and it can be seen that the number of registers required is only four, maximum number of live variables at a time instance. With 16 butterflies available in the hardware, each would require four registers and the total register array requirement is 64 for N = 32. When N increases by an order 2, the additional requirement is only 16 registers. The number of registers R required is given by R = (log2 N − 1) × 16. Thus for N = 2,048, we need only (10 × 16) 160 internal storage registers. Thus, the registers required are reduced drastically. It is mentioned here that the size of each register is 64 bits as 32bit floating point numbers are used for both real and imaginary parts of the data.
4 The adapted architecture
The FFT processor proposed in this work is reconfigurable for processing up to 2,048 input samples using an array of 16 low power DFPA butterflies on to which all the nodes in the DFG are folded. The delays along the folded edges differ with respect to the FFT size N and are configured by the stage control unit. Two register banks of size 64 words each (64 × 64) are fabricated in the FFT processor as a basic internal RAM, and they are alternatively used for storing the incoming data and as an internal register array. An additional register array of 32 registers is set aside to attain the maximum register size of 160 required for 2 Kpoint FFT. The butterflies are fired one after the other once in three clock cycles with its inputs, which process the FFT computation and are controlled by the control unit. This array and memorybased floating point FFT processor architecture as given in Figure 6 is presented in this section.
4.1 The IO block and butterfly block
The IO (input/output) block is the interface with the outside world. It receives the input samples in 64bit format and writes the incoming samples to the RAM. The FFT output (64 bits), which is available in one of the RAMs after all the stages of processing, is transmitted out by the IO module.
The butterfly block consists of 16 DFPAbased butterfly elements. Each folded butterfly receives a set of two data from the read control block. The addresses of these data are also generated by the folding control block which is programmed with the addressgenerating algorithm for the different N values. Only one of the butterflies gets the inputs at a time and it ends the process after 31 cycles. The compute finitestate machine (FSM) is shown in Figure 7. In the meantime, the other butterflies receive the data sequentially once in every two clock cycles. Thus, there is an added latency of 1 cycle. Thus, outputs come sequentially once in every two clock cycles after the initial latency of 31 cycles from the butterfly block and get stored in the register array for the next stage of processing by one of the folded butterflies. The scheduling of the 16 butterflies is shown in Figure 8 and how the butterfly resources are allocated for the computation using the minor butterfly cycles is shown in Figure 9, for N = 64.
4.2 Reduced size twiddle ROM
A reduced twiddle factor ROM of size 256 × 64 bits (2 KB) is used in this processor. For an Npoint FFT, there are N/2 distinct twiddle factors but there is inherent symmetry among the twiddle factors. The twiddle factor entry to the ROM can be further reduced with additional logic to either N/4 or N/8 using t π/2 symmetry or π/4 symmetry of the sine and cosine values [28]. In this design, the additional glue logic calculates the twiddle from the N/8 values. Thus for a 2,048 point FFT, the 1,024 distinct twiddle factors are obtained only with 256 values.
4.3 Configuration registers and control
FFT size N is given as the input to the configuration block. Aiding to the reconfiguration, the configuration registers configure the control unit for the required number of stages, number of butterflies per stage, number of times the BF block is used, etc. The stage control FSM shown in Figure 8 controls the whole computation process and reconfiguration process. On receiving the information from the configuration registers and other blocks, it controls the flow of data from the RAM and from the IO block. It also controls the data flow in the BF block and controls all the different stages of the FFT computation. The address generation for accessing the data and the twiddle factors from the RAM and ROM respectively are done by the read, write, and twiddle blocks but monitored and controlled by the signals generated by the stage control block.Once one of the RAMs is filled with the samples to be processed, the stage control FSM initiates the read module to read the pair of samples from RAM. The samples to be fed to a particular butterfly are read one after the other as a pair in three clock cycles. Then, one of the 16 butterflies is enabled and it starts processing. As the butterfly operation is based on DA algorithm, which is a bit serial operation, it takes 31 cycles to produce the output. In the meantime, the next butterfly receives the data samples and starts processing. For writing the two outputs to RAM, again three clock cycles are required (Figure 10).Thus, the first butterfly finishes the whole process in 2 + 31 + 3 = 36 clock cycles. When all the 16 butterflies are enabled, the first butterfly has finished the process and ready to process the next set of data. The scheduling of the 16 butterflies is shown in Figure 6. As shown in the Figure 6, BF0 produces its output at the 36th clock cycle and after that, for every two cycles, one set of outputs is produced and stored in the register array. Thus, there is an initial latency of 36 clock cycles, to get the first output of the first stage of FFT computation. One cycle of computation of all the 16 butterflies is called one minor cycle. One minor cycle gets completed in 66 clock cycles.
For N =32, one stage of computation is done in one minor cycle, and five minor cycles finish the computation. The inputs and outputs of all the 16 butterflies are fed back and forwarded from the register array using the folded architecture/switches. The data flow between the folded butterfly nodes is controlled by the stage control block. If N = 64, two minor cycles are required to finish one stage of FFT computation. Then, the folding architecture is different. For N = 1,024, 32 minor cycles complete one stage. When one stage is completed, the next stage is carried out with another set of minor cycles. The design works with a clock frequency of 100 MHz with a clock period of 10 ns. Thus, final FFT output of 64point FFT will be available after six stages, each stage consisting of two minor cycles of the BF block. Thus, for N = 64, the final FFT output will be available after an initial latency of 2 × 66 × 6 × 10 ns plus some stages over delays adding to it becomes 7.64 μs.
4.4 Data routers
There are two routers which routes the data to and from the two RAMs. Data router I receives data from the IO module and also the outputs from the BF block and sends them to RAM0 or RAM1 based on the control signals. Similarly, router II receives data from RAM0 and RAM1. It routes them to the BF block for processing and at the FFT output to the IO block. All the data are 64bit wide as 32 bits are used for real part and 32 bits are used for imaginary part. The diagrams of the routers are given in Figure 11.
5 Chip implementation and results
The proposed DFPABFbased reconfigurable processor core is implemented in Verilog hardware description language, synthesized using Cadence RTL compiler (Cadence Design Systems, San Jose, CA) using standard 45nm technology library, with a V_{dd} supply of 1.08 V, for normal PVT conditions. The back end physical design up to layout of the chip is done using Cadence Encounter (Cadence Design Systems, San Jose, CA) for a six metal layer and one poly process. The layout is shown in Figure 12. This design runs with a maximum clock frequency of 100 MHz. A reconfigurable 64 to 2,048point FFT processor using conventional multiplierbased butterflies with the same array architecture is also implemented in Verilog and implemented using the same technology, in order to compare and demonstrate the higher performance of the distributed arithmeticbased FFT processor.
5.1 Reduced area and power reports
The proposed distributed arithmeticbased FFT processor results in reduced area as well as power, as the computations are distributed over many clock cycles with less hardware. The latency created due to this bit serial distributed operations is exploited in the architecture of the processor, making this design area and power efficient. This reconfigurable FFT core is a coarse grain type, whose basic building blocks are the power and areaefficient, radix2 DIF butterflies.
The chip implementation detail of the proposed FFT core is given in Table 4. The proposed FFT processor performance is basically evaluated by comparing with performance of the FFT processor designed with the same configuration and architecture. The synthesis results of the DAAbased butterfly/conventional butterfly and the results of the DFPABFbased reconfigurable (64 to 2 K points) FFT processor and the conventional butterflybased (64 to 2 K points) FFT processor with the same architecture are compared in Table 5 which shows 33% less area and 38% less power for the same architecture.The power consumption of the proposed processor at various operating frequencies is observed by synthesizing the design at different frequencies. The processor consumes less power at lower frequencies and the frequency versus power graph is shown in Figure 13.
5.2 Latency and throughput of the design
The FFT output points are generated with an initial latency which depends on the FFT size. Each minor cycle takes 66 clock cycles, and change over delays are encountered at the end of the stages. The latency in getting the first output for different values of N is shown in Table 6. After the initial latency, output data is generated at the rate of one FFT point per 10 ns.
6 Comparison with prior low power FFT processors
Performance of the FFT processor, designed in this work, is compared against other existing processors with low power consumption. As the implemented technologies, frequencies, word sizes, FFT lengths, and their latencies different for these processors, they are ordered based on the normalized area and power. In [29], the concept of using normalized area/power for comparison of designs implemented in different technologies is first introduced. There are many variations of the formulae to calculate the normalized area and power with respect to the factors like FFT size. Operating frequency are found in the literature [4, 8, 18, 22], etc. In this work, the word size of the complex data is 64 bits (as IEEE 754 standard singleprecision floating point representation is used), whereas no other designs have used a long word size. The highest data width found for the complex data is 32, whereas most designs have used a data width of 16/20/22 bits. Thus, it is absolutely necessary to include the word size factor while normalizing the values with respect to this design. The formulae used for normalized area and power with respect to this implementation are given in Equations 12 and 13. Energy per FFT is calculated using the formula given in Equation 14. Normalized energy is not found as the execution times of other processors are not known.
Note: *Data width used in this design is 64
^{$}Supply voltage in this work is 1.08 V
Table 7 shows the comparison of the FFT processor proposed in this work with six other processors on various parameters.From the table, it can be observed that the FFT processor proposed in this work has less normalized area and power compared with five of the processors in the table as illustrated in Figure 14. All the processors have adapted a pipelined architecture with the variations like SDF or MDF with multiple streams and mixed radix algorithms. Only our work has used novel array architecture with 16 BF processing elements, each being fired one after the other, thus making them work in parallel with the required time delay. Thus, this architecture becomes suitable for the serial operation of the distributed arithmetic butterfly. The inherent advantage of the distributed arithmetic makes our processor both area and power efficient compared with most of the existing designs.
The throughput of the proposed processor is in the range of 119 to 222 KS/s. As the execution times of all the processors are not known, the normalized energy per FFT could not be calculated. The energy per FFT of this processor is calculated and it is proved with good results of 7.4 to 14.4 nJ for 64point and 2,014point FFT computation as shown in Table 8. Thus, it is more energy efficient than many other existing processors. This is achieved by the energyefficient butterflies, register minimization, and the efficient scheduling of butterflies with folding transformation.
7 Conclusions
In this paper, we have presented an array architecture with folding transformation for a reconfigurable (32/64/128/256/512/1,024/2,048 points) FFT processor. The systematic folding transformation is illustrated for N = 32 and this approach is used for other FFT sizes also. The computational nodes are ultra low power and lowarea distributed arithmeticbased FP butterflies, which accomplishes low power, less silicon processor, compared with existing low power FFT processors. The array of 16 folded butterfly elements works in a quasiparallel mode. The number of butterflies is selected as 16 after analyzing different implementation factors and the control mechanism. Another new feature of this processor is it uses very low power butterfly elements whose design is based on DAA. The processor designed in this work occupies a silicon area of 0.973 mm^{2} with a power dissipation of 68 mW at 100MHz operating frequency. The throughput is also calculated to be in the higher range of 119 to 222 KS/s, where as one sample is 64 bits. The energy efficiency is also very high ranging from 7.4 to 14.4 nJ/FFT for the FFT size varying from 64 to 2,048. Thus, this design is one of the most energyefficient processors designed so far.
References
 1.
Cooley JW, Tukey JW: An algorithm for the machine calculation of complex Fourier series. Math Comp 1965, 19: 297301. 10.1090/S00255718196501785861
 2.
Weidong L, Lars W SIPS, IEEE Workshop. A Pipeline FFT Processor, Signal Processing Systems 1999, 654662.
 3.
Cheng Han S, Kun Bin L, Chein Wei J: Design and Implementation of a Scalable Fast Fourier Transform Core. Proceedings of 2002 ASIAPacific Conference on ASIC. 2002, 295298.
 4.
Chu Y, Mao Hsu Y, Pao Ann H, Sao Jie C: A low power 64point pipeline FFT/IFFT processor for OFDM applications. IEEE Trans. Consum. Electron 2011, 57(1):4045.
 5.
Gin Der W, Yi Ming L: Radix 2² Based Low Power Reconfigurable FFT Processor. Proc. Of IEEE International Symposium on Industrial Electronics (ISIE 2009). 2009, 11341138.
 6.
Xue LIU, Feng YU, Z k Wang NG: A pipelined architecture for normal I/O order FFT. Journal Zhejiang UniversityScience C (Computers & Electronics) 2011, 12(1):7682. 10.1631/jzus.C1000234
 7.
Byungcheol K, Jaeseok K: Low complexity multipoint 4channel FFT Processor for IEEE 802.11n MIMOOFDM WLAN system. International Conference on Green and Ubiquitous Technology (GUT). 2012, 9497. 7–8
 8.
Yang KJ, Tsai SH: MDC FFT/IFFT processor with variable length for MIMO OFDM systems. IEEE Trans. Very Large Scale Integration (VLSI) Syst 2013, 21(4):11881203.
 9.
Garrido M, Grajal J, Sanchez MA, Gustafsson O: Pipelined radix2^{k} feed forward FFT architectures. IEEE Trans. VLSI 2013, 21: 2332.
 10.
Arslan T, Erdogan DH, Horrocks AT: Low power design for DSP methodologies and techniques. Microelectron J 1996, 27: 731744. Elsevier Science Ltd 10.1016/00262692(96)000109
 11.
Liu G, Feng Q: ASIC Design of Low Power Reconfigurable FFT Processor. ASIC,2007.ASICON '07.7th Internal Conference, IEEE. 2007.
 12.
Shuenn Shyang W, Chien Sung LI: An areaefficient design of variablelength fast Fourier transform processor. J. VLSI Signal Processing Systems 2008, 51: 245256. Springer Science 10.1007/s1126500700638
 13.
Weidong L, Lars W Swedish systemonchip conference, SSoCC'01, Arild, Sweden. Low Power FFT Processors 2001, 2021.
 14.
Taesang C, Hanho L: A highspeed lowcomplexity modified radix2^{5} FFT processor for high rate WPAN applications. IEEE Trans. Very Large Scale Integration (VLSI) Syst 2013, 21(1):187191.
 15.
Manish S, Patil TD Proceedings of Signal Processing and Communication Conference (SPCOM). Chhatbar and A.D.Darji, n area efficient and Low Power implementation of 2048 point, FFT/IFFT processor for mobile WiMax 2010, 14.
 16.
Xiaojin L, ZongSheng L: A low power and small area FFT processor for OFDM demodulator. IEEE Trans. Consum. Electron 2007, 53(2):274277.
 17.
Yifan B, Renfeng D, Jun H, Xiaoyang Z AsiaPacific (4, 4, 2013). A Hardware efficient Variablelength FFT Processor for Low Power Applications. Signal and Information Processing Association Annual Summit and Conference (APSIPA) 14.
 18.
Chia Hsiang Y, Tsung H, Yu D, Marković : Power and area minimization of reconfigurable FFT processors: A 3GPPLTE Example. IEEE J. Solid State Circuits 2012, 47: 3.
 19.
Guillermo B, Uwe Meyer B, Antonio G, Manuel R: Quantization analysis and enhancement of a VLSI gradientbased motion estimation architecture. Digital Signal Process 2012, 22(6):11741187. ISSN 1051–2004 10.1016/j.dsp.2012.05.013
 20.
Botella G, Garcia A, RodriguezAlvarez M, Ros E, MeyerBaese U, Molina MC: Robust bioinspired architecture for opticalflow computation very large scale integration (VLSI) systems. IEEE Trans 2010, 18(4):616629.
 21.
Botella G, Martín HJA, Santos M, Meyer U: Baese, FPGAbased multimodal embedded sensor system integrating low and midlevel vision. Sensors 2011, 11: 81648179. 10.3390/s110808164
 22.
Song T: Nein, L Chi Hsang, C Tsin Yuan, An area efficient, multimode FFT processor for WPAN/WLAN/WMAN systems. IEEE Trans. Solid States 2012, 47: 6.
 23.
Parhi KK: VLSI Digital Signal Processing Systems: Design and Implementation. Wiley, India; 2007. Pvt. Limited
 24.
Meyer Baese U, Botella G, Romero DE, Kumm M: Optimization of high speed pipelining in FPGAbased FIR filter design using genetic algorithm. SPIE Defense, Security, and Sensing (pp 84010R84010R). International Society for Optics and Photonics 2012.
 25.
Chow P, Vranesic ZG, Yen JL: A pipelined distributed arithmetic PFFT processor. IEEE Trans. Comput 1983, C32(12):11281136.
 26.
Augusta S, Srinivasan R, Raja J: Distributed arithmetic based butterfly element for FFT processor, in 45 nm technology. ARPN J. Eng. Appl. Sci 2013, 8: 1.
 27.
Ayinila M, Brown M, Parhi KK: Pipelined parallel FFT architectures via folding transformation. IEEE Trans. VLSI Syst 2012, 20: 6.
 28.
Kang HJ, Lee JY, Kim JH: Lowcomplexity twiddle factor generation for FFT processor. Electron Lett 2013, 49(23):14431445. 10.1049/el.2013.2461
 29.
Bevan M: Bass, A low power high performance 1024 point FFT processor. IEEE J. Solid State Circ 1999, 34(3):380387. 10.1109/4.748190
Acknowledgements
The authors express their heartfelt thanks to Mr. P. Radhakrishnan, Open Silicon Pvt Ltd, for his valuable inputs and constant guidance in completing this project. The authors also extend their gratitude to Dr. Anand Samuel, Dr. Menaka, Dr. Ravi Shankar, Prof. Reena, and Mr. Avinash of VIT University, Chennai, India for their valuable suggestions in the manuscript preparation and constant moral support in completing this work.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0), which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Beulet Paul, A.S., Raju, S. & Janakiraman, R. Low power reconfigurable FPFFT core with an array of folded DA butterflies. EURASIP J. Adv. Signal Process. 2014, 144 (2014). https://doi.org/10.1186/168761802014144
Received:
Accepted:
Published:
Keywords
 Fast Fourier transform (FFT)
 Distributed floating point arithmetic (DFPA)
 Twiddle factor
 DIF FFT
 Butterfly element
 Array architecture