Low power reconfigurable FPFFT core with an array of folded DA butterflies
 Augusta Sophy Beulet Paul^{1}Email author,
 Srinivasan Raju^{2} and
 Raja Janakiraman^{3}
https://doi.org/10.1186/168761802014144
© Beulet Paul et al.; licensee Springer. 2014
Received: 7 May 2014
Accepted: 4 September 2014
Published: 17 September 2014
Abstract
A variable length (32 ~ 2,048), low power, floating point fast Fourier transform (FPFFT) processor is designed and implemented using energyefficient butterfly elements. The butterfly elements are implemented using distributed arithmetic (DA) algorithm that eliminates the powerconsuming complex multipliers. The FFT computations are scheduled in a quasiparallel mode with an array of 16 butterflies. The nodes of the data flow graph (DFG) of the FFT are folded to these 16 butterflies for any value of N by the control unit. Register minimization is also applied after folding to decrease the number of scratch pad registers to (log_{ 2 }N − 1) × 16. The real and imaginary parts of the samples are represented by 32bit singleprecision floating point notation to achieve high precision in the results. Thus, each sample is represented using 64 bits. Twiddle factor ROM size is reduced by 25% using the symmetry of the twiddle factors. Reconfigurability based on the sample size is achieved by the control unit. This distributed floating point arithmetic (DFPA)based design of FFT processor implemented in 45nm process occupies an area of 0.973 mm^{2} and dissipates a power of 68 mW at an operating frequency of 100 MHz. When compared with FFT processor designed in the same technology with multiplierbased butterflies, this design shows 33% less area and 38% less power. The throughput for 2,048point FFT is 222 KS/s and the energy spent per FFT is 7.4 to 14 nJ for 64 to 2,048 points being one among the most energyefficient FFT processors.
Keywords
1 Introduction
Fast Fourier transforms (FFTs) efficiently compute the coefficients of a discrete Fourier series (DFS). Also, FFT is one of the most commonly used signal processing algorithms in any communication or multimedia system. Direct applications of FFT include spectral analysis, spectral estimation, image processing, interpolation, decimation, convolution, correlation, filtering, etc. FFT is also used in all wideband digital communication systems, which use orthogonal frequency division multiplexing (OFDM) as the modulation technique.
1.1 Need for reconfigurable FFT
Wired/wireless technologies which use OFDM
Applications  FFT points N 

Highperformance local area network (LAN)  64 
Wireless LAN  64 
Multipleinput and multipleoutput (MIMO) OFDM system  64/128 
Institute of Electrical and Electronics Engineers (IEEE) 802.16 based wireless systems  128 ~ 2,048 
Digital audio broadcasting (DAB)  256 ~ 2,048 
Veryhighbitrate digital subscriber line (VDSL)  256 ~ 2,048 
Asymmetric digital subscriber line (ADSL)  512 
Worldwide interoperability for microwave access  2,048 
Digital video broadcastingterrestrial (DVBT)  2,048/8,912 
1.2 Need for low power FFT processor
While implementing FFT algorithm on hardware, the area, power, and speed are the major performance parameters. FFT algorithm is a computationally intensive algorithm and the large number of complex multiplications consumes a lot of power and area.
Implementing FFT and inverse fast Fourier transform (IFFT) blocks using digital signal processors (DSPs) is the method used in the initial years and it is followed even now, as reconfiguring the requirements can be done easily through software. But DSPs are power hungry and not suitable for batteryoperated communications equipment. The FFT and IFFT can also be implemented on fieldprogrammable gate array (FPGA) and other reusable IP cores, but the area and power consumption are not as low as for dedicated hard FFT cores.
Also as the technology node shrinks, with millions of switching transistors per μm^{2}, the total power dissipated by the high performing VLSI circuits greatly increases the temperature of devices and reduces its reliability. It needs higher efforts for cooling and increases the battery weight. In this scenario, a number of low power reconfigurable FFT processors with different architectures have been proposed in the literature and they are summarized in Section 2.
1.3 Review of FFT algorithm
The direct computation or implementation of the DFT equation requires N^{2} complex multiplications and N(N − 1) number of complex additions. Fast Fourier transforms (FFTs) compute the DFT efficiently with reduced number of multiplications and additions. The basic FFT algorithm was developed by Cooley and Tukey in 1965 [1]. The techniques used in developing the FFT algorithm are breaking down the DFT of a long sequence into small DFTs and exploiting the following properties of the twiddle factor. Those two properties are given in Equations 3 and 4.

Symmetry property${W}_{N}^{\left(k+N/2\right)}={W}_{N}^{k}\text{.}$(3)

Periodicity property${W}_{N}^{\left(k+N\right)}={W}_{N}^{k}\text{.}$(4)
There are hundreds of different versions of the FFT algorithm. Decimation in time (DIT) and decimation in frequency (DIF) are the two methods in grouping the N samples. Radix2 DIF FFT algorithm is applied in this work.
2 FFT processor architectures
2.1 General FFT architectures
Based on the FFT algorithm used, radix chosen, size of FFT and the number of channels, a variety of FFT architectures have been proposed in the literature [2–18]. While mapping the FFT algorithm into hardware generally, three [2] or more different architectures are followed [3].
2.1.1 Single PE architecture
A monoprocessor, i.e. a single processing element, is used to perform all the butterflies in the signal flow graph. As the single processing element is reused, usually a butterfly element of higher radix is preferred to reduce the latency. The advantage of single processing element (PE) architecture is high hardware utilization and the disadvantages are discontinuous input and output data streams [3].
2.1.2 Pipelined architecture
The pipelined architecture uses one PE for each stage and the speed of processing is increased. Thus, many concurrent processing elements are used to process different stages to achieve high throughput with less number of cycles [4, 5]. Singlepath delay feedback (SDF), singlepath delay commutator (SDC) [6], and multipath delay commutator (MDC) [7–9] are the common types of pipelined architectures.
2.1.3 Fully parallel FFT architecture
Parallel or column FFT processor maps the signal flow graph or a single stage of the signal flow graph, isomorphically, into a hardware structure. One stage of FFT computation is done using several processing elements and the same hardware is reused for the next stages. This architecture is hardware intensive.
2.1.4 Array architecture
Arraybased architecture uses an array of processing elements to do the FFT computation. All the processing elements can be enabled in parallel to increase the speed of operation. Thus, an areaspeed tradeoff is done. The scheduling logic of the processing elements makes the design complex and hence, this architecture is not commonly used.
2.2 Low power FFT architectures and techniques
Several low power FFT implementation approaches have been proposed in the literature over the past two decades, but still there is a continuing search for an ultra low power implementation of FFT. The research papers [19–21] on methods for motion estimation on a customizable reconfigurable hardware motivate the researchers to search for biologically inspired FFT architectures which might provide the best solution to design a low power FFT.
To achieve low power implementation of DSP circuits, pipelining, parallel processing, algebraic transformations, and algorithmic modifications are generally employed [10]. Reducing the physical switching capacitance either by reducing the physical capacitance or by reducing the switching activity is an appropriate solution to achieve low power. The physical capacitance can be reduced by reducing the complexity of the architecture, while the switching activity can be reduced by an appropriate data encoding method, by proper reordering of the operations, and by using pointtopoint data buses [10]. The reduction in complexity and increase in throughput is depicted in [7] for MIMO OFDM system using 4channel radix2^{3} (R2^{3}) and mixed radix architecture, as R2^{3}SDF needs the smallest number of nontrivial multiplications.
Pipelined FFT processor architecture is put into practice in [2, 4, 5, 11]. Pipelining can be used either to increase the operating frequency or to lower the operating voltage, thus decreasing the power consumption. Radix2, radix4, and radix8 butterflies are used in a pipeline to achieve the implementation of 64 to 2,048 point FFT in [11], and a highspeed radix2^{5} based processor is presented in [14]. A pipelined low power FFT/IFFT processor, along with optimized complex multiplier, is designed for up to 2,048 points for WiMax application in [15].
In [16], low power consumption is achieved by novel radix2 and radix4 butterfly elements, which share two complex multipliers. High throughput is also achieved in [16] using three distributed memories for loading the input data and for reading/writing data before and after computation. The low power FFT processor proposed in [5] uses radix2^{2} algorithm and power saving is achieved by using asynchronous memory instead of synchronous memory. The 64point low power FFT processor of [4] has used radix2, pipelined architecture. Twiddle factor ROM size is reduced by using a reconfigurable complex multiplier. Five types of twiddle factor multiplications are identified and thus the number of computations is reduced, achieving low power consumption [4]. There are many more architectures in the literature, which proposes low power design.
In [17], a 64 to 8,192 point FFT processor for low power applications is presented, by using dynamic data scaling scheme, thereby using a small word length of 11 × 2. To compensate for the signaltoquantizationnoise ratio (SQNR) of the reduced word length, ‘trounding’ (truncation and rounding) strategy is used instead of rounding/truncation. A power and area optimized reconfigurable FFT processor, employing radix factorization using the algorithmic, architectural, and also the circuit level optimization is proved to be highly energyefficient in [18]. The possibility of achieving the most energyefficient FFT processor architecture is investigated in all dimensions. An area and energyefficient multimode processor proposed in [22] is also designed based on flexibleradix and multipath delay feedback architecture and has achieved high throughput with good SQNR. Better SQNR and also 2.5 GS/s are reported in [14].
To summarize, the following methodologies are commonly employed in FFT processors to achieve low power.

Reducing the load capacitance C or the switching frequency ‘f’

Pipelining

Memory partitioning and reducing the twiddle ROM size

Using higher radix, mixed radix algorithms, and radix factorization

Using energyefficient processing blocks
3 The proposed methodology
Three major approaches are used to achieve both low power and reconfigurability of the FFT core in our work. In a FFT core, the major portion of power consumption occurs in two blocks, namely the butterflies with complex twiddle factor multiplications and the internal data storage registers. These two issues are addressed in this design to achieve low power, and reconfigurability is also achieved with the following listed methodologies.

Conventional butterfly with complex multipliers is replaced with distributed arithmeticbased butterfly, which reduces the dynamic power generated by the butterfly computation by 80% (at 20 MHz), thus the whole FFT computation consumes very less dynamic power.

Reconfigurability of the processor to accommodate different lengths of FFT is made possible by the folded butterfly architecture done for an array of 16 coarse grain butterfly processing elements. This is an atypical architecture in contrast to the typical pipelined or parallel architectures.

Using register minimization technique [23], the internal memory requirement is reduced to (log_{ 2 }N − 1) × 16, which further reduces the power.
These three features are explained in detail in the following sections.
3.1 The DAAbased butterfly design
Butterfly operation is the basic computation in the FFT algorithm. Distributed arithmetic algorithm is used for lowpower finite impulse response (FIR) filter implementation without multipliers. [24] shows such an implementation of FIR filter with a combination of DA and common subexpression elimination (CSE) and genetic algorithm on a reconfigurable hardware. Relating distributed arithmetic to a butterfly computation and constructing a FFT processor based on the bit serial butterfly with high latency are done for the first time in this project. The butterfly element is hand crafted using distributed floating point arithmetic (DFPA) for highenergy efficiency. The FFT processor design using distributed arithmetic was proposed as long back as in 1981 in the literature [25] but distributed arithmetic algorithm (DAA) is not applied to the butterfly operation, instead it is used directly for computing prime number DFTs. Combining many prime number FFTs, a larger length FFT is formed. In [25], a pipelined prime factor FFT algorithm is implemented for 504 points using shorter transforms of 8point, 9point, and 7point which are implemented using DAA. But in this work, DA is applied in the butterfly level so that the design can be modularized for reusing it for variablelength, thus making reconfigurability possible.
where x, y, X, and Y have real and imaginary parts and these complex values are represented in a 64bit format.
Thus in a conventional butterfly, to compute output X, two floating point adders/subtractors are required. To compute output Y, four floating point adders/subtractors and four floating point multipliers are required. These floating point multipliers consume more dynamic power and occupy more area.
Hardware used for conventional BF and DFPABF
Hardware  Conventional  DAABF 

Floating point adder/subtractor  6  4 
Floating point multiplier  4  0 
Shift accumulators  0  2 
4input LUT  0  2 
When implemented in 45nm technology, the DFPA butterfly shows 80% less power and 44% less area compared with the conventional butterfly. As DAA is a bit serial operation, this design has a latency of 31 cycles to produce the output of the butterfly operation. But this latency is used efficiently to operate the butterflies in a quasiparallel mode, in this design of the DFPABFbased FFT processor.
3.2 Mapping the DFG on the array of DA butterflies using folding transformation
Getting the insight from the FPGA architecture, 16 DFPA butterflies arranged in an array topology are used to compute FFT up to 2,048 points, instead of the pipelined architecture used in the conventional FFT processor. In pipelined architecture, one butterfly is employed for one stage of computation and the hardware utilization of a pipelined processor is not 100% except for the higher FFT points. But in this methodology, all the butterflies are used even for small FFT size.
Note: The delays are shown only along the first set of edges and not shown for other edges for simplicity. They are present along the appropriate edges, and the pipelining delays used in the calculations are not shown.
Here the folding factor F is 5 for all the sets, the folding order is the time instance at which a particular node in the set fires, and the folding order varies from 0 to F − 1. For example, in the folding set BF0 containing five operations, the folding orders of A 0 is 0, B 0 is 1, C 0 is 2, D 0 is 3, and E 0 is 4. The folding edges and switched inputs/outputs for set S 1 folded to butterfly element BF0 = {A 0,B 0,C 0,D 0,E 0} are derived as follows.
Similar folding switches are added for ‘output 2’ terminal of the butterfly too. All the sets are folded to form BF1, BF2, etc. Thus, DFG for N = 32 with 80 nodes is folded/rolled over in the horizontal direction on to the 16 BFs, in contrast to vertical folding as done in the pipelined architecture. The same technique can be used for a FFT and the corresponding DFG of any size.
3.3 Register minimization
Life time of nodes in the folding set BF0
Node  T_{input} → T_{output1}  T_{input} → T_{output2} 

A 0  31 → 71  31 → 71 
B 0  32 → 52  32 → 32 
C 0  33 → 43  33 → 33 
D 0  34 → 39  34 → 34 
E 0  35 → 35  35 → 35 
4 The adapted architecture
4.1 The IO block and butterfly block
The IO (input/output) block is the interface with the outside world. It receives the input samples in 64bit format and writes the incoming samples to the RAM. The FFT output (64 bits), which is available in one of the RAMs after all the stages of processing, is transmitted out by the IO module.
4.2 Reduced size twiddle ROM
A reduced twiddle factor ROM of size 256 × 64 bits (2 KB) is used in this processor. For an Npoint FFT, there are N/2 distinct twiddle factors but there is inherent symmetry among the twiddle factors. The twiddle factor entry to the ROM can be further reduced with additional logic to either N/4 or N/8 using t π/2 symmetry or π/4 symmetry of the sine and cosine values [28]. In this design, the additional glue logic calculates the twiddle from the N/8 values. Thus for a 2,048 point FFT, the 1,024 distinct twiddle factors are obtained only with 256 values.
4.3 Configuration registers and control
For N =32, one stage of computation is done in one minor cycle, and five minor cycles finish the computation. The inputs and outputs of all the 16 butterflies are fed back and forwarded from the register array using the folded architecture/switches. The data flow between the folded butterfly nodes is controlled by the stage control block. If N = 64, two minor cycles are required to finish one stage of FFT computation. Then, the folding architecture is different. For N = 1,024, 32 minor cycles complete one stage. When one stage is completed, the next stage is carried out with another set of minor cycles. The design works with a clock frequency of 100 MHz with a clock period of 10 ns. Thus, final FFT output of 64point FFT will be available after six stages, each stage consisting of two minor cycles of the BF block. Thus, for N = 64, the final FFT output will be available after an initial latency of 2 × 66 × 6 × 10 ns plus some stages over delays adding to it becomes 7.64 μs.
4.4 Data routers
5 Chip implementation and results
5.1 Reduced area and power reports
The proposed distributed arithmeticbased FFT processor results in reduced area as well as power, as the computations are distributed over many clock cycles with less hardware. The latency created due to this bit serial distributed operations is exploited in the architecture of the processor, making this design area and power efficient. This reconfigurable FFT core is a coarse grain type, whose basic building blocks are the power and areaefficient, radix2 DIF butterflies.
Chip implementation details
Technology  45nm CMOS 

Voltage  1.08 V 
Process  1P6M 
PVT conditions  Typical 
Word length  64 bits 
FFT size  32 to 2,048 
Internal RAM  1.25 KB 
ROM  2 KB 
Maximum frequency  100 MHz 
Core area  0.973 mm^{2} 
Cell count  307,201 
Leakage power  0.034 mW 
Total power  68.17 mW 
Energy per FFT  14 nJ for 2,048 points 
Comparison of conventional and DAAbased designs at 20 MHz
Parameter  DFPABF  Conventional BF  Percentage saving  Proposed FFT processor  Conventional BFbased processor  Percentage saving 

Maximum frequency (MHz)  100  20  −  100  20  − 
No. of cells  18,074  46,886  61.45  245,452  571,590  57.06 
Area (mm ^{ 2 } )  0.031  0.055  43.64  0.694  1.04  33.27 
Leakage power (nw)  1,318  2,711  51.38  26,574  48,679  45.41 
Total power at 20 MHz (mW)  0.9878  4.937  79.99  28.9  46.85  38.31 
5.2 Latency and throughput of the design
Latency as a function of N
FFT Size ( N)  Latency  Throughput (KS/s) 

64  7 μs  119.375 
128  15.56 μs  139.375 
256  38.36 μs  159.843 
512  87.36 μs  180.625 
1,024  0.196 ms  201.601 
2,048  0.435 ms  222.625 
6 Comparison with prior low power FFT processors
Note: *Data width used in this design is 64
Comparison of design features and performance of various FFT processors
This work  [KaiJiun]  [Chia]  [Song]  [Chu yu]  [Manish]  

Technology  45 nm  90 nm  65 nm  180 nm  180 nm  180 nm 
Voltage (V_{dd})  1.08 V  1 V  0.45 V  1.8 V  1.8 V  1.8 V 
Architecture/algorithm  Arraybased, DFPABF/radix2  MDC, 4stream, Radix4/8  Mixed radix MDF  Flexible radix, MDF multiple stream  Pipelined, SDF  R2SDF 
FFT size/modes  Variable 64 to 2,048  Variable 128 to 2,048  Variable 128 to 2,048  128/256/512/1024  Fixed 64  Variable 128 to 2,048 
14 streams  
Maximum frequency  100 MHz  40 MHz  20 MHz  300 MHz  20 MHz  40 MHz 
Word length  64 bits  16 bits (input)  24  20  16  32 
Memory  3.25 KB internal memory (RAM + ROM)  Dual port SRAM (10,224 × 16 bits)  48 KB of register file  Mixed SRAM  DL buffers  FIFO of varying sizes 
Core area  0.973 mm^{2}  3.1 mm^{2}  1.375 mm^{2}  3.2 mm^{2}  0.88 mm^{2}  4.52 mm^{2} 
Power consumption  68 mW  63.72 mW  4.05  507 mW at 512 points  9.79 mW  55.64 mW 
Normalized area  0.475  1.51  0.858  1.25  3.45  0.275 
Normalized power  0.332 μw  3.62 μw  1.51 μw  3.8 μw  11 μw  0.489 μw 
Energy per FFT
FFT size ( N)  Execution time (μs)  Energy per FFT (nJ) 

64  7.64  7.43 
128  17.84  8.79 
256  40.92  10.18 
512  92.48  11.60 
1,024  206.44  13.01 
2,048  456.08  14.44 
7 Conclusions
In this paper, we have presented an array architecture with folding transformation for a reconfigurable (32/64/128/256/512/1,024/2,048 points) FFT processor. The systematic folding transformation is illustrated for N = 32 and this approach is used for other FFT sizes also. The computational nodes are ultra low power and lowarea distributed arithmeticbased FP butterflies, which accomplishes low power, less silicon processor, compared with existing low power FFT processors. The array of 16 folded butterfly elements works in a quasiparallel mode. The number of butterflies is selected as 16 after analyzing different implementation factors and the control mechanism. Another new feature of this processor is it uses very low power butterfly elements whose design is based on DAA. The processor designed in this work occupies a silicon area of 0.973 mm^{2} with a power dissipation of 68 mW at 100MHz operating frequency. The throughput is also calculated to be in the higher range of 119 to 222 KS/s, where as one sample is 64 bits. The energy efficiency is also very high ranging from 7.4 to 14.4 nJ/FFT for the FFT size varying from 64 to 2,048. Thus, this design is one of the most energyefficient processors designed so far.
Declarations
Acknowledgements
The authors express their heartfelt thanks to Mr. P. Radhakrishnan, Open Silicon Pvt Ltd, for his valuable inputs and constant guidance in completing this project. The authors also extend their gratitude to Dr. Anand Samuel, Dr. Menaka, Dr. Ravi Shankar, Prof. Reena, and Mr. Avinash of VIT University, Chennai, India for their valuable suggestions in the manuscript preparation and constant moral support in completing this work.
Authors’ Affiliations
References
 Cooley JW, Tukey JW: An algorithm for the machine calculation of complex Fourier series. Math Comp 1965, 19: 297301. 10.1090/S00255718196501785861MathSciNetView ArticleMATHGoogle Scholar
 Weidong L, Lars W SIPS, IEEE Workshop. A Pipeline FFT Processor, Signal Processing Systems 1999, 654662.Google Scholar
 Cheng Han S, Kun Bin L, Chein Wei J: Design and Implementation of a Scalable Fast Fourier Transform Core. Proceedings of 2002 ASIAPacific Conference on ASIC. 2002, 295298.Google Scholar
 Chu Y, Mao Hsu Y, Pao Ann H, Sao Jie C: A low power 64point pipeline FFT/IFFT processor for OFDM applications. IEEE Trans. Consum. Electron 2011, 57(1):4045.View ArticleGoogle Scholar
 Gin Der W, Yi Ming L: Radix 2² Based Low Power Reconfigurable FFT Processor. Proc. Of IEEE International Symposium on Industrial Electronics (ISIE 2009). 2009, 11341138.Google Scholar
 Xue LIU, Feng YU, Z k Wang NG: A pipelined architecture for normal I/O order FFT. Journal Zhejiang UniversityScience C (Computers & Electronics) 2011, 12(1):7682. 10.1631/jzus.C1000234View ArticleGoogle Scholar
 Byungcheol K, Jaeseok K: Low complexity multipoint 4channel FFT Processor for IEEE 802.11n MIMOOFDM WLAN system. International Conference on Green and Ubiquitous Technology (GUT). 2012, 9497. 7–8Google Scholar
 Yang KJ, Tsai SH: MDC FFT/IFFT processor with variable length for MIMO OFDM systems. IEEE Trans. Very Large Scale Integration (VLSI) Syst 2013, 21(4):11881203.View ArticleGoogle Scholar
 Garrido M, Grajal J, Sanchez MA, Gustafsson O: Pipelined radix2^{k} feed forward FFT architectures. IEEE Trans. VLSI 2013, 21: 2332.View ArticleGoogle Scholar
 Arslan T, Erdogan DH, Horrocks AT: Low power design for DSP methodologies and techniques. Microelectron J 1996, 27: 731744. Elsevier Science Ltd 10.1016/00262692(96)000109View ArticleGoogle Scholar
 Liu G, Feng Q: ASIC Design of Low Power Reconfigurable FFT Processor. ASIC,2007.ASICON '07.7th Internal Conference, IEEE. 2007.Google Scholar
 Shuenn Shyang W, Chien Sung LI: An areaefficient design of variablelength fast Fourier transform processor. J. VLSI Signal Processing Systems 2008, 51: 245256. Springer Science 10.1007/s1126500700638View ArticleGoogle Scholar
 Weidong L, Lars W Swedish systemonchip conference, SSoCC'01, Arild, Sweden. Low Power FFT Processors 2001, 2021.Google Scholar
 Taesang C, Hanho L: A highspeed lowcomplexity modified radix2^{5} FFT processor for high rate WPAN applications. IEEE Trans. Very Large Scale Integration (VLSI) Syst 2013, 21(1):187191.View ArticleGoogle Scholar
 Manish S, Patil TD Proceedings of Signal Processing and Communication Conference (SPCOM). Chhatbar and A.D.Darji, n area efficient and Low Power implementation of 2048 point, FFT/IFFT processor for mobile WiMax 2010, 14.Google Scholar
 Xiaojin L, ZongSheng L: A low power and small area FFT processor for OFDM demodulator. IEEE Trans. Consum. Electron 2007, 53(2):274277.View ArticleGoogle Scholar
 Yifan B, Renfeng D, Jun H, Xiaoyang Z AsiaPacific (4, 4, 2013). A Hardware efficient Variablelength FFT Processor for Low Power Applications. Signal and Information Processing Association Annual Summit and Conference (APSIPA) 14.Google Scholar
 Chia Hsiang Y, Tsung H, Yu D, Marković : Power and area minimization of reconfigurable FFT processors: A 3GPPLTE Example. IEEE J. Solid State Circuits 2012, 47: 3.View ArticleGoogle Scholar
 Guillermo B, Uwe Meyer B, Antonio G, Manuel R: Quantization analysis and enhancement of a VLSI gradientbased motion estimation architecture. Digital Signal Process 2012, 22(6):11741187. ISSN 1051–2004 10.1016/j.dsp.2012.05.013View ArticleMathSciNetGoogle Scholar
 Botella G, Garcia A, RodriguezAlvarez M, Ros E, MeyerBaese U, Molina MC: Robust bioinspired architecture for opticalflow computation very large scale integration (VLSI) systems. IEEE Trans 2010, 18(4):616629.Google Scholar
 Botella G, Martín HJA, Santos M, Meyer U: Baese, FPGAbased multimodal embedded sensor system integrating low and midlevel vision. Sensors 2011, 11: 81648179. 10.3390/s110808164View ArticleGoogle Scholar
 Song T: Nein, L Chi Hsang, C Tsin Yuan, An area efficient, multimode FFT processor for WPAN/WLAN/WMAN systems. IEEE Trans. Solid States 2012, 47: 6.Google Scholar
 Parhi KK: VLSI Digital Signal Processing Systems: Design and Implementation. Wiley, India; 2007. Pvt. LimitedGoogle Scholar
 Meyer Baese U, Botella G, Romero DE, Kumm M: Optimization of high speed pipelining in FPGAbased FIR filter design using genetic algorithm. SPIE Defense, Security, and Sensing (pp 84010R84010R). International Society for Optics and Photonics 2012.Google Scholar
 Chow P, Vranesic ZG, Yen JL: A pipelined distributed arithmetic PFFT processor. IEEE Trans. Comput 1983, C32(12):11281136.View ArticleGoogle Scholar
 Augusta S, Srinivasan R, Raja J: Distributed arithmetic based butterfly element for FFT processor, in 45 nm technology. ARPN J. Eng. Appl. Sci 2013, 8: 1.Google Scholar
 Ayinila M, Brown M, Parhi KK: Pipelined parallel FFT architectures via folding transformation. IEEE Trans. VLSI Syst 2012, 20: 6.Google Scholar
 Kang HJ, Lee JY, Kim JH: Lowcomplexity twiddle factor generation for FFT processor. Electron Lett 2013, 49(23):14431445. 10.1049/el.2013.2461View ArticleGoogle Scholar
 Bevan M: Bass, A low power high performance 1024 point FFT processor. IEEE J. Solid State Circ 1999, 34(3):380387. 10.1109/4.748190View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.