 Research
 Open Access
 Published:
Instruction scheduling heuristic for an efficient FFT in VLIW processors with balanced resource usage
EURASIP Journal on Advances in Signal Processing volume 2016, Article number: 38 (2016)
Abstract
The fast Fourier transform (FFT) is perhaps today’s most ubiquitous algorithm used with digital data; hence, it is still being studied extensively. Besides the benefit of reducing the arithmetic count in the FFT algorithm, memory references and scheme’s projection on processor’s architecture are critical for a fast and efficient implementation. One of the main bottlenecks is in the long latency memory accesses to butterflies’ legs and in the redundant references to twiddle factors. In this paper, we describe a new FFT implementation on highend very long instruction word (VLIW) digital signal processors (DSP), which presents improved performance in terms of clock cycles due to the resulting lowlevel resource balance and to the reduced memory accesses of twiddle factors. The method introduces a tradeoff parameter between accuracy and speed. Additionally, we suggest a cacheefficient implementation methodology for the FFT, dependently on the provided VLIW hardware resources and cache structure. Experimental results on a TI VLIW DSP show that our method reduces the number of clock cycles by an average of 51 % (2 times acceleration) when compared to the most assemblyoptimized and vendortuned FFT libraries. The FFT was generated using an instructionlevel scheduling heuristic. It is a modulobased registersensitive scheduling algorithm, which is able to compute an aggressively efficient sequence of VLIW instructions for the FFT, maximizing the parallelism rate and minimizing clock cycles and register usage.
Introduction
The discrete Fourier transform (DFT) is a used transform for spectral analysis of finitedomain discretetime signals. It is widely employed in signal processing systems. For decades, studies have been done to improve the algorithmic efficiency of this technique and this is still an active research field. The frequency response of a discrete signal x[n] can be computed using the DFT formula over N samples as in (1).
For k ϵ {0,…,N − 1} where \( {W}_N^{nk}={e}^{j\left(2\pi /N\right)nk} \)
The DFT algorithmic complexity is O(N ^{2}). In order to reduce this arithmetic count and therefore enhancing its implementation efficiency, a set of methods were proposed. These methods are commonly known as fast Fourier transforms (FFTs), and they present a valuable enhancement in complexity of O(N log(N)). FFT was first discovered by Gauss in the eighteenth century and reproposed by Cooley and Tukey in 1965 [1]. The idea is based on the fundamental principle of dividing the computation of a DFT into smaller successive DFTs and recursively repeating this process. The fixedradix category of FFT algorithms mainly includes radix2 (dividing the DFT into 2 parts), radix4 (into 4 parts), radix2^{2}, and radix8 [2]. Mixedradix FFTs combine several fixedradix algorithms for better convenience [3]. Splitradix FFTs offer lower arithmetic count than the fixed or mixedradix, using a special irregular decomposition [4, 5]. Also, a recursive FFT can be implemented as in [6], and a combination between the decimationinfrequency (DIF) and the decimationintime (DIT) FFT is proposed in [7].
The FFT algorithm has been implemented for either hardware or software, on different platforms. Hardware IP implementations on ASIC or FPGA can provide realtime highspeed solutions but lack flexibility [8]. Equivalent implementations on general purpose processors (GPP) offer flexibility, but it is generally slower and cannot meet high realtime constraints. Multicore digital signal processors (DSPs) are interesting hardware platforms achieving the tradeoff between flexibility and performance. These are sharing with FPGAs a wellearned reputation of being difficult for developing parallel applications. As a result, several languages (such as OpenCL) seeking to exploit the power of multicores while remaining platform independent has been recently explored [9, 10]. OpenCL for instance is an industry’s attempt to unify embedded multicore programming, allowing data parallelism, SIMD instructions, and data locality as well [11].
Modern lowpower multicore DSP architectures attracted many realtime applications with power restrictions. One of the primary examples in this field is the C6678 multicore DSP from Texas Instruments, which can provide up to 16 GFLOPS/watt. In [12], a realtime lowpower motion estimation algorithm based of the McGM gradient model has been implemented in the TI C6678 DSP, exploiting DSP features and looplevel very long instruction word (VLIW) parallelism. The implementation provided significant power efficiency gains toward highend current architectures (multicore CPUs, manycore GPUs). In [13], a lowlevel optimization of the 4, 8, and 16point FFTs in the C6678 DSP is presented.
The most recent highend DSP architectures are VLIW, which mainly support an instructionlevel parallelism (ILP) feature, offering the possibility to execute simultaneously multiple instructions and a datalevel parallelism allowing the access to multiple data during each cycle. Therefore, these kinds of processors are known to have greater performance compared to RISC or CISC, even having simpler and more explicit internal design. However, unlike superscalar processors where parallelism opportunities are discovered by hardware at the run time, VLIW DSPs leave this task to the software compiler. It has been shown that constructing an optimal compiler performing instruction mapping and scheduling in VLIW architectures is an NPComplete problem [14]. Toward this increasingly difficult task, compilers have been unable to capitalize on these existing features and often cannot produce efficient code [15].
Several compilation techniques tackled this problem. A classical method is called the list scheduling [16] and it schedules instructions within a single block, by building a directed acyclic dependency graph (DAG) (as in Fig. 1). Trace scheduling was introduced afterwards by Fisher [17], which selects and acts on acyclic paths having the highest probability of execution (called traces). The trace scheduling was unable to properly optimize loops; then, a more effective solution to this specific problem was proposed by software pipelining [18]. The most successful technique of software pipelining is known as the modulo scheduling [19]. We have proposed an efficient enumerationbased heuristic for modulo scheduling in [20].
One of the bottlenecks toward an efficient FFT implementation on these VLIW DSP platforms is memory latencies. Indeed, in addition to the memory access of butterflies’ inputs and outputs over several stages (log_{2}(N) stages for a radix2 FFT, shown in Fig. 2), conventional algorithms excessively and redundantly load twiddle factors (W _{ N } ^{k}). These latter are classically stored as a Nsized complex vector, requiring O(N log(N)) accesses. In [21], a novel memory reference reduction method is introduced by grouping butterflies using the same twiddle factors together; therefore decreasing the number of memory references due to twiddle factors in DSPs by 76 %. Another FFT scheme is presented in [22], reducing the memory access frequency and multiplication operations. Also, other results in [23] decrease the number of twiddle accesses by an asymptotic factor of log(N), based on G and G ^{T} transforms. On the other hand, the FFT performance is tightly dependent to the processor’s architecture and its memory and cache structure. Kelefouras et al. [24] propose a methodology to speed up the FFT algorithm depending on the memory hierarchy of processor’s architecture. Stateoftheart FFT libraries such as FFTW [25–27] and UHFFT [28] maximize the performance by adapting to the hardware at run time, usually using a planner, searching over a large space of parameters in order to pick the best implementation. The FFTS in [29] claims an efficient cacheoblivious FFT scheme that is not requiring any machinespecific calibration.
In the present paper, we propose an efficient modulolike FFT scheduling for VLIW processors. Our implemented methodology allows better corelevel resource balance, exploiting the fact that the twiddle factors can be calculated recurrently using multipliers and entirely generated in masked time. The resulting scheme created a vital balance between the computation capability and the data bandwidth that are required by the FFT algorithm, taking into account the VLIW architecture. Moreover, an important amount of input buffers can be freed since twiddle factors are no longer stored, nor referenced from memory, avoiding significant memory stalls. Since the recurrent computation of twiddle factors using multipliers induces a processing error, a tradeoff parameter between accuracy and speed is introduced.
Besides, our proposed implementation methodology takes into account the memory hierarchy, the memory banks, the cache size, the cache associativity, and its line size, in order to well adapt the FFT algorithm to a broad range of embedded VLIW processors. The bit reversal was efficiently designed to take advantage of the cache structure, and a mixed scheme was proposed for FFT/iFFT sizes not fitting the cache size.
VLIW assembly code of the FFT/iFFT was generated using a scheduling heuristic. The proposed FFTspecific modulo instruction scheduling algorithm is resourceconstrained and registersensitive that uses controlled backtracking. This heuristic reorders the scheduling array to accelerate the backtracking search for the best schedule within the NPcomplete state space. Our algorithm applies a strict optimization on internally used registers, so that generating twiddle factors of the new FFT scheme can be done effectively masked in parallel with other VLIW instructions.
The idea of recurrently generating twiddle factors during the FFT calculation is also discussed in our paper in [30]. In the present paper, we additionally propose a VLIWgeneric recurrent FFT scheme and the related instruction scheduling generator.
In the following, a background on the FFT algorithm of interest is given in Section 2, we then present an overview on the VLIW DSP processors in Section 3, the new FFT strategy scheme is explained in Section 4, and the modulolike scheduling of the suggested FFT is described in Section 5. Finally, experimental results are given in Section 6.
Background on the FFT algorithm of interest
Many factors other than the pure number of arithmetic operations must be considered for an efficient FFT implementation on a VLIW DSP, which can be derived from memoryinduced stalls, regularity, and algorithm’s projection on hardware VLIW architectures. Yet, one of the FFT algorithms that proved enough satisfaction on DSPs is the radix4 FFT, mostly implemented by manufactures. This is mainly due to its relatively less access to memory (during log_{4}(N) stages vs. log_{2}(N) stages in a radix2 scheme), additional to its regular, straightforward and less complex algorithm (compared for instance to splitradix FFTs, radix8 and higher radix FFTs). The radix4 FFT is usually used and mixed with a last radix2 stage, enabling it to treat sizes that are powerof2 and not only being limited to powerof4 sizes.
We distinguish between two basic types of radix4 algorithms: decimationintime (DIT) is the consequence of decomposing the input x[n], and decimationinfrequency (DIF) when decomposing the output X[n]. Both DIT and DIF present the same computational cost. In this paper, we will be only interested in DIF versions.
Building the radix4 algorithm can be done starting from the DFT formula in (1) which can be rewritten in a decomposed form and consequently obtaining the DIF radix4 butterfly description. An FFT computation is transformed into four smaller FFTs through a divideandconquer approach, making a radix4 scheme with log_{4}(N) stages, each containing N/4 butterflies.
Through the given scheme, when the FFT input is in its natural order, the output will be arranged in the socalled digitreversed order, which is defined depending on the used radix. Worthwhile to note that radix2 FFT schemes present easier reordering process.
In 1996, He and Torkelson proposed in [31] a radix4 variant, having the same computational complexity that they called a radix2^{2} FFT. Its main provided advantage is retaining the same butterfly structure as the radix2 FFT and therefore preserving a bitreversed order at the output. This is useful not only when reordering the output compared to 4based digitreversed reordering but also when mixing it with a last radix2 stage, which in this case can be directly done, thanks to the instinct compatibility between radix2 and radix2^{2} schemes. The new butterfly definition is given in Fig. 3 and the resulting algorithm’s structure in Fig. 4.
The radix2^{2} FFT is adopted in this work and forms the basis of our new FFT scheme. Next, we describe the targeted VLIW family and the related stateofart modulo scheduling.
VLIW DSP processors
Architecture overview
VLIW platforms allow several heterogeneous operations to be executed in parallel, due to multiple execution units in their architecture, achieving high computation capability (Fig. 5). In this case, the instruction size is increased depending on the number of units (usually 128 or 256 bits). Variable instruction sizes can be used as well to avoid additional NOP operations for code optimization. Although that each operation requires a number of cycles to execute, VLIW instructions can be fully pipelined; consequently, new operations can start in every cycle at every functional unit.
In VLIW processors, instructions are pipelined with an outoforder execution, thus, iterations are initiated in parallel at a constant rate called the initiation interval (II) [19] (Fig. 6).
Among industrial VLIW platforms, we mention the Texas Instruments TMS320C6x, Texas Instruments 66AK2Hx, FreeScale StarCore MSC8x, ADI TigerSharc, and Infineon Carmel.
Background on the modulo scheduling
Modulo scheduling is a software pipelining technique, exploiting ILP to schedule multiple iterations of a loop in an overlapped form, initiating iterations at a constant rate called the initiation interval.
Modulo instruction scheduling schemes are mainly heuristicbased as finding an optimal solution is proven an NPcomplete problem. Basic common operations of this technique include computing a lower bound of the initiation interval (denoted MII) which depends on provided resources and dependencies. Next, starting from this MII value, search for a valid schedule respecting hardware and graph dependency constraints. If no valid schedule could be found, increase the II value by 1 and search for a feasible schedule again. This last process is repeated until a solution is found. Higher performance is achieved for lower II; increasing II will reduce the amount of used parallelism, however, this makes finding a valid schedule easier [32]. Main evoked techniques in the literature for modulo scheduling include iterative modulo scheduling (IMS) [19], which uses backtracking by scheduling and unscheduling operations at different time slots searching for a better solution. Slack modulo scheduling [33] minimizes needed registers by reducing lifetimes of operands, using their scheduling freedom (or slack). Integrated registersensitive iterative software pipelining (IRIS) [34] modifies the IMS technique to minimize register requirements. Swing modulo scheduling (SMS) [35, 36] does not use backtracking but orders graph nodes guarantying effective schedules with low register pressure. The modulo scheduling with integrated register spilling (MIRS) in [37] suggests to integrate the possibility of storing data temporally out to memory (using spill code) when a schedule aims to exceed the number of available registers in the processor.
In general, the problem statement starts from a directed acyclic graph (DAG), with nodes representing operations of a loop, and edges for the intra or interiteration dependencies between them. Those are valued at instructions’ latencies. The wanted schedule must be functionally correct regarding data dependencies and hardware conflicts, minimizing both II and register usage, therefore reducing the execution time. Register pressure is a critical element to consider while searching for a schedule; a commonly used strategy is to minimize the lifetime of instruction’s inputs/outputs.
Accordingly, modulo scheduling focuses on arranging instructions in a window of II slots called the kernel, when m iterations are overlapped within it, then m1 iterations must be separately done before and after entering the kernel; those are called prolog and epilog, respectively. In order to reduce the code expansion issue that is naturally required by modulo scheduling (typically for the prolog/epilog parts), hardware facilities for software pipelining are implemented in VLIW.
Next, our new FFT scheme is discussed for implementation possibilities on VLIW DSPs.
Our implementation methodology for the FFT on VLIW DSPs
A motivating example
The key idea behind the FFT scheme that we are proposing is to create a balance between the computation capability and the data bandwidth that are required by the FFT algorithm. In the following, we analyze the conventional FFT scheme regarding needed VLIW operations in a TI C66 device, in order to evaluate the default efficiency of resource usage.
The TI TMS320 C66 is a fixed and floatingpoint DSP. It contains eight cores, each working with a clock frequency of 1.0 to 1.25 GHz. A C66 CorePac core is able to execute up to eight separate instructions in parallel, thanks to the eight functional units in its VLIW structure as in Fig. 7. The maximum ability of one core is to execute 8 fixed/floating 32bit multiplications per cycle using .M units, 8 fixed/floating 32bit additions/subtractions per cycle using .L and .S units and loads/stores of 128bit per cycle using .D units. The internal register files are composed of 64 32bit registers [A31:A0 and B31:B0] [38].
As seen previously, the conventional radix2^{2} FFT algorithm needs log_{4}(N) stages (assuming N a powerof4); and within each stage, a number of N/4 radix2^{2} butterflies are computed. Besides, there is a need of 8 loads/stores for each butterfly’s legs, in addition to 3 load operations of twiddle factors as shown in Fig. 3; making a total of 11 loads/stores per butterfly. Moreover, eight complex additions/subtractions, three complex multiplications, and a special (−j) multiplication are required to complete the processing on a single butterfly.
Let us denote n _{ d } the number of loaded/stored data in bytes during the whole FFT algorithm, n _{ m } the total number of required real multiplications, n _{ a } the total number of required real additions/subtractions, and n _{ j } the total number of (–j) multiplications. Then, the conventional algorithm needs the (n _{ d }, n _{ m }, n _{ a }, n _{ j }) expressed in (2) (always assuming N a powerof4, denoting B _{ N } = N log_{4}(N)/4 as the total number of radix4 butterflies). The n _{ d } formula is scaled by a factor of 8 since the used samples are singleprecision floating point, hence, coded in 8 bytes for both real and imaginary parts. The n _{ m } formula takes into consideration that each complex multiplication is translated into 4 real multiplications, and the n _{ a } formula sums the additions/subtractions that are needed in both complex additions/subtractions {16} and in complex multiplications {6}.
A VLIW core can especially issue loads/stores, multiplications, additions/subtractions, and (−j) multiplications in parallel. In the C66 core case, load/store capacity is 16 bytes per cycle using .D units (denoted next by p _{ d }). Eight real floatingpoint singleprecision multiplications can be done per cycle (p _{ m }) using .M units and eight real floatingpoint singleprecision additions/subtractions are achievable per cycle (p _{ a }) using both .L and .S units. Finally, 2 multiplications by (–j) per cycle (p _{ j }) using .L and .S units as well, those last are simplified into combinations of move and signchange operations between real and imaginary parts.
According to these VLIW core capacities, the maximum peak performance in terms of clock cycles for the whole conventional FFT on device is equal to:
The minimal need for each set of operations on the given C66 VLIW core is shown in Fig. 8. The most optimized conventional way for implementing this algorithm at instruction level leads to the kernel form presented in Fig. 9. It shows butterfly’s instructions mapping on the eight VLIW functional units (.D1, .L1, .S1 …). Since each C66 core contains two side banks (A and B), we processed symmetrically one butterfly per side.
In Fig. 9, loads of the four butterfly’s legs are marked by m0, m1, m2, and m3. Stores are represented by the entities sf0, sf1, sf2, and sf3. Additions and subtractions are referred as a0, a1, a2, a3, v0, v1, v2, and v3. A multiplication by (−j) is translated into two operations on .L/.S units: a3ii (a3i1, a3i2). In addition, multiplications are marked by mf1, mf2, and mf3; and since we are processing complex data (with real and imaginary parts), extra additions are needed to complete the complex multiplication operation: f1, f2, f3. Finally, the symbols w1, w2, and w3 represent loads from memory of needed twiddle factors (W _{ N } ^{k}).
Hence, there is indeed an unbalance between the required computation capability and the load/store bandwidth for this VLIW case, making excessive loads/stores vs. lower use of computation units.
Our new method changes the structure of the algorithm to fit the VLIW hardware, creating a balance in resource usage, and therefore minimizing the overall clock cycles. In an attempt to reduce the load/store pressure, we suggest not to load twiddle factors but to generate them internally instead. This idea uses the fact that the twiddle factors in the nindexed butterfly that are (W _{ N } ^{2} ^{n}, W _{ N } ^{n}, W _{ N } ^{3} ^{n}) can be deduced from the (n − 1)indexed butterfly according to these formulas: (W _{ N } ^{2} ^{n} = W _{ N } ^{2} ^{(} ^{n} ^{ − 1)} W _{ N } ^{2}, W _{ N } ^{n} = W _{ N } ^{n} ^{ − 1} W _{ N }, W _{ N } ^{3} ^{n} = W _{ N } ^{3(} ^{n} ^{ − 1)} W _{ N } ^{3}). This trades loads/stores for multiplications/additions and makes an FFT scheme with butterflies that are dependent on each other; therefore, we cannot start the processing of the nindexed butterfly if the (n − 1)indexed butterfly did not yet compute its twiddle factors.
We decide to process one butterfly per C66 core side bank as well, grouping two butterflies within a single larger iteration. Therefore, this makes an II large enough to wait for the generation of needed twiddle factors by subsequent pipeline stages (7 cycles are required on the TI C66 device in order to complete a floatingpoint complex multiplication).
This new scheme reduces the loads/stores in exchange with arithmetic operation increase (Fig. 10). Indeed, the number of needed cycles for loads/stores becomes N log_{4}(N) as twiddle factors are no more loaded, while the numbers of multiplications and additions/subtractions are increased to 3N log_{4}(N)/4 cycles and N log_{4}(N) cycles, respectively.
Our modulo scheduling heuristic that will be detailed in the next sections was able to generate the aggressively optimized schedule of Fig. 11 for the new FFT scheme. The obtained Initiation Interval (II) is 8 cycles instead of 11, fitting a limited core register set, despite adding extra arithmetic operations for twiddle factor generation. This enhances the raw needed time for the FFT by 27 % (excluding memory stall gains) over the conventional implementation with twiddle factors. Moreover, this scheme requires 0 % references to twiddle factors and 0 % space for their memory storage as well.
Proposed FFT implementation methodology
Our implemented methodology allows better corelevel resource balance, exploiting the fact that the twiddle factors can be calculated recurrently using multipliers during the execution. The resulting scheme, regarding a VLIW architecture, created a vital balance between the computation capability and the data bandwidth that are required by the FFT algorithm. Besides, it takes into account the memory hierarchy, the memory banks, the cache size, the cache associativity, and its line size, in order to well adapt the FFT algorithm to a broad range of embedded VLIW processors. The bit reversal was efficiently designed to take advantage of the cache structure, and a mixed scheme was proposed for FFT/iFFT sizes not fitting the cache size.
A recurrent FFT scheme without memory references of twiddle factors
The proposed FFT scheme generates twiddle factors using multipliers instead of loading them from a precomputed array. Those are recurrently computed from previously handled butterflies. Therefore, the processing of the nindexed butterfly is timeconstrained by the (n − 1)indexed butterfly.
Let us denote t _{ w } the latency time in terms of cycles that is needed to compute a twiddle factor from a previously calculated one; consequently, if an iteration processes one butterfly, then the II must be greater than or equal to t _{ w }, waiting the required time to generate twiddle factors for the next iteration.
For maximum FFT performance, the initiation interval must be minimized, expressed for our new FFT as MII = MAX(RCPB {required cycles per butterfly} for loads/stores, RCPB for multiplications, RCPB for adds/subs and j multiplications, t _{ w }).
In order to mask the effect of t _{ w } on II, we unroll a number of U successive iterations into a single large one, reducing dependencies to between groups of merged iterations. The new MII expression becomes MII = MAX(U × RCPB for loads/stores, U × RCPB for multiplications, U × RCPB for adds/subs and j multiplications, t _{ w }).
In the earlier equation, 3U twiddle factors per group are deduced from 3U previously calculated ones, which reduces dependencies toward the constant delay t _{ w }. The minimum value of U minimizing MII (having MII > t _{ w }) will be denoted U _{ m } next. During each nindexed group of U _{ m } butterflies, there is a need of 3U _{ m } twiddle factors which are computed from those in the (n − 1)indexed group as in (3) (denoting twiddle factors by γ _{ n,k }). On the other hand, treating many butterflies simultaneously per iteration requires the usage of more VLIW core registers.
For 0 ≤ k < U _{ m } and \( 1\le n<\frac{N}{4{U}_m} \) Denoting \( {\gamma}_{n,k}={W}_N^{n{U}_m+k} \)
The resulting structure of a group of U _{ m } butterflies composing a single larger iteration is shown in Fig. 12. The minimum II will be then expressed as follows:
with (n’ _{ d }, n’ _{ m }, n’ _{ a }, n’ _{ j }) are the new needs in terms of arithmetic and load/store operations expressed as in (4).
Consequently, this new scheme reduces memory accesses by 27 %, making an implementation advantage on a broad range of architectures as most FFT algorithms use memory extensively. In addition to that, it gives an opportunity to use the nonexploited VLIW units for a possibly masked generation of twiddle factors. Besides, since it is not an obvious task to generate an efficient pipelined schedule (having II = MII) with respect to hardware constraints and available core registers, we suggest in later sections an aggressive FFTspecific scheduling heuristic.
This scheme requires 0 % references to twiddle factors and 0 % space for their memory storage as well, making significant gains on related memory latencies.
The key parameters of our scheme are the VLIW core features (p _{ d }, p _{ m }, p _{ a }, p _{ j }, t _{ w }). By computing the MII_{1} when using an FFT scheme with loaded twiddle factors (using n _{ d }, n _{ m }, n _{ a }, and n _{ j }, expressed in Eq. (2)), and MII_{2} when using an FFT scheme with recurrent computation of twiddle factors (using n’ _{ d }, n’ _{ m }, n’ _{ a }, and n’ _{ j }, expressed in Eq. (4)), then if MII_{1} < MII_{2}, the provided VLIW core would not be applicable for the proposed scheme; otherwise, the minimal gain is expressed as 100(MII_{1} − MII_{2})/MII_{1}.
Setup code integration within the pipelined flow
The previous section described lowlevel instruction mapping of the most inner loop of the new FFT scheme. In order to complete the FFT/iFFT implementation, intermediate setup iterations (representing outer loops) must be injected into the pipelined flow of iterations. The straightforward way is to completely drain the last iterations of the inner loop, executing the setup code (constants reset, pointers updates …), and then resuming the pipelined execution; this turns to be timeconsuming due to the time needed for the prolog/epilog parts. Indeed, if the dynamic length (DL) denotes the number of cycles that are needed by an inner loop iteration for its processing, then the whole FFT will at least require B _{ N }MII/U m + (DLII)(N/48 − 1/3) cycles (assuming N a powerof4). The integer expression (N/48 − 1/3) counts the number of interrupts that must be done to the inner loop kernel, assuming that the last two FFT stages can be especially treated and done without setup code merging. For N = 4k on the C66 for example, we can see that the setup code interruptions represent 7 % of the main processing.
VLIW architectures can support the merging of setup codes into the pipelined iterations (the case of C66), making it possible to add a customized iteration in concurrence with others without draining the process. Therefore, we can insert an additional II cycle iteration, setting up changes required by outer loops to begin the next sequence of inner loop iterations; needing only B _{ N }MII/U m + II(N/48 − 1/3) cycles on the whole FFT/iFFT routine. This enhances the efficiency, representing only 2.7 % of the main processing when N = 4k on C66 cores.
Cacheefficient bit reversal computation
The FFT naturally returns an Nsized output with a bitreversed order, post reordering data is necessary. Bit reversing an element at a 0based index k consists of replacing it into the position index m, such that m is obtained by reversing log_{2}(N) bits of the binary form of k. Processors usually implement hardware bit reversal of a 32bit integer, hence, the wanted function can be obtained by leftshifting an index 32log_{2}(N) times and bitreversing the whole word afterwards.
In order to increase computation efficiency, one can integrate the reordering step into the last stage of the FFT/iFFT, rather than creating a separate routine. One encountered difficulty in bit reversal is accessing scattered positions of memory, causing many memory stalls. Indeed, commonly used architectures of L1D cache are to divide it into several banks, such that simultaneous accesses to distinct banks are possible, while many concurrent accesses to the same bank induce latencies. In processors’ architectures that allow multiple data to be accessed in a cycle, the L1D cache level is divided into banks that are defined by lower bits of an address (AMD’s processors, TI C6x DSPs, …). In the C66 CorePac architecture for example, there is 8 L1D banks such that the address range [4b + 32k; 4(b + 1) + 32k [(where k and b are integers) is linked with the bank number b (b ϵ [0,7]).
It turns out that store indexes related to first butterflies (0, N/2, N/4, …) all usually belong to the same memory bank (as long as N gets high values); consequently, 2 parallel stores in a constructed kernel will likely target different addresses from the same bank, inducing stalls. Besides, it is recommended to access consecutive addresses, to be possibly merged into larger accesses by subsequent data paths.
The straightforward implementation provides no successive stores, neither avoiding bank conflicts. A possible enhancement could be achieved by trying to bitreverse input indexes on the last FFT stage instead of output ones. Resulting structure of this FFT/iFFT stage is described below (processing four butterflies per iteration):
i = 0; loop N/16 times br = bit_reverse(i); load_legs(br, br + 1, br + 2, br + 3); load_legs(N/2 + br, N/2 + br + 1, N/2 + br + 2, N/2 + br + 3); load_legs(N/4 + br, N/4 + br + 1, N/4 + br + 2, N/4 + br + 3); load_legs(3 N/4 + br, 3 N/4 + br + 1, 3 N/4 + br + 2, 3 N/4 + br + 3); process_butterflies(); store_legs(i, i + N/2, i + N/4, i + 3 N/4); store_legs(i + 1, i + N/2 + 1, i + N/4 + 1, i + 3 N/4 + 1); store_legs(i + 2, i + N/2 + 2, i + N/4 + 2, i + 3 N/4 + 2); store_legs(i + 3, i + N/2 + 3, i + N/4 + 3, i + 3 4 + 3); i = i + 4; end loop 
Doing so, we can always issue parallel stores targeting different memory banks and then avoiding bank conflicts (for example, data (i) at a specific bank in parallel with data (i + 1) at another bank). Furthermore, 4 consecutive store accesses are now possible. In this case, butterflies are processed in an order that provides the maximum of consecutive stores.
During the FFT, inplace computation on an input buffer is performed until the last stage, where stored data are put into the output buffer. Processing 2, 4, or more butterflies per iteration increases register pressure; we have applied the same scheduling heuristic that will be described later to find a feasible implementation with advanced constraints on data accesses. Found schedule has II = MII with all constraints verified ensuring consecutive stores and avoiding bank conflicts. Obtained performance of the FFT routine with bit reversal was similar to an FFT without bit reversal.
For FFT sizes that are powerof2 and not powerof4, an additional radix2 stage is added; that is where the bit reversal is merged in the same manner.
Adapting the FFT to the cache associativity
When the FFT size is greater than the allocated cache size, the considered radix2^{2} scheme may present some inefficiency toward the L1D cache. The L1D cache is composed of a number of cache lines (usually of 64 bytes), used to store external data prior to their use by the CPU. A cache miss occurs when the requested data is not yet available into the cache; in this case, the CPU is stalled waiting for a cache line to be updated. Many cache structures were used in CPU architectures: directmapped cache associates each address of the external memory with a unique position into the cache (therefore with one cache line); as a result, two addresses that are separated by a multiple of the cache size could not survive into the cache at the same time. An advanced mechanism is called the setassociative cache, where the cache memory is divided into a number of p ways, such that when a cache miss occurs, data is transferred to the way whose cache line is the least recently used (LRU); consequently, there are p unique locations into the cache for every address. The main advantage for increasing the associativity is to let noncontiguous data survive into cache lines without overwriting each other (without cache thrashes).
A radix2^{2} butterfly loads their legs from the indexes: 0, N/4, N/2, and 3 N/4. All this data should exist in the cache at the same time. If the L1D cache is 2way setassociative and denoting L1D_S the allocated cache size in bytes, no cache thrash would happen if N is less than or equal to L1D_S/8. Otherwise, cache lines for indexes (0 and N/2) or (N/4 and 3 0/4) will overwrite each other continuously, decreasing then the cache efficiency. A solution to this consists of applying radix2 FFTs for larger sizes, until the size (L1D_S/8) where radix2^{2} can be used without cache thrashes. Indeed, while radix2 FFTs only access elements at indexes like (0 and N/2), no cache thrash would occur no matter how large N is; as long as the cache is 2way setassociative.
A radix2 FFT without references of twiddle factors was similarly built and generated using our scheduling heuristic leading to a schedule of II = MII (merging 4 radix2 butterflies in a single iteration).
Denoting the cache associativity parameter by CACHE_A, the pseudocode of our adapted FFT scheme regarding cache is written below:
function fft_cache() begin step = N; while (step > L1D_S/8 and CACHE_A < 4) loop for (k = 0;k < N/step) fft_radix2_stage(input + k*step, step);//radix2 fft step = step/2; end loop for (k = 0;k < N/step) fft(input + k*step, step);//radix2^{2} fft bit_reversal(input, output); end 
A bit reversal routine was designed separately in this case and cacheoptimized; the II(MII) was extended in order to optimally (fully) treat 4 cache lines in each iteration.
Using a radix2 FFT for first stages is making a slight drop on the overall efficiency. In fact, a fullradix2 scheme requires more time than a fullradix2^{2} scheme (on C66 cores, it needs N log_{2}(N) cycles at peak performance instead of N log_{4}(N)). Even so, its gain is far dominant for large FFTs avoiding cache thrashes. For example, the 16kFFT performance on C66 cores using a fullradix2^{2} scheme is 367,890 cycles; it decreased to 245,206 cycles (33 % gain) using the schemeavoiding cache thrashing.
The inverse FFT is the same as a FFT, except the fact that it uses conjugated twiddle factors and (1/N) extra multiplications added to the last stage. These modifications can be performed without decreasing performance, by exploiting the ILP feature of VLIW processors.
FFT scheme accuracy
A possible side effect of the proposed implementation is a slightly reduced precision. Indeed, internally computed twiddle factors in a recurrent fashion are less accurate than those loaded precomputed using trigonometric functions. We introduce in Fig. 13 a tradeoff parameter (tradeoff_factor) between accuracy and speed. The tradeoff scheme injects more precomputed twiddle factors within the FFT flow instead of using only one, which reduces error accumulation effects. However, since the pipeline is regularly stopped to process more precomputed twiddle factors, the speed performance slightly drops.
The key idea of the algorithm in Fig. 13 is to use more than one precomputed twiddle factor per FFT stage in order to limit the error propagation. Indeed, if the tradeoff_factor is 0 then only 1 twiddle factor will be used to feed the whole FFT process. Otherwise, 2^(tradeoff_factor)/2 precomputed twiddle factors will be used per each FFT stage. We will denote next the tradeoff_factor by T.
The calculation error of twiddle factors using the repeated multiplication algorithm grows as O(N) as shown in [39]. Therefore, using our tradeoff method, the twiddle factors accuracy would be expressed as follows:
where \( {\widehat{\omega}}_N^k \) is the approximated twiddle factor value and u is the unit roundoff. For IEEE754 singleprecision floating point, the unit roundoff is equal to u = 5.96 × 10^{− 8}.
For increasing values of T, the method’s accuracy increases, but also the FFT time increases according to the following formula:
Where DL denotes the dynamic length of an inner loop iteration. The T parameter creates then a tradeoff factor between accuracy and speed.
Instruction scheduling heuristic for FFT/iFFT implementations on VLIW
Introduction
According to the proposed FFT scheme, a group of U _{ m } butterflies must be processed during a single iteration. Furthermore, extra operations are added to compute the needed twiddle factors in the masked time; these make an iteration with largely increased operations to be scheduled during II = MII within a limited core register set. The data dependence graph of this inner loop FFT iteration is shown in Fig. 14; each node of the graph (among 30U _{ m } nodes) can be executed in 1 functional unit of the VLIW core.
In a TI C66 device for instance, and as we aspire the schedule to be done for II = MII = 8, the kernel can be able then to execute up to 64 nodes (due to the 8way VLIW architecture). It leads to a functional unit pressure of 60 per 64 possible slots (94 % of unit pressure); this shows the difficulty class of the actual scheduling problem, in the presence of a limited register set.
The new FFT scheme merges U _{ m } totally independent butterflies in a single iteration, making it possible to symmetrically divide the computation on processors with symmetric corebank structure (ADI TigerShark, TI C66 [Fig. 7], …). This will have the effect to reduce the problem size by half (on U _{ m }/2 butterflies), and avoid corebank communication which is therefore usually limited with many other constraints.
Each node of the graph is associated with a specific assembly instruction, requiring input/output operands with precise sizes, and a list of possible functional units for execution (an example of some TI C66 instructions in Table 1).
At the coreregister level, the FFT algorithm will need to allocate a number of registers exclusively to store data pointers, constants, counters, and twiddle factors. First, 1 register is needed to store a counter on the iteration loop; 4 registers per butterfly to contain pointers on input and output butterfly legs; 1 register for the FFT stride; 3 registers per butterfly for jump offsets; 1 register for a pointer on the final output buffer; 1 register as a stack pointer (where some coreregisters are spilled). Besides, 3 register pairs are exclusively needed for \( \left({W}_N^{U_m},{W}_N^{2{U}_m},{W}_N^{3{U}_m}\right) \), and 3U m twiddle factors per butterfly must be allocated as well. Left registers must be used for operand allocation of the entire FFT/iFFT DAG. Our scheduling algorithm must take into account this limiting register constraint.
One of the most efficient scheduling heuristic is SMS as evaluated in [32, 35, 36]. Applying SMS on our FFT problem in the TI C66 core, it produced a schedule of II = MII with a minimum register usage of 20 per core side (40 needed registers), which is greater than the available registers for the DAG allocation, meaning that this is not an implementable schedule. Our scheduling technique aims to find a valid schedule with II = MII and a minimum register requirement in a reasonable time.
A proposed modulo scheduling algorithm
The new scheduling algorithm starts with ordering the graph nodes for a one core side into a 1dimensional array, such that if a node is jindexed into this array, all of its predecessors must be indexed less than j. This is the case as the scheduling algorithm that will be presented next uses this generated arrayorder to schedule/unschedule graph nodes in a backtracking fashion and computes the best starting time of each node (regarding register lifetime) based on all of its already scheduled predecessors. The scheduling order is critical and our algorithm uses a special ordering of graph nodes, which will be presented later.
For each node on the ordered list, we define two parameters that are node possible start time (NPST) and node end time (NET), which describe the freedom or possible slots in the actual schedule; they are defined according to (5). We define as well the node start time (NST) as the effective start time, varying between NPST and NET, and pred_{ i } being the ith predecessor of a node. Special freedom ranges were attributed for root nodes not having any predecessor and those with interiteration dependences.
Based on slot freedom of each node on the formed ordered list, we schedule them in a 2dimensional kernel starting from their NPST. When it is not feasible to schedule an operation due to a unit or write conflict, other possible units are tried or another cycle in [NPST + 1; NET] is used in a backtracking mode. The algorithm is lifetimesensitive, integrating an accumulative measure of how much time the returned operands remain in core registers before being used. This criterion at a graph node is defined as the total lifetimes since results are being returned by all its predecessors before being consumed by this node; it is expressed in (6) (NRS [“node result size”] is the size in terms of coreregisters for a node’s results). Notice that this provided lifetime is weighted by the amount of registers that are presented and pended for use. The scheduling algorithm should minimize the overall accumulative lifetime which takes into consideration every node of the graph; in this case, register usage is in general also minimized.
The total length of the accumulative lifetime divided by II gives a lower bound on register pressure, denoted AvgLive. Despite the fact that a minimized AvgLive gives smaller register usage, it does not necessarily provide the lowest. For example, a found schedule with a total lifetime of 110 requires a minimum of 17 registers, while another schedule with a lifetime of 118 required only a minimum of 16 registers. A better register lower bound is computed considering overlapped lifetimes over II cycles, getting an array (called LiveVector) of II elements as described in [33]. The maximum among the LiveVector values was named the MaxLive, which is a precise lower bound measure of the number of needed registers. It was shown in [33] that a schedule requires at most MaxLive + 1 registers.
Our algorithm merges the calculation of the MaxLive in the search process, informing of register pressure at any partial schedule. This serves us to efficiently cut off useless branches in the state space, reducing significantly the scheduling time. The pseudocode of our scheduling algorithm is given in Fig. 15.
As the state space cannot be scanned entirely due to its NPComplete nature despite MaxLive cutoffs, we propose to make smaller successive searches with different starting points on the search space. This is having the advantage to make finding a better schedule faster. Indeed, if we are not able to find a better solution within a specified amount of backtracking tries (100M nodes as an example), then it is more likely because first placed operations constrain the efficiency and therefore must be changed. The algorithm starts then with an initial ordering in the scheduling array, subsequently, if the backtracking amount limit is reached, the scheduling array is reordered according to specific rules and the search process starts again using a new initial state; this sequence is repeated until a solution fitting available registers is found.
The reordering part tries to guarantee different initial schedules and a fast convergence rate. The main used criteria while sorting is that when an operation v is unscheduled, next operations to be rescheduled must be those who maximize their effect on this operation v. In order to illustrate this criteria; let’s take an example and assume that the scheduling array is arranged as follows: {mw1, mw2, mw3, m0, m1, m2, m3, a0, a1, a2, a3, a3i1, a3i2, v0, v1, v2, v3, f1, f2, f3, sf0, sf1, sf2, sf3, w1, w2, w3}.
Next, we assume that operations until “sf3” were placed successfully into the kernel and found a valid slot. If the operation “w1” cannot be placed into the schedule (either because it presents higher MaxLive pressure, or no free place could be found within its slots freedom range), then the algorithm will reschedule the previous operation in the scheduling array which is “sf3” and checking if this enhanced scheduling opportunities for “w1”. It will not have an effect, because “sf3” is not sharing the same resource unit as “w1” nor among its direct predecessors. Hence, rescheduling “sf3”, “sf2”, “sf1”, or “sf0” merely leads to useless states; only “mw1” (its direct predecessor) or operations using the same unit resource could have a chance to make a valid placement for “w1”. In order to avoid scanning useless states first, a better order should have been done (making “mw1” close to “w1” in the scheduling array for example).
Accordingly, reordering the scheduling array will take those considerations:

▪ If an operation is jindexed into this array, all its predecessors must be indexed less than j {1}

▪ Each operation must be close as possible to its direct predecessors {2}

▪ Each operation must be close as possible to operations using the same unit resource {3}

▪ Operations with larger input/output sizes are more critical to reordering considerations {4}
We next define for each graph node on the ordered list, a measure on its rescheduling easiness, denoted RF. It expresses how much a graph node meets the previously mentioned criteria regarding its indexing order into the scheduling array. Equation (7) defines the considered RF for a graph node operation op:
The “.id” field in (7) represents the ordering position of a node within the array. The pn and cn fields denotes respectively the number of predecessors of op and the number of concurrent operations using the same unit resource as op and which are indexed less than its index. The formula is scaled by the number of registers (buffers size “.bs”) that are required by op.
The sum of all RF for every operation in the array reflects the ordering penalty (referred next by OrdP). A better ordering should have a minimized sum. The reordering routine generates a number of ordering possibilities; the one having the minimal penalty is picked and used next. This routine is as follows (assuming the array is presented with condition {1} verified):
function change_order(Sch_Array) begin loop choose op: an operation from the Sch_Array; min_i = index(op); max_i = min(index(successors(op))); //if no successor, set max_i to upper bound choose a position s in the range [min_i, max_i1]; move in array operation from position min_i to s; compute the OrdP; compare it to the best penalty found so far; end loop return the order having the lowest OrdP; end 
The presented heuristic in this section was tested on a TI C66 VLIW core and generated an efficient schedule with II = MII and MaxLive = 12 during few minutes. Figure 16 shows the ordering penalty effect on the convergence speed. The resulting kernel form for the FFT/iFFT on C66 is shown in Fig. 11.
Our proposed scheduling algorithm aggressively minimized register usage, enhancing the MaxLive by a factor of 1.7 in the TI C66 core, toward the SMS scheduling method (which returned a MaxLive of 20), making an FFT completely without memory references of twiddle factors implementable.
The key parameters of our scheduling method are mainly the computed MII and U _{ m } in Section 4.2.1, the number of symmetric clusters (denoted SymC) in the VLIW core (The method would schedule U _{ m }/SymC butterflies per cluster, reducing the algorithmic problem size by a factor of SymC). The scheduling is also dependent on the VLIW instruction’s delay slots, operand sizes, and their possible execution units and on the number of available core registers per cluster.
Implementation and experimental results
Subsequent implementation strategy for the FFT/iFFT was implemented in the highend TMS320 C6678 VLIW DSP and in the 66AK2H12 DSP, using the Standard C66 Assembly. Data samples were singleprecision complex floating point, with imaginary parts in odd indexes and real parts in even ones. During benchmarks, L1D/L1P was fully used as cache. Input/output buffers are stored into the memory L2 (512 Kbytes in C6678, 1 Mbytes in 66AK2H12). Moreover, the program code is mapped to the local memory L2. Experimented FFT sizes are in the range [256, 64k], which correspond to most signal processing applications. The comparison is made with the most vendortuned linear assemblyoptimized FFT of TI (found in the DSPLib version 3.1.1.1), with the TI compiler’s optimizations all active (o3, version 7.3.2).
A common efficiency measure for the FFT algorithm will be used, which is the number of cycles per pseudoradix2 butterfly (that we will denote as CPPR2B), defined by the FFT cycle count divided by N log_{2}(N)/2. A perfect FFT implementation on the CorePac C66 would have a CPPR2B of 1; otherwise, it will be greater than 1. Performance comparison between our new FFT/iFFT (bit reversal included) and the optimized TI routines is presented in Fig. 17 and Table 2 (results are for 1 core of C6678, running at the frequency of 1 GHz).
Our presented FFT implementation shows great improvements over TI; the peak performance was reached for N = 4k, as the limited cache associativity made our special optimization to take place for N = 8k and larger, inducing relatively less efficiency for integrating radix2 stages. Small FFTs usually suffer from nonnegligible overhead toward main processing. We are then able to reach an average gain of 50.56 % (2 times acceleration) over TI’s routines, with a maximum performance of 1.119 CPPR2B (89.36 % of absolute efficiency). This obtained gain is explained by the proved 27 % gain in Section 4, the suppressed latencies of twiddle factors and by the other described optimizations.
Besides the speed performance, our FFT saves 50 % of input buffers toward conventional (TI) FFT routines (Table 3). Indeed, our scheme does not require twiddle factors to be stored nor to be referenced from external memory.
The bit reversal computation did not affect the performance for FFT sizes less than 8k (0 % increase due to our cacheefficient bit reversal optimization), while it increased the cycles count by an average of 18.4 % starting from 8k sizes; the loss represents the price of the time needed to process bit reversal separately for large FFTs. Besides, the proposed FFT/iFFT adaptation to the cache associativity is having at least a gain of 69 % in terms of efficiency. The inverse FFT was optimally reformed such that its performance overlapped with the FFT routine, despite the fact it included extra arithmetic operations.
Figure 18 shows the relative RMS error comparison between the new FFT and TI’s one. The proposed FFT using 1 twiddle factor (tradeoff factor of 0) achieves for instance an RMS error of about 1E05 (FFT size of 4k), against 5E07 for TIs. Our FFT variant with a tradeoff factor of 1 (using log(N) precomputed twiddle factors) achieved about a 10 times enhancement in terms of precision, preserving exactly the same presented speed performance. For most signal processing applications, this FFT accuracy is enough, and we can see for instance that the signal on FFTnoise ratio (SNR) we achieve in Fig. 18 for a 4kFFT is about 107 dB, compared to 127 dB for the TI FFT and to 138 dB for the maximum achievable SNR of an IEEE754 singleprecision floating point. Worth to note that a 16bit ADC converter has an SQNR of about 84 dB [40] (which is even less than the SNR due to FFT’s computation accuracy); the SQNR is usually raised by coherent signal processing gains (e.g., a 4kFFT processing can increase the SQNR by 33 dB or by 10 log_{10}(N/2) dB).
On the other hand, our scheduling heuristic generates the kernel codes of the FFT/iFFT and aggressively optimizes the number of cycles and registers’ usage. It is able to compute the best schedule respecting tight pressure constraints with a fast convergence rate, overcoming results given by SMS by a factor of 1.7 (40 % gain) on the found MaxLive. The best generated schedule of instructions with a MaxLive = 12 was computed within 2 to 15min range.
Furthermore, a 4ksample floatingpoint FFT was performed inchip during 2.6 μs within a 10W power consumption in a TI 66AK2H12 DSP device, making a remarkable FFT implementation efficiency of 9.5 GFLOPS/watt. This makes it possible for use in several computeintensive applications, such as radar processing [41]. Our work has been used within the official FFT library of Texas Instruments [42].
In contrast to previous works, onthefly generation of twiddle factors as in [43–45] used the CORDIC algorithm or related generation designs to compute the needed twiddle factors instead of performing ROM accesses. These techniques target hardware FFT designs in FPGAs or ASICs and are not applicable for CPU or DSP platforms. Indeed, the idea of generating twiddle factors using multipliers (or equivalent operations) in software for CPU/DSP is usually avoided, as it requires in most cases more latency than FFT schemes with precomputed twiddle factors. To the best of our knowledge, no published work proposed a softwareefficient solution to do an FFT with ingeneration of twiddle factors. The idea becomes possible with recent highend VLIW processors, where we have to issue parallel instructions computing the twiddle factors in the masked time; however, it requires proper scheduling and lowlevel control on the execution pattern to be done successfully.
Conclusions
In the present paper, a new radix2^{2}based FFT/iFFT scheme is proposed to fit VLIW processors. This structure made a balance between the VLIW computation capabilities and the data bandwidth pressure, optimally exploiting parallelism opportunities and reducing memory references to twiddle factors, leading to an average gain of 51 % on efficiency toward the most assemblyoptimized and vendortuned FFT on a highend VLIW DSP. Our implementation methodology took into account the VLIW hardware resources and the cache structure, adapting the FFT algorithm to a broad range of embedded VLIW processors. On the other hand, a resourceconstrained and registersensitive modulo scheduling heuristic was designed to find the best lowlevel schedule to our FFT scheme, minimizing clock cycles and register usage using controlled backtracking, generating efficient assemblyoptimized FFT with balanced resource usage.
References
 1.
JW Cooley, JW Tukey, An algorithm for the machine calculation of complex Fourier series. Math. Comput. 19, 297–301 (1965)
 2.
GD Bergland, A radixeight fastFourier transform subroutine for realvalued series. IEEE Trans. On Electroacoust. 17(2), 138–144 (1969)
 3.
RC Singleton, An algorithm for computing the mixed radix fast Fourier transform. IEEE Trans. Audio Electroacoust. 1(2), 93–103 (1969)
 4.
P Duhamel, H Hollmann, Split radix FFT algorithm. Electronics Letters 20, 14–16 (1984)
 5.
D Takahashi, An extended splitradix FFT algorithm. IEEE Signal Processing Letters 8(5), 145–147 (2001)
 6.
AR VarkonyiKoczy, A recursive fast Fourier transform algorithm. IEEE Trans. on Circuits and Systems, II 42, 614–616 (1995)
 7.
A Saidi, Decimationintimefrequency FFT algorithm. Proc. ICAPSS 3, 453–456 (1994)
 8.
BM Baas, A lowpower, highperformance, 1024point FFT processor. IEEE J. SolidState Circuits 34(3), 380–387 (1999)
 9.
R Weber et al., Comparing hardware accelerators in scientific applications: a case study. IEEE Trans. Parallel Distrib. Syst. 22(1), 58–68 (2011). doi:10.1109/TPDS.2010.125
 10.
T Fryza, J Svobodova, F Adamec, R Marsalek, J Prokopec, Overview of parallel platforms for common high performance computing. Radioengineering 21(1), 436–444 (2012)
 11.
JiaJhe Li, ChiBang Kuan, TungYu Wu, and Jenq Kuen Lee. Enabling an OpenCL compiler for embedded multicore DSP systems. In Proceedings of the 2012 41st International Conference on Parallel Processing Workshops (ICPPW '12). (IEEE Computer Society, Washington, DC, USA, 2012), p. 545552
 12.
Francisco D. Igual, Guillermo Botella, Carlos García, Manuel Prieto, Francisco Tirado, Robust motion estimation on a lowpower multicore DSP. EURASIP Journal on Advances in Signal Processing. 99, 115 (2013)
 13.
T. Fryza and R. Mego, Low level source code optimizing for single/multi/core digital signal processors, Radioelektronika (RADIOELEKTRONIKA), 2013 23rd International Conference, Pardubice, 2013, pp. 288291. doi:10.1109/RadioElek.2013.6530933
 14.
JA Fisher, P Faraboschi, C Young, Embedded computing: A VLIW approach to architecture, compilers, and tools. (Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005), ISBN: 9780080477541.
 15.
V ZˇZivojnovi’c, Compilers for digital signal processors. DSP & Multimedia Technology Magazine 4(5), 27–45 (1995)
 16.
J. P. Grossman, Compiler and architectural techniques for improving the effectiveness of VLIW compilation. [Online]. Available:http://www.ai.mit.edu/projects/aries/Documents/vliw.pdf [24Mars2016]
 17.
JA Fisher, Trace scheduling: a technique for global microcode compaction. IEEE Trans. Comput. 30(7), 478–490 (1981)
 18.
L Monica, Software pipelining: an effective scheduling technique for VLIW machines (Proc. SIGPLAN ’88 Conference on Programming Language Design and Implementation, Atlanta, 1988), pp. 318–328
 19.
B Ramakrishna Rau, Iterative modulo scheduling: an algorithm for software pipelining loops. Proc. 27^{th} Annual International Symposium on Microarchitecture, 1994, pp. 63–74
 20.
M. Bahtat, S. Belkouch, P. Elleaume, P. Le Gall, Fast enumerationbased modulo scheduling heuristic for VLIW architectures, in 26th International Conference on Microelectronics (ICM), 2014, pp. 116119, 2014. doi: 10.1109/ICM.2014.7071820
 21.
Y Wang, Y Tang, Y Jiang, JG Chung, SS Song, MS Lim, Novel memory reference reduction methods for FFT implementation on DSP processors. IEEE Trans. Signal Process 55, 2338–2349 (2007). doi:10.1109/TSP.2007.892722
 22.
Y Jiang, T Zhou, Y Tang, Y Wang, Twiddlefactorbased FFT algorithm with reduced memory access, in Proc. 16th Int. Symp. Parallel Distrib. Process (IEEE Computer Soc, Washington, 2002), p. 70
 23.
K.J. Bowers, D.E. Shaw Res, New York, NY, USA; R.A. Lippert, R.O. Dror, D.E. Shaw, Improved twiddle access for fast Fourier transforms. IEEE Trans. Signal Process. 58(3), 1122–1130 (2010)
 24.
VI Kelefouras, G Athanasiou, N Alachiotis, HE Michail, A Kritikakou, CE Goutis, A methodology for speeding up fast Fourier transform focusing on memory architecture utilization. IEEE Trans. Signal Process 59(12), 6217–6226 (2011)
 25.
M Frigo, SG Johnson, The fastest Fourier transform in the west, in Proc. Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 1998
 26.
M Frigo, A fast Fourier transform compiler. SIGLAN Not. 39, 642–655 (2004)
 27.
S Johnson, M Frigo, A modified splitradix FFT with fewer arithmetic operations. IEEE Trans. Signal Process 55(1), 111–119 (2006)
 28.
D Mirkovic, L Johnsson, Automatic performance tuning in the UHFFT library, in Computational Science—ICCS 2001 (Springer, New York, 2001), pp. 71–80
 29.
AM Blake, IH Witten, MJ Cree, The fastest Fourier transform in the south. IEEE Trans. Signal Process 61(19), 4707–4716 (2013)
 30.
M. Bahtat, S. Belkouch, P. Elleaume, P. Le Gall, Efficient implementation of a complete multibeam radar coherentprocessing on a telecom SoC, in 2014 International Radar Conference (Radar), pp. 16, 2014. doi: 10.1109/RADAR.2014.7060412
 31.
S He, M Torkelson, A new approach to pipeline FFT processor, in Proc. IEEE Parallel Processing Symp, 1996, pp. 766–770
 32.
J.M. Codina, J. Llosa, A. González, A comparative study of modulo scheduling techniques, Proceedings of the 16th international conference on Supercomputing ICS 02(2002), 13(1), 97. ACM Press
 33.
RA Huff, Lifetimesensitive modulo scheduling, In Proc. of the ACM SIGPLAN '93 Conf. on Programming Language Design and Implementation. 258267(1993)
 34.
AK Dani, VJ Ramanan, R Govindarajan, Registersensitive software pipelining. Parallel Processing Symposium, 1998. (IPPS/SPDP, Orlando, FL, 1998), p. 194198
 35.
J. Llosa, A. González, E. Ayguadé, M. Valero, Swing modulo scheduling: a lifetimesensitive approach, PACT ′96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques. (Boston, MA, 1996), p. 8086
 36.
J Llosa, E Ayguade, A Gonzalez, M Valero, J Eckhardt, Lifetimesensitive modulo scheduling in a production environment. IEEE Trans. Comput. 50(3), 234–249 (2002)
 37.
J Zalamea, J Llosa, E Ayguade, M Valero, Register constrained modulo scheduling. IEEE Trans. Parallel Distrib. Syst. 15(5), 417–430 (2004)
 38.
TMS320C6678, Multicore fixed and floatingpoint digital signal processor, Data Manual, Texas Instruments. SPRS691E. March 2014. [Online]. Available: www.ti.com/lit/gpn/tms320c6678 [25Mars2016]
 39.
M Tasche, H Zeuner, Improved roundoff error analysis for precomputed twiddle factors. J. Comput. Anal. Appl. 4(1), 1–18 (2012)
 40.
JJ Alter, JO Coleman, Radar digital signal processing, Chapter 25 in Merrill I. Skolnik, Radar Handbook, Third Edition, (McGrawHill, 2008)
 41.
A Klilou, S Belkouch, P Elleaume, P Le Gall, F Bourzeix, MM Hassani, Realtime parallel implementation of pulseDoppler radar signal processing chain on a massively parallel machine based on multicore DSP and serial RapidIO interconnect. EURASIP Journal on Advances in Signal Processing 2014, 161 (2014)
 42.
Texas Instruments. FFT library for C66X floating point devices, C66X FFTLIB, version 2.0. [Online]. Available: http://www.ti.com/tool/FFTLIB [25Mars2016]
 43.
Sang Yoon Park; Nam Ik Cho; Sang Uk Lee; Kichul Kim; Jisung Oh, Design of 2K/4K/8Kpoint FFT processor based on CORDIC algorithm in OFDM receiver, Communications, Computers and signal Processing, 2001. PACRIM. 2001 IEEE Pacific Rim Conference on, vol. 2, no., pp. 457,460 vol. 2, 2001. doi:10.1109/PACRIM.2001.953668
 44.
T Pitkänen, T Partanen, J Takala, Lowpower twiddle factor unit for FFT computation. Embedded Computer Systems: Architectures, Modeling, and Simulation Lecture Notes in Computer Science 4599(2007), 65–74 (2007)
 45.
JC Chi, SG Chen, An efficient FFT twiddle factor generator (Proc. European Signal Process. Conf, Vienna, 2004)
Acknowledgements
This research is sponsored in part by a Research and Development Contract from Thales Air Systems.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Bahtat, M., Belkouch, S., Elleaume, P. et al. Instruction scheduling heuristic for an efficient FFT in VLIW processors with balanced resource usage. EURASIP J. Adv. Signal Process. 2016, 38 (2016). https://doi.org/10.1186/s1363401603360
Received:
Accepted:
Published:
Keywords
 FFT
 VLIW
 Scheduling heuristic
 Twiddle factors
 Modulo scheduling
 Software pipelining