 Research
 Open Access
Instruction scheduling heuristic for an efficient FFT in VLIW processors with balanced resource usage
 Mounir Bahtat^{1}Email authorView ORCID ID profile,
 Said Belkouch^{1},
 Philippe Elleaume^{2} and
 Philippe Le Gall^{2}
https://doi.org/10.1186/s1363401603360
© Bahtat et al. 2016
 Received: 23 October 2015
 Accepted: 14 March 2016
 Published: 31 March 2016
Abstract
The fast Fourier transform (FFT) is perhaps today’s most ubiquitous algorithm used with digital data; hence, it is still being studied extensively. Besides the benefit of reducing the arithmetic count in the FFT algorithm, memory references and scheme’s projection on processor’s architecture are critical for a fast and efficient implementation. One of the main bottlenecks is in the long latency memory accesses to butterflies’ legs and in the redundant references to twiddle factors. In this paper, we describe a new FFT implementation on highend very long instruction word (VLIW) digital signal processors (DSP), which presents improved performance in terms of clock cycles due to the resulting lowlevel resource balance and to the reduced memory accesses of twiddle factors. The method introduces a tradeoff parameter between accuracy and speed. Additionally, we suggest a cacheefficient implementation methodology for the FFT, dependently on the provided VLIW hardware resources and cache structure. Experimental results on a TI VLIW DSP show that our method reduces the number of clock cycles by an average of 51 % (2 times acceleration) when compared to the most assemblyoptimized and vendortuned FFT libraries. The FFT was generated using an instructionlevel scheduling heuristic. It is a modulobased registersensitive scheduling algorithm, which is able to compute an aggressively efficient sequence of VLIW instructions for the FFT, maximizing the parallelism rate and minimizing clock cycles and register usage.
Keywords
 FFT
 VLIW
 Scheduling heuristic
 Twiddle factors
 Modulo scheduling
 Software pipelining
1 Introduction
The DFT algorithmic complexity is O(N ^{2}). In order to reduce this arithmetic count and therefore enhancing its implementation efficiency, a set of methods were proposed. These methods are commonly known as fast Fourier transforms (FFTs), and they present a valuable enhancement in complexity of O(N log(N)). FFT was first discovered by Gauss in the eighteenth century and reproposed by Cooley and Tukey in 1965 [1]. The idea is based on the fundamental principle of dividing the computation of a DFT into smaller successive DFTs and recursively repeating this process. The fixedradix category of FFT algorithms mainly includes radix2 (dividing the DFT into 2 parts), radix4 (into 4 parts), radix2^{2}, and radix8 [2]. Mixedradix FFTs combine several fixedradix algorithms for better convenience [3]. Splitradix FFTs offer lower arithmetic count than the fixed or mixedradix, using a special irregular decomposition [4, 5]. Also, a recursive FFT can be implemented as in [6], and a combination between the decimationinfrequency (DIF) and the decimationintime (DIT) FFT is proposed in [7].
The FFT algorithm has been implemented for either hardware or software, on different platforms. Hardware IP implementations on ASIC or FPGA can provide realtime highspeed solutions but lack flexibility [8]. Equivalent implementations on general purpose processors (GPP) offer flexibility, but it is generally slower and cannot meet high realtime constraints. Multicore digital signal processors (DSPs) are interesting hardware platforms achieving the tradeoff between flexibility and performance. These are sharing with FPGAs a wellearned reputation of being difficult for developing parallel applications. As a result, several languages (such as OpenCL) seeking to exploit the power of multicores while remaining platform independent has been recently explored [9, 10]. OpenCL for instance is an industry’s attempt to unify embedded multicore programming, allowing data parallelism, SIMD instructions, and data locality as well [11].
Modern lowpower multicore DSP architectures attracted many realtime applications with power restrictions. One of the primary examples in this field is the C6678 multicore DSP from Texas Instruments, which can provide up to 16 GFLOPS/watt. In [12], a realtime lowpower motion estimation algorithm based of the McGM gradient model has been implemented in the TI C6678 DSP, exploiting DSP features and looplevel very long instruction word (VLIW) parallelism. The implementation provided significant power efficiency gains toward highend current architectures (multicore CPUs, manycore GPUs). In [13], a lowlevel optimization of the 4, 8, and 16point FFTs in the C6678 DSP is presented.
The most recent highend DSP architectures are VLIW, which mainly support an instructionlevel parallelism (ILP) feature, offering the possibility to execute simultaneously multiple instructions and a datalevel parallelism allowing the access to multiple data during each cycle. Therefore, these kinds of processors are known to have greater performance compared to RISC or CISC, even having simpler and more explicit internal design. However, unlike superscalar processors where parallelism opportunities are discovered by hardware at the run time, VLIW DSPs leave this task to the software compiler. It has been shown that constructing an optimal compiler performing instruction mapping and scheduling in VLIW architectures is an NPComplete problem [14]. Toward this increasingly difficult task, compilers have been unable to capitalize on these existing features and often cannot produce efficient code [15].
In the present paper, we propose an efficient modulolike FFT scheduling for VLIW processors. Our implemented methodology allows better corelevel resource balance, exploiting the fact that the twiddle factors can be calculated recurrently using multipliers and entirely generated in masked time. The resulting scheme created a vital balance between the computation capability and the data bandwidth that are required by the FFT algorithm, taking into account the VLIW architecture. Moreover, an important amount of input buffers can be freed since twiddle factors are no longer stored, nor referenced from memory, avoiding significant memory stalls. Since the recurrent computation of twiddle factors using multipliers induces a processing error, a tradeoff parameter between accuracy and speed is introduced.
Besides, our proposed implementation methodology takes into account the memory hierarchy, the memory banks, the cache size, the cache associativity, and its line size, in order to well adapt the FFT algorithm to a broad range of embedded VLIW processors. The bit reversal was efficiently designed to take advantage of the cache structure, and a mixed scheme was proposed for FFT/iFFT sizes not fitting the cache size.
VLIW assembly code of the FFT/iFFT was generated using a scheduling heuristic. The proposed FFTspecific modulo instruction scheduling algorithm is resourceconstrained and registersensitive that uses controlled backtracking. This heuristic reorders the scheduling array to accelerate the backtracking search for the best schedule within the NPcomplete state space. Our algorithm applies a strict optimization on internally used registers, so that generating twiddle factors of the new FFT scheme can be done effectively masked in parallel with other VLIW instructions.
The idea of recurrently generating twiddle factors during the FFT calculation is also discussed in our paper in [30]. In the present paper, we additionally propose a VLIWgeneric recurrent FFT scheme and the related instruction scheduling generator.
In the following, a background on the FFT algorithm of interest is given in Section 2, we then present an overview on the VLIW DSP processors in Section 3, the new FFT strategy scheme is explained in Section 4, and the modulolike scheduling of the suggested FFT is described in Section 5. Finally, experimental results are given in Section 6.
2 Background on the FFT algorithm of interest
Many factors other than the pure number of arithmetic operations must be considered for an efficient FFT implementation on a VLIW DSP, which can be derived from memoryinduced stalls, regularity, and algorithm’s projection on hardware VLIW architectures. Yet, one of the FFT algorithms that proved enough satisfaction on DSPs is the radix4 FFT, mostly implemented by manufactures. This is mainly due to its relatively less access to memory (during log_{4}(N) stages vs. log_{2}(N) stages in a radix2 scheme), additional to its regular, straightforward and less complex algorithm (compared for instance to splitradix FFTs, radix8 and higher radix FFTs). The radix4 FFT is usually used and mixed with a last radix2 stage, enabling it to treat sizes that are powerof2 and not only being limited to powerof4 sizes.
We distinguish between two basic types of radix4 algorithms: decimationintime (DIT) is the consequence of decomposing the input x[n], and decimationinfrequency (DIF) when decomposing the output X[n]. Both DIT and DIF present the same computational cost. In this paper, we will be only interested in DIF versions.
Building the radix4 algorithm can be done starting from the DFT formula in (1) which can be rewritten in a decomposed form and consequently obtaining the DIF radix4 butterfly description. An FFT computation is transformed into four smaller FFTs through a divideandconquer approach, making a radix4 scheme with log_{4}(N) stages, each containing N/4 butterflies.
Through the given scheme, when the FFT input is in its natural order, the output will be arranged in the socalled digitreversed order, which is defined depending on the used radix. Worthwhile to note that radix2 FFT schemes present easier reordering process.
The radix2^{2} FFT is adopted in this work and forms the basis of our new FFT scheme. Next, we describe the targeted VLIW family and the related stateofart modulo scheduling.
3 VLIW DSP processors
3.1 Architecture overview
Among industrial VLIW platforms, we mention the Texas Instruments TMS320C6x, Texas Instruments 66AK2Hx, FreeScale StarCore MSC8x, ADI TigerSharc, and Infineon Carmel.
3.2 Background on the modulo scheduling
Modulo scheduling is a software pipelining technique, exploiting ILP to schedule multiple iterations of a loop in an overlapped form, initiating iterations at a constant rate called the initiation interval.
Modulo instruction scheduling schemes are mainly heuristicbased as finding an optimal solution is proven an NPcomplete problem. Basic common operations of this technique include computing a lower bound of the initiation interval (denoted MII) which depends on provided resources and dependencies. Next, starting from this MII value, search for a valid schedule respecting hardware and graph dependency constraints. If no valid schedule could be found, increase the II value by 1 and search for a feasible schedule again. This last process is repeated until a solution is found. Higher performance is achieved for lower II; increasing II will reduce the amount of used parallelism, however, this makes finding a valid schedule easier [32]. Main evoked techniques in the literature for modulo scheduling include iterative modulo scheduling (IMS) [19], which uses backtracking by scheduling and unscheduling operations at different time slots searching for a better solution. Slack modulo scheduling [33] minimizes needed registers by reducing lifetimes of operands, using their scheduling freedom (or slack). Integrated registersensitive iterative software pipelining (IRIS) [34] modifies the IMS technique to minimize register requirements. Swing modulo scheduling (SMS) [35, 36] does not use backtracking but orders graph nodes guarantying effective schedules with low register pressure. The modulo scheduling with integrated register spilling (MIRS) in [37] suggests to integrate the possibility of storing data temporally out to memory (using spill code) when a schedule aims to exceed the number of available registers in the processor.
In general, the problem statement starts from a directed acyclic graph (DAG), with nodes representing operations of a loop, and edges for the intra or interiteration dependencies between them. Those are valued at instructions’ latencies. The wanted schedule must be functionally correct regarding data dependencies and hardware conflicts, minimizing both II and register usage, therefore reducing the execution time. Register pressure is a critical element to consider while searching for a schedule; a commonly used strategy is to minimize the lifetime of instruction’s inputs/outputs.
Accordingly, modulo scheduling focuses on arranging instructions in a window of II slots called the kernel, when m iterations are overlapped within it, then m1 iterations must be separately done before and after entering the kernel; those are called prolog and epilog, respectively. In order to reduce the code expansion issue that is naturally required by modulo scheduling (typically for the prolog/epilog parts), hardware facilities for software pipelining are implemented in VLIW.
Next, our new FFT scheme is discussed for implementation possibilities on VLIW DSPs.
4 Our implementation methodology for the FFT on VLIW DSPs
4.1 A motivating example
The key idea behind the FFT scheme that we are proposing is to create a balance between the computation capability and the data bandwidth that are required by the FFT algorithm. In the following, we analyze the conventional FFT scheme regarding needed VLIW operations in a TI C66 device, in order to evaluate the default efficiency of resource usage.
As seen previously, the conventional radix2^{2} FFT algorithm needs log_{4}(N) stages (assuming N a powerof4); and within each stage, a number of N/4 radix2^{2} butterflies are computed. Besides, there is a need of 8 loads/stores for each butterfly’s legs, in addition to 3 load operations of twiddle factors as shown in Fig. 3; making a total of 11 loads/stores per butterfly. Moreover, eight complex additions/subtractions, three complex multiplications, and a special (−j) multiplication are required to complete the processing on a single butterfly.
A VLIW core can especially issue loads/stores, multiplications, additions/subtractions, and (−j) multiplications in parallel. In the C66 core case, load/store capacity is 16 bytes per cycle using .D units (denoted next by p _{ d }). Eight real floatingpoint singleprecision multiplications can be done per cycle (p _{ m }) using .M units and eight real floatingpoint singleprecision additions/subtractions are achievable per cycle (p _{ a }) using both .L and .S units. Finally, 2 multiplications by (–j) per cycle (p _{ j }) using .L and .S units as well, those last are simplified into combinations of move and signchange operations between real and imaginary parts.
In Fig. 9, loads of the four butterfly’s legs are marked by m0, m1, m2, and m3. Stores are represented by the entities sf0, sf1, sf2, and sf3. Additions and subtractions are referred as a0, a1, a2, a3, v0, v1, v2, and v3. A multiplication by (−j) is translated into two operations on .L/.S units: a3ii (a3i1, a3i2). In addition, multiplications are marked by mf1, mf2, and mf3; and since we are processing complex data (with real and imaginary parts), extra additions are needed to complete the complex multiplication operation: f1, f2, f3. Finally, the symbols w1, w2, and w3 represent loads from memory of needed twiddle factors (W _{ N } ^{ k }).
Hence, there is indeed an unbalance between the required computation capability and the load/store bandwidth for this VLIW case, making excessive loads/stores vs. lower use of computation units.
Our new method changes the structure of the algorithm to fit the VLIW hardware, creating a balance in resource usage, and therefore minimizing the overall clock cycles. In an attempt to reduce the load/store pressure, we suggest not to load twiddle factors but to generate them internally instead. This idea uses the fact that the twiddle factors in the nindexed butterfly that are (W _{ N } ^{2} ^{ n }, W _{ N } ^{ n }, W _{ N } ^{3} ^{ n }) can be deduced from the (n − 1)indexed butterfly according to these formulas: (W _{ N } ^{2} ^{ n } = W _{ N } ^{2} ^{(} ^{ n } ^{ − 1)} W _{ N } ^{2}, W _{ N } ^{ n } = W _{ N } ^{ n } ^{ − 1} W _{ N }, W _{ N } ^{3} ^{ n } = W _{ N } ^{3(} ^{ n } ^{ − 1)} W _{ N } ^{3}). This trades loads/stores for multiplications/additions and makes an FFT scheme with butterflies that are dependent on each other; therefore, we cannot start the processing of the nindexed butterfly if the (n − 1)indexed butterfly did not yet compute its twiddle factors.
We decide to process one butterfly per C66 core side bank as well, grouping two butterflies within a single larger iteration. Therefore, this makes an II large enough to wait for the generation of needed twiddle factors by subsequent pipeline stages (7 cycles are required on the TI C66 device in order to complete a floatingpoint complex multiplication).
4.2 Proposed FFT implementation methodology
Our implemented methodology allows better corelevel resource balance, exploiting the fact that the twiddle factors can be calculated recurrently using multipliers during the execution. The resulting scheme, regarding a VLIW architecture, created a vital balance between the computation capability and the data bandwidth that are required by the FFT algorithm. Besides, it takes into account the memory hierarchy, the memory banks, the cache size, the cache associativity, and its line size, in order to well adapt the FFT algorithm to a broad range of embedded VLIW processors. The bit reversal was efficiently designed to take advantage of the cache structure, and a mixed scheme was proposed for FFT/iFFT sizes not fitting the cache size.
4.2.1 A recurrent FFT scheme without memory references of twiddle factors
The proposed FFT scheme generates twiddle factors using multipliers instead of loading them from a precomputed array. Those are recurrently computed from previously handled butterflies. Therefore, the processing of the nindexed butterfly is timeconstrained by the (n − 1)indexed butterfly.
Let us denote t _{ w } the latency time in terms of cycles that is needed to compute a twiddle factor from a previously calculated one; consequently, if an iteration processes one butterfly, then the II must be greater than or equal to t _{ w }, waiting the required time to generate twiddle factors for the next iteration.
For maximum FFT performance, the initiation interval must be minimized, expressed for our new FFT as MII = MAX(RCPB {required cycles per butterfly} for loads/stores, RCPB for multiplications, RCPB for adds/subs and j multiplications, t _{ w }).
In order to mask the effect of t _{ w } on II, we unroll a number of U successive iterations into a single large one, reducing dependencies to between groups of merged iterations. The new MII expression becomes MII = MAX(U × RCPB for loads/stores, U × RCPB for multiplications, U × RCPB for adds/subs and j multiplications, t _{ w }).
For 0 ≤ k < U _{ m } and \( 1\le n<\frac{N}{4{U}_m} \) Denoting \( {\gamma}_{n,k}={W}_N^{n{U}_m+k} \)
Consequently, this new scheme reduces memory accesses by 27 %, making an implementation advantage on a broad range of architectures as most FFT algorithms use memory extensively. In addition to that, it gives an opportunity to use the nonexploited VLIW units for a possibly masked generation of twiddle factors. Besides, since it is not an obvious task to generate an efficient pipelined schedule (having II = MII) with respect to hardware constraints and available core registers, we suggest in later sections an aggressive FFTspecific scheduling heuristic.
This scheme requires 0 % references to twiddle factors and 0 % space for their memory storage as well, making significant gains on related memory latencies.
The key parameters of our scheme are the VLIW core features (p _{ d }, p _{ m }, p _{ a }, p _{ j }, t _{ w }). By computing the MII_{1} when using an FFT scheme with loaded twiddle factors (using n _{ d }, n _{ m }, n _{ a }, and n _{ j }, expressed in Eq. (2)), and MII_{2} when using an FFT scheme with recurrent computation of twiddle factors (using n’ _{ d }, n’ _{ m }, n’ _{ a }, and n’ _{ j }, expressed in Eq. (4)), then if MII_{1} < MII_{2}, the provided VLIW core would not be applicable for the proposed scheme; otherwise, the minimal gain is expressed as 100(MII_{1} − MII_{2})/MII_{1}.
4.2.2 Setup code integration within the pipelined flow
The previous section described lowlevel instruction mapping of the most inner loop of the new FFT scheme. In order to complete the FFT/iFFT implementation, intermediate setup iterations (representing outer loops) must be injected into the pipelined flow of iterations. The straightforward way is to completely drain the last iterations of the inner loop, executing the setup code (constants reset, pointers updates …), and then resuming the pipelined execution; this turns to be timeconsuming due to the time needed for the prolog/epilog parts. Indeed, if the dynamic length (DL) denotes the number of cycles that are needed by an inner loop iteration for its processing, then the whole FFT will at least require B _{ N }MII/U m + (DLII)(N/48 − 1/3) cycles (assuming N a powerof4). The integer expression (N/48 − 1/3) counts the number of interrupts that must be done to the inner loop kernel, assuming that the last two FFT stages can be especially treated and done without setup code merging. For N = 4k on the C66 for example, we can see that the setup code interruptions represent 7 % of the main processing.
VLIW architectures can support the merging of setup codes into the pipelined iterations (the case of C66), making it possible to add a customized iteration in concurrence with others without draining the process. Therefore, we can insert an additional II cycle iteration, setting up changes required by outer loops to begin the next sequence of inner loop iterations; needing only B _{ N }MII/U m + II(N/48 − 1/3) cycles on the whole FFT/iFFT routine. This enhances the efficiency, representing only 2.7 % of the main processing when N = 4k on C66 cores.
4.2.3 Cacheefficient bit reversal computation
The FFT naturally returns an Nsized output with a bitreversed order, post reordering data is necessary. Bit reversing an element at a 0based index k consists of replacing it into the position index m, such that m is obtained by reversing log_{2}(N) bits of the binary form of k. Processors usually implement hardware bit reversal of a 32bit integer, hence, the wanted function can be obtained by leftshifting an index 32log_{2}(N) times and bitreversing the whole word afterwards.
In order to increase computation efficiency, one can integrate the reordering step into the last stage of the FFT/iFFT, rather than creating a separate routine. One encountered difficulty in bit reversal is accessing scattered positions of memory, causing many memory stalls. Indeed, commonly used architectures of L1D cache are to divide it into several banks, such that simultaneous accesses to distinct banks are possible, while many concurrent accesses to the same bank induce latencies. In processors’ architectures that allow multiple data to be accessed in a cycle, the L1D cache level is divided into banks that are defined by lower bits of an address (AMD’s processors, TI C6x DSPs, …). In the C66 CorePac architecture for example, there is 8 L1D banks such that the address range [4b + 32k; 4(b + 1) + 32k [(where k and b are integers) is linked with the bank number b (b ϵ [0,7]).
It turns out that store indexes related to first butterflies (0, N/2, N/4, …) all usually belong to the same memory bank (as long as N gets high values); consequently, 2 parallel stores in a constructed kernel will likely target different addresses from the same bank, inducing stalls. Besides, it is recommended to access consecutive addresses, to be possibly merged into larger accesses by subsequent data paths.
i = 0; loop N/16 times br = bit_reverse(i); load_legs(br, br + 1, br + 2, br + 3); load_legs(N/2 + br, N/2 + br + 1, N/2 + br + 2, N/2 + br + 3); load_legs(N/4 + br, N/4 + br + 1, N/4 + br + 2, N/4 + br + 3); load_legs(3 N/4 + br, 3 N/4 + br + 1, 3 N/4 + br + 2, 3 N/4 + br + 3); process_butterflies(); store_legs(i, i + N/2, i + N/4, i + 3 N/4); store_legs(i + 1, i + N/2 + 1, i + N/4 + 1, i + 3 N/4 + 1); store_legs(i + 2, i + N/2 + 2, i + N/4 + 2, i + 3 N/4 + 2); store_legs(i + 3, i + N/2 + 3, i + N/4 + 3, i + 3 4 + 3); i = i + 4; end loop 
Doing so, we can always issue parallel stores targeting different memory banks and then avoiding bank conflicts (for example, data (i) at a specific bank in parallel with data (i + 1) at another bank). Furthermore, 4 consecutive store accesses are now possible. In this case, butterflies are processed in an order that provides the maximum of consecutive stores.
During the FFT, inplace computation on an input buffer is performed until the last stage, where stored data are put into the output buffer. Processing 2, 4, or more butterflies per iteration increases register pressure; we have applied the same scheduling heuristic that will be described later to find a feasible implementation with advanced constraints on data accesses. Found schedule has II = MII with all constraints verified ensuring consecutive stores and avoiding bank conflicts. Obtained performance of the FFT routine with bit reversal was similar to an FFT without bit reversal.
For FFT sizes that are powerof2 and not powerof4, an additional radix2 stage is added; that is where the bit reversal is merged in the same manner.
4.2.4 Adapting the FFT to the cache associativity
When the FFT size is greater than the allocated cache size, the considered radix2^{2} scheme may present some inefficiency toward the L1D cache. The L1D cache is composed of a number of cache lines (usually of 64 bytes), used to store external data prior to their use by the CPU. A cache miss occurs when the requested data is not yet available into the cache; in this case, the CPU is stalled waiting for a cache line to be updated. Many cache structures were used in CPU architectures: directmapped cache associates each address of the external memory with a unique position into the cache (therefore with one cache line); as a result, two addresses that are separated by a multiple of the cache size could not survive into the cache at the same time. An advanced mechanism is called the setassociative cache, where the cache memory is divided into a number of p ways, such that when a cache miss occurs, data is transferred to the way whose cache line is the least recently used (LRU); consequently, there are p unique locations into the cache for every address. The main advantage for increasing the associativity is to let noncontiguous data survive into cache lines without overwriting each other (without cache thrashes).
A radix2^{2} butterfly loads their legs from the indexes: 0, N/4, N/2, and 3 N/4. All this data should exist in the cache at the same time. If the L1D cache is 2way setassociative and denoting L1D_S the allocated cache size in bytes, no cache thrash would happen if N is less than or equal to L1D_S/8. Otherwise, cache lines for indexes (0 and N/2) or (N/4 and 3 0/4) will overwrite each other continuously, decreasing then the cache efficiency. A solution to this consists of applying radix2 FFTs for larger sizes, until the size (L1D_S/8) where radix2^{2} can be used without cache thrashes. Indeed, while radix2 FFTs only access elements at indexes like (0 and N/2), no cache thrash would occur no matter how large N is; as long as the cache is 2way setassociative.
A radix2 FFT without references of twiddle factors was similarly built and generated using our scheduling heuristic leading to a schedule of II = MII (merging 4 radix2 butterflies in a single iteration).
function fft_cache() begin step = N; while (step > L1D_S/8 and CACHE_A < 4) loop for (k = 0;k < N/step) fft_radix2_stage(input + k*step, step);//radix2 fft step = step/2; end loop for (k = 0;k < N/step) fft(input + k*step, step);//radix2^{2} fft bit_reversal(input, output); end 
A bit reversal routine was designed separately in this case and cacheoptimized; the II(MII) was extended in order to optimally (fully) treat 4 cache lines in each iteration.
Using a radix2 FFT for first stages is making a slight drop on the overall efficiency. In fact, a fullradix2 scheme requires more time than a fullradix2^{2} scheme (on C66 cores, it needs N log_{2}(N) cycles at peak performance instead of N log_{4}(N)). Even so, its gain is far dominant for large FFTs avoiding cache thrashes. For example, the 16kFFT performance on C66 cores using a fullradix2^{2} scheme is 367,890 cycles; it decreased to 245,206 cycles (33 % gain) using the schemeavoiding cache thrashing.
The inverse FFT is the same as a FFT, except the fact that it uses conjugated twiddle factors and (1/N) extra multiplications added to the last stage. These modifications can be performed without decreasing performance, by exploiting the ILP feature of VLIW processors.
4.2.5 FFT scheme accuracy
A possible side effect of the proposed implementation is a slightly reduced precision. Indeed, internally computed twiddle factors in a recurrent fashion are less accurate than those loaded precomputed using trigonometric functions. We introduce in Fig. 13 a tradeoff parameter (tradeoff_factor) between accuracy and speed. The tradeoff scheme injects more precomputed twiddle factors within the FFT flow instead of using only one, which reduces error accumulation effects. However, since the pipeline is regularly stopped to process more precomputed twiddle factors, the speed performance slightly drops.
The key idea of the algorithm in Fig. 13 is to use more than one precomputed twiddle factor per FFT stage in order to limit the error propagation. Indeed, if the tradeoff_factor is 0 then only 1 twiddle factor will be used to feed the whole FFT process. Otherwise, 2^(tradeoff_factor)/2 precomputed twiddle factors will be used per each FFT stage. We will denote next the tradeoff_factor by T.
Where DL denotes the dynamic length of an inner loop iteration. The T parameter creates then a tradeoff factor between accuracy and speed.
5 Instruction scheduling heuristic for FFT/iFFT implementations on VLIW
5.1 Introduction
In a TI C66 device for instance, and as we aspire the schedule to be done for II = MII = 8, the kernel can be able then to execute up to 64 nodes (due to the 8way VLIW architecture). It leads to a functional unit pressure of 60 per 64 possible slots (94 % of unit pressure); this shows the difficulty class of the actual scheduling problem, in the presence of a limited register set.
The new FFT scheme merges U _{ m } totally independent butterflies in a single iteration, making it possible to symmetrically divide the computation on processors with symmetric corebank structure (ADI TigerShark, TI C66 [Fig. 7], …). This will have the effect to reduce the problem size by half (on U _{ m }/2 butterflies), and avoid corebank communication which is therefore usually limited with many other constraints.
Instructions’ features of some TI C66 operations
Instruction  Operand size (op1, op2, dst) [in number of core registers]  Delay slot latency (DS)  Possible execution units 

DADDSP  (2, 2, 2)  2  L, S 
DSUBSP  (2, 2, 2)  2  L, S 
CMPYSP  (2, 2, 4)  3  M 
STDW  (2, 0, 1)  0  D 
LDDW  (1, 0, 2)  4  D 
ADD  (1, 1, 1)  0  L, S, D 
At the coreregister level, the FFT algorithm will need to allocate a number of registers exclusively to store data pointers, constants, counters, and twiddle factors. First, 1 register is needed to store a counter on the iteration loop; 4 registers per butterfly to contain pointers on input and output butterfly legs; 1 register for the FFT stride; 3 registers per butterfly for jump offsets; 1 register for a pointer on the final output buffer; 1 register as a stack pointer (where some coreregisters are spilled). Besides, 3 register pairs are exclusively needed for \( \left({W}_N^{U_m},{W}_N^{2{U}_m},{W}_N^{3{U}_m}\right) \), and 3U m twiddle factors per butterfly must be allocated as well. Left registers must be used for operand allocation of the entire FFT/iFFT DAG. Our scheduling algorithm must take into account this limiting register constraint.
One of the most efficient scheduling heuristic is SMS as evaluated in [32, 35, 36]. Applying SMS on our FFT problem in the TI C66 core, it produced a schedule of II = MII with a minimum register usage of 20 per core side (40 needed registers), which is greater than the available registers for the DAG allocation, meaning that this is not an implementable schedule. Our scheduling technique aims to find a valid schedule with II = MII and a minimum register requirement in a reasonable time.
5.2 A proposed modulo scheduling algorithm
The new scheduling algorithm starts with ordering the graph nodes for a one core side into a 1dimensional array, such that if a node is jindexed into this array, all of its predecessors must be indexed less than j. This is the case as the scheduling algorithm that will be presented next uses this generated arrayorder to schedule/unschedule graph nodes in a backtracking fashion and computes the best starting time of each node (regarding register lifetime) based on all of its already scheduled predecessors. The scheduling order is critical and our algorithm uses a special ordering of graph nodes, which will be presented later.
The total length of the accumulative lifetime divided by II gives a lower bound on register pressure, denoted AvgLive. Despite the fact that a minimized AvgLive gives smaller register usage, it does not necessarily provide the lowest. For example, a found schedule with a total lifetime of 110 requires a minimum of 17 registers, while another schedule with a lifetime of 118 required only a minimum of 16 registers. A better register lower bound is computed considering overlapped lifetimes over II cycles, getting an array (called LiveVector) of II elements as described in [33]. The maximum among the LiveVector values was named the MaxLive, which is a precise lower bound measure of the number of needed registers. It was shown in [33] that a schedule requires at most MaxLive + 1 registers.
As the state space cannot be scanned entirely due to its NPComplete nature despite MaxLive cutoffs, we propose to make smaller successive searches with different starting points on the search space. This is having the advantage to make finding a better schedule faster. Indeed, if we are not able to find a better solution within a specified amount of backtracking tries (100M nodes as an example), then it is more likely because first placed operations constrain the efficiency and therefore must be changed. The algorithm starts then with an initial ordering in the scheduling array, subsequently, if the backtracking amount limit is reached, the scheduling array is reordered according to specific rules and the search process starts again using a new initial state; this sequence is repeated until a solution fitting available registers is found.
The reordering part tries to guarantee different initial schedules and a fast convergence rate. The main used criteria while sorting is that when an operation v is unscheduled, next operations to be rescheduled must be those who maximize their effect on this operation v. In order to illustrate this criteria; let’s take an example and assume that the scheduling array is arranged as follows: {mw1, mw2, mw3, m0, m1, m2, m3, a0, a1, a2, a3, a3i1, a3i2, v0, v1, v2, v3, f1, f2, f3, sf0, sf1, sf2, sf3, w1, w2, w3}.
Next, we assume that operations until “sf3” were placed successfully into the kernel and found a valid slot. If the operation “w1” cannot be placed into the schedule (either because it presents higher MaxLive pressure, or no free place could be found within its slots freedom range), then the algorithm will reschedule the previous operation in the scheduling array which is “sf3” and checking if this enhanced scheduling opportunities for “w1”. It will not have an effect, because “sf3” is not sharing the same resource unit as “w1” nor among its direct predecessors. Hence, rescheduling “sf3”, “sf2”, “sf1”, or “sf0” merely leads to useless states; only “mw1” (its direct predecessor) or operations using the same unit resource could have a chance to make a valid placement for “w1”. In order to avoid scanning useless states first, a better order should have been done (making “mw1” close to “w1” in the scheduling array for example).

▪ If an operation is jindexed into this array, all its predecessors must be indexed less than j {1}

▪ Each operation must be close as possible to its direct predecessors {2}

▪ Each operation must be close as possible to operations using the same unit resource {3}

▪ Operations with larger input/output sizes are more critical to reordering considerations {4}
The “.id” field in (7) represents the ordering position of a node within the array. The pn and cn fields denotes respectively the number of predecessors of op and the number of concurrent operations using the same unit resource as op and which are indexed less than its index. The formula is scaled by the number of registers (buffers size “.bs”) that are required by op.
function change_order(Sch_Array) begin loop choose op: an operation from the Sch_Array; min_i = index(op); max_i = min(index(successors(op))); //if no successor, set max_i to upper bound choose a position s in the range [min_i, max_i1]; move in array operation from position min_i to s; compute the OrdP; compare it to the best penalty found so far; end loop return the order having the lowest OrdP; end 
Our proposed scheduling algorithm aggressively minimized register usage, enhancing the MaxLive by a factor of 1.7 in the TI C66 core, toward the SMS scheduling method (which returned a MaxLive of 20), making an FFT completely without memory references of twiddle factors implementable.
The key parameters of our scheduling method are mainly the computed MII and U _{ m } in Section 4.2.1, the number of symmetric clusters (denoted SymC) in the VLIW core (The method would schedule U _{ m }/SymC butterflies per cluster, reducing the algorithmic problem size by a factor of SymC). The scheduling is also dependent on the VLIW instruction’s delay slots, operand sizes, and their possible execution units and on the number of available core registers per cluster.
6 Implementation and experimental results
Subsequent implementation strategy for the FFT/iFFT was implemented in the highend TMS320 C6678 VLIW DSP and in the 66AK2H12 DSP, using the Standard C66 Assembly. Data samples were singleprecision complex floating point, with imaginary parts in odd indexes and real parts in even ones. During benchmarks, L1D/L1P was fully used as cache. Input/output buffers are stored into the memory L2 (512 Kbytes in C6678, 1 Mbytes in 66AK2H12). Moreover, the program code is mapped to the local memory L2. Experimented FFT sizes are in the range [256, 64k], which correspond to most signal processing applications. The comparison is made with the most vendortuned linear assemblyoptimized FFT of TI (found in the DSPLib version 3.1.1.1), with the TI compiler’s optimizations all active (o3, version 7.3.2).
Performance comparison between the new FFT/iFFT and TIs
Implementation  FFT size  CPPR2B  Cycles  Relative gain over TI (%) 

New FFT  1k  1.206  6175  37.62 
TI FFT  1k  1.933  9899   
New FFT  4k  1.119  27502  46.65 
TI FFT  4k  2.097  51547   
New FFT  8k  1.896  100979  53.63 
TI FFT  8k  4.090  217782   
New iFFT  1k  1.211  6198  47.89 
TI iFFT  1k  2.323  11894   
New iFFT  4k  1.130  27779  53.50 
TI iFFT  4k  2.431  59745   
New iFFT  8k  1.896  100977  56.59 
TI iFFT  8k  4.368  232613   
Our presented FFT implementation shows great improvements over TI; the peak performance was reached for N = 4k, as the limited cache associativity made our special optimization to take place for N = 8k and larger, inducing relatively less efficiency for integrating radix2 stages. Small FFTs usually suffer from nonnegligible overhead toward main processing. We are then able to reach an average gain of 50.56 % (2 times acceleration) over TI’s routines, with a maximum performance of 1.119 CPPR2B (89.36 % of absolute efficiency). This obtained gain is explained by the proved 27 % gain in Section 4, the suppressed latencies of twiddle factors and by the other described optimizations.
On the other hand, our scheduling heuristic generates the kernel codes of the FFT/iFFT and aggressively optimizes the number of cycles and registers’ usage. It is able to compute the best schedule respecting tight pressure constraints with a fast convergence rate, overcoming results given by SMS by a factor of 1.7 (40 % gain) on the found MaxLive. The best generated schedule of instructions with a MaxLive = 12 was computed within 2 to 15min range.
Furthermore, a 4ksample floatingpoint FFT was performed inchip during 2.6 μs within a 10W power consumption in a TI 66AK2H12 DSP device, making a remarkable FFT implementation efficiency of 9.5 GFLOPS/watt. This makes it possible for use in several computeintensive applications, such as radar processing [41]. Our work has been used within the official FFT library of Texas Instruments [42].
In contrast to previous works, onthefly generation of twiddle factors as in [43–45] used the CORDIC algorithm or related generation designs to compute the needed twiddle factors instead of performing ROM accesses. These techniques target hardware FFT designs in FPGAs or ASICs and are not applicable for CPU or DSP platforms. Indeed, the idea of generating twiddle factors using multipliers (or equivalent operations) in software for CPU/DSP is usually avoided, as it requires in most cases more latency than FFT schemes with precomputed twiddle factors. To the best of our knowledge, no published work proposed a softwareefficient solution to do an FFT with ingeneration of twiddle factors. The idea becomes possible with recent highend VLIW processors, where we have to issue parallel instructions computing the twiddle factors in the masked time; however, it requires proper scheduling and lowlevel control on the execution pattern to be done successfully.
7 Conclusions
In the present paper, a new radix2^{2}based FFT/iFFT scheme is proposed to fit VLIW processors. This structure made a balance between the VLIW computation capabilities and the data bandwidth pressure, optimally exploiting parallelism opportunities and reducing memory references to twiddle factors, leading to an average gain of 51 % on efficiency toward the most assemblyoptimized and vendortuned FFT on a highend VLIW DSP. Our implementation methodology took into account the VLIW hardware resources and the cache structure, adapting the FFT algorithm to a broad range of embedded VLIW processors. On the other hand, a resourceconstrained and registersensitive modulo scheduling heuristic was designed to find the best lowlevel schedule to our FFT scheme, minimizing clock cycles and register usage using controlled backtracking, generating efficient assemblyoptimized FFT with balanced resource usage.
Declarations
Acknowledgements
This research is sponsored in part by a Research and Development Contract from Thales Air Systems.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 JW Cooley, JW Tukey, An algorithm for the machine calculation of complex Fourier series. Math. Comput. 19, 297–301 (1965)MathSciNetView ArticleMATHGoogle Scholar
 GD Bergland, A radixeight fastFourier transform subroutine for realvalued series. IEEE Trans. On Electroacoust. 17(2), 138–144 (1969)View ArticleGoogle Scholar
 RC Singleton, An algorithm for computing the mixed radix fast Fourier transform. IEEE Trans. Audio Electroacoust. 1(2), 93–103 (1969)View ArticleGoogle Scholar
 P Duhamel, H Hollmann, Split radix FFT algorithm. Electronics Letters 20, 14–16 (1984)View ArticleGoogle Scholar
 D Takahashi, An extended splitradix FFT algorithm. IEEE Signal Processing Letters 8(5), 145–147 (2001)View ArticleGoogle Scholar
 AR VarkonyiKoczy, A recursive fast Fourier transform algorithm. IEEE Trans. on Circuits and Systems, II 42, 614–616 (1995)View ArticleMATHGoogle Scholar
 A Saidi, Decimationintimefrequency FFT algorithm. Proc. ICAPSS 3, 453–456 (1994)Google Scholar
 BM Baas, A lowpower, highperformance, 1024point FFT processor. IEEE J. SolidState Circuits 34(3), 380–387 (1999)View ArticleGoogle Scholar
 R Weber et al., Comparing hardware accelerators in scientific applications: a case study. IEEE Trans. Parallel Distrib. Syst. 22(1), 58–68 (2011). doi:10.1109/TPDS.2010.125 View ArticleGoogle Scholar
 T Fryza, J Svobodova, F Adamec, R Marsalek, J Prokopec, Overview of parallel platforms for common high performance computing. Radioengineering 21(1), 436–444 (2012) Google Scholar
 JiaJhe Li, ChiBang Kuan, TungYu Wu, and Jenq Kuen Lee. Enabling an OpenCL compiler for embedded multicore DSP systems. In Proceedings of the 2012 41st International Conference on Parallel Processing Workshops (ICPPW '12). (IEEE Computer Society, Washington, DC, USA, 2012), p. 545552Google Scholar
 Francisco D. Igual, Guillermo Botella, Carlos García, Manuel Prieto, Francisco Tirado, Robust motion estimation on a lowpower multicore DSP. EURASIP Journal on Advances in Signal Processing. 99, 115 (2013)Google Scholar
 T. Fryza and R. Mego, Low level source code optimizing for single/multi/core digital signal processors, Radioelektronika (RADIOELEKTRONIKA), 2013 23rd International Conference, Pardubice, 2013, pp. 288291. doi:10.1109/RadioElek.2013.6530933
 JA Fisher, P Faraboschi, C Young, Embedded computing: A VLIW approach to architecture, compilers, and tools. (Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005), ISBN: 9780080477541.Google Scholar
 V ZˇZivojnovi’c, Compilers for digital signal processors. DSP & Multimedia Technology Magazine 4(5), 27–45 (1995)Google Scholar
 J. P. Grossman, Compiler and architectural techniques for improving the effectiveness of VLIW compilation. [Online]. Available:http://www.ai.mit.edu/projects/aries/Documents/vliw.pdf [24Mars2016]
 JA Fisher, Trace scheduling: a technique for global microcode compaction. IEEE Trans. Comput. 30(7), 478–490 (1981)View ArticleGoogle Scholar
 L Monica, Software pipelining: an effective scheduling technique for VLIW machines (Proc. SIGPLAN ’88 Conference on Programming Language Design and Implementation, Atlanta, 1988), pp. 318–328Google Scholar
 B Ramakrishna Rau, Iterative modulo scheduling: an algorithm for software pipelining loops. Proc. 27^{th} Annual International Symposium on Microarchitecture, 1994, pp. 63–74Google Scholar
 M. Bahtat, S. Belkouch, P. Elleaume, P. Le Gall, Fast enumerationbased modulo scheduling heuristic for VLIW architectures, in 26th International Conference on Microelectronics (ICM), 2014, pp. 116119, 2014. doi: 10.1109/ICM.2014.7071820
 Y Wang, Y Tang, Y Jiang, JG Chung, SS Song, MS Lim, Novel memory reference reduction methods for FFT implementation on DSP processors. IEEE Trans. Signal Process 55, 2338–2349 (2007). doi:10.1109/TSP.2007.892722 MathSciNetView ArticleGoogle Scholar
 Y Jiang, T Zhou, Y Tang, Y Wang, Twiddlefactorbased FFT algorithm with reduced memory access, in Proc. 16th Int. Symp. Parallel Distrib. Process (IEEE Computer Soc, Washington, 2002), p. 70Google Scholar
 K.J. Bowers, D.E. Shaw Res, New York, NY, USA; R.A. Lippert, R.O. Dror, D.E. Shaw, Improved twiddle access for fast Fourier transforms. IEEE Trans. Signal Process. 58(3), 1122–1130 (2010)Google Scholar
 VI Kelefouras, G Athanasiou, N Alachiotis, HE Michail, A Kritikakou, CE Goutis, A methodology for speeding up fast Fourier transform focusing on memory architecture utilization. IEEE Trans. Signal Process 59(12), 6217–6226 (2011)MathSciNetView ArticleGoogle Scholar
 M Frigo, SG Johnson, The fastest Fourier transform in the west, in Proc. Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 1998Google Scholar
 M Frigo, A fast Fourier transform compiler. SIGLAN Not. 39, 642–655 (2004)View ArticleGoogle Scholar
 S Johnson, M Frigo, A modified splitradix FFT with fewer arithmetic operations. IEEE Trans. Signal Process 55(1), 111–119 (2006)MathSciNetView ArticleGoogle Scholar
 D Mirkovic, L Johnsson, Automatic performance tuning in the UHFFT library, in Computational Science—ICCS 2001 (Springer, New York, 2001), pp. 71–80View ArticleGoogle Scholar
 AM Blake, IH Witten, MJ Cree, The fastest Fourier transform in the south. IEEE Trans. Signal Process 61(19), 4707–4716 (2013)MathSciNetView ArticleGoogle Scholar
 M. Bahtat, S. Belkouch, P. Elleaume, P. Le Gall, Efficient implementation of a complete multibeam radar coherentprocessing on a telecom SoC, in 2014 International Radar Conference (Radar), pp. 16, 2014. doi: 10.1109/RADAR.2014.7060412
 S He, M Torkelson, A new approach to pipeline FFT processor, in Proc. IEEE Parallel Processing Symp, 1996, pp. 766–770Google Scholar
 J.M. Codina, J. Llosa, A. González, A comparative study of modulo scheduling techniques, Proceedings of the 16th international conference on Supercomputing ICS 02(2002), 13(1), 97. ACM PressGoogle Scholar
 RA Huff, Lifetimesensitive modulo scheduling, In Proc. of the ACM SIGPLAN '93 Conf. on Programming Language Design and Implementation. 258267(1993)Google Scholar
 AK Dani, VJ Ramanan, R Govindarajan, Registersensitive software pipelining. Parallel Processing Symposium, 1998. (IPPS/SPDP, Orlando, FL, 1998), p. 194198Google Scholar
 J. Llosa, A. González, E. Ayguadé, M. Valero, Swing modulo scheduling: a lifetimesensitive approach, PACT ′96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques. (Boston, MA, 1996), p. 8086 Google Scholar
 J Llosa, E Ayguade, A Gonzalez, M Valero, J Eckhardt, Lifetimesensitive modulo scheduling in a production environment. IEEE Trans. Comput. 50(3), 234–249 (2002)View ArticleGoogle Scholar
 J Zalamea, J Llosa, E Ayguade, M Valero, Register constrained modulo scheduling. IEEE Trans. Parallel Distrib. Syst. 15(5), 417–430 (2004)View ArticleMATHGoogle Scholar
 TMS320C6678, Multicore fixed and floatingpoint digital signal processor, Data Manual, Texas Instruments. SPRS691E. March 2014. [Online]. Available: www.ti.com/lit/gpn/tms320c6678 [25Mars2016]
 M Tasche, H Zeuner, Improved roundoff error analysis for precomputed twiddle factors. J. Comput. Anal. Appl. 4(1), 1–18 (2012)MathSciNetMATHGoogle Scholar
 JJ Alter, JO Coleman, Radar digital signal processing, Chapter 25 in Merrill I. Skolnik, Radar Handbook, Third Edition, (McGrawHill, 2008)Google Scholar
 A Klilou, S Belkouch, P Elleaume, P Le Gall, F Bourzeix, MM Hassani, Realtime parallel implementation of pulseDoppler radar signal processing chain on a massively parallel machine based on multicore DSP and serial RapidIO interconnect. EURASIP Journal on Advances in Signal Processing 2014, 161 (2014)View ArticleGoogle Scholar
 Texas Instruments. FFT library for C66X floating point devices, C66X FFTLIB, version 2.0. [Online]. Available: http://www.ti.com/tool/FFTLIB [25Mars2016]
 Sang Yoon Park; Nam Ik Cho; Sang Uk Lee; Kichul Kim; Jisung Oh, Design of 2K/4K/8Kpoint FFT processor based on CORDIC algorithm in OFDM receiver, Communications, Computers and signal Processing, 2001. PACRIM. 2001 IEEE Pacific Rim Conference on, vol. 2, no., pp. 457,460 vol. 2, 2001. doi:10.1109/PACRIM.2001.953668
 T Pitkänen, T Partanen, J Takala, Lowpower twiddle factor unit for FFT computation. Embedded Computer Systems: Architectures, Modeling, and Simulation Lecture Notes in Computer Science 4599(2007), 65–74 (2007)Google Scholar
 JC Chi, SG Chen, An efficient FFT twiddle factor generator (Proc. European Signal Process. Conf, Vienna, 2004)Google Scholar