EURASIP Journal on Applied Signal Processing 2002:9, 961–974 c ○ 2002 Hindawi Publishing Corporation A Reduced-Complexity Fast Algorithm for Software Implementation of the IFFT/FFT in DMT Systems

The discrete multitone (DMT) modulation/demodulation scheme is the standard transmission technique in the application of asymmetric digital subscriber lines (ADSL) and very-high-speed digital subscriber lines (VDSL). Although the DMT can achieve higher data rate compared with other modulation/demodulation schemes, its computational complexity is too high for cost-efficient implementations. For example, it requires 512-point IFFT/FFT as the modulation/demodulation kernel in the ADSL systems and even higher in the VDSL systems. The large block size results in heavy computational load in running programmable digital signal processors (DSPs). In this paper, we derive computationally efficient fast algorithm for the IFFT/FFT. The proposed algorithm can avoid complex-domain operations that are inevitable in conventional IFFT/FFT computation. The resulting software function requires less computational complexity. We show that it acquires only 17% number of multiplications to compute the IFFT and FFT compared with the Cooly-Tukey algorithm. Hence, the proposed fast algorithm is very suitable for firmware development in reducing the MIPS count in programmable DSPs.


INTRODUCTION
Recent progress of Internet access has a strong demand on high-speed data transmission. To overcome the transmission bottleneck over the conventional twisted-pair telephone lines, several sophisticated modulation/demodulation schemes have been proposed, including carrierlessamplitude-phase (CAP) modulation [1], discrete multitone modulation (DMT) [2,3,4,5] and QAM technology [6]. Among these advanced modulation schemes, the DMT can achieve highest transmission rate since it incorporates lots of advanced DSP techniques such as dynamic bit allocation, multidimensional tone encoding, frequency-domain equalization, and so forth. As a consequence, the DMT has been chosen as the physical layer transmission standard by the ADSL standardization committee.
One major disadvantage of the DMT scheme is its high computational complexity. In particular, the large block size of the IFFT/FFT consumes lots of computing power in running programmable DSPs [7]. In [8], we have considered a cost-efficient lattice VLSI architecture to realize the IFFT/FFT in integrated circuits. In this paper, we propose computationally efficient fast algorithms to run the IFFT/FFT function in software implementation such as programmable DSP processors (DSPs). By making use of the symmetric/antisymmetric properties of the Fourier transform, we first decompose the IFFT/FFT into a combination of two new real-domain transform kernels-the Modified DCT and Modified DST. These two transform functions are used to replace the complex-domain IFFT/FFT. Then we employ the divide-and-conquer approach in [9] to derive novel recursive algorithms and butterfly architectures for the modified DCT DST.  The new scheme can avoid redundant complex-domain of the IFFT/FFT. That is, it involves only real-valued operations to compute the IFFT/FFT. Hence, we can avoid the special data structure in software programming to run complexdomain addition/multiplication operations in computing the IFFT/FFT. In addition, our analysis shows that we need only 17% and multiplications in computing the IFFT and FFT compared with Cooly-Tukey algorithm [10]. The low computational complexity as well as real-domain operations makes it very suitable for firmware coding in DSPs, which helps to save the MIPS counts. Also, the DSP program can be written in recursive form which requires less ROM/RAM program storage space to implement the IFFT/FFT. The rest of this paper is organized as follows. Section 2 shows the derivation of the IFFT algorithm. In Section 3, the derivation of the FFT algorithm is discussed. The computation complexity comparison is shown in Section 4. The finite precision effect of our algorithm is also discussed. Finally, we conclude our work in Section 5.

The IFFT derivation
The IFFT/FFT block diagram in the DMT system is showed in Figure 1. At the transmitter side, to ensure the IFFT generates only real-valued outputs, the inputs of the IFFT in the DMT standard have the constraint [11], where X(k) = X r (k) + j · X i (k) are encoded complex symbols. As defined in [12,Chapter 9], the IFFT of a finite-length sequence of length 2N is , for n = 0, 1, . . . , 2N − 1, By decomposing n into the first half and the second half, (2) becomes Next, by substituting (3) into (4), and using (1), we can simplify (4) as (see Appendix A) for n = 0, 1, . . . , 2N − 1.

Odd summation
Injected items From (5), we can see that the computation of the IFFT is decomposed into two real-valued operations. One is a discrete cosine transform DCT-like operation with X r (k), k = 0, 1, 2, . . . , N − 1, as the inputs. The other is a discrete sine transform DST-like operation with X i (k), k = 0, 1, 2, . . . , N − 1, as the inputs. We will name the first term Modified DCT (MDCT), and the second term Modified DST (MDST). Note that the MDCT and MDST involve only real-valued operators. Furthermore, it can be shown that MDST(n) = − MDST(2N − n), for n = 0, 1, . . . , N − 1.
Hence, we can focus on computing MDCT(n) and MDST(n) for n = 0, 1, . . . , N − 1. Then, expand the results for n = N + 1, N +2, . . . , 2N − 1. For the special cases of n = 0 and n = N, the MDCT and MDST can be simplified as respectively. These simple relationships can help us to save additional computation complexity.

MDCT/MDST operations of the IFFT
From the preceding discussion, we can see that the implementation issue of the IFFT is to realize MDCT and MDST in a cost-efficient way. Then, we can just combine the results of the MDCT and MDST to obtain the IFFT results based on (5). Here, we first consider the implementation of the MDCT. We follow the derivation in [9] and define C nk 2N = cos(2πnk/2N). Then, the MDCT can be written as Decompose the MDCT into even and odd indices of k, then (9) can be rewritten as where Define h(n) = 2C n 2N h (n). Following the derivation in Lee's algorithm [9], we can find That is, On the other hand, by replacing index n with (N − n) in (12), it can be shown that The special case MDCT(N/2) needs to be computed separately, which can be simplified as The mapping of (13), (14), and (15) is shown in Figure 2. As we can see, the N-point MDCT is decomposed into two N/2-point MDCT (g(n) and h (n)) plus some pre-processing and post-processing modules. Then we can apply the technique of divide-and-conquer to recursively expand the N/2-point MDCT until 1-point MDCT is formed. That is, we repeat the decomposition in (10) and (11) until N = 1. Next, we consider the recursive implementation of the MDST. We define S nk 2N = sin (2πnk/2N). As with the derivation in (10), (11), (12), (13), and (14), we can find It is worth noting that the injected item is zero in the MDST. Besides, the MDST also has a special case for index N/2 as The mapping of the MDST structure in Figure 3 is similar to the MDCT structure, except that minimum processing block is 2-point MDST (see Figure 3) and the injected items do not exist in the MDST implementation. That is, we repeat the decomposition in (16) until N = 2. Note that the 1-point MDST is always equal to zero.
Odd summation (7) based on (5), we combine the MDCT and MDST results together with the scaling operation (which is achieved by shifting right by log 2 (N) bits) to obtain the IFFT results. This is done in the post-processing operation.

Matrix notation of the MDCT/MDST
In this section, we present the matrix notation of the proposed fast IFFT algorithm. The matrix form can help to see the divide-and-conquer nature of our approach. By following the notation in [13], we rewrite X r (k) and MDCT(n) as respectively. Then (9) can be represented as where [T N,MDCT ] denotes the transform kernel matrix of the MDCT operation. Next, the injected items of (13) can be represented as where We define the odd-summation matrix as and the scaling matrix as The special case of the MDCT in (15) can be represented as .

Shift left log 2 (N) bits
Post-processing (Expanding circuit) Note that the MDST is similar to the MDCT except that there is no injected items. Also, the special case matrix can be modified as The block diagram of the MDST in the matrix form is shown in Figure 6.

The FFT derivation
At the receiver side (see Figure 1), the 512-point FFT is used to demodulate the received signals, which is given bỹ where Note thatx(n), n = 0, 1, . . . , 2N − 1, are real-valued numbers. Hence, (30) can be rewritten as Equation (32) shows that the computation of the FFT is . .
Even-odd index mapping Figure 5: Block diagram of the MDCT operation in matrix form. decomposed into a combination of two real-domain kernels-MDCT(k) and MDST(k). Both MDCT and MDST usex(n), n = 0, 1, . . . , 2N − 1, as the inputs. Hence, we only employ two real-valued kernels (MDCT and MDST), thus no complex-valued operations are required in computing the FFT. In addition, in the DMT system, the lower N-point FFT outputs are conjugate-symmetric to the upper N-point outputs. We are only interested in N-point data for k = 0, 1, . . . , N − 1. Hence, we can neglect the outputs X(k), for k = N, N + 1, . . . , 2N − 1.

MDCT/MDST operations of the FFT
In (32), the transform kernels are 2N-point MDCT(k) and MDST(k). Here, we propose a novel approach to further reduce the computational complexity. Hence, we only need to perform N-point MDCT/MDST.
Hence, we havẽ wherex c (0) = 0 andx s (0) = 0. Since the block size is reduced from 2N-point (see (32)) to N-point (see (36)). Next, following the derivations of the IFFT in Section 2, we can have Similarly, for the MDST(k), we have for k = 0, 1, . . . , The two special cases for index N/2 are The block diagram of the MDCT(k) is shown in Figure 7. The mapping of the MDST structure is similar to the MDCT structure in Figure 7 except that minimum processing block is 2-point MDST and the injected items do not exist in the MDST(k) implementation (see Figure 8). Then we can just combine the MDCT(k) and MDST(k) outputs, followed by addingx(0) andx(N)(−1) k , to obtain the FFT results based on (36).

Overall FFT computation procedures
The overall computation flow of the FFT is shown in Figure 9. The operations are as follows.
(2) In the first phase, the generatedx c (n) are fed into recursive butterfly operation to obtain the MDCT(k) outputs.
(3) In the second phase, we repeat the computation by using thex s (n) as inputs into recursive butterfly operation to obtain the MDST(k) outputs.
(4) We combine the MDCT(k) and MDST(k) results then addx(0) andx(N)(−1) k together to obtain the FFT results based on (36). This is done in the post-processing operation. The difference is that it requires a pre-processing to compute thex c (n) andx s (n). The block diagrams of the MDCT and MDST are shown in Figures 10 and 11, respectively.

Comparison of hardware complexity
In this section, we compare the computation complexity of the proposed algorithm with the traditional Cooly- Tukey  2N-point . . .
where O 1 and O 2 are the number of multiplications (or additions) in other fast algorithms and our approach, respectively. We can see that the complexity ratio of the multiplication is only 17% for N = 256 compared with conventional IFFT/FFT. Table 1 also shows that our approach can gain more computation savings as N gets larger in the VDSL systems [14].

Experiment results
There are lots of DSP processors on the market. Due to the variety or hardware structure, coding styles, compliers, and so forth, we are not trying to do the detail optimization for specific processors. On the other hand, we would like to compare the proposed algorithm with Cooly-Tukey's algorithm, which is a baseline of the FFT realization. The implementation platform is TI TMS320C54 evaluation board, http://www.ti.com. Both algorithms are written in C language without any assembly-level programming tricks. During compilation, the TI C54X C complier is used without adding special compilation options, neither. Table 2 shows the comparison of the proposed algorithm and the conventional FFT in terms of clock cycles. As we can see, the proposed algorithm requires only about 30% clock cycles of the Cooly-Tukey's. The result is very consistent with our observation in Table 1.

Finite-precision effect
In fixed-point implementation of the IFFT/FFT kernels, it is important to consider the effects of finite register length in the IFFT/FFT calculations (see [12,Chapter 9] and [15]). To compare the butterfly approach and our approach in fixedpoint implementation, we conduct extensive computer simulation by using MATLAB for finite-wordlength IFFT/FFT architecture. Figure 12 shows the SNR performance with assigned wordlength B = 8, 16, 32 bits. We observe that the SNR performance with B =16 bits is good enough in practical fixed-point implementations. From the simulation results, we can see that the SNR performance of our approach is comparable to the traditional butterfly approach under the same wordlength.

CONCLUSIONS
In this paper, we develop a computationally efficient fast algorithm for the software implementation of the IFFT/FFT kernel in the DMT system. We reformulate the IFFT/FFT functions so as to avoid complex-domain operations. The complexity ratio of the multiplications is only 17% compared with the direct butterfly implementation approach. The proposed algorithm provides a good solution in reducing MIPS count in programmable DSP implementation for the applications of the DMT transceiver systems.

A. DERIVATION OF (4)
Decomposing (4) into the first half and second half with the fact that X(0) = X(N) = 0, (4) can be represented as  (B.1) Use n = 2N − n to replace the variable in the second term. Then, we havẽ .

(B.2)
Because n is a dummy variable, we can rewrite (B.2) as