EURASIP Journal on Applied Signal Processing 2002:9, 944–953 c ○ 2002 Hindawi Publishing Corporation Frequency Spectrum Based Low-Area Low-Power Parallel FIR Filter Design

Parallel (or block) FIR digital filters can be used either for high-speed or low-power (with reduced supply voltage) applications. Traditional parallel filter implementations cause linear increase in the hardware cost with respect to the block size. Recently, an efficient parallel FIR filter implementation technique requiring a less-than linear increase in the hardware cost was proposed. This paper makes two contributions. First, the filter spectrum characteristics are exploited to select the best fast filter structures. Second, a novel block filter quantization algorithm is introduced. Using filter benchmarks, it is shown that the use of the appropriate fast FIR filter structures and the proposed quantization scheme can result in reduction in the number of binary adders up to 20%.


INTRODUCTION
Finite impulse response (FIR) filters are widely used in various DSP applications. In some applications, the FIR filter circuit must be able to operate at high sample rates, while in other applications, the FIR filter circuit must be a low-power circuit operating at moderate sample rates. The low-power or low-area techniques developed specifically for digital filters can be found in [1,2,3,4,5,6,7].
Parallel (or block) processing can be applied to digital FIR filters to either increase the effective throughput or reduce the power consumption of the original filter. While sequential FIR filter implementation has been given extensive consideration, very little work has been done that deals directly with reducing the hardware complexity or power consumption of parallel FIR filters.
Traditionally, the application of parallel processing to an FIR filter involves the replication of the hardware units that exist in the original filter. If the area required by the original circuit is A, then the L-parallel circuit requires an area of L×A. Recently, an efficient parallel FIR filter implementation technique requiring a less-than linear increase in the hardware cost was proposed using FFAs (fast FIR Algorithms) [8].
In [9], it was shown that the power consumption of arithmetic units can be reduced if statistical properties of the input signals are exploited. In this paper, based on [10], it is shown that the hardware cost can be reduced by exploiting the frequency spectrum characteristics of the given transfer function. This is achieved by selecting appropriate FFA structures out of many possible FFA structures all of whom have similar hardware complexity at the word-level. However, their complexity can differ significantly at the bit-level. For example, in narrowband low-pass filters, the signs of consecutive unit sample response values do not change much and therefore their difference can require fewer number of bits than their sum. This favors the use of a parallel structure which requires subfilters which require difference of consecutive unit sample response values as opposed to sum.
In addition to the appropriate selection of FFA structures, proper quantization of subfilters is important for low-power or low hardware cost implementation of parallel FIR filters. It is shown in [5,6,7] that if the filter coefficients are first scaled before the quantization process is performed, the resulting filter will have much better frequency-space characteristics. When the quantized filter is implemented, a postprocessing scale factor (PPSF) is used to properly adjust the magnitude of the filter output. In cases where large levels of parallelism are used, the number of required subfilters is large, and consequently the PPSFs can contribute to a significant amount of hardware overhead. In [8], PPSFs are restricted to a set of simple values to reduce the hardware overhead due to PPSFs. Since the original PPSF is replaced with the new simple PPSF that is the nearest in value, the quantized filter coefficients must also be properly modified. However, this approach is not guaranteed to give optimal quantized coefficients since already quantized coefficients are modified again. To avoid this problem, we propose look-ahead maximum absolute difference (LMAD) quantization algorithm, which gives optimal quantized coefficients for a given simple PPSF value.
In Section 2, FFAs are briefly reviewed. Also, frequency spectrum related hardware complexities for different types of FFAs are discussed. Section 3 presents a quantization method suitable for block FIR filters. Section 4 presents several block filter design examples.

FAST FIR ALGORITHMS
Consider the general formulation of a length-N FIR filter, where {x i } is an infinite length input sequence and {h i } are the length-N FIR filter coefficients. Then the polyphase representation of a traditional L-parallel FIR filter [11] can be expressed as where This block FIR filtering equation shows that the parallel FIR filter can be realized using L 2 -FIR filters of length N/L. This linear complexity can be reduced using various FFA structures.
which implies that Direct implementation of (4) is shown in Figure 1. This structure computes a block of 2 outputs using 4 length N/2 FIR filters and 2 postprocessing additions, which requires 2N multipliers and 2N − 2 adders. If (4) is written in a different form, the (2×2) FFA0 (FFAtype 0) is obtained, where H i+ j = H i + H j and X i+ j = X i + X j . Implementation of (5) is shown in Figure 2. This structure computes a block of Figure 1: Traditional 2-parallel FIR filter.
y(2k) Figure 2: 2-parallel FIR filter using FFA0. 2 outputs using 3 length N/2 FIR filters and 4 preprocessing and postprocessing additions, which requires 3N/2 multipliers and 3(N/2 − 1) + 4 adders. By a simple modification of (5), the following FFA1 (FFA-type 1) is derived [11], In (6), H 0−1 = H 0 − H 1 and X 0−1 = X 0 − X 1 . The structure derived by FFA1 is shown in Figure 3. The structures derived by FFA0 and FFA1 are essentially the same except some sign changes. Notice that, in FFA1, H 0−1 is used instead of H 0+1 . When an FIR filter is implemented using a multiplierless approach, the hardware complexity is directly proportional to the number of nonzero bits in the filter coefficients. If the signs of the given impulse response sequences do not change frequently as in the narrowband low-pass filter cases, the coefficient magnitudes of H 0 + H 1 are likely to be larger than those of H 0 − H 1 . Then, H 0 + H 1 has more nonzero bits in the coefficients than H 0 − H 1 . (See examples in Section 4.) If the signs of the given impulse response sequences change frequently as in the wide-band low-pass filter cases, H 0 − H 1 is likely to have more nonzero bits in the coefficients than H 0 + H 1 . Thus, to achieve minimum hardware cost, it is necessary to select either FFA0 or FFA1 depending upon the frequency spectrum specifications.

Cascading FFAs
The (2 × 2) and (3 × 3) FFAs can be cascaded together to achieve higher levels of parallelism. The cascading of FFAs is a straightforward extension of the original FFA application [8]. For example, an (m × m) FFA can be cascaded with an (n×n) FFA to produce an (m×n)-parallel filtering structure. The set of FIR filters that result from the application of the (m × m) FFA are further decomposed, one at a time, by the application of the (n × n) FFA. The resulting set of filters will be of length N/(m × n).
For example, the (4 × 4) FFA can be obtained by first applying the (2 × 2) FFA0 to (2) and then applying the (2 × 2) FFA0 or the (2 × 2) FFA1 to each of the filtering operations that result from the first application of the FFA0. The resulting (4 × 4) FFA structure is shown in Figure 7. Each filter block F 0 , F 0 +F 1 , and F 1 represents a (2×2) FFA structure and can be replaced separately by either (2 × 2) FFA0 or (2 × 2) FFA1. Each filter block F 0 , F 0 + F 1 , and F 1 is composed of three subfilters as follows: When the filter block F 0 + F 1 is implemented using FFA1 structure, the subfilters are H 0+1 , H 2+3 , and H 0+1 − H 2+3 . Thus, even though FFA1 structure is used for slowly varying impulse response sequences, optimum performance is not guaranteed. In this case, better performance can be obtained by using the FFA1 shown in Figure 8. Since the subfilters in FFA1 are H 0−1 , H 2−3 , and H 0−1 − H 2−3 , the FFA1 gives smaller number of nonzero bits than FFA1 for the case of slowly varying impulse response sequences. Notice that the FFA1 structure can be derived by first applying the (2 × 2) FFA1 (instead of the (2 × 2) FFA0) to (2). When the filter block F 0 + F 1 in Figure 7 is replaced by FFA1 in Figure 8, it can be shown that the outputs are y(4k), −y(4k + 1), y(4k + 2), and −y(4k + 3).

Selection of FFA types
For given length N unit sample response values {h i } and block size L, the selection of best FFA type can be roughly determined by comparing the signs of the values in subfilters For example, in the case of L = 2 and even N, H 0 , and H 1 are  Figure 4: 3-parallel FIR filter using FFA0. Figure 5: 3-parallel FIR filter using FFA1.

LOOK-AHEAD MAD QUANTIZATION
It is shown in [5,6,7] that if the filter coefficients are first scaled before the quantization process is performed, the resulting filter will have much better frequency-space characteristics. The NUS algorithm [6] employs a scalable quantization process. To begin the process, the ideal filter is normalized so that the largest coefficient has an absolute value of 1. The normalized ideal filter is then multiplied by a variable scale factor (VSF). The VSF steps through the range of numbers from 0.4375 to 1.13 with a step size of 2 −W , where W is the coefficient word length. Signed power-of-two (SPT) terms are then allocated to the quantized filter coefficient that represents the largest absolute difference between the scaled ideal filter and the quantized filter. The NUS algorithm iteratively allocates SPT terms until the desired number of SPT terms is allocated or until the desired NPR, normalized peak ripple, specification is met. Once the allocation of terms stops, the NPR is calculated. The process is then repeated for a new scale factor. The quantized filter leading to the minimum NPR is chosen.
In parallel FIR filters, the NPR cannot be used as a selection criteria for choosing the best quantized filter since passband/stopband ripples cannot be defined for the set of subfilters obtained by the application of FFAs. In [8], it is shown that the maximum absolute difference (MAD) between the y(3k + 2) Figure 6: 3-parallel FIR filter using FFA2.
frequency responses of the ideal filter and the quantized filter can be used as an efficient selection criteria for parallel filters.
When the quantized filter is implemented, a postprocessing scale factor (PPSF) is used to properly adjust the magnitude of the filter output. The PPSF is calculated as In the cases where large levels of parallelism are used, the PPSFs can contribute to a significant amount of hardware overhead. In [8], to reduce this hardware overhead the PPSFs are restricted to the following set of values: {0 in value. This is accomplished using the following three steps: (i) determine effective coefficients, effective coeffs. = quantized coeffs. × PPSF; (ii) determine shifted coefficients with new PPSF, shifted coeffs. = effective coeffs./new PPSF; (iii) quantize the shifted coefficients. However, the above steps are not guaranteed to give optimal quantized coefficients for the new PPSF value. The reason is that the quantization in (iii) is performed on the already quantized coefficients.
To avoid this problem, LMAD quantization algorithm is proposed. In the proposed algorithm, the PPSF for a given VSF is computed by (13) before the quantization step begins. If the number of nonzero bits in the computed PPSF is less than a prespecified value, then the normalized coefficients are scaled by the VSF and the scaled coefficients are quantized. Otherwise, the procedure is repeated for the next VSF value.
In [8], the number of nonzero bits in PPSF is fixed. However, in the proposed approach, the number of nonzero bits in PPSF can be varied and the PPSF value giving the best performance can be selected. From our simulation experience, increasing the number of nonzero bits in PPSF more than three does not improve the numerical performance significantly.   Notice that the MAD value by the proposed method is only 45% of the MAD value in [8]. Frequency responses are compared in Figure 9. Table 1 shows that, for the two low-pass FIR filter examples in [8], the proposed method can save up to 24% of adders. In [8], only FFA type 0 is used for each value of L. However, as can be seen from Table 1, better results are obtained by selecting FFA type(s) properly for each L.
To compare the hardware savings by the quantization and the proper selection of FFA types, only H 0+1 or H 0−1 subfilters are considered. From Table 2, the number of nonzero bits for H 0+1 of nonscaled FFA0 filter is 14 while the number of nonzero bits for H 0+1 of scaled FFA0 filter is 10 (including PPSF). Thus, in addition to the word-length reduction, hardware saving of about 28% can be obtained by LMAD scaling.
From Table 4, the number of nonzero bits for H 0−1 of scaled FFA1 filter is 7 (including PPSF). Thus, 22% further saving is obtained by the selection of proper filter type. Thus, in this example, about half of the saving is due to the LMAD quantization and the other half is due to proper filter type selection.

DESIGN EXAMPLES
In this section, three design examples with various frequency specifications are given.
Example 3. Consider a narrowband low-pass filter with filter order = 35, passband edge = 0.2π, maximum passband ripple = 0.185 dB, stopband edge = 0.3π, and minimum stopband attenuation = 33.5 dB. As can be seen from Figure 11, the signs of the impulse response sequences (designed by the Remez exchange algorithm) change slowly.
For L = 2, according to the discussions in Section 2.4, the number of pairs with the same signs is 16, while the number of pairs with the opposite signs is only 2. Thus, FFA1 is more efficient than FFA0. By the LMAD quantization algorithm, the number of nonzero bits required for H 0+1 is 42 but the number of nonzero bits required for H 0−1 is 24. Thus the hardware cost of H 0−1 is about 57% of the hardware cost of H 0+1 . The frequency responses for L = 2 are compared in Figure 12. the same signs is 2. Thus, FFA0 is the most efficient for F 0 . The number of pairs with the opposite signs in the subfilter pair {H 1 , H 3 } is 7 while the number of pairs with the same signs is 2. Thus, FFA0 is the most efficient for F 1 . By a similar procedure, it can be shown that FFA1 is the most efficient choice for F 0 + F 1 .
The design results for L = 2, 3, and 4 are summarized in Table 5. For L = 2 and L = 3, about 20% of the hardware can be saved by a proper choice of FFA types. However, for L = 4, only 7% of the hardware saving can be achieved by a proper choice of FFA types. The main reason is that the correlation of filter coefficients between subfilters is reduced as the block size increases.      Table 6. For L = 2 and L = 3, about 12%-15% of the hardware can be saved by a proper choice of FFA types. For L = 4, 4% of the hardware saving can be achieved by a proper choice of FFA types.

CONCLUSIONS
It has been shown that the hardware cost and power consumption of parallel FIR filters can be reduced significantly by exploiting the frequency spectrum characteristics. For example, in narrowband low-pass filters, the signs of consecutive unit sample response values do not change much and therefore their difference (FFA1) can require fewer number of bits than their sum (FFA0). In wideband low-pass filters, the signs of consecutive unit sample response values change frequently and therefore their sum (FFA0) can require fewer number of bits than their difference (FFA1). To determine the best FFA type for given impulse response sequence and block size L, a sign-comparing procedure was proposed. The usefulness of the proposed sign-comparing procedure was demonstrated by several examples. Also, the proposed lookahead MAD quantization algorithm was shown to be very efficient for the implementation of parallel FIR filters. Substructure sharing is the process of examining the hardware implementation of the filter coefficients and sharing the hardware units that are common among the filter coefficients. Using the substructure sharing techniques in [8], further savings in hardware cost and power consumption can be achieved.
Developing a similar approach to power reduction of adaptive FIR filters will be an interesting future research. Further research needs to be directed towards finite word-length analysis of these low-power parallel FIR filters.