FIR filter optimization for video processing on FPGAs
- Martin Kumm^{1}Email author,
- Diana Fanghänel^{1},
- Konrad Möller^{1},
- Peter Zipf^{1} and
- Uwe Meyer-Baese^{2}
https://doi.org/10.1186/1687-6180-2013-111
© Kumm et al.; licensee Springer. 2013
Received: 31 January 2013
Accepted: 30 April 2013
Published: 25 May 2013
Abstract
Two-dimensional finite impulse response (FIR) filters are an important component in many image and video processing systems. The processing of complex video applications in real time requires high computational power, which can be provided using field programmable gate arrays (FPGAs) due to their inherent parallelism. The most resource-intensive components in computing FIR filters are the multiplications of the folding operation. This work proposes two optimization techniques for high-speed implementations of the required multiplications with the least possible number of FPGA components. Both methods use integer linear programming formulations which can be optimally solved by standard solvers. In the first method, a formulation for the pipelined multiple constant multiplication problem is presented. In the second method, also multiplication structures based on look-up tables are taken into account. Due to the low coefficient word size in video processing filters of typically 8 to 12 bits, an optimal solution is found for most of the filters in the benchmark used. A complexity reduction of 8.5% for a Xilinx Virtex 6 FPGA could be achieved compared to state-of-the-art heuristics.
Keywords
Introduction
Two-dimensional linear filters with finite impulse response (FIR) are one of the most fundamental operations used in image and video processing. They are used, e. g., in applications which contain contrast improvement, denoising, sharpening, target matching, and feature enhancement [1]. Compared to infinite impulse response filters, FIR filters have a strict stability, and high-throughput implementations are easily possible using pipelining as no recursions are involved. However, they are computationally expensive as many multiply accumulate (MAC) operations are necessary for each pixel of the resulting image. While this is very demanding for a microprocessor or digital signal processor, the inherent parallelism of field programmable gate arrays (FPGAs) can be used to accelerate the FIR operation.
Modern FPGAs directly incorporate embedded multipliers or DSP blocks which also include pre- and post-adders for MAC operations. Xilinx’s DSP blocks (Xilinx Inc., San Jose, CA, USA) of Virtex 5/6, Spartan 6, and the 7 series FPGAs provide 18×25-bit signed multipliers. More flexible are the variable precision DSP blocks of the latest FPGAs of Altera, the Stratix V, Cyclone V, and Aria V devices (Altera, San Jose, CA, USA). Each DSP block can be configured as three independent 9×9-bit multipliers, two independent 16×16-bit, 15×17-bit, or 14×18-bit multipliers, or a single 18×36-bit or 27×27-bit multiplier.
However, embedded multipliers and DSP blocks are limited resources even on modern low-cost FPGAs, and they may have a higher power consumption compared to constant multiplication using the carry-chain resources [2]. Especially in image processing, embedded multipliers are often underutilized because of the small word lengths used. Typically, only 8 bits per color and 10 bits for a luminance representation are used.
- (a)
- (b)
- (c)
In method (a), constant multiplications are realized using additions, subtractions, and bit shifts only. These operations form a so-called adder graph, so this method is called the adder graph MCM method in the following. It was originally developed for software or VLSI applications [3] but also maps well to the fast carry chains of FPGAs. In method (b), the input word is split into smaller chunks that fit into the input word size of the FPGA LUTs. These LUT results are shifted and added afterward to form the multiplication result. LUTs and adders are also used in method (c), but there the folding equation of the FIR filter is rearranged in such a way that identical LUTs can be used. This is very beneficial in sequential FIR implementations but costly in parallel implementations. Because it was shown in the recent years that multiplier blocks using add, subtract, and shift operations (method a) consume considerable less logic resources compared to parallel DA implementations [5–7], the DA approach is not further considered. Due to the relatively large routing delays compared to the fast carry chain, a pipelined implementation of the adder graph is necessary to obtain the maximum speed of the FPGA [2, 5–10]. It was shown by Faust et al. [35] that the LUT-based approach (method b) is competitive to the adder graph method. Thus, pipelined circuits using the combination of methods (a) and (b) are investigated in this paper.
Contribution of this work
The main contribution of this article is the description of two novel optimization methods, one for the adder graph MCM problem including pipelining (the pipelined MCM problem [9]) and one for a combination of this method with a pipelined realization of the LUT-based method mentioned above. Each method is formulated as a boolean integer linear program (BILP, or 0-1 ILP) and then reduced to a mixed integer linear program (MILP). Hence, if the MILP solver finds a solution in reasonable time, an optimal solution for the given cost model is found. To the best of our knowledge, this is the first time an optimal method for solving the pipelined MCM (PMCM) problem is proposed.
The complexity of the adder graph MCM method heavily depends on the coefficient values, while the complexity of the LUT-based approach mainly depends on the input word size. Therefore, sometimes, one method or the other delivers better results. For this, a combination of both methods is proposed in this work by incorporating the LUT-based multipliers in the integer linear programming (ILP) formulation of the PMCM problem. Due to the low coefficient word size of typically 8 to 12 bits in image processing, a short convergence time of the ILP solver is very likely, which makes the proposed optimization an ideal candidate for image processing.
The remaining of this paper is organized as follows. The related work is discussed in the next section, followed by an introduction of the used FIR architectures for image processing. Then, an ILP formulation for the PMCM problem is described which is later extended for additional LUT-based multiplication. Finally, results from the optimizations and FPGA synthesis are presented and discussed, followed by a conclusion.
Related work
MCM using additions, subtractions, and bit shifts
Different methods have been proposed over the years to realize constant multiplication using additions, subtractions, and bit shifts only. Finding the optimal configuration of these operations is known as MCM problem, which has been an active research topic for almost the last two decades [3–33]. The objective is usually defined by minimizing the number of adders and subtractors (shifts are assumed to be free, as they can be implemented using wires).
The MCM problem is NP complete [4]. Hence, most of the proposed algorithms are heuristics, and less work was directed toward optimal solutions. Early work was done by Bull and Horrocks [3] which was later extended by Dempster and Macleod to the modified Bull and Horrocks algorithm [14]. In the same work, the n-dimensional reduced adder graph (RAG-n) algorithm was proposed which was one of the leading heuristics for years. Major improvements could be achieved by the work of Voronenko and Püschel with their cumulative benefit heuristic (H_{cub}) [4]. By spending a bit more algorithmic complexity and evaluating adder graph topologies up to a depth of three, they could reduce the required additions/subtractions by 20% on average compared to RAG-n. A competing approach based on difference graphs was proposed by Gustafsson [16]. It tends to be beneficial compared to H_{cub} in case large coefficient sets and/or low coefficient word lengths are used but may be worse in other cases.
Many approaches use ILP formulations, for which optimal solvers exist. However, the search space is often significantly reduced due to the used number representation which leads to non-optimal results. Minimum signed digit (MSD) number systems like canonic signed digit (CSD) are often used as they have a reduced complexity compared to the binary representation [18]. In MSD, a number is coded using the digits {-1,0,1}, such that the number of non-zero digits is minimal and, hence, the number of partial products is reduced. A 0-1 ILP model that uses subexpressions of length two in the CSD number system (i. e., subexpressions with at most two non-zeros) was used by Yurdakul and Dündar [19]. A 0-1 ILP model which can be used for any number system was proposed by Flores et al. [18]. Results for binary, CSD and MSD were presented. This work was further extended by Aksoy et al. with additional delay constraints [20], low-level area models [21], and a heuristic variation [22].
So far, the discussed publications did not result in globally optimal solutions as they use the reduced search space of a given number representation. Breadth-first search [23] and depth-first search [17] algorithms were proposed by Aksoy et al. to optimally solve the MCM problem in a graph-based way. The depth-first search is able to solve MCM instances in a reasonable time but cannot handle different constraints or cost metrics. Another interesting method to optimally solve the MCM problem was given by Gustafsson who transferred the MCM problem to the problem of finding a Steiner hypertree in a directed hypergraph [24]. He used an optimal 0-1 ILP formulation which is very generic and can be flexibly adopted to different cost metrics (at adder or logic level) and different constraints (adder depth and fan-out). The main drawback is its computational complexity. Nevertheless, it can be used for small MCM instances or to find lower bounds [25] by relaxing the model to a continuous LP problem. A 0-1 ILP formulation for optimally solving the special case of minimum depth adder graphs in a graph-based fashion was recently proposed [26]. In this work, the search space of a graph-based search is compared to the MSD search space in terms of variables of the ILP. The graph-based search needs three times more variables for 8-bit coefficients and 18 times more variables for 13-bit coefficients compared to MSD.
In the recent time, more and more MCM algorithms with different objectives were proposed. One objective is minimizing the power of the adder graph by reducing or minimizing the adder depth (AD) of each output, which is defined as the number of adder stages needed to compute a coefficient [26–29]. Limiting the maximum AD of all outputs can be used to find adder graphs with low delay [30]. If this delay is still too large, pipelining can be used to speed up the circuit [5, 8, 9, 31].
Pipelining plays a crucial role in high-performance adder graph realizations on FPGAs as they have relatively large routing delays compared to their fast carry chains. However, the addition of pipeline registers may significantly increase the complexity. An example of the pipelined adder graph (PAG) of Figure 1a using an as-soon-as-possible (ASAP) scheduling for placing the pipeline registers is shown in Figure 1b. Here, each rectangular node includes a pipeline register, i. e., nodes without any ‘ + ’ or ‘ -’ operator are pure registers. Many resources can be saved by finding the optimal schedule of pipeline registers for a given adder graph. Compared to the ASAP schedule, 29% less pipelined operations and speedups of about 300% were achieved on average using a slice overhead of only 18% compared to the non-pipelined adder graph [8]. The PAG using the optimal schedule of the adder graph of Figure 1a is shown in Figure 1c. However, directly considering pipelining during the adder graph optimization can further reduce the resources as demonstrated in Figure 1d. Heuristics for this kind of direct optimization were proposed recently [7, 9]. A reduction of pipelined operations by 10% compared to the optimal pipelined adder graphs [8] could be achieved by the reduced pipelined adder graph (RPAG) algorithm [9]. Slice reductions of 16% on average were reported using the best result out of three algorithms (C1^{+}, RSG Improved and a genetic algorithm) [7].
MCM using look-up tables
A totally different method for single constant multiplication which uses the look-up tables of FPGAs was proposed by Wirthlin [34]. The idea is to split the multiplication into smaller chunks, e. g., 4 bits for FPGAs with four-input LUTs, which can be directly realized using a single stage of LUTs. These LUT results have to be shifted and added to get the final product. Several techniques were proposed to minimize the number of redundant LUTs [34]. An extension of LUT-based multipliers to MCM was presented by Faust et al. [35], where identical LUTs are also shared between different constant multipliers. It was shown that the maximal number of required LUTs is far less than combinatorially possible. Their benchmark results include a comparison with adder graph MCM, which shows that their graph-based MinLD MCM algorithm [29] sometimes uses less resources and sometimes more resources than the LUT-based method for an input word size of 8 bits on an FPGA with four-input LUTs. As their method does not include pipelining, shorter delays could be achieved using the LUT-based MCM.
Two-dimensional FIR filter architectures
where $u=\u230a\frac{P}{2}\u230b$ and $v=\u230a\frac{Q}{2}\u230b$ denote the center of the folding matrix H. Note that P and Q are often identical (matrix X is square) and odd.
Here, N _{ A } and ${N}_{A}^{T}$ are the adder cost before and after transposition, respectively. MCM is a special case of vector matrix multiplication with N _{ i } = 1, so if N _{ A,MCM} adders are needed for an optimal MCM adder graph with N unique coefficients, N _{ A,SOP} = N _{ A,MCM} + N - 1 adders are needed for the corresponding optimal SOP. Therefore, the eight additional adders shown in the transposed form in Figure 2b are exactly the additional adders needed to compute the SOP in Figure 2a. Hence, from its complexity, there is no difference between the direct or transposed form. If we take a look on pipelined implementations of MCM blocks, the situation becomes worse. While transposing a pipelined MCM block still leads to a valid pipeline, the pipeline registers may not any longer be located in the critical path between the adders. Therefore, we concentrate in the following on pipelined implementations of the MCM block, which can be directly incorporated in the transposed form, as shown in Figure 2b.
A totally different filter structure can be realized if the folding matrix is separable, i. e., matrix H can be separated in two vectors h _{1} and h _{2} with H = h _{1}·h _{2}. Then, the two-dimensional filter can be realized by cascading two one-dimensional filters, one for processing the rows and one for the columns. This reduces the PQ multiplications to P + Q multiplications [43]. If the filter cannot be separated, then there exist methods to decompose the filter into a sum of separable filters using singular value decomposition [45].
However, only a fraction (which is typically less than one third) of the multipliers of the unseparated folding matrix have to be realized due to symmetric, zero, or power-of-two coefficients. Furthermore, in both MCM methods discussed above, many resources can be saved when intermediate computation results are shared between the constant multipliers. This is only possible if all multipliers have the same input which is not the case for separated filters. Therefore, we concentrate on the architecture shown in Figure 2b as it is generic (any folding matrix can be realized) and pipelined MCM blocks can be directly used.
Optimally solving the pipelined MCM problem
Pipelined MCM problem formulation
with configuration q = (l _{1},l _{2},r,s g), where ${l}_{1},{l}_{2},r\in {\mathbb{N}}_{0}$ are shift factors, the sign bit s g ∈ {0,1} denotes whether an addition or subtraction is performed and u and v are positive, odd input arguments. A valid configuration q is a combination of l _{1}, l _{2}, r, and sg such that the result is a positive odd integer.
The greatest effort during MCM optimization is finding the numerical values of non-output nodes, i. e., the values of all u and v which are not in the coefficient set. Once all these node values are found, it is easy to determine the configuration q of the corresponding adder graph, e. g., using the optimal part of RAG-n [14] or H_{cub}[4]. Since the same can be applied for PAG optimization, it is appropriate to define a set X _{ s } for each pipeline stage s, containing the node values of the corresponding stage. The pipeline sets for the PAG in Figure 1d are, e. g., X _{0} = {1}, X _{1} = {7,9}, and X _{2} = {11,43,65}. With this representation, we can formally define the PMCM problem:
Definition 1 (Pipelined MCM problem)
Given a set of positive target constants T = {t _{1},…,t _{ M }} and the number of pipeline stages S, find sets X _{1},…,X _{ S-1} ⊆ {1,3,…,x _{max}} with minimal area cost such that for all w∈X _{ s } for s = 1,…,S there exists a valid $\mathcal{A}$-configuration q such that $w={\mathcal{A}}_{q}(u,v)$ with u,v ∈ X _{ s-1}, X _{0} = {1} and X _{ S } = {odd(t) | t ∈ T ∖ {0}}, where odd(t) is the absolute value of t divided by 2 until it is odd.
The area cost of pipeline sets X _{0 … S } depends on the target architecture, and it will be discussed in the following sections.
FPGA cost of pipelined adder graphs
where B _{ x } is the input word size and $w={\mathcal{A}}_{q}(u,v)$.
ILP formulation for the pipelined adder graph problem
Hence, there is one variable ${a}_{w}^{s}$ for each stage s and each possible element w and one variable ${a}_{(u,v,w)}^{s}$ for each w and each possible pair (u,v), from which w can be computed with a single $\mathcal{A}$-operation.
Using these variables and auxiliary sets, we can define the ILP formulation for the PAG problem as follows:
The subject is to minimize the BLE cost of all required $\mathcal{A}$-operations. Constraint (C1) simply forces that all target coefficients are realized in the last pipeline stage. To realize an element ${a}_{w}^{s}$, one of the possible $\mathcal{A}$-operations to compute w must be available, which is constrained by (C2). The last constraint (C3) ensures that an $\mathcal{A}$-operation $w={\mathcal{A}}_{q}(u,v)$ can only be realized in stage s when u and v are both available in stage s - 1.
In the formulation, we use the relaxed condition ${a}_{(u,v,w)}^{s}\ge 0$ instead of requiring ${a}_{(u,v,w)}^{s}\in \{0,1\}$ which results in an MILP formulation. Using continuous variables ${a}_{(u,v,w)}^{s}$ instead of binary ones reduces the runtime of the optimization substantially. This is possible since we can construct a binary solution that is feasible and minimal from the relaxed solution.
Proof
and since the objective function is minimized, the new solution is also minimal and the objective values are equal. If we do this construction for all variables ${a}_{w}^{s}=1$, we obtain an optimum solution where all variables are binary. □
A modified cost function can also be used to influence the number of adders within a pipeline stage. Instead of placing pipeline registers after each stage of adders, they could be placed behind multiple adder stages (e.g., every second or third stage). This would reduce the register resources for applications where a lower speed is sufficient or a lower latency is required. For that, the cost function for pure registers (as for ASICs) can be set to zero in each stage in which no registers should be placed. In the implementation, these registers are exchanged by wires.
While (C3a) is nearly identical to (C3), (C3b) allows the bypass of several stages using shift registers. Of course, these values are not accessible for intermediate stages.
Multiple constant multiplication using LUT multipliers
In this section, a LUT-based constant multiplier method is introduced, which is used to extend ILP Formulation 1.
LUT-based constant multiplication
LUT minimization techniques
- (a)
Removal of constant LUTs, i. e., LUTs that are always ‘0’ or ‘1’ can be replaced by a constant.
- (b)
Removal of LUTs which are identical to one of the inputs, i. e., the output can be connected to the corresponding input.
- (c)
Removal of redundant LUTs, i. e., LUTs with identical content and identical inputs can be shared.
ILP formulation for the combined pipelined adder/LUT graph optimization
If a LUT multiplier is inserted in the adder graph to compute w from node v, the variable ${a}_{(0,v,w)}^{s}$ is set to one (the case u = 0 does not exist in ILP Formulation 1). The corresponding (0,v,w) triplets have to be added to ${\mathcal{T}}^{s}$. Now, ILP Formulation 1 can be extended by two constraints (C4) and (C5) (the remaining constraints remain nearly identical) and a modified objective function:
The additional constraints (C4) and (C5) are defined to incorporate LUT realizations during optimization. Constraint (C4) is similar to (C3): If a LUT multiplier is used to compute w in stage s from v in stage s - LD(w/v), the corresponding node v must be available in that stage. Constraint (C5) ensures that the corresponding LUTs are available in the correct pipeline stage to compute w from v. In first experiments, we found only cases where LUT multipliers directly compute target coefficients of the last stage. Hence, we decided to reduce the search space by evaluating the constraints (C4) and (C5) only for the target constants of the last stage s = S.
Results
Parameters of the used benchmark filters
Filter | Filter | B _{ c } | N _{uq} | S | Parameter | Unique odd coefficients |
---|---|---|---|---|---|---|
type | size | |||||
Gaussian | 3 × 3 | 8 | 3 | 2 | σ = 0.5 | 3 21 159 |
Gaussian | 5 × 5 | 12 | 4 | 3 | 1 23 343 1,267 | |
Laplacian | 3 × 3 | 8 | 3 | 2 | α = 0.2 | 5 21 107 |
Unsharp | 3 × 3 | 8 | 3 | 2 | α = 0.2 | 3 11 69 |
Unsharp | 3 × 3 | 12 | 3 | 3 | 43 171 1,109 | |
Lowpass | 5 × 5 | 8 | 5 | 2 | 11 33 35 53 103 | |
Lowpass | 9 × 9 | 10 | 13 | 2 | f _{ pass } = 0.2, f _{ stop } = 0.4 | 1 5 7 25 31 63 65 67 73 97 117 165 303 |
Lowpass | 15 × 15 | 12 | 26 | 3 | 1 5 7 13 17 19 21 27 41 43 45 53 61 79 93 101 | |
103 113 133 137 199 331 333 613 1,097 1,197 | ||||||
Highpass | 5 ×5 | 8 | 5 | 2 | 1 3 5 7 121 | |
Highpass | 9×9 | 10 | 6 | 2 | f _{ stop } = 0, f _{ pass } = 0.2 | 1 3 5 7 11 125 |
Highpass | 15×15 | 12 | 13 | 2 | 1 3 5 7 9 11 13 15 17 19 21 23 507 |
The BILP solver of Matlab is slow compared to the CPLEX optimizer from IBM [46] and does not provide a solver for MILP problems. CPLEX provides a text file interface with a comfortable human readable syntax (LP file format [47]). Hence, Matlab was used to generate the MILP models as LP files, and CPLEX was used for solving them. Then, Matlab was used to read in the solution file of CPLEX in XML format [47] for generating synthesizable VHDL code. All MCM blocks were optimized for an input word size B _{ x } of 8, 10, and 12 bits. The results are compared with the RPAG algorithm [9] and the LUT MCM method of [35], which was applied to the pipelined realization, as shown in Figure 5. The RPAG algorithm is a greedy heuristic. As the best local choice in a greedy algorithm does not necessarily lead to the best global solution, it allows to randomly select one of the n th best decisions. Then, the best result out of several runs can be taken to improve the optimization. RPAG was configured for a single run (R = 1, pure greedy), and R = 50 runs per MCM instance where the locally first or second best decision was randomly selected. These results were considered as state-of-the-art reference. The optimization was performed for the Virtex 6/7 FPGA architecture, and CPLEX was configured with a computation time limit of 8 hours.
Optimization results in terms of the number of basic logic elements (BLE) for the previous methods RPAG[9] using R iterations, a LUT MCM block[35] including pipelining and the proposed methods for optimal pipelined adder graphs using ILP Formulation 1 (Optimal PAG) and optimal pipelined adder/LUT graphs using ILP Formulation 2 (Optimal PALG), including the coefficients realized by LUTs (LUT Coeffs)
No. of BLE | |||||||||
---|---|---|---|---|---|---|---|---|---|
Filter | Filter | B _{ c } | B _{ x } | RPAG[9] | RPAG[9] | LUT | Optimal | Optimal | LUT |
type | size | ( R = 1) | ( R = 50) | MCM[35] | PAG | PALG | Coeffs | ||
Gaussian | 3 × 3 | 8 | 8 | 63 | 58 | 56 | 58 | 56 | All |
Gaussian | 5 × 5 | 12 | 8 | 125 | 125 | 77 | 111 | 77 | All |
Laplacian | 3 × 3 | 8 | 8 | 79 | 61 | 54 | 61 | 54 | All |
Unsharp | 3 × 3 | 8 | 8 | 56 | 56 | 51 | 56 | 51 | All |
Unsharp | 3 × 3 | 12 | 8 | 112 | 107 | 64 | 91 | 64 | All |
Lowpass | 5 × 5 | 8 | 8 | 98 | 98 | 91 | 98 | 91 | All |
Lowpass | 9 × 9 | 10 | 8 | 235 | 221 | 192 | 221 | 192 | All |
Lowpass | 15 × 15 | 12 | 8 | 480 | 475 | 371 | ≤478 | 371 | All |
Highpass | 5 × 5 | 8 | 8 | 74 | 74 | 69 | 74 | 69 | All |
Highpass | 9 × 9 | 10 | 8 | 85 | 85 | 83 | 85 | 83 | All |
Highpass | 15 × 15 | 12 | 8 | 186 | 186 | 170 | 186 | 170 | All |
Gaussian | 3×3 | 8 | 10 | 73 | 68 | 71 | 68 | 68 | None |
Gaussian | 5 × 5 | 12 | 10 | 145 | 143 | 98 | 129 | 98 | All |
Laplacian | 3 × 3 | 8 | 10 | 93 | 71 | 69 | 71 | 68 | All |
Unsharp | 3 × 3 | 8 | 10 | 66 | 66 | 66 | 66 | 66 | None |
Unsharp | 3 × 3 | 12 | 10 | 130 | 123 | 80 | 105 | 80 | All |
Lowpass | 5 × 5 | 8 | 10 | 114 | 114 | 118 | 114 | 112 | 33 |
Lowpass | 9 × 9 | 10 | 10 | 271 | 255 | 276 | 255 | 254 | 117 |
Lowpass | 15 × 15 | 12 | 10 | 550 | 543 | 546 | ≤547 | ≤545 | None |
Highpass | 5 × 5 | 8 | 10 | 88 | 88 | 95 | 88 | 87 | 121 |
Highpass | 9 × 9 | 10 | 10 | 101 | 101 | 115 | 101 | 101 | None |
Highpass | 15 × 15 | 12 | 10 | 218 | 218 | 254 | 218 | 216 | 23 |
Gaussian | 3 × 3 | 8 | 12 | 83 | 78 | 143 | 78 | 78 | None |
Gaussian | 5 × 5 | 12 | 12 | 165 | 161 | 203 | 147 | 147 | None |
Laplacian | 3 × 3 | 8 | 12 | 107 | 81 | 142 | 81 | 81 | None |
Unsharp | 3 × 3 | 8 | 12 | 76 | 76 | 135 | 76 | 76 | None |
Unsharp | 3 × 3 | 12 | 12 | 148 | 139 | 173 | 119 | 119 | None |
Lowpass | 5 × 5 | 8 | 12 | 130 | 130 | 242 | 130 | 130 | None |
Lowpass | 9 × 9 | 10 | 12 | 307 | 289 | 589 | 289 | 289 | None |
Lowpass | 15 × 15 | 12 | 12 | 620 | 611 | 1203 | ≤620 | ≤620 | None |
Highpass | 5 × 5 | 8 | 12 | 102 | 102 | 192 | 102 | 102 | None |
Highpass | 9 × 9 | 10 | 12 | 117 | 117 | 232 | 117 | 117 | None |
Highpass | 15 × 15 | 12 | 12 | 250 | 250 | 523 | 250 | 250 | None |
Average: | 168.09 | 162.73 | 207.36 | 160.3 | 150.97 | ||||
Improvement to RPAG (R = 50): | – | – | -27.43% | 1.49% | 7.23% |
- 1.
A single Virtex 6 BLE is able to realize two flip-flops (see Figure 3b), but sometimes, ISE uses one BLE to realize a single register and sometimes to realize two registers.
- 2.
Sometimes, ISE maps a single flip-flop in a BLE that is also used for a full adder.
Synthesis results using the same benchmark instances as in Table 2 providing the actual BLEs as well as the maximum clock frequency f _{ m a x } on a Virtex 6 FPGA
Filter | Filter | B _{ c } | B _{ x } | RPAG[9] | LUT | Optimal | Optimal | ||||
---|---|---|---|---|---|---|---|---|---|---|---|
type | size | ( R = 50) | MCM[35] | PAG | PALG | ||||||
BLE | f _{ m a x } | BLE | f _{ m a x } | BLE | f _{ m a x } | BLE | f _{ m a x } | ||||
Gaussian | 3 × 3 | 8 | 8 | 58 | 713.8 | 58 | 720.5 | 58 | 730.5 | 58 | 720.5 |
Gaussian | 5 × 5 | 12 | 8 | 125 | 635.3 | 68 | 633.3 | 111 | 619.2 | 68 | 633.3 |
Laplacian | 3 × 3 | 8 | 8 | 61 | 749.1 | 52 | 718.9 | 61 | 701.8 | 52 | 718.9 |
Unsharp | 3 × 3 | 8 | 8 | 56 | 727.8 | 51 | 731.0 | 56 | 709.7 | 51 | 731.0 |
Unsharp | 3 × 3 | 12 | 8 | 107 | 636.9 | 59 | 701.3 | 91 | 657.9 | 59 | 701.3 |
Lowpass | 5 × 5 | 8 | 8 | 98 | 665.8 | 93 | 652.3 | 98 | 657.9 | 93 | 652.3 |
Lowpass | 9 × 9 | 10 | 8 | 221 | 593.5 | 186 | 573.1 | 221 | 582.4 | 186 | 573.1 |
Lowpass | 15 × 15 | 12 | 8 | 469 | 530.2 | 368 | 525.8 | 478 | 489.0 | 368 | 525.8 |
Highpass | 5 × 5 | 8 | 8 | 74 | 605.3 | 67 | 726.2 | 74 | 711.7 | 67 | 726.2 |
Highpass | 9 × 9 | 10 | 8 | 85 | 718.4 | 81 | 655.3 | 85 | 689.2 | 81 | 655.3 |
Highpass | 15 × 15 | 12 | 8 | 184 | 613.9 | 166 | 630.5 | 183 | 598.4 | 166 | 630.5 |
Gaussian | 3 × 3 | 8 | 10 | 64 | 713.3 | 70 | 776.4 | 68 | 745.7 | 64 | 666.7 |
Gaussian | 5 × 5 | 12 | 10 | 139 | 653.6 | 82 | 690.6 | 129 | 637.8 | 82 | 690.6 |
Laplacian | 3 × 3 | 8 | 10 | 71 | 682.6 | 59 | 745.2 | 71 | 656.6 | 59 | 745.2 |
Unsharp | 3 × 3 | 8 | 10 | 66 | 729.9 | 60 | 769.8 | 66 | 712.3 | 66 | 712.3 |
Unsharp | 3 × 3 | 12 | 10 | 118 | 682.1 | 69 | 690.1 | 103 | 605.3 | 69 | 690.1 |
Lowpass | 5 × 5 | 8 | 10 | 110 | 640.6 | 106 | 627.0 | 114 | 646.8 | 111 | 647.3 |
Lowpass | 9 × 9 | 10 | 10 | 255 | 557.4 | 225 | 574.4 | 251 | 602.1 | 252 | 528.3 |
Lowpass | 15 × 15 | 12 | 10 | 539 | 526.0 | 443 | 414.8 | 546 | 514.7 | 529 | 470.8 |
Highpass | 5 × 5 | 8 | 10 | 88 | 700.3 | 89 | 698.8 | 86 | 704.2 | 87 | 721.5 |
Highpass | 9 × 9 | 10 | 10 | 97 | 616.5 | 91 | 644.8 | 97 | 677.5 | 97 | 677.5 |
Highpass | 15 × 15 | 12 | 10 | 213 | 561.8 | 218 | 600.6 | 218 | 567.9 | 203 | 583.8 |
Gaussian | 3 × 3 | 8 | 12 | 78 | 646.4 | 129 | 688.2 | 78 | 644.3 | 78 | 644.3 |
Gaussian | 5 × 5 | 12 | 12 | 161 | 619.6 | 150 | 615.4 | 147 | 624.2 | 147 | 635.7 |
Laplacian | 3 × 3 | 8 | 12 | 81 | 684.0 | 104 | 672.0 | 81 | 724.1 | 81 | 724.1 |
Unsharp | 3 × 3 | 8 | 12 | 76 | 721.5 | 129 | 638.2 | 72 | 638.6 | 72 | 638.6 |
Unsharp | 3 × 3 | 12 | 12 | 132 | 614.3 | 122 | 620.0 | 119 | 638.6 | 119 | 638.6 |
Lowpass | 5 × 5 | 8 | 12 | 127 | 601.3 | 198 | 557.1 | 126 | 653.2 | 130 | 625.8 |
Lowpass | 9 × 9 | 10 | 12 | 285 | 603.1 | 412 | 606.4 | 278 | 581.7 | 273 | 602.4 |
Lowpass | 15 × 15 | 12 | 12 | 606 | 521.4 | 835 | 530.8 | 619 | 530.8 | 620 | 517.9 |
Highpass | 5 × 5 | 8 | 12 | 102 | 669.8 | 145 | 647.7 | 94 | 714.3 | 94 | 714.3 |
Highpass | 9 × 9 | 10 | 12 | 113 | 655.7 | 159 | 624.6 | 113 | 680.7 | 113 | 680.7 |
Highpass | 15 × 15 | 12 | 12 | 238 | 591.4 | 388 | 593.5 | 239 | 539.7 | 250 | 590.7 |
Average | 160.52 | 641.9 | 167.64 | 645.28 | 158.52 | 642.08 | 146.82 | 648.94 | |||
Improvement to RPAG (R = 50): | – | – | -4.4% | -0.5% | 1.25% | 0.03% | 8.53% | 1.09% |
Hence, an overestimate was done in the cost model with the assumption that single flip-flops are mapped to a single BLE. Due to the large shifts in the LUT multiplier architecture (see Figure 5), the least significant bits in the adder tree are pure flip-flops. Thus, it is more likely that a LUT multiplier is overestimated and adders are used instead. However, the synthesis results show that a BLE reduction of 1.3% and 8.5% on average compared to RPAG could be achieved for the optimal PAG and PALG method, respectively. The speed of the different architectures is very similar. In 75.8% and 78.8% of the cases for optimal PAG and PALG, respectively, they were even faster than the maximum clock frequency of the embedded DSP48E1 multiplier (which is 600 MHz).
Conclusion and outlook
Two ILP formulations to optimize the multiple constant multiplication on FPGAs were presented and analyzed using synthesis experiments. The first one is a formulation of the PMCM problem, for which only heuristics exist [7, 9]. It was shown that better results are achievable for the used low word size coefficients. For most instances, the RPAG heuristic is also able to find an optimal solution (in 24 out of 32 cases). The second ILP formulation incorporates pipelined LUT-based multipliers. It was shown that this combined method outperforms the PMCM realization in particular for a low-input word size of 8 to 10 bits.
The used ILP solver CPLEX was able to find an optimal solution for most of the test instances within 8 h of computation time. If not, the best feasible solution was close to that of the RPAG heuristic. Synthesis experiments were performed for the Xilinx Virtex 6 architecture, showing that compact and fast multiple constant multipliers are obtained. A resource reduction of 8.5% was achieved compared to the state-of-the-art while having approximately the same speed. Thus, the proposed optimizations can be beneficially used for many real-time video processing applications which involve FIR filters.
Future work could be extended into different directions. The examinations about the mismatch between cost model and synthesis results could be used to improve the results. The slice flip-flops in adders which were not used by synthesis could be utilized to implement pure registers. Furthermore, pure registers could be forced to use all of the eight available slice flip-flops. In the ILP model, this could be respected by reducing the cost for registers, e. g., setting the cost function for one flip-flop to a half BLE instead of a full BLE. The physical realization can be done using low-level placement constraints. However, this could reduce the performance as fixed placements (even relative) may limit a timing driven place & route optimization.
Another extension could be the optimization with adder graphs containing ternary adders, i. e., adders with three inputs. Modern FPGAs provide methods to implement ternary adders with the same number of slices/ALMs as needed for a two-input adder with equal output word size but with a reduced speed due to a longer critical path [49, 50]. The ILP formulations could be extended in that direction using quadruplets instead of the (u,v,w) triplets which allows the modeling of three inputs instead of the two inputs u and v. However, this would substantially increase the search space, so one has to investigate if a solution can be found in an acceptable runtime which is left open for future research.
Appendices
Appendix 1. Convolution matrices of the benchmark filters
Convolution matrices of the benchmark filters
Filter | Convolution matrix |
---|---|
Gaussian 3 × 3, B _{ c } = 8 bits | $\left(\begin{array}{lll}3& 21& 3\\ 21& 159& 21\\ 3& 21& 3\end{array}\right)$ |
Gaussian 5 × 5, B _{ c } = 12 bits | $\left(\begin{array}{lllll}0& 0& 1& 0& 0\\ 0& 46& 343& 46& 0\\ 1& 343& 2,534& 343& 1\\ 0& 46& 343& 46& 0\\ 0& 0& 1& 0& 0\end{array}\right)$ |
Laplacian 3 × 3, B _{ c } = 8 bits | $\left(\begin{array}{lll}5& 21& 5\\ 21& -107& 21\\ 5& 21& 5\end{array}\right)$ |
Unsharp 3 × 3, B _{ c } = 8 bits | $\left(\begin{array}{lll}-3& -11& -3\\ -11& 69& -11\\ -3& -11& -3\end{array}\right)$ |
Unsharp 3 × 3, B _{ c } = 12 bits | $\left(\begin{array}{lll}-43& -171& -43\\ -171& 1109& -171\\ -43& -171& -43\end{array}\right)$ |
Lowpass 5 × 5, B _{ c } = 8 bits | $\left(\begin{array}{lllll}22& 88& 132& 88& 22\\ 88& 140& 103& 140& 88\\ 132& 103& 106& 103& 132\\ 88& 140& 103& 140& 88\\ 22& 88& 132& 88& 22\end{array}\right)$ |
Lowpass 9 × 9, B _{ c } = 10 bits | $\left(\begin{array}{lllllllll}-1& -7& -25& -50& -62& -50& -25& -7& -1\\ -7& -25& -10& 73& 130& 73& -10& -25& -7\\ -25& -10& 117& 165& 126& 165& 117& -10& -25\\ -50& 73& 165& 194& 303& 194& 165& 73& -50\\ -62& 130& 126& 303& 268& 303& 126& 130& -62\\ -50& 73& 165& 194& 303& 194& 165& 73& -50\\ -25& -10& 117& 165& 126& 165& 117& -10& -25\\ -7& -25& -10& 73& 130& 73& -10& -25& -7\\ -1& -7& -25& -50& -62& -50& -25& -7& -1\end{array}\right)$ |
Lowpass 15 × 15, B _{ c } = 12 bits | $\left(\begin{array}{lllllllllllllll}0& 0& 1& 5& 13& 27& 40& 45& 40& 27& 13& 5& 1& 0& 0\\ 0& 2& 7& 13& 2& -41& -103& -133& -103& -41& 2& 13& 7& 2& 0\\ 1& 7& 10& -21& -93& -137& -101& -64& -101& -137& -93& -21& 10& 7& 1\\ 5& 13& -21& -106& -122& -41& -17& -43& -17& -41& -122& -106& -21& 13& 5\\ 13& 2& -93& -122& -8& 79& 199& 304& 199& 79& -8& -122& -93& 2& 13\\ 27& -41& -137& -41& 79& 333& 613& 662& 613& 333& 79& -41& -137& -41& 27\\ 40& -103& -101& -17& 199& 613& 904& 1,097& 904& 613& 199& -17& -101& -103& 40\\ 45& -133& -64& -43& 304& 662& 1,097& 1,197& 1,097& 662& 304& -43& -64& -133& 45\\ 40& -103& -101& -17& 199& 613& 904& 1,097& 904& 613& 199& -17& -101& -103& 40\\ 27& -41& -137& -41& 79& 333& 613& 662& 613& 333& 79& -41& -137& -41& 27\\ 13& 2& -93& -122& -8& 79& 199& 304& 199& 79& -8& -122& -93& 2& 13\\ 5& 13& -21& -106& -122& -41& -17& -43& -17& -41& -122& -106& -21& 13& 5\\ 1& 7& 10& -21& -93& -137& -101& -64& -101& -137& -93& -21& 10& 7& 1\\ 0& 2& 7& 13& 2& -41& -103& -133& -103& -41& 2& 13& 7& 2& 0\\ 0& 0& 1& 5& 13& 27& 40& 45& 40& 27& 13& 5& 1& 0& 0\end{array}\right)$ |
Highpass 5 × 5, B _{ c } = 8 bits | $\left(\begin{array}{lllll}-2& -7& -10& -7& -2\\ -7& -3& 8& -3& -7\\ -10& 8& 121& 8& -10\\ -7& -3& 8& -3& -7\\ -2& -7& -10& -7& -2\end{array}\right)$ |
Highpass 9 × 9, B _{ c } = 10 bits | $\left(\begin{array}{lllllllll}0& -2& -6& -11& -14& -11& -6& -2& 0\\ -2& -7& -10& -3& 4& -3& -10& -7& -2\\ -6& -10& -1& -2& -11& -2& -1& -10& -6\\ -11& -3& -2& -11& -1& -11& -2& -3& -11\\ -14& 4& -11& -1& 500& -1& -11& 4& -14\\ -11& -3& -2& -11& -1& -11& -2& -3& -11\\ -6& -10& -1& -2& -11& -2& -1& -10& -6\\ -2& -7& -10& -3& 4& -3& -10& -7& -2\\ 0& -2& -6& -11& -14& -11& -6& -2& 0\end{array}\right)$ |
Highpass 15 × 15, B _{ c } = 12 bits | $\left(\begin{array}{lllllllllllllll}0& 0& 0& -1& -3& -6& -8& -10& -8& -6& -3& -1& 0& 0& 0\\ 0& 0& -2& -5& -8& -8& -5& -4& -5& -8& -8& -5& -2& 0& 0\\ 0& -2& -6& -9& -8& -6& -10& -13& -10& -6& -8& -9& -6& -2& 0\\ -1& -5& -9& -8& -8& -14& -14& -11& -14& -14& -8& -8& -9& -5& -1\\ -3& -8& -8& -8& -15& -15& -15& -19& -15& -15& -15& -8& -8& -8& -3\\ -6& -8& -6& -14& -15& -17& -21& -18& -21& -17& -15& -14& -6& -8& -6\\ -8& -5& -10& -14& -15& -21& -19& -23& -19& -21& -15& -14& -10& -5& -8\\ -10& -4& -13& -11& -19& -18& -23& 2,028& -23& -18& -19& -11& -13& -4& -10\\ -8& -5& -10& -14& -15& -21& -19& -23& -19& -21& -15& -14& -10& -5& -8\\ -6& -8& -6& -14& -15& -17& -21& -18& -21& -17& -15& -14& -6& -8& -6\\ -3& -8& -8& -8& -15& -15& -15& -19& -15& -15& -15& -8& -8& -8& -3\\ -1& -5& -9& -8& -8& -14& -14& -11& -14& -14& -8& -8& -9& -5& -1\\ 0& -2& -6& -9& -8& -6& -10& -13& -10& -6& -8& -9& -6& -2& 0\\ 0& 0& -2& -5& -8& -8& -5& -4& -5& -8& -8& -5& -2& 0& 0\\ 0& 0& 0& -1& -3& -6& -8& -10& -8& -6& -3& -1& 0& 0& 0\end{array}\right)$ |
Appendix 2. Adder graphs results of PMCM optimization
Declarations
Authors’ Affiliations
References
- Bovik A: The Essential Guide to Image Processing. Academic Press, Waltham; 2009.Google Scholar
- Kumm M, Zipf P: Hybrid multiple constant multiplication for FPGAs. In International Conference on Electronics, Circuits and Systems (ICECS). IEEE Piscataway; 2012:556-559.Google Scholar
- Bull DR, Horrocks DH: Primitive operator digital filters. IEEE Proc. Circuits, Devices Syst 1991, 138(3):401-412. 10.1049/ip-g-2.1991.0066View ArticleGoogle Scholar
- Voronenko Y, Püschel M: Multiplierless multiple constant multiplication. ACM Trans. Algorithms (TALG) 2007, 3(2):1-38.Google Scholar
- Meyer-Baese U, Chen J, Chang CH, Dempster AG: A comparison of pipelined RAG-n and DA FPGA-based multiplierless filters. In Asia Pacific Conference on Circuits and Systems (APCCAS). IEEE Piscataway; 2006:1555-1558.Google Scholar
- Mirzaei S, Kastner R, Hosangadi A: Layout aware optimization of high speed fixed coefficient FIR filters for FPGAs. Int. J Reconfigurable Comput 2010, 3: 1-17.View ArticleGoogle Scholar
- Meyer-Baese U, Botella G, Romero D, Kumm M: Optimization of high speed pipelining in FPGA-based FIR filter design using genetic algorithm. In SPIE Defense Security+Sensing, Volume 8401. SPIE Baltimore; 2012:1-12.Google Scholar
- Kumm M, Zipf P: High speed low complexity FPGA-based FIR filters using pipelined adder graphs. In International Conference on Field Programmable Technology (ICFPT). IEEE Piscataway; 2011:1-4.Google Scholar
- Kumm M, Faust M, Zipf P, Chang CH: Pipelined adder graph optimization for high speed multiple constant multiplication. In International Symposium on Circuits and Systems (ISCAS). IEEE Piscataway; 2012:49-52A.Google Scholar
- Kumm M, Liebisch K, Zipf P: Reduced Complexity Single and Multiple constant multiplication in Floating point precision. In International Conference on Field Programmable Logic and Applications (FPL). IEEE Piscataway; 2012:255-261.View ArticleGoogle Scholar
- Hartley R: Subexpression sharing in filters using canonic signed digit multipliers. IEEE Trans. Circuits and Syst. II: Analog Digit. Signal Process 1996, 43(10):677-688. 10.1109/82.539000View ArticleGoogle Scholar
- Mirzaei S, Hosangadi A, Kastner R: FPGA implementation of high speed FIR filters using add and shift method. In International Conference on Computer Design (ICCD). IEEE Piscataway; 2006:308-313.Google Scholar
- Imran M, Khursheed K, O’Nils M: On the number representation in sub-expression sharing. In International Conference on Signals and Electronic Systems (ICSES). IEEE Piscataway; 2010:17-20.Google Scholar
- Dempster AG, Macleod MD: Constant integer multiplication using minimum adders. IEE Proc. Circuits, Devices Syst 1994, 141(5):407-413. 10.1049/ip-cds:19941191View ArticleGoogle Scholar
- Dempster AG, Macleod MD: Use of minimum-adder multiplier blocks in FIR digital filters. IEEE Trans. Circuits Syst. II: Analog Digit. Signal Process 1995, 42(9):569-577. 10.1109/82.466647View ArticleGoogle Scholar
- Gustafsson O: A difference based adder graph heuristic for multiple constant multiplication problems. In International Symposium on Circuits and Systems (ISCAS). IEEE Piscataway; 2007:1097-1100.Google Scholar
- Aksoy L, Günes E, Flores P: Search algorithms for the multiple constant multiplications problem: exact and approximate. Microprocessors and Microsystems 2010, 34(5):151-162. 10.1016/j.micpro.2009.10.001View ArticleGoogle Scholar
- Flores P, Monteiro J, Costa E: An exact algorithm for the maximal sharing of partial terms in multiple constant multiplications. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE Computer Society Washington; 2005:13-16.Google Scholar
- Yurdakul A, Dündar G: Multiplierless realization of linear DSP transforms by using common two-term expressions. J. VLSI Signal Process 1999, 22: 163-172. 10.1023/A:1008125221674View ArticleGoogle Scholar
- Aksoy L, Costa E, Flores P, Monteiro J: Optimization of area under a delay constraint in digital filter synthesis using SAT-based integer linear programming. In 43rd ACM/IEEE Design Automation Conference (DAC). IEEE Piscataway; 2006:669-674.Google Scholar
- Aksoy L, Costa E, Flores P, Monteiro J: Optimization of Area in Digital FIR Filters Using Gate-Level Metrics. In 44th ACM/IEEE Design Automation Conference (DAC). IEEE; 2007:420-423.Google Scholar
- Aksoy L, da Costa E, Flores P, Monteiro J: Exact and approximate algorithms for the optimization of area and delay in multiple constant multiplications. IEEE Trans. Computer-Aided Design Integrated Circuits Syst 2008, 27(6):1013-1026.View ArticleGoogle Scholar
- Aksoy L, Gunes E, Flores P: An Exact Breadth-First Search Algorithm for the Multiple Constant Multiplications Problem. In NORCHIP. IEEE Piscataway; 2008:41-46.Google Scholar
- Gustafsson O: Towards optimal multiple constant multiplication: a hypergraph approach. In 42nd Asilomar Conference on Signals, Systems and Computers. IEEE Piscataway; 2008:1805-1809.View ArticleGoogle Scholar
- Gustafsson O: Lower bounds for constant multiplication problems. IEEE Trans. Circuits Syst. II: Express Briefs 2007, 54(11):974-978.View ArticleGoogle Scholar
- Aksoy L, Costa E, Flores P, Monteiro J: Design of low-power multiple constant multiplications using low-complexity minimum depth operations. In Proceedings of the 21st Edition of the Great Lakes Symposium on VLSI. ACM New York; 2011:79-84.Google Scholar
- Dempster A, Dimirsoy S, Kale I: Designing multiplier blocks with low logic depth. In International Symposium on Circuits and Systems (ISCAS). IEEE Piscataway; 2002:773-776.Google Scholar
- Johansson K, Power Low: Low Power and Low Complexity Shift-and-Add Based Computations. PhD thesis, Linköping University, Department of Electrical Engineering, 2008Google Scholar
- Faust M, Chang CH: Minimal logic depth adder tree optimization for multiple constant multiplication. In International Symposium on Circuits and Systems (ISCAS). IEEE Piscataway; 2010:457-460.Google Scholar
- Kang HJ, Park IC: FIR Filter Synthesis Algorithms for Minimizing the Delay and the Number of Adders. IEEE Trans. Circuits Syst. II: Analog Digital Signal Process 2001, 48(8):770-777. 10.1109/82.959867View ArticleGoogle Scholar
- Aksoy L, Costa E, Flores P, Monteiro J: Optimization of gate-level area in high throughput multiple constant multiplications. In European Conference on Circuit Theory and Design (ECCTD). IEEE Piscataway; 2011:609-612.Google Scholar
- Gustafsson O, Dempster A: On the use of multiple constant multiplication in polyphase FIR filters and filter banks. 2004.Google Scholar
- Aksoy L, Costa E, Flores P, Monteiro J: Design of low-complexity digital finite impulse response filters on FPGAs. 2012.View ArticleGoogle Scholar
- Wirthlin M: Constant coefficient multiplication using look-up tables. J. VLSI Signal Process 2004, 36: 7-15.View ArticleGoogle Scholar
- Faust M, Chang CH: Bit-parallel multiple constant multiplication using look-up tables on FPGA. In International Symposium on Circuits and Systems (ISCAS). IEEE Piscataway; 2011:657-660.Google Scholar
- Crosisier A, Esteban DJ, Levilio ME, Riso V: Digital filter for PCM encoded signals. US Patent No. 3777130 (1973)Google Scholar
- Zohar S: New hardware realizations of nonrecursive digital filters. IEEE Trans. Comput 1973, 22(4):328-338.View ArticleGoogle Scholar
- Peled A, Liu B: A new hardware realization of digital filters. IEEE Trans. Acoustics, Speech Signal Process 1974, 22(6):456-462. 10.1109/TASSP.1974.1162619View ArticleGoogle Scholar
- White SA: Applications of distributed arithmetic to digital signal processing: a tutorial review. IEEE ASSP Mag 1989, 6(3):4-19.View ArticleGoogle Scholar
- Sen W, Bin T, Jim Z: Distributed arithmetic for FIR filter design on FPGA. In International Conference on Communications, Circuits and Systems (ICCCAS). IEEE Piscataway; 2007:620-623.Google Scholar
- Meher P, Chandrasekaran S, Amira A: FPGA realization of FIR filters by efficient and flexible systolization using distributed arithmetic. IEEE Trans. Signal Process 2008, 56(7):3009-3017.MathSciNetView ArticleGoogle Scholar
- Kumm M, Möller K, Zipf P: Reconfigurable FIR filter using distributed arithmetic on FPGAs. In International Symposium on Circuits and Systems (ISCAS). IEEE Piscataway; (accepted for publication in 2013)Google Scholar
- Bailey D: Design for Embedded Image Processing on FPGAs. Wiley-IEEE Press, New York; 2011.View ArticleGoogle Scholar
- Willson A: Desensitized half-band filters. IEEE Trans. Circuits and Syst I: Regular Papers 2010, 57: 152-167.MathSciNetView ArticleGoogle Scholar
- Lu WS, Wang HP, Antoniou A: Design of two-dimensional FIR digital filters by using the singular-value decomposition. IEEE Trans. Circuits and Syst 1990, 37: 35-4. 10.1109/31.45689MathSciNetView ArticleGoogle Scholar
- IBM Inc: IBM ILOG CPLEX Optimizer. . Accessed 16 April 2013 http://www.ilog.com/products/cplex
- IBM Inc: IBM ILOG CPLEX V12.1 - File Formats Supported by CPLEX (2009).Google Scholar
- Lavin C, Padilla M, Lamprecht J, Lundrigan P, Nelson B, Hutchings B: RapidSmith: do-it-yourself CAD tools for Xilinx FPGAs. In International Conference on Field Programmable Logic and Applications (FPL). IEEE; 2011:349-355.Google Scholar
- Simkins JM, Philofsky BD: Structures and methods for implementing ternary adders/subtractors in programmable logic devices. US Patent No 7274211, Xilinx Inc. (2006)Google Scholar
- Baeckler G, Langhammer M, Schleicher J, Yuan R: Logic cell supporting addition of three binary words. US Patent No 7565388, Altera Coop. (2009)Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.