 Research
 Open Access
 Published:
FIR filter optimization for video processing on FPGAs
EURASIP Journal on Advances in Signal Processing volume 2013, Article number: 111 (2013)
Abstract
Twodimensional finite impulse response (FIR) filters are an important component in many image and video processing systems. The processing of complex video applications in real time requires high computational power, which can be provided using field programmable gate arrays (FPGAs) due to their inherent parallelism. The most resourceintensive components in computing FIR filters are the multiplications of the folding operation. This work proposes two optimization techniques for highspeed implementations of the required multiplications with the least possible number of FPGA components. Both methods use integer linear programming formulations which can be optimally solved by standard solvers. In the first method, a formulation for the pipelined multiple constant multiplication problem is presented. In the second method, also multiplication structures based on lookup tables are taken into account. Due to the low coefficient word size in video processing filters of typically 8 to 12 bits, an optimal solution is found for most of the filters in the benchmark used. A complexity reduction of 8.5% for a Xilinx Virtex 6 FPGA could be achieved compared to stateoftheart heuristics.
Introduction
Twodimensional linear filters with finite impulse response (FIR) are one of the most fundamental operations used in image and video processing. They are used, e. g., in applications which contain contrast improvement, denoising, sharpening, target matching, and feature enhancement [1]. Compared to infinite impulse response filters, FIR filters have a strict stability, and highthroughput implementations are easily possible using pipelining as no recursions are involved. However, they are computationally expensive as many multiply accumulate (MAC) operations are necessary for each pixel of the resulting image. While this is very demanding for a microprocessor or digital signal processor, the inherent parallelism of field programmable gate arrays (FPGAs) can be used to accelerate the FIR operation.
Modern FPGAs directly incorporate embedded multipliers or DSP blocks which also include pre and postadders for MAC operations. Xilinx’s DSP blocks (Xilinx Inc., San Jose, CA, USA) of Virtex 5/6, Spartan 6, and the 7 series FPGAs provide 18×25bit signed multipliers. More flexible are the variable precision DSP blocks of the latest FPGAs of Altera, the Stratix V, Cyclone V, and Aria V devices (Altera, San Jose, CA, USA). Each DSP block can be configured as three independent 9×9bit multipliers, two independent 16×16bit, 15×17bit, or 14×18bit multipliers, or a single 18×36bit or 27×27bit multiplier.
However, embedded multipliers and DSP blocks are limited resources even on modern lowcost FPGAs, and they may have a higher power consumption compared to constant multiplication using the carrychain resources [2]. Especially in image processing, embedded multipliers are often underutilized because of the small word lengths used. Typically, only 8 bits per color and 10 bits for a luminance representation are used.
In most filter applications, the coefficients are fixed, which can be used to reduce the complexity of the multiplication. Furthermore, partial results can be shared inside a single multiplier or between multipliers of different constants to reduce hardware resources. Different methods have been proposed over the years for such multiple constant multiplications (MCM):
 (a)
 (b)
 (c)
In method (a), constant multiplications are realized using additions, subtractions, and bit shifts only. These operations form a socalled adder graph, so this method is called the adder graph MCM method in the following. It was originally developed for software or VLSI applications [3] but also maps well to the fast carry chains of FPGAs. In method (b), the input word is split into smaller chunks that fit into the input word size of the FPGA LUTs. These LUT results are shifted and added afterward to form the multiplication result. LUTs and adders are also used in method (c), but there the folding equation of the FIR filter is rearranged in such a way that identical LUTs can be used. This is very beneficial in sequential FIR implementations but costly in parallel implementations. Because it was shown in the recent years that multiplier blocks using add, subtract, and shift operations (method a) consume considerable less logic resources compared to parallel DA implementations [5–7], the DA approach is not further considered. Due to the relatively large routing delays compared to the fast carry chain, a pipelined implementation of the adder graph is necessary to obtain the maximum speed of the FPGA [2, 5–10]. It was shown by Faust et al. [35] that the LUTbased approach (method b) is competitive to the adder graph method. Thus, pipelined circuits using the combination of methods (a) and (b) are investigated in this paper.
Contribution of this work
The main contribution of this article is the description of two novel optimization methods, one for the adder graph MCM problem including pipelining (the pipelined MCM problem [9]) and one for a combination of this method with a pipelined realization of the LUTbased method mentioned above. Each method is formulated as a boolean integer linear program (BILP, or 01 ILP) and then reduced to a mixed integer linear program (MILP). Hence, if the MILP solver finds a solution in reasonable time, an optimal solution for the given cost model is found. To the best of our knowledge, this is the first time an optimal method for solving the pipelined MCM (PMCM) problem is proposed.
The complexity of the adder graph MCM method heavily depends on the coefficient values, while the complexity of the LUTbased approach mainly depends on the input word size. Therefore, sometimes, one method or the other delivers better results. For this, a combination of both methods is proposed in this work by incorporating the LUTbased multipliers in the integer linear programming (ILP) formulation of the PMCM problem. Due to the low coefficient word size of typically 8 to 12 bits in image processing, a short convergence time of the ILP solver is very likely, which makes the proposed optimization an ideal candidate for image processing.
The remaining of this paper is organized as follows. The related work is discussed in the next section, followed by an introduction of the used FIR architectures for image processing. Then, an ILP formulation for the PMCM problem is described which is later extended for additional LUTbased multiplication. Finally, results from the optimizations and FPGA synthesis are presented and discussed, followed by a conclusion.
Related work
MCM using additions, subtractions, and bit shifts
Different methods have been proposed over the years to realize constant multiplication using additions, subtractions, and bit shifts only. Finding the optimal configuration of these operations is known as MCM problem, which has been an active research topic for almost the last two decades [3–33]. The objective is usually defined by minimizing the number of adders and subtractors (shifts are assumed to be free, as they can be implemented using wires).
An example adder graph which realizes a multiplier block with the constants of the set {44,130,172} is shown in Figure 1a. Each node in the graph corresponds to an adder or subtractor, indicated by ‘ + ’ or ‘ .’ The numeric node value represents the realized multiple of the input, i. e., node ‘1’ corresponds to the input of the MCM block. To have a unique representation, all node values are defined to be odd. They can be made even by a simple bit shift as shown at the output. All edge weights represent left shifts, e. g., node ‘5’ is realized by left shifting the input x by 2 bits and adding the unshifted input: 2^{2} x + 2^{0} x = 5 x. Right shifts are represented by negative edge weights.
The MCM problem is NP complete [4]. Hence, most of the proposed algorithms are heuristics, and less work was directed toward optimal solutions. Early work was done by Bull and Horrocks [3] which was later extended by Dempster and Macleod to the modified Bull and Horrocks algorithm [14]. In the same work, the ndimensional reduced adder graph (RAGn) algorithm was proposed which was one of the leading heuristics for years. Major improvements could be achieved by the work of Voronenko and Püschel with their cumulative benefit heuristic (H_{cub}) [4]. By spending a bit more algorithmic complexity and evaluating adder graph topologies up to a depth of three, they could reduce the required additions/subtractions by 20% on average compared to RAGn. A competing approach based on difference graphs was proposed by Gustafsson [16]. It tends to be beneficial compared to H_{cub} in case large coefficient sets and/or low coefficient word lengths are used but may be worse in other cases.
Many approaches use ILP formulations, for which optimal solvers exist. However, the search space is often significantly reduced due to the used number representation which leads to nonoptimal results. Minimum signed digit (MSD) number systems like canonic signed digit (CSD) are often used as they have a reduced complexity compared to the binary representation [18]. In MSD, a number is coded using the digits {1,0,1}, such that the number of nonzero digits is minimal and, hence, the number of partial products is reduced. A 01 ILP model that uses subexpressions of length two in the CSD number system (i. e., subexpressions with at most two nonzeros) was used by Yurdakul and Dündar [19]. A 01 ILP model which can be used for any number system was proposed by Flores et al. [18]. Results for binary, CSD and MSD were presented. This work was further extended by Aksoy et al. with additional delay constraints [20], lowlevel area models [21], and a heuristic variation [22].
So far, the discussed publications did not result in globally optimal solutions as they use the reduced search space of a given number representation. Breadthfirst search [23] and depthfirst search [17] algorithms were proposed by Aksoy et al. to optimally solve the MCM problem in a graphbased way. The depthfirst search is able to solve MCM instances in a reasonable time but cannot handle different constraints or cost metrics. Another interesting method to optimally solve the MCM problem was given by Gustafsson who transferred the MCM problem to the problem of finding a Steiner hypertree in a directed hypergraph [24]. He used an optimal 01 ILP formulation which is very generic and can be flexibly adopted to different cost metrics (at adder or logic level) and different constraints (adder depth and fanout). The main drawback is its computational complexity. Nevertheless, it can be used for small MCM instances or to find lower bounds [25] by relaxing the model to a continuous LP problem. A 01 ILP formulation for optimally solving the special case of minimum depth adder graphs in a graphbased fashion was recently proposed [26]. In this work, the search space of a graphbased search is compared to the MSD search space in terms of variables of the ILP. The graphbased search needs three times more variables for 8bit coefficients and 18 times more variables for 13bit coefficients compared to MSD.
In the recent time, more and more MCM algorithms with different objectives were proposed. One objective is minimizing the power of the adder graph by reducing or minimizing the adder depth (AD) of each output, which is defined as the number of adder stages needed to compute a coefficient [26–29]. Limiting the maximum AD of all outputs can be used to find adder graphs with low delay [30]. If this delay is still too large, pipelining can be used to speed up the circuit [5, 8, 9, 31].
Pipelining plays a crucial role in highperformance adder graph realizations on FPGAs as they have relatively large routing delays compared to their fast carry chains. However, the addition of pipeline registers may significantly increase the complexity. An example of the pipelined adder graph (PAG) of Figure 1a using an assoonaspossible (ASAP) scheduling for placing the pipeline registers is shown in Figure 1b. Here, each rectangular node includes a pipeline register, i. e., nodes without any ‘ + ’ or ‘ ’ operator are pure registers. Many resources can be saved by finding the optimal schedule of pipeline registers for a given adder graph. Compared to the ASAP schedule, 29% less pipelined operations and speedups of about 300% were achieved on average using a slice overhead of only 18% compared to the nonpipelined adder graph [8]. The PAG using the optimal schedule of the adder graph of Figure 1a is shown in Figure 1c. However, directly considering pipelining during the adder graph optimization can further reduce the resources as demonstrated in Figure 1d. Heuristics for this kind of direct optimization were proposed recently [7, 9]. A reduction of pipelined operations by 10% compared to the optimal pipelined adder graphs [8] could be achieved by the reduced pipelined adder graph (RPAG) algorithm [9]. Slice reductions of 16% on average were reported using the best result out of three algorithms (C1^{+}, RSG Improved and a genetic algorithm) [7].
MCM using lookup tables
A totally different method for single constant multiplication which uses the lookup tables of FPGAs was proposed by Wirthlin [34]. The idea is to split the multiplication into smaller chunks, e. g., 4 bits for FPGAs with fourinput LUTs, which can be directly realized using a single stage of LUTs. These LUT results have to be shifted and added to get the final product. Several techniques were proposed to minimize the number of redundant LUTs [34]. An extension of LUTbased multipliers to MCM was presented by Faust et al. [35], where identical LUTs are also shared between different constant multipliers. It was shown that the maximal number of required LUTs is far less than combinatorially possible. Their benchmark results include a comparison with adder graph MCM, which shows that their graphbased MinLD MCM algorithm [29] sometimes uses less resources and sometimes more resources than the LUTbased method for an input word size of 8 bits on an FPGA with fourinput LUTs. As their method does not include pipelining, shorter delays could be achieved using the LUTbased MCM.
Twodimensional FIR filter architectures
An image of height M and width N is usually represented by an M × N matrix X. The matrix elements are written in lower case in the following, i. e., x _{ m,n } denotes the luminance of a pixel at position (m,n). The twodimensional folding with the P × Q folding matrix H is defined as
where $u=\u230a\frac{P}{2}\u230b$ and $v=\u230a\frac{Q}{2}\u230b$ denote the center of the folding matrix H. Note that P and Q are often identical (matrix X is square) and odd.
Architectures for computing this folding equation can be classified by the computed output pixels per clock cycle. A sequential realization may compute a single MAC operation per clock cycle producing one output pixel every PQ clock cycles, like this is done on conventional microprocessors. As we are interested in accelerating the filter operation, parallel realizations are considered in the following. A direct implementation of the folding equation for a 3×3 folding matrix in a parallel way which is able to compute one output pixel every clock cycle is shown in Figure 2a. First, the input pixel stream is clocked into a chain of registers and FIFO buffers to provide a 3×3 block of the image. Then, this block is processed in parallel by a sumofproducts (SOP) operation according to (1), i. e., each pixel of the block is multiplied by its corresponding element in the folding matrix and all products are added to yield the final output pixel. An alternative architecture can be obtained by transposing the structure of Figure 2a by reversing the directions of each edge, replacing branches by adders and vice versa, and swapping the input and output [43]. The resulting architecture is shown in Figure 2b. The SOP block results in an MCM block after transposition, i. e., now a single input has to be multiplied by all elements of the folding matrix. Note that the outputs of both architectures are invalid for the boundary pixels of the input image. Thus, if the output image has to have the same size as the input image, a suitable boundary treatment (setting boundary pixels to black/white, copying from neighbor pixels, etc.) has to be implemented [43].
MCM blocks of Figure 2b can be obtained by one of the algorithms discussed above. The SOP circuit of Figure 2a can be either obtained by using LUT multipliers and additional adders, or by transposing the adder graph which was obtained by an adder graph MCM method. It is well known that the adder cost for a singleinput singleoutput adder graph is equal to its transposed form [44]. More general, Gustafsson et al. derived the cost of an adder graph with N _{ i } inputs and N _{ o } outputs after transposition for the generalized case of vector matrix multiplication [32]:
Here, N _{ A } and ${N}_{A}^{T}$ are the adder cost before and after transposition, respectively. MCM is a special case of vector matrix multiplication with N _{ i } = 1, so if N _{ A,MCM} adders are needed for an optimal MCM adder graph with N unique coefficients, N _{ A,SOP} = N _{ A,MCM} + N  1 adders are needed for the corresponding optimal SOP. Therefore, the eight additional adders shown in the transposed form in Figure 2b are exactly the additional adders needed to compute the SOP in Figure 2a. Hence, from its complexity, there is no difference between the direct or transposed form. If we take a look on pipelined implementations of MCM blocks, the situation becomes worse. While transposing a pipelined MCM block still leads to a valid pipeline, the pipeline registers may not any longer be located in the critical path between the adders. Therefore, we concentrate in the following on pipelined implementations of the MCM block, which can be directly incorporated in the transposed form, as shown in Figure 2b.
A totally different filter structure can be realized if the folding matrix is separable, i. e., matrix H can be separated in two vectors h _{1} and h _{2} with H = h _{1}·h _{2}. Then, the twodimensional filter can be realized by cascading two onedimensional filters, one for processing the rows and one for the columns. This reduces the PQ multiplications to P + Q multiplications [43]. If the filter cannot be separated, then there exist methods to decompose the filter into a sum of separable filters using singular value decomposition [45].
However, only a fraction (which is typically less than one third) of the multipliers of the unseparated folding matrix have to be realized due to symmetric, zero, or poweroftwo coefficients. Furthermore, in both MCM methods discussed above, many resources can be saved when intermediate computation results are shared between the constant multipliers. This is only possible if all multipliers have the same input which is not the case for separated filters. Therefore, we concentrate on the architecture shown in Figure 2b as it is generic (any folding matrix can be realized) and pipelined MCM blocks can be directly used.
Optimally solving the pipelined MCM problem
Pipelined MCM problem formulation
In an adder graph, each node unequal one is generated by a so called $\mathcal{A}$operation, introduced by Voronenko et al. [4] (compare with Figure 1a)
with configuration q = (l _{1},l _{2},r,s g), where ${l}_{1},{l}_{2},r\in {\mathbb{N}}_{0}$ are shift factors, the sign bit s g ∈ {0,1} denotes whether an addition or subtraction is performed and u and v are positive, odd input arguments. A valid configuration q is a combination of l _{1}, l _{2}, r, and sg such that the result is a positive odd integer.
The greatest effort during MCM optimization is finding the numerical values of nonoutput nodes, i. e., the values of all u and v which are not in the coefficient set. Once all these node values are found, it is easy to determine the configuration q of the corresponding adder graph, e. g., using the optimal part of RAGn [14] or H_{cub}[4]. Since the same can be applied for PAG optimization, it is appropriate to define a set X _{ s } for each pipeline stage s, containing the node values of the corresponding stage. The pipeline sets for the PAG in Figure 1d are, e. g., X _{0} = {1}, X _{1} = {7,9}, and X _{2} = {11,43,65}. With this representation, we can formally define the PMCM problem:
Definition 1 (Pipelined MCM problem)
Given a set of positive target constants T = {t _{1},…,t _{ M }} and the number of pipeline stages S, find sets X _{1},…,X _{ S1} ⊆ {1,3,…,x _{max}} with minimal area cost such that for all w∈X _{ s } for s = 1,…,S there exists a valid $\mathcal{A}$configuration q such that $w={\mathcal{A}}_{q}(u,v)$ with u,v ∈ X _{ s1}, X _{0} = {1} and X _{ S } = {odd(t)  t ∈ T ∖ {0}}, where odd(t) is the absolute value of t divided by 2 until it is odd.
The upper bound is usually chosen as ${x}_{\text{max}}={2}^{{b}_{\text{max}}+1}$[4], where b _{max} is equal to the maximum bit width of T. The number of stages S has to be at least the largest minimal adder depth of the coefficients. The minimal AD of an integer x can be directly computed using
where nz(x) represents the minimal number of nonzero digits of x in canonic signed digit (CSD) representation [29]. As each additional pipeline stage introduces additional nodes in the PAG, it is very unlikely that there exists a graph with higher S but less cost. Therefore, we define S to be the minimum number of stages which is possible:
The area cost of pipeline sets X _{0 … S } depends on the target architecture, and it will be discussed in the following sections.
FPGA cost of pipelined adder graphs
We use the area (i. e., the number of FPGA components) as cost measure, since the maximum speed results from the architecture and is given by the largest ripple carry adder and the routing delay which is hard to predict. Different cost metrics are used for evaluating the size of an FPGA circuit. Common metrics are counting the LUTs, flipflops, slices (Xilinx devices), or logic elements (Altera devices). As different FPGA components are used in this work, the smallest common piece of logic, consisting of LUT(s), FF(s), and full adder logic, is counted and referred to as basic logic element (BLE) in the following. Simplified block schematics of BLEs of the Virtex 4 and Virtex 6/7 architectures are shown in Figure 3. On Virtex 4, each BLE is half of an FPGA slice, and on Virtex 6/7, it is a quarter of a slice. While the Virtex 4 BLE provides a fourinput LUT, a flipflop and additional logic for building a full adder (AND gate, XOR gate, and a multiplexer), the BLE from the later generations provide a sixinput LUT which can be used as two fiveinput LUTs with shared inputs, two flipflops, and full adder logic (XOR gate and multiplexer).
Each BLE can realize a single full adder including the register for pipelining. Thus, for evaluating adder graphs, the number of full adders, or pure registers can be counted. A detailed cost model of common adder graphs mapped to FPGAs was recently presented in [33]. This model respects the replacement of full adders by simple wires if the two shifted numbers do not overlap at every bit position. However, for pipelined adders on FPGAs, full adder and pipeline register are realized in a single BLE resource. Hence, saving some full adders has no effect on BLE resources as the pipeline register is needed in any case. The number of BLEs of an $\mathcal{A}$operation is equal to the output word size of the corresponding adder. If there is no right shift (r = 0), the output word size is identical to the sum of input word size and the word size needed to represent node value w. If a right shift r > 0 is used, the word size of the corresponding adder has r additional output bits. Even if these additional output bits are ignored after the right shift, they are needed to compute the carryin for higher order bits to yield a correct result. Hence, the number of BLEs of each $\mathcal{A}$operation in a pipelined adder graph can be summarized to
where B _{ x } is the input word size and $w={\mathcal{A}}_{q}(u,v)$.
ILP formulation for the pipelined adder graph problem
Before the ILP formulation for the PAG problem is described, some variables and auxiliary sets are introduced. There are two types of binary decision variables used in the formulation:
Hence, there is one variable ${a}_{w}^{s}$ for each stage s and each possible element w and one variable ${a}_{(u,v,w)}^{s}$ for each w and each possible pair (u,v), from which w can be computed with a single $\mathcal{A}$operation.
To determine these combinations, we use the definition of the ${\mathcal{A}}_{\ast}(u,v)$ set [4], which contains all elements that can be computed from u and v using a single $\mathcal{A}$operation:
For convenience, the ${\mathcal{A}}_{\ast}$ set is also defined for a set X:
Note that ${\mathcal{A}}_{\ast}\left(\right\{u,v\left\}\right)$ is different from ${\mathcal{A}}_{\ast}(u,v)$ as the first one contains all combinations of u and v: ${\mathcal{A}}_{\ast}\left(\right\{u,v\left\}\right)={\mathcal{A}}_{\ast}(u,u)\cup {\mathcal{A}}_{\ast}(u,v)\cup {\mathcal{A}}_{\ast}(v,v)$. With this set, we can now define the set ${\mathcal{A}}^{s}$ of all possible odd variables which may be used at pipeline stage s. It can be recursively defined by
As not all elements of ${\mathcal{A}}^{s}$ are needed to compute the target coefficients, the set ${\mathcal{S}}^{s}\subseteq {\mathcal{A}}^{s}$ is defined which contains all single elements that may be used to compute the target coefficients T in the last stage. Furthermore, with ${\mathcal{T}}^{s}$ we denote the set of (u,v,w)triplets for which $w\in {\mathcal{S}}^{s}$ can be computed using u and v of the previous stage s1. Now, the sets ${\mathcal{S}}^{S}$ and ${\mathcal{T}}^{s}$ can be computed recursively, starting from the last stage S, where ${\mathcal{S}}^{S}$ is identical to the odd target coefficients excluding zero:
Using these variables and auxiliary sets, we can define the ILP formulation for the PAG problem as follows:
ILP Formulation 1: Pipelined adder graph optimization
The subject is to minimize the BLE cost of all required $\mathcal{A}$operations. Constraint (C1) simply forces that all target coefficients are realized in the last pipeline stage. To realize an element ${a}_{w}^{s}$, one of the possible $\mathcal{A}$operations to compute w must be available, which is constrained by (C2). The last constraint (C3) ensures that an $\mathcal{A}$operation $w={\mathcal{A}}_{q}(u,v)$ can only be realized in stage s when u and v are both available in stage s  1.
In the formulation, we use the relaxed condition ${a}_{(u,v,w)}^{s}\ge 0$ instead of requiring ${a}_{(u,v,w)}^{s}\in \{0,1\}$ which results in an MILP formulation. Using continuous variables ${a}_{(u,v,w)}^{s}$ instead of binary ones reduces the runtime of the optimization substantially. This is possible since we can construct a binary solution that is feasible and minimal from the relaxed solution.
Proof
Assume an optimal solution of ILP Formulation 1 is given. Due to (C3), all variables ${a}_{(u,v,w)}^{s}$ satisfy the condition
but in the optimum solution of ILP Formulation 1, a variable ${a}_{w}^{s}=1$ can exist with several variables ${a}_{(u,v,w)}^{s}>0$. Let us define a set ${\mathcal{T}}_{w}^{\prime}$, which contains all triplets (u,v,w) with ${a}_{(u,v,w)}^{s}>0$. Then, there exist at least one triplet $({u}^{\prime},{v}^{\prime},w)\in {\mathcal{T}}_{w}^{\prime}$ with the least cost contribution:
Now, a feasible binary solution of the formulation can be obtained by assigning new values ${\xe2}_{({u}^{\prime},{v}^{\prime},w)}^{s}=1$ and ${\xe2}_{(u,v,w)}^{s}=0$ for all $(u,v,w)\in {\mathcal{T}}_{w}^{\prime}\setminus \left\{\right({u}^{\prime},{v}^{\prime},w\left)\right\}$. Since the contribution of the triplets in ${\mathcal{T}}_{w}^{\prime}$ to the objective function is
and since the objective function is minimized, the new solution is also minimal and the objective values are equal. If we do this construction for all variables ${a}_{w}^{s}=1$, we obtain an optimum solution where all variables are binary. □
ILP Formulation 1 is very generic and is extendable into different directions. For optimizing MCM blocks on application specific integrated circuits (ASICs), the cost function in the objective can be easily divided into two cost functions: one for pipelined adders (${\text{cost}}_{\mathcal{A}}$) and one for pure registers (cost_{R}) as both require different ASIC resources:
A modified cost function can also be used to influence the number of adders within a pipeline stage. Instead of placing pipeline registers after each stage of adders, they could be placed behind multiple adder stages (e.g., every second or third stage). This would reduce the register resources for applications where a lower speed is sufficient or a lower latency is required. For that, the cost function for pure registers (as for ASICs) can be set to zero in each stage in which no registers should be placed. In the implementation, these registers are exchanged by wires.
Another extension of the ILP formulation would be the support of dedicated shift registers which are provided by the latest Xilinx FPGA architectures. These shift registers are realized in a single FPGA LUT and provides up to 16 bits on Virtex 4 FPGAs (SRL16 primitive) and up to 32 bits on Virtex 5/6, Spartan 6 and 7 series FPGAs (SRLC32E primitive). In other words, up to 16 or 32 registers (plus one additional register at the output) can be realized at the same BLE cost like a single register. As long as S < 17 or S < 33, constraint (C3) in ILP Formulation 1 can be replaced by:
While (C3a) is nearly identical to (C3), (C3b) allows the bypass of several stages using shift registers. Of course, these values are not accessible for intermediate stages.
Multiple constant multiplication using LUT multipliers
In this section, a LUTbased constant multiplier method is introduced, which is used to extend ILP Formulation 1.
LUTbased constant multiplication
Consider a number x which is represented in two’s complement format using B _{ x } bits:
If this number is multiplied by a constant c _{ n } with B _{ c } bits, the resulting B _{ c } × B _{ x } multiplication can be divided into several smaller multiplications by rearranging the partial sum terms:
Now, $K=\u2308\frac{{B}_{x}}{L}\u2309$ multiplications of size B _{ c } × L are necessary. In case that B _{ x } is not divisible by L, the number x has to be sign extended to ${B}_{x}^{\prime}=\mathit{\text{KL}}$ bits. The idea of Wirthlin [34] was to realize each B _{ c }×L multiplier using Linput LUTs, where L has to be chosen to fit the FPGA LUT input size. The resulting multiplier structure for an example of a 12 × 4 multiplier using fourinput LUTs [34] is shown in Figure 4.
LUT minimization techniques
Several methods were proposed by Wirthlin to reduce the hardware complexity of the circuit. In the following cases, LUTs can be eliminated:

(a)
Removal of constant LUTs, i. e., LUTs that are always ‘0’ or ‘1’ can be replaced by a constant.

(b)
Removal of LUTs which are identical to one of the inputs, i. e., the output can be connected to the corresponding input.

(c)
Removal of redundant LUTs, i. e., LUTs with identical content and identical inputs can be shared.
In particular, the removal of redundant LUTs can be used to reduce logic when multiple constants are involved [35]. It was shown that it is very likely that identical LUTs occur in MCM operations. For example, only 52 unique LUTs are required to be able to compute all signed products with constants from 1 to 2^{12} using fourinput LUTs. Compared to adder graphs with minimal adder depth, the latency could be reduced by approximately 33% on average for an input word size of 8 bits. However, even for this small input word size, clock frequencies of about 100 MHz could be achieved which is far less than what is possible on the used Virtex 4 FPGA. Hence, in our work, pipelined realizations of LUT multipliers are investigated, as shown in Figure 5. In the first pipeline stage, the LUT content is looked up, and the results are shifted and added according to (25) using a pipelined adder tree. Note that some of the bit shifts can be moved toward the output of the adder tree to reduce the word size in each stage (not shown in Figure 5). No extra BLE has to be reserved for pure registers as the flipflops of BLEs using LUTs, and full adders are used. But from the reduction methods, only the removal of redundant LUTs can be used as these can be shared together with the flipflop, while the other methods still require a BLE for a single flipflop.
ILP formulation for the combined pipelined adder/LUT graph optimization
ILP Formulation 1 of the PAG optimization is now extended to include LUT multipliers. Thus, the ILP solver should decide whether a target coefficient is realized using adders or using LUTs. Of course, if more than one LUT multiplier is used, the sharing of identical LUTs should be included in the cost model. For that, the set ${\mathcal{\mathcal{L}}}_{v,w}^{s}$ is defined, which contains all LUTs of stage s which are needed to compute each bit of w from v and a new boolean decision variable is introduced:
If a LUT multiplier is inserted in the adder graph to compute w from node v, the variable ${a}_{(0,v,w)}^{s}$ is set to one (the case u = 0 does not exist in ILP Formulation 1). The corresponding (0,v,w) triplets have to be added to ${\mathcal{T}}^{s}$. Now, ILP Formulation 1 can be extended by two constraints (C4) and (C5) (the remaining constraints remain nearly identical) and a modified objective function:
ILP Formulation 2: Combined pipelined adder graph/LUT optimization
Two cost functions were added, cost_{ LUT } includes the cost of a single LUT and cost_{ AT }(u,v,w) corresponds to the cost of the adder tree which is necessary to compute the sum of the LUT outputs. The splitting of the cost function is necessary to respect the reduced cost for shared LUTs. For Virtex 4 or Virtex 6/7 using sixinput LUTs, each LUT consumes one BLE (cost_{ LUT } = 1). For Virtex 6/7 using fiveinput LUTs, the LUT costs can be approximated to cost_{ LUT } = 1/2 as only one half of a BLE is used for a single LUT. The adder tree costs (cost_{ AT }(u,v,w)) depend on v, w and the LUT multiplier pipeline depth, which is defined by the depth of the adder tree plus one for the LUT registers:
The additional constraints (C4) and (C5) are defined to incorporate LUT realizations during optimization. Constraint (C4) is similar to (C3): If a LUT multiplier is used to compute w in stage s from v in stage s  LD(w/v), the corresponding node v must be available in that stage. Constraint (C5) ensures that the corresponding LUTs are available in the correct pipeline stage to compute w from v. In first experiments, we found only cases where LUT multipliers directly compute target coefficients of the last stage. Hence, we decided to reduce the search space by evaluating the constraints (C4) and (C5) only for the target constants of the last stage s = S.
Results
To evaluate the performance of the two ILP formulations, a benchmark set of folding matrices was designed using Matlab. The coefficients were computed using the fspecial() function except from the lowpass and highpass filters. These were obtained by the ParksMcClellan algorithm (remez() and ftrans2() functions). All real coefficients were quantized to a word size of B _{ c }. The complete convolution matrices are given in Appendix 1. Convolution matrices of the benchmark filters; the filter parameters are summarized in Table 1 with their matrix size, word size, the required pipeline stages S using PMCM, parameters for their design, and the N _{ uq } unique odd coefficients of their folding matrix.
The BILP solver of Matlab is slow compared to the CPLEX optimizer from IBM [46] and does not provide a solver for MILP problems. CPLEX provides a text file interface with a comfortable human readable syntax (LP file format [47]). Hence, Matlab was used to generate the MILP models as LP files, and CPLEX was used for solving them. Then, Matlab was used to read in the solution file of CPLEX in XML format [47] for generating synthesizable VHDL code. All MCM blocks were optimized for an input word size B _{ x } of 8, 10, and 12 bits. The results are compared with the RPAG algorithm [9] and the LUT MCM method of [35], which was applied to the pipelined realization, as shown in Figure 5. The RPAG algorithm is a greedy heuristic. As the best local choice in a greedy algorithm does not necessarily lead to the best global solution, it allows to randomly select one of the n th best decisions. Then, the best result out of several runs can be taken to improve the optimization. RPAG was configured for a single run (R = 1, pure greedy), and R = 50 runs per MCM instance where the locally first or second best decision was randomly selected. These results were considered as stateoftheart reference. The optimization was performed for the Virtex 6/7 FPGA architecture, and CPLEX was configured with a computation time limit of 8 hours.
The results are summarized in Table 2 for RPAG (R = 1 and R = 50) and the LUT MCM method which are compared to the proposed pipelined adder graph (Optimal PAG) and pipelined adder/LUT graph (Optimal PALG) methods. An optimal solution could be found within the time limit for all cases, except for the lowpass 15×15 instances. Here, the best obtained result is shown as upper bound. All solutions are given as PAG in Appendix 2. Adder graphs results of PMCM optimization. The coefficients which has to be replaced by LUT multipliers to obtain the PALG solution are given in the last column of Table 2. Comparing RPAG with the optimal PAG solution, it is apparent that RPAG (R = 50) often finds an optimal solution (in 24 out of 32 cases). For the three cases where the timeout of the ILP solver occurred, RPAG finds slightly better solutions. Comparing the pipelined LUT MCM method with RPAG and optimal PAG, the LUT MCM method performs better for lowinput word sizes as expected. For B _{ x } = 8, it is always the best choice; for B _{ x } = 10, it is sometimes the best; and for B _{ x } = 12, it is never the best choice. This is different to the combined PALG optimization which found the best results in all cases where an optimal solution was obtained. For B _{ x } = 8, all instances were pure LUT MCM realizations; for B _{ x } = 10, there is a mixture of pure adder graphs, pure LUT realizations, and combinations of both; while for B _{ x } = 12, the adder graph realizations dominate. An example is given in Figure 6. While the PAG in Figure 6a needs two elements in the second pipeline stage (1 and 7) to compute coefficient 121, this coefficient is realized with a twostage LUT multiplier in Figure 6b, saving one element in stage two. An overall reduction of 1.5% and 7.2% of BLEs compared to RPAG could be achieved for the optimal PAG and PALG methods, respectively.
The minimum, maximum, and average computation times on an Intel Nehalem 2.93GHz computer (Intel, Santa Clara, CA, USA) for all used optimization algorithms are given in Table 3. All heuristics take up to a few seconds; the optimal methods are very fast for the pipeline depth S = 2 cases in Table 1 but take hours for the three matrices with S = 3 (up to the time limit of 8 h = 28,800 s for the lowpass 15 × 15 instances). However, the optimization time is reasonable compared to the common FPGA development effort.
To validate the optimization results, synthesis experiments were performed for a Virtex 6 XC6VLX75T2FF484 FPGA. For that, VHDL code generators were developed in Matlab for all the examined architectures. The synthesis was performed using Xilinx ISE v13.4. Unfortunately, the BLE usage cannot be directly obtained by counting the FFs, LUTs, or slices reported by the tool, as a Virtex 6 may use more than one FF per BLE, a LUT may be unused, and many slices may not be fully used but can be used for other logic. Therefore, the resulting netlist after place&route was converted to an XDL file, and the XDLParser from the RapidSmith Project [48] was used to obtain the BLEs directly from the netlist. All synthesis results are listed in Table 4. It can be seen that in most cases the optimal PALG solution leads to the best synthesis result. However, there are some cases where the estimated cost model fails to predict the synthesis. The following cases have been identified by examining the netlists with the Xilinx’ FPGA Editor:

1.
A single Virtex 6 BLE is able to realize two flipflops (see Figure 3b), but sometimes, ISE uses one BLE to realize a single register and sometimes to realize two registers.

2.
Sometimes, ISE maps a single flipflop in a BLE that is also used for a full adder.
Hence, an overestimate was done in the cost model with the assumption that single flipflops are mapped to a single BLE. Due to the large shifts in the LUT multiplier architecture (see Figure 5), the least significant bits in the adder tree are pure flipflops. Thus, it is more likely that a LUT multiplier is overestimated and adders are used instead. However, the synthesis results show that a BLE reduction of 1.3% and 8.5% on average compared to RPAG could be achieved for the optimal PAG and PALG method, respectively. The speed of the different architectures is very similar. In 75.8% and 78.8% of the cases for optimal PAG and PALG, respectively, they were even faster than the maximum clock frequency of the embedded DSP48E1 multiplier (which is 600 MHz).
Conclusion and outlook
Two ILP formulations to optimize the multiple constant multiplication on FPGAs were presented and analyzed using synthesis experiments. The first one is a formulation of the PMCM problem, for which only heuristics exist [7, 9]. It was shown that better results are achievable for the used low word size coefficients. For most instances, the RPAG heuristic is also able to find an optimal solution (in 24 out of 32 cases). The second ILP formulation incorporates pipelined LUTbased multipliers. It was shown that this combined method outperforms the PMCM realization in particular for a lowinput word size of 8 to 10 bits.
The used ILP solver CPLEX was able to find an optimal solution for most of the test instances within 8 h of computation time. If not, the best feasible solution was close to that of the RPAG heuristic. Synthesis experiments were performed for the Xilinx Virtex 6 architecture, showing that compact and fast multiple constant multipliers are obtained. A resource reduction of 8.5% was achieved compared to the stateoftheart while having approximately the same speed. Thus, the proposed optimizations can be beneficially used for many realtime video processing applications which involve FIR filters.
Future work could be extended into different directions. The examinations about the mismatch between cost model and synthesis results could be used to improve the results. The slice flipflops in adders which were not used by synthesis could be utilized to implement pure registers. Furthermore, pure registers could be forced to use all of the eight available slice flipflops. In the ILP model, this could be respected by reducing the cost for registers, e. g., setting the cost function for one flipflop to a half BLE instead of a full BLE. The physical realization can be done using lowlevel placement constraints. However, this could reduce the performance as fixed placements (even relative) may limit a timing driven place & route optimization.
Another extension could be the optimization with adder graphs containing ternary adders, i. e., adders with three inputs. Modern FPGAs provide methods to implement ternary adders with the same number of slices/ALMs as needed for a twoinput adder with equal output word size but with a reduced speed due to a longer critical path [49, 50]. The ILP formulations could be extended in that direction using quadruplets instead of the (u,v,w) triplets which allows the modeling of three inputs instead of the two inputs u and v. However, this would substantially increase the search space, so one has to investigate if a solution can be found in an acceptable runtime which is left open for future research.
Appendices
Appendix 1. Convolution matrices of the benchmark filters
The convolution matrices which were used as benchmark are listed in Table 5. They were used to produce the results, listed in Tables 2, 3, 4. Their properties are summarized in Table 1.
Appendix 2. Adder graphs results of PMCM optimization
The pipelined adder graph solutions of the benchmark convolution matrices of Table 5 are shown in Figure 7. Only the realization of unique odd coefficients is shown; their even representation can be realized by additional left shifts. All solutions were optimal (obtained by optimal PAG) except lowpass 15 × 15, from which the optimal solution is unknown (obtained by RPAG).
References
 1.
Bovik A: The Essential Guide to Image Processing. Academic Press, Waltham; 2009.
 2.
Kumm M, Zipf P: Hybrid multiple constant multiplication for FPGAs. In International Conference on Electronics, Circuits and Systems (ICECS). IEEE Piscataway; 2012:556559.
 3.
Bull DR, Horrocks DH: Primitive operator digital filters. IEEE Proc. Circuits, Devices Syst 1991, 138(3):401412. 10.1049/ipg2.1991.0066
 4.
Voronenko Y, Püschel M: Multiplierless multiple constant multiplication. ACM Trans. Algorithms (TALG) 2007, 3(2):138.
 5.
MeyerBaese U, Chen J, Chang CH, Dempster AG: A comparison of pipelined RAGn and DA FPGAbased multiplierless filters. In Asia Pacific Conference on Circuits and Systems (APCCAS). IEEE Piscataway; 2006:15551558.
 6.
Mirzaei S, Kastner R, Hosangadi A: Layout aware optimization of high speed fixed coefficient FIR filters for FPGAs. Int. J Reconfigurable Comput 2010, 3: 117.
 7.
MeyerBaese U, Botella G, Romero D, Kumm M: Optimization of high speed pipelining in FPGAbased FIR filter design using genetic algorithm. In SPIE Defense Security+Sensing, Volume 8401. SPIE Baltimore; 2012:112.
 8.
Kumm M, Zipf P: High speed low complexity FPGAbased FIR filters using pipelined adder graphs. In International Conference on Field Programmable Technology (ICFPT). IEEE Piscataway; 2011:14.
 9.
Kumm M, Faust M, Zipf P, Chang CH: Pipelined adder graph optimization for high speed multiple constant multiplication. In International Symposium on Circuits and Systems (ISCAS). IEEE Piscataway; 2012:4952A.
 10.
Kumm M, Liebisch K, Zipf P: Reduced Complexity Single and Multiple constant multiplication in Floating point precision. In International Conference on Field Programmable Logic and Applications (FPL). IEEE Piscataway; 2012:255261.
 11.
Hartley R: Subexpression sharing in filters using canonic signed digit multipliers. IEEE Trans. Circuits and Syst. II: Analog Digit. Signal Process 1996, 43(10):677688. 10.1109/82.539000
 12.
Mirzaei S, Hosangadi A, Kastner R: FPGA implementation of high speed FIR filters using add and shift method. In International Conference on Computer Design (ICCD). IEEE Piscataway; 2006:308313.
 13.
Imran M, Khursheed K, O’Nils M: On the number representation in subexpression sharing. In International Conference on Signals and Electronic Systems (ICSES). IEEE Piscataway; 2010:1720.
 14.
Dempster AG, Macleod MD: Constant integer multiplication using minimum adders. IEE Proc. Circuits, Devices Syst 1994, 141(5):407413. 10.1049/ipcds:19941191
 15.
Dempster AG, Macleod MD: Use of minimumadder multiplier blocks in FIR digital filters. IEEE Trans. Circuits Syst. II: Analog Digit. Signal Process 1995, 42(9):569577. 10.1109/82.466647
 16.
Gustafsson O: A difference based adder graph heuristic for multiple constant multiplication problems. In International Symposium on Circuits and Systems (ISCAS). IEEE Piscataway; 2007:10971100.
 17.
Aksoy L, Günes E, Flores P: Search algorithms for the multiple constant multiplications problem: exact and approximate. Microprocessors and Microsystems 2010, 34(5):151162. 10.1016/j.micpro.2009.10.001
 18.
Flores P, Monteiro J, Costa E: An exact algorithm for the maximal sharing of partial terms in multiple constant multiplications. In IEEE/ACM International Conference on ComputerAided Design (ICCAD). IEEE Computer Society Washington; 2005:1316.
 19.
Yurdakul A, Dündar G: Multiplierless realization of linear DSP transforms by using common twoterm expressions. J. VLSI Signal Process 1999, 22: 163172. 10.1023/A:1008125221674
 20.
Aksoy L, Costa E, Flores P, Monteiro J: Optimization of area under a delay constraint in digital filter synthesis using SATbased integer linear programming. In 43rd ACM/IEEE Design Automation Conference (DAC). IEEE Piscataway; 2006:669674.
 21.
Aksoy L, Costa E, Flores P, Monteiro J: Optimization of Area in Digital FIR Filters Using GateLevel Metrics. In 44th ACM/IEEE Design Automation Conference (DAC). IEEE; 2007:420423.
 22.
Aksoy L, da Costa E, Flores P, Monteiro J: Exact and approximate algorithms for the optimization of area and delay in multiple constant multiplications. IEEE Trans. ComputerAided Design Integrated Circuits Syst 2008, 27(6):10131026.
 23.
Aksoy L, Gunes E, Flores P: An Exact BreadthFirst Search Algorithm for the Multiple Constant Multiplications Problem. In NORCHIP. IEEE Piscataway; 2008:4146.
 24.
Gustafsson O: Towards optimal multiple constant multiplication: a hypergraph approach. In 42nd Asilomar Conference on Signals, Systems and Computers. IEEE Piscataway; 2008:18051809.
 25.
Gustafsson O: Lower bounds for constant multiplication problems. IEEE Trans. Circuits Syst. II: Express Briefs 2007, 54(11):974978.
 26.
Aksoy L, Costa E, Flores P, Monteiro J: Design of lowpower multiple constant multiplications using lowcomplexity minimum depth operations. In Proceedings of the 21st Edition of the Great Lakes Symposium on VLSI. ACM New York; 2011:7984.
 27.
Dempster A, Dimirsoy S, Kale I: Designing multiplier blocks with low logic depth. In International Symposium on Circuits and Systems (ISCAS). IEEE Piscataway; 2002:773776.
 28.
Johansson K, Power Low: Low Power and Low Complexity ShiftandAdd Based Computations. PhD thesis, Linköping University, Department of Electrical Engineering, 2008
 29.
Faust M, Chang CH: Minimal logic depth adder tree optimization for multiple constant multiplication. In International Symposium on Circuits and Systems (ISCAS). IEEE Piscataway; 2010:457460.
 30.
Kang HJ, Park IC: FIR Filter Synthesis Algorithms for Minimizing the Delay and the Number of Adders. IEEE Trans. Circuits Syst. II: Analog Digital Signal Process 2001, 48(8):770777. 10.1109/82.959867
 31.
Aksoy L, Costa E, Flores P, Monteiro J: Optimization of gatelevel area in high throughput multiple constant multiplications. In European Conference on Circuit Theory and Design (ECCTD). IEEE Piscataway; 2011:609612.
 32.
Gustafsson O, Dempster A: On the use of multiple constant multiplication in polyphase FIR filters and filter banks. 2004.
 33.
Aksoy L, Costa E, Flores P, Monteiro J: Design of lowcomplexity digital finite impulse response filters on FPGAs. 2012.
 34.
Wirthlin M: Constant coefficient multiplication using lookup tables. J. VLSI Signal Process 2004, 36: 715.
 35.
Faust M, Chang CH: Bitparallel multiple constant multiplication using lookup tables on FPGA. In International Symposium on Circuits and Systems (ISCAS). IEEE Piscataway; 2011:657660.
 36.
Crosisier A, Esteban DJ, Levilio ME, Riso V: Digital filter for PCM encoded signals. US Patent No. 3777130 (1973)
 37.
Zohar S: New hardware realizations of nonrecursive digital filters. IEEE Trans. Comput 1973, 22(4):328338.
 38.
Peled A, Liu B: A new hardware realization of digital filters. IEEE Trans. Acoustics, Speech Signal Process 1974, 22(6):456462. 10.1109/TASSP.1974.1162619
 39.
White SA: Applications of distributed arithmetic to digital signal processing: a tutorial review. IEEE ASSP Mag 1989, 6(3):419.
 40.
Sen W, Bin T, Jim Z: Distributed arithmetic for FIR filter design on FPGA. In International Conference on Communications, Circuits and Systems (ICCCAS). IEEE Piscataway; 2007:620623.
 41.
Meher P, Chandrasekaran S, Amira A: FPGA realization of FIR filters by efficient and flexible systolization using distributed arithmetic. IEEE Trans. Signal Process 2008, 56(7):30093017.
 42.
Kumm M, Möller K, Zipf P: Reconfigurable FIR filter using distributed arithmetic on FPGAs. In International Symposium on Circuits and Systems (ISCAS). IEEE Piscataway; (accepted for publication in 2013)
 43.
Bailey D: Design for Embedded Image Processing on FPGAs. WileyIEEE Press, New York; 2011.
 44.
Willson A: Desensitized halfband filters. IEEE Trans. Circuits and Syst I: Regular Papers 2010, 57: 152167.
 45.
Lu WS, Wang HP, Antoniou A: Design of twodimensional FIR digital filters by using the singularvalue decomposition. IEEE Trans. Circuits and Syst 1990, 37: 354. 10.1109/31.45689
 46.
IBM Inc: IBM ILOG CPLEX Optimizer. . Accessed 16 April 2013 http://www.ilog.com/products/cplex
 47.
IBM Inc: IBM ILOG CPLEX V12.1  File Formats Supported by CPLEX (2009).
 48.
Lavin C, Padilla M, Lamprecht J, Lundrigan P, Nelson B, Hutchings B: RapidSmith: doityourself CAD tools for Xilinx FPGAs. In International Conference on Field Programmable Logic and Applications (FPL). IEEE; 2011:349355.
 49.
Simkins JM, Philofsky BD: Structures and methods for implementing ternary adders/subtractors in programmable logic devices. US Patent No 7274211, Xilinx Inc. (2006)
 50.
Baeckler G, Langhammer M, Schleicher J, Yuan R: Logic cell supporting addition of three binary words. US Patent No 7565388, Altera Coop. (2009)
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Kumm, M., Fanghänel, D., Möller, K. et al. FIR filter optimization for video processing on FPGAs. EURASIP J. Adv. Signal Process. 2013, 111 (2013). https://doi.org/10.1186/168761802013111
Received:
Accepted:
Published:
Keywords
 Integer Linear Programming
 Mixed Integer Linear Program
 Finite Impulse Response
 Finite Impulse Response Filter
 Full Adder