FIR filter optimization for video processing on FPGAs

Two-dimensional finite impulse response (FIR) filters are an important component in many image and video processing systems. The processing of complex video applications in real time requires high computational power, which can be provided using field programmable gate arrays (FPGAs) due to their inherent parallelism. The most resource-intensive components in computing FIR filters are the multiplications of the folding operation. This work proposes two optimization techniques for high-speed implementations of the required multiplications with the least possible number of FPGA components. Both methods use integer linear programming formulations which can be optimally solved by standard solvers. In the first method, a formulation for the pipelined multiple constant multiplication problem is presented. In the second method, also multiplication structures based on look-up tables are taken into account. Due to the low coefficient word size in video processing filters of typically 8 to 12 bits, an optimal solution is found for most of the filters in the benchmark used. A complexity reduction of 8.5% for a Xilinx Virtex 6 FPGA could be achieved compared to state-of-the-art heuristics.


Introduction
Two-dimensional linear filters with finite impulse response (FIR) are one of the most fundamental operations used in image and video processing. They are used, e. g., in applications which contain contrast improvement, denoising, sharpening, target matching, and feature enhancement [1]. Compared to infinite impulse response filters, FIR filters have a strict stability, and highthroughput implementations are easily possible using pipelining as no recursions are involved. However, they are computationally expensive as many multiply accumulate (MAC) operations are necessary for each pixel of the resulting image. While this is very demanding for a microprocessor or digital signal processor, the inherent parallelism of field programmable gate arrays (FPGAs) can be used to accelerate the FIR operation.
Modern FPGAs directly incorporate embedded multipliers or DSP blocks which also include pre-and postadders for MAC operations. Xilinx's DSP blocks (Xilinx Inc., San Jose, CA, USA) of Virtex 5/6, Spartan 6, and the 7 series FPGAs provide 18×25-bit signed multipliers. More flexible are the variable precision DSP blocks of the latest *Correspondence: kumm@uni-kassel.de 1 Digital Technology Group, University of Kassel, Kassel 34121, Germany Full list of author information is available at the end of the article FPGAs of Altera, the Stratix V, Cyclone V, and Aria V devices (Altera, San Jose, CA, USA). Each DSP block can be configured as three independent 9 × 9-bit multipliers, two independent 16 × 16-bit, 15 × 17-bit, or 14 × 18-bit multipliers, or a single 18×36-bit or 27×27-bit multiplier.
However, embedded multipliers and DSP blocks are limited resources even on modern low-cost FPGAs, and they may have a higher power consumption compared to constant multiplication using the carry-chain resources [2]. Especially in image processing, embedded multipliers are often underutilized because of the small word lengths used. Typically, only 8 bits per color and 10 bits for a luminance representation are used.
In most filter applications, the coefficients are fixed, which can be used to reduce the complexity of the multiplication. Furthermore, partial results can be shared inside a single multiplier or between multipliers of different constants to reduce hardware resources. Different methods have been proposed over the years for such multiple constant multiplications (MCM): (a) MCM using additions, subtractions, and bit shifts ; (b) MCM using look-up tables (LUTs) and adders [34,35]; (c) Distributed arithmetic (DA) [36][37][38][39][40][41][42]. http://asp.eurasipjournals.com/content/2013/1/111 In method (a), constant multiplications are realized using additions, subtractions, and bit shifts only. These operations form a so-called adder graph, so this method is called the adder graph MCM method in the following. It was originally developed for software or VLSI applications [3] but also maps well to the fast carry chains of FPGAs. In method (b), the input word is split into smaller chunks that fit into the input word size of the FPGA LUTs. These LUT results are shifted and added afterward to form the multiplication result. LUTs and adders are also used in method (c), but there the folding equation of the FIR filter is rearranged in such a way that identical LUTs can be used. This is very beneficial in sequential FIR implementations but costly in parallel implementations. Because it was shown in the recent years that multiplier blocks using add, subtract, and shift operations (method a) consume considerable less logic resources compared to parallel DA implementations [5][6][7], the DA approach is not further considered. Due to the relatively large routing delays compared to the fast carry chain, a pipelined implementation of the adder graph is necessary to obtain the maximum speed of the FPGA [2,[5][6][7][8][9][10]. It was shown by Faust et al. [35] that the LUT-based approach (method b) is competitive to the adder graph method. Thus, pipelined circuits using the combination of methods (a) and (b) are investigated in this paper.

Contribution of this work
The main contribution of this article is the description of two novel optimization methods, one for the adder graph MCM problem including pipelining (the pipelined MCM problem [9]) and one for a combination of this method with a pipelined realization of the LUT-based method mentioned above. Each method is formulated as a boolean integer linear program (BILP, or 0-1 ILP) and then reduced to a mixed integer linear program (MILP). Hence, if the MILP solver finds a solution in reasonable time, an optimal solution for the given cost model is found. To the best of our knowledge, this is the first time an optimal method for solving the pipelined MCM (PMCM) problem is proposed.
The complexity of the adder graph MCM method heavily depends on the coefficient values, while the complexity of the LUT-based approach mainly depends on the input word size. Therefore, sometimes, one method or the other delivers better results. For this, a combination of both methods is proposed in this work by incorporating the LUT-based multipliers in the integer linear programming (ILP) formulation of the PMCM problem. Due to the low coefficient word size of typically 8 to 12 bits in image processing, a short convergence time of the ILP solver is very likely, which makes the proposed optimization an ideal candidate for image processing.
The remaining of this paper is organized as follows. The related work is discussed in the next section, followed by an introduction of the used FIR architectures for image processing. Then, an ILP formulation for the PMCM problem is described which is later extended for additional LUT-based multiplication. Finally, results from the optimizations and FPGA synthesis are presented and discussed, followed by a conclusion.

MCM using additions, subtractions, and bit shifts
Different methods have been proposed over the years to realize constant multiplication using additions, subtractions, and bit shifts only. Finding the optimal configuration of these operations is known as MCM problem, which has been an active research topic for almost the last two decades . The objective is usually defined by minimizing the number of adders and subtractors (shifts are assumed to be free, as they can be implemented using wires).
An example adder graph which realizes a multiplier block with the constants of the set {44, 130, 172} is shown in Figure 1a. Each node in the graph corresponds to an adder or subtractor, indicated by '+' or '−. ' The numeric node value represents the realized multiple of the input, i. e., node '1' corresponds to the input of the MCM block. To have a unique representation, all node values are defined to be odd. They can be made even by a simple bit shift as shown at the output. All edge weights represent left shifts, e. g., node '5' is realized by left shifting the input x by 2 bits and adding the unshifted input: 2 2 x + 2 0 x = 5x. Right shifts are represented by negative edge weights.
The MCM problem is NP complete [4]. Hence, most of the proposed algorithms are heuristics, and less work was directed toward optimal solutions. Early work was done by Bull and Horrocks [3] which was later extended by Dempster and Macleod to the modified Bull and Horrocks algorithm [14]. In the same work, the ndimensional reduced adder graph (RAG-n) algorithm was proposed which was one of the leading heuristics for years. Major improvements could be achieved by the work of Voronenko and Püschel with their cumulative benefit heuristic (H cub ) [4]. By spending a bit more algorithmic complexity and evaluating adder graph topologies up to a depth of three, they could reduce the required additions/subtractions by 20% on average compared to RAGn. A competing approach based on difference graphs was proposed by Gustafsson [16]. It tends to be beneficial compared to H cub in case large coefficient sets and/or low coefficient word lengths are used but may be worse in other cases.
Many approaches use ILP formulations, for which optimal solvers exist. However, the search space is often http://asp.eurasipjournals.com/content/2013/1/111  significantly reduced due to the used number representation which leads to non-optimal results. Minimum signed digit (MSD) number systems like canonic signed digit (CSD) are often used as they have a reduced complexity compared to the binary representation [18]. In MSD, a number is coded using the digits {−1, 0, 1}, such that the number of non-zero digits is minimal and, hence, the number of partial products is reduced. A 0-1 ILP model that uses subexpressions of length two in the CSD number system (i. e., subexpressions with at most two non-zeros) was used by Yurdakul and Dündar [19]. A 0-1 ILP model which can be used for any number system was proposed by Flores et al. [18]. Results for binary, CSD and MSD were presented. This work was further extended by Aksoy et al. with additional delay constraints [20], low-level area models [21], and a heuristic variation [22].
So far, the discussed publications did not result in globally optimal solutions as they use the reduced search space of a given number representation. Breadth-first search [23] and depth-first search [17] algorithms were proposed by Aksoy et al. to optimally solve the MCM problem in a graph-based way. The depth-first search is able to solve MCM instances in a reasonable time but cannot handle different constraints or cost metrics. Another interesting method to optimally solve the MCM problem was given by Gustafsson who transferred the MCM problem to the problem of finding a Steiner hypertree in a directed hypergraph [24]. He used an optimal 0-1 ILP formulation which is very generic and can be flexibly adopted to different cost metrics (at adder or logic level) and different constraints (adder depth and fan-out). The main drawback is its computational complexity. Nevertheless, it can be used for small MCM instances or to find lower bounds [25] by relaxing the model to a continuous LP problem. A 0-1 ILP formulation for optimally solving the special case of minimum depth adder graphs in a graph-based fashion was recently proposed [26]. In this work, the search space of a graph-based search is compared to the MSD search space in terms of variables of the ILP. The graph-based search needs three times more variables for 8-bit coefficients and 18 times more variables for 13-bit coefficients compared to MSD.
In the recent time, more and more MCM algorithms with different objectives were proposed. One objective is minimizing the power of the adder graph by reducing or minimizing the adder depth (AD) of each output, which is defined as the number of adder stages needed to compute a coefficient [26][27][28][29]. Limiting the maximum AD of all outputs can be used to find adder graphs with low delay [30]. If this delay is still too large, pipelining can be used to speed up the circuit [5,8,9,31].
Pipelining plays a crucial role in high-performance adder graph realizations on FPGAs as they have relatively large routing delays compared to their fast carry chains. However, the addition of pipeline registers may significantly increase the complexity. An example of the pipelined adder graph (PAG) of Figure 1a using an as-soon-as-possible (ASAP) scheduling for placing the pipeline registers is shown in Figure 1b. Here, each rectangular node includes a pipeline register, i. e., nodes without any '+' or '−' operator are pure registers. Many resources can be saved by finding the optimal schedule of pipeline registers for a given adder graph. Compared to the ASAP schedule, 29% less pipelined operations and speedups of about 300% were achieved on average using a slice overhead of only 18% compared to the non-pipelined adder graph [8]. The PAG using the optimal schedule of the adder graph of Figure 1a is shown in Figure 1c. However, directly considering pipelining during the adder graph optimization can further reduce the resources as demonstrated in Figure 1d. Heuristics for this kind of direct optimization were proposed recently [7,9]. A reduction http://asp.eurasipjournals.com/content/2013/1/111 of pipelined operations by 10% compared to the optimal pipelined adder graphs [8] could be achieved by the reduced pipelined adder graph (RPAG) algorithm [9]. Slice reductions of 16% on average were reported using the best result out of three algorithms (C1 + , RSG Improved and a genetic algorithm) [7].

MCM using look-up tables
A totally different method for single constant multiplication which uses the look-up tables of FPGAs was proposed by Wirthlin [34]. The idea is to split the multiplication into smaller chunks, e. g., 4 bits for FPGAs with fourinput LUTs, which can be directly realized using a single stage of LUTs. These LUT results have to be shifted and added to get the final product. Several techniques were proposed to minimize the number of redundant LUTs [34]. An extension of LUT-based multipliers to MCM was presented by Faust et al. [35], where identical LUTs are also shared between different constant multipliers. It was shown that the maximal number of required LUTs is far less than combinatorially possible. Their benchmark results include a comparison with adder graph MCM, which shows that their graph-based MinLD MCM algorithm [29] sometimes uses less resources and sometimes more resources than the LUT-based method for an input word size of 8 bits on an FPGA with four-input LUTs. As their method does not include pipelining, shorter delays could be achieved using the LUT-based MCM.

Two-dimensional FIR filter architectures
An image of height M and width N is usually represented by an M × N matrix X. The matrix elements are written in lower case in the following, i. e., x m,n denotes the luminance of a pixel at position (m, n). The two-dimensional folding with the P × Q folding matrix H is defined as where u = P 2 and v = Q 2 denote the center of the folding matrix H. Note that P and Q are often identical (matrix X is square) and odd.
Architectures for computing this folding equation can be classified by the computed output pixels per clock cycle. A sequential realization may compute a single MAC operation per clock cycle producing one output pixel every PQ clock cycles, like this is done on conventional microprocessors. As we are interested in accelerating the filter operation, parallel realizations are considered in the following. A direct implementation of the folding equation for a 3 × 3 folding matrix in a parallel way which is able to compute one output pixel every clock cycle is shown in Figure 2a. First, the input pixel stream is clocked into a chain of registers and FIFO buffers to provide a 3 × 3 block of the image. Then, this block is processed in parallel by a sum-of-products (SOP) operation according to (1), i. e., each pixel of the block is multiplied by its corresponding element in the folding matrix and all products are added to yield the final output pixel. An alternative architecture can be obtained by transposing the structure of Figure 2a by reversing the directions of each edge, replacing branches by adders and vice versa, and swapping the input and output [43]. The resulting architecture is shown in Figure 2b. The SOP block results in an MCM block after transposition, i. e., now a single input has to be multiplied by all elements of the folding matrix. Note that the outputs of both architectures are invalid for the boundary pixels of the input image. Thus, if the output image has to have the same size as the input image, a suitable boundary treatment (setting boundary pixels to black/white, copying from neighbor pixels, etc.) has to be implemented [43]. MCM blocks of Figure 2b can be obtained by one of the algorithms discussed above. The SOP circuit of Figure 2a can be either obtained by using LUT multipliers and additional adders, or by transposing the adder graph which was obtained by an adder graph MCM method. It is well known that the adder cost for a single-input single-output adder graph is equal to its transposed form [44]. More general, Gustafsson et al. derived the cost of an adder graph with N i inputs and N o outputs after transposition for the generalized case of vector matrix multiplication [32]: Here, N A and N T A are the adder cost before and after transposition, respectively. MCM is a special case of vector matrix multiplication with N i = 1, so if N A,MCM adders are needed for an optimal MCM adder graph with N unique coefficients, N A,SOP = N A,MCM + N − 1 adders are needed for the corresponding optimal SOP. Therefore, the eight additional adders shown in the transposed form in Figure 2b are exactly the additional adders needed to compute the SOP in Figure 2a. Hence, from its complexity, there is no difference between the direct or transposed form. If we take a look on pipelined implementations of MCM blocks, the situation becomes worse. While transposing a pipelined MCM block still leads to a valid pipeline, the pipeline registers may not any longer be located in the critical path between the adders. Therefore, we concentrate in the following on pipelined implementations of the MCM block, which can be directly incorporated in the transposed form, as shown in Figure 2b.
A totally different filter structure can be realized if the folding matrix is separable, i. e., matrix H can be separated in two vectors h 1 and h 2 with H = h 1 · h 2 . Then, the two-dimensional filter can be realized by cascading http://asp.eurasipjournals.com/content/2013/1/111 two one-dimensional filters, one for processing the rows and one for the columns. This reduces the PQ multiplications to P + Q multiplications [43]. If the filter cannot be separated, then there exist methods to decompose the filter into a sum of separable filters using singular value decomposition [45]. However, only a fraction (which is typically less than one third) of the multipliers of the unseparated folding matrix have to be realized due to symmetric, zero, or powerof-two coefficients. Furthermore, in both MCM methods discussed above, many resources can be saved when intermediate computation results are shared between the constant multipliers. This is only possible if all multipliers have the same input which is not the case for separated filters. Therefore, we concentrate on the architecture shown in Figure 2b as it is generic (any folding matrix can be realized) and pipelined MCM blocks can be directly used.

Pipelined MCM problem formulation
In an adder graph, each node unequal one is generated by a so called A-operation, introduced by Voronenko et al. [4] (compare with Figure 1a) with configuration q = (l 1 , l 2 , r, sg), where l 1 , l 2 , r ∈ N 0 are shift factors, the sign bit sg ∈ {0, 1} denotes whether an addition or subtraction is performed and u and v are positive, odd input arguments. A valid configuration q is a combination of l 1 , l 2 , r, and sg such that the result is a positive odd integer. The greatest effort during MCM optimization is finding the numerical values of non-output nodes, i. e., the values of all u and v which are not in the coefficient set. Once all these node values are found, it is easy to determine the configuration q of the corresponding adder graph, e. g., using the optimal part of RAG-n [14] or H cub [4]. Since the same can be applied for PAG optimization, it is appropriate to define a set X s for each pipeline stage s, containing the node values of the corresponding stage. The pipeline sets for the PAG in Figure 1d are, e. g., X 0 = {1}, X 1 = {7, 9}, and X 2 = {11, 43, 65}. With this representation, we can formally define the PMCM problem: The upper bound is usually chosen as x max = 2 b max +1 [4], where b max is equal to the maximum bit width of T. The number of stages S has to be at least the largest minimal adder depth of the coefficients. The minimal AD of an integer x can be directly computed using where nz(x) represents the minimal number of non-zero digits of x in canonic signed digit (CSD) representation [29]. As each additional pipeline stage introduces additional nodes in the PAG, it is very unlikely that there exists a graph with higher S but less cost. Therefore, we define S to be the minimum number of stages which is possible: The area cost of pipeline sets X 0...S depends on the target architecture, and it will be discussed in the following sections.

FPGA cost of pipelined adder graphs
We use the area (i. e., the number of FPGA components) as cost measure, since the maximum speed results from the architecture and is given by the largest ripple carry adder and the routing delay which is hard to predict. Different cost metrics are used for evaluating the size of an FPGA circuit. Common metrics are counting the LUTs, flip-flops, slices (Xilinx devices), or logic elements (Altera http://asp.eurasipjournals.com/content/2013/1/111 devices). As different FPGA components are used in this work, the smallest common piece of logic, consisting of LUT(s), FF(s), and full adder logic, is counted and referred to as basic logic element (BLE) in the following. Simplified block schematics of BLEs of the Virtex 4 and Virtex 6/7 architectures are shown in Figure 3. On Virtex 4, each BLE is half of an FPGA slice, and on Virtex 6/7, it is a quarter of a slice. While the Virtex 4 BLE provides a four-input LUT, a flip-flop and additional logic for building a full adder (AND gate, XOR gate, and a multiplexer), the BLE from the later generations provide a six-input LUT which can be used as two five-input LUTs with shared inputs, two flip-flops, and full adder logic (XOR gate and multiplexer).
Each BLE can realize a single full adder including the register for pipelining. Thus, for evaluating adder graphs, the number of full adders, or pure registers can be counted. A detailed cost model of common adder graphs mapped to FPGAs was recently presented in [33]. This model respects the replacement of full adders by simple wires if the two shifted numbers do not overlap at every bit position. However, for pipelined adders on FPGAs, full adder and pipeline register are realized in a single BLE resource. Hence, saving some full adders has no effect on BLE resources as the pipeline register is needed in any case. The number of BLEs of an A-operation is equal to the output word size of the corresponding adder. If there is no right shift (r = 0), the output word size is identical to the sum of input word size and the word size needed to represent node value w. If a right shift r > 0 is used, the word size of the corresponding adder has r additional output bits. Even if these additional output bits are ignored after the right shift, they are needed to compute the carryin for higher order bits to yield a correct result. Hence, the number of BLEs of each A-operation in a pipelined adder graph can be summarized to where B x is the input word size and w = A q (u, v).

ILP formulation for the pipelined adder graph problem
Before the ILP formulation for the PAG problem is described, some variables and auxiliary sets are introduced. There are two types of binary decision variables used in the formulation: Hence, there is one variable a s w for each stage s and each possible element w and one variable a s (u,v,w) for each w and each possible pair (u,v), from which w can be computed with a single A-operation.
To determine these combinations, we use the definition of the A * (u, v) set [4], which contains all elements that can be computed from u and v using a single A-operation: For convenience, the A * set is also defined for a set X: Note that A * ({u, v}) is different from A * (u, v) as the first one contains all combinations of u and v: With this set, we can now define the set A s of all possible odd variables which may be used at pipeline stage s. It can be recursively defined by As not all elements of A s are needed to compute the target coefficients, the set S s ⊆ A s is defined which contains all single elements that may be used to compute the target coefficients T in the last stage. Furthermore, with T s we denote the set of (u, v, w) triplets for which w ∈ S s can be computed using u and v of the previous stage s − 1. Now, the sets S s and T s can be computed recursively, starting from the last stage S, where S S is identical to the odd target coefficients excluding zero: Using these variables and auxiliary sets, we can define the ILP formulation for the PAG problem as follows: http://asp.eurasipjournals.com/content/2013/1/111

ILP Formulation 1 : Pipelined adder graph optimization
The subject is to minimize the BLE cost of all required A-operations. Constraint (C1) simply forces that all target coefficients are realized in the last pipeline stage.
To realize an element a s w , one of the possible Aoperations to compute w must be available, which is constrained by (C2). The last constraint (C3) ensures that an A-operation w = A q (u, v) can only be realized in stage s when u and v are both available in stage s − 1.
In the formulation, we use the relaxed condition a s (u,v,w) ≥ 0 instead of requiring a s (u,v,w) ∈ {0, 1} which results in an MILP formulation. Using continuous variables a s (u,v,w) instead of binary ones reduces the runtime of the optimization substantially. This is possible since we can construct a binary solution that is feasible and minimal from the relaxed solution.
Proof. Assume an optimal solution of ILP Formulation 1 is given. Due to (C3), all variables a s (u,v,w) satisfy the condition but in the optimum solution of ILP Formulation 1, a variable a s w = 1 can exist with several variables a s (u,v,w) > 0. Let us define a set T w , which contains all triplets (u, v, w) with a s (u,v,w) > 0. Then, there exist at least one triplet (u , v , w) ∈ T w with the least cost contribution: Now, a feasible binary solution of the formulation can be obtained by assigning new valuesâ s (u ,v ,w) = 1 and a s (u,v,w) = 0 for all (u, v, w) ∈ T w \ {(u , v , w)}. Since the contribution of the triplets in T w to the objective function is and since the objective function is minimized, the new solution is also minimal and the objective values are equal. If we do this construction for all variables a s w = 1, we obtain an optimum solution where all variables are binary.
ILP Formulation 1 is very generic and is extendable into different directions. For optimizing MCM blocks on application specific integrated circuits (ASICs), the cost function in the objective can be easily divided into two cost functions: one for pipelined adders (cost A ) and one for pure registers (cost R ) as both require different ASIC resources: A modified cost function can also be used to influence the number of adders within a pipeline stage. Instead of placing pipeline registers after each stage of adders, they could be placed behind multiple adder stages (e.g., every second or third stage). This would reduce the register resources for applications where a lower speed is sufficient or a lower latency is required. For that, the cost function for pure registers (as for ASICs) can be set to zero in each stage in which no registers should be placed. In the implementation, these registers are exchanged by wires. http://asp.eurasipjournals.com/content/2013/1/111 Another extension of the ILP formulation would be the support of dedicated shift registers which are provided by the latest Xilinx FPGA architectures. These shift registers are realized in a single FPGA LUT and provides up to 16 bits on Virtex 4 FPGAs (SRL16 primitive) and up to 32 bits on Virtex 5/6, Spartan 6 and 7 series FPGAs (SRLC32E primitive). In other words, up to 16 or 32 registers (plus one additional register at the output) can be realized at the same BLE cost like a single register. As long as S < 17 or S < 33, constraint (C3) in ILP Formulation 1 can be replaced by: While (C3a) is nearly identical to (C3), (C3b) allows the bypass of several stages using shift registers. Of course, these values are not accessible for intermediate stages.

Multiple constant multiplication using LUT multipliers
In this section, a LUT-based constant multiplier method is introduced, which is used to extend ILP Formulation 1.

LUT-based constant multiplication
Consider a number x which is represented in two's complement format using B x bits: If this number is multiplied by a constant c n with B c bits, the resulting B c × B x multiplication can be divided into several smaller multiplications by rearranging the partial sum terms: Now, K = B x L multiplications of size B c × L are necessary. In case that B x is not divisible by L, the number x has to be sign extended to B x = KL bits. The idea of Wirthlin [34] was to realize each B c × L multiplier using L-input LUTs, where L has to be chosen to fit the FPGA LUT input size. The resulting multiplier structure for an example of a 12 × 4 multiplier using four-input LUTs [34] is shown in Figure 4.

LUT minimization techniques
Several methods were proposed by Wirthlin to reduce the hardware complexity of the circuit. In the following cases, LUTs can be eliminated: (a) Removal of constant LUTs, i. e., LUTs that are always '0' or '1' can be replaced by a constant. (b) Removal of LUTs which are identical to one of the inputs, i. e., the output can be connected to the corresponding input. (c) Removal of redundant LUTs, i. e., LUTs with identical content and identical inputs can be shared.
In particular, the removal of redundant LUTs can be used to reduce logic when multiple constants are involved [35]. It was shown that it is very likely that identical LUTs occur in MCM operations. For example, only 52 unique LUTs are required to be able to compute all signed products with constants from 1 to 2 12 using four-input LUTs. Compared to adder graphs with minimal adder depth, the latency could be reduced by approximately 33% on average for an input word size of 8 bits. However, even for this small input word size, clock frequencies of about 100 MHz could be achieved which is far less than what is possible on the used Virtex 4 FPGA. Hence, in our work, pipelined realizations of LUT multipliers are investigated, as shown in Figure 5. In the first pipeline stage, the LUT content is looked up, and the results are shifted and added according to (25) using a pipelined adder tree. Note that some of the bit shifts can be moved toward the output of the adder tree to reduce the word size in each stage (not shown in Figure 5). No extra BLE has to be reserved for pure registers as the flip-flops of BLEs using LUTs, and full adders are used. But from the reduction methods, only the removal of redundant LUTs can be used as these can be shared together with the flip-flop, while the other methods still require a BLE for a single flip-flop.

ILP formulation for the combined pipelined adder/LUT graph optimization
ILP Formulation 1 of the PAG optimization is now extended to include LUT multipliers. Thus, the ILP solver should decide whether a target coefficient is realized using adders or using LUTs. Of course, if more than one LUT multiplier is used, the sharing of identical LUTs should be included in the cost model. For that, the set L s v,w is defined, which contains all LUTs of stage s which are needed to compute each bit of w from v and a new boolean decision variable is introduced: If a LUT multiplier is inserted in the adder graph to compute w from node v, the variable a s (0,v,w) is set to one (the case u = 0 does not exist in ILP Formulation 1). The corresponding (0, v, w) triplets have to be added to T s . Now, ILP Formulation 1 can be extended by two constraints (C4) and (C5) (the remaining constraints remain nearly identical) and a modified objective function: Two cost functions were added, cost LUT includes the cost of a single LUT and cost AT (u, v, w) corresponds to the cost of the adder tree which is necessary to compute the sum of the LUT outputs. The splitting of the cost function is necessary to respect the reduced cost for shared LUTs. For Virtex 4 or Virtex 6/7 using sixinput LUTs, each LUT consumes one BLE (cost LUT = 1). For Virtex 6/7 using five-input LUTs, the LUT costs can be approximated to cost LUT = 1/2 as only one half of a BLE is used for a single LUT. The adder tree costs (cost AT (u, v, w)) depend on v, w and the LUT multiplier pipeline depth, which is defined by the depth of the adder tree plus one for the LUT registers:

ILP Formulation 2 : Combined pipelined adder graph/LUT optimization
The additional constraints (C4) and (C5) are defined to incorporate LUT realizations during optimization. Constraint (C4) is similar to (C3): If a LUT multiplier is used to compute w in stage s from v in stage s − LD(w/v), the corresponding node v must be available in that stage. Constraint (C5) ensures that the corresponding LUTs are available in the correct pipeline stage to compute w from v. In first experiments, we found only cases where LUT multipliers directly compute target coefficients of the last stage. Hence, we decided to reduce the search space by evaluating the constraints (C4) and (C5) only for the target constants of the last stage s = S.

Results
To evaluate the performance of the two ILP formulations, a benchmark set of folding matrices was designed using Matlab. The coefficients were computed using the fspecial() function except from the lowpass and highpass filters. These were obtained by the Parks-McClellan algorithm (remez() and ftrans2() functions). All real coefficients were quantized to a word size of B c . The complete convolution matrices are given in Appendix 1; the filter parameters are summarized in Table 1 with their matrix size, word size, the required pipeline stages S using PMCM, parameters for their design, and the N uq unique odd coefficients of their folding matrix.
The BILP solver of Matlab is slow compared to the CPLEX optimizer from IBM [46] and does not provide a solver for MILP problems. CPLEX provides a text file interface with a comfortable human readable syntax (LP file format [47]). Hence, Matlab was used to generate the MILP models as LP files, and CPLEX was used for solving them. Then, Matlab was used to read in the solution file of CPLEX in XML format [47] for generating synthesizable VHDL code. All MCM blocks were optimized for an input word size B x of 8, 10, and 12 bits. The results are compared with the RPAG algorithm [9] and the LUT MCM method of [35], which was applied to the pipelined realization, as shown in Figure 5. The RPAG Table 2 Optimization results in terms of the number of basic logic elements (BLE) for the previous methods RPAG [9] using R iterations, a LUT MCM block [35] including pipelining and the proposed methods for optimal pipelined adder graphs using ILP Formulation 1 (Optimal PAG) and optimal pipelined adder/LUT graphs using ILP Formulation 2 (Optimal PALG), including the coefficients realized by LUTs (LUT Coeffs) algorithm is a greedy heuristic. As the best local choice in a greedy algorithm does not necessarily lead to the best global solution, it allows to randomly select one of the nth best decisions. Then, the best result out of several runs can be taken to improve the optimization. RPAG was configured for a single run (R = 1, pure greedy), and R = 50 runs per MCM instance where the locally first or second best decision was randomly selected. These results were considered as state-of-the-art reference. The optimization was performed for the Virtex 6/7 FPGA architecture, and CPLEX was configured with a computation time limit of 8 hours.
The results are summarized in Table 2 for RPAG (R = 1 and R = 50) and the LUT MCM method which are compared to the proposed pipelined adder graph (Optimal PAG) and pipelined adder/LUT graph (Optimal PALG) methods. An optimal solution could be found within the time limit for all cases, except for the lowpass 15 × 15 instances. Here, the best obtained result is shown as upper bound. All solutions are given as PAG in Appendix 2.
The coefficients which has to be replaced by LUT multipliers to obtain the PALG solution are given in the last column of Table 2. Comparing RPAG with the optimal PAG solution, it is apparent that RPAG (R = 50) often finds an optimal solution (in 24 out of 32 cases). For the three cases where the timeout of the ILP solver occurred, RPAG finds slightly better solutions. Comparing the pipelined LUT MCM method with RPAG and optimal PAG, the LUT MCM method performs better for low-input word sizes as expected. For B x = 8, it is always the best choice; for B x = 10, it is sometimes the best; and for B x = 12, it is never the best choice. This is different to the combined PALG optimization which found the best results in all cases where an optimal solution was obtained. For B x = 8, all instances were pure LUT MCM realizations; for B x = 10, there is a mixture of pure adder graphs, pure LUT realizations, and combinations of both; while for B x = 12, the adder graph realizations dominate. An example is given in Figure 6. While the PAG in Figure 6a needs two elements in the second pipeline stage (1 and 7) to compute coefficient 121, this coefficient is realized with a two-stage LUT multiplier in Figure 6b, saving one element in stage two. An overall reduction of 1.5% and 7.2% of BLEs compared to RPAG could be achieved for the optimal PAG and PALG methods, respectively. The minimum, maximum, and average computation times on an Intel Nehalem 2.93-GHz computer (Intel, Santa Clara, CA, USA) for all used optimization algorithms are given in Table 3. All heuristics take up to a few seconds; the optimal methods are very fast for the pipeline depth S = 2 cases in Table 1 but take hours for the three matrices with S = 3 (up to the time limit of 8 h = 28,800 s for the lowpass 15 × 15 instances). However, the optimization time is reasonable compared to the common FPGA development effort.
To validate the optimization results, synthesis experiments were performed for a Virtex 6 XC6VLX75T-2FF484 FPGA. For that, VHDL code generators were developed in Matlab for all the examined architectures. The synthesis was performed using Xilinx ISE v13.4. Unfortunately, the BLE usage cannot be directly obtained by counting the FFs, LUTs, or slices reported by the tool, as a Virtex 6 may use more than one FF per BLE, a LUT may be unused, and many slices may not be fully used but can be used for other logic. Therefore, the resulting netlist after place&route was converted to an XDL file, and the XDL-Parser from the RapidSmith Project [48] was used to obtain the BLEs directly from the netlist. All synthesis results are listed in Table 4. It can be seen that in most cases the optimal PALG solution leads to the best synthesis result. However, there are some cases where the estimated cost model fails Table 4 Synthesis results using the same benchmark instances as in Table 2 2  13  7  2  0  0  0  1  5  13  27  40  45  40  27 13 to predict the synthesis. The following cases have been identified by examining the netlists with the Xilinx' FPGA Editor: 1. A single Virtex 6 BLE is able to realize two flip-flops (see Figure 3b), but sometimes, ISE uses one BLE to realize a single register and sometimes to realize two registers. 2. Sometimes, ISE maps a single flip-flop in a BLE that is also used for a full adder.
Hence, an overestimate was done in the cost model with the assumption that single flip-flops are mapped to a single BLE. Due to the large shifts in the LUT multiplier architecture (see Figure 5), the least significant bits in the adder tree are pure flip-flops. Thus, it is more likely that a LUT multiplier is overestimated and adders are used instead. However, the synthesis results show that a BLE reduction of 1.3% and 8.5% on average compared to RPAG could be achieved for the optimal PAG and PALG method, respectively. The speed of the different architectures is very similar. In 75.8% and 78.8% of the cases for optimal PAG and PALG, respectively, they were even faster than the maximum clock frequency of the embedded DSP48E1 multiplier (which is 600 MHz).

Conclusion and outlook
Two ILP formulations to optimize the multiple constant multiplication on FPGAs were presented and analyzed using synthesis experiments. The first one is a formulation of the PMCM problem, for which only heuristics exist [7,9]. It was shown that better results are achievable for the used low word size coefficients. For most http://asp.eurasipjournals.com/content/2013/1/111 instances, the RPAG heuristic is also able to find an optimal solution (in 24 out of 32 cases). The second ILP formulation incorporates pipelined LUT-based multipliers. It was shown that this combined method outperforms the PMCM realization in particular for a low-input word size of 8 to 10 bits.
The used ILP solver CPLEX was able to find an optimal solution for most of the test instances within 8 h of computation time. If not, the best feasible solution was close to that of the RPAG heuristic. Synthesis experiments were performed for the Xilinx Virtex 6 architecture, showing that compact and fast multiple constant multipliers are obtained. A resource reduction of 8.5% was achieved compared to the state-of-the-art while having approximately the same speed. Thus, the proposed optimizations can be beneficially used for many Future work could be extended into different directions. The examinations about the mismatch between cost model and synthesis results could be used to improve the results. The slice flip-flops in adders which were not used by synthesis could be utilized to implement pure registers. Furthermore, pure registers could be forced to use all of the eight available slice flip-flops. In the ILP model, this could be respected by reducing the cost for registers, e. g., setting the cost function for one flip-flop to a half BLE instead of a full BLE. The physical realization can be done using low-level placement constraints. However, this could reduce the performance as fixed placements (even relative) may limit a timing driven place & route optimization.
Another extension could be the optimization with adder graphs containing ternary adders, i. e., adders with three inputs. Modern FPGAs provide methods to implement ternary adders with the same number of slices/ALMs as needed for a two-input adder with equal output word size but with a reduced speed due to a longer critical path [49,50]. The ILP formulations could be extended in that direction using quadruplets instead of the (u, v, w) triplets which allows the modeling of three inputs instead of the two inputs u and v. However, this would substantially increase the search space, so one has to investigate if a solution can be found in an acceptable runtime which is left open for future research.