 Research
 Open Access
 Published:
Theoretical lower bounds for parallel pipelined shiftandadd constant multiplications with ninput arithmetic operators
EURASIP Journal on Advances in Signal Processing volume 2017, Article number: 31 (2017)
Abstract
New theoretical lower bounds for the number of operators needed in fixedpoint constant multiplication blocks are presented. The multipliers are constructed with the shiftandadd approach, where every arithmetic operation is pipelined, and with the generalization that ninput pipelined additions/subtractions are allowed, along with pure pipelining registers. These lower bounds, tighter than the stateoftheart theoretical limits, are particularly useful in early design stages for a quick assessment in the hardware utilization of lowcost constant multiplication blocks implemented in the newest families of field programmable gate array (FPGA) integrated circuits.
Introduction
Multiplication with constants is a regular operation in digital signal processing (DSP) systems. In hardware, a multiplication is demanding in terms of area and power consumption. However, the single constant multiplication (SCM) and multiple constant multiplication (MCM) operations can be implemented by using only shifts, additions, and subtractions, with the last two being usually referred in general form as additions [1]. The SCM case is when an input is multiplied by a constant coefficient (Fig. 1a), and the MCM operation is when an input is multiplied by a set of constant coefficients (Fig. 1b) [2]. Theoretical lower bounds for the number of adders and for the number of depth levels, i.e., the maximum number of serially connected adders (also known as the critical path), in SCM, MCM, and other constant multiplication blocks that are constructed with twoinput adders under the shiftandadd scheme have been presented in [3]. Tighter lower bounds, as well as a new bound, namely, the one for the number of extra adders required to preserve the lowest number of depth levels, were presented in [4] for the SCM case. Nevertheless, there are no theoretical lower bounds for the case of constant multiplication blocks that include multiple input additions/subtractions and pipeline registers in the involved arithmetic operations. However, this type of operations has become very important mainly when the pipelined constant multiplication blocks are implemented in the increasingly demanded field programmable gate array (FPGA) platforms. This is due to the fact that logic blocks of FPGAs include memory elements, and thus, pipelining results in low extra cost [5,6,7,8,9,10,11,12]. Currently, the use of threeinput adders has started to gain importance, since the logic blocks of the newest families of FPGAs are bigger and allow to fit more complex adders using nearly the same amount of hardware resources [10,11,12].
Particularly, in the last two decades, many efficient highlevel synthesis algorithms have been introduced for the multiplierless design of constant multiplication blocks. The common cost function to be minimized in these algorithms is given by the number of arithmetic operations (additions and subtractions) needed to implement the multiplications. Nevertheless, the critical path has the main negative impact in the speed and power consumption [13,14,15,16,17,18]. Therefore, substantial research activity has been carried out currently targeting both, applicationspecific integrated circuits (ASICs) [19,20,21] and FPGAs [5,6,7,8,9,10, 22,23,24,25], where the minimization of the number of arithmetic operations subject to a minimum number of depth levels is the ultimate goal.
On the other hand, even though ASICs still provides higher performance and low power consumption, the increased development time and manufacturing cost which comes with smaller CMOS transistor technologies have opened a large market for FPGAs. The FPGA technology provides the signal processing engineers with the ability to construct a custom data path that is tailored to the application at hand [26, 27]. FPGAs offer the flexibility of instruction set digital signal processors, while providing the processing power and flexibility of an ASIC, and enable significant design cycle compression and timetomarket advantages, an important consideration in an economic climate with everdecreasing market windows and short product life cycles [28, 29].
The novelty of this paper is to introduce the theoretical lower bounds for the number of operations necessary to implement pipelined single constant multiplication (PSCM) and pipelined multiple constant multiplication (PMCM) blocks that are constructed with the shiftandadd scheme. For the derivation of these bounds, we consider that either an ninput (where n is an integer) pipelined addition/subtraction or a single pipeline register have the same cost. As mentioned earlier, recently, this assumption fits particularly well for cases where n is set equal to 3 and the target platforms for implementation are the newest FPGAs from the two most dominant manufacturers, Xilinx and Altera. However, it is worth highlighting that n = 2 is still under common use in many applications. This contribution is important because the optimality of different algorithms that reduce the number of operations in PSCM and PMCM blocks can be tested using appropriate theoretical lower bounds. Additionally, these bounds can be useful to develop new algorithms.
The paper is organized as follows. In the next section, definitions and methods needed to address the proposal are given. Section 3 presents the new theoretical lower bounds along with theorems and proofs to support the derivation of these bounds. Comparisons with previous theoretical lower bounds from [3] and [4] are provided in Section 4. Finally, conclusions are given in Section 5.
Definitions of terms
The constant multiplications referred here are expressed in fixedpoint arithmetic because implementations in this number representation have higher speed and lower cost, thus being usually employed in DSP algorithms [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25, 30,31,32,33,34,35,36,37,38,39,40]. Only integer, positive, odd constants are considered since this is a useful simplification that does not affect the formulation of constant multiplication problems. In this sense, a constant can be expressed simply in binary form, as follows:
where b _{ i }∈{0, 1} is the ith bit and B is the word length [31]. We can express a product of a variable input X by a constant c with the shiftandadd approach using the binary representation of that constant to dictate the multiplier structure. For example, the product 47X, with 47 = 2^{5} + 2^{3} + 2^{2} + 2^{1} + 2^{0} (i.e., a binary string “101111”), needs four additions and has a critical path of three additions, as show in Fig. 2. The implementation cost of a shiftandadd constant multiplier is the number of arithmetic operations since products by powers of two are implemented as hardwired shifts with no practical cost.
It is worth to highlight that additions and subtractions require practically equal amount of resources in hardware implementation. Hence, signed digit (SD) representations of a constant can reduce the aforementioned implementation cost because they employ negative digits, which represent subtractions. An SD representation of a constant is given in the form,
where d _{ i }∈{−1, 0, 1}, with “−1” usually expressed as \( \overline{1} \)[32]. Among them, the canonical signed digit (CSD) representation is convenient since its number of nonzero digits is the minimum number of signed digits (MNSD) [3]. Besides, each nonzero digit is followed by at least one zero, which makes the representation unique. The CSD form of a constant can be found from binary by iteratively substituting every string of k digits “1” (say, “1111”) with a string of k − 1 digits “0” between a “1” and a “−1” (the string 1111 becomes “\( 1000\overline{1} \)”). In this case, the product 47X, with 47 = 2^{6} − 2^{4} −2^{0} (i.e., a CSD string “\( 10\overline{1}000\overline{1} \)”), needs two subtractions and has two operations in its critical path, as shown in Fig. 3.
In a constant multiplication block, the Aoperation [30] represents twoinput addition or subtraction along with shifts, and it is defined as
where l _{1} ≥ 0 and l _{2} ≥ 0 are left shifts, r ≥ 0 is a right shift, s _{2} is a binary value, i.e., s _{2}∈{0,1}, q is the set of parameters (socalled the configuration) of the Aoperation, i.e., q = {l _{1}, l _{2}, r, s _{2}}, and u _{1} and u _{2} are odd integers. For threeinput adders the Aoperation is [10]
where l _{1} ≥ 0, l _{2} ≥ 0, and l _{3} ≥ 0 are left shifts, r ≥ 0 is a right shift, s _{2} and s _{3} are binary values, q = {l _{1}, l _{2}, l _{3}, s _{2}, s _{3}, r} is the configuration of the Aoperation, and u _{1}, u _{2}, and u _{3} are odd integers. Generalizing to ninputs, the Aoperation is expressed as
where l _{1} ≥ 0,…, l _{ n } ≥ 0 are left shifts, r ≥ 0 is a right shift, s _{2},…, s _{ n } are binary values, q = {l _{1}, …, l _{ n }, s _{2}, …, s _{ n }, r} is the configuration of the Aoperation, and u _{1},…, u _{ n } are odd integers.
An array of interconnected Aoperations forms a SCM or a MCM block. The MCM is built upon SCM because the latter is the simplest case. The SCM array is represented using directed acyclic graphs (DAGs) with the following characteristics [33,34,35,36]:

The output of each Aoperation is called fundamental.

For a graph with m Aoperations, there are m + 1 vertices and m fundamentals.

Each vertex has an indegree n, except for the input vertex which has indegree zero.

A vertex with indegree n corresponds to an ninput Aoperation.

Each vertex has outdegree larger than or equal to one except for the output vertex which has outdegree zero.

The constant resulting from the last Aoperation is output fundamental (OF). The constants resulting from previous Aoperations are nonoutput fundamentals (NOFs).
In the MCM case, there are several OFs.
The DAG representation is the most useful for saving arithmetic operations because it allows to exploit structures to interconnect Aoperations that cannot be seen in the CSD representation. This expands the opportunity to optimize the constant multiplication blocks. For example, the product 45X, with 45 = 2^{6} − 2^{4} − 2^{2} + 2^{0} (i.e., a CSD string “\( 10\overline{1}0\overline{1}01 \)”), needs three 2input additions and has a critical path of two additions, as show in Fig. 4a. However, by using the DAG approach, the multiplication 45X requires two 2input additions and has a critical path of two additions. In this case, it is possible to factorize the constant in two factors, namely, 5 and 9, as shown in Fig. 4b.
It is important to mention that a multiplicative graph is the graph obtained by cascading subgraphs, and the union point between two cascaded subgraphs in a multiplicative graph is called articulation point [37]. This is illustrated in Fig. 5a. A particular case is the completely multiplicative graph, where each cascaded subgraph is composed by one Aoperation, as shown in Fig. 5b [4]. The graph presented in Fig. 4b is an example of a completely multiplicative graph with 2input Aoperations. Other graphs without articulation points are referred as nonmultiplicative graphs [37]. A cascaded interconnection of a completely multiplicative graph with a nonmultiplicative graph is called generalized graph, see Fig. 5c.
The speed of a design is restricted by the critical path. The pipelining technique allows the reduction of a critical path introducing registers along the data path [38]. In FPGA implementations, the constant multiplications involving shiftsandadd operations can be made fully pipelined with a low extra cost. Pipelining has a small overhead due to the fact that the logic blocks in FPGAs include memory elements, which are otherwise unused [39, 40]. For example, Table 1 shows the amount of logic elements used to implement the multiplier 45X (for an 8bit input) in an Altera Cyclone IV EP4CE115F29C7 FPGA. We observe that only three extra logic elements are needed in the pipelined implementation, which represents an increase of 9.7% in resources utilization compared with the nonpipelined case. Nevertheless, the frequency of operation is increased by 31.7%.
Due to the aforementioned observation, the implementation cost will be accounted by the number of registered operations (Roperations), i.e., either an additionregister pair or a single register, needed to implement constant multiplications. Two Roperations with the same cost are illustrated in a simplified way in Fig. 6. Hence, the PSCM problem consists in finding the pipelined array of Aoperations that form a singleconstant multiplier using the minimum number of Roperations. Similarly, the PMCM problem consists in finding the pipelined array of Aoperations that form a multipleconstant multiplier using the minimum number of Roperations.
To calculate the lower bounds for the number of Roperations required to implement PSCM and PMCM blocks, we need the following information from a constant:

1)
Its MNSD, denoted by S. We will also refer to this number in a more informal manner as “the number of nonzero digits”.

2)
Its number of prime factors (it does not matter if these prime factors are repeated). This number is denoted by Ω.
Proposed lower bounds
In the following, we state, in Subsection 3.1, Theorems 1 to 8 to derive the lower bounds of Roperations in PSCM and, in Subsection 3.2, Theorems 9 and 10 for PMCM, along with their corresponding proofs. The pipelining operation, which has not been alluded in the previous works [3] and [4], is explicitly included in the proposed lower bounds through the Roperations.
PSCM case
Whenever a constant c is mentioned in the theorems of this subsection (Theorems 1 to 8), we consider that the MNSD of that constant is S and its number of prime factors is Ω.
Theorem 1 provides the upper limit of nonzero digits that can be generated by any graph with a given number of depth levels, regardless of its number of Roperations. From this, we can know the minimum number of depth levels that a graph must have to implement a constant with a given S.
Theorems 2 and 3 prove the properties of the completely multiplicative graphs, namely, generating the upper limit of nonzero digits mentioned in Theorem 1 with the minimum possible number of Roperations. From them, we have that the completely multiplicative graph is a solution with the lower bound for the number of Roperations. However, as it is known, this graph has articulation points, and every articulation point represents the union between two cascaded subgraphs, i.e., the product of two smaller constants. Therefore, Theorem 4 uses Ω to identify what constants can be implemented with the completely multiplicative graph (for example, prime constants cannot be factorized into smaller constants; thus, they cannot be implemented by a completely multiplicative graph).
Theorem 5 identifies the minimum number of Roperations needed in any nonmultiplicative graph with a given number of depth levels, and Theorem 6 proves that nonmultiplicative graphs can generate the upper limit of nonzero digits mentioned in Theorem 1 with its minimum number of Roperations. Then, Theorem 7 establish the lower bound for the number of Roperations needed to implement a prime constant (Ω = 1).
Finally, Theorem 8 completes the information of Theorems 4 and 7, namely, the lower bound of Roperations needed to implement nonprime constants that have fewer number of factors than the number of subgraphs used in a completely multiplicative graph.
Theorem 1
A graph with p depth levels can provide at most n ^{p} nonzero digits for a constant.
Proof
The proof is given by induction (see proof of Theorem 6.9 in [39] for the case of 2input Aoperations):

1)
The base case corresponds to the first depth level, where a ninput Aoperation can form a constant with at most n nonzero digits. This is true since the input of any graph has one nonzero digit [3, 4, 39].

2)
As inductive step, we assume that, in the pth level, there are n ^{p} nonzero digits at most. In the (p + 1)th level, an Aoperation can form a constant whose number of nonzero digits is the sum of the numbers of nonzero digits at every input of that Aoperation. This is at most n times the maximum number of nonzero digits available in the previous level, i.e., n × n ^{p} = n ^{p + 1} nonzero digits.
Since assuming that the theorem is true for p implies that the theorem is also true for p + 1, and since the base case is also true, the proof is complete. An adder, regardless of its number of inputs, cannot generate more nonzero digits than the sum of the numbers of nonzero digits in every one of its inputs. Thus, the MNSD can be, at most, nplicate if the inputs of the ninput adder placed in any depth level come from the immediately previous depth level. ■
Theorem 2
A completely multiplicative graph with p Aoperations can generate n ^{p} nonzero digits.
Proof
This proof is an straightforward extension of the proof of Theorem 6.8 in [39], which corresponds to completely multiplicative graphs with 2input Aoperations. As stated earlier, the input of a graph has one nonzero digit. In the completely multiplicative graph, there are at most n nonzero digits after the Aoperation placed at the 1st depth level. Cascading an Aoperation to that output yields at most n × n nonzero digits, and so on. The number of nonzero digits at the depth level p is at most the ntuple of the number of nonzero digits of a fundamental at the (p − 1)th depth level. Consequently, the maximum number of nonzero digits at the pth depth level is n ^{p}. ■
Theorem 3
A completely multiplicative graph with p depth levels needs only p Roperations.
Proof
The completely multiplicative graph with p depth levels has p Aoperations, and every Aoperation forms a subgraph. Pipelining between two subgraphs needs only one register, according to [38], because the pipelining occurs on the articulation point. This results in every Aoperation being followed by a register. Since an Aoperation followed by a register is considered an Roperation, there are only p Roperations in total. This is illustrated in Fig. 7. ■
Theorem 4
A constant with (n ^{p − 1} + 1) ≤ S ≤ n ^{p} and Ω ≥ p needs at least p Roperations.
Proof
From Theorem 2, we have that a constant with (n ^{p − 1} + 1) ≤ S ≤ n ^{p} nonzero digits can be implemented with at least p depth levels, which implies at least p Aoperations. From Theorem 3, we have that a completely multiplicative graph can generate those values for S with only p Roperations. The completely multiplicative graph with p Roperations consists of p cascaded subgraphs; thus, a constant implemented with that graph must have at least p prime factors. Since Ω ≥ p holds, the completely multiplicative graph can be employed to implement that constant using p Roperations. ■
Theorem 5
A nonmultiplicative graph with p depth levels needs at least (2p − 1) Roperations.
Proof
According to Theorem 3, if a graph with p depth levels has only p Roperations in total, it must be a pipelined completely multiplicative graph. According to Theorem 2, that graph can generate the maximum possible number of nonzero digits, namely, n ^{p}. To make nonmultiplicative that optimal graph, the (p − 1) articulation points must be eliminated. From [38], it is known that at least one additional Roperation must be added for every eliminated articulation point. Therefore, at least (2p − 1) Roperations are required, i.e., the original p minimum number of Roperations in the form of additiondelay pairs plus the additional (p − 1) Roperations in the form of pure delays. Figure 8 shows an example with p = 3. ■
Theorem 6
A nonmultiplicative graph with p depth levels and (2p − 1) Roperations can generate n ^{p} nonzero digits.
Proof
Consider a graph with p depth levels formed by two completely multiplicative graphs of (p − 1) levels each, connected in parallel from the input of the graph, and one Aoperation placed in the pth level summing up the outputs of the aforementioned graphs. The output of one of these graphs is connected to the n − 1 inputs of the last Aoperation, and the output of the other graph is connected to the remaining input of the last Aoperation. This is a nonmultiplicative graph because it is not formed by cascading subgraphs, and it is composed by (2p − 1) Aoperations. According to Theorem 2, we can obtain n ^{p − 1} nonzero digits from the completely multiplicative graphs, and according to Theorem 3, these graphs can be pipelined without requiring extra registers. Since the last Aoperation can add n times the n ^{p − 1} nonzero digits in each one of its inputs and can be pipelined without extra cost, the resulting graph generates n ^{p} nonzero digits using (2p − 1) Roperations. An example of this is shown in Fig. 9. ■
Theorem 7
A constant with (n ^{p − 1} + 1) ≤ S ≤ n ^{p} and Ω = 1 needs at least 2p − 1 Roperations.
Proof
Since Ω = 1 holds, the nonmultiplicative graph must be employed to implement that constant. From Theorem 6, we have that a constant with (n ^{p − 1} + 1) ≤ S ≤ n ^{p} nonzero digits can be implemented with at least p depth levels and at least 2p − 1 Roperations. This is a lower bound for the number of Roperations, since from Theorem 5, we have that a nonmultiplicative graph with plevels needs at least 2p − 1 Roperations. ■
Theorem 8
A constant with (n ^{p−1} + 1) ≤ S ≤ n ^{p} and 1 < Ω < p needs at least (2p − Ω) Roperations.
Proof
From Theorem 1, we have that p depth levels are necessary to achieve the values of S in the specified range. Since Ω < p holds, we can take advantage of a completely multiplicative graph with Ω−1 Roperations at most, which, according to Theorem 2, generates n ^{Ω−1} nonzero digits at most, and represents the product of Ω − 1 factors. The last factor can be formed with a nonmultiplicative subgraph with [p − (Ω − 1)] depth levels. According to Theorem 5, this subgraph needs at least 2[p − (Ω − 1)] − 1 Roperations, and according to Theorem 6, it can generate n ^{[p − (Ω − 1)]} nonzero digits. The total graph, illustrated in Fig. 10, can generate at most n ^{Ω − 1} × n ^{[p − (Ω − 1)]} = n ^{p} nonzero digits and uses at least (Ω − 1) + 2[p − (Ω − 1)] − 1 = 2p − 2(Ω − 1) + (Ω − 1 − 1 = 2p − (Ω − 1) − 1 = (2p − Ω) Roperations. ■
Finally, from Theorem 1, we have that the number of depth levels necessary to achieve S is p = ⌈ log_{ n }(S)⌉. Substituting this value for p and using Theorems 4, 7, and 8, we obtain the lower bound for the number of Roperations needed to form a PSCM block as follows:
PMCM case
The theorems in this section are stated for N constants c _{1}, c _{2}, …, c _{ N }, whose respective MNSDs are S _{1}, S _{2}, …, S _{ N }, and their respective numbers of prime factors are Ω_{1}, Ω_{2}, …, Ω_{ N }, such that S _{1} ≤ S _{2} ≤ … ≤ S _{ N }.
Theorem 9 indicates the lower bound for the number of ninput Aoperations needed to form an MCM block. If pipelining is added, more Roperations than the aforementioned lower bound may be needed because the constants with fewer prime factors may use nonmultiplicative graphs, which require extra Roperations (see Theorems 5 to 8). Besides, all the outputs of the PMCM block must have equal number of depth levels to balance the input–output delay, which also may require extra Roperations. Based on these observations, Theorem 10 extends the lower bound provided in Theorem 9 by identifying at least how many extra Roperations would be needed. From these theorems, we obtain the lower bound for the number of Roperations needed to form a PMCM block.
Theorem 9
At least K ninput Aoperations are needed to build an MCM block, where K is given by
with \( E\left({S}_i,{S}_{i+1}\right)=\left\{\begin{array}{c}\hfill 1;\kern5em {S}_i={S}_{i+1},\hfill \\ {}\hfill \left\lceil { \log}_n\frac{S_{i+1}}{S_i}\right\rceil; \kern0.75em {S}_i<{S}_{i+1}.\hfill \end{array}\right. \)
Proof
Recall that every Aoperation has only one possible configuration and therefore can generate only one fundamental. Simply shifted (i.e., scaled by a power of two) versions of that fundamental can be obtained from that Aoperation. Since the target constants are integer and odd by definition, it is not possible to obtain two target constants from the same Aoperation. Therefore, there must be at least N ninput Aoperations for the N constants. Note that, since the terms S _{ i } are sorted in ascendant order, S _{1} corresponds to the simplest constant, i.e., the one with the smallest number of nonzero digits. From Theorem 1, we have that with p depth levels we can obtain n ^{p} nonzero digits at most. By using the relation n ^{p} ≥ S _{1}, we have that the minimum number of levels necessary to generate S _{1} nonzero digits is ⌈ log_{ n }(S _{1})⌉, which implies the existence of at least ⌈ log_{ n }(S _{1})⌉ Aoperations for that constant. Finally, if S _{ i+1} > n × S _{ i } holds, we have that a single Aoperation is not able to generate the constant c _{ i+1} if there are only coefficients with at most S _{ i } digits available because the number of nonzero digits at the output of an Aoperation is at most the sum of the number of nonzero digits at its inputs. Therefore, at least ⌈ log_{ n }(S _{ i + 1}/S _{ i })⌉ Aoperations will be required. This proof is a straightforward extension of the proof given in [3] for the lower bound of 2input Aoperations that form an MCM block. ■
Theorem 10
At least L Roperations are needed to build a PMCM block, where L = K + F + G, with
and K given in (7).
Proof
Consider that there is a constant c _{ m } that satisfies Ω_{ m } < ⌈ log_{ n }(S _{ m })⌉ and, if there are more constants that satisfy such condition, c _{ m } has the greatest difference [⌈ log_{ n }(S _{ m })⌉ − Ω_{ m }]. From Theorem 8, we have that the constant can be formed by cascading a nonmultiplicative graph with a completely multiplicative graph, where the nonmultiplicative graph needs 2[⌈ log_{ n }(S _{ m })⌉ − (Ω_{ m } − 1)] − 1 Roperations. Since Theorem 9 has not taken into consideration the number of prime factors, only [⌈ log_{ n }(S _{ m })⌉ − (Ω_{ m } − 1)] Aoperations have been accounted in that theorem, under the assumption that the constant c _{ m } can be constructed with the optimal completely multiplicative graph. Therefore, at least [⌈ log_{ n }(S _{ m })⌉ − (Ω_{ m } − 1)] − 1 extra Roperations must be included when pipelining is applied, which explains the term F. The term G is explained by the fact that extra Roperations may be needed to achieve the same number of pipelined stages from input to output in every constant. Since the minimum depth level of a constant is given by ⌈ log_{ n }(S)⌉, the differences between the minimum depth level of the constant c _{ N } (which has the greatest depth level among other constants) and the minimum depth levels of the other constants are accumulated in the term G. ■
From Theorem 10, we can express the lower bound for the number of Roperations in the PMCM case as
with \( E\left({S}_i,{S}_{i+1}\right)=\left\{\begin{array}{c}\hfill 1;\kern5em {S}_i={S}_{i+1},\hfill \\ {}\hfill \left\lceil { \log}_n\frac{S_{i+1}}{S_i}\right\rceil; \kern0.75em {S}_i<{S}_{i+1},\hfill \end{array}\right. \)
and \( F=\left\{\begin{array}{c}\hfill {\displaystyle \underset{i}{ \max }}\left\{\left\lceil { \log}_n\left({S}_i\right)\right\rceil {\varOmega}_i\right\};\kern0.5em \forall\ i\kern0.5em \mathrm{such}\ \mathrm{that}\kern0.75em {\varOmega}_i<\left\lceil { \log}_n\left({S}_i\right)\right\rceil, \hfill \\ {}\hfill 0;\kern8.25em \mathrm{otherwise}.\hfill \end{array}\right. \)
Results and comparisons
In this section, comparisons of the proposed lower bounds with the lower bounds currently available in literature are presented, detailing PSCM and PMCM cases in Subsections 4.1 and 4.2, respectively. In all cases, two and threeinput additions were considered.
First, the PSCM case is addressed for n = 2 (i.e., 2input additions) with an illustration of the lower bounds averaged over all the constants with a word length of B bits, where B goes from 1 to 14. This illustration compares the proposed lower bound with the existing lower bounds from [3] and [4], showing that the proposed lower bound is tighter. An example is also included, where the pipelined shiftandadd multipliers for some constants are constructed with 2input and 3input additions.
The effectiveness of the PMCM lower bound is demonstrated by examples, where pipelined shiftandadd multiple constant multiplication blocks are constructed using the algorithms from [7, 8, 22, 30] and [36] for the case of 2input additions and the algorithm from [10] for the case of 3input additions. The proposed lower bound is compared with the lower bound from [3] in the case of 2input additions, and in most of the cases, it provides better estimation of the number of required Roperations. For n = 3 (i.e., 3input additions), there are no theoretical lower bounds currently available in literature. Thus, the proposed lower bound is only compared with the solution from [10]. In that case, the proposed lower bound falls short only by one Roperation.
PSCM case
The lower bounds from methods [3] and [4], as well as the proposed lower bound L _{ PSCM } from (6) are averaged for all constants with B bits, where B is between 1 and 14. These averages are shown in Fig. 11. We can observe the tightening of the proposed lower bound, i.e., the proposed lower bound in general is greater than the lower bounds currently available in literature. Table 2 presents, for n = 2, the percentage of constants with improved lower bounds among 10,000 14bits random constants and among 10,000 Bbits random constants, with B between 15 and 32.
Example 1 presents the pipelined shiftandadd multipliers for constants {11,467}, {11,093}, and {13,003} constructed with 2input additions (shown in Fig. 12a, c, and e, respectively) and 3input additions (shown in Fig. 12b, d and f, respectively). In all the cases, the optimal solutions have the number of Roperations predicted by the proposed lower bound, as shown in Table 3. Besides, for the case of twoinput additions, the proposed lower bound outperforms the ones from [3] and [4] because the lower bound from [3] falls short by 2 Roperations and the lower bound from [4] falls short by one Roperation.
Example 1 The constants {11,467}, {11,093}, and {13,003} have similar graph and the same lower bounds as shown in Table 3. The corresponding graphs are presented in Fig. 12.
PMCM case
In Example 2, the multiplier block with constants {44; 130; 172}, formed with 2input additions, is presented. In Table 4, the number of Roperations obtained by the algorithms H_{cub} [30] with pipelining, PAG using ASAP pipelining [8], and the optimal PAG [8] are listed. Additionally, the lower bound of [3] and the proposed lower bound L_{PMCM} from (8) are given. The proposed lower bound is closer to the number of Roperations needed to implement the multiplier block than the lower bound of [3].
Example 3 presents the group of constants {3; 13; 21; 37} that form a multiplier block. The Roperations needed to implement the multiplier block using 2input additions are obtained with the algorithms RAGn [36] with pipelining, RSG [22], and OFL [7]. The resulting values are shown in Table 5, where it can be observed that the OFL algorithm offers the less number of Roperations. Also, the lower bound of [3] and the proposed lower bound L_{PMCM} from (8) are given in Table 5. In this example, the proposed lower bound estimates the same number of Roperations used by the OFL algorithm to implement the multiplier block.
A multiplier block formed with the constants {7,567; 20,406} is illustrated in Example 4. The Roperations needed to implement the multiplier block using 2input additions are obtained with the algorithm PAG [8]. Table 6 shows the resulting number of Roperations together with the estimated number of Roperations using the lower bound of [3] and the proposed lower bound L_{PMCM} from (8). The Roperations needed to implement the multiplier block using 3input additions are obtained with the algorithm PAG for 3input additions [10]. Table 7 shows the resulting number of Roperations along with the estimations using the proposed lower bound L_{PMCM} from (8).
Finally, Example 5 presents the constants {87,381; 689,493} that form a multiplier block. The Roperations needed to implement the multiplier block using 2input additions are obtained with the algorithm PAG [8], and the Roperations needed to implement the multiplier block using 3input additions are obtained with the algorithm PAG for 3input additions [10]. Table 8 shows the resulting number of Roperations together with the estimated number of Roperations using the lower bound of [3] and the proposed lower bound L_{PMCM} from (8). Table 9 shows the resulting number of Roperations along with the estimations using the proposed lower bound L_{PMCM} from (8). The proposed lower bound presents a reliable estimation of the number of Roperations needed to implement the multiplier block.
Example 2 (example given in [8]) A multiplier block with the constants from the set {44; 130; 172} have the estimate number of Roperations as shown in Table 4 (the resulting graphs are shown in Fig. 1 of paper [8]).
Example 3 (example given in [7]) A multiplier block with the constants from the set {3; 13; 21; 37} have the estimate number of Roperations as is shown in Table 5 (the resulting graphs can be seen in Fig. 4 of [7]).
Example 4 (example given in [10]) A multiplier block with the constants from the set {7,567; 20,406} have the estimate number of Roperations as shown in Table 6 for twoinput adders and Table 7 for threeinput adders (Fig. 3 of [10] shows the corresponding graphs).
Example 5 A multiplier block with the constants from the set {87,381; 689,493} have the estimate number of Roperations as shown in Table 8 for 2input adders and Table 9 for 3input adders. The corresponding graphs are shown in Fig. 13.
Conclusions
New theoretical lower bounds for the number of Roperations in the fully pipelined SCM and the fully pipelined MCM cases for ninput adders/subtractions have been presented. The proposed lower bounds are tighter because pipelining registers were explicitly considered. On the other hand, it was observed that the use of articulation points allows a rapid increase of the number of nonzero digits from a depth level to the next depth level. The new theoretical lower bounds achieve better estimation of the number of required operations needed to implement a single multiplier or a multiplier block. The tightening of the new lower bounds was illustrated with examples in the comparisons section.
References
 1.
R Guo, LS DeBrunner, K Johansson, Truncated MCM Using Pattern Modification for FIR Filter Implementation. Paper presented at the Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS), Paris, France, p. 3881–3884, May 30Jun 2, 2010.
 2.
L Aksoy, EO Günes, P Flores, Search algorithms for the multiple constant multiplication problem: exact and approximate. Microprocess. Microsyst. 34(5), 151–162 (2010). doi: doi.org/10.1016/j.micpro.2009.10.001.
 3.
O Gustafsson, Lower bounds for constant multiplication problems. IEEE Trans. Circuits and Syst. II: Express briefs 54 (11), 974–978 (2007). doi: 10.1109/TCSII.2007.903212.
 4.
DET Romero, U MeyerBaese, GJ Dolecek, On the inclusion of prime factors to calculate the theoretical lower bounds in multiplierless single constant multiplications. EURASIP Journal on Advances in Signal Processing 122, 1–9 (2014). doi:10.1186/168761802014122.
 5.
S Mirzaei, R Kastner, A Hosangadi, Layout aware optimization of high speed fixed coefficient FIR filters for FPGAs. Int. Journal of Reconfigurable Computing (2010). doi: 10.1155/2010/697625.
 6.
M Kumm, P Zipf, High speed low complexity FPGAbased FIR filters using pipelined adder graphs. Paper presented at the Int. Conference on Field Programmable Technology (FPT), Indian Institute of Technology Delhi, New Delhi, India, p. 1–4, 12–14 December 2011.
 7.
U MeyerBaese, G Botella, DET Romero, M Kumm, Optimization of high speed pipelining in FPGAbased FIR filter design using genetic algorithm. Proc. SPIE 8401, Independent Component Analyses, Compressive Sampling, Wavelets, Neural Net, Biosystems, and Nanoengineering X, 84010R112 (2012). doi:10.1117/12.918934.
 8.
M Kumm, P Zipf, M Faust, CH Chang, Pipelined adder graph optimization for high speed multiple constant multiplication. Paper presented at the Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS), p. 49–52, COEX, Seoul, Korea, 20–23 May 2012.
 9.
M Kumm, D Fanghanel, K Moller, P Zipf, U MeyerBaese, FIR filter optimization for video processing on FPGAs. EURASIP J Adv Sig Process 111, 1–18 (2013). doi:10.1186/168761802013111
 10.
M Kumm, M Hardieck, J Willkomm, P Zipf, U MeyerBaese, Multiple constant multiplications with ternary adders. Paper presented at the International Conference on Field Programmable Logic and Applications (FPL), Porto, Portugal, p. 1–8, 2–4 Sept. 2013.
 11.
M Kumm, P Zipf, Pipelined compressor tree optimization using integer linear programming. Paper presented at the 24th International Conference on Field Programmable Logic and Applications (FPL), p. 1–8, Munich, Germany, 2–4 Sept. 2014.
 12.
M Kumm, P Zipf, Efficient high speed compression trees on Xilinx FPGAs. Paper presented at the MBMV, IBM Germany Research and Development, Böblinguen, Germany, p. 171–182, 10–12 March 2014.
 13.
L Aksoy, E Costa, P Flores, J Monteiro, Exact and approximate algorithms for the optimization of area and delay in multiple constant multiplications. IEEE Trans. Comput.Aided Des. Integr. Circuits 27(6), 1013–1026 (2008). doi:10.1109/TCAD.2008.923242
 14.
L Aksoy, E Costa, P Flores, J Monteiro, Finding the optimal tradeoff between area and delay in multiple constant multiplications. Elsevier Journal Microprocessors and Microsystems 35 (8), 729 – 741 (2011). doi: doi.org/10.1016/j.micpro.2011.08.009.
 15.
AG Dempster, SS Dimirsoy, I Kale, Designing multiplier blocks with low logic depth. Paper presented at the Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS), Scottsdale, Arizona, p. 773–776, 26–29 May 2002.
 16.
M Faust, CH Chang, Minimal logic depth adder tree optimization for multiple constant multiplication. Paper presented at the Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS), Paris, France, p. 457–460, May 30Jun 2, 2010.
 17.
K Johansson, O Gustafsson, LS DeBrunner, L Wanhammar, Minimum adder depth multiple constant multiplication algorithm for low power FIR filters. Paper presented at the Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS), Rio de Janeiro, Brazil, p. 1439–1442, 15–18 May 2011.
 18.
AG Dempster, MD Macleod, Using all signeddigit representations to design single integer multipliers using subexpression elimination. Paper presented at the Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS), Vancouver, British Columbia, p. 165–168, 23–26 May 2004.
 19.
L Aksoy, E Costa, P Flores, J Monteiro, Multiplierless design of linear DSP transforms. VLSISoC: Advanced Research for Systems on Chip, ed. by S. Mir, CY Tsui, R Reis, O Choy (Springer 2011), p. 73 – 93.
 20.
YH Ho, CU Lei, HK Kwan, N Wong, Global optimization of common subexpressions for multiplierless synthesis of multiple constant multiplications. Paper presented at the Proceedings of Asia and South Pacific Design Automation Conference, Seoul, South Korea, p. 119–124, 21–24 January 2008.
 21.
A Hosangadi, F Fallah, R Kastner, Simultaneous optimization of delay and number of operations in multiplierless implementation of linear systems. Paper presented at the Proceedings of International Workshop on Logic Synthesis, Lake Arrowhead, California, p. 1–8, 8–10 June 2005.
 22.
KN Macpherson, RW Stewart, Rapid prototyping  Area efficient FIR filters for high speed FPGA implementation. IEE Proceedings  Vision, Image Signal Processing 156, 711–720 (2006). doi:10.1049/ipvis:20045133.
 23.
U MeyerBaese, J Chen, CH Chang, AG Dempster, A comparison of pipelined RAGn and DA FPGAbased multiplierless filters. Paper presented at the IEEE AsianPacific Conference on Circuits and Systems, Singapore, p. 1555–1558, 4–7 December 2006.
 24.
L Aksoy, E Costa, P Flores, J Monteiro, Design of lowcomplexity digital finite impulse response filters on FPGAs. Paper presented at the Proceedings of Design, Automation and Test in Europe Conference, Dresden, Germany, p.11971202, 12–16 March 2012.
 25.
M Faust, CH Chang, Bitparallel Multiple Constant Multiplication using LookUp Tables on FPGA. Paper presented at the Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS), p. 657–660, Rio de Janeiro, Brazil, 15–18 May 2011.
 26.
G Botella, A Garcia, M. RodriguezAlvarez, E Ros, U MeyerBaese, M C Molina, Robust bioinspired architecture for opticalflow computation. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 18(4), 616–629 (2010). doi: 10.1109/TVLSI.2009.2013957.
 27.
G Botella, U MeyerBaese, A Garcia, M Rodriguez, Quantization analysis and enhancement of a VLSI gradientbased motion estimation architecture. Digital Signal Processing, 22(6), 1174–1187 (2012). doi: doi.org/10.1016/j.dsp.2012.05.013.
 28.
G Botella, U MeyerBaese, A Garcia, Bioinspired robust optical flow processor system for VLSI implementation. Electron Lett 45(25), 1304–1305 (2009). doi:10.1049/el.2009.1718
 29.
E Castillo, A Lloris, DP Morales, L Parrilla, A Garcia, G Botella, A new areaefficient BCDdigit multiplier. Digital Signal Processing 62, 1–10 (2017). doi: dx.doi.org/10.1016/j.dsp.2016.10.011.
 30.
Y Voronenko, M Püschel, Multiplierless multiple constant multiplication, ACM Transactions on Algorithms, 3 (2), 11 (2007). doi: 10.1145/1240233.1240234.
 31.
I Koren, Computer Arithmetic Algorithms. (Prentice Hall, 1993).
 32.
U MeyerBaese, Digital Signal Processing with Field Programmable Gate Arrays, 4th. edn. (Springer, 2014).
 33.
DR Bull, DH Horrocks, Primitive operator digital filters. IEE Proceedings G  Circuits, Devices and Systems 138(3), 401–412 (1991). doi:10.1049/ipg2.1991.0066
 34.
K Johansson, O Gustafsson, L Wanhammar, Switching activity estimation for shiftandadd based constant multipliers. Paper presented at the Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS), Seattle, Washington, p. 676–679, 18–21 May 2008.
 35.
J Chen, CH Chang, Highlevel synthesis algorithm for the design of reconfigurable constant multiplier. IEEE Trans ComputerAided Des Integr Circ Syst 28(12), 1844–1856 (2009). doi:10.1109/TCAD.2009.2030446
 36.
AG Dempster, MD Macleod, Use of minimumadder multiplier blocks in FIR digital filters. IEEE Trans. Circ Syst II – Analog Digit Sig Process 42(9), 569–577 (1995). doi:10.1109/82.466647
 37.
O Gustafsson, AG Dempster, K Johansson, MD Macleod, L Wanhammar, Simplified design of constant coefficient multipliers. Circuits Syst. Signal Process 25(2), 225–251 (2006). doi:10.1007/s0003400525055
 38.
KK Parhi, VLSI digital signal processing systems: design and implementation, (John Wiley & Sons, 2007).
 39.
O. Gustafsson, Contributions to LowComplexity Digital Filters, 837, (Linköping Studies and technology dissertations, 2003).
 40.
R Kastner, A Hosangadi, F Fallah, Arithmetic Optimization Techniques for Hardware and Software Design, (Cambridge University Press, 2010).
Acknowledgements
This paper has been supported by CONACYT scholarship no. 224191. The authors are grateful to D. E. T. Romero for his helpful comments during the development of this proposal.
Funding
This work is a result of a doctoral thesis developed in the Institute INAOE; the thesis has been supported with CONACYT’s grant.
Authors’ contributions
MGCJ contributed to the main development of the theorems and examples in this proposal. UMB is the advisor in the development of lowcomplexity FPGAbased arithmetic blocks and contributed to the review of theorems and examples. GJD as thesis advisor directed all the work an the paper was written under her supervision. All authors read and approved the final manuscript.
Authors’ information
Miriam Guadalupe Cruz Jimenez received the BS degree from the Minatitlan Institute of Technology and the MS degree from the National Institute for Astrophysics, Optics and Electronics (INAOE), Mexico. She received the best paper award at the conference CIIECC 2013. Currently, she is a PhD student in the Institute INAOE. She is a reviewer for the journals IEEE Transactions on Circuits and Systems I and Circuits, Systems & Signal Processing.
Dr. Uwe H. MeyerBaese (IEEE, S'91M'93) was born in Kassel, Germany, on July 10, 1964. He received his BSEE, MSEE, and Ph.D. “Summa cum Laude” from the Darmstadt University of Technology in 1987, 1989, and 1995, respectively. In 1994 and 1995, he held a Postdoctoral Position in “Institute of Brain Research,” Magdeburg, Germany. In 1996 and 1997, he was a visiting professor at the University of Florida, Gainesville. From 1998 to 2000, he worked as a Research Scientist in the ASIC industry. He joint Electrical and Computer Engineering Department at the FAMUFSU College of Engineering in 2001 and is now an Associate Professor. He holds 3 patents, has published over 100 journal and conference papers, 5 books, and supervised more than 60 master thesis projects in the realtime DSP/FPGA area. He is author of the bestselling Springer textbook on DSP with FPGAs. He was a recipient of the MaxKade Award in Neuroengineering in 1997, ECE Department Research Award in 2005, Who’s Who in Science member in 2005, SPIE, Best Presentation Award in 2006, FAMUFSU College of Engineering Teaching Award in 2007, and the Humboldt Fellow in 2009. He has served as Faculty Senator of the FSU senate since Spring 2011. He has been an elected member of the editorial board for the journal Signal, Image and Video Processing for 2011–2015 and has been elected as a board member as well as an associate editor for the EURASIP Journal of Advances in Signal Processing for 2011–2013.
Gordana Jovanovic Dolecek received a BS degree from the Faculty of Electrical Engineering, University of Sarajevo, an Ms degree from the University of Belgrade, and a Ph.D. degree from the Faculty of Electrical Engineering, University of Sarajevo. She was with the Faculty of Electrical Engineering, University of Sarajevo until 1993, as a research assistant, assistant professor, associate professor, and full professor. From 1986 to 1991, she was chairman of the Department of Telecommunication. During 1993–1995, she was with the Institute Mihailo Pupin, Belgrade. In 1995, she joined Institute INAOE, Department for Electronics, Puebla, Mexico, where she works as a professor and researcher. She is the author of three books and more than 100 papers. She is also author of four lectures for TechOnLine University. Her research interests include digital signal processing and digital communications. She is a member of IEEE and The National Researcher System (SNI) of Mexico.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Cruz Jiménez, M.G., Meyer Baese, U. & Jovanovic Dolecek, G. Theoretical lower bounds for parallel pipelined shiftandadd constant multiplications with ninput arithmetic operators. EURASIP J. Adv. Signal Process. 2017, 31 (2017) doi:10.1186/s136340170466z
Received
Accepted
Published
DOI
Keywords
 SCM
 MCM
 FPGA
 Multiplication
 Lower bound