An optimised twin precision multiplier for ASIC environment
 Rosi Asirvatham^{1}Email author and
 Seshasayanan Ramachandran^{1}
https://doi.org/10.1186/s136340150197y
© Asirvatham and Ramachandran; licensee Springer. 2015
Received: 1 March 2014
Accepted: 13 January 2015
Published: 4 March 2015
Abstract
In this paper, we present the performance of twin precision technique in reduced computation modified booth (RCMB) multiplier to achieve double throughput, and an algorithm is proposed. Twin precision technique is the efficient way to obtain double throughput in the multipliers. We describe how to apply twin precision technique to RCMB multipliers. Implementation of twin precision in RCMB multiplier requires lesser changes to be made in partial product array for obtaining double throughput. Multiplexers usually do the signal selection for N and N/2 bit multiplication. In RCMB multiplier, [N/2] + 1 partial product are reduced to N/2 rows. Our idea of implementing twin precision technique to RCMB results in less utilisation of multiplexers of about [N/2] + 3 which gave a way for optimization in the twin precision (TP) multiplier. Thereby, we have achieved the drastic reduction in multiplexer utilisation of about 40% to 50% (for N = 8 to 128) compared to the existing twin precision modified booth multiplier. In our proposed optimised TP modified booth multiplier this reduction in multiplexers gave a way for overall reduction in area, power and delay. Lesser utilisation of multiplexer results in the area reduction of about 5% to 18%, delay of 5% to 20% and a considerable reduction in power of 8% to 32% were noticed in the proposed TP booth multiplier for N = 8 to 128. Our proposed optimised TP multiplier is implemented in FFT complex multiplication which is taken as an application case study and achieves better performance (area, delay and power) compare to prior TP multiplier. All our evaluation are made using cadence RTL compiler using TSMC 180 nm library.
Keywords
1 Introduction
Multiplication is an influential arithmetic operation in processor and digital signalprocessing application, and thus it plays a foremost role in digital computation. In multiplication, the processing delay is directly proportional to the critical path. In order to design an efficient multiplier, limits such as multiplier speed, power and area have to be thought thoroughly. In this paper, we narrowed our research work towards achieving double throughput in signed multipliers with lesser hardware complexity. Twin precision (TP) is the technique that can be exploited to obtain the dual output [1] in multipliers, and this technique was implemented for both signed and unsigned multipliers [2].
Achieving double throughput multipliers in applicationspecific integrated circuit (ASIC) environment is a challenging task where reprogrammability is not possible like in fieldprogrammable gate array (FPGA). Two technologies such as FPGA and ASIC have their own pros and cons [3]. One of the biggest advantages considered in FPGA is reprogrammability. Modern FPGA has many variable precision embedded multipliers [4]. Though FPGA has variable precision embedded multipliers, blindly we cannot say it is more advantageous than ASIC multipliers. Because further optimization in FPGA embedded multipliers are not possible. In ASIC, reprogrammability is not possible but increased throughput in multipliers is achieved efficiently by twin precision technique.
Initially, double throughput in very large scale integration (VLSI) multipliers is achieved by using several multipliers and at least two share the same route, which is adopted in [5,6]. However, these methods have several disadvantages like increased multiplier area and high fanout that increases the overall delay and area. This method uses multiplexers to connect the active multiplier to the output. We can say twin precision is a subset of subword parallelism (SWP) if the lower precision multiplications stop within two levels, i.e., if a 16 bit multiplier performs two 8bit multiplication, then it is TP multiplication. SWP is capable of performing four 4bit multiplications or two 8bit multiplications in the same multiplier architecture, and we referred it as multiple SWP. A subword is a lower precision of data contained within a word. By exploiting SWP in signed multipliers, multiple lower precision multiplications can be performed. So unlike TP, the throughput wont stop within two. But the other case, we have to consider here is when multiple (above two) SWP is performed on modified booth multiplier many changes are to be made for obtaining MB algorithm in all levels of lower precision multiplication. While obtaining multiple (above two) SWP in signed multiplier which adopts modified booth algorithm, steps like sign extension and inversion of bits and carry suppression are to be made in all levels of lower precision operation. And multiplexers are needed at all lower precision multiplication to select appropriate partial products because sign extension and inversion of the most significant bits (MSB) in partial product rows will vary for multiple SWP and full precision multiplication. Though SWP offers more flexibility like multiple throughputs, their hardware complexity increases with an increase in multiplexer utilisation for implementing a signed multiplication algorithm like modified booth (MB). Much architecture based on SWP has been implemented in [79]. These architectures are specially designed for media processing.
In [2], double throughput is realised effectively by adopting TP technique in MB multiplier. MB algorithm [10] is a widely used signed algorithm since it has reduced partial product row. The possibility of combining N and N/2 bit (b) multiplication in the same N b tree multiplier is called as TP multiplier where N is the bit width of the multiplier. Here we can split the partial product bits of the N b multiplier in such a way that N/2 b multiplication can be performed in the least significant part (LSP) of the multiplier in parallel with another N/2 b multiplication in the most significant part (MSP). And this is done in the partial product reduction tree without inclusion of any additional logic as explained in [1]. Multiplexers (muxes) are generally employed in TP multiplier to select appropriate partial product for N and N/2 b multiplication.
While implementing TP technique in the signed multipliers like MB [2], multiplexers are involved for the selection of partial products during N and N/2 b multiplication. This addition of multiplexers give rise to an area overhead, and thereby significant delay is added in the TP multiplier. By making the calculation of multiplexer utilisation, the TP implementation in MB multiplier [2] utilises N + 3 multiplexers to select the partial products. The unwanted partial products that are not in use for N/2 bit multiplication are set to zero by modifying two input AND gate which produces partial products to three input AND gate and the third input is the control signal. Since the area overhead in TP multiplier is caused by the multiplexers, we aimed at optimising the TP multiplier by reducing the multiplexer utilisation. Multiplexer will get reduced only when the changes to be made in N/2 b multiplication are lesser. Our goal of obtaining optimised TP MB multiplier is achieved when implementing the TP technique in reduced computation modified booth (RCMB) that consists of [N/2] rows and not [N/2] + 1 rows as in modified booth algorithm and uses the simple sign extension prevention scheme which requires lesser changes to be made in TP implementation, i.e., for N/2 b multiplication.
In this paper, our goal of obtaining optimised double throughput signed multiplier has achieved without inclusion of complex logic, which reflects in reduced area, delay and power. We have implemented twin precision in an efficient manner with less hardware constraint compared to previous implementation, and a suitable algorithm is proposed. Implementing TP in RCMB [11] has achieved reduced area, delay and power compared to prior TP technique applied in MB algorithm. We have analysed that implementing TP technique in RCMB results in better performance, and it has been discussed in the rest of our paper.
2 TP implementation in modified booth
Implementation of TP technique is not similar for BW and MB algorithm. Figure 3 illustrates the MB multiplication for 8 × 8 which uses the sign extension scheme proposed by Fadavi [16] and uses recoding logic proposed in [15]. Achieving double throughput in MB algorithm requires appropriate selection of partial product signals in N and N/2 b multiplication, i.e., looking after sign extension prevention in LSB and MSB multiplication is to be taken care when N/2 b multiplications are performed. Simply we can say the steps performed for N b multiplication has to be carried over for N/2 b multiplication. Apart from booth encoding and decoding, the steps (changes) to be performed in N/2 LSB and MSB multiplication for MB algorithm [2] are precomputed LSB (PLSB) and its carry ai has to be calculated separately. Likewise inversion of MSB in each partial product row and sign bit pattern 1 s and 0 s will differ for N/2 LSB and MSB multiplication. Multiplexers will make these changes in N b partial product array by selecting appropriate partial products for N and N/2 b multiplication.
To obtain TP in modified booth, the partial products that are shaded (grey colour) in Figure 4 are needed when multiplier performs N/2 b multiplication whereas the nonshaded partial products are set to zero. The partial products are set to zero by connecting the MB decoder output to two input AND gate and the other input of the AND is the control signal. Based on this control signal, either MB decoded signal or zero will appear as the output. Multiplexers are needed to select the appropriate partial products for N and N/2 b multiplications. Encoding scheme for N/2 b multiplication is shown in Figure 5.

Partial products – P40 and P41 during normal 8 × 8 multiplications (Figure 3) need to describe sign extension prevention for N/2 LSB multiplication in MSB (Figure 4).

Partial products – P42 and P43 in normal (N bit) multiplication has to be replenished as PLSB2 and PLSB3 for N/2 MSB bit multiplication.

aMSP0 and aMSP1 are to be added for N/2 MSB multiplication.

The pattern 1 s and 0 s are different for N and N/2 bit multiplication.
In Figure 6, the partial products are mapped to highperformance multiplier (HPM) reduction tree [17] and the shaded regions indicate that they are involved in N/2 b multiplication. Another issue that has to be looked through in N/2 b multiplication is that the multiplication in the LSB should not interfere with MSB multiplication, and so an encoding logic in Figure 5 has been followed. The 1 s and 0 s pattern in the last row of the partial product array is for the sign extension prevention and the most significant ‘1’ is to invert the S2N1 bit. So while performing N/2 LSB multiplication, the most significant ‘1’ present in the S7 column produces a carry, and when it propagates, it will affect the N/2 MSB multiplication result. In order to avoid this problem, output of the S7 is passed to an exclusive OR (XOR) gate and the inversion is made, and it is illustrated in Figure 6 where the most significant ‘1’ is made to zero and S7 output is passed to the XOR gate.
Changes needed for twin precision
Sum bit  Twin 2009 

Signal selection  
S0  No change 
S1  No change 
S2  No change 
S3  No change 
S4  1/0 and P40/\( \overline{\mathrm{P}40} \) 
S5  1/a2 
S6  P41/\( \overline{\mathrm{P}41} \) 
S7  0/a3 
S8  PLSB2/P42 and 0/1 
S9  aMSP0/1 
S10  PLSB3/P43 
S11  aMSP1/1 
S12  1/0 
S13  No change 
S14  No change 
S15  No change 
By comparing Figures 3 and 4, the changes needed to perform in each column of 8 × 8 MB multiplication for TP are tabulated in Table 1. Depending on the multiplication performed (N or N/2 b), the appropriate changes that have to be made in the partial product array are executed by multiplexers. In N b multiplication (Figure 3), the S4 column usually has no sign bit, but when N/2 LSB multiplication (Figure 4) is performed, ‘1’ has to be included as sign bit and the inversion of MSB bit (P40) has to be performed. So in the S4 column, two multiplexers are required to make these two changes. As explained earlier, the MSB bit of each partial product row has to be inverted so P41 bit has to be inverted for N/2 LSB multiplication . In the seventh column, the potential carry (a3) in N bit partial product array has to be replaced as sign bit 1 for N/2 LSB multiplication. Likewise the changes such as precomputed LSB (PLSB2, PLSB3), the potential carry (aMSP0, aMSP1), and sign bit pattern 1 s and 0 s made for N/2 MSB multiplication are tabulated in Table 1.
Totally, for an 8 × 8 TP MB multiplication, we need 11 multiplexers. For various bit width, the multiplexer (mux) utilisation is calculated. From the analysis made in the MUX utilisation for implementation of TP in MB [2], it is clear that this method inquires N + 3 multiplexers for selection of appropriate partial products, where N is the bit width of multiplier.
3 Implementation of TP in RCMB
In this study, we have designed an optimised TP multiplier. Optimization is achieved by implementing TP technique in RCMB and an algorithm is proposed which is applicable for N = 8 × (multiples of eight). Implementation of TP in RCMB reduces the mux utilisation to (N/2 + 3). Muxes are normally employed in TP multiplier for partial product selection in N and N/2 b multiplication. In prior work [2], the implementation of TP involves N + 3 multiplexers for partial product selection in TP multiplier. By applying our proposed algorithm, the TP can be obtained in the RCMB multiplier.
In this paper, we have implemented TP technique in RCMB multiplier which has N/2 partial rows. The sign extension prevention scheme proposed in [18] is used in the RCMB. In our proposed algorithm, much of the work has to be done in the upper bound compared to the lower bound. Inversion of most significant bit in partial product rows and adding of sign bit 1 s are two major steps followed to perform twin precision. In prior work of implementing twin precision technique [2] to obtain double throughput, more changes (steps) are to be performed in partial product array than proposed implementation of TP in RCMB. This in turn increases the need of multiplexers. Therefore, the area overhead for the implementation of twin precision in MB algorithm is higher than our proposed method. While obtaining twin precision in RCMB algorithm, the N/2 rows will be increased to (N/2) + 1 that uses sign extension prevention scheme in [18], and MB recoding logic proposed [19]. When the TP technique is implemented in RCMB, the multiplexer utilisation reduces drastically, and its impact on area, delay and power is analysed.
B _{ T }(i)  total bits (BT) in a row
For an 8 × 8 RCMB multiplier illustrated in Figure 8, B _{ T } = 11 and this is equal for all the rows. When we perform N/2 LSB multiplication in N b tree, the most significant bit in each partial product row has to be inverted (negated). To perform this, we need to know the bit position (Xi) in the upper bound with respect to corresponding row (i) and given as B[Xi] in Equation 4, where B is the bit and X denotes the bit position. In the lower bound, there is no need for inversion of bits for N/2 MSB multiplication because during N bit multiplication itself, the inversion was made. The inversion of bits that has to be implemented in the upper bound in corresponding bit position of ith row is described in Equation 4.
B[Xi]  bit position in the corresponding i row
C _{ N }  constant, which depends on N value
Proof
j _{ ou(I)}  ones (O) in the upper (U) bound for iteration (I)
Proof
So in the fifth, sixth and eighth columns, the 1 s are inserted
Changes to be made to obtain TP
Sum bit  Twin 2009 [ 2 ]  Proposed twin signal selection 

S15  No change  No change 
S14  No change  No change 
S13  No change  No change 
S12  1/0  1 
S11  aMSP1/1  No change 
S10  PLSB3/P43  P81/neg3 
S9  aMSP0/1  No change 
S8  PLSB2/P42 and 0/1  P23/neg2 
S7  0/a3  P51/1 
S6  P41/\( \overline{P41} \)  P41/\( \overline{P41} \) 
S5  1/a2  P31/1 
S4  1/0 and P40/\( \overline{P40} \)  P02/1 and P40/\( \overline{P40} \) 
S3  No change  No change 
S2  No change  No change 
S1  No change  No change 
S0  No change  No change 
Proof
In thirteenth column, 1 is added
OB _{ L(j)}  one (O) in the lower bound (BL) in corresponding column (j)
From the proof of Equation 6, it is clear 1 has to be added in the 13th column. All the steps in the algorithm are made in RCMB multiplier and mapped in HPM reduction tree [17] shown in Figure 10. For N b multiplication, the sum (S) bit is from S0 to S15, and for N/2 LSB and MSB multiplication, sum bits are S0 to S7 and S8 to S15.
In RCMB [11], the N/2 + 1 rows are reduced to N/2 rows by the addition of last row sign bit, i.e., (neg((N/2) − 1)) bit to MSB part of the first row partial product. In our chosen example in Figure 7, it is neg3 bit which leads to (N/2) + 1 rows added to the first row whose partial product are represented as qij. We exploited TP technique in RCMB multiplier and suitable algorithm is proposed. In our proposed TP implementation in RCMB when we perform N/2 MSB multiplication, the first row (i = 3) of lower bound is not modified like N bit multiplication because the complexity of the design will get increased.
Proof
j _{(4)} = j _{(4–1)} + 2 = 9 + 2 = 11 neg3 → 11 (neg3 bit is added to eleventh column)
Note: i = 1 to N/2 (first step in the ‘Algorithm for obtaining twin precision in RCMB’ section). For N = 8, i varies from 1 to 4. Since i value ends with 4, it is not necessary to find j _{(i)} for values of i after N/4 + 2 (i.e.,) for i = N/4 + 3 = 5 where we do not have i = 5 (fifth row) in 8 × 8 (N = 8) RCMB multiplier.
The negk (k = 0 to (N/2 − 1)) bits are usually used to generate twos complement in RCMB algorithm. These bits are added to the LSB part in each partial product row. During N bit multiplication, negk bits are added in the LSB part of each partial product row. So for N/2 LSB multiplication, we do not face any problem in adding the neg(0 to (N/4 − 1)) bits in the LSB part, i.e., in the upper bound. But for N/2 MSB multiplication, the neg(N/4 to (N/2 − 1)) bits have to be added in the corresponding LSB part of each partial product row in the lower bound. By Equation 7, this can be performed by adding neg(N/4 to (N/2 − 1)) bits to the corresponding column. From proof of Equation 7, it is clear the neg2 and neg3 are added in the 9th and 11th columns. In Figure 9, the neg2 bit is joined with P23, i.e., for N/2 MSB multiplication; instead of P23, the neg2 bit will be selected by the multiplexer. Likewise in the tenth column, neg3 bit is joined with P81.
And by implementing TP technique in RCMB, we found that our approach requires less mux utilisation compared to previous implementation [2]. Figure 9 depicts the implementation of all the above steps for N and N/2 bit multiplication. When the multiplier performs N/2 b multiplication, the unwanted partial products that are not shaded in Figure 9 are set to zero and the multiplexers are implemented where ever the changes are required. The columns where the changes have to be made in RCMB to obtain twin output using our algorithm are S4, S5, S7, S8, S10 and S12 as charted in Table 2. And by consolidating these changes, we can formulate the mux utilisation for our approach.
Here in this study, we have chosen 8 × 8 RCMB multiplier to explain our idea. From the consolidated changes, it is revealed that we entail (N/2) + 3 multiplexers and verified for various bit width multiples of eight as outlined in Table 2. The earlier work [2] utilises N + 3 multiplexers for the selection of appropriate signals. Since multiplier structure is the interconnection of full adders and half adders, the adders involved in the critical path have to wait for the previous carry signal to produce output sum bit S. Implementation of our proposed algorithm for obtaining TP in RCMB results in less multiplexers utilisation which has its impacts on area, delay and power that are analysed in the Results and discussion section. Because of this drastic reduction in multiplexers, the area, delay and power of TP multiplier are reduced compared to prior work of implementing TP in modified booth algorithm [2].
4 SWP
The bounded regions in Figure 11 that are green in colour illustrate four 4bit (N/4) multiplications and red bounded region shows TP, i.e., N/2 multiplications. Multiple SWP multiplication means either N/2 or N/4 bit multiplication is possible. While doing four 4bit multiplications, the first lower precision multiplication will perform for inputs x0 to x3 and y0 to y3, second precision multiplication inputs are from x4 to x7 and y4 to y7, third is from x8 to x11 and y8 to y11 and fourth is from x12 to x15 and y12 to y15. For all these four lower precision multiplication, RCMB algorithm is applied. In this algorithm, all the MSB bit of partial product row has to be inverted for achieving twos complement, 1 s are added for sign extension and negk bits are added in the LSB of each partial product row. For first lower precision multiplication, the first two rows (only green bounded region) of partial product in Figure 11 for N = 16 are taken into consideration. Since the inputs are x0 to x3 and y0 to y3, the MSB bit for this multiplication is P30 and P31 has to be inverted to achieve twos complement. Multiplexer will select either P30 or \( \overline{P30} \) and P31 or \( \overline{P31} \). Sign bit ‘1’ has to be added in fourth, fifth and seventh columns which is for sign extension in RCMB algorithm. In the fourth column, none of the partial product is available to merge sign bit ‘1’, so this leads to extra half adder during the addition of partial products to produce sum bit (S) and this sign bit will be 0 for N/2 and N bit operation. Whereas in fifth and seventh column sign bit 1 is merged with P02/1, P22/1, i.e., mux is employed in the places of P02 and P22 which selects either P02 or 1 depends on the operation. neg0 and neg1 are added in the LSB. This leads to utilisation of four multiplexers in the first lower precision region bounded in the green colour to select the partial products for N/4 operation.
Mux utilisation
S number  Bit width  Twin 2009 MUX ( N + 3)  Proposed twin mux ( N /2) + 3 

1  8  11  7 
2  16  19  11 
3  32  35  19 
4  64  67  35 
5  128  131  67 
5 Results and discussion
Mux utilisation for SWP and TP
S number  Bit width  Proposed twin mux  Mux count for multiple SWP( N /2 + N /4) 

1  16  11  27 
2  32  19  52 
To our knowledge, TP implementation to signed multipliers is made in [2] and further optimization in TP multiplier is not done. Compared to prior work [2], our proposed algorithm implementation requires less change for performing N/2 multiplication. These changes or selection of appropriate partial products are typically selected using multiplexers. Table 1 illustrates the signal selection that has to be made for N/2 multiplication in MB. TP implementation in MB contains (N/2) + 1 partial product rows and adopts sign extension scheme presented in [16]. TP implementation in MB which adopts the sign extension scheme in [16] requires that more changes are to be made for N/2 bit multiplication. Multiplexer utilisation is directly related to the changes to be made in partial product row for N/2 b multiplication, and for every change, the multiplexers are to be deployed to select required partial product for N and N/2 b multiplication.
For twin precision implementation in modified booth algorithm [2], the changes (steps) that have to be made for N/2 b multiplication apart from booth encoding and decoding are PLSB, and its potential carry (most significant part (aMSP)) has to be performed separately for N/2 LSB and MSB multiplication. And also sign extension has to be made separately for N/2 b multiplication. So these three steps have to be performed in N/2 LSB and MSB multiplication. In RCMB [11] apart from booth encoding and decoding, the two steps to be performed for N/2 b multiplication are neg bit added in the LSB of each partial product row to generate twos complement and 1 s added for sign extension. RCMB does not require PLSB and potential carry (less complexity) so this makes RCMB more suitable for TP implementation and utilises less multiplexers. For an 8 × 8 TP multiplier, twin 2009 [2] requires 11 multiplexers, and our proposed method requires only 7 multiplexer as consolidated in Table 3. Reduction in multiplexer occurs because only two steps have to be performed in RCMB N/2 b multiplication compared to [2] which performs three steps in N/2 b multiplication.
Our proposed method has been tested for various bit widths that are multiples of eight and results are compared with prior work [2] and tabulated in Table 4. From the analysis made for various bit widths, it is inferred that our method needs (N/2) + 3 multiplexers. Due to reduction in multiplexers, overall area, delay and power are reduced for TP multiplier. For the previous implementation of TP technique [2], the multiplexer utilisation has been formulated as N + 3. This reduction in multiplexer utilisation in our proposed work reduces the design complexity of TP multiplier.
Comparison chart of existing TP multiplier, proposed TP multiplier and nonTP
NonTPM  Twin 2009 [ 2 ]  Proposed twin  

Bit width  8  16  32  64  128  8  16  32  64  128  8  16  32  64  128 
Mux            11  19  35  67  131  7  11  19  35  67 
Area (μm^{2}) (cells)  398  1,385  5,247  19,706  74,390  444  1,603  6,412  25,780  103,320  412  1,458  216,552  90,836  
Delay (ps)  359  692.27  1,273.8  2,553.5  5,481.65  394.6  818.5  1,627.4  3,453.8  7,865.3  378.2  7438  1,415.4  2,935.05  6,449 
Power (nW)  13,051.8  362,256.4  731,984.4  1,548,441.2  3,478,721.9  12,198  320,581  609,987  1,200,342  2,576,831  11,222  280,212  5,123,891  912,259  1,700,708 
Cadence RTL compiler with TSMC library 180 nm is used to synthesized and analyse the cell area, delay and power. All the evaluation including power consumption is taken after post place and route synthesis of cadence digital flow. For higher bit width of multiplier, the reduction in area, delay and power are more noticeable for proposed twin compared with twin 2009 [2]. TP multiplier usually requires muxes for selection of partial products, so in Table 5, nonTP multiplier (TPM) results in less area compared to twin 2009 [2] and proposed twin. Usually, nonTP multiplier requires less area because of the absence of multiplexers, but the TP multiplier produces double output which is not possible in nonTP multiplier. Though nonTP multiplier produces less delay compared with TP multiplier, it cannot produce N or two N/2 b output at a time. And also increase in delay for TP is only 15% when compared to nonTP multiplier. While performing N/2 b multiplication in TP multiplier, the unwanted partial products are made to zero and this is the reason for achieving lesser power in TP multiplier compared to nonTP.
From Table 5, it is inferred that our TP implementation in RCMB achieves reduction in delay of about 5% to 18%, area of about 5% to 20% and power reduction of 8% to 32% due to reduction in mux utilisation. And the mux utilisation is drastically reduced up to 40% than the previous method [2]. When N/2 multiplication is performed, i.e., when multiplier performs narrow width operation, a significant reduction in power is achieved. Most of the partial products are made zero in N/2 b multiplication, so it leads to overall power reduction.
From the results in Figure 11, it is inferred that our implementation gives better performance compared with the twin 2009 [2] methodology.
In hardware architectures, obtaining reconfigurable architecture in ASIC is more challenging. Through TP technique, double throughput is achieved in signed multipliers. In this study, we have proposed an optimised twin precision multiplier. To prove the effectiveness of the proposed multiplier, we have implemented the twin 2009 [2] multiplier and the proposed optimised TP multiplier in the FFT complex multiplication and analysis were made. Apart from the effective hardware utilisation, our proposed TP multiplier yields reduced area, power and delay compared to the twin 2009 [2] TP multiplier.
First, our analysis is made by replacing three multipliers in the complex multiplication in Figure 13 as TP multiplier. Second, the experiment was conducted with a 32bit TP multiplier. So in a single 32bit multiplier, two 16bit multiplications (N/2 multiplication) or one 32bit multiplication (N bit multiplication) can be performed. During N bit multiplication, the data width is 32 bit and twiddle factor is also 32 bit whereas in N/2 multiplication, data width is 16 bit and twiddle factor is 16 bit. When the three 32bit multiplier in Figure 13 is replaced by TP multiplier as per TP logic, it can perform one 32bit multiplication or two 16bit multiplications.
TP in complex multiplication
Parameter  Twin 2009 [ 2 ] multiplier  Proposed TP multiplier (proposed twin) 

Data width  32  32 
Area (μm^{2})  20,186.47  16,956.99 
Delay (ps)  5,123  4,252 
Power (nW)  1,867,221.23  1,456,432.67 
6 Conclusions
Double throughput in ASIC environment is achieved effectively by implementing TP technique in RCMB, and a suitable algorithm is proposed. Our implementation requires less changes in the partial product array to acquire TP. Depending upon the multiplication either N or N/2 bit, selection of appropriate partial products in TP multiplier is done using multiplexers. Since our implementation utilised less multiplexers of about (N/2) + 3, the overall delay, area and power are reduced compared to prior implementation of TP technique in MB algorithm. Our proposed TP implementations consume 40% to 50% (for N = 8 to 128%) of less multiplexers. And our implementation gives better performance (area, delay and power) compared to prior implementation of TP in MB algorithm. To test the efficiency of the system, our proposed TP multiplier is implemented in FFT complex multiplication and its results gives better performance for our approach.
Declarations
Authors’ Affiliations
References
 M Själander, H Eriksson, P LarssonEdefors, An efficient twin precision multiplier, in Proc. 22nd IEEE Int. Conf. Comput. Des, 2004, pp. 30–33Google Scholar
 M Själander, P LarssonEdefors, Multiplication Acceleration Through Twin Precision, in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 17, 9th edn., 2009Google Scholar
 Differences between FPGA and ASIC. http://www.xilinx.com/fpga/asic.htm.
 About ASIC and FPGA. http://www.altera.com
 GH Loh, Exploiting datawidth locality to increase superscalar execution bandwidth, in Proc. 35th Int. Symp. Microarchitecture, 2002, pp. 395–405Google Scholar
 D Brooks, M Martonosi, Dynamically exploiting narrow width operands to improve processor power and performance, in Proc. 5th Int. Symp. High Perform. Comput. Arch, 1999, pp. 13–22Google Scholar
 S Krithivasan, MJ Schulte, Multiplier Architectures for Media Processing, in IEEE Conference, 2003Google Scholar
 S Khan, E Casseau, D Menard, Reconfigurable SWP operator for multimedia processing, in IEEE International Conference on ApplicationSpecific Systems, Architectures and Processors, 2009Google Scholar
 A Danysh, D Tan, Architecture and Implementation of a vector/SIMD multiplyaccumulate unit, in IEEE Transaction on Computers, vol. 54, 2005, pp. 284–293Google Scholar
 AD Booth, A signed binary multiplication technique, in Quarterly J. Mechanical and Applied Math, vol. 4, 1951, pp. 236–240Google Scholar
 F Lamberti, N Andrikos, E Antelo, P Montuschi, Reducing the Computation Time in (Short BitWidth) Two’s Complement Multipliers, in IEEE Transactions On Computers, vol. 60, 2nd edn., 2011Google Scholar
 OL MacSorley, High speed arithmetic in binary computers. Proc. Inst. Radio Eng. 49(1), 67–97 (1961)MathSciNetGoogle Scholar
 Z Huang, MD Ercegovac, HighPerformance LowPower LefttoRight Array Multiplier Design. IEEE Trans. Comput. 54(3), 272–283 (2005)View ArticleGoogle Scholar
 R Zimmermann, DQ Tran, Optimized Synthesis of Sumof Products, in Proc. Conf. Record of the 37th Asilomar Conf. Signals, Systems and Computers, vol. 1, 2003, pp. 867–872Google Scholar
 WC Yeh, CW Jen, Highspeed Booth encoded parallel multiplier design. IEEE Trans. Comput. 49(7), 692–701 (2000)View ArticleGoogle Scholar
 J FadaviArdekani, M N Booth encoded multiplier generator using optimized Wallace trees, in IEEE Trans. Very Large Scale Integr. (VLSI) Syst, vol. 1, 2nd edn., 1993, pp. 120–125Google Scholar
 H Eriksson, P LarssonEdefors, M Sheeran, M Själander, D Johansson, M Schölin, Multiplier reduction tree with logarithmic logic depth and regular connectivity, in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), 2006, pp. 4–8Google Scholar
 MD Ercegovac, T Lang, Digital Arithmetic. Morgan Kaufmann Publishers  An Imprint of Elsevier, 2004Google Scholar
 JY Kang, JL Gaudiot, A Fast and WellStructured Multiplier, in Proc. Euromicro Symp. Digital System Design, 2004, pp. 508–515Google Scholar
 YS Algnabi, R Teymourzadeh, M Othman and S Islam. FPGA Implementation of Pipeline DigitSlicing MultiplierLess Radix 22 DIF SDF Butterfly for Fast Fourier Transform Structure, Institute of MicroEngineering and Nanoelectronics IMEN, VLSI Design Department, Malaysia, in The 5th European conference on antennas andpropagation (EUCAP2011), pp 4168–4172.Google Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.