- Research
- Open Access
An optimized two-level discrete wavelet implementation using residue number system
- Husam Y. Alzaq^{1}Email authorView ORCID ID profile and
- B. Berk Ustundag^{1}
https://doi.org/10.1186/s13634-018-0559-3
© The Author(s) 2018
- Received: 11 October 2017
- Accepted: 30 May 2018
- Published: 25 June 2018
Abstract
Using discrete wavelet transform (DWT) in high-speed signal processing applications imposes a high degree of caution to hardware resource availability, latency and power consumption. In this paper, we investigated the design and implementation aspects of a multiplier-free two-level DWT by using residue number system (RNS). The proposed two-level takes the advantage of performing the multiplication operations using only the memory without involving special multiplier units, which preserves valuable resources for other critical tasks within the FPGA. The design was implemented and synthesized in ZYNQ ZC706 development kit, taking advantage of embedded block RAMs (BRAMs). The results of the overall experimentations showed that there is a considerable improve in the proposed two-level DWT design with regard to latency and peak signal-to-noise ratio (PSNR) precision value in the final output.
Keywords
- Discrete wavelet transform (DWT)
- Digital signal processing (DSP)
- Residue number system (RNS)
- Field programmable gate array (FPGA)
1 Introduction
In this work, we preferred the conventional convolution-based DWT implementation over the LS for the following reasons. In LS, as the critical path delay (CPD) increases, the energy per operation increases and the operating frequency decreases [10]. In [11, 12], the authors found out that as the length of the filter (N) increases, the CPD is increased, respectively. Hence, the sequence of multiplication and addition will be longer than the convolution-based scheme. Therefore, LS is observed to have poor scalability and is inappropriate for large filter lengths [13, 14]. In addition, LS requires temporary registers to store the intermediate results, which takes up more storage area and as well consumes more power [15, 16]. For these reasons, we decided to implement the DWT using the convolution-based approach, but with multiplierless architecture.
Multiplierless approaches eliminate the use of multipliers by replacing individual coefficient multipliers with a single multiplier block, known as a multiple constant multiplication (MCM). Because filter coefficients are fixed and determined in advanced, the multiplication of filter coefficients by an input leads to area, delay, and power-efficient architectures [17].
The existing multiplierless algorithms can be divided into two general classes: they either reduce the number of multipliers or totally replace them with a simplified circuit logic. The most popular reduction algorithms are graph-based eliminations (GE) [18] and common subexpression elimination (CSE) techniques [19–21]. The drawback of CSE algorithms is that its performance depends on the representations of the coefficients and also limited by the constant bit widths [20], whereas the GE require more computational resources due to a larger search space [22].
On the other hand, several multiplierless architectures that eliminate all multipliers have been proposed. Distributed arithmetic (DA) efficiently performs the inner product function in a bit-serial manner via a look-up table (LUT) scheme, followed by shift accumulation operations [23–25]. Based on our previous experience, we identified that the ROM size in DA-based structures increases with the increase in the word length [26]. Residue number system (RNS) is a highly parallel non-weighted arithmetic system that is based on the residue of division operation of integers using the look-up table (LUT) scheme [27–29]. The key advantage of RNS is gained by reducing an arithmetic operation to a set of concurrent, but simple, operations. Another advantage of RNS is its large dynamic range, which is divided into independent smaller ranges, where addition and multiplication operations are performed in parallel without a carry propagation among them. Several applications, such as digital filters, benefit from the RNS implementation, e.g., [30–32]. To the best of our knowledge, the aforementioned approaches consider only one-level DWT implementation.
1.1 Contribution of this paper
This article focuses exclusively on the implementation of two-level multiplier-free DWT. We propose a new design of two-level RNS-based DWT that efficiently uses the memory elements in the first-level DWT and do not employ any memory element in the next levels. In addition, this design eliminates the use of multiple residue-to-binary converters (RBCs) between consecutive levels. Generally, the number of level is bounded by the output word length and we determine it mathematically (Eqs. 15 and 16). Finally, the proposed RNS-based approach could achieve high PSNR values with simple hardware structure and consume less power.
The remainder of this paper is organized as follows: In Section 2, the theoretical background on RNS is given. Section 3 illustrates the implementation of discrete wavelet transform. The implementation of the proposed two-level RNS is also presented. We further show an analytical comparison between these approaches. Section 4 presents the performance results. Finally, conclusions are drawn in Section 5.
2 Preliminaries
2.1 Discrete wavelet transform
where N is the number filter tap. For the sake of simplicity of representing Eq. (4), x[ n−k] is replaced by x[ k].
2.2 Residue number system (RNS)
RNS [27, 28] is a non-weighted number system that performs parallel carry-free addition and multiplication arithmetic. In DSP applications, which require intensive computations, the carry-free propagation allows for concurrent computation in each residue channel.
Another aspect of using RNS is that an integer, within a large dynamic range, can be uniquely represented by set of residues, P, that are of much smaller values, corresponding to the size of the moduli set.
M determines the range of unsigned numbers in [0,M−1]. In particular, M should be greater than the largest expected output.
where ∘ represents the addition, subtraction, or multiplication operation; and m_{ i }∈P.
where \( \hat {M_{i}} = M/{m_{i}} \) and \( \alpha _{i} = |\hat {M_{i}}^{-1}|_{m_{i}} \) is the multiplicative inverse of \( \hat {M_{i}} \) with respect to m_{ i }.
for each m_{ i }∈P. This implies that a q-channel DWT is implemented by q FIR filters that are working in parallel.
For designing an efficient RNS-based DWT, the choice of the moduli set and hardware design of residue-to-binary conversion are two critical issues that should be considered. Most widely studied moduli sets are given as a power of two due to the attractive arithmetic properties of these modulo sets. For example, {2^{ n }−1,2^{ n },2^{n+1}−1} [37] {2^{ n }−1,2^{ n },2^{ n }+1} [38] and {2^{ n },2^{2n}−1,2^{2n}+1} [39] have been investigated. A four-moduli set has been suggested to increase the dynamic range, e.g., {2^{ n }−1,2^{ n },2^{ n }+1,2^{n+1}−1} [40] and {2^{ n }−1,2^{ n },2^{ n }+1,2^{2n}+1} [41].
In this work, the moduli set P_{ n }={2^{ n }−1,2^{ n },2^{n+1}−1} is used for three reasons. First reason being that the modular adder is simple and identical for both m_{1}=2^{ n }−1 and m_{3}=2^{n+1}−1. Secondly, for small n=7, the dynamic range of P_{7} is large and M is equal to 4145280, which would efficiently express real numbers in the range [−2.5,2.5] using 16-bit fixed-point representation, provided scaling and rounding are done properly. We assume that this interval is sufficient to map the input values, which does not exceeds ± 2. Thirdly, the reverse converter unit is simple and regular [36] because it does not employ any memory.
3 DWT implementation methodology
As mentioned in the previous sections, the wavelet transform of a signal can be performed by FIR filters, where the convolution operations are achieved by multiplying an input signal by the wavelet coefficients. In contrast, RNS-based approach has replaced the multiplication units with a suitable memory to perform the multiplication operations.
3.1 DWT implementation using RNS
3.1.1 Binary-to-residue converter (BRC)
The BRC is used to convert the result of multiplying an input number by a wavelet coefficient to q residue numbers by using LUT, shift, and modulo adders, where q is the number of channels. This procedure ensures that the multiplication operation is performed by using only memory.
3.1.1.1 RNS-system number conversion
The received input and wavelet coefficients span the real number and might take small values. One of the main limitation of using RNS-number representation is that it only operates with positive integer numbers from [0,M−1]. The DWT coefficients are generally close to zero and between − 1 and 1. Therefore, it is important to cope with both negative numbers and small numbers. To handle negative numbers, we mapped the real number to RNS range. Assuming the input samples are in [− 2.5,2.5], we mapped any value in this range to a unique value in [0,(M−1)]. Any sample, which does not fit this interval, will produce incorrect values. Hence, the interval should be large enough to map all the numbers.
3.1.1.2 Modulo m _{ i } multiplier
The multiplication of the received sample by the filter coefficients, which are constants, can be performed via indexing the LUT. It is critical to identify the size of LUT because as the word length, w, of the received sample is increased, the memory size becomes 2^{ w }. Additionally, the design require q LUTs to perform the modulo multiplication.
The memory content of h_{0} = −0.1294 or 757(− 266) multiplied by 2^{11} in P_{7}={127,128,255} when word length is 4
Location i | \(\phantom {\dot {i}\!} |\,-\,266*m_{1}|_{m_{1}} \) | \(\phantom {\dot {i}\!} |\,-\,266*m_{2}|_{m_{2}} \) | \(\phantom {\dot {i}\!}|\,-\,266*m_{3}|_{m_{3}}\) | ROM(i) |
---|---|---|---|---|
(Eq. 10) | ||||
0 | 0 | 0 | 0 | 0 |
1 | 115 | 118 | 244 | 3798772 |
2 | 103 | 108 | 233 | 3402985 |
3 | 91 | 98 | 222 | 3007198 |
4 | 79 | 88 | 211 | 2611411 |
5 | 67 | 78 | 200 | 2215624 |
6 | 55 | 68 | 189 | 1819837 |
7 | 43 | 58 | 178 | 1424050 |
8 | 31 | 48 | 167 | 1028263 |
9 | 19 | 38 | 156 | 632476 |
10 | 7 | 28 | 145 | 236689 |
11 | 122 | 18 | 134 | 4002438 |
12 | 110 | 8 | 123 | 3606651 |
13 | 98 | 126 | 112 | 3243632 |
14 | 86 | 116 | 101 | 2847845 |
15 | 74 | 106 | 90 | 2452058 |
It is obvious that if the input word length is 16 bits, then the LUT size becomes huge because 2^{16} locations will be needed. One way to reduce the size of memory is to divide it into smaller size, each consisting of 2×22 bits or 4×22 bits. Figure 5 shows the block diagram of the binary-to-residue converter with two and four memories, respectively. However, the output of each memory should be combined, so that the final result is correct. It is worth noting that this division comes with a cost in terms of additional adders and registers are used (discussed in Section 3.4).
According to the previous improvements, the RNS-based system works as follows (step 5 from Fig. 3). Suppose that four memories are used, each of 16 locations. The input X_{16−bit}=(x_{1},x_{2},x_{3},x_{4}) will be divided into four segments. Each 4-bit segment will be fed into one memory, so that the 22-bit can be found, which will then be divided into three outputs, corresponding to \(\phantom {\dot {i}\!} |h_{k}*x_{l}*2^{11}|_{m_{i}} \). We want to emphasize that this result is the multiplication of each 4-bit with a filter coefficient with respect to m_{ i }.
To obtain the final multiplication’s result, each m_{ i } output will be shifted by l positions, where l is the index of the lowest input bit (4, 8, or 12). The modular multiplication and shift for 2^{ n }−1 and 2^{n+1}−1 can be achieved by a left circular shift (left rotate) for l positions, whereas the modular multiplication and shift for 2^{ n } can be achieved by a left shift for l positions [37]. Finally, the modulo adder adds the corresponding output.
3.1.2 Modulo adder (MA)
To improve the design and enhance the speed, a parallel-prefix carry computational structure is used [42–44], which allows the implementation of highly efficient combinational and pipelined circuits for modular arithmetic.
3.1.3 The reverse converter
3.2 Example
In this subsection, an example with input x[1] = − 0.4 is given to illustrate how the RNS-based works. Each input is added to 2.5 and then multiplied by 2^{8}. Therefore, if x[1] = − 0.4, the result is 538. Then, this value is multiplied by the scaled h_{0}=− 266 in P_{7}. The sample input can be rewritten as (0000 0010 0001 1010)_{2} or x_{1}=0, x_{2}=2, x_{3}=1 and x_{4}=10. These values are used to index the memory and the value of multiplying x_{ i } by h_{0} can be found in Table 1—i.e., 0, 3402985, 3798772 and 236689, respectively. From the table, the corresponding module of x_{1} is \((0, 0, 0)_{P_{7}}\), x_{2} is \((103, 108, 233)_{P_{7}}\), x_{3} is \((115, 118, 244)_{P_{7}}\) and the corresponding module of x_{4} is \( (7,28,145)_{P_{7}} \). After that, the value of h_{0}∗x_{2} is shifted eight-position corresponding to m_{ i } and the value of h_{0}∗x_{3} is shifted four-position corresponding to m_{ i }. For the case of x_{3}, the mid value is shifted 8 bit to the left, whereas the other values is circular shifted by 8 bit to the left and the result becomes \( (79, 0, 233)_{P_{7}}\). It is worth noting that the mid value is always 0 because the word width is 7 and is less than the shift value, 8. For the case of x_{2}, the mid value is shifted 4 bit to the left, whereas the other values is circular shifted for four positions and the result becomes \( (62, 96, 79)_{P_{7}}\). The sum of the partial results, performed via tree of two-input MAs, is \( (148,124,457)_{P_{7}} \) or \( (21,124,202)_{P_{7}} \), which is equivalent to (mod(− 143108,127),mod(− 143108,128), mod(− 143108,255)), respectively. Finally, the output of this memory-based multiplication is aggregated with next filter-taps using MAs.
3.3 Two-level DWT implementation
where τ_{ FCMA } is the RNS-based filter latency and τ_{ RBC } is the RNS-to-binary converting delay, respectively.
3.3.1 The design of the second FCMA
3.4 Hardware complexity
3.4.1 Memory usage
Occupied memories that are used by RNS-based DB2 DWT approaches. The input word length, w, is 16 bits and b=3∗n+1
Approach | Occupied memories | |
---|---|---|
a = 8 | a = 4 | |
Memory-size | (8×b) | (4×b) |
One-level | 8 | 16 |
Two-level | 16 | 32 |
Optimized FCMA (memory-based) | 24 | 36 |
Optimized FCMA (shift-based) | 0 | 0 |
For instance, a three-channel of P_{10} FCMA has 6 memories of (4×10) and 3 memories of (4×11). Therefore, the FCMA at the second stage requires 36 memory elements and in total 52 are required for the whole design.
The design of FCMA at the second stage can be improved by eliminating all the memory at FCMA, as shown in Fig. 8b. In this case, the proposed shift-based FCMA performs the multiplying operations via shift operations and MA units. The shift operation is always performed via rewiring the bits [37], which has no cost in terms of delay and hardware.
3.4.2 Adder counts
In addition to the memory complexity, we could derive an expression for the overall adder counts. In the following analysis, we can neglect the difference between (2^{ n }−1) and (2^{n+1}−1) MAs because it will not affect the total number of MAs.
For instance, three-channel DB2 implementation requires 9 MA blocks to sum up the final result, and in total P_{7} RNS-based implementation has a total of 49 MA blocks when w = 16 and a = 4 bits.
If the FCMA of the second stage is implemented by means of memory, then each unit requires \( (q*((\lceil \frac {n}{a} \rceil) - 1)) \) MAs to sum the output of each tap (Fig. 8a). In addition, (q∗(N−1)) MAs are required to sum all the output of all taps. This means, \( (N * q*((\lceil \frac {n}{a} \rceil) - 1) + q * (N-1)) \) MAs are used for the memory based proposed approach.
Memory usage and adders for RNS-based approaches for N-tap DWT
Number of memories | Number of MAs | |
---|---|---|
One-level | N∗w/a | q∗(N∗w/a−1)+4 |
Two-level | 2∗N∗w/a | 2∗(q∗(N∗w/a−1)+4) |
Optimized memory-based FCMA | \( q * N * (\lceil \frac {n}{a} \rceil) \) | \( N * q*((\lceil \frac {n}{a} \rceil) - 1) + q * (N-1) \) |
Optimized shift-based FCMA | 0 | \(N* q (\bar {n} - 1) + q * (N-1) \) |
It is clear that as (w/a) increases, the number of MA increases because (w/a−1) MA are required to construct MA tree. Hence, the critical path delay (PSD) involves one multiplier followed by log_{2}(w/a−1) levels MA tree. As a consequence, there is a trade-off between the number of memory and its size on the overall performance of the system.
4 Simulation results, performance analysis, and validation
In the previous section, we have demonstrated the design of the DWT by using a residue number system. The two-level DWT RNS-based has been designed, implemented and tested with series of simulations to verify the DWT functionality. Experiments were carried out on the Xilinx ZC706 evaluation board [45]. The performance of the proposed approach was compared with the distributed arithmetic (DA) [3], which is a multiplierless DWT. We also considered the direct DWT implementation using an IP FIR Compiler 6.3 (FIR6.3) block, which provides a common interface to generate highly parameterizable, area-efficient, high-performance FIR filters [46].
In the following experiments, the moduli sets of P_{7}={127,128,255},P_{10}={1023,1024,2047}, and P_{13}={8191,8192,16383} were used. The dynamic range of these sets are M=4161536,2144338944, and 1099310309376, respectively. In fact, the moduli sets of P_{10} and P_{13} are selected because their dynamic range are greater than th_{ o }. For instance, Eq. 15 shows that th_{ o }=1279020283 for P_{10} with y=6,z=11, and \( \sum (h_{i}) = 1.5436 \). In all RNS-based implementations, the input word length was set to 16 bits.
4.1 Resource utilization and system performance
FPGA resource utilization and system performance for the RNS components— i.e., FCMA and reverse converter
Resources | (n=7) | (n=10) | n=10 | n=13 | n=10 | ||
---|---|---|---|---|---|---|---|
FCMA | RBC | FCMA | RBC | S-FCMA ^{c} | M-FCMA | ||
Number of slice LUTs | 234 | 114 | 335 | 143 | 731 | 999 | 348 |
Number of slice registers | 375 | 148 | 478 | 187 | 792 | 1024 | 471 |
Number of occupied slices | 121 | 55 | 158 | 57 | 360 | 524 | 164 |
Number of RAMB18E1 | 8 | 0 | 8 | 0 | 0 | 0 | 24 |
Output word length (bits) | 22 | 0 | 31 | 0 | 31 | 40 | 31 |
Worst negative slack (ns) | 7.3 | 7.2 | 7.1 | 7.29 | 7.23 | 7.26 | 7.29 |
Max. operating freq (MHz) | 367.1 | 353.7 | 346.6 | 369.7 | 360.1 | 365.7 | 368.9 |
Data path delay (ns) | 2.599 | 2.65 | 2.66 | 2.5 | 2.76 | 2.7 | 2.65 |
Estimated power (mW)^{a} | 25 | 3 | 29 | 3 | 6 | 7 | 21 |
Block RAM power (mW) | 16 | 0 | 16 | 0 | 0 | 0 | 16 |
Latency (CC)^{b} | 5 | 6 | 5 | 5 | 6 | 6 | 5 |
4.2 Two-level DWT evaluation
FPGA resource utilization and system performance of two-level DB2 DWT implementation with ZC706
RNS-based | ||||||||
---|---|---|---|---|---|---|---|---|
Resources | FIR | DA | (n=7) | (n=10) | (n=13) | |||
Full | Full | M-FCMA | S-FCMA^{**} | Full | S-FCMA | |||
Number of slice LUTs | 92 | 1108 | 730 | 1000 | 882 | 1759 | 1261 | 2455 |
Number of slice registers | 494 | 1250 | 1007 | 1307 | 1231 | 1648 | 1643 | 2204 |
Number of occupied slices | 122 | 411 | 351 | 434 | 444 | 656 | 561 | 885 |
Number of memory | 0 | 44 | 16 | 16 | 32 | 8 | 16 | 8 |
Number of DSP | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Worst neg. slack (ns) | 7.654 | 6.01 | 6.59 | 6.87 | 4.73 | 6.6 | 6.86 | 6.2 |
Max. operating freq (MHz) | 426.3 | 250.6 | 293.3 | 316 | 189.7 | 294.4 | 318.8 | 263.2 |
Data path delay (ns) | 2.017 | 3.73 | 3.152 | 2.94 | 4.84 | 3.07 | 2.9 | 3.54 |
Estimated power (mW) | 13 | 63 | 42 | 50 | 55 | 44 | 81 | 63 |
Block RAM power (mW) | 0 | 37 | 20 | 23 | 29 | 8 | 46 | 15 |
It is also observed that the maximum frequency of all RNS-based schemes is higher than DA-based DWT. Because the only change among P_{7}, P_{10}, and P_{13} implementations is the moduli-set width, the maximum operating frequencies slightly changes among these designs. Furthermore, the two-level DB2 filter bank was designed with maximum operating frequencies between 260 and 360 MHz for full and optimized shift-based FCMA, respectively. However, P_{7} RNS-based is the only model that has less resources compared to DA-based because of its small word length.
The effect of using four memories in each filter-tap with ZC706
RNS-based | |||||||
---|---|---|---|---|---|---|---|
Resources | DA | (n=7) | (n=10) | (n=13) | |||
Full | Full | M-FCMA | S-FCMA^{**} | Full | S-FCMA | ||
Number of memory | 44 | 32 | 32 | 52 | 16 | 32 | 16 |
Output word length (bits) | 22 | 22 | 31 | 31 | 31 | 40 | 40 |
Estimated power (mW) | 78 | 70 | 85 | 88 | 60 | 142 | 92 |
Block RAM power (mW) | 51 | 39 | 43 | 49 | 18 | 88 | 36 |
4.3 Functionality verification
4.4 Precision analysis
Generally, convolution-based DWT involves floating-point operations, which introduces rounding errors. Because the filter-banks coefficients, designing by means of floating-point, require large hardware resources to retain the precision, we replaced the floating-point method with RNS numbering system. We simply multiplied the input by 2^{ y } and the filter coefficients by 2^{ z }. At the end, we converted the result back to floating-point number. PSNR is the most commonly used method to measure the quality of the result. In fact, it measures the peak error and high PSNR means better quality and that less error is introduced to the result.
The PSNR values of one- and two-level of different DWT implementations
DA | FIR | RNS | Optimized RNS | ||||
---|---|---|---|---|---|---|---|
P _{7} | P _{10} | P _{13} | P _{10} ^{a} | P _{13} ^{a} | |||
Architecture | 1L/2L | 1L/2L | 1L/2L | 1L/2L | 1L/2L | 2L | 2L |
Input precision | Q _{5,16} | Q _{5,16} | y=8/8^{b} | y=12/12 | y=13/12 | y=6 | y=11 |
Coeff. precision | Q _{1,15} ^{ c } | Q _{0,15} | z=11 | z=16/11 | z=18/13 | z=11 | z=13 |
Internal word length | 22 bit | NA | 22 bit | 31 bit | 40 bit | 31 bit | 40 bit |
PSNR (dB) | 73.5/63.5 | 86.3/78.7 | 56.5/41.87 | 84/53 | 90/54 | 48.5^{d} | 54.5 |
The optimized two-level with P_{10} has a maximum input scaling factor of 6 (due to Eq. 16). As a consequence, we cannot adapt their scaling factors. In contrast, the optimized two-level of P_{13} has higher input and filter coefficients scaling factors due to its large word length, which enables it to have large accuracy values.
5 Conclusions
In this article, we have addressed the development of a multiplierless scheme for two-level RNS-based DWT, which can be adapted to any moduli set, with any number of channel. This approach intensively use memory to speed up the entire processing time. In order to achieve low latency, we incorporated two novel ideas into the two-level proposed design, as follows: (1) eliminating the intermediate RBC unit; (2) replacing the internal memory of the second level by simple circular shift operations. A key feature of this approach is that the user can change the scaling factors, y and z, either to achieve high PSNR values or lowering the PSNR value in order to design multi-level DWT with low latency.
A comparison between RNS-based implementations using moduli-set P_{ n }
P _{7} | P _{10} | P _{13} | |
---|---|---|---|
Pros | ∙ Design is simple ∙ Consume less power | ∙ Design is moderate ∙ The shift-based scheme has lower latency and consume less power | ∙ Can be extended to three-level (with y=8 and z=9) ∙ The shift-based scheme has lower latency and consume less power |
Cons | ∙ Low PSNR ∙ Cannot be extended to multi-level DWT | ∙ Cannot be extended to three-level DWT | ∙ Word length is high ∙ consume large resources |
Given the implementation examples for experimental verifications and analysis, the approach was validated on a ZYNQ ZC706 development kit. The co-simulation results have also been verified and compared with the simulation environment. The complexity and optimization of multi-level DWT with respect to hardware structure provides a foundation for employing an appropriate algorithm for high-performance applications, such as in cognitive communication, where DWT analysis is combined with machine learning algorithms.
6 Appendix 1
6.1 Acronyms
BRAM Block RAM
BRC Binary-to-residue converter
CC Clock cycle
CLB Configurable logic block
CPD Critical path delay
CRT Chinese reminder theorem
CSE Common subexpression elimination
DA Distributed arithmetic
DSP Digital signal processing
DWT Discrete wavelet transform
GCD Greatest common divisor
GE Graph-based eliminations
FCMA Forward-converter and modular adders
FIR Finite impulse response
FPGA Field-programmable gate array
LS Lifting-based schemeLUT Look-up table
MA Modular adder
MAC Multiplier-accumulator
MCM Multiple constant multiplications
PSNR Peak signal-to-noise ratio
RBC Reverse-binary converter
RNS Residue number system
7 Appendix 2
7.1 Mathematical symbols
a×b The memory word size
l Number of DWT levels
M The maximum range of P_{ n }
N Number of filter tap
h_{ k } The low-pass k^{ t h } filter coefficient
m The number of magnitude bits
m_{ i } The i^{ th } moduli of P_{ n }
n The moduli set base (e.g. P_{7})
q Number of RNS channel
τ Latency
w Word length
y The input scaling factor
z The filter scaling factor
Declarations
Acknowledgements
The authors would like to thank the anonymous reviewers for their valuable comments and suggestions to improve the quality of the article.
Authors’ contributions
All authors contributed to the work. Both authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
- P Yang, Q Li, Wavelet transform-based feature extraction for ultrasonic flaw signal classification. Neural Comput. & Applic.24(3-4), 817–826 (2014).View ArticleGoogle Scholar
- SK Madishetty, A Madanayake, RJ Cintra, VS Dimitrov, Precise VLSI architecture for AI based 1-D/ 2-D Daub-6 wavelet filter banks with low adder-count. IEEE Trans. Circ. Syst. I Regular Papers. 61(7), 1984–1993 (2014).View ArticleGoogle Scholar
- M Martina, G Masera, MR Roch, G Piccinini, Result-biased distributed-arithmetic-based filter architectures for approximately computing the DWT. IEEE Trans. Circ. Systems I Regular Papers. 62(8), 2103–2113 (2015).MathSciNetView ArticleGoogle Scholar
- H Alzaq, BB Ustundag, in European Wireless 2015; 21th European Wireless Conference; Proceedings Of. Wavelet Preprocessed Neural Network Based Receiver for Low SNR Communication System (VDE Budapest, 2015), pp. 1–6.Google Scholar
- N Carta, D Pani, L Raffo, Impact of Threshold Computation Methods in Hardware Wavelet Denoising Implementations for Neural Signal Processing (Springer, Cham, 2015). http://doi.org/10.1007/978-3-319-26129-4_5.View ArticleGoogle Scholar
- S Mallat, A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way, 3rd edn. (Academic Press, Philadelphia, PA, USA, 2008).MATHGoogle Scholar
- M Vetterli, C Herley, Wavelets and filter banks: theory and design. IEEE Trans. Signal Process.40(9), 2207–2232 (1992).View ArticleMATHGoogle Scholar
- S Gnavi, B Penna, M Grangetto, E Magli, G Olmo, Wavelet kernels on a DSP: a comparison between lifting and filter banks for image coding. EURASIP J. Adv. Signal Process.2002(9), 458215 (2002).View ArticleMATHGoogle Scholar
- I Daubechies, W Sweldens, Factoring wavelet transforms into lifting steps. J Fourier Anal. Appl.4(3), 247–269 (1998).MathSciNetView ArticleMATHGoogle Scholar
- M MAB, NM Sk, in 2018 31st International Conference on VLSI Design and 2018 17th International Conference on Embedded Systems (VLSID). An efficient vlsi architecture for convolution based dwt using mac (IEEEPune, 2018), pp. 271–276. https://doi.org/10.1109/VLSID.2018.75.Google Scholar
- A Gacic, M Puschel, JMF Moura, in 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing. Automatically generated high-performance code for discrete wavelet transforms, vol 5 (IEEEMontreal, 2004), pp. 69–725. https://doi.org/10.1109/ICASSP.2004.1327049.Google Scholar
- E Ramola, JS Manoharan, in 2011 3rd International Conference on Electronics Computer Technology. An area efficient vlsi realization of discrete wavelet transform for multiresolution analysis, vol 6 (IEEEKanyakumari, 2011), pp. 377–381.View ArticleGoogle Scholar
- I Mamatha, S Tripathi, TSB Sudarshan, in 2017 International Conference on Computing, Communication and Automation (ICCCA). Convolution based efficient architecture for 1-d dwt (IEEEGreater Noida, 2017), pp. 1436–1440. https://doi.org/10.1109/CCAA.2017.8230023.View ArticleGoogle Scholar
- M I, S Tripathi, S TSB, in 2016 3rd International Conference on Signal Processing and Integrated Networks (SPIN). Pipelined architecture for filter bank based 1-d dwt (IEEENoida, 2016), pp. 47–52. https://doi.org/10.1109/SPIN.2016.7566660.Google Scholar
- PK Meher, BK Mohanty, MMS Swamy, in 2015 28th International Conference on VLSI Design. Low-Area and Low-Power Reconfigurable Architecture for Convolution-Based 1-D DWT Using 9/7 and 5/3 Filters (IEEEBangalore, 2015), pp. 327–332. https://doi.org/10.1109/VLSID.2015.61.View ArticleGoogle Scholar
- J Ramírez, A García, PG Fernandez, A Lloris, in 2000 10th European Signal Processing Conference. An efficient rns architecture for the computation of discrete wavelet transforms on programmable devices (IEEETampere, 2000), pp. 1–4.Google Scholar
- L Aksoy, P Flores, J Monteiro, A tutorial on multiplierless design of FIR filters: algorithms and architectures. Circ. Syst. Signal Process.33(6), 1689–1719 (2014).View ArticleGoogle Scholar
- Y Voronenko, M Püschel, Multiplierless multiple constant multiplication. ACM Trans. Algorithm.3(2) (2007).Google Scholar
- F Al-Hasani, MP Hayes, A Bainbridge-Smith, A common subexpression elimination tree algorithm. IEEE Trans. Circ. Syst. I Regular Papers. 60(9), 2389–2400 (2013).MathSciNetView ArticleGoogle Scholar
- X Lou, YJ Yu, PK Meher, New approach to the reduction of sign-extension overhead for efficient implementation of multiple constant multiplications. IEEE Trans. Circ. Syst. I Regular Papers. 62(11), 2695–2705 (2015).MathSciNetView ArticleGoogle Scholar
- H Liu, A Jiang, in 2016 8th International Conference on Wireless Communications Signal Processing (WCSP). Efficient design of fir filters using common subexpression elimination (IEEEYangzhou, 2016), pp. 1–5. https://doi.org/10.1109/WCSP.2016.7752701.Google Scholar
- L Aksoy, P Flores, J Monteiro, Multiplierless design of folded dsp blocks. ACM Trans. Des. Autom. Electron. Syst.20(1), 14–11424 (2014).View ArticleGoogle Scholar
- A Peled, B Liu, A new hardware realization of digital filters. IEEE Trans. Acoust. Speech Signal Process.22(6), 456–462 (1974).View ArticleGoogle Scholar
- DJ Allred, W Huang, V Krishnan, H Yoo, DV Anderson, in Field-Programmable Custom Computing Machines, 2004. FCCM 2004. 12th Annual IEEE Symposium On. An FPGA Implementation for a High Throughput Adaptive Filter using Distributed Arithmetic (IEEENapa, 2004), pp. 324–325. https://doi.org/10.1109/FCCM.2004.15.View ArticleGoogle Scholar
- H Yoo, DV Anderson, in Proceedings. (ICASSP ’05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. Hardware-efficient Distributed Arithmetic Architecture for High-Order Digital Filters, vol 5 (IEEEPhiladelphia, 2005), pp. 125–1285. https://doi.org/10.1109/ICASSP.2005.1416256.Google Scholar
- H Alzaq, BB Üstündağ, in 2017 10th International Conference on Electrical and Electronics Engineering (ELECO). Multiplier-less 1-level discrete wavelet transform implementations on zc706 development kit (IEEEBursa, 2017), pp. 1122–1126.Google Scholar
- S Pontarelli, G Cardarilli, M Re, A Salsano, Optimized implementation of RNS FIR filters based on FPGAs. J. Signal Process. Syst.67(3), 201–212 (2012).View ArticleGoogle Scholar
- W Jenkins, B Leon, The use of residue number systems in the design of finite impulse response Digital Filters. IEEE Trans. Circ. Syst.24(4), 191–201 (1977).MathSciNetView ArticleMATHGoogle Scholar
- CH Chang, AS Molahosseini, AAE Zarandi, TF Tay, Residue number systems: a new paradigm to datapath optimization for low-power and high-performance digital signal processing applications. IEEE Circ. Syst. Mag.15(4), 26–44 (2015).View ArticleGoogle Scholar
- J Ramírez, U Meyer-Bäse, F Taylor, A García, A Lloris, Design and implementation of high-performance RNS wavelet processors using custom IC technologies. J. VLSI Signal Process. Syst. Signal Image Video Technol.34(3), 227–237 (2003).View ArticleMATHGoogle Scholar
- GC Cardarilli, A Nannarelli, M Petricca, M Re, in 2015 IEEE 58th International Midwest Symposium on Circuits and Systems (MWSCAS). Characterization of RNS multiply-add units for power efficient DSP (IEEEFort Collins, 2015), pp. 1–4. https://doi.org/10.1109/MWSCAS.2015.7282052.Google Scholar
- R Conway, J Nelson, Improved RNS FIR filter architectures. IEEE Trans. Circ. Syst. II Express Briefs. 51(1), 26–28 (2004).View ArticleGoogle Scholar
- I Daubechies, Ten Lectures on Wavelets (Society for Industrial and Applied Mathematics, Philadelphia, 1992).View ArticleMATHGoogle Scholar
- KH Rosen, Elementary Number Theory and Its Applications, 5th edn. (Addison-Wesley, Reading, MA, 2004).Google Scholar
- PVA Mohan, RNS-to-binary converter for a new three-moduli set 2^{n+1}−1,2^{ n },2^{ n }−1. IEEE Trans. Circ. Syst. II Express Briefs. 54(9), 775–779 (2007).View ArticleGoogle Scholar
- S-H Lin, M-h Sheu, C-H Wang, Y-C Kuo, in Circuits and Systems, 2008. APCCAS 2008. IEEE Asia Pacific Conference On. Area-Time-Power Efficient VLSI Design for Residue-to-binary Converter Based on Moduli Set (2^{ n },2^{n+1}−1,2^{ n }+1) (IEEEMacao, 2008), pp. 168–171. https://doi.org/10.1109/APCCAS.2008.4745987.Google Scholar
- KS Reddy, S Bajaj, SS Kumar, in TENCON 2014 - 2014 IEEE Region 10 Conference. Shift add approach based implementation of RNS-FIR filter using modified product encoder (IEEE Bangkok, 2014), pp. 1–6. https://doi.org/10.1109/TENCON.2014.7022321.Google Scholar
- CH Vun, AB Premkumar, W Zhang, A new RNS based DA approach for inner product computation. IEEE Trans. Circ. Syst. I Regular Papers. 60(8), 2139–2152 (2013).MathSciNetView ArticleGoogle Scholar
- A Hariri, K Navi, R Rastegar, A new high dynamic range moduli set with efficient reverse converter. Comput. Math. Appl.55(4), 660–668 (2008).MathSciNetView ArticleMATHGoogle Scholar
- B Cao, T Srikanthan, C-H Chang, Efficient reverse converters for the four-moduli sets (2^{ n }−1,2^{ n },2^{ n }+1,2^{n+1}−1) and (2^{ n }−1,2^{ n },2^{ n }+1,2^{n−1}−1). IEE Proc. Comput. Digit Tec.152(5), 687–696 (2005).View ArticleGoogle Scholar
- B Cao, C-H Chang, T Srikanthan, An efficient reverse converter for the 4-moduli set 2^{ n }−1,2^{ n },2^{ n }+1,2^{2n}+1 based on the new Chinese remainder theorem. IEEE Trans. Circ. Syst. I Fundam. Theory Appl.50(10), 1296–1303 (2003).View ArticleMATHGoogle Scholar
- R Zimmermann, in Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336). Efficient vlsi implementation of modulo (2n plusmn;1) addition and multiplication (IEEE Adelaide, 1999), pp. 158–167. https://doi.org/10.1109/ARITH.1999.762841.View ArticleGoogle Scholar
- L Kalampoukas, D Nikolos, C Efstathiou, HT Vergos, J Kalamatianos, High-speed parallel-prefix modulo 2^{ n }−1 adders. IEEE Trans. Comput.49(7), 673–680 (2000).View ArticleGoogle Scholar
- G Dimitrakopoulos, DG Nikolos, HT Vergos, D Nikolos, C Efstathiou, in 2005 12th IEEE International Conference on Electronics, Circuits and Systems. New architectures for modulo 2^{ n }−1 adders (IEEEGammarth, 2005), pp. 1–4. https://doi.org/10.1109/ICECS.2005.4633502.Google Scholar
- Xilinx Inc., Zynq-7000 All Programmable SoC ZC706 evaluation kit. https://www.xilinx.com/products/boards-and-kits/ek-z7-zc706-g.html. Accessed 30 Aug 2017.
- Xilinx: LogiCORE IP FIR Compiler v6.3. Product Specification DS795 (Oct 2011). http://www.xilinx.com/support/documentation/ip_documentation/fir_compiler/v6_3/ds795_fir_compiler.pdf. Accessed 25 Sept 2017.