Implementation of the Least-Squares Lattice with Order and Forgetting Factor Estimation for FPGA

A high performance RLS lattice ﬁlter with the estimation of an unknown order and forgetting factor of identiﬁed system was developed and implemented as a PCORE coprocessor for Xilinx EDK. The coprocessor implemented in FPGA hardware can fully exploit parallelisms in the algorithm and remove load from a microprocessor. The EDK integration allows e ﬀ ective programming and debugging of hardware accelerated DSP applications. The RLS lattice core extended by the order and forgetting factor estimation was implemented using the logarithmic numbers system (LNS) arithmetic. An optimal mapping of the RLS lattice onto the LNS arithmetic units found by the cyclic scheduling was used. The schedule allows us to run four independent ﬁlters in parallel on one arithmetic macro set. The coprocessor containing the RLS lattice core is highly conﬁgurable. It allows to exploit the modular structure of the RLS lattice ﬁlter and construct the pipelined serial connection of ﬁlters for even higher performance. It also allows to run independent parallel ﬁlters on the same input with di ﬀ erent forgetting factors in order to estimate which order and exponential forgetting factor better describe the observed data. The FPGA coprocessor implementation presented in the paper is able to evaluate the RLS lattice ﬁlter of order 504 at 12 kHz input data sampling rate. For the ﬁlter of order up to 20, the probability of order and forgetting factor hypotheses can be continually estimated. It has been demonstrated that the implemented coprocessor accelerates the Microblaze solution up to 20 times. It has also been shown that the coprocessor performs up to 2.5 times faster than highly optimized solution using 50MIPS SHARC DSP processor, while the Microblaze is capable of performing another tasks concurrently.


INTRODUCTION
A number of possible applicationsin digital signal processing (DSP) such as parameter estimation [1], echo suppression [2], or beam-forming [3] can be found for adaptive least squares filters.Their recursive form known as the recursive least squares (RLS) [4] is the solution of the minimum mean square error problem.The convergence rate of the RLS is far superior to that of the well-known least mean square (LMS) [5] algorithm and its normalized sibling NLMS.
The hardware implementation of the RLS algorithm is rather difficult due to its high computational complexity and problems with numerical stability.The complexity can be decreased by the exploitation of serial structure of the input data, typical for DSP applications.It allows to reduce asymptotic complexity of the RLS from O(N 2 ) to O(N).One of the fast versions of RLS algorithm is represented by the fast transversal filters (FTF) [6][7][8].The problems with numerical instability of the FTF lead to the development of its stabilized version [9,10].Motivated to develop numerically stable fast RLS filters, the least-squares lattice (LSL) filters [11,12] and fast QR decomposition (QR-RLS) [13] algorithms were developed.While the numerical stability of the fast QR-RLS algorithm was proven analytically [14], good numerical properties of an LSL filter were found experimentally.The analysis of QR-RLS and LSL algorithms can be found in [15].This work focuses on the LSL algorithm because of its slightly lower complexity.
The implementation of the LSL filter with error-feedback [16][17][18] has proven good numerical behavior.In [16], it was shown that the filter can be efficiently implemented in field programmable gate arrays (FPGA).In the following text, this algorithm is referred to as the RLS lattice.Its computational complexity is 24N, which can be further reduced by the utilization of parallel hardware in an FPGA.
The possibilities of efficient FPGA implementation of the RLS lattice algorithm were exhaustively investigated in [19,20].Similar approach to the optimization of other DSP algorithms for FPGAs can be found in [21][22][23].As shown in these works, the resulting intellectual property (IP) cores can outperform floating-point DSP microprocessor solutions by one order of magnitude.There also exist other RLS filter FPGA implementations such as [1,3], where IP cores are implemented and integrated in a custom one purpose design.Another implementation of RLS filter is presented in [24], where calculations are distributed between FPGA and the NIOS microprocessor in a single chip.
Our aim is to provide a versatile highly configurable hardware RLS core for DSP applications.We focus on the implementation of a hardware coprocessor rather than a standalone IP core.Our target application scheme is to use one lattice filter and to estimate the order probability or to use more parallel lattice filters with different forgetting factors on a single channel and to estimate, which forgetting factor yields better results.We also expect that parallel lattice filters will be possible to interconnect serially and to create a high performance pipelined solution.The algorithm is integrated into the Xilinx EDK environment as a Microblaze accelerator, resulting in a versatile, easy-to-use and compact RLS lattice coprocessor, which can easily be accessed by the standard C programming and debugging.

PROBABILISTIC APPROACH TO SYSTEM IDENTIFICATION
In order to outline the development of the RLS lattice algorithm with order probability estimation and to describe its hardware implementation, a brief insight to the recursive least squares estimation from the probabilistic theory viewpoint is provided.The probabilistic approach [25] to the system identification provides the link between the probabilistic theory and the least square error estimation, which allows us to extend the estimation task by the hypotheses probability estimation.In this approach to the system identification, the hypotheses probability update of an autonomous onedimensional system [26] can be formulated as where y n are data observed at the unknown system output at time n, the variables D n and D n−1 are previously observed data y 0 through y n and y n−1 , respectively.The hypothesis h n is the ordered pair h ∈ (i, λ), where i ∈ {0, . . ., N} is the unknown order and λ ∈ {λ 1 , . . ., λ M } is the unknown forgetting factor.The term p(y n |D n−1 , h n ) is the probabilistic description of the modeled system with order and exponential forgetting given by hypothesis h n .This probability can be evaluated as where Λ h is the optimal solution error of the model for the hypothesis h.The matrix V z,h,n is the autocorrelation matrix defined as where The operator Γ in (2) is the gamma function.The quantity ϑ represents the "amount of data" accumulated in matrix V z,h,n through the estimation process.The quantity ϑ is updated as It should be noted that according to [25] the relation ( 2) is correct only for the definition of the autoregression model inherent to the hypothesis h as a conditional probability density function (pdf) where the symbol Θ h,n denotes parametrization of the model by the vector of unknown parameters α and of an unknown The autoregression model of an unknown system in (6) can be described as where e h,n is the prediction error of the model inherent to the hypothesis h at time n.As shown in [25], if e h,n is a normally distributed random variable, the conditional pdf for the model parameters Θ h,n can be written as where c h,n is a normalizing constant and The matrix V h,n is the augmented autocorrelation matrix which keeps information about the shape of the conditional pdf (9).This data matrix can be updated recursively as For better clarity, the subscript n will be omitted in the following text.The matrix V h used in (10) can be decomposed as follows: Consequently, the optimal solution can be written as where Recursive formulas for the direct update of the decomposed matrix V h can be derived from (11).The recursive solution for A h , maximizing the pdf given by (9), is also known as the RLS algorithm [27].It is important to note that for the estimation of (N + 1)M hypotheses by the Bayes formula (1), it is needed to estimate all models defined by the estimated hypotheses, which means to calculate NM-array of RLS filters.

HYPOTHESES ESTIMATION
Despite the possibility to implement the RLS estimation by formulas introduced in Section 2, it is more convenient to use one of the state-of-the-art RLS algorithms.As the most convenient algorithm for implementation, the recursive least-squares lattice [4] in the error-feedback form was chosen.As suggested in [17], the normalized a posteriori errors are used to reduce the complexity of the algorithm.The computational complexity of this algorithm is 24N, where N denotes the filter order (dimension).
As mentioned above, the estimation of probability of hypotheses h by (1) requires performing one RLS estimation for each hypothesis h.Thus, the NM-array of RLS filters has to be calculated.
The most important property making the RLS lattice suitable for the hypotheses estimation is its modular structure.The RLS lattice filter consists of a cascade of identical modules.Each module implements the order update, which means that it is using ith order output from the preceding module and increases the order of estimation to i + 1.Consequently, estimations of all orders up to N can be found during computations.Using this principle, the number RLS filters required for the hypotheses estimation can be reduced to M.
In our solution, the evaluation of probability estimates is divided into two stages.The first stage performs the order update, which uses the "old" probability estimates and updates them by new data.This operation is represented by the numerator of (1).In the second stage, the normalization of the updated order estimates is performed.The normalization is represented by the denominator of (1).The forgetting on hypotheses pdf is applied in the normalization stage.The order update can be integrated into the RLS lattice algorithm.17) and (18) Lattice update (for each n ≥ 0): T13,11,14,12 T17,15,18,16 U7 end Algorithm 1: RLS lattice algorithm with exponential forgetting and probability update evaluation.The labels Txx and Ux denote individual arithmetic operations contained in the right side of each equation.The RLS lattice algorithm with order and forgetting factor update is summarized in Algorithm 1 .For the illustration, the update of estimates to order i + 1 from order i is depicted in Figure 1.
The algorithm presented in Algorithm 1 is in the form, where input u n is used to estimate desired value d n .For identification of one-dimensional autoregression model the input must be connected as presented in Figure 2.Then, the

RLS lattice
Order estimation Table 1: The parameters of the RLS lattice algorithm with exponential forgetting and probability update evaluation.

N
Filter order λ ∈ (0; 1 Forgetting factor Hypotheses forgetting factor probabilistic approach given in Section 2 can be used for estimation of order and forgetting factor probability as also shown in the figure.The RLS lattice algorithm parameters are summarized in Table 1. For the probability p(h n |D n ), p h,n = p i,λ,n will be further used as a more simple notation, where h was defined as the ordered pair (i, λ), i ∈ {0, . . ., N}, h ∈ {1, . . ., M}.Before the first iteration of the algorithm, initial hypotheses pdf has to be set and the look-up tables have to be initialized.The initial hypotheses pdf can be selected as where p i,λ,n is the probability of order i and forgetting factor λ at time n.Value n = −1 represents the initialization step.The look-up tables are initialized for the limit value of as follows: After the values p i,λ,n−1 have been updated, the normalization step has to be performed to obtain actualized probability p i,λ,n .The pdf given by (1) extended with forgetting of hypotheses can be evaluated as where p d i,λ is updated, but not normalized probability of order i and forgetting factor λ. The symbol ϕ is the forgetting factor of hypotheses.
Equation ( 19) can be calculated more efficiently as where p sd N,λ is a sum of updated probabilities p d i,λ biased by the forgetting factor of hypotheses.Comparing (19) and (20), it can be shown that the sum of updated probabilities p d i,λ can be calculated as Then, the value of p sd i,λ is calculated using the update of p sd i,λ from its initial value p sd −1,λ = (N + 1)ϕ as shown in the righthand side of ( 21).This step is labeled as operation U7 in Algorithm 1.
Adding the order and forgetting estimation, the original RLS lattice algorithm increases its complexity to 31N.Thus, the number of operations for maintaining N +1 order and M forgetting factor hypotheses is 31NM.The normalization of updated probabilities requires 2M(N +1)+M −1 operations.Considering these figures, we can state the complexity of the RLS lattice with the estimation of hypotheses which is 33NM + 3M − 1 operations, provided that the division of two powers is regarded as one operation.
It is evident that the implementation of M RLS lattice estimations to test each hypothesis can be easily parallelized.For each forgetting factor hypothesis, one RLS lattice instance extended by the probability update can be evaluated in parallel.When all filters have been calculated, the normalization is performed before new data are acquired.Such an arrangement can be efficiently implemented in FPGAs.

FPGA IMPLEMENTATION
Hardware implementation of the RLS lattice requires ALU providing ADD, SUB, MUL, and DIV operations.For the probability estimation, POW and SQRT operations are also required.The Logarithmic Number System (LNS) arithmetic [28,29] has been identified as the most convenient alternative to floating point for hardware implementation.This selection is supported by [16,19].
Numbers in LNS are represented as fixed-point base-2 logarithms of numbers to be represented.The LNS arithmetic provides extremely effective MUL, DIV, and SQRT operations.The ADD/SUB operations are more complex and thus require more resources.For our RLS lattice implementation, the high speed logarithmic arithmetic (HSLA) library was used [29].
The proposed hardware provides solution to a few implementation challenges, such as (i) conversions between fixed-point and LNS numbers; (ii) implementation of the power function or directly of the division of powers; (iii) mapping the algorithm to the LNS ALU efficiently and scheduling of operations; (iv) ensuring the numerically robust behavior; (v) implementation of the optimized RLS lattice core; (vi) supporting integration of the core into a Microprocessor system.
In the following sections, these issues are directly addressed and their solution is presented.

Conversions
In audio DSP applications, the 16-bit two's complement integer is a typical input and output data format.The same precision was used for implementation of the input and output of the RLS lattice filter.
A conversion of an unsigned 16-bit fixed-point numbers to LNS format and vice versa, introduced in [19], was implemented.The method can be easily modified for signed integers.The integer to LNS conversion is based on the LNS addition, which can be written as where i = log 2 |X| and j = log 2 |Y | are the fixed-point numbers representing X and Y in LNS, respectively.The 16bit integer input can be written as where z 1 and z 2 are high and low parts of the integer Z, respectively.Then, the conversion of Z into the LNS can be written as It is clear that for calculation of the LNS image of Z, the values i = log 2 |X| = log 2 |2 8 z 1 | and j = log 2 |z 2 | have to be known.The value of i and j can be tabulated as T i and T j , each of depth 256.The conversion from LNS to fixed point is implemented as the binary search of the nearest lower number in T i .The value of T i is then subtracted from the original in the LNS domain and the search continues in T j .The integer result is formed from addresses of found values in the tables T i and T j .The hardware solution of conversions using LNS addition and two tables delivers maximal conversion performance for the input and output data.
Initialization of the algorithm presented in Algorithm 1 requires constants and initial vector values in LNS.A software conversions for an FPGA soft processor were implemented using the functions provided in HSLA.

Division of powers
As mentioned in Section 2, the division of two general powers in ( 2) is considered as one floating-point-like operation.In the LNS arithmetic, this operation can be calculated as where a = log 2 |A| and b = log 2 |B| are LNS representations of A and B, respectively.The values of e1 and e2 are fixedpoint exponent values stored in tables defined in (18).According to (25), the division of two powers in the LNS can be implemented very efficiently as the fixedpoint multiplications, which are in fact fixed-point integers representing base-2 logarithm with a fixed-point exponent.Consequently, the results are subtracted.The FPGA implementation benefits from the possibility to use integer subtraction in full precision and to truncate the result to the desired LNS width at the end.
The fixed-point exponents in tables e1 and e2 are stored in 16-bit unsigned fixed-point with 8 fractional bits.Such decision makes possible efficient implementation without the loss of precision.Although such representation limits the possible range of the parameter λ to λ ∈ 0.95; 0.995 and of the parameter i to i ∈ {0; 20}, this range covers the most used options for the recursive identification and for the order estimation.The restriction put on the order i corresponds with the range in which the probabilistic approach to order estimation can provide reliable results as shown in [30].

ALU and scheduling
Since our implementation of the RLS lattice filter is designed for 16-bit two's complement integers used as an input and output, the 19-bit LNS arithmetic provides sufficient precision.The bit allocation within the 19-bit LNS number is as follows: 1 bit sign and 18-bit two's complement fixed-point number.Special values are reserved for zero, NaN, and Inf.
In order to exploit the maximal possible parallelism in the implementation of the RLS lattice filter, the cyclic scheduling of the RLS lattice inner loop was used [20,31].It was found that the addition hardware macro is utilized by less than 25%.For higher utilization of resources, four RLS lattice filter modules can share one dual-port ADD/SUB unit, that is, four independent RLS lattice filters can be implemented without using more ADD/SUB units.Other hardware macros used in the RLS lattice filter implementation are four MUL and DIV macros, one SQRT, and POW/POW macro.All hardware macros are fully pipelined.
Using the method of cyclic scheduling, four independent RLS lattice filter modules were mapped onto the LNS arithmetic units so that the schedule consists of 2 clock cycles prologue loop, 25 clock cycles main body loop, and 34 clock cycles epilogue loop.The resulting schedule is depicted in Figure 3, where the evaluation of one iteration of the RLS lattice filter is displayed.Operations for one iteration of the algorithm presented in Algorithm 1 are spread over three consecutive iterations in the hardware implementation and evaluated concurrently.Operations in Algorithm 1 were labeled T1-26 for the RLS lattice algorithm and U1-7 for the probability update.In Figure 3, the corresponding operations are displayed as 4-clock-cycle long operations, which consist of four consecutive calls-one for each instance of the RLS lattice module.

Numerical behavior
To guarantee the numerical stability, it is required to avoid divisions close to zero in operations labeled T19 and T20.The sufficient solution is to add a small positive constant ν 0 to the denominator before the divisions are evaluated [18].Such solution provides nonbiased energies B i,n and F i,n .Alternatively, the constant ν = ν 0 (1 − λ) can be used in operations labeled T13 and T17 [17].We use this version of the RLS lattice which introduces a small bias to the energies B i,n and F i,n .
The "biased" version is more suitable for parallelization, rather than using the nonbiased version, because the evaluation of the energy B i,n lies on the critical path of the algorithm.Situation on the critical path is compared to both options in Figure 4.It can be seen that at the expense of a small bias added to the energies F i,n and B i,n , the iteration time of the lattice loop is decreased from 33 to 25 clock cycles, that is, by 24%.
According to [17,18], the constant ν equal to 2 −b (1 − λ) should be used.Here, b is the mantissa length to which the prediction energies are quantized and λ is the forgetting factor.For the 19-bit LNS arithmetic case, b = 12 was used.The worst case forgetting factor λ = 0.95 is determined by constrain introduced in Section 4.2.The relation for ν is correct under the assumption that input signal y n is within the range −1, 1 , which is always satisfied by the input and output conversions described in Section 4.1.

Optimized RLS lattice core
The result of RLS lattice implementation is a standalone IP core.The internal variables of the core are expected to be stored in external memories.It allows more convenient integration of the core into the soft core processor.
A special memory organization, grouping the instances of one internal variable into one distributed memory element, was used.It allows us to reduce the size of multiplexer connected to the shared arithmetic unit macros, which is demonstrated in Figure 5.The experiments show that 8% of FPGA resources was saved by this approach.

Integration to a microprocessor system
Considering common solutions, where algorithm is used as an IP core in one purpose hardware design, our aim is to provide a versatile configurable RLS lattice solution.Thus, the optimized RLS lattice core was integrated as a coprocessor to the Microblaze processor system.For its development, the Xilinx system generator (XSG) was used.It is possible to create the coprocessor PCORE, which can directly be integrated to the Microblaze system.Such solution provides maximal applicability of the RLS lattice core, although at the expense of slight performance loss against one purpose design based on the same core.The XSG schematic of the coprocessor can be seen in Figure 6.It consists of the RLS lattice core denoted LSL, which can also be used as a standalone core of the input and output data buffers denoted inBuf and outBuf, of the control words mb2hc control and hc2mb status, and of the communication interface denoted Comm.
The Comm block is responsible for the batch processing of input data by the RLS lattice core in the LSL block.It stores filter results to the output buffer.It is capable of working with the RLS lattice core containing one up to four filter instances.
The LSL module consists of order updating loop (see algorithm in Algorithm 1) with the input consisting of four values.After increasing the estimation to higher order, the same four data can be regarded as output.Otherwise, only the internal states are altered.As a consequence, one filter can use another filter output as its input and to continue in the computation of order updates.The solution of higherorder filter can be obtained using the pipelined solution consisting of multiple RLS lattice filter instances.That is why the Comm block in the coprocessor design has to be capable of reconfiguring data path between RLS lattice instances, in order to use them either as parallel four channel RLS lattice filter (each with order up to 126), or it is possible to connect up to four instances serially and to create pipelined RLS lattice solution with order up to 504, achieving 4× higher performance.It is the reason why, in Figure 6, the fifo block storing the intermediate results is used in the pipelined organization of the coprocessor.
Four versions of the PCORE with one, two, three, and four RLS lattice modules were implemented.Resource requirements in the Xilinx Virtex4 SX35 device are summarized in Table 2.For illustration, the floor-plan of the system with Microblaze and the PCORE with four RLS lattice modules as a coprocessor are presented in Figure 7.The area occupied by the Microblaze processor is displayed in yellow, the FSL communication interfaces in blue, batch processing control unit in purple, and the RLS lattice filter core in green color.The entire system occupies 78% of the Xilinx Virtex4 SX35 chip.

DYNAMIC RECONFIGURATION
The implementation of the RLS lattice as a coprocessor connected to the Microblaze makes possible to use the dynamic reconfiguration for loading and unloading the coprocessor while the microcontroller is running.The loading of the coprocessor can be initiated on demand.
The software version of the RLS lattice filter was developed.The floating-point number representation is used in the Microblaze, whereas the 19-bit LNS is used in hardware.The software conversions based on the HSLA library were used for implementation of migration between hardware and software.The conversions are based on evaluation of base-2 logarithm contained in the Microblaze glibc library.Such conversions are time consuming even if a hardware support of floating-point is included in the Microblaze.
To control the load of the processor and the power consumption, a mechanism for migration the task from  the processor to coprocessor was developed.The software version of the RLS lattice runs in the Microblaze when no other tasks require to use the processor.When there is a need to use the Microblaze for other tasks, the processor is freed and the RLS lattice is run in the coprocessor.The available run-time configurations are presented in Figure 8.One is the software solution, remaining three are the coprocessor versions containing one to four RLS lattice filters.

PERFORMANCE RESULTS
The performance in each run-time state was measured and the key results are summarized in Table 3.The table is divided into three parts.The first part summarizes performance of the RLS lattice filter with the probability estimation.In general, the estimation of the model order higher than 15-20 is not giving satisfactory results, as it was shown in [30].Thus, higher order of estimation is not expected to be used.As it can be seen in the table, the performance decreases with the number of employed RLS lattice instances.The performance of hardware coprocessor is 5.5-times higher compared to the software solution, even if the clock frequency of the microprocessor is 4-times higher (this can be seen in the table by comparison of Microblaze and PCORE4 rows for N = 16, M = 4).cyclic scheduling.The acceleration of computation is nearly 20-times compared to the software solution running at 4time higher clock speed.The last part of the table shows results for the pipelined version of the RLS lattice filter.The pipelined solution uses all four RLS lattice instances, each computing 1/4 of overall estimation order.The hardware solution of order 504 is equivalent to software with M = 4 and N = 126, and the speedup of 20-times is reached again.The performance of the pipelined solution can be seen in the last row of Table 3.In the pipelined solution, the probability estimates cannot be maintained because the normalization cannot be implemented in such case.
The switch cost between run-time states was measured.The time for conversion is 411 μs/order.To change the state, local memories in the PCORE have to be reorganized.The time required for this state change was found by another experiment as 4 μs/order.Thus, we can formulate the "reorg" state and conversion times, which are T reorg = 4MN μs and T conv = 411MN μs, respectively.
The example of order estimation is presented in Figure 9.The upper plot shows the time when the model order was switched from 0 to 1 and back.The bottom plot shows the corresponding probability of order 0 and 1.

RELATED WORK
The error feedback RLS lattice filter algorithm was implemented before, which is presented in [32].Our implementation of the lattice PCORE has similar performance.However, three more lattice filters are possible to run at the same time and hypotheses probability can be evaluated in our implementation.We have also demonstrated that four filters in coprocessor can be serialized to perform as one, 4-level pipelined filter.The performance can be 4-times better but the hypotheses estimation cannot be maintained.
There exists another RLS filter FPGA implementation presented in [33].The performance cannot be compared since the only information provided in the paper is the clock frequency 5.3 MHz and the resource usage consisting of 2685 slices and 7 multipliers for the floating-point version and 3971 slices and 24 multipliers for the 17-bit LNS version (at 4 MHz).The PCORE implemented in our work uses 19-bit LNS and it can run at 44 MHz.In the smallest configuration, it occupies 4032 slices, 42 BRAMs, and 12 DSPs.
Our implementation is based on the algorithm presented in [17].Its implementation for the analog devices 21061 DSP (SHARC) running at 50 MHz allows to evaluate order N = 290 at 8 kHz sampling rate.From Table 3 it can be extrapolated that our solution at 44 MHz can, in the case of one lattice module, operate for order N = 168 at 8 kHz sampling rate (1.7× slower than SHARC solution).When all four RLS lattice modules are used, the pipelined solution can theoretically reach up to N = 753.However, the limitation of the current implementation allows maximal value of N = 504 (1.7× faster than SHARC).Alternatively, the filter of order N = 290 can operate at the sampling rate of 20 kHz (2.5× faster than SHARC).It is important to note that our architecture allows to execute any task on the microprocessor while the RLS lattice coprocessor is busy.

CONCLUSIONS
The easy use and easy programming and debugging PCORE integrated in the Xilinx EDK can perform 5-times faster than the software solution for Microblaze.At the maximal order with deactivated probability estimation, the acceleration reaches up to 20 times.
The dynamic reconfiguration can be used to adapt DSP capabilities to actual demand by changing the contents of the RLS lattice coprocessor.The migration of processing from the microprocessor to HW requires 411NM microseconds, where N is the order and M is the number of filter instances.The time is determined mainly by the conversion from the Microblaze floating-point representation to the LNS.In the case of reconfiguration between different hardware versions, the reorganization of the internal state takes only 4NM microseconds.The reconfiguration controller must be designed with respect to the high cost of the migration between software and hardware solutions.

Figure 1 :
Figure 1: Data flow graph of one order update step of the algorithm.

Figure 3 :
Figure 3: Schedule of operations of four RLS lattice algorithms with probability estimation.The operations Txx and Ux are defined in Algorithm 1.

Figure 4 :
Figure 4: Comparison of two approaches to regularization of division in T20.The critical path fragment is marked in red.

Figure 5 :
Figure 5: Reduction of multiplexers by grouping the block memories.

Figure 6 :
Figure 6: The RLS lattice coprocessor integration to Microblaze processor via FSL.

Table 2 :
RLS lattice coprocessor resources usage in the Xilinx Virtex4 SX35 device.

Table 3 :
Performance of the RLS lattice coprocessor: the execution time for 180 inputs is presented.