A New Pipelined Systolic Array-Based Architecture for Matrix Inversion in FPGAs with Kalman Filter Case Study

A new pipelined systolic array-based (PSA) architecture for matrix inversion is proposed. The pipelined systolic array (PSA) architecture is suitable for FPGA implementations as it e ﬃ ciently uses available resources of an FPGA. It is scalable for di ﬀ erent matrix size and as such allows employing parameterisation that makes it suitable for customisation for application-speciﬁc needs. This new architecture has an advantage of O ( n ) processing element complexity, compared to the O ( n 2 ) in other systolic array structures, where the size of the input matrix is given by n × n . The use of the PSA architecture for Kalman ﬁlter as an implementation example, which requires di ﬀ erent structures for di ﬀ erent number of states, is illustrated. The resulting precision error is analysed and shown to be negligible.


INTRODUCTION
Many DSP algorithms, such as Kalman filter, involve several iterative matrix operations, the most complicated being matrix inversion, which requires O(n 3 ) computations (n is the matrix size).This becomes the critical bottleneck of the processing time in such algorithms.
With the properties of inherent parallelism and pipelining, systolic arrays have been used for implementation of recurrent algorithms, such as matrix inversion.The lattice arrangement of the basic processing unit in the systolic array is suitable for executing regular matrix-type computation.Historically, systolic arrays have been widely used in VLSI implementations when inherent parallelism exists in the algorithm [1].
In recent years, FPGAs have been improved considerably in speed, density, and functionality, which makes them ideal for system-on-a-programmable-chip (SOPC) designs for a wide range of applications [2].In this paper we demonstrate how FPGAs can be used efficiently to implement systolic arrays, as an underlying architecture for matrix inversion and implementation of Kalman filter.
The main contributions of this paper are the following.
(1) A new pipelined systolic array (PSA) architecture suitable for matrix inversion and FPGA implementation, which is scalable and parameterisable so that it can be easily used for new applications (2) A new efficient approach for hardware-implemented division in FPGA, which is required in matrix inversion.(3) A Kalman filter implementation, which demonstrates the advantages of the PSA.
The paper is organised as follows.In Section 2, the Schur complement for the matrix inversion operation is described and a generic systolic array structure for its implementation is shown.Then a new design of a modified array structure, called PSA, is proposed.In Section 3, the performance of two approaches for scalar division calculation, a direct division by divider and an approximated division by lookup table (LUT) and multiplier, are compared.An efficient LUTbased scheme with minimum round-off error and resource consumption is proposed.In Section 4, the PSA implementation is described.In Section 5, the system performance and results verification are presented in detail.Benchmark comparison and the design limitations are discussed to show the advantages as well as the limitations of the proposed design.In Section 6, Kalman filter implementation using the proposed PSA structure is presented.Section 7 presents concluding remarks.

MATRIX INVERSION
Hardware implementation of matrix inversion has been discussed in many papers [3].In this section, a systolic-arraybased inversion is introduced to target more efficient implementation in FPGAs.

Schur complement in the Faddeev algorithm
For a compound matrix M in the Faddeev algorithm [4], where A, B, C, and D are matrices with size of (n × n), (n × l), (m × n), and (m × l), respectively, the Schur complement, D + CA −1 B, can be calculated provided that matrix A is nonsingular [4].First, a row operation is performed to multiply the top row by another matrix W and then to add the result to the bottom row: When the lower left-hand quadrant of matrix M is nullified, the Schur complement appears in the lower right-hand quadrant.Therefore, W behaves as a decomposition operator and should be equal to such that By properly substituting matrices A, B, C, and D, the matrix operation or a combination of operations can be executed via the Schur complement, for example, as follows.
(i) Multiply and add:

Systolic array for Schur complement implementation
Schur complement is a process of matrix triangulation and annulment [5].Systolic arrays, because of their regular lattice structure and the parallelism, are a good platform for the implementation of the Schur complement.Different systolic array structures, which compute the Schur complement, are presented in the literature [3,[6][7][8].However, when choosing an array structure one must take into account the design efficiency, structure regularity, modularity, and communication topology [9].The array structure presented in [6] is taken as the starting point for our approach.It consists of only two types of cells, the boundary and internal cells.The structure in [3] needs three types of cells.The cell arrangement in the chosen structure is two-dimensional while the cells in [7] are connected in three-dimensional space with much higher complexity.
The other consideration when choosing the target structure was the type of operations in the cells.In the preferred structure [6], all the computations executed in cells are linear, while [8] would require operations such as square and square root calculations.
A cell is a basic processing unit that accepts the input data and computes the outputs according to the specified control signal.Both the boundary and internal cells have two different operating modes that determine the computation algorithms employed inside the cells.Mode 1 executes matrix triangulation and mode 2 performs annulment.The operating mode of the cell depends on the comparison result between the input data and the register content in the cell.The cell operations are described in Figure 1.
To create a systolic array for Schur complement evaluation, E = D + CA −1 B, cells are placed in a pattern of an inverse trapezium shown in Figure 2. The systolic array size is controlled by the size of output matrix E, which is a square matrix in case of matrix inversion.The number of cells in the top row is twice the size of E and the number of internal cells Then the matrix is fed into the systolic array in columns.A and B require mode 1 cell operation, while C and D are computed in mode 2. The result can be obtained from the bottom row in skewed form that corresponds to the input sequence.Figure 3 gives an illustration.

Modifying systolic array structure
A new systolic array can be constituted from other array structures to achieve certain specifications with the following four techniques [6].
(i) Off-the-peg maps the algorithm onto an existing systolic array directly.Data is preprocessed but the array design is preserved.However, data may be manipulated to ensure that the algorithm works correctly under array structure.
(ii) Cut-to-fit is to customise an existing systolic array to adjust for special data structures or to achieve specific system performance.In this case, data is preserved but array structure is modified.
(iii) Ensemble merges several existing systolic arrays into a new structure to execute one algorithm only.Both data and array structures are preserved, with dataflow transferring between arrays.
(iv) Layer is similar to the ensemble technique.Several existing systolic arrays are joined to from a new array, which switches its operation modes depending on the data.Only part of the new array will be utilised at one time.
In order to overcome the problem of the growth of the basic systolic array presented in Section 2.2 with the size of input matrices, a modified PSA is proposed in this section.When comparing two consecutive layers in the basic array from Figure 2, it can be noted that the cell arrangement is identical except the lower layer has one less internal cell than its immediate upper layer.This leads to the conclusion that the topmost layer is the only one that has the processing capabilities of all other layers and could be reused to do the function of any other layer given the appropriate input data into each cell.In other words, the topmost layer processing elements can be reused (shared) to implement functionality of any layer (logical layer) at different times.Obviously, for this to be possible, the intermediate results of calculation from logical layers have to be stored in temporary memories and made available for the subsequent calculation.The sharing of the processing elements of the topmost layer is achieved by transmitting the output data to the same layer through feedback paths and pipeline registers.The dataflow graph of the PSA is shown in Figure 4.
In the PSA, the regular lattice structure of basic systolic array is simplified to only include the first (topmost/physical) layer.Referring to Figure 4, data first enters in the single cell row and the outputs are passed to the registers in the same column.These registers, which store the temporary results, are connected in series and also provide feedback paths.The end of the register column connects to the input ports of the cell in the adjacent column and the feedback data becomes the input data of the adjacent cell.The corresponding dataflow paths in two different array structures are shown in Figure 5, highlighted in bold arrows.The data originally passing through the basic systolic array re-enters the same single processing layer four times during three recursions.
In order to implement the PSA structure for an n × n matrix, the required number of elements is (i) the number of boundary cells C bc = 1, (ii) the number of internal cells C ic = 2n − 1, (iii) the number of layers in a column of register bank R L = 2(n − 1), (iv) the total number of registers R tot = 2(n − 1)(2n − 1).
The exact structure of the PSA for the example from Figure 5 is presented in Figure 6.As can be seen when the input  matrix size increases, the number of cells required to build the PSA increases by O(n), which is much smaller than O(n 2 ) as it is the case in other systolic array structures.The price paid is the number of additional registers used for storage of intermediate results.However, as the complexity of registers is much lower than that of systolic array cells, substantial savings in the implementation of the functionality can be achieved as it is illustrated in Figure 7 for different sizes of matrices.Resource utilisation is expressed in a number of logic elements of an FPGA device used for implementation.

Division with multiplication
Scalar division represents the most critical arithmetic operation within a processing element in terms of both resource utilisation and propagation delay.This is particularly typical for FPGAs, where a large number of logic elements are typically used to implement division.For the efficient implementation of division, which still satisfies accuracy requirements, an approach with the use of LUT and an additional multiplier has been proposed and implemented.
Noting that numerical result of "a divided by b" is the same as "a multiplied by 1/b," the FPGA built-in multiplier can be used to calculate the division if an LUT of all possible values of 1/b was available in advance.
FPGA devices provide a limited amount of memory, which can be used for LUTs.Due to the fact that 1 and b can be considered integers, the value of 1/b falls into a decreasing hyperbolic curve, while b tends to one, and so the value difference between two consecutive numbers of 1/b decreases dramatically.To reduce the size of the LUT, the inverse value curve can be segmented into several sections with different mapping ratios.This can be achieved by storing one inverse value, the median of the group, in the LUT to represent the results of 1/b for a group of consecutive values of b.This process is illustrated in Figure 8.The larger the mapping ratio, the smaller amount of memory needed for the LUT.Obviously, such segmentation induces precision error.The way to segment the inverse curve is important because it directly affects the result accuracy.Further reduction in the memory size is achieved by storing only positive values in the LUT.The sign of the division result can be evaluated by an XOR gate.
On an Altera APEX device, when combining the LUT and multiplier into a single division module, a 16 bit by 26 bit multiplier consumes 838 logic elements (LEs), operating at 25 MHz clock frequency and total memory consumption of 53 248 memory bits for the specific target FPGA device.The overall speed improvement achieved through using the DLM method is 3.5 times when compared to using a traditional divider.Because of the extra hardware required for efficiently addressing the LUT, the improvement in terms of LEs is rather modest.The hardware-based divider supplied by Altera, configured as 16 bit by 26 bit, consumes 1 123 LEs when it is synthesised for the same APEX device.

Optimum segmentation scheme
Since b is a 16-bit number (used in 1.15 format), there are (2 15 − 1) = 32 767 different values of 1/b.The performance of various linear and nonlinear segmentation approaches are evaluated in the priority of precision error and resource consumption.Absolute error is calculated by subtracting the true value of the inverse 1/b from the LUT output.Average error is the mean of the absolute error among the 32 767 data.Since the value of 1/b retrieved from the LUT is later multiplied by a in order to generate the division result, any precision error in LUT will be eventually magnified by the multiplier.Therefore, the worst-case error is more critical than the average precision error.The worst-case error can be calculated as follows: worst-case error of 1/b k = absolute error of The error analysis was performed to investigate both the absolute error in average and the worst-case.As a result of this analysis an optimum segmentation scheme, tabulated in Table 1, was determined.It provides the minimum precision required of a typical hardware-implemented matrix inversion operation.This was verified by means of simulation using Matlab-DSP blockset for a number of applications.The resulting LUT holds 4 096 inverse values with a 26-bit word length in 16.10 data format.

PIPELINED SYSTOLIC ARRAY IMPLEMENTATION
The implementation block diagram of the PSA structure is shown in Figure 9. Datapath Architecture is illustrated in Figure 10.The interfacing of the control unit and the other internal and external cells are shown in Figure 11.

Control unit
The control unit is a timing module responsible for generating the control signals at specific time instances.It is synchronous to the system clock.Counters are the main components in the control unit.The I/O data of control unit are listed below.

Inputs
(i) 1-bit system clock: clk for synchronisation and the basic unit in timing circuitry.(ii) 1-bit reset signal: reset to reset the control unit operation.Counters will be reset to the initial values and restart the counting sequences.

Outputs
(i) 1-bit cell operation signal mode to decide the cell operation mode: "1" for mode 1 and "0" for mode 2. (ii) 1-bit register clear signal: clear to activate the contentclear function in cell internal registers: "1" for enable and "0" for disable.(iii) 1-bit multiplexer select signal: sel for controlling the input data sources selection in data path multiplexers: "1" for input from matrix and "0" for input from the feedback path.
Since the modules in the PSA are arranged in systolic structure and connected synchronously, generation of the control signals required to operate these modules should be also in regular timing patterns.Figure 12 demonstrates the required control signals for operating the PSA in different sizes.

Resource consumption and timing restrictions
Compared to other systolic arrays in the literature, the small logic resource consumption is the main advantage of the proposed PSA structure.For example, for inverting an n × n matrix, the PSA requires to instantiate 2n cells while the systolic array in Figure 2 requires (n 2 + 2n−1 k=1 k) cells.Because of feedback paths in the design and single cell layer structure in the PSA, the number of processing elements required for implementation has been reduced and therefore the hardware complexity changed from O(n 2 ) to O(n).
A generic PSA has a customisable size and configurable structure.The final size of the PSA can be estimated by adding the resource consumption of each building block or where I, R, M, and D represent the number of internal cells, 16-bit pipelining registers, 16-bit input select multiplexers, and 3-bit signal delay D-FFs, respectively.It should be noted that the actual size of the synthesised PSA on FPGA device will be affected by the architecture and routing resources of the FPGA.
The processing time for the n × n matrix inversion in PSA is 2(n 2 − 1) clock cycles at a maximum clock frequency running at 16.5 MHz for n < 10 in our implementation (Altera APEX EP20K200EFC484-2).When a larger PSA is synthesised, the system clock period decreases as the critical path extends.

Comparisons with other implementations
The PSA performance has been compared with some other matrix inversion structures based on systolic arrays in terms of number of processing elements (or cells), number of cell types, logic element consumption, maximum clock frequency, and design flexibility.
For an n × n matrix inversion, the PSA requires 2n cells while [n(3n + 1)/2] cells are used in the systolic array based on the Gauss-Jordan elimination algorithm [10].In the PSA, cells are classified as either boundary or internal cells, while the processing elements in the matrix inversion array structure in [5] are divided into three different functional groups.
When working with a 4 × 4 matrix, it takes 4 784 LEs to implement the PSA on an Altera APEX device, while 8 610 LEs are used to implement the same in a matrix-based systolic algorithm engineering (MBSAE) Kalman filter [11].

• • •
Schur complemnt Matrix from Skewed from When synthesised on an Altera APEX device (EP20K-200EFC484-2), PSA allows a maximum throughput of 16 MHz, compared to only 2 MHz in the design presented in the systolic array based design reported in [11] and 10 MHz in geometric arithmetic parallel processor (GAPP) in [12].The PSA is designed to be customisable and parameterisable, but other systolic arrays in the literature were all fixed-size structures.

Limitations
In our design several built-in modules from the vendor library were used for basic dataflow control and arithmetic calculations.Therefore, the results reported in this paper are valid only for specific FPGA devices.However, as libraries provided by other FPGA vendors have equivalent functionalities readily available, the proposed design can be easily modified and ported to other FPGA device families.
One disadvantage of the PSA design is that input data has to be in skewed form before entering the array.When the PSA interfaces with other processors, a data wrapping preprocessing stage may be required to pack the data in the specific skewed form shown in Figure 13.Output data from the PSA are unpacked to rearrange the results back to regular matrix form.

Effects of the finite word length
The finite word length performance of the PSA structure was analysed.All quantities in the structure are represented using fixed-point numbers.It should be noted that only multiplication and division, which itself is computed by multiplication, will introduce round-off error [13].Addition and subtraction do not produce any round-off noise.The approach used here was to follow the arithmetic operations in the different variables update equations and keep track of the errors which arise due to finite-precision quantisation.As described earlier in the paper, all the multiplication operations are performed using 26-bit long data.Computation results, as well as the data in the LUT, are of 26-bit long.To a large extent, this eliminates the possibility of overflow occurring with matrices of small size regardless of the actual data values.Simulation shows that the inverse of a matrix of size up to 10 × 10, and data represented with 26 bits, which is sufficient for most practical applications, can be computed with minimal error.Obviously, as the size of the matrix increases, the error also increases.However, as the proposed design is fully parameterised, the word length used in the computation can be accordingly increased, but it will result in higher FPGA resource usage.

Kalman filter
Since its introduction in the early 60s [14], Kalman filter has been used in a wide range of applications and as such it falls in the category of recursive least square (RLS) filters.As a powerful linear estimator for dynamic systems, Kalman filter invokes the concept of state space [15].The main feature of the state-space concept allows Kalman filters to compute a new state estimate from the previous state estimate and new input data [16].Kalman filter algorithms consist of six equations in a recursive loop.This means that results are continuously calculated step by step.To derive the Kalman filter equations, a mathematical model is built to describe the dynamics and the measurement system in form of linear equations ( 9) and (10).
(i) Process equation: (ii) Measurement equation: where x(n) is the state at time instance n, s(n) is the measurement at time instance n, A is the processing matrix, B is the measurement matrix, w(n) is the system processing noise, and finally v(n) is the measurement noise.In (9), A describes the plant and the changes of state vector x(n) over time, while w(n) is a plant disturbance vector of a zero-mean Gaussian white noise.In (10), B linearly relates the system states to the measurements, where v(n) is a measurement noise vector of a zero-mean Gaussian white noise.The Kalman filter equations can be grouped into two basic operations: prediction and filtering.Prediction, sometimes referred to as time update, estimates the new state and the uncertainty.An estimated state vector is denoted as x(n).When an estimate of x(n) is computed before the current measurement data s(n) become available, such estimate is classified as an a priori estimate and denoted as x(n).When the estimate is made after the measurement s(n) arrives, it is called a posteriori estimate [16].On the other hand, filtering, usually referred to as measurement update, is to correct the previous estimation with the arrival of new measurement data.The prediction error can be computed from the difference between the value of actual measurements and the estimated value.It is used to refine the parameters in a prediction algorithm immediately in order to generate a more accurate estimate in the future.The full set of Kalman filter equations can be found in [17].
It is evident from the Kalman filter equations that its algorithm comprises a set of matrix operations, including matrix addition, matrix subtraction, matrix multiplication, and matrix inversion.Among these matrix operations, matrix inversion is the most computationally expensive and thus being the bottleneck in the processing time of the algorithm such that the overall system processing time mainly depends on matrix inversion speed [10].In Section 2, a new implementation of matrix inversion, which is in fact the "heart" of Kalman filter, was presented.Hardware implementation of another critical operation, division, was presented in Section 3.

Kalman filter in PSA-based structure
As a case study to verify the performance of the proposed PSA, a Kalman-filter-based echo cancellation application was implemented.By appropriate substitutions of matrices A, B, C, and D (Table 2), matrix-form Kalman filter equations can be computed by the PSA in 9 steps.A complete execution of the 9 steps produces state estimates in the next time instance and constitutes one recursion in the Kalman filter algorithm.
The components of the four input matrices are queued in a skewed package entering the PSA cells row by row.It can be noted from Table 2 that some Schur complement results will be used as input data in later steps.Thus, extra registers are required to store the intermediate results.To ensure that the intermediate results are reloaded to specific cells at the correct time instances, a new data path and control unit

Schur complement Result
Step 1 Step 3 Step 4 Step 7 Step 8 theoretical values.The small residual error observed in the resulting data, is contributed to the finite word length effect typical of fixed-point structure of the proposed design.

Comparison with other implementations
There are several hardware implementations for Kalman filter in the literature.For a 4-state Kalman filter, all the Kalman filter equations can be expressed as 30 scalar equations.Similar to the PSA, direct operation of matrix inversion is also avoided in the matrix decomposition method (MDM) and the Kalman gain calculation turns into a set of 4 scalar equations with scalar division and addition.With the high processing speed of 169.4 nanoseconds reported in [18], MDM seems to have a better speed over the PSA (280 nanoseconds) for the same target APEX device.However, the PSA structure still enjoys the following advantages.

Flexibility
When the number of states in a Kalman filter changes, all the scalar equations in MDM become invalid as matrix dimensions in the algorithm depend on the size of the state vector.Considerable design time is required to decompose the matrix-form equations again.However, in the PSA, a Kalman filter with different number of states can be generated by modifying one parameter (number of states, i.e., the matrix size) in the heading of the VHDL code.The PSA serves as an IP block for a generic Kalman filter in VHDL, while MDM is a hard-wired implementation for a fixed Kalman filter.

Clock speed
The advantages and the conditions of using LUT with multiplier to perform scalar division has been discussed in Section 3.2.This approach enables PSA to have a system clock frequency 3.5 times faster than using scalar dividers only.

Resource usage
In the MDM method, 32 operations of addition/subtraction, 22 multiplications, and 4 divisions are involved in scalar operations.The overall logic element usage of the PSA is 40% lower than an equivalent MDM-based design for a 4-state Kalman filter implementation.

CONCLUSIONS
In this paper, an optimised systolic-array-based matrix inversion for implementation in FPGA was proposed and used for rapid prototyping of a Kalman filter.Matrix inversion is the computational bottleneck and the most complex operation in Kalman filtering.The PSA matrix inversion results in a simple, yet fast, implementation of the operation.It is scalable to matrices of various sizes and is implemented as a parameterised design.This allows its direct customisation and instantiation for application-specific problems.Resource utilisation is low and linearly depends on the matrix size.
Modified from the Schur complement systolic array, the PSA simplifies recursive matrix-form equations in Kalman filters to scalar operations and inherits the design advantages of parallelism and pipelining.In the proposed PSA design, a new approach for implementation of scalar division has also been proposed, which speeds up the division operation 3.5 times over traditional dividers and yet uses less logic elements and resources to implement.

1 Figure 1 :
Figure 1: Operations of boundary cell and internal cell.

Figure 7 :
Figure 7: Logic resource usage comparison between the PSA and basic systolic array.

Figure 8 :
Figure 8: A simple demonstration of segments in different mapping ratios.

Figure 13 :
Figure 13: Procedures for input data packing and output data unpacking.

Figure 14 :Figure 15 :
Figure 14: The original data paths of PSA.

Table 1 :
The optimum segmentation scheme.

Table 2 :
Matrix substitutions for Kalman filter algorithms.