Real-Time Signal Processing for Multiantenna Systems: Algorithms, Optimization, and Implementation on an Experimental Test-Bed

A recently realized concept of a reconﬁgurable hardware test-bed suitable for real-time mobile communication with multiple antennas is presented in this paper. We discuss the reasons and prerequisites for real-time capable MIMO transmission systems which may allow channel adaptive transmission to increase link stability and data throughput. We describe a concept of an e ﬃ cient implementation of MIMO signal processing using FPGAs and DSPs. We focus on some basic linear and nonlinear MIMO detection and precoding algorithms and their optimization for a DSP target, and a few principal steps for computational performance enhancement are outlined. An experimental veriﬁcation of several real-time MIMO transmission schemes at high data rates in a typical o ﬃ ce scenario is presented and results on the achieved BER and throughput performance are given. The di ﬀ erent transmission schemes used either channel state information at both sides of the link or at one side only (transmitter or receiver). Spectral e ﬃ ciencies of more than 20bits/s/Hz and a throughput of more than 150 Mbps were shown with a single-carrier transmission. The experimental results clearly show the feasibility of real-time high data rate MIMO techniques with state-of-the-art hardware and that more sophisticated baseband signal processing will be an essential part of future communication systems. A discussion on implementation challenges towards future wireless communication systems supporting higher data rates (1Gbps and beyond) or high mobility concludes the paper.


Motivation
The widespread use of wireless and mobile communication devices has changed everyday life during the recent decade. The introduction of cellular networks laid the foundation for mobile communication almost everywhere, anytime, and with everyone. A growing use of data communication mainly over the internet, for example, email, news, or information of any kind, produces an increasing demand in wireless data traffic as well. Since wireless connections are generally not exclusive point-to-point connections as land lines used, for example, for telephone and DSL, the available frequency spectrum has to be shared with other users and radio systems.
The high expectations towards the growth of mobile communications made the available spectrum valuable and expensive for licensing. Therefore, it is a prerequisite for all service providers and radio systems to exploit the limited resource frequency spectrum very efficiently.
A new transmission concept proposed by Foschini [1] using multiple antennas at each side of the radio link promises a significant increase in spectral efficiency. An informationtheoretic basic work by Telatar [2] on the capacity in multiantenna channels opened intensive research activities in the multiple-input multiple-output (MIMO) area worldwide. The new domain to be exploited is the spatial domain, taking into account the separability of the spatial signatures belonging to data streams transmitted from different antennas. MIMO transmission allows that several radio links can be supported simultaneously at the same time, in the same frequency band, and without any need for code separation.

State of the art and related work
The increasing demand for faster and more reliable wireless communication links reopened discussions on how to exploit the degrees of freedom in wireless communication which come basically from time, frequency, space, or scenarios with many users to choose from. Since the time and frequency domains are already exploited to a high extent, the spatial domain offers an additional degree of freedom. The work of Foschini [1,3] inspired discussion about the radio transmission systems with multiple antennas at both ends of 2 EURASIP Journal on Applied Signal Processing the link-so-called MIMO systems. The achievable capacity in a single-cell multiuser scenario [4] was well understood and it has been also well known that the use of several antennas at one side of the transmission link can increase the system capacity and performance due to transmit or receive diversity [5]. In recent years, it was found that MIMO systems have the ability to reach higher spectral efficiency than systems using antenna arrays only at one side of the link [6]. This so-called spatial multiplexing was studied in [1,[7][8][9] and is based on the fact that under a sum power constraint the capacity can be increased by establishing several parallel links (MIMO) instead of one single-input single-output (SISO) link. When the transmission with spatial multiplexing is separable, then the sum capacity is given by the sum of the individual capacities which is always bigger than that of a single-antenna link. Reference [10] showed that there exists a fundamental tradeoff between multiplexing and diversity gain for any multiantenna system.
In 1998, a first successful experimental demonstration [11] proved the practical feasibility of spatial multiplexing in narrowband frequency-flat channels which boosted the research effort in the MIMO area.
For the case of channel state information (CSI) at the transmitter, the link performance can be enhanced by appropriate signal processing at the transmitter before emitting the signal from the antennas. The most simple way is exploiting transmit diversity [12] while linear transmit precoding proposed by [13][14][15] or in the context of CDMA [16,17] needs more complex signal processing at the transmit side. A first real-time implementation of adaptive linear precoding has recently been presented by [18].
If CSI is available at the Tx and the Rx, then eigenmode transmission [19][20][21] is the optimum strategy. The data streams are coupled into the eigenspaces of the channel and decoupled at the Rx providing full decorrelation due to the orthogonal subspaces. An ASIC implementation of the algorithms for slow flat-fading channels has recently been presented [22] while [23] realized a narrowband and lowdata rate implementation of eigenmode transmission with low cost of-the-shelf RF components and DSPs.
A further important contribution for the overall multiantenna system performance is given by a proper coding against noise distortion and more important bad fading channel states, for example, [24,25]. The additional spatial dimension allows for so-called space-time codes which basically transmit replicas of the same information over, for example, different antennas in different time slots. In parallel very efficient and powerful error correcting codes like turbocodes [26] or low-density parity check (LDPC) codes [27] have been developed over the recent years which are now entering the application stage [28,29]. Coded transmission which is a research area in itself is not considered throughout the paper without disregarding the impact of channel and source coding on the final system performance.
Practical transmission systems normally do not apply neither Gaussian alphabets nor infinite interleaving as would be required from the capacity point of view. Nevertheless, we are interested in how to achieve optimum rate and performance with, for example, discrete modulation alphabets and/ or symbol-by-symbol decisions. This problem is generally referred to as bit loading and can be performed in time, space, and frequency [30]. Reference [31] gave theoretical sufficient conditions for discrete bit loading to be optimum in the context of OFDM. References [32][33][34][35][36][37][38] proposed bit-loading strategies for fixed-rate applications. A recent work in [39] has discussed an analytical optimization of the joint error rate with successive interference cancellation at fixed rate by means of power and bit allocation. In [40], it was shown that a transmission using an MMSE-SIC receiver combined with adaptive modulation and coding is capacity achieving at high SNR at least in theory.
A slightly different bit-loading approach is outlined in this paper. The idea exploits the fact that CSI is available to the transmission system and channel aware bit loading can be performed in a sense that transmission in bad channels is avoided. Exploiting CSI and the detector structure we can predict the achieved signal-to-interference-plus-noise ratio (SINR) in front of the decision unit. Based on symbol-bysymbol decisions, we can now adapt power and bit-allocation such that all data streams have a desired error probability [41,42] which can be controlled. The proposed scheme has variable rate but an upper limited and assured BER, which requires error-correcting codes only to contribute SNR gain instead of protection against fading. This allows for codes with high code rates, for example, Reed-Solomon codes or product accumulate codes [43] and schemes like automatic repeat request (ARQ) [44][45][46][47][48] are supported ideally since the achieved BER and FER can be controlled to the desired working point. References [18,49] could show the advantages of channel aware bit loading in experiments at high data rate. The resulting variable data rate in a single-user scenario might appear unusual, but with an increasing number of users, a multiuser scheduling algorithm can control the data streams individually and match them to the requested data rates of each user.
In the reality of multiuser scenarios the user scheduling becomes a challenging task when spectral efficiency and quality of service (QoS), for example, average rate or delay, are included in the optimization. Works in [50][51][52][53][54] proposed a powerful framework to solve the complex scheduling task very efficiently, such that a real-time implementation [55] on today's hardware could show the gains towards sum rate and individual QoS requirements of scheduling policies derived from a cross-layer optimization.
In Section 2, we will introduce the technical challenges involved with high-data-rate MIMO signal processing. In Section 3, we describe our reconfigurable experimental testbed and in Section 4 we discuss the computational expenses and achievable performance with optimization of several basic MIMO algorithms. Section 5 reveals some results from transmission experiments conducted on the test-bed. Section 6 finally summarizes the paper and gives a short outlook on technical challenges which have to be taken for a further increase of spectral efficiency, data rate, and adaptivity of multiantenna systems.

REAL-TIME MIMO SIGNAL PROCESSING: CHALLENGES AND IMPLEMENTATION ASPECTS
The advantages of MIMO techniques towards spectral efficiency and enhancing the link stability are well understood and generally accepted by the community, but there is still a lot of work to be done to bring those techniques into the realworld systems. We are now at the edge of the wider introduction of MIMO techniques for various deployments and the technical challenges require solutions. This is where reprogrammable MIMO platforms for rapid prototyping are needed for. The analysis of the theoretically well-understood MIMO algorithms has to be done under all constraints given by the real world, for example, limited processing capability of state-of-the-art signal processing architectures, imperfections of RF components (dirty RF), frequency selectivity and time variance of the transmission channel, cochannel interference by other users using the same frequency resource, and so forth.
So an experimental analysis of several transmission, detection, and precoding schemes by implementing them exemplarily on a test-bed is a challenging task, since high-speed data reconstruction and algorithmic flexibility are required at the same time. Our approach and its realization will be described in the following.
The reconstruction of the data streams transmitted over MIMO channels requires very fast matrix vector multiplications at the symbol rate. Therefore, the digitized signals from all Rx antennas have to be available in a joint processing unit, meaning a very high number of digital I/O ports. This can be met, for example, by FPGAs which are equipped with sufficient parallel I/O ports. A classical 32-bit bus architecture common with PCs and DSPs is not appropriate because the amount of data for the A/D converters (ADCs) easily exceeds the capability of those buses. To illustrate the immense amount of data necessary for MIMO baseband signal processing, the following example is given: OFDM, direct downconversion with a bandwidth of 20 MHz (2x oversampling), 5 Rx antennas and 12-bit resolution in I/Q : 2 · 20 MHz · 2 · 5 · 12 bits = 4.8 Gbps, which is quite a remarkable data rate and is hard to realize with today's computer buses.
For the signal reconstruction, we assume a block data frame detection using matrix × vector multiplications on a symbol-by-symbol basis. In static or quasistatic scenarios, this allows that the MIMO filters (matrices) can be used for the reconstruction of the entire data block. But, even those relaxed assumptions require strong hardware capabilities concerning bus architecture, processing power, and so forth.
With rising mobility, the channel becomes more timevariant and the filter coefficients for the data detection have to be recalculated within a fraction of the coherence time of the channel. This alone can be challenging already with flatfading scenarios when the number of Tx and Rx antennas is growing and more sophisticated algorithms like, for exam-ple, V-BLAST or SVD, are performed. A recently presented 1 Gbps implementation of near ML-decoding [56] over a fading channel simulator has showed the enormous hardware complexity involved when MIMO-OFDM with many carriers has to be processed in real time at very high data rate.
For indoor scenarios, the channel coherence time can be of some milliseconds which seems to be a quite relaxed time frame for the computation of, for example, filter matrices in single-carrier transmission schemes. Assuming OFDM 1 even this time window of a few milliseconds can be a limiting factor if the number of subcarriers is increased which is necessary with increasing frequency selectivity of the channel and desirable with respect to spectral efficiency due to the necessary length of the guard interval with OFDM which is determined by the radio propagation environment.
When the channel is changing more rapidly which can be caused, for example, by high mobility of the user (car, train, etc.), then the time limits are an even more limiting factor due to a required faster channel tracking which is not done with simple phase and amplitude tracking like in the SISO case.
Another aspect which has to be considered is nonlinearities and imperfections in the RF chain, for example, I/Q mismatch which can cause I/Q or image crosstalk and have to be compensated by the baseband signal processing. This often requires a real-valued baseband processing which doubles the computational effort with matrix computations, in general.

THE REAL-TIME MIMO TEST-BED: A HYBRID SIGNAL PROCESSING APPROACH
The real-time MIMO test-bed described here was developed in the German HyEff project. The goal was to show the feasibility of MIMO in real-time in a single-carrier link based on the well-known flat-fading algorithms, and to speed up the signal processing in this first step beyond the natural limits set by the temporal dispersion found in typical indoor channels. We evaluated various architectures and implemented one promising approach which is fully operational since July 2003 (see Figure 1). This prototype has been presented with real-time transmission experiments at the Globecom conference in San Francisco in December 2003. 1 Note that for OFDM, the frame structure and the channel estimation have to be adapted to a specific environment satisfying Z · M · 1/B Sig τ(H) with Z denoting the number of OFDM symbols per frame and M the number of subcarriers. B Sig is the baseband signal bandwidth and τ(H) denotes the channel coherence time. In case the channel coherence time is held fixed, then an increase of signal bandwidth always allows for more subcarriers and OFDM symbols per frame which is very important since MIMO-OFDM in general requires pilot symbols for the MIMO channel estimation and the length of the pilot preamble cannot be reduced below a certain minimum depending on the number of Tx antennas and the desired accuracy of the channel estimation [57]. We can conclude that a signal bandwidth increase supports higher rate and spectral efficiency, in general.

General concept of the multiantenna test-bed
To exploit the multiplexing and diversity potential of multiantenna systems, a higher effort of baseband signal processing is a prerequisite. To match those signal processing requirements, a hybrid design was chosen for the test-bed (see Figure 2). The main baseband signal processing units consist of an FPGA for very fast matrix vector multiplications and a DSP for a flexible implementation of more sophisticated algorithms. This baseband design concept unites realtime high-data-rate capability and a high flexibility regarding the detection and precoding algorithms under investigation. The D/A and A/D converters use duplex mode 2 and are integrated on a special board which is plugged onto the FPGA board.
The RF frontend uses direct up-and downconversion (DUC/DDC) and uses a center frequency of 5.2 GHz for the local oscillator (LO).

Transmitter
In the setup under investigation, we use four transmit antennas. The 5.2 GHz radio hardware has a bandwidth of roughly 100 MHz and it performs direct analog upconversion using four I/Q mixers each followed by +20 dBm power amplifier (ZRON-8G, Mini Circuits); see Figure 3. Up to four independent complex-valued data streams are transmitted over the air. The data generation and the modulation are realized within a Xilinx Virtex II 8000 FPGA. The output signals are D/A converted with 12-bit resolution and used to modulate the carrier. One reason to use FPGAs instead of DSPs is the need for a joint signal processing of multiple data streams. The limited number of in-and output ports of current DSPs may not allow multiple high-data-rate streams in parallel. Due to the FPGA realization, all the signal processing must be carefully programmed in VHDL to allow a proper timing control. The periodically transmitted signal consists of a preamble and a data block. Each I and Q branch of the Tx antennas is tagged with a different 127-bit Gold sequence transmitted in BPSK format in a preamble. The length of the pilots is intentionally oversized in the experimental system to get precise channel estimates. The pilots are followed by a pseudorandom data block with 1024 symbols on each stream. The modulation of the data is independently set on each I and Q branch with up to 16 PAM levels allowing schemes from BPSK to 256-QAM.

Receiver
The received signals from 5 antennas are directly downconverted using analog I/Q demodulators and digitized using 12-bit AD converters (see Figure 4).
The analog design creates a severe I/Q imbalance (3-4 degrees for commercial I/Q mixers) which has to be taken into account in the entire system concept. In principle, we treat the complex-valued MIMO baseband system with 4 Txs and 5 Rxs as a real-valued system having 8 Txs and 10 Rxs to compensate the I/Q crosstalk.
Note, that the I/Q imbalance can be compensated at each transmit and receive antenna after a careful calibration is done. This is of ever greater importance for OFDM schemes [58] due to the crosstalk between the image frequencies. For the SISO-OFDM case [59][60][61] proposed the estimation of the IQ imbalances based on statistical measures but these concepts are not applicable straightforward for multiple antennas since signals coming from different transmit antennas are not separable by the this method. Therefore, our concept of realvalued data separation can be used here as well but now the symbols on subcarrier f i have to be reconstructed together with the symbols from subcarrier − f i [62] which expands the detector matrix, for example, MMSE filter by a factor of 2 in each dimension. For a MIMO-OFDM system with 4 Tx and 5 Rx antennas, this would mean that a realvalued matrix with 2(2n T ) × 2(2m R ) = 320 entries had to be computed and processed in real time with the received data vector. In case that the number of multipliers in the FPGA is limited, then an I/Q preequalization at the Tx antennas and an I/Q equalization at the Rx antennas is a reasonable alternative, but careful calibration is needed in advance. For low signal bandwidth (< 50 MHz), digital up-and downconversion is another favorable option.

Channel estimation
In the Rx-FPGA, 80 correlation circuits (CCs) are implemented using the known training sequences. Since binary pilot sequences are used, the CCs need no multipliers. The  next bit in the sequence may eventually change the sign of the signal to be accumulated, so the CC switches from addition and subtraction. Additional CCs based on unused sequences are used to estimate the noise variance of each receive branch. The channel estimates are immediately available after the last bit in the training sequence and stored in dedicated registers. These registers are read out by a separate DSP (Texas Instruments 6713) connected to the FPGA via a parallel bus (24-bit flat ribbon cable). The DSP is used to calculate the coefficients of, for example, a linear MMSE filter which are then sent back to the dedicated weight registers in the FPGA via the same link. The read and write operations of the DSP are fully asynchronous to the transmitted frame structure.

MIMO detection
Two linear detection schemes, ZF and MMSE, were implemented in the Rx-FPGA as a matrix-vector multiplication unit to separate the spatially multiplexed data streams. Note that for a 4 × 5 MIMO system, this unit consumes 80 dedicated multipliers, which sets an upper limit to the numbers of antennas depending on the FPGA size (Virtex II, Virtex II Pro 70/100, etc.). If a matrix-vector multiplication of bigger size has to be performed, then, for example, a rowwise multiplication of H † · y can help to overcome the limited number of multiplier units where H † denotes the MMSE pseudoinverse of the channel and y denotes the receive vector.
For nonlinear detection like SIC and V-BLAST a decision feedback equalizer (DFE) structure 3 was implemented. The feedforward matrix GF uses the same matrix block as for the linear equalization. After each symbol decision, the decided symbols are fed back by a multiplication with a triangular feedback matrix B − I. The DFE design was implemented such that for the detection of one symbol vector, the DFE loop is passed several times until the last element of the symbol vector is detected. With 8 real-valued data streams, the maximum symbol rate of this DFE design is limited to 1 MSymbol/s, due to 25 MHz FPGA system clock, which was the FPGA clock rate for the flat-fading design at the time of the implementation. In principle, this was sufficient for symbol rates up to 10 MHz due to the measured temporal dispersion in our lab. A way out to support higher symbol rates with SIC the DFE detection unit can be run at a higher system clock rate (100-150 MHz) or the structure can be set up in parallel at the cost of more multiplication units.
The DFE design in Figure 5 allows a fair comparison of several detection schemes by simply loading different matrices for the feedback and feedforward filters, for example, for ZF and MMSE, the feedback matrix B−I is loaded with zeros.

MIMO precoding
Several MIMO transmission schemes like SVD-MIMO or joint transmission/linear channel inversion require spatial precoding at the transmitter. The spatial precoding was implemented in the Tx-FPGA after the parallel PAM modula-   tion block with a matrix multiplication unit similar to that from the Rx but using only 64 dedicated multipliers. The matrix entries are calculated by the DSP as well and loaded via the 24-bit DSP-FPGA parallel bus at the time of the experiments. While this paper is written, the test-bed is equipped with reciprocal transceivers proposed in [63], such that the spatial precoding can be calculated by the Tx independently, relying on a channel estimation in the opposite direction in TDD mode.

Demodulation
The separated streams are demodulated using hard decisions in each I-and Q-branch.
The temporal dispersion in the multipath indoor channel obviously sets the upper limit to the maximal symbol rate, which was 10 Msymbols/s in our lab. Using symbol rates of 5 Msymbols/s, this corresponds to an overall data rate of 40 Mbps with QPSK and 120 Mbps with 64-QAM modulation on all four Tx antennas (8 bps/Hz and 24 bp/Hz). Therefore, the current bandwidth extension to 100 MHz required multicarrier techniques (OFDM).
The signal processing itself can support even higher rates and more complex schemes like, for example, MIMO-OFDM which has been implemented on the reconfigurable signal processing platform, recently.

Bit error rate measurements
The BER measurement is performed automatically on all data streams based on a comparison of the separated and demodulated signals at the Rx and the data coming from the PRBS-data generator are also programmed inside the Rx-FPGA. The error measurement is performed on bit and frame level as well and can be file-logged on the PC.

Synchronization
The synchronization between Tx and Rx was realized by two cables, one for the symbol clock and one for the frame clock.
Since the channel impulse response causes spikes with exponential decay when changing from symbol to symbol, the symbols are sampled at about 70% to 80% of its length. By this adjustment, a reliable channel measurement could be achieved up to symbol rates of 10 Msymbols/s. Synchronization over the air is currently being implemented for MIMO-OFDM but was not finalized at the time when the experiments were conducted with the single-carrier setup.

Channel tracking
With respect to higher mobility, it becomes critical to track the MIMO channel sufficiently fast. The most challenging part becomes the weight calculation when there are a few dozens of OFDM carriers and for each of them a weight matrix has to be calculated. Appropriate algorithms for the implementation on a DSP are discussed in Section 4.6. If those weights are available within one or a few milliseconds, 4 channel tracking is expected to be fast enough for indoor and pedestrian applications. For higher mobility, channel tracking within each frame becomes mandatory.

Bit loading or rate control
It is calculated at the Rx. The DSP calculates the actual possible PAM constellation based on the expected noise enhancement after the MIMO detector. This is equivalent to the SINR in front of the demodulator. Here, the I/Q imbalances causes different noise enhancement in I and Q (see also Figure 14). Therefore, we control the modulation independently for the I-and Q-part of each symbol by using PAM instead of M-QAM. This higher channel adaptivity translates directly into a higher throughput and link reliability.

Feedback link
Based on the channel estimates, the DSP may calculate the optimal modulation in each stream. Note that the test-bed is currently operational only in simplex mode. So the loading vector is sent back to the Tx-FPGA via a parallel bus, thus realizing an ideal feedback link.

Basic algorithmic strategies for real-time multiantenna systems with high data rates
With the perspective of real-time capable algorithm implementation for very high data rates, the complexity of algorithms often becomes a limiting factor. Therefore, it is reasonable to search for solutions which have a high performance and match the capability of a dedicated hardware. The hybrid FPGA/DSP architecture of the test-bed gives a high flexibility over algorithms used for data stream separation at the Tx and/or the Rx, rate and power control. Those algorithms are run on the DSP while the fixed part (e.g., channel estimation, data separation, mod/demod, BER) is performed by the FPGA. The DSP works fully asynchronous and refreshes, for example, the necessary MMSE weights and/or the bit-loading vector at the Tx-FPGA within a millisecond or less.
Following this divide-and-rule strategy, we are able to support high data rates in a MIMO transmission and still have the flexibility towards algorithms.
To realize this ambitious approach, we implemented the high-speed matrix-vector multiplications for the reconstruction of the data streams in VHDL on the FPGA and the DSP performs the calculation of the required matrices. The complexity which can be implemented in the FPGA is mainly limited by the number of dedicated multipliers, RAM, and so forth, and particularly by the maximum clock rate at which the design can be routed within the required delay limits. The more resources are used from the FPGA (70% or more), the more difficult the place & route procedure becomes. The limiting factor for high-speed signal processing in the FPGA is determined by the ADC, DAC, and FFT/IFFT blocks (e.g., OFDM) which run at the highest clock rates which is limited to 150-200 MHz in reality (Virtex II Pro 100), which limits the usable signal bandwidth to be used for transmission. This means that for high data rates of several 100 Mbps to 1 Gbps or more, higher modulation levels and spatial multiplexing are a necessity.
A recent FPGA implementation of MIMO-OFDM at a clock rate of 100 MHz [64] has allowed a reliable lowmobility transmission with a gross data rate of 1 Gbps with 3 Tx and 5 Rx antennas using 48 active OFDM carriers and 100 MHz bandwidth at 5.2 GHz.
If the data transfer on the parallel bus between DSP and FPGA is optimized, then the calculation of the detection matrices itself can become the most time-consuming part. The received signals of the current MIMO-OFDM system with 3 Tx and 5 Rx antennas and 48 carriers which in our implementation are again treated as real-valued. Therefore, the DSP calculates 48 MMSE solutions where each matrix has size 10 × 6. If we remember that matrix inversions have roughly a complexity ∼ N 3 for square matrices, it becomes clear that the optimization of DSP code is crucial. If the number of sub-carriers is high (256 or 1024), we will use DSP clusters which can work in parallel to perform the calculation task still within the channel coherence time. In many transmission scenarios, the channel has only a a few taps (10 or less), hence theoretically, assuming perfect channel knowledge the same number of subcarriers would be sufficient to equalize the channel. But for reasons of spectral efficiency in OFDM many more subcarriers are often used which now carry redundant information. This redundancy can be exploited to reduce the MIMO signal processing significantly. A promising approach is the calculation of an exact solution (e.g., ZF-pseudoinverse as proposed by [65]) on (L − 1)(N T − 1) + 1 subcarriers only and to interpolate the filter solutions in between. 5 If this is done in an appropriate trigonometrical fashion [66], the interpolated filter matrices can reconstruct the multiplexed data streams with high accuracy. The savings in time for the calculation of the MMSE solutions have to be traded carefully against the additional effort for the interpolation.
MIMO transmission schemes require specific algebraic procedures to be performed in order to precode or decode the data appropriately. Some useful algorithms are discussed in the following paragraphs. Most of them were implemented on the DSP in C language and used for the calculation of the MIMO filter matrices in the transmission experiments.

DSP-architecture and optimization
One of the initial decisions which has to be taken is between floating-point and fixed-point arithmetic. Fixed-point DSPs are offered on the market at much higher clock rates (e.g., 1 GHz) than floating-point DSPs (300 Mhz), so one might say let us take the faster one. But this is only true if all calculations are performed in the integer domain and the dynamic range is fixed and well known. If floating types like float or double are used, the mapping to integer numbers is performed automatically by the compiler. A simple test showed that, for example, a matrix inversion on a 16-bit fixed-point TI-DSP (1 GHz) performs slower than the 300 MHz 32-bit floating-point DSP (TI6713) by a factor of 10. A way out is to optimize the mapping by hand using additional knowledge about the dynamic range, and so forth. A major drawback of this approach is that hand-optimized program code is hard to read and therefore very error-prone and not very flexible to code changes, not to mention a lot of overhead may occur when different people are contributing to the same algorithm library without necessarily knowing all details on dynamic range of the possible input and output values. Furthermore, assembly code optimization is more difficult on a fixed point target.
Therefore, we choose the floating-point architecture (TI6713) with 225 MHz for the test-bed to have as much algorithmic flexibility as possible.
Reference [67] investigated several MIMO algorithms in great detail regarding general C-code and assembly optimization. We will limit ourselves to the performance results in Section 4.6. 5 The classical approach of interpolation of the frequency channel estimates by a transfer into time domain, appropriate windowing, and a back transformation to the required number of subcarriers in the frequency domain improves the accuracy of the channel estimation but does not help to reduce the calculation effort at all. Note that the filter envelopes of analogue or digital filters which are used for image band suppression have to be measured carefully before interpolation techniques can be exploited. This is important in particular when more than 80% of the OFDM subcarriers are used, which can be done with channel adaptive bit loading.

Matrix inversion and decompositions
Many MIMO precoding and reception techniques are based on matrix-vector multiplications either in a linear sense or a nonlinear sense which means repeating matrix-vector operations with decisions in between. The required matrices are mostly obtained by matrix decompositions or matrix inversions, so we will focus on those very important algebraic algorithms. Since real-time capability is mandatory for highdata-rate MIMO applications, speed and numerical stability are of great importance. Another aspect is fixed or variable computational time, since in many applications it is not the average computation time which matters but very often the worst-case time. Therefore, a fixed computation time is desirable and often easier to optimize.

The inverse of a matrix and the pseudoinverse
By definition, the inverse of a matrix only exists for matrices with the same number of rows and columns. Let A be a matrix of size m R × n T with m R = n T . Then we define A −1 the inverse of matrix A if it holds that where I nT is the unity matrix of size n T × n T . If A is of rectangular shape m R × n T with m R ≥ n T , then an inverse is not defined. Therefore, a so-called pseudoinverse has to be computed instead: where (A H A) −1 has square shape and standard algorithms for matrix inversion are applicable. A † then satisfies I nT = A † A similar like in (1). When using (·) † in the following, we will refer to the Moore-Penrose pseudoinverse which causes lowest noise enhancement when multiplied with the receive vector.
In multiple-antenna systems, the signals coming from all Tx antennas are superimposed at the Rx antennas. For the separation of these signals, for example, a linear filter can be used. A simple realization can be achieved with a zeroforcing (ZF) filter while the minimum mean-square error (MMSE) is more complex but considers the noise from the Rx and outperforms ZF regarding the BER especially in the low SNR region. Both solutions require one matrix inversion each.
A linear equalization at the Rx corresponds to a multiplication of the receive vector y with a matrix H † . The transmitted data can then be estimated as where the ZF-pseudoinverse of H for m R ≥ n T is or if we consider the receiver noise, additionally, the belonging MMSE filter reads where the noise variance σ 2 N is assumed to be the same for all receivers for a more convenient notation. Note that in general we have to expect different noise variances for each receiver if, for example, independent automatic gain controls are used.

Calculation of the inverse/pseudoinverse
One straightforward approach to implement the calculation of the inverse and/or pseudoinverse is using Greville's method [68]. This algorithm provides full flexibility in the number of Tx and Rx antennas and even some columns or rows can contain zero vectors.
While the ZF filter from (4) can be calculated directly from H instead of inverting H H H, the MMSE filter from (5) requires two extra matrix multiplications and the inversion of (HH H + σ 2 N I) which is of size m R × m R . Keeping in mind that the computational effort of multiplications and inversions increases by ∼ N 3 with N = max(n T , m R ), we can choose a dimension-reduced formulation of the MMSE for the implementation: where σ 2 N is now equivalent noise variance per data stream. Furthermore, the range of the data is an important issue in the conjunction with algorithms to calculate a pseudoinverse, since a calculation of H H H doubles the binary range from, for example, 12 bits to 24 bits which can decrease the algorithmic stability. In other words, the condition number 6 of the matrix to be inverted is increased by a power of two when H H H is inverted instead of H. This range extension is not required when Greville's method is used, so this may be an algorithm of choice for fixed-point implementation.
Another algorithm which can be used is based on a modification of the Frobenius formula [68] where the calculation of a pseudoinverse can be performed by the calculation of pseudoinverses of submatrices: where K = A − BD −1 C. If the submatrices of the Frobenius decomposition are regular and of square shape (e.g., A), then inversion can be performed by calculating the elements of the inverse matrix A −1 directly with Cramer's rule The implementation of (8) is quite straightforward up to a matrix size of 4×4 real-values. For instance, if the matrix H is of size 6×6 or 8×8, then a decomposition into 3×3 or 4×4 submatrices is advised, respectively. Note that the calculation 6 The condition number is used here as the fraction of the biggest and the smallest singular value of a matrix.
of a matrix inverse with Cramer's rule (8) is not advised with regard to numerical stability due to the determinant in the denominator.
For the special case of the inversion of a square matrix with full rank, which is true for the MMSE solution with nonzero noise in (5) and (6), there is another option to obtain a matrix inverse. Following the outline of [69], Gauss-Jordan elimination has the advantage of a high numerical stability, especially when full pivoting is used. Furthermore, the structure of the algorithm allows a very efficient manual optimization of the C-code.
Beside the three given examples, many more algorithms were optimized, implemented, and evaluated towards numerical stability and speed. An short overview including QR and QL decomposition is given in Figure 6.

Performance analysis
To evaluate and compare algorithms, we have to characterize the complexity or the computationally required effort. Very often the measure is given in flops (floating-point operations), where the definitions are varying among different authors. Instead we will compare all algorithms by the amount of required multiplications. Since additions mostly occur in pairs with multiplications, we only have to count the latter.
Reciprocal values (1/X), square roots ( √ X), and reciprocal square roots (1/ √ X) are counted separately, since their computation needs more cycles on the DSP. In the algorithmic optimization process, the minimization of those operations has a high priority. Unavoidable divisions will always be replaced by reciprocal values. All algorithms are used on matrices of size m × n and mn 3 + n 2 , n 1 X , n 1 √ X (9) denotes an algorithms consisting of mn 3 + n 2 multiplications (additions), n reciprocal values, and n reciprocal roots. In Table 1, the complexity of several algorithms is summarized. Figure 7 illustrates a complexity comparison of typical linear (Figures 7(a), 7(b)) and nonlinear (Figures 7(c), 7(d)) MIMO algorithms based on real multiplications. It is clearly to be seen that complex calculations 7 (Figures 7(b), 7(d)) reduce the complexity significantly but can only be exploited when the I/Q-imbalance is negligible. On the other hand, real-valued SIC detection offers exploitable performance gains even without I/Q-imbalance as shown in [70]. In Figure 7  outperformed by the QRD pre and postsort approach (bullets) proposed by [71] only for large numbers of antennas N ≥ 10 when a complex calculation would be performed. For the real-valued signal processing, a comparable complexity is achieved at about 6 Tx and Rx antennas. So the computational gain is more to be seen in a sense that the postsorting algorithm has to be run only when the detection order has to be tracked permanently, for example, with fixedrate transmission. In case of adaptive bit loading, the detection order is only once computed for every bit-loading procedure and is then held fixed till the next bit loading, hence most of the time QRD is sufficient for tracking the channel. Therefore, the additional expenses for the V-BLAST ordering now and then are less burden to the time budget. So by carefully counting all necessary operations, a principle performance prediction with, for example, rising matrix size can be given. An implementation of the algorithms on a DSP might give different results since every dedicated DSP architecture supports some algorithmic structures better than others. Therefore, the experienced programmer matches the algorithm implementation to the computational strength of a specific DSP type. Still limitations like a certain number of possible parallel assembly instructions or a limited cache size can cause that even slight changes in the code (e.g., loop length or matrix size) can change the number of required cycles significantly. Figure 8 shows algorithm speed implemented on the TI6713 DSP for single-carrier system Figure 8 Figure 8(d) show the performance of some algorithms used for nonlinear detection. All algorithms are performed with real-valued calculation. For a 48-subcarrier OFDM, the run time exceeds the 1-millisecond (indoor environment) level already for small numbers of antennas (N < 6) even for the linear schemes. This shows that further acceleration including assembly programming, multiple DSP, and/or interpolation techniques is inevitable.
The black square in Figure 8(a) and Figure 8(c) depicts the performance which was achieved with an exemplary assembly code optimization for 2 Tx and 2 Rx antennas (4 × 4 real-valued matrix). This measurement together with an assembly design for an 8×8 real-valued matrix was used to predict the assembler performance for some MIMO algorithms. The estimated run-times (in microseconds) for an OFDM system with 48 subcarriers are collected in Table 2.
Assuming an OFDM frame length of 2 milliseconds which is adapted to a nomadic indoor environment with small-and medium-sized office rooms, we define 1 millisecond to be the critical computational time which should not be exceeded in order to guarantee that the next frame can be detected with a new filter based on the channel estimation in the actual frame. We can expect that for quadratic antenna configurations ZF filters with up to 8 × 8 antennas and MMSE-pseudoinverses up to 5×5 antenna configuration can be calculated with an optimized assembler implementation  in one DSP. Nonlinear detection seems to be feasible with up to 6 × 6 antennas without optimum ordering. If additionally a V-BLAST ordering is required for every filter, then the matrix size is limited to a 4 × 4 antenna configuration.
The MIMO-OFDM configurations with higher antenna numbers can be supported with one TI6713 DSP only when the channel coherence time is much longer (quasistatic scenarios) or alternatively a DSP cluster must be used to partition the calculation effort subcarrier-wise and work in parallel.

Transmit and receive configurations
Thanks to the reconfigurability of the test-bed, we could run a wide range of transmission schemes on the platform, by simply calculating different solutions for the transmit precoding or/and the receive decoding in the DSP and loading the matrices to the Tx-and the Rx-FPGA. So, the flexible algorithmic part is performed by the DSP while the FPGAs Thomas Haustein et al.  simply do always the same straightforward matrix-vector multiplications with the actually loaded solutions from the DSP.
To bring more transparency into all possible transmit and receive configurations, Table 3 will help. The table has to be read in the following way. The first column gives the transmission scheme under investigation and the belonging uplink (UP) or downlink (DL) scenario where it can be applied to. The next two columns contain the matrices which are loaded into the Tx-and the Rx-FPGA. The column modulation contains the modulation levels which are assigned, for example, per antenna, per data stream, and so forth. The last column contains the parameter for the bit loading which is specific for all schemes. This parameter represents the expected noise enhancement or SINR in front of the decision unit which is used for the bit allocation. The scaling parameter α used for the adaptive channel inversion (ACI) is necessary to limit the transmitted signals to the 12-bit DAC range. Table 2 Number of antennas n T = m R  2  3  4  5  6  8  ZF-I-LUD  25  48  86  140  220  460  ZF-I-GJ  36  88  180  330  550 Table 3 Transmission

Adaptive transmission schemes-flat fading
The transmission schemes summarized in Table 3 were implemented on the MIMO test-bed with a single carrier at 5.2 GHz, data symbol rates from 1 Msymbol/s to 10 Msymbols/s and adaptive modulation from 2-16 PAM which equals 256 QAM as highest modulation scheme. The detailed experimental results are published in [18,49,55,72]. Beside one extra antenna at the Rx channel, adaptive bit loading was an essential part to make the MIMO link much more stable and reliable since transmission over bad channels was avoided. It was found during the experiments that channel tracking and bit loading or multiuser scheduling can be performed at different time scales, since a change in the channel first causes phase and amplitude changes but the SINR behind a MIMO detector is changing much slower. Keeping in mind that switching from one QAM-level to the next or backwards requires about 6 dB more or less SINR, it can be easily understood that bit loading can be run on another time scale. During all our measurements, the Rx antenna set was moving with 4 cm/s along a 5-meter long railway-like construction, so channel tracking within one millisecond was sufficient while bit-loading could be done about every 100 milliseconds without losing throughput or violating the average BER target.
The reproducibility of channel realizations by moving the Rx antennas always the same path through the room was a key issue to compare various transmission and detection schemes. As discussed in [49], the measured channel statistics in the laboratory seen from the pdf of the singular values behaves very similar to an i.i.d. Rayleigh channel with a slight Rician component. Furthermore, the deteriorating effect of I/Q imbalances was reported to be seen in a split-up of the singular values which should be pairwise degenerated otherwise [49]. This also underlines that real-valued baseband signal processing is a good option with direct analogue upand downconversion as used in the test-bed.
Due to the similarity of the channel, in our lab with an i.i.d. Rayleigh channel we could measure the MIMO diversity slopes (dotted lines) in Figure 9 in very good accordance with what was expected from theory under the assumption of uncoded fixed modulation transmission and a linear detector. The average SNR per Rx antenna was calculated indirectly from the measured channel along the track.
Throughput experiments with several MIMO transmission schemes combined with channel adaptive bit loading as described in [49] were conducted. The results are summarized in Figures 10 and 11. Figure 10 shows the measured sum throughput with a BER ≤ 10 −2 with three transmission schemes: SVD-MIMO   two schemes by a higher throughput even at high SNR values. Note here we would expect from theory a similar throughput performance for SVD-MIMO and MMSE-SIC, which is known to be capacity achieving as well [40]. A certain modulation and coding should only shift the capacity curve on the SNR axis, also known as SNR gap. The observed difference at high SNR can only be explained by error propagation which can become significant due to the uncoded transmission.
Since we perform adaptive bit loading in such a manner that all layers meet a certain BER target, we have to consider the effect of error propagation in the bit-loading algorithm. The weaker the BER decay (diversity slope), the more extra transmit power necessary to fulfill the target. As an example let us assume a BER target of 10 −3 for all layers. Since all layers including the last layer will meet this BER target, we have to set the BER target for each layer lower such that including error propagation we will satisfy the targeted BER. Assuming 4 Tx and 5 Rx antennas and a multiplexing of 4 data streams, we can expect a BER diversity order ∼ SNR −2 . If we had a 100% error propagation, then as a rule of thumb the last layer would suffer from 3/4 of possibly propagated errors and 1/4 of own decision errors meaning that we should set the target BER to 1/4 · 10 −3 . At the given diversity slope, this corresponds to an SNR loss of approximately 3-4 dB, something comparable to the measurements. This SNR loss is expected to increase to about 6-8 dB with 4 Tx and 4 Rx antennas.
Generally, this means that the SNR loss against the waterfilling or SVD-MIMO scheme increases with the number of layers/transmit antennas and decreases with the number of extra receive antennas/degree of receive diversity. Furthermore, the correlation of the data streams influences the error propagation, for example, orthogonal transmit channel vectors do not propagate errors from one detection layer to another. So in reality the SNR margin has to be found by averaging over a statistical ensemble of channels and can later be adapted automatically if the channel entanglement is changing in different deployments. Furthermore, the SNR gap can be closed by introducing FEC on each layer, but at the cost of increased buffer size and processing delay which can be significant for long block length.
At low SNR, SVD-MIMO achieves a tremendous relative gain compared to MMSE and MMSE-SIC. This high throughput advantage can be explained that with SVD one data stream is coupled into one eigenmode of the channel. The other two schemes couple each data stream into all eigenmodes depending on the actual channel realization, which means in average 1/4 of each data stream. At very low SNR, when only one complex stream is transmitted in all schemes, MMSE and SIC transmit only 1/4 of their one and only stream over the best eigenmode. In average this should result in a disadvantage of about 6 dB on the SNR scale which is roughly the measured value at low SNR.
The dashed lines in Figure 10 show the behavior when the maximum modulation level is limited to 8-PAM or 64-QAM, respectively. The cutoff rate is approached already within our measurement range and shows that the achievable maximum slope for the average throughput which means that the maximum achieved spatial multiplexing gain is determined by the cutoff rate due to limited modulation levels.
With an M-ary QAM level of 1024 (if implementable in multiantenna schemes) a smaller gap between theory and practice towards the spatial multiplexing gain might be achievable. Other groups, for example, [23] showed the feasibility of high modulation schemes (512 cross-QAM) in combination with coding. Figure 11 shows the empirical cumulative density function of the measured sum through- put at the highest possible SNR point. We see that the fitted curve is steepest for the SVD-MIMO and has the longest tail at low rates for the linear MMSE. This is in good accordance with capacity simulations from the measured channels. Especially at low outage probabilities the three schemes have a huge difference in throughput. Example: Outage = 0.01 MMSE: 11 bps/Hz, MMSE-SIC: 17 bps/Hz, and SVD-MIMO: 21 bps/Hz. Those results are comparable with spectral efficiencies achieved by [23].

MIMO-OFDM for frequency-selective channels
The extension of the well-studied flat-fading algorithms towards frequency-selective channels offers equalization of the MIMO channel in the time or frequency domain. For reasons of simplicity, a frequency-domain equalization with OFDM was implemented for a 2 × 3 MIMO system as a first step. 48 out of 64 subcarriers were used for data transmission, compliant with 802.11g plus an additional C-preamble for the estimation of the MIMO channel, which was described in [57]. For a 20 MHz bandwidth version, the OFDM parameters were the following: center frequency: 5.2 GHz, frame length: 2 milliseconds, symbol length: 4 microseconds, guard interval: 800 nanoseconds, training sequence length: 64 OFDM symbols maximum.
In order to use as many modules from the flat-fading FPGA design, all correlation units and the multiplication unit (MIMO detector) have to be reused 48× within one OFDM symbol length. Since the signals for each frequency leave the FFT unit one after the other, the filter weights, and so forth, can be changed from subcarrier to subcarrier. Figure 12 shows the fully reconstructed OFDM pilot symbols after the MIMO detection in the baseband. Each of the four figures displays the reconstructed complex OFDM symbols transmitted from two Tx antennas. The signals are ordered as follows (from top to bottom): I-signal of Tx1, Q-signal of Tx1, I-signal of Tx2, and Q-signal of Tx2. The arrow in the top-left figure shows the symbol length of 4 microseconds. The Hadamard sequences used for the C-preamble are clearly to be seen in the bottom-left figure. In the top-right we see a data symbol vector using BPSK. The degrading effect of sever I/Q imbalance is visible in the remaining image crosstalk in the I/Q-branches which should be zeroed with perfect spatial reconstruction. In the bottom-right figure, we see the noise enhancement after the MMSE MIMO detector due to singularities in the MIMO matrices in the upper OFDM frequency band. Here, we do not have to find deep fading as known from SISO systems but instead the MIMO matrix becomes close to singular which causes severe noise enhancement due to the matrix inversion involved with the MMSE filter. This effect degrades all spatial MIMO channels, in general.This observation is very important for proper space-frequency coding since redundant information can be placed at another Tx antenna but must be placed well separated in frequency domain, to avoid degradation from the same "fading hole." A recent implementation of the MIMO-OFDM with a 100 MHz FPGA design-allowed a 1 Gbps with 3 Tx and 5 Rx antennas and 64 QAM on 48 active subcarriers [64]. An upgrade to 128 subcarriers and channel adaptive bit loading now allows a 1 Gbps transmission with only 2 Tx and 4 Rx antennas when 116 subcarriers are used for data transmission. A revised RF front end allowed 256-QAM in good channels. A first public presentation was given at the CeBIT fair in Hannover in early March. Figure 13 shows the bit allocation for a particular channel realization in our lab. Figure 14 shows screen shots of the reconstructed symbols at different subcarriers, showing that even with a good image suppression timing imperfections can cause significant differences in noise enhancement in the real and imaginary parts of the data symbol. Therefore, independent modulation in I and Q is an appropriate solution.

CONCLUSIONS AND CHALLENGES FOR FUTURE MIMO IMPLEMENTATIONS AND APPLICATIONS
A multiantenna experimental test-bed was presented based on a hybrid approach consisting of FPGAs and DSPs which was developed at FhG-HHI. The internal signal processing structures were described in detail and critical implementation issues were pointed out. The MIMO filter algorithms which were calculated on a DSP were analyzed with regard to complexity and optimization potential in C-code or assembly code. Several implementations were compared on the DSP target used for the test-bed and a selection of those algorithms was applied for real-time high data rate MIMO transmission experiments using a single carrier MIMO design and a MIMO-OFDM design. The experimental results clearly show that multiantenna techniques are an essential ingredient of signal processing structures for future wireless systems. The spatial diversity and multiplexing gains could be measured in good accordance with what was predicted from information theory. Using channel adaptive bit loading in the single-carrier mode, average spectral efficiencies of more than 20 bps/Hz with an assured BER better than 10 −2 Necessary further steps towards higher spectral efficiency and possible transmission rates of beyond 1 Gbps are outlined in the following together with some of the technical challenges involved.
If a higher bandwidth efficiency with OFDM is desired the number of subcarriers should be increased since the length of the guard interval is generally determined by the deployment scenario. Therefore, faster MIMO-filter computation is required, which could be solved by parallel computing, filter interpolation, faster clocking of the DSPs, and assembly code.
The next challenging task is to be seen in channel adaptive transmission using adaptive modulation and coding. Here, a higher number of subcarriers do not appear to be a limitation since adjacent subcarriers are highly correlated and channel bundling with common modulation can be applied. The bit loading for adaptive transmission requires good error protection for the modulation level signalling over the feedback channel or alternatively some modulation signalling, for example, sent directly after the MIMO training sequence to inform the Rx about the modulation levels used by the transmitter at every Tx antenna and subcarrier. Furthermore, the channel coding must have sufficient granularity to ensure an error protection always matched to the actual channel quality and the requested BER target. It is still 18 EURASIP Journal on Applied Signal Processing considered as an open problem what channel coding strategies are well matched to MIMO systems with/without frequency diversity and adaptive/nonadaptive modulation under real-time transmission and decoding requirements. If a bandwidth extension is taken into consideration for data rate enhancement, all ADCs, DACs, and FPGA clocks have to be set to higher rates which demands for a very good VHDL design to comply with all necessary timing constrains required by symbol-wise MIMO signal processing. Furthermore, higher signal bandwidth sets tighter limits to digital up-and downconversion which are common approaches to combat I/Q-imbalances by low IF digital frequency conversion. Here, the IF concept may contradict the capabilities of ADCs and/or DACs of commercially available products. As an alternative direct up-and downconversion becomes more attractive again and the compensation of I/Q cross talk is required by appropriate calibration and signal processing at the Txs and Rxs.