Low-mobility channel tracking for MIMO–OFDM communication systems

It is now well understood that by exploiting the available additional spatial dimensions, multiple-input multiple-output (MIMO) communication systems provide capacity gains, compared to a single-input single-output systems without increasing the overall transmit power or requiring additional bandwidth. However, these large capacity gains are feasible only when the perfect knowledge of the channel is available to the receiver. Consequently, when the channel knowledge is imperfect, as is common in practical settings, the impact of the achievable capacity needs to be evaluated. In this study, we begin with a general MIMO framework at the outset and specialize it to the case of orthogonal frequency division multiplexing (OFDM) systems by decoupling channel estimation from data detection. Cyclic-prefixed OFDM systems have attracted widespread interest due to several appealing characteristics not least of which is the fact that a single-tap frequency-domain equalizer per subcarrier is sufficient due to the circulant structure of the resulting channel matrix. We consider a low-mobility wireless channel which exhibits inter-block channel variations and apply Kalman tracking when MIMO–OFDM communication is performed. Furthermore, we consider the signal transmission to contain a stream of training and information symbols followed by information symbols alone. By relying on predicted channel states when training symbols are absent, we aim to understand how the improvements in channel capacity are affected by imperfect channel knowledge. We show that the Kalman recursion procedure can be simplified by the optimal minimum mean square error training design. Using the simplified recursion, we derive capacity upper and lower bounds to evaluate the performance of the system.


Introduction
In the presence of a rich scattering environment, multipleinput multiple-output (MIMO) systems enable a linear increase in capacity with no increase in bandwidth or transmit power compared to single-input single-output (SISO) systems. However, the seminal work of [1] is based on the assumption that the channel is perfectly known to the receiver. In practical systems, the estimated channel using training sequences can be imperfect. As a result, there is potentially a mutual information loss between the input and the output of the channel. Given a power budget and a desired data rate, the time and power spent on training versus information symbols have to be judiciously *Correspondence: srikanthp@alum.wpi.edu 1 Department of Electrical and Computer Engineering, Worcester Polytechnic Institute, Worcester, MA 01609, USA Full list of author information is available at the end of the article selected since there is an interesting interplay involving information throughput and the quality of the channel estimates. If a large fraction of the time and/or power is spent on training, excellent channel estimates can be obtained at the expense of poor information throughput. Conversely, expending too little time and/or power on training results in poor channel estimates that lead to error-prone information symbol transmission. Receivers that rely on channel estimates to perform information symbol decoding are termed as "mismatched" receivers [2][3][4][5]. In this article, we study this scenario involving a transmitter with no channel state information (CSI) communicating with a receiver that relies on imperfect channel estimates. A different problem which deals with transmit and receive precoder design under the assumption that CSI is available at the transmitter has been studied extensively in the published literature e.g., see [6][7][8]. http://asp.eurasipjournals.com/content/2013/1/78 The problem of channel estimation has been studied in numerous contexts. Here, we list a few relevant studies. For an exhaustive survey of the area of channel estimation using known pilot sequences, see [9]. One of the earliest works in formulating training designs to obtain channel estimates for orthogonal frequency division multiplexing (OFDM) systems was [10]. In [11], optimal training designs have been designed for single-carrier and OFDM systems by maximizing a tight lower bound on the ergodic training-based independent and identically distributed (i.i.d.) capacity. Optimal pilot symbol design and their placement in a packet were addressed for both SISO and MIMO systems in [12] by minimizing the Bayesian Cramer-Rao bound (CRB) of a semi-blind channel estimator. In [13], a general affine-precoding framework [14] was considered and it was shown that decoupling channel estimation from symbol detection and optimizing a leastsquares channel estimator naturally leads to an OFDM system with information and training symbols on disjoint subcarriers. Considering the same framework, Ohno and Giannakis [15] provide a link between optimal training designs and maximizing the channel capacity lower bound similar to [11]. This work was extended in [16] to include a MIMO communications setup. Furthermore, by considering block-processing of transmitted symbols with a cyclic-prefix or zero-padding, optimal training designs are provided that maximize the channel capacity lower bound when a linear minimum mean square error (LMMSE) estimator is employed.
The impact of receiver estimation error from an information theoretic viewpoint has also extensively been studied. One of the earliest studies was conducted in [4] where the relationship between lower and upper bounds on the mutual information between transmitted and estimated Gaussian symbols is derived by modeling a timevarying frequency-selective channel as a random component with a known mean and a covariance that accounts for the channel estimation error. Specifically, it was shown that signal-to-noise ratio in the mutual information lower bound is lowered as a result of imperfect channel knowledge. In [17], the achievable data rate of a flat-fading interleaved MIMO channel is related to the LMMSE covariance matrix. In [5], the transmission of Gaussian symbols through a flat-fading channel was considered and it was demonstrated that when the Gaussianity assumption on the additive noise is rendered invalid due to channel estimation errors, scaled nearest neighbor detection is suboptimal. In [18], a lower bound on the capacity of a time-multiplexed training scheme in the presence of a flat-fading channel was studied and related to the variance of an LMMSE channel estimator. In [19], two pilot arrangement schemes were considered and the impact of the receiver estimation error was analyzed when CSI is available only at the receiver and when it is also fed back to the transmitter. In both cases, maximum likelihood channel estimation was considered. The relationship between the symbol Bayesian CRB and the mutual information between estimated and transmitted symbols was shown in [20]. In this study, two strategies were considered. One, when the receiver obtains joint Bayesian channel and symbol estimates and two, when the receiver computes channel estimates followed by their utilization in obtaining symbol estimates. The model presented in [18] was generalized in [21] by considering a superimposed training scheme of which time multiplexed training can be termed as a special case. Based on the mutual information bounds derived, a comparison between the superimposed training and the conventional time multiplexed training is performed by optimizing over training design, number of transmit antennas, and a training/information symbol power budget. While Hassibi and Hochwald [18] provide the optimal noise covariance matrix that maximizes a tight lower bound on the mutual information between the input and the output when both the transmitter and the receiver have imperfect CSI, Ding and Blostein [22] provide the optimal signal covariance matrix and show that the uniform power allocation scheme is suboptimal.
Our most important contribution in this article is the result provided in Theorem 1. Although this result is similar to that provided in ( [23], Lemma 1), the approach that we have taken, i.e., the application of the theory of complex-valued differentials to compute the Bayesian Fisher Information Matrix (FIM), is novel. Second, although the decoupling of channel and symbol estimation has been noted based on the structure of the least-squares channel estimator for SISO systems [13] as well as MIMO-OFDM systems [24], we arrive at this conclusion by maximizing the Bayesian FIM of a general affine-precoded MIMO system. Moreover, we extend the analysis conducted regarding Kalman channel tracking of SISO-OFDM systems in [25] to MIMO-OFDM systems. In the process, we extend the discussion in [18,21] by considering a "slowly" time-varying frequency-selective channel. In other words, while both [18,21] consider a block-invariant frequency-flat channel, we consider a frequency-selective channel that is correlated over successive symbol blocks.
The system model considered is described in Section 2. Based on this system model, we derive the Bayesian FIM of a general MIMO communications system that employs affine precoding at the transmitter in Section 3. We then show that in order to decouple channel estimation from data detection, an orthogonality constraint has to be met between the training and linear precoder matrices. A solution to the orthogonality constraint is the MIMO-OFDM system with frequency domain multiplexing (FDM) training symbols. We consider a MIMO channel that undergoes block-wise variations according http://asp.eurasipjournals.com/content/2013/1/78 to a first-order autoregressive (AR) model. Moreover, in order to improve the information throughput while understanding the impact of imperfect channel estimates, we formulate a scheme where during the training phase, the OFDM symbol contains training and information symbols, whereas in the data transmission phase, only information symbols are transmitted. Consequently, we consider a scheme wherein during the training phase, channel tracking is performed by a Kalman filter followed by the estimation of information symbols during the data phase based on channel state prediction in Section 4. Using this setup, we derive the capacity upper and lower bounds in Section 5 based on a training scheme that has been derived in an MMSE minimizing sense. We then provide simulation examples in Section 6 to support the theoretical results.

System model
In our analysis, we consider a MIMO communications system consisting of K transmit antennas that transmit N training and information symbols over a timevarying frequency-selective block fading channel. We design super-imposed training symbols optimally such that the channel estimates from N t consecutive blocks of training symbols are utilized in the data detection of the following N d information symbol blocks. We assume that the receiver also has K receive antennas without loss of generality. The maximum order of the discrete-time complex baseband wireless channels, L, is assumed to be known.

Training phase
In the training phase, training symbols and information symbols are affinely-precoded [14] and transmitted over K antennas. A matrix formulation of this system for an arbitrary time index, n, is as follows. Assuming that the information symbol vector at each antenna is of size M, we stack the symbols transmitted across K transmit antennas as shown below where the nth block of M symbols from kth transmit antenna is represented as The affine-precoder output vector is similarly arranged as Denoting the precoder matrix of size KP × KM as Q and the additive pilot-symbol vector of size KP × 1 as t, we can now write the equation for the transmitted symbol vector during the training mode as follows: where t vec ). Furthermore, the matrix Q is such that the data stream transmitted from an antenna is precoded independently of the data-streams from the other antennas. In other words, Q has a block diagonal structure and hence Q diag . Despite this restriction on the structure of Q, it is still general enough to encapsulate not only a MIMO system employing K antennas but also a multi-user system, e.g., K U users utilizing K antennas in total and communicating with a base-station equipped with K antennas. Also, restricting the structure of Q to be block diagonal simplifies an orthogonality condition (cf., Theorem 2) that helps in the design of the linear precoder and the training vector.
After pre-multiplying the above vector by I K ⊗C T where C T [ [ 0 L×(P−L) I L ] T I P ] T withP = P + L, the KP × 1 vector (P = P + L) undergoes a digital-to-analog conversion followed by pulse-shaping to yield a continuous-time signal. Assuming perfect timing and carrier synchronization at the receiver, the signal is sampled to obtain the received symbol vector. Subsequently, the cyclic-prefix is removed by a pre-multiplication operation with I K ⊗ C R (C R [ 0 P×L I P ] T ) and an ISI-free received vector of size KP × 1 is available for processing: Each matrix in the set, {H [ 1] ]. We now define a channel vector h n such that By exploiting the commutativity property of discrete convolution, (5) can now be written in a different form in terms of the MIMO channel vector, h n and the pilot symbol matrix, T [ T 1 T 2 . . . T K ] as,  (5) and (8), we use the subscript n in H n and h n to indicate the timedependence of the random channel. The system model described above needs to satisfy the following conditions.
Remark: Condition (C1) is enforced as a simple way of introducing redundancy in the precoding process [7,26]. Condition (C2) ensures that enough dimensions are available for the identification of the unknown channel coefficients in a linear least-squares sense. As we shall show in Theorem 2, the extra dimensions that are available as a result of employing a full-column rank, strictly-tall precoding matrix are useful in designing the training vector. Condition (C2) also suggests that given the knowledge of the channel order and for a fixed number of transmit antennas, the data-block size has to be at least equal to the product of the number of channel taps and the number of transmit antennas. Condition (C3) which complements (C2) implies that each element of the set, {T k } is also of full column-rank.

Data transmission phase
Due to the fact that no training symbols are available in the data transmission phase, we can write the system model as follows: where s n =Qs n ands n are obtained in a manner similar to (1). A few assumptions on the system model shown in (8) and (9) are now in order.
(A1) The channel vector, h n is zero-mean, i.i.d complex Gaussian with variance σ 2 h , i.e., h n ∼ CN (0, σ 2 h I K 2 (L+1) ). Moreover, each channel tap gain is assumed to be an independent AR process. We only consider a first-order AR model (cf. Appendix 1.1 for a brief discussion of the general AR model) for each tap gain so that where a ∈[ 0, 1] is the AR coefficient for the l th channel tap gain and the excitation noise, u n ∼ CN (0, σ 2 u I K 2 (L+1) ). In order to match the correlation functions at lags 0 and 1 and thus make the random process WSS for n ≥ 0, we select The transmitted symbol vectors,x n ands n are i.i.d complex Gaussian with variance σ 2 x and σ 2 s , i.e., x n ∼ CN (0, σ 2 x I KP ) ands n ∼ CN (0, σ 2 s I KP ) respectively. (A3) The additive noise vector, z n is zero-mean, circularly-symmetric i.i.d complex Gaussian noise with variance σ 2 z , i.e., z n ∼ CN (0, σ 2 z I KP ).
Remark: Assumption (A1) indicates that the channel is modeled as Rayleigh-fading random vector. This assumption represents a standard model for a rich scattering environment in the absence of line-of-sight. An expression for a in terms of the channel Doppler spread and the transmission bandwidth was shown in [4]. However, the first-order AR model possibly incurs considerable estimation error and results in numerous erroneous symbol decisions [27,28]. One reason for making assumption (A2) is to satisfy the regularity conditions related to the evaluation of the Bayesian FIM described below. They require that the joint distribution of p(y n ,x n , h n ) be absolutely continuous with respect to x n,k (p). A data vector modeled as Gaussian meets this criterion. For those transmit symbol vectors modeled on other distributions, the Theorem 1 gives an approximation. Another reason for making this assumption lies in the fact that a signal that is a zero-mean uncorrelated complex Gaussian distributed maximizes the lower bound (which is given with respect to a zeromean uncorrelated complex Gaussian noise vector) on the mutual information between the input and the output for of MIMO channels [18,29]. Moreover, for a block transmission scheme such as an OFDM system with large number of subcarriers, the transmit symbol vector obtained by linear-precoding the information-symbol vector with an IDFT matrix can be claimed to be Gaussian by an appeal to the central limit theorem ( [30], Figure 4.21). Hence, (A2) is not a particularly restrictive assumption.

Decoupled channel and symbol estimation
An observation of (8) reveals that the knowledge of the MIMO channel vector is contained not only in the known training symbols, but also in the unknown information symbols. However, the joint estimation of the channel vector and the detection of the information symbol vector is a non-linear problem, and its solution may not exist in certain cases [13]. On the other hand, a sub-optimal approach is to decouple the channel estimation problem from the data detection process. In order to do so, we may consider the channel vector to be a deterministic unknown within the classical approach to statistical estimation or as a random vector by adopting the Bayesian viewpoint. In this study, we consider the latter approach and derive the FIM http://asp.eurasipjournals.com/content/2013/1/78 of the channel vector based on the (8). That is, we derive the Bayesian FIM concerning the estimation of the channel vector using KP × 1 observations gathered from all the receive antennas at an arbitrary time instant, n. We then maximize the Bayesian FIM, which is equivalent to minimizing the Bayesian Cramer-Rao lower bound and obtain an orthogonality criterion. Finally, we formulate an affine precoder scheme that meets this condition.

Strategy: Bayesian FIM maximization
Theorem 1. Assuming that the likelihood function of p (y n ; h n ) for the system model given in (8) satisfies the regularity condition, the complex FIM for estimating the MIMO channel is where, Proof. See Appendix 1.2.
Remark: Since we will be considering a non-decisionaided setup where any information about the channel coefficients that is contained in data symbols is discarded, the term (Q) represents potentially useful information that is not utilized. Consequently, as it is independent of t, the maximization of I(h n ) in such a scenario is possible by working with (t, Q) alone. The maximization of I(h n ) leads us to the orthogonality condition shown in Theorem 2.

Theorem 2.
If the affine precoder scheme, (t, Q) satisfies conditions, (C1) and (C2), then the following orthogonality condition is necessary and sufficient for a non decision-aided training-only estimator to maximize the Bayesian FIM, I(h n ) obtained in (11): The expression for the Bayesian FIM that we have obtained in Theorem 1 is analogous to the result provided in ( [23], Lemma 1). Moreover, as shown in Appendix 1.2, we have not based this result on the block-diagonal structure of Q. Hence, the result in Theorem 1 is a general one. Also, the result that we derived in Theorem 2 was showed previously in [13] for SISO systems and in [24] for MIMO OFDM systems using minimum least-squares estimation error variance arguments. In [16,23], the orthogonality condition was derived within a Bayesian framework with the former relying on an LMMSE channel estimator while the latter uses a Bayesian FIM expression similar to this study. However, while we focus on a block diagonal structure of the linear precoder, Vosoughi and Scaglione [23] focus on a general linear precoder matrix.

OFDM with FDM training: an orthogonal affine precoder scheme
We see from ( [13], Theorem 1) for the case of a SISO system that the affine precoder scheme which uses linearly precoded OFDM along with an FDM training sequence that modulates a disjoint set of tones not used for data transmission meets the orthogonality criterion. Similarly, for MIMO systems, Theorem 3 establishes that although the training symbols and information-bearing symbols overlap in time domain, orthogonality between the subcarriers in frequency domain satisfies (13).
Theorem 3. The affine precoder scheme, (t, Q) that satisfies the orthogonality condition given in (13) irrespective of the FIR channel provides a non-data-aided channel estimator if it is selected from the class In the above equations, M×M is any full-rank matrix and P (t) is a permutation matrix that places the L + 1 possible non-zero entries oft k on non data-bearing subcarriers, whereas W m,n = 1 P exp(j2πmn/P).
Proof. The proof is a straight-forward generalization of ( [13], Appendix I).
In the subsequent sections, we focus our analysis on a MIMO-OFDM communication system. That is, we assumex n,k to be the result of a linear-precoding operation involving a general full-rank matrix, M×M before it is IDFT-modulated. Moreover, the same set of subcarriers are used for transmitting training symbols across all the antennas. http://asp.eurasipjournals.com/content/2013/1/78

Training phase
By substituting the result of Theorem 3 in (5), and considering the signal at an arbitrary receive antenna, k, we write the following equation: By multiplying the above equation with P (t)T 0:V −1 W, we notice that the channel estimation is decoupled from data detection so that the following expression is obtained whereTk diag (tk) andŴ 0:L √ P P (t)T 0:V −1 W 0:L . As a result, (16) can be written as Also, it can be showed that n,K ] ), we can write the MIMO system model for the measured signal across all receive antennas due to the pilot tones as follows: It can be showed that enforcing conditions (C1), (C2), and (C3) naturally result in two more standard conditions regarding the structure ofT and thus satisfy the dimensionality of (20).
By employing operations similar to those that helped in obtaining (20), the equation for the observation vector affected by the information symbols alone is as follows: where the KM × KM channel matrix, H n is as follows: Each matrix in the set, {H

Data transmission phase
Although the linear precoder matrix,Q can be any fullcolumn rank matrix in general, we focus on a block diagonal structure. We consider each element in the set, {Q k } to be a P × P IDFT matrix that modulates an information symbol vector which has been linearly precoded by a general full-rank matrix,¯ P×P .
wherer n W r n and the KP × KP channel matrix, H n is defined as follows: Remark: By enforcing the orthogonality condition and by choosing MIMO-OFDM with FDM training symbols as the affine precoder scheme, we have broken down (8) into (20) and (21). As a result, the impact of overlapping data-bearing symbols on the channel estimator has been http://asp.eurasipjournals.com/content/2013/1/78 circumvented. Moreover, we carryover the linear precoder from the training phase to the data transmission phase by introducing a simple modification on the dimensions of the IDFT matrix.
Before we study the MMSE characteristics during training and data transmission phases when a Kalman filter is employed to track the time-varying channel vector, h n we note that the following time and power budget constraints are enforced over (20), (21), and (24), where P is the total average transmit power that is split into P t , the average power allocated for training, P dt , the average power allocated for information symbols in the training phase, and P d , the average power allocated for information symbols in the data transmission phase. In addition, P t is distributed equally among the transmit antennas, i.e., where

Blockwise Kalman tracking
Due to the AR(1) random process model for timevariations on the channel vector, in order to compute the channel estimator in the MMSE sense, we have to utilize the past and the current observations, {ỹ (t) nN+k : k ∈ [ 0, N t − 1] , n = 1, 2, . . .}. An MMSE channel estimator can then be given as, However, a batch processing approach would necessitate the use of large datasets. A natural choice is the sequential MMSE approach and is implemented by a Kalman filter. A Kalman filter is well known for its computational efficiency which results from the fact that only the most recent estimate need to be stored in order to refine the MMSE estimate of the unknown parameter of interest based on the new observations. For the current problem at hand, we compute the channel estimate during the training phase based on (20) and utilize the predicted channel vector in the data transmission phase. The Kalman filter recursion algorithm for estimating the MIMO channel vector in the setup considered is summarized in (29a)-(29e) [31].

Kalman filter recursion
It can be noticed that when the system converges to a steady state, the MMSE of the channel estimator is not stationary during each cycle of N blocks. In the data transmission phase, the MMSE associated with the channel estimator's predicted state increases monotonically from the N t th block to the (N − 1)th block. Thus, the maximum steady-state MMSE in the data transmission phase occurs at the last information symbol block of each cycle. On the other hand, since the channel estimator computed based on the observations of the 0th block in nth cycle refines the predicted channel state at the end of the last information symbol block of (n − 1)th cycle, the steady-state MMSE decreases monotonically from the 0th training block to the (N t − 1)th training block. Before we derive the steady-state MMSE expressions for the two cases described above, we derive the steady-state MMSE when all the blocks are training symbols and make an interesting observation. The steady-state MMSE when all blocks are training symbols is given by the solution to the Ricatti equation (based on (29e)), where lim n→∞ M n|n−1 , and K (∞) lim n→∞ K n . Although several techniques have been proposed in the published literature such as eigenvector solutions [32], Schur vector approaches [33], iterative solving for scalar polynomials [34], etc., to solve the system of equations obtained in (30), we will show that by utilizing the following lemma describing the optimal design of the training symbols in the MMSE sense, the above system of equations is greatly simplified for MIMO-OFDM systems. http://asp.eurasipjournals.com/content/2013/1/78 Lemma 1. For the system model shown in (20), the minimum error variance of the MMSE channel estimator is, The optimalT,T (opt) that attains this error variance is, where Proof. See Appendix 1.4.
Remark: By employing the training design described in (32),T HT in (D.7) is diagonal and the MMSE of (31) is attained. The time-domain training sequences can be obtained from (14b) in a straight-forward manner by using the relation, t = (I K ⊗ W H P (t) 0:V −1 )T. It can be noticed that a simple way of making the term, 13) equal to zero is to allow only (L+1) out of V subcarriers dedicated for training symbols to be used at any given antenna. These equispaced and equipowered training symbols occupy disjoint sets of subcarriers at each transmit antenna. Clearly, such a scheme utilizes only (L+1) out of V subcarriers dedicated for training symbols at any given antenna. On the other hand, a general training scheme design described in (32) uses all non-data-bearing subcarriers, i.e., V for channel estimation purposes.
Remark: In [16], disjoint sets of subcarriers were considered to reduce the MMSE channel estimation error. Training designs similar to ours were shown in [24] by minimizing the least-squares channel estimation error and in [20] by minimizing the MSE of the LMMSE channel estimate. In [35], several classes of training schemes are derived by minimizing the least-squares channel estimation error. In this study, the disjoint allocation of subcarriers for training symbols from different antennas is referred to as a FDM scheme and the phase-shift orthogonal design as a code-division multiplexing in the frequency domain scheme.
If we were to initialize the Kalman recursion by substituting the scaled identity covariance matrix of h n for M −1|−1 , the one-step prediction error matrix, M n|n−1 is always a scaled identity matrix. Consequently, the matrix K n K n ( I K ⊗T) is also a scaled identity matrix since ( I K ⊗T HT ) is designed to be a scaled identity matrix. This is better understood by writing the alternative version of the Kalman gain matrix using the matrix inversion lemma a : As an extension of the above fact, due to assumption, (A1) and the optimal training design described by Lemma 1, M (∞) is also a scaled identity matrix. It can be showed that an arbitrary diagonal element, m (∞) M (∞) [ l, l], 0 ≤ l ≤ K 2 (L + 1) − 1 is given as follows: This steady-state Ricatti solution is the lower bound on the MMSE for estimating any of the K 2 (L + 1) channel filter taps, irrespective of the particular phase being considered. To compute the steady-state MMSE characteristics, we let n → ∞, and define, for j ∈[ 0, N − 1]. We can now review the closed-form expressions for steady-state channel MMSEs in training and data transmission phases based on [36].

Lemma 2.
When the training vectors are designed according to (32) and a Kalman filter is employed to perform channel tracking, the steady-state channel MMSEs for the system model corresponding to (20), (21), and (24) are given as follows: Proof. See proof of Lemma 1 in [36].

Capacity bounds with sequential MMSE channel estimation
Similar to [18], we adopt the definition of capacity in bits per channel use to be the maximum over the distribution of the transmit signal of the mutual information between the known training symbols and the observations and the unknown transmitted signal. In other words, for the system model shown in (20), (21), and (24), the channel capacity averaged over the random channel is defined as follows: bits/channel use.

Upper bound on the channel capacity
To benchmark the maximum achievable capacity, we consider the ideal scenario where the channel estimation is perfect. We also utilize the Gaussianity assumption on the distribution of the information symbol vectors,x n ands n due to (A2) in the channel capacity expression. We now have the following result: Theorem 4. The upper bound on the channel capacity for the system model shown in (20), (21), and (24) is obtained when the information symbol vectors,x n ands n are Gaussian distributed and is given by the expression: Proof. See Appendix 1.5.

Lower bound on the channel capacity
From [4] and ( [18], Theorem 1), we know that the lower bound on the mutual information between the channel input and its output is obtained when the additive noise is Gaussian distributed. In other words, when imperfect channel estimates are employed for estimating information symbols, a zero-mean uncorrelated complex Gaussian noise vector minimizes the upper bound over the distribution of the information symbol vector of the mutual information between the transmitted and observed information symbols. For the problem under consideration, the following signal model can be written by expressing the estimated channel matrix, as a sum of the conditional mean, and the random error component, In (42a), we made use of the following relationship, whereX n,k diag (x n,k ),W 0:L √ P P It should be observed that in (21) and (24), the channel is unknown whereas in (42a) and (42b), the channel is known. Furthermore, the additive noise in the former two equations is Gaussian and independent of the information symbols whereas in the latter two, it is possibly neither. This is due to the fact that each of the effective additive noise vectors,z (dt) n (I K ⊗X n )ȟ n +z n and z (d) n (I K ⊗S n )ȟ n +z n appear to be a sum of a Gaussian vector and a vector whose elements are obtained by summing products of Gaussian random variables. As a result, we will merely derive the lower bound by replacing the effective noise vectors, with Gaussian noise vectors that possess the same average powers. The expressions for the average noise powers in each phase are as shown below.

Training phase
E{ trace{X n,kW 0:LW where we substituted, σ 2 x = P dt to account for the power budget on the transmit symbols in the training phase.

Data transmission phase
E{ trace{PS n,kW 0:LW where we substituted, σ 2 s = P d to account for the power budget on the transmit symbols in the data transmission phase.
The lower bound on the channel capacity when the estimated MIMO channels are taken to be the true channels is now given by the following result.
Theorem 5. The worst-case lower bound on the channel capacity for the system model shown in (20), (21), and (24) is obtained when the additive noise is Gaussian distributed and is maximized when the information symbol vectors,x n ands n are Gaussian distributed. It is given by the expression: Proof. See Appendix 1.6.

Simulation results
In our simulation, we selected K = 2, P = 32, L = 3, and M = 24 (since M = P − V and V is set to K(L + 1)). The training vectors are generated according to (32). We also set P = 1, so that the SNR is defined as: SNR −10 log 10 σ 2 z . In designing the optimal training vector and in generating Gaussian information symbols over each of the K transmit antennas, their variances have been appropriately scaled such that the total power constraint on the overall system is satisfied. We selected the Rayleigh channel variance to be σ 2 h = 1/(L + 1). Thus, the Rayleigh channel adopted is an uncorrelated uniform scattering model. Moreover, we averaged the results over 500 randomly generated MIMO channel vectors. Given the fact that the channel capacity lower bound given by (49) is quite involved, we do not attempt to provide analytical results for the optimal power allocation and the optimal number of blocks out of N that carry superimposed training symbols. Consequently, we resort to numerical optimization to determine optimal P dt , P t , P d , and N t . http://asp.eurasipjournals.com/content/2013/1/78

Performance evaluation of optimal training designs over non-time-varying wireless channels
In this section, we have generated the Rayleigh channels such that there is no correlation between successive block indices. In other words, each MIMO channel vector of any index is assumed to be independent of the MIMO channel vector of any other index. Moreover, we consider each block to contain training and information symbols such that channel tracking is not performed. Thus each block is represented by (20) and (21) alone. This also implies that P dt + P t = 1 and P d = 0 are assumed. From Figure 1, we notice that when a large fraction of total power is allocated for training symbols, the lower bound of the channel estimator progressively decreases. When the training symbols carry a small fraction of the total power, the under-performance of the MMSE channel estimator that is evaluated based solely on (20) w.r.t the lower bound is evident. On the other hand, as the power of training symbols increases, the difference between the MMSE channel estimator variance and the Bayesian lower bound achievable is negligible. In other words, the role played by the term, σ 4

Comparison of the MIMO channel estimator and BCRB
x (Q) in (11) is progressively minimized. Figure 2 describes the performance of an MMSE equalizer for estimating the information symbols using (21). We provide the MMSE variance characteristics for the case when the true channel was used in (21) as well as the case where estimated channels were used. Unsurprisingly, the curves corresponding to true channel values suggest that the MMSE variance of the estimated information symbols is lower than those that result when estimated channel vectors are used. However, a more interesting observation is that the performance is impaired both when P t is too low or too high. Specifically, when P t = 0.25, a small fraction of the total power is employed to gather channel estimates. Since, there are bound to be numerous errors in this scenario, data estimation suffers. Conversely, when P t = 0.95, only a small fraction of total power is expended for information symbols and hence the data estimation suffers. On the other hand, when P t = 0.5, the performance appears to be better. However, we refrain from computing the optimal power allocation by considering a capacity lower bound similar to (49) for the non-timevarying wireless channel scenario and reserve such an analysis for the next section.

Kalman tracking of time-varying wireless channels
In this section, we selected the MIMO channel vectors such that they are correlated. The excitation noise is generated with the appropriate variance so that the channel vectors are WSS. Throughout this section, we set N = 10. We then performed a numerical optimization as mentioned above and determined that the following values result maximize the lower bound obtained in (49): P dt = 0.32, P t = 0.41, P d = 0.27 and N t = 4. When a non-optimal value of P dt is allocated to training symbols, the division of the remaining power to information symbols in the training phase and the data transmission phase is arbitrarily chosen.

Steady-state MMSE of the channel estimator
In Figure 3, we provide the steady-state MMSE characteristics of the channel estimator when a Kalman filter is used to track the channel. We set a = 0.95 and fixed N t = 4 in http://asp.eurasipjournals.com/content/2013/1/78 order to generate these characteristics. We also analyzed the characteristics for P t = 0.25, 0.5 and 0.95 in addition to the optimal value. The MMSE lower bound shown in Figure 3 corresponds to (36) whereas the normalized MMSE corresponds to averaging (38b) and (38c) every N blocks. We notice that for small P t the errors committed due to channel predictions in the data transmission phase cause significant deviation of the normalized steady-state MMSE from the lower bound. Only at high values of P t , these errors become insignificant. Of particular importance is the fact that even at an SNR of 30 dB and with optimal training power allocation, the deterioration suffered due to prediction errors w.r.t the lower bound is close to 3 dB.

MMSE estimation of information symbols due to
Kalman channel tracking Figure 4 shows the resulting MMSE estimation error variance characteristics of information symbols. The solid curves rely on true channel estimates whereas the dashed curves depend on not only the estimated channel states, but also on Kalman predictions. Similar to Figure 2, we see that when P t is too small or too large the error variance of information symbol estimation suffers greatly. Even at high SNR, the non-judicious power allocation combined with Kalman predictions during the data transmission phase leads to numerous errors. On the other hand, optimal power allocation between training and information symbols leads to the lowest possible information symbol estimation error variance.

Capacity bounds
The final simulation example that we will consider is the capacity upper and lower bound characteristics. While the upper bound characteristics for varying levels of P t exhibit a gradual improvement toward the theoretical upper bound, the lower bound characteristics are more abrupt. This can be attributed to the prediction errors that occur in the Kalman prediction stage during the data transmission phase. When P t = 0.95, the fraction of the total power available for information symbols in the training phase and the data transmission phase is minuscule and hence the achievable capacity lower bound is small. In contrast, this value can be improved by more than 15 bits/channel use with an optimal allocation of training and data powers ( Figure 5).

Conclusion
In this article, we have shown that similar to a SISO case, an OFDM linear precoder with an FDM training sequence satisfies the orthogonality condition and results in decoupled channel estimation and symbol detection. Furthermore, we have derived optimal training sequences such that the FDM training sequences between different antennas are phase-shift orthogonal to each other. Based on the structure of the training matrices, the Kalman filter recursion was simplified to a scalar recursion. Eventually, the upper and lower bounds on the channel capacity were obtained by utilizing the Kalman filter's MMSE expressions to account for imperfect channel estimates. We showed that the Kalman filter predictions affect the capacity calculations substantially. Taking this degradation into account, we numerically determined the optimal training power allocation and optimal number of training blocks to achieve the best possible capacity lower bound. Finally, we provided numerical results to support the theoretical results.

Modeling time-variations in low-mobility wireless channels
While the complex random variable description of a wireless channel as Rayleigh, Rician, etc., forms one aspect of characterization, another one involves taking the timevariations of the channel filter taps into consideration. A common assumption on the random process that drive the time-variations of the channel filter taps is its widesense stationarity. In other words, the mean and the autocorrelation functions of each filter tap are assumed to be independent of time, with the latter being a function of the time-difference alone. Further, each tap at a given time instant is assumed to be independent of every other tap at any time instant. Together these two assumptions give rise to the wide-sense stationary, uncorrelated scattering (WSSUS) model. Autoregressive model: A widely used approach to model time variations of a WSSUS channel is by a general Pth order AR random process. By considering (7a), the AR model that helps us to specify the correlation between the current state of the system and the past states is as shown below: In (A.1), each element in {A p } is termed as an AR coefficient matrix or a state-transition matrix and u n as the excitation or driving noise vector. The eigen values of each element in {A p } are assumed to be less than 1 in magnitude and the driving noise is assumed to be i.i.d and complex Gaussian distributed with zero mean. The AR model admits the following Yule-Walker equations to describe the covariance function of the process [37].
Assuming that R h [ 0] and {A p } are known, we can apply the fact that R h [ n] = R h [ −n], and recursively find {R h [ n] } for n = 1, 2, . . . , P. We can also find a non-unique B by computing the square-root of (A.2) for n = 0 ( [38], p. 358).

Proof of Theorem 1
From [39], we know that the complex FIM is given by the equation, In the second equality of the above equation, the inner expectation in the first term is w.r.t y n , whereas the outer expectation is w.r.t h n . The log-likelihood function of the probability density function, p(y n |h n ) in (B.1) and its derivative are as follows: where, u (y n − (I K ⊗ T) h n ) and . . Q j,K ] are obtained from each of the KM columns of Q. The matrices {Q j,k } are a result of applying the commutativity property of convolution. It should be noted that we have not utilized the block-diagonal structure of Q in obtaining {Q j,k }. In other words, the matrices {Q j,k } are constructed without explicit consideration of the fact that (K − 1)P out of KP elements in each column of Q are zeros. In addition, we have utilized the matrix inversion lemma in obtaining We now evaluate the two partial derivatives in (B.2b) separately.
Using ( [40], (9)), we note that Here, D R yn|hn ln |R y n |h n | = R −T y n |h n and D R * yn|hn ln |R y n |h n | = 0 ( [41], Table II). Moreover, from (B.3a), we see that (cf. ( [40], (1)), (B.5b) From the above equations and ( [40], Table III), we notice that, D h * n R y n |h n = σ 2 x KM−1 j=0 (Q * j ⊗ Q n h n ). It should be noted that the definition of the partial derivative for the case of a scalar function w.r.t a column vector adopted by Hjørungnes and Gesbert result in a row vector ( [40], Table  III, 2nd row). We consider this definition to lead to transposed derivative and perform a transpose operation of the results obtained based on ( [40], (9)) in order to obtain the FIM with appropriate dimensions. Consequently, ∂ ln |R y n |h n | ∂ h * n = ( D R yn|hn ln |R y n |h n | ) D h * n R y n |h n = vec T ∂ ln |R y n |h n | ∂ R y n |h n ∂vec R y n |h n Using ( [40], (9)), we can similarly show that Hence, from (B.2b), Before we evaluate the inner expectation in the first term of (B.1), we recall that, Incidentally, by utilizing the above result, we can see that  ) . Substituting this result along with (B.10) and (B.3b) in (B.1) gives: (B.11) http://asp.eurasipjournals.com/content/2013/1/78

Proof of Theorem 2
From (12c), we see that G is the inverse of sum of two full-rank positive-definite matrices. This is because, H n is a Rayleigh-fading channel matrix of full rank (with probability 1) due to (A1) and (C1) stipulates that Q be a full column-rank matrix. Hence, Q H H H n H n Q is a matrix with strictly-positive eigenvalues. Together with the fact that I KM is also a matrix with strictly-positive eigen-values, we arrive at the result that G 0. By making a similar argument, we can show that R −1 y n |h n 0. As a result of the above statements, we can claim that (t, Q) 0 and (t) 0. Combining the above results with (C1) leads us to conclude that I(h n ) 0. Now, based on a previous observation that only (t, Q) is the term under the designer's control, we see that I (opt) (h n ) I(h n ) where the optimal Bayesian FIM for a training-based channel estimator is as follows: is the Bayesian CRB of a non-decisionaided channel estimator for the system model described in (8). We can now see that I (opt) (h n ) is obtained by making (t, Q) = 0 which in turn is possible by enforcing the condition: In other words, We now utilize the commutativity property of convolution and in a manner similar in the construction of the matrices, {T k }, we see that H

Proof of Lemma 1
First, it is easy to see that the optimal MMSE estimator coincides with the linear MMSE estimator for the system under consideration i.e., (20), due to the joint Gaussian nature of the unknown parameter and the observation vectors. The optimal MMSE channel estimator,ĥ n is now ( [31], (11.33) and (11.35)): where, h n = h n −ĥ n . The resulting channel estimator error variance is, The optimalT,T (opt) needs to minimize σ 2 h n subject to (27). An equivalent representation of this pilot power constraint that will be useful for findingT (opt) is as follows: trace (T H where the equality is attained if and only if A is diagonal. Therefore, ifT (opt) is employed to perform the MIMO-OFDM channel estimation, the resulting variance of the MMSE channel estimator is as follows: ( D . 6 ) and equality in the above equation is attained whenT HT is diagonal.

Optimal training design
We now design the optimal training design that achieves the minimum MSE variance shown in (D.6). We will see that in order to attain this bound, the pilot sequences of each transmit antenna as well as their relationship with http://asp.eurasipjournals.com/content/2013/1/78 the training sequences emitted from every other transmit antenna need to satisfy certain specific properties. These properties are a direct consequence of (D.3b) and (D.5). A closer observation ofT HT reveals the following: where the (L + 1) × (L + 1) dimensional submatrix R k 1 ,k 2 is defined based on (19) as: The minimum variance as shown in (D.6) is therefore attained when Casek 1 = k 2 In order to understand the conditions that need to be imposed on the structure ofT k , we examine an arbitrary element of R k,k . From (D.9), we notice that The above conditions indicate that the pilot symbols used for channel estimation must be equispaced in the subcarrier domain and equipowered. Due to (C8), we see thatT * kT k = P t KV I V . Combined with the fact that W H 0:LŴ 0:L = V I L+1 , we see that R k,k = P t K I L+1 . We now see that when K = 1, the following pilot sequence design: Casek 1 = k 2 We now incorporate the consequences of imposing the condition, R k 1 ,k 2 = 0 when k 1 = k 2 , in (D.11). We again utilize (D.9) and apply (C7). We see that [ R k 1 ,k 2 ] l 1 ,l 2 = exp{−j2π l s (l 1 − l 2 )/P} × exp{−j2π vS(l 1 − l 2 )/P}, (D. 12) which equals zero when We selected E k as shown in (D.14) so that we can exploit the property of summation of the roots of unity. In order to do so, we require that the term, (f k 1 − f k 2 + l 1 − l 2 ) be a non integer-multiple of V. So, we choose f k = k(V −L−1).
In conclusion, the training design shown in (32) meets not only conditions (C6), (C7), and (C8) but also (D.13) so that phase-shift orthogonality is maintained between the pilot sequences of any pair of transmit antennas.

Proof of Theorem 4
By denoting the entropy using H(.) and applying the definition of mutual information, we can write the following expression, I (ỹ (dt) n ;x n | H n ) = H (x n | H n ) − H (x n | H n ,ỹ (dt)