Nonlinear joint transmit-receive processing for coordinated multi-cell systems: centralized and decentralized

This paper proposes a nonlinear joint transmit-receive (tx-rx) processing scheme for downlink-coordinated multi-cell systems with multi-stream multi-antenna users. The nonlinear joint tx-rx processing is formulated as an optimization problem to maximize the minimum signal-to-interference noise ratio (SINR) of streams to guarantee the fairness among streams of each user. Nonlinear Tomlinson-Harashima precoding (THP) is applied at transmitters, and linear receive processing is applied at receivers, to eliminate the inter-user interference and inter-stream interference. We consider multi-cell systems under two coordinated modes: centralized and decentralized, corresponding to systems with high- and low-capacity backhaul links, respectively. For the centralized coordinated mode, transmit and receive processing matrices are jointly determined by the central processing unit based on the global channel state information (CSI) shared by base stations (BSs). For the decentralized coordinated mode, transmit and receive processing matrices are computed independently based on the local CSI at each BS. In correspondence, we propose both a centralized and a decentralized algorithm to solve the optimization problem under the two modes, respectively. Feasibility and computational complexity of the proposed algorithms are also analyzed. Simulation results prove that the proposed nonlinear joint tx-rx processing scheme can achieve user fairness by equalizing the bit error rate (BER) among streams of each user and the proposed scheme outperforms the existing linear joint tx-rx processing. Moreover, consistent with previous research results, performance of the proposed centralized nonlinear joint tx-rx processing scheme is proved to be better than that of the decentralized nonlinear joint tx-rx processing.


Introduction
Coordinated multi-cell is a promising technology to reduce inter-cell interference and increase user data rate, which has been considered as one of the potential technologies for LTE Advanced [1,2]. To fully utilize the advantage of coordinated multi-cell technology, it is essential to manage the multi-user interference (MUI) within the coordinated area appropriately as it is directly related to the achievable spectrum efficiency [3]. Precoding is a well-known technique for MUI mitigation in multi-user multiple-input multiple-output (MU-MIMO) systems [4,5]. The joint transmit-receive (tx-rx) processing can be used to further improve the downlink performance of MU-MIMO systems by optimizing the transmit precoding and receive filter matrices jointly. According to the processing of the transmit precoding, the joint tx-rx processing technology can be divided into two types, linear and nonlinear schemes.
The coordinated multi-cell technology can be implemented in a centralized or decentralized mode based on the backhaul capacity of the systems. The centralized coordinated mode can achieve higher data rate at the cost of high-capacity backhaul links in order to enable base stations (BSs) to share their channel state information (CSI) (defined as local CSI) and data. Hence, the centralized approach is limited to systems with sufficient backhaul capacity [6,7]. The decentralized coordinated mode does not require BSs to share their local CSI, and the precoding or tx-rx processing is conducted at each BS [8]. This approach has less requirement on backhaul link capacity at some loss on the data rate in comparison to the centralized coordinated mode.
In recent years, relevant works on joint tx-rx processing in coordinated multi-cell systems have been widely studied under either centralized [9][10][11][12][13][14][15][16][17][18][19] or decentralized mode [20][21][22][23]. For designing the nonlinear joint tx-rx processing, many different optimal objectives have been considered, such as minimizing the sum mean square error (S-MSE) or maximizing the SINR; yet, the fairness among the streams of each user has not been solved for the coordinated multi-cell systems with multi-stream multi-antenna users.

Prior art
Linear joint tx-rx processing algorithms have been widely studied for coordinated multi-cell systems under centralized mode [9][10][11][12][13][14]. In [9], block diagonalization (BD) precoding was designed to maximize the weighted sum rate of all users. The tx-rx processing optimization with the criterion of minimizing the S-MSE was presented in [10][11][12], and the authors of [13] proposed a weighted S-MSE minimization algorithm by considering the channel gain as the weight factor. In [14], the energy efficiency was considered in the tx-rx processing design. A new criterion of maximizing weighted sum energy efficiency was formulated, and the optimization problem was solved by an iterative algorithm. For the decentralized coordinated mode, D. Gesbert and R. Holakouei, et al. studied the decentralized linear precoding techniques for the system with single-antenna users recently [20][21][22]. In [20], a distributed precoding scheme based on zero-forcing (ZF) criterion (defined as DZF) and several centralized power allocation approaches was proposed. In [21,22], a characterization of the optimal linear precoding strategy was derived. Distributed virtual SINR (DV-SINR) precoding approaches, where each BS balances the ratio between signal gain at the intended user and the interference caused by other users, had been proposed for the particular case of two users in [21] and generalized for multi-user in [22]. The DV-SINR scheme was illustrated to satisfy the optimal precoding characterization and outperform DZF.
Compared with the linear joint tx-rx processing schemes, the nonlinear joint tx-rx processing schemes are more complex but can obtain more system gain, which have gained much attention recently. Most research about the nonlinear precoding focus on Tomlinson-Harashima precoding (THP), as it can achieve approximate performance with the optimal dirty paper coding but has a much lower complexity [5]. For the centralized coordinated mode, the tx-rx processing scheme was designed to minimize the S-MSE in [15] and maximize the SINR in [16], wherein both should be solved by an iterative method, resulting in high computational complexity. The schemes with low complexity were proposed and derived a closed-form solution based on minimum average bit error rate (BER) in [17], minimum mean square error (MMSE) in [18], or ZF criterion in [19]. In [18], the receive processing matrix was firstly computed by CSI. Then,the transmit processing matrix and receive weight coefficient were computed based on MMSE. In [19], the algorithm decomposed the MU-MIMO channel into parallel independent single user MIMO (SU-MIMO) channels, and then, closed-form expressions of transmit and receive processing matrices were derived to optimize the performance of each user. The above research works on nonlinear tx-rx processing were all developed for the centralized coordinated mode. The relevant works for the decentralized coordinated mode are relatively fewer. A decentralized nonlinear precoding, ZF-THP, was proposed in [23] but can only be applied for the system with a single user. To the best of our knowledge, for the system with multi-stream multi-antenna users, the tx-rx processing solutions under decentralized coordination mode have not been addressed in the literature.
Previous work did not consider fairness among streams of each user in the coordinated multi-cell system with multi-stream multi-antenna users. It is essential to study the fairness for nonlinear scheme, as unfairness is an inherent character of THP and the worst performance determines the whole performance of the user [24].

Contributions
In this paper, a nonlinear joint tx-rx processing scheme is proposed to improve fairness among streams of each user with multi-antenna. The nonlinear joint tx-rx processing is formulated as an optimization problem to maximize the minimum SINR of streams. The performance of the proposed scheme is evaluated under both centralized and decentralized coordinated modes. Two algorithms for solving the optimization problem are derived.
The main work of this paper can be summarized as follows.
Nonlinear joint tx-rx processing scheme is developed for a coordinated multi-cell system with multi-stream multi-antenna users under two coordinated modes, centralized and decentralized mode. Two algorithms, the centralized and the decentralized algorithms, are proposed to solve the optimization problem, and both of them derive the closed-form solutions. The algorithms guarantee the fairness among the streams of each user, which not only boost the performance of each user, but bring much convenience to the modulation/demodulation and coding/decoding procedures.
The remainder of this paper is organized as follows. Section 2 presents the coordinated multi-cell system model. The proposed nonlinear joint tx-rx processing scheme is described in detail in Section 3. A performance analysis of the proposed algorithms is developed in Section 4. Simulation results and conclusions are presented in Section 5 and Section 6, respectively.

Notation
We use uppercase boldface letters to denote matrices and lowercase boldface to denote vectors. The operators (⋅) T , (⋅) H , (⋅) † , E(⋅), and Tr(⋅) stand for transpose, Hermitian, Moore-Penrose pseudo-inverse, expectation, and the trace of a matrix, respectively. diag(⋅) and blockdiag(⋅) denote diagonal and block diagonal matrix. I and 0 are the identity and the all-zero matrix, respectively, with appropriate dimensions. ‖ ⋅ ‖ F represents the Frobenius norm of a matrix. [⋅] i : j,k : l denotes the submatrix comprised of row i through row j and column k through column l of a matrix.

System model
Consider a downlink coordinated multi-cell system, where N BSs cooperatively serve K users. Each BS and user is equipped with n t and n r antennas, respectively. All BSs share user data and cooperatively transmit the data to an intended user. Each BS transmits L ¼ X K k¼1 l k data streams to K users, where l k is the number of transmitted data streams for user k.
We assume that BSs' transmit power for every user is P. Therefore, the total transmit power of BSs is KP. De- , where x k n denotes the preprocessed signal transmitted by the nth BS for user k, The received signal of the kth user is: where is the global CSI between BSs and the kth user and H k n ∈ C n r Ân t denotes the local CSI between the nth BS and the kth user, whose entries are independent and identically distributed (i.i.d.) complex Gaussian variables with zero mean and unit variance. In Equation 1, the second term on the right-hand side is MUI, and n k e CN 0; σ 2 I n r ð Þis the additive white Gaussian noise variable.
Each user decodes the desired data by multiplying with the receive processing matrix. The received data of the kth user is given as: where R k ∈ C l k Ân r denotes the receive processing matrix of the kth user. ñ k = R k n k is the equivalent received noise vector at the kth user.
Let y ¼ y 1 T ; ⋯; y K T T represent the received signal of the K users. Equation 2 can be expressed as: where R = blockdiag(R 1 , ⋯, R K ) is a L × Kn r matrix.
BSs and K users.
notes the nth local CSI between the nth BS and K users.
x ¼ X K k¼1 x k denotes the transmit signal at BSs, and x n ¼ X K k¼1 x k n is the transmit signal at the nth BS. n ¼ñ 1 T ; ⋯;ñ K T T is the combination of the receive noise at the K users.
The rate of the kth user is given by Then, the system sum rate can be obtained by r ¼ X K k¼1 r k . The coverage of N-coordinated BSs is defined as one coordinated area. We mainly focus on the interference within the coordinated area. The interference from other coordinated areas is ignored in this paper, which can be eliminated by inter-cell interference coordination technology [25] or interference alignment technology [26].
For the centralized coordinated mode, it is assumed that all BSs exchange their local CSI, and the tx-rx processing matrices are jointly designed at the central processing unit. The system can be seen as a virtual MU-MIMO system with N t = Nn t transmit antennas. On the contrary, for the decentralized coordinated mode, BSs do not share their CSI, and every BS only has knowledge of local CSI between itself and K users. Therefore, the tx-rx processing matrices are independently designed at each BS.

Nonlinear joint transmit-receive processing algorithm
In this section, we present nonlinear joint tx-rx processing algorithms for a coordinated multi-cell system under two different coordinated modes. The algorithm structure is firstly shown. Then, we formulate the optimization problem, aiming at maximizing the minimum SINR of streams to guarantee the fairness among the streams of each user. Finally, the algorithms for different coordinated modes are proposed.

Algorithm structure
The structure of the proposed algorithm is shown in Figure 1. In the proposed algorithms, nonlinear preprocessing is applied at transmitters; meanwhile, linear processing is applied at each receiver.
At the nth(n = 1,…,N) transmitter, denotes the modulated data vector satisfying E{ss H } = I, where s k is comprised of the l k data streams for the kth user. In THP, feedback matrix B n is a unit lower triangular matrix, where B k;k n is a unit lower triangular matrix with l k × l k size. u n is the output data of THP. Therefore, the lth data stream of u n is interfered by the first (l-1) data streams; in other words, the lth(l = 2, ⋯, L) element u l n in u n is a linear combination of s j (j ≤ l). Assume M-ary square constellation is employed to s. To ensure that the real and the imaginary parts of u l n are constrained The output data of THP is expressed as: where , z I and z Q are both integers. Define v n = s + d n , and then, u n is written as: There is a power enhancement of τ = M/(M − 1) due to THP, i.e., E u n u H n È É ¼ τI [27]. The transmit signal at the nth BS is: where At the receivers, the received signal in Equation 3 can be rewritten as: The user data will finally be obtained by modulo operation and demodulation. Obviously, the received noise power of l k streams of the kth user is:

Problem formulation
From Equation 1, it is noticed that the received signal of every user is influenced by MUI. In order to liberate every user from MUI, the relative matrices in this algorithm are designed to satisfy ZF criterion: The SINR of the kth user can be obtained as: In order to guarantee the fairness among streams of each user, we investigate the tx-rx processing matrices design to maximize the minimum SINR for each stream of each user, which is formulated as follows: for k = 1, ⋯, K, n = 1, ⋯, N. (a) denotes ZF criterion. (b) denotes that B n is the unit lower triangular matrix, and e i is the ith column of I L . (c) is used to guarantee the power constraint and

Centralized algorithm
In Equation 13, the relative matrices are entangled with each other. To solve this problem, we start from the ZF constraint. Every BS is assumed to have the same feedback matrix, denoted as B, which will be determined at the central processing unit based on the global CSI. (a) in Equation 13 can be rewritten as where We assume that F k is represented as is named as the transmit space matrix and F k is the transmit diversity matrix with The above analysis is suitable for the kth(k > 1) user. Since the first user is not limited by Equation 15, we use With F k and THP, the proposed algorithm decomposes the MU-MIMO channel into parallel independent SU-MIMO channels [19]. We can comprehend this as follows: for the kth user, F k is designed to avoid the interference from users (k + 1, ⋯, K). Meanwhile, THP is used to eliminate the interference from the first (k − 1) users. Therefore, user k will not suffer from MUI. For any user k(k = 1, ⋯, K), B k;k ; F k and R k satisfy: where B k,k is a unit lower triangular matrix with l k × l k size. Therefore, B k;k ; F k and R k (k = 1, ⋯, K) can be designed separately. For the kth user, Equation 13 can be reduced to: where Q k ∈ C n r ÂS and P k ∈ C lower triangular matrix, the diagonal elements of which satisfy: where λ k i is the ith largest positive singular value of H k F k , and λ k ¼ and B k,k are given by: Based on Equation 10 and Equation 22, the received noise power of every stream of the kth user is σ 2 λ k −2 . Therefore, the SINR of the kth user is: It can be seen that every stream of the kth user can achieve equal SINR.
Note that for the computation of the transmit space matrix of the kth user F k , we need to know the receive processing matrices of the first (k − 1) users R t (t < k) and that, for the computation of the receive processing matrix of the kth user R k , we need to know the transmit space matrix of the kth user. Therefore, F k and R k are designed step-by-step, which starts by computing the transmit space matrix and the receive processing matrix of the first user, then computes the matrices for the second user by utilizing the receive processing matrix of the first user and so on. All of the matrices are designed at the central processing unit, and the receive processing matrices are transmitted to each user by downlink channel. The procedure of the proposed centralized algorithm is summarized in Table 1.

Decentralized algorithm
In this scenario, as BSs do not exchange their local CSI, each BS independently preprocesses the user data with the local CSI of itself. The data processed by each BS cannot be obtained by other BSs. In order to ensure that the user's receive signal is not interfered by MUI, relative matrices at each BS should satisfy the ZF criterion. Therefore, Equation 11 is reduced to: where The receive processing matrix R k (k = 1, ⋯, K) of each user is related to the transmit signals from N BSs. If R k is computed at BSs, each BS can only decide it dependently as the local CSI of each BS is not exchanged. Generally, R k derived at different BSs has different values, which is unreasonable. Otherwise, for each user, frequently interactive information with all coordinated BSs is required. It will largely increase the system computational complexity. Therefore, we firstly compute R k (k = 1, ⋯, K) at users. Denote where U k 1 ∈C n r Âl k . Then R k can be obtained by R k ¼ G k U k 1 À Á H , where G k is a diagonal matrix for normalizing the received signal and will be determined at the BSs. For frequency division duplex system, user k can only feedback the equivalent local CSI H k n ¼ U k 1 H k n to the nth BS. Therefore, Equation 10 can be rewritten as: Assume that every BS has equal transmit power p = P/N. Based on the above analysis, Equation 13 is equivalent to the following optimization problem: for k = 1, ⋯, K, n = 1, ⋯, N. In (a), are used to guarantee the power constraint.
In Equation 26, the relative matrices are entangled with each other. Similarly, to solve this problem, we start from the ZF constraint. Take the nth BS for example. (a) in Equation 26 can be rewritten as: As G and W n are diagonal matrices and B n is a unit lower triangular matrix, the left side of Equation 27 is a lower triangular matrix, i.e., H We assume that F k n is represented as The above analysis is suitable for the kth(k > 1) user. Since the first user is not limited by Equation 28, we use F k n ¼ I n t . F k n can be achieved by: Similarly, with F n1 and THP, the algorithm decomposes the MU-MIMO channel into parallel independent SU-MIMO channels. Define H k n ¼ H k n F k n . For any user k (k = 1, ⋯, K), B k;k n ; F k n and G k satisfy: Therefore, B k;k n ; F k n and G k (k = 1, ⋯, K) can be designed separately. For the kth user, Equation 26 can be reduced to:

The optimal solution of Equation 32
is obtained when all l k streams attain equal SINR [29]. According to Equation 12 and Equation 25, it is equivalent to possess equal value for diagonal elements of G k , expressed as G k = α k I. Equation 32 is equivalent to: where γ n ¼ α k =p k n . The constrain condition (a) in Equation 33 can be rewritten as F Actually, B k;k n e k i denotes the ith column of B k;k n . The objective of Equation 34 is equivalent to minimizing H k † n B k;k n e k i 2 for any i(i = 1, ⋯, l k ).
By differentiating of Equation 35 with respect to b i n and setting the result to zero, b i n is achieved by: Therefore, the ith column of B k;k n is obtained by: Then, B k;k n is obtained by combining all columns B k;k n e k i (i = 1, ⋯, l k ). Therefore, we can derive

=p . By combining (a) and (d) in
Equation 33, G k = α k I is obtained, where: Finally, B n is determined by B n ¼ W −1 n RH n F n . In this algorithm, R k (k = 1, ⋯, K) are designed at the users, and other matrices are designed at the BSs. The design of matrices is independently performed at every Þshould be transmitted to the kth(k = 1, ⋯, K) user by downlink channel for achieving G k at the kth user. The procedure of the proposed decentralized algorithm is described in Table 2.

Remark 1 (applicability)
It should be noted that the proposed two algorithms are also suitable for the system with a single-data stream transmitted for each user. Moreover, the proposed two algorithms both are applicable to the noncoordinated system. However, the centralized scheme is suggested to apply for the noncoordinated system, as the decentralized scheme is a suboptimal solution in this situation.

From Equation 23 and Equation 38
, it is noted that both of the proposed two algorithms can achieve equal SINR for every stream of the user. They guarantee the balance performance among streams of each user, which bring much convenience to the modulation/demodulation and coding/decoding procedures. In this section, we analyze the feasibility and the computational complexity of the proposed two algorithms.

Feasibility analysis
In the MIMO system, in order to distinguish every transmit stream, the constraint that the number of transmit data streams is no more than the number of transmit and receive antennas should be satisfied. For the centralized coordinated mode and the decentralized coordinated mode, the constraint on the number of transmit data streams is specified as follows: Lemma 1: For the centralized coordinated mode, the number of transmit data streams are bounded by L ≤ N t , l k ≤ n r ; for the decentralized coordinated mode, the number of transmit data streams are bounded by L ≤ n t , l k ≤ n r .
In the proposed centralized algorithm, the design of the transmit space matrix Furthermore, to guarantee that the optimization problem Equation 19 has solutions, S ≥ l k is required. As the entries of H k F k are zero-mean complex Gaussian variables, the rank of H k F k is S ¼ min n r ; N t − X k−1 i¼1 l i with a probability of 1. Therefore, S ≥ l k is the necessary condition to carry out the algorithm. Base on Lemma 1, the necessary condition is satisfied to the centralized coordinated system. Therefore, the proposed centralized algorithm is feasible.
In the proposed decentralized algorithm, n t − X k−1 i¼1 l i > 0 is required to guarantee the existence of the transmit space Obtain l k columns of B k;k n using Equation 37  8 Compute γ n by γ 2 Moreover, the solution of optimization problem Equation 34 requires n t − X k−1 i¼1 l i ≥l k , which is satisfied in the decentralized coordinated system. Therefore, the proposed decentralized algorithm is feasible.

Computational complexity
For simplicity, the number of float point operations is used to measure the computational complexity of the proposed algorithms.
In the proposed centralized algorithm, the design of the relative matrices for the kth user includes the following: a onetime multiplication of a l k − 1 × n r matrix and a n r × N t matrix, the complexity of which is O(l k − 1 n r N t ); a onetime computation of the null space of a X k−1 complexity; a onetime multiplication of a n r × N t matrix and a N t ; and a onetime computation of the singular value of a n r Â N t − X k−1 complexity. Therefore, the complexity of the relative matrices designed for the kth user is O In the proposed decentralized algorithm, every BS has the same computational complexity. For any BS, the design of the relative matrices for the kth user includes the following: a onetime computation of the singular vector of a n r × n t matrix with O n 2 r n t À Á complexity; onetime multiplications of a l k × n r matrix and a n r × n t matrix, the complexity of which is O(l k n r n t ); a onetime computation of the null complexity; a onetime multiplication of a l k × n t matrix and a n t Â n t − X k−1 i¼1 l i h i matrix, the complexity of which is O l k n t n t − X k−1 i¼1 l i h i ; and l k -times computation of the Moore-Penrose pseudo-inverse of a n t − X k−1 The complexity of other scalar computations can be ignored. Therefore, the complexity of the relative matrices designed for the Assume that the data streams for every user is equal, i.e., l 1 = ⋯ = l K = l. Thus, the complexity of the proposed centralized algorithm is O KL 2 N t þ Kn 2 r N t þ Kn r N 2 t À Á , and the decen-

Remark 2 (backhaul latency effect)
For centralized coordinated mode, tx-rx processing matrices are jointly computed at the central processing unit and then reported to every BS through the backhaul link. The existing backhaul latency can affect the system performance. We ignore the backhaul latency effect in the paper and will study it in the future work.

Numerical results and discussions
This section presents some simulation results to evaluate the BER performance of the proposed two algorithms. We compare them with the following algorithms: the interference-free algorithm, the joint transit-receive processing algorithm proposed in [19], and the centralized BD (CBD) and decentralized BD (DBD). As the traditional BD cannot be directly applied in a decentralized manner, here in DBD, receive processing matrix is derived firstly based on the same method for receive processing matrix in the proposed decentralized algorithm, then the precoding matrix is derived based on the ZF criterion. For the system with a single stream transmitted for each user, i.e., l k = 1 (k = 1,…,K) system, we also compare the proposed decentralized algorithm with DZF [20] and DV-SINR [22]. Flat Rayleigh fading channels are considered in simulations. The elements of the channels are i.i.d. complex Gaussian variables with zero mean and unit variance. In this simulation, a 64-QAM modulation scheme is employed in the simulation. The signal to noise ratio (SNR) is defined as SNR = P/(SMσ 2 ), where M = 4 is the signal constellation size and S is the average number of the data streams transmitted for each user. Figure 2 verifies the balance performance among the streams of each user in the proposed algorithms. We consider a 3-cell coordinated system with n t = 6 transmit antennas and K = 3 users each equipped with n r = 3 receive antennas. There are l k = 2 (k = 1,…,K) data streams transmitted for each user. The BER performance of two streams of the first user and the third user are shown. It can be seen that the two streams of any user achieve the approximately equal BER, not only for the centralized algorithm, but also for the decentralized algorithm.

Balance BER performance among streams of each user
The simulation results are in accordance with the theoretical analysis. Figure 3 presents the BER performance comparison of the six algorithms. A 2-cell coordinated system with n t = 6 transmit antennas and K = 3 users each equipped with n r = 5 antennas is considered. The number of the data streams transmitted to each user is set to 2, i.e., l k = 2 (k = 1,…,K). It is noticed that in the interference-free algorithm, only a single user is served by BSs. On the whole, centralized algorithms have better performance than decentralized algorithms, at the cost of information exchange among BSs. For the proposed algorithms, the centralized algorithm achieves about 7-dB gain related to the decentralized algorithm at BER = 10 −3 . Compared with the existing algorithms, when BER = 10 −3 , the proposed centralized algorithm has an approximately 5-dB gain to the algorithm in [19] and a 10-dB gain to CBD. Also, about a 10-dB gain is achieved by the proposed decentralized algorithm related to DBD at BER = 10 −2 .

BER performance comparison of different algorithms
In Figure 4, we consider the BER performance of a 3cell coordinated system, with n t = 6 transmit antennas   Figure 3 BER performance of centralized and decentralized algorithms with N = 2, K = 3, n t = 6, n r = 5, l k = 2 (k = 1,…,K). and K = 3 users each equipped with n r = 3 antennas. The number of the data streams transmitted to each user is set to 2, i.e., l k = 2 (k = 1,…,K). As mentioned in Figure 3, in the interference-free algorithm, only a single user is served by BSs. As can be seen from Figure 4, centralized algorithms have better BER performance than decentralized algorithms. The proposed centralized algorithm has a lower BER than the algorithm in [19] and CBD, and the proposed decentralized algorithm achieves better performance than DBD. Compared with Figure 3, the performance gains among algorithms are different, as they are related with system configuration.
In Figure 5, the performance of the proposed decentralized algorithm for the system with a single stream transmitted for each user, i.e., l k = 1 (k = 1,…,K), is verified and compared with the existing decentralized algorithms, DZF [20], and DV-SINR [22] in BER. A 3-cell coordinated system with n t = 6 transmit antennas and K = 5 users each equipped with n r = 3 antennas is considered. As can be seen from Figure 5, the proposed decentralized  algorithm has a lower BER than other algorithms. When BER = 10 −3 , it can achieve an approximately 7-dB gain compared with DZF, and a 5-dB gain compared with DV-SINR.

5.3
The effect of the receive antennas and user's number to centralized algorithms Figure 6 illustrates the effect of the number of receive antennas to the proposed centralized algorithm, the algorithm in [19], and the CBD. We consider a 3-cell coordinated system with n t = 6 transmit antennas and K = 3 users. The number of the data streams transmitted to each user is set to 2, i.e., l k = 2 (k = 1,…,K). As can be seen from Figure 6, the performance difference between the proposed centralized algorithm and the algorithm in [19] is increased with the number of the receive antennas. In the proposed centralized algorithm, the receive processing matrix is considered into the MU-MIMO channel decomposition. Compared with the algorithm in [19], the decomposed SU-MIMO channels have larger dimensions, which increases the system diversity gain and improves the system performance. With a larger number of the receive antennas, the decomposed SU-MIMO channels have the same dimension in the proposed centralized algorithm but have smaller dimensions in the algorithm in [19]. Therefore, with increased number of the receive antennas, the proposed centralized algorithm can achieve more performance gain than the algorithm in [19].
In Figure 7, the effect of the number of users to the proposed centralized algorithm, the algorithm in [19] and CBD is illustrated. We consider a 3-cell coordinated system with n t = 6 transmit antennas and n r = 3 receive antennas. The number of the data streams transmitted to each user is set to 2, i.e., l k = 2 (k = 1,…,K). As can be seen from Figure 7, the increased number of users enlarges the performance differences among the algorithms. The tx-rx processing matrices of each user, in CBD, are used to eliminate the interference of all other users. Differently, in the proposed centralized algorithm and the algorithm in [19], they are used to eliminate the interference of part of the other users, bringing in more space dimensions for the diversity gain.

BER performance of the proposed algorithms in a noncoordinated system
In Figure 8, we illustrate the performance of the proposed algorithms for the noncoordinated system with N = 1 and compare them with CBD. In this situation, CBD is equivalent to the traditional BD in a single-cell MIMO system. A MU-MIMO system, in which there are n t = 8 transmit antennas and K = 4 users each equipped with n r = 2 antennas, is considered. The number of the data streams transmitted to each user is set to 2, i.e., l k = 2 (k = 1,…,K). It is shown that the proposed algorithms can achieve better BER performance than BD, and that the proposed decentralized algorithm is only a suboptimal scheme for the noncoordinated system, as part of the receive processing matrix is not jointly derived with the transmit processing matrix. In this situation, the proposed centralized algorithm is verified to achieve lower BER than the proposed decentralized algorithm. It exhibits an approximately 6-dB gain over the decentralized scheme at BER = 10 −2 .  Figure 6 The effect of the number of receive antennas n r to different centralized algorithms with N = 3, n t = 6, l k = 2 (k = 1,…,K).

Conclusions
Nonlinear joint tx-rx processing technology for a coordinated multi-cell system with multi-stream multi-antenna users has been studied. The capacity of the backhaul link determines different coordinated modes among BSs, including centralized and decentralized coordinated. The proposed centralized algorithm is proposed to derive the tx-rx processing matrices jointly at the central processing unit. The proposed decentralized algorithm allows each BS design to transmit precoding in a decentralized manner, which alleviates the demand on the backhaul capacity. The analysis and simulation results show that the centralized algorithm achieves better performance than the decentralized algorithm. And, the proposed algorithms achieve better performance than the existing joint tx-rx processing algorithms and the decentralized linear precodings.  Figure 7 The effect of the number of users K to different centralized algorithms with N = 3, n t = 6, l k = 2 (k = 1,…,K).