EURASIP Journal on Applied Signal Processing 2003:13, 1335–1345 c ○ 2003 Hindawi Publishing Corporation Low-Complexity Decoding of Block Turbo-Coded System with Antenna Diversity

The goal of this paper is to reduce the decoding complexity of space-time block turbo-coded system with low performance degradation. Two block turbo-coded systems with antenna diversity are considered. These include the simple serial concatenation of error control code with space-time block code, and the recently proposed transmit antenna diversity scheme using forward error correction techniques. It is shown that the former performs better when compared to the latter in terms of bit error rate (BER) under the same spectral e ﬃ ciency (up to 7dB at the BER of 10 − 5 for quasistatic channel with two transmit and two receive antennas). For the former system, a computationally e ﬃ cient decoding approach is proposed for the soft decoding of space-time block code. Compared to its original maximum likelihood decoding algorithm, it can reduce the computation by up to 70% without any performance degradation. Additionally, for the considered outer code block turbo code, through reduction of test patterns scanned in the Chase algorithm and the alternative computation of its extrinsic information during iterative decoding, extra 0.3dB to 0.4 dB coding gain is obtained if compared with previous approaches with negligible hardware overhead. The overall decoding complexity is approximately ten times less than that of the near-optimum block turbo decoder with coding gain loss of 0.5 dB at the BER of 10 − 5 over AWGN channel.


INTRODUCTION
One of the major challenges in wireless communications is the severe channel fading caused by multipath and movement in radio link.Recently, in order to explore the improved capacity of multiple-in multiple-out (MIMO) system over flat Rayleigh fading channel [1], different transmit diversity techniques have been developed to benefit from antenna diversity in the downlink while placing the diversity burden on the base station [2,3].Although space-time block code (STBC) has attracted a lot of attention, few papers have been published on its hardware implementation.The authors in [4] addressed the hard decoding of STBCs, which is based on the maximum likelihood decoding algorithm presented in [3].
STBC provides the maximum possible diversity advantage for multiple transmit antenna system with a very low complexity decoding algorithm.However, in order to achieve significant coding gain, it should be concatenated with a powerful outer code [5,6,7].The current powerful error control codes use iterative soft-input soft-output (SISO) decoding to achieve performance approaching Shannon limit.Thus, the concatenated STBC decoder must provide soft output, that is, the reliability information of the decision bit, to the SISO block turbo decoder.Therefore, efficient soft decoding algorithm for STBC should be considered.
In [8], a near-optimum iterative algorithm for decoding block turbo codes (BTCs) was proposed, which is based on the chase algorithm [9].Unfortunately, in spite of its nearoptimum performance comparable to convolutional turbo code (CTC) [10], the decoding complexity is fairly high.In order to offer a compromise between performance and complexity, several complexity reduction schemes have been discussed and presented [11,12,13,14,15,16].
More recently, the authors in [17] proposed to achieve antenna diversity by directly mapping the turbo-coded bits to the transmit antennas.This idea has also been extended to BTCs [18].Simulation results showed that in terms of  coding gains, BTCs associated with transmit and receive diversity (BTC-Diversity system) performs as well as CTC.In this paper, the serial concatenation of BTC-STBC system is simulated, which achieves additional coding gain compared to BTC-Diversity system under the same spectral efficiency (up to 7 dB at the bit error rate (BER) of 10 −5 over quasistatic channel with two transmit and two receive antennas).STBC with code rate 1 is chosen to preserve the code rate of the whole system.In this paper, a new efficient decoding approach is proposed for STBC.It introduces no performance degradation and requires much lower hardware complexity, which is more suitable for real implementation.For the chosen outer error control code, BTC, we also present a new power efficient method which gains an extra 0.3 dB to 0.4 dB coding gain compared to the scheme presented in [12].The hardware overhead is negligible.This implies that the complexity of our new block turbo decoder is about ten times less than that of the near-optimum block turbo decoder [19] with a performance degradation of only 0.5 dB at the BER of 10 −5 over additive white Gaussian noise (AWGN) channel.Thus, the very large scale integration (VLSI) implementation of the space-time block turbo-coded system with low complexity and acceptable error correction capability is possible.
This paper is organized as follows.In Section 2, two space-time block turbo-coded systems are briefly introduced and their performances are compared under the same spectral efficiency over block fading or quasistatic fading channel with two transmit and one or two receive antennas.Section 3 presents the complexity reduction approaches for soft decoding of STBC in the system with better BER performance.Section 4 is devoted to the complexity reduction schemes for the block turbo decoder.Section 5 provides the conclusions.

SPACE-TIME BLOCK TURBO-CODED SYSTEMS
In this section, space-time block codes with maximum likelihood decoding algorithm are briefly explained and the performances of the two space-time block turbo-coded systems are compared under the same spectral efficiency.
Assuming that flat Rayleigh fading matrix channel and perfect channel state information is available, the log a posteriori probability (LAPP) of the two transmitted symbols c 1 and c 2 for the STBC with two transmit antennas is given as follows [5]: for the symbol c 1 , and for the symbol c 2 , where r j t is the signal received at antenna j at each time slot t, h i, j is the path gain from transmit antenna i, 1 ≤ i ≤ n, to receive antenna j, 1 ≤ j ≤ m, and s k is the possible complex constellation symbol.

BTC-STBC system versus BTC-Diversity system
Simple STBC concatenated with powerful forward error correction channel code as outer code is expected to provide significant coding gain in addition to the diversity advantage.The block diagram of space-time block turbo-coded system is illustrated in Figure 1.
At the receiver end, the output from STBC decoder is the LAPPs for each transmitted symbol.Before it is input to the block turbo decoder, the log-likelihood ratios (LLRs) for individual bits have to be calculated, which resembles the reverse function of gray mapping in transmit antenna, where  Another considered BTC for transmit antenna diversity system is shown in Figure 2.This straightforward system is chosen because it has recently drawn much interest and achieves much better performance compared to the original space-time trellis code [17].Denoting the set of constellation points by {c i } 2 M i=1 , the LLRs of b l , l = 1, 2, . . ., nM, using m received signals from n transmit antennas, can be obtained as (see [17]) where N 0 stands for the noise power spectral density.To simplify the computation complexity, the following approximate equation is used in our simulation: Both BTC-Diversity and BTC-STBC systems have much flexibility since the block turbo decoder remains the same no matter which type of modulation scheme or fading channel is employed.Nevertheless, BTC-STBC system has two more building blocks (space-time block encoder and decoder).Furthermore, some modifications have to be made to the STBC codec if the number of transmit antennas is increased.
However, the overall complexity of the BTC-STBC system is not increased as the LLR computation module is much simpler.From ( 5) and (6), it is easily seen that the number of computations N required to obtain the LLRs for each bit in BTC-Diversity grows exponentially with the constellation size 2 M (N = 2 M×n , where n stands for the number of transmit antennas).On the other hand, for BTC-STBC system, this number grows only linearly (N = 2 M ), instead of exponentially, with the constellation size (see (1), (2), and (3)).
For example, if 16-QAM is adopted for both systems with two transmit antennas, 256 comparison terms have to be calculated for BTC-Diversity system, while only 16 comparison terms need to be calculated for BTC-STBC system.This significant hardware reduction is very attractive for VLSI implementation.

Performance comparison under the same spectral efficiency
The considered BTC is composed of two identical systematic extended Hamming code [exHamming(32, 26, 4)] 2 with code rate R = 0.660.STBC is defined by the transmission matrix G 2 as [2].Helical interleaver as described in [20] is employed in our simulation.For fair comparison, the spectral efficiencies for the two systems are kept the same.In the case of two transmit antennas, BTC-STBC system transmits two symbols in two time slots while BTC-Diversity system transmits two symbols in just one time slot.Therefore, for 2R bits/s/Hz (1.32 bits/s/Hz), BTC-STBC uses QPSK while BTC-Diversity uses BPSK modulation.For 4R bits/s/Hz (2.64 bits/s/Hz), BTC-STBC uses 16-QAM while BTC-Diversity uses QPSK modulation.Here, R refers to the code rate of BTC.All the performance are evaluated over either the block fading channel or quasistatic fading channel.Here, block fading channel means that the path gains are constant for consecutive L channel symbols, where L is smaller than frame length (1024 bits for our considered [exHamming(32, 26, 4)] 2 code).These L adjacent symbols are also called a faded block since they are affected by the same fading value.On the other hand, quasistatic fading channel means that the path gains are constant for a frame and change independently from one frame to the next.Actually, quasistatic channel is a special case of block fading channel, where L is equal to frame length.Two different L values are simulated: 2 or 64.The case of L = 2 guarantees the validity of the decoding algorithm of STBC, which is based on the assumption that the path gains are constant over two successive transmissions.While the case of L = 64 indicates that there are four (half rate, 4R bits/s/Hz) or eight (full rate, 2R bits/s/Hz) differently faded blocks per frame.The BER comparison of the two transmit and two receive antennas with 2R bits/s/Hz over different channels is shown in Figure 3a.
As L increases, the SNR has to be increased accordingly to maintain the same BER performance.At the BER of 10 −5 , the advantage of BTC-STBC over BTC-Diversity system is only around 1.5 dB over L = 2 and L = 64 block fading channels, while this additional coding gain is up to 8 dB over quasistatic channel.
Similar results are obtained for two transmits and one receive antenna case (Figure 3b).For the L = 2 block fading channel, BTC-STBC system demonstrates additional coding gain of 3 dB at the BER of 10 −5 .This extra coding gain is 6 dB over L = 64 block fading channel.More coding gain is expected over quasistatic fading channel.
In Figure 4, spectral efficiency is increased to 4R bits/s/Hz from 2R bits/s/Hz.Significant coding gains of BTC-STBC system over BTC-Diversity system are also observed.At the BER of 10 −5 , for two transmit and two receive antenna, the coding gain is 2 dB over L = 64 block fading channel and 7.5 dB over quasistatic fading channel.It is interesting to note that as L = 2, the performance of the two systems are comparable.For two transmit and one receive antennas system, the coding gain is 4 dB over L = 2 block fading channel and 11 dB over L = 64 block fading channel.

COMPLEXITY REDUCTION OF SPACE-TIME BLOCK DECODER
In this section, a powerful efficient algorithm is described for evaluating the bit LLRs in (3).As an example, the transmission matrix for two transmit antennas G 2 [2] and BPSK, QPSK, and 16-QAM modulation schemes are adopted here.Similar approaches can be easily applied to other transmission matrices and modulation schemes.
Denoting s k = s I + js Q , we can rewrite the decision metric used for the LAPP computation in (3) as where  From (7), further simplifications can be made as follows: (1) the term α 2 + β 2 is common for all s k , thus, it can be excluded from the comparisons; (2) for M-PSK with equal energy signal constellations, (γ+ 1)(s 2 I + s 2 Q ) can also be cancelled out.Then, From ( 9), it is observed that the bit LLRs for M-PSK are only dependent on values of α, β and modulation scheme which decides s I and s Q .In the following, the computation of those bit LLRs for each considered modulation scheme will be described, respectively.

BPSK and QPSK
The signal constellations for BPSK and QPSK are illustrated in Figure 5. Gray mapping is assumed.
As seen in Figure 5, there is no complex signal for BPSK constellations, that is, s Q = 0.According to (9), the bit LLR for BPSK case is In a straightforward manner, the two bit LLRs for QPSK are simplified as follows:

16-QAM
The signal constellations for 16-QAM are illustrated in Figure 6.Gray mapping is also assumed.For the 16-QAM case, due to the unequal signal energies of constellations, the term (γ + 1)(s 2 I + s 2 Q ) in (7) has to be considered for comparisons.For the first bit b 0 , we have Because the compared signal constellations are located in four quadrants and symmetric, the most possible signal constellation point to maximize the decision metric can be determined just by observing the signs of α and β.Therefore, there are merely four cases.If α > 0 and β > 0, The reason for the second step is that the points s 2 and s 3 , s 6 and s 7 have the same s Q value.In the third step, the two maximum terms can always be cancelled out since the two finally chosen points will have the same s I values.By the same method, ∧(b 0 ) can be computed for three other cases, that is, (i) α > 0 and β < 0, (ii) α < 0 and β > 0, and (iii) α < 0 and β < 0. As another example, for α < 0 and β < 0 case, One general expression can be used to summarize all the results: Similarly, the LLR for the second bit b 1 is However, for the other two bits b 2 and b 3 , it is slightly more complicated since the compared signal constellations are not located in four different quadrants.For the fourth bit b 3 , the eight compared signals are symmetric along the Iaxis.Thus, four of them can be eliminated by just observing the sign of β.The remaining four points in each compared group are always simultaneously in the lower or upper plane and symmetric along the Q-axis.Consequently, s Q can always be cancelled out, that is, ∧(b 3 ) depends only on the sign, not on the absolute value of β. Otherwise, In this case, in order to further reduce the complexity, the concept of "bias point" can be introduced as [4], which depends on the variable γ.The four compared signals originally within one quadrant are then separated into four new quadrants with the bias point acting as the new "origin."The new value of the signals are redefined by the difference between its original real value and the corresponding bias point.By observing the signs of the new value, the possible candidates can be further reduced from four to one.For α, there are two bias points, one is in the right-half plane and the other is in the left-half plane.No bias point is needed to calculate β since it is already cancelled out in the decision metric.As a result, the procedure to compute ∧(b 3 ) has the following two steps.First, calculate the bias points: bias = 2 * (1+γ), α 1 = α−bias, α 2 = α + bias.Secondly, observe the signs of α 1 and α 2 to compute the right soft output.Consequently, there are four possible cases: (1) if (α 1 > 0 and α 2 > 0), (2) else if (α 1 > 0 and α 2 < 0), (3) else if (α 1 < 0 and In a similar approach, the LLR for the third bit is calculated.Nevertheless, the cancelled-out terms here are s I instead of s Q : The bias points are bias = 2 * (1 + γ), β 1 = β − bias, β 2 = β + bias.Then, the soft output is In other words, all the three variables α, β, and γ are required to compute the LLRs for 16-QAM modulation.However, through the bias point calculation approach, many comparisons among half constellation size of signals have been avoided.

Complexity analysis
In this section, the hardware complexity between the original and proposed maximum likelihood decoding algorithm will be compared.The complexity considered here is in terms of the number of multiplications and additions for each decoded symbol.The following assumptions are used as in [4].The comparison results are displayed in Table 1.For example, for BPSK case, in the proposed algorithm, only α needs to be computed to obtain the soft output ∧(b).For the symbol c 1 in (8), the computation of the real part of r j 1 h * 2, j and (r j 2 ) * h 1, j for two transmit antennas, j = 1, 2, needs (2N − 1) × 4 = (8N − 4) operations.Three more additions are necessary to obtain α, thus, the overall decoding complexity is (8N − 4) + 3 = (8N − 1) operations.While in the original algorithm, for the symbol c 1 , α + jβ for two transmit antennas requires (8N − 1) × 2 = (16N − 2) operations.Additionally, (2N − 1) × 4 + 1 = (8N − 3) operations for γ and 2 × (N − 1) + 2 = 2N operations for each compared signal s k ; another three additions for final soft output are required (see ( 1) and ( 3)).The total number of operations is . By using similar method, the total number of operations for QPSK and 16-QAM with both the original and proposed algorithms can also be obtained.
As observed in Table 1, the new proposed soft decoding algorithm for STBC with two transmit antennas reduces the total number of operations by 52% to 72%.Similar results are expected for other transmission matrices with more transmit antennas.This significant computation reduction will consequently cause much lower power consumption in VLSI implementation.
According to our simulation results under various configurations, the proposed simplified soft decoding approach achieves exactly the same performance as the original maximum likelihood algorithm for space-time block decoder shown in Section 2, which is omitted here.On the other hand, for the details of BTC decoder, we refer the reader to [19].

COMPLEXITY REDUCTION OF BLOCK TURBO DECODER
Since our major goal in this paper is to reduce the decoding complexity of the space-time block turbo-coded system, in Section 3, the simplified decoding algorithm is already proposed and evaluated for the space-time block decoder.In this section, we investigate the complexity reduction issues for the block turbo decoder.

Iterative decoding of BTCs based on Chase algorithm
BTC is also called turbo product code, which is decoded by sequentially decoding the rows and columns in order to reduce the decoding complexity based on the Chase algorithm [9].The main idea of the Chase algorithm is to limit the number of reviewed codewords to codeword subset Ω formed by the following steps.
step 1: Determine p least reliable positions using channel information R. step 2: Form the 2 p binary n-tuple test patterns T at the p least reliable positions.step 3: Decode test sequences Z q = r ⊕ t q using an algebraic decoder to form subset Ω.
To maintain the near-optimum performance, the iterative SISO approach is employed.The soft input to the decoder R(m) is where m is the decoding step, R is the received channel information, W(m) is the extrinsic information input to the next iteration, and α(m) is the scaling factor which takes a small value in the first decoding step and increases as the BER tends to zero.The extrinsic information is the difference between soft output (normalized LRR) and soft input of the decoder and is calculated as follows: or when C does not exist in the considered subset, where D is the maximum likelihood decoded (MLD) codeword, C is the competing codeword of D, that is, C has also minimum distance to R but c j = d j , and β is the empirically determined reliability factor.

Complexity reduction techniques
For the block turbo decoder described above, we can see that there are two major sources of complexity.If we consider the decoding of a column of the matrix, the first source lies in step 3 of the procedures to find the codeword subset Ω.For this column, each of q = 2 p formed test sequences has to perform one syndrome decoding, that is, the decoding complexity of one column for this procedure is q × m times the complexity of a syndrome decoder, where m stands for the number of decoding steps.
The second source of complexity is the extensive computation of the extrinsic information W(m) associated with the MLD codeword D. For each w j , this procedure has to search among the q codewords in the codeword subset Ω whether there is a competing codeword C at the smallest distance from R such that c j = d j .Thus, D is unique to all symbols of R, while C may be different for each symbol.If we find C, then we use (25), else we use (26) to compute w j .The decoding complexity of one column for this second procedure is q×n×m times the complexity of an elementary compare and save operation, where n stands for the block length.Therefore, in order to reduce the complexity of the block turbo decoder, we can either decrease the number of test patterns q or simplify the extrinsic information computation.

Simplifying the extrinsic information computation
We first look at the second possibility.To avoid searching the competing codeword C for each symbol of the block code, it can be replaced by the MLD codeword of last decoding step D(m − 1) when computing the extrinsic information, which is called gradient algorithm [12].In terms of complexity reduction, this is a very clever way since the decoding complexity of one column for the second procedure is reduced down to n × m times the complexity of an elementary compare and save operation, that is, the complexity is decreased by more than ten times.Nevertheless, its drawback is that the replaced competing codeword C = D(m−1) is not always a codeword.The decoder guarantees that we have codewords along the rows (columns) of the matrix in the current decoding step but not along the columns (rows) in the next decoding step.Thus, there is no guarantee that W(m+1) has the same interpretation in this gradient algorithm as in the near-optimum one.
A new gradient algorithm is proposed to compute the extrinsic information without searching the competing codeword C extensively [15].The main idea is to divide the codeword matrix [D(m)] into codeword matrix for columns [D col (m)] and for rows [D row (m)].We consider the mth decoding step of the BTC and suppose that we start by decoding the columns of the BTC.For odd values of m, the decoder processes the columns of the block turbo code as follows: when d col j (m) = d col j (m − 1), otherwise we use while for even values of m, the decoder processes the rows of BTC when d row j (m) = d row j (m − 1), otherwise we use Here is another interpretation of this algorithm.Since the rows and columns of the BTC are always decoded alternatively, one after another, the new proposed algorithm can be equivalently considered as using D(m−2) instead of D(m−1) to compute extrinsic information W(m + 1): for m ≥ 2, when d j (m) = d j (m − 2), otherwise we use When m < 2, the nongradient algorithm can be used.Compared to the gradient algorithm in [12], this new algorithm guarantees that the matrix [D col (m − 1)] or [D row (m − 1)] is always a codeword.As a result, the performance is better.In fact, an extra 0.3 dB to 0.4 dB coding gain is obtained.The hardware overhead is negligible since only one small buffer is needed to store the single bit codeword information.

Reducing the number of test patterns
For the first possibility, using the algebraic structure of extended Hamming codes that consist of BTCs and the syndrome of a received word in a component code, one can show that the required number N(p, d) of test patterns is as follows [11]: (1) no error detection:
The performance comparison between our new gradient algorithm and that in [12] for the [exHamming(32, 26, 4)] 2 and [exHamming(64, 57, 4)] 2 BTC is shown in Figures 7  and 8, respectively.From these two figures, extra coding gain can be clearly observed with our new gradient algorithm using separate row and column MLD codeword matrices compared with that using only one codeword matrix.At the BER of 10 −5 , the extra coding gain is 0.4 dB for [exHamming(32, 26, 4)] 2 BTC and 0.3 dB for [exHamming(64, 57, 4)] 2 at the 4th iteration.Compared to the original near-optimum algorithm using 16 test patterns, using only 8 test patterns introduces negligible performance degradation (less than 0.1 dB for both [exHamming(32, 26, 4)] 2 and [exHamming(64, 57, 4)] 2 block turbo code).It verifies the correctness of the statement that reducing the number of test patterns from 2 p down to N(p, d) for extended Hamming codes introduces no performance loss.
By implementing the proposed algorithm, the coding gain loss is reduced to 0.55 dB at the BER of 10 −5 for the [exHamming(32, 26, 4)] 2 code.For the [exHamming(64, 57, 4)] 2 block turbo code, the result is even better and the degradation is only 0.5 dB at the 4th iteration.This is a very good trade-off between complexity and performance since it reduces the complexity of block turbo decoder by more than ten times.
Other important complexity reduction issues such as how to adaptively choose the scaling factors α and β under various simulation situations and memory reduction techniques have been addressed in [14,15].

CONCLUSIONS
In this paper, a new efficient decoding scheme for the soft decoding of STBC is presented.It achieves the same optimum performance with up to 70% hardware complexity reduction.This space-time block decoder providing soft information makes its concatenation to any soft-input soft-output decoder more flexible with much lower power consumption.The simulation results using space-time block turbo-coded system shows that the simplified algorithm is correct.Compared to the most recent block turbo code for space-time systems, this serial concatenation scheme is still more favorable in terms of bit error performance and complexity under the same spectral efficiency.The decoding complexity reduction techniques are also explored for the considered block turbo code, which include test patterns reduction and efficient alternative extrinsic information computation.Consequently, the decoding complexity is reduced by approximately ten times with coding gain loss of 0.5 dB at the BER of 10 −5 over AWGN channel.Thus, the VLSI implementation of the space-time block turbo-coded system with low complexity and acceptable error correction capability is possible.

Figure 4 :
Figure 4: BER comparison for BTC-STBC system and BTC-Diversity system: 4R bits/s/Hz, 4 iterations, two transmit antennas and (a) two or (b) one receive antennas.

Table 1 :
Complexity comparison between original and proposed decoding algorithm.The signal energies for BPSK and QPSK are assumed to be known in advance and their computations are excluded from complexity count.For the 16-QAM case, the signal energies and its multiplication with γ are only counted for 4 instead of 16 times due to the inherent symmetry property.