Parallelism Efficiency in Convolutional Turbo Decoding
© Olivier Muller et al. 2010
Received: 25 May 2010
Accepted: 21 November 2010
Published: 6 December 2010
Parallel turbo decoding is becoming mandatory in order to achieve high throughput and to reduce latency, both crucial in emerging digital communication applications. This paper explores and analyzes parallelism techniques in convolutional turbo decoding with the BCJR algorithm. A three-level structured classification of parallelism techniques is proposed and discussed: BCJR metric level parallelism, BCJR-SISO decoder level parallelism, and Turbo-decoder level parallelism. The second level of this classification is thoroughly analyzed on the basis of parallelism efficiency criteria, since it offers the best tradeoff between achievable parallelism degree and area overhead. At this level, and for subblock parallelism, we illustrate how subblock initializations are more efficient with the message passing technique than with the acquisition approach. Besides, subblock parallelism becomes quite inefficient for high subblock parallelism degree. Conversely, component-decoder parallelism efficiency increases with subblock parallelism degree. This efficiency, moreover, depends on BCJR computation schemes and on propagation time. We show that component-decoder parallelism using shuffled decoding enables to maximize architecture efficiency and, hence, is well suited for hardware implementation of high throughput turbo decoder.
Turbo decoding  is increasingly proposed in emerging and future digital communication systems, for example, fiber-optic communication, wireless communication, and storage applications. Practical turbo decoder designs, as the one used for IEEE 802.16e, Long-term Evolution (LTE) or IEEE 802.11 standards, require high data throughput (several hundred of Mbps) and low latency (ten ms or so). To cope with these requirements, turbo decoder implementations have to be massively parallel. Therefore, the parallelism involved in implementations of this iterative process has to be carefully analyzed through real industrial constraints such as area and throughput.
In iterative decoding algorithms, the underlying turbo principle relies on extrinsic information exchanges and iterative processing between different Soft-Input Soft-Output (SISO) modules. Using input information and a priori extrinsic information, each SISO module computes a posteriori extrinsic information. This constitutes the a priori information for the other modules and is exchanged via interleaving and deinterleaving processes. For convolutional turbo codes, the SISO modules process the BCJR or forward-backward algorithm  which is the optimal algorithm for the maximum a posteriori (MAP) decoding of convolutional codes. So, a BCJR SISO firstly computes branch metrics (or γ metrics), which represent the probability of a transition occurring between two trellis states. Then a BCJR SISO computes forward and backward recursions. Forward recursion (or α recursion) computes a trellis section (i.e., the probability of all states of the trellis) using the previous trellis section and branch metrics between these two sections, while backward recursion (or β recursion) computes a trellis section using the future trellis section and branch metrics between these two sections. Finally, extrinsic information is computed from the forward recursion, the backward recursion and the extrinsic part of the branch metrics. Therefore, turbo decoders, because of iterative decoding process and BCJR complexity (bidirectional recursive computing), imply optimal parallelism exploitation in order to achieve high-data rates required in present and future applications.
Parallelism in convolutional turbo decoding has been widely investigated in the literature over the last few years. At a fine-grain level, parallelism was investigated inside the BCJR algorithm on elementary computations [3, 4]. At a coarse grain level, explored parallelism techniques are based on frame decoding schemes. State-of-the-art research is mainly focused on parallel processing of frame subblocks and on the parallel processing issues, such as computation complexity [3, 5], memory saving [3, 5, 6], initializations [5, 7], or on-chip communication requirements [8, 9]. Recently, a new parallelism technique named shuffled decoding was introduced to process in parallel the component decoders [10, 11]. However, interactions between these diverse parallelism techniques and different granularity levels are rarely discussed. Thus, predicting the combination of these parallelism techniques, in order to reach optimal parallelism for given performance requirements, becomes a complex task of hardware implementation.
In this paper, we propose a three-level classification of existing parallelism techniques in convolutional turbo decoding with the BCJR algorithm and analyze more thoroughly the second level which includes subblock parallelism and component-decoder parallelism. Performance analyses of these techniques are conducted separately on the basis of parallelism efficiency criteria. Then, efficiency of combined parallelism techniques is improved collectively by taking into account interactions between these parallelism techniques.
The rest of the paper is organized as follows. The next section presents definitions of parallelism metrics that will be used in the paper. Section 3 analyzes all parallel processing techniques of turbo decoding and proposes a three-level classification of these techniques. In Sections 4 and 5, subblock parallelism and component-decoder parallelism (shuffled decoding) are, respectively, analyzed on the basis of parallelism efficiency criteria. Finally, Section 6 summarizes the obtained results and concludes the paper.
2. Definitions of Parallelism Metrics
where is the execution time of the algorithm without parallelism and its execution time using a parallelism degree . The speedup generally follows the well-known Amdahl's law , where is the sequential part ratio and the parallelizable part ratio (such as )
where is the efficiency of a system using a parallelism degree for the studied parallelism and its efficiency without this parallelism. This metric can also be seen as the ratio between decoding speedup of a parallelism and its area overhead , where (resp. ) is the chip area of the system using a parallelism degree (resp. without parallelism).
3. Parallel Processing Levels
In turbo decoding with the BCJR algorithm, parallelism techniques can be classified at three levels: BCJR metric level parallelism, BCJR SISO decoder level parallelism, andTurbo-decoder level parallelism. The first (lowest) parallelism level concerns symbol elementary computations inside an SISO decoder processing the BCJR algorithm. Parallelism between these SISO decoders, inside one turbo decoder, belongs to the second parallelism level. The third (highest) parallelism level duplicates the turbo decoder itself.
3.1. BCJR Metric Level Parallelism
The BCJR metric level parallelism concerns the processing of all metrics involved in the decoding of each received symbol inside a BCJR SISO decoder. It exploits the inherent parallelism of the trellis structure [3, 4], and also the parallelism of BCJR computations [3–5].
3.1.1. Parallelism of Trellis Transitions
Trellis-transition parallelism can easily be extracted from trellis structure as the same operations are repeated for all transition pairs. In log-domain , these operations are either ACS operations (Add-Compare-Select) for the max-log-MAP algorithm or ACSO operations (ACS with a correction offset ) for the log-MAP algorithm.
Each BCJR computation requires a number of ACS-like operations equal to half the number of transitions per trellis section. Thus, this number, which depends on the structure of the convolutional code, constitutes the upper bound of the trellis-transition parallelism degree.
Furthermore this parallelism implies low area overhead (as only the ACS units have to be duplicated). In particular, no additional memories are required since all parallelized operations are executed on the same trellis section and, in consequence, on the same data.
3.1.2. Parallelism of BCJR Computations
Parallel computation of backward recursion and extrinsic information was proposed with the original forward-backward scheme , depicted in Figure 1(b) In this scheme, we can notice that BCJR computation parallelism degree is equal to one in the forward part and two in the backward part. Thus, it enables a speedup of 1.5 in comparison with the sequential BCJR algorithm (BCJR computation parallelism degree is always one as depicted in Figure 1(a)) and its state metric memory depth is equal to frame length. To increase this parallelism degree, several schemes were proposed . The most common is the butterfly scheme (Figure 1(c)) which doubles the parallelism degree of the original scheme through the parallelism between the forward and backward recursion computations (degree 2 in the first half of the butterfly and degree 4 in the second half) in order to perform a speedup of 3. Various schemes have been derived from this butterfly scheme. For example, the replica butterfly scheme (Figure 1(d)) extends the extrinsic information computations to the first half of the butterfly. This scheme was proposed in  to improve the shuffled decoding convergence, but it requires four BCJR computation units (full-time usage for a speedup of 3) and state metric storage between two consecutive iterations (since the extrinsic information computations in the first half of the butterfly require one recursion computation (forward or backward) and take the other one (backward or forward) from the state metric storage). Globally, memory size is the same with this scheme as with the butterfly scheme or forward-backward one. A further example is the forward butterfly scheme (Figure 1(e)) that performs extrinsic information computation only in forward direction. Consequently, its BCJR computation parallelism degree is three on both half of the butterfly. Furthermore, the state metric memory is twice less deeper in this case. The resulting BCJR-SISO decoder requires smaller area (typically 30% less in comparison with others butterfly schemes), but the iterative process has a slower convergence (see Section 4). We can note that, with all these schemes, BCJR computation parallelism is performed without any memory increase and only BCJR computation resources have to be duplicated.
As a conclusion, BCJR metric level parallelism (trellis-transition parallelism and BCJR computation parallelism) induces a minimal area overhead as it does not affect memory size, which occupies most of the area in a turbo decoder circuit. Nevertheless the parallelism degree is limited by the code structure and the decoding algorithm. Thus, achieving higher parallelism degree implies exploring higher processing levels.
3.2. BCJR-SISO Decoder Level Parallelism
The second level of parallelism concerns the SISO decoder level. It consists in using multiple SISO decoders, each executing the BCJR algorithm and processing a subblock of the same frame in one of the two component decoders. At this level, parallelism can be applied either on subblocks and/or on component decoders.
3.2.1. Subblock Parallelism
In subblock parallelism, each frame is divided into subblocks and then each subblock is processed on a BCJR-SISO decoder using adequate initializations [6, 16, 17]. In fact, only two different initialization methods, namely acquisition and message passing (also know as next-iteration initialization) exist and are analyzed in Section 4.
Besides duplication of BCJR-SISO decoders, this parallelism leads to an on-chip communication issue related to the interleaver. Indeed, interleaving has to be parallelized in order to extend proportionally the communication bandwidth. In consequence, the complexity of communication structure (and also communication time) increases with parallelism degree and access conflicts may occur (even if latest standards use conflict-free interleavers). The latter problem can be resolved using an interleaving mapping to avoid conflicts  or using a communication structure to manage conflicts on the fly .
3.2.2. Component-Decoder Parallelism
The component-decoder parallelism is a new kind of parallelism that has become operational with the introduction of the shuffled decoding technique . The basic idea of the shuffled decoding technique is to execute all component decoders in parallel and to exchange extrinsic information as soon as created. Using this method, the iteration period is halved in comparison with originally proposed serial turbo decoding . Nevertheless, component-decoder parallelism may require additional iterations as explained in Section 5.
This level of parallelism can reach a reasonable parallelism degree and preserve memory area. Due to its great potential for scalability and mastered area overhead, new explorations are focused on this second level of parallelism.
3.2.3. Turbo-Decoder Level Parallelism
The highest level of parallelism duplicates the whole turbo decoder to process frames in parallel. Iteration parallelism occurs in a pipelined fashion (each parallel instance works on different frames and transmits their results to the instance in charge of the next iteration) with a maximum pipeline depth equal to twice the iteration number, whereas frame parallelism (each parallel instance decodes completely its frames) presents no limitation in parallelism degree. Nevertheless, turbo-decoder level parallelism is too area-expensive (all memories and computation resources are duplicated) and presents no gain in decoding latency and thus it is not considered in this work.
4. Initialization in Subblock Parallelism
As described in Section 3, subblock parallelism takes place at frame level and requires initializations. Proper initialization is mandatory to achieve correct decoding since information on recursion metrics is available at frame ending points, but not at subblock ending points. Estimation of undetermined information can be obtained either by acquisition or by message passing between neighboring subblocks.
4.1. Initialization by Acquisition
This widely used initialization method consists in estimating recursion metrics by means of an overlapping region called acquisition window or prolog. Starting from a trellis section, where all the states are initialized to a uniform constant, the acquisition window will be processed on its length, denoted AL, to provide reliable recursion metrics at subblock ending points. This acquisition length is determined at design time in order to achieve negligible error-rate degradation. It is fixed according to the number of redundancy bits in the prolog (typically 6 bits). Another empirical rule recommends from 3 to 5 times the constraint length of the code for this acquisition length .
This speedup follows an Amdahl's law where the sequential and incompressible part is mainly related to computations of acquisition windows, which are mandatory to reliable initializations. Thus subblock parallelism with initialization by acquisition encounters a throughput ceiling value and the maximum speedup (obtained when equals ) is roughly equal to (by neglecting ).
4.2. Initialization by Message Passing
The second method initializes dynamically a subblock with recursion metrics computed during the last iteration in the neighboring subblocks . So this technique does not require additional memory except for on-chip communication resources between BCJR SISO units. It was shown in  that the asymptotic error-rate, that is, the error-rate achieved with an infinite number of iterations, is not affected by the message passing approach whatever the parallelism degree. Consequently, it ensures that initialization by message passing can be used without error-correction performance degradation at the expense of additional iterations.
Let be the mean number of iterations for an architecture with subblocks communicating with the message passing technique. For accuracy reasons, is obtained with the genie stopping criterion, that is, the decoding is stopped immediately after an iteration returning the right codeword and is not started for undecodable frame.
Like subblock parallelism with initialization by acquisition, an Amdahl's law can be recognized in this speedup equation if terms are neglected. Thus, the sequential part of the subblock parallelism with message passing initialization is around (compared to for acquisition initialization). Similarly, we can show (by neglecting terms and by remarking that is much smaller than ) that the maximum speedup (obtained when equals ) is roughly equal to .
4.3. Subblock Parallelism Efficiency and Performance Comparison
To evaluate the efficiency of initialization methods, a comparison of their subblock parallelism speedups is accurate enough since the ratios ( ) between the BCJR-SISO decoder area and the overall architecture area are very close for both methods (the initialization methods only require memorization overheads, that are negligible with respect to the BCJR-SISO decoder area).
Each subblock parallelism speedup takes into account acquisition overhead and additional iterations. Note that using additional iterations with the acquisition method has quite no effect on error-correction performance.
Thus, the subblock parallelism efficiency has an extremum when the parallelism degree is equal to . Indeed, this extremum is a maximum as the first derivative is positive below this value (the denominator is always positive) and negative above. The maximum efficiency depends on , on the block size, code rate and component decoders. Above this value, the architecture efficiency is degraded and other parallelisms have to be considered such as component-decoder parallelism.
5. Component-Decoder Parallelism Analysis
As described in Section 3.2.2, component-decoder parallelism takes advantage of the shuffled decoding technique that executes all component decoders in parallel and exchanges extrinsic information as soon as created. However, the relevance of this technique depends greatly on other parallelism choices and on interleaving rules.
5.1. Shuffled Decoding Speedup
where is the component-decoder parallelism degree (equals to dec for shuffled decoding), the subblock parallelism degree used in each component decoder, and itshuffled the number of iterations required by the shuffled decoding process.
The impact of the propagation time will be discussed in Section 5.3. In Section 5.2, a zero propagation time will be considered and the convergence speed of the shuffled decoding process, defined as , is analysed.
Simulation results demonstrate that the convergence speed of shuffled decoding ranges from 0.6 to 0.95 depending on the choice of parallelism techniques and on the interleaving scheme.
5.2. Combining Shuffled Decoding with Other Parallelism Techniques
Considering to BCJR computation parallelism and subblock parallelism, the convergence speed of shuffled decoding is very disparate.
In , the influence of BCJR computation scheme was pointed out with the replica scheme proposal. The replica principle consists in generating two extrinsic information per symbol per iteration (one in the forward and one in the backward), instead of one extrinsic information in other schemes. Thus, decoders have more up-to-date extrinsic information and converge faster. Using the schemes presented in Figure 1, our simulations also reveal that shuffled decoding convergence speed is always the greatest with the replica butterfly scheme (convergence speed observed around 0.8) and the lowest with the forward butterfly scheme (convergence speed observed around 0.6). However, in these conditions, shuffled decoding is not valuable for implementation, since its resulting speedup is less than the one obtained using subblock parallelism (same number of BCJR-SISO decoders) instead of component decoder parallelism.
where (resp. ) is the subblock parallelism speedup (resp. the shuffled decoding speedup). Note that the area overhead is considered to be similar for both parallelisms. Consequently, the global overhead depends on the product of the parallelism degrees.
5.3. Shuffled Decoding Implementation Issues
5.3.1. Propagation Time Effect on Shuffled Decoding
In real implementations of a parallel turbo decoder, exchanged extrinsic information is not immediately updated in the targeted SISO. We define the propagation time as the time required to update the extrinsic information value of a symbol. It includes the computation time of the extrinsic information value and the communication time. As complex communication structures (complexity increasing with subblock parallelism) are needed to perform interleaving without conflicts, communication time is particularly not negligible. However, for most of hardware implementations, propagation time is less than .
Because of this propagation time, consistency conflicts (i.e., a component decoder performs a read access before the write access of the other component decoder is completed.) may occur in extrinsic information memory. Hence, the symbols suffering from consistency conflict must have one memory bank per decoder and their extrinsic information values are exchanged in the time interval of two iterations instead of one for other symbols. Consequently, the convergence of the shuffled decoding process is slowed down.
Thus, preserving a good convergence speed of the shuffled process requires the preservation of a reasonable percentage of exchanging symbol reliabilities in the process. To better understand the influence of propagation time and interleaving rules on this percentage, we propose a geometrical explanation.
5.3.2. Geometrical Explanation of Propagation Time Effect
Let be the SISO decoder that processes the th subblock of the frame on the th component decoder (e.g., for a concatenation of two convolutional codes). Let denote the propagation time. According to the considered scheme of BCJR computations (Section 3.1.2), at the th iteration, can deliver extrinsic information for the estimated symbol in the forward recursion or/and in the backward recursion. Extrinsic information transmission time will be denoted for transmission in forward direction and for transmission in backward direction.
To describe the interleaving design rule, the two-dimensional representation of an interleaver introduced in  is well suited. In this representation, the natural order (resp. interleaved) is depicted on the horizontal-axis (resp. vertical-axis) and the symbol with index in natural order is represented with the point . On this representation, (16) is translated into banned regions constituting the mask of the interleaver . So, symbols have a slower convergence inside the interleaver mask than outside.
For the interleaver mask of replica butterfly scheme (Figure 8(b)), and are both defined for each symbol. Hence, each symbol is concerned by the four inequalities, which define four diagonal banned regions. Thus, the interleaver mask is the intersection of these diagonals, that is, the square in the middle and the one distributed on the corners (iterations are assumed to be executed continuously, that is, the first and last symbols of the frame are neighbors in processing time). In comparison with butterfly mask, the replica mask is smaller. So, in random interleaving case, the replica mask allows more exchanges and, consequently, enables a better convergence of the iterative process. Like for butterfly mask, the replica mask covers the entire space when is equal to . However, convergence speed is not slowed down to 0.5 (see Figure 7) since, with replica scheme, additional extrinsic information exchanges exist in the time interval of one iteration. Indeed, one information update is used at two different instants (forward and backward), but (16) only consider the closest instant. Additional exchanges are the secondary reason of the robustness of replica shuffled decoding to long propagation time.
5.3.3. Concluding Remark on Real Shuffled Decoding
Finally, in a real implementation context (propagation time less than 3 ), the propagation time effect on the convergence speed (and efficiency) of the shuffled decoding process introduces loss always less than 10% for replica shuffled decoding and less than 12% for butterfly shuffled decoding. Therefore, with the efficiency gap observed in Figure 5 between implementations using only subblock parallelism and implementations using ideal shuffled decoding, we can conclude that real shuffled decoding implementations based on replica and butterfly schemes are more efficient for high parallelism degree.
This paper thoroughly examines parallel convolutional turbo decoding. We have analyzed and classified the various parallelism techniques that could be used in convolutional turbo decoding with the BCJR algorithm. The three-level classification proposed includes: BCJR metric level parallelism, BCJR SISO decoder level parallelism, andTurbo-decoder level parallelism. Considering its achievable high parallelism degrees and mastered area overhead, our analyses focus on the BCJR SISO decoder level, which includes subblock parallelism and component-decoder parallelism. On the one hand, we demonstrate that subblock initialization is more efficient with the message passing technique than with the acquisition technique and also that subblock parallelism becomes inefficient for high subblock parallelism degrees. On the other hand, we show that component-decoder parallelism efficiency depends on the BCJR computation scheme, on subblock parallelism degree, and on propagation time. Furthermore, results point out that the shuffled decoding, until now never considered in hardware implementation, improves the efficiency of very high-throughput low-latency implementations.
This work was supported in part by the UDEC project of the French Research Agency (ANR).
- Berrou C, Glavieux A, Thitimajshima P: Near SHANNON limit error-correcting coding and encoding: turbo-codes. Proceedings of the IEEE International Conference on Communications (ICC '93), May 1993, Geneva, Switzerland 1064-1070.View ArticleGoogle Scholar
- Bahl LR, Cocke J, Jelinek F, Raviv J: Optimal decoding of linear codes for minimizing symbol error rate. IEEE Transactions on Information Theory 1974, 20(2):284-287.MathSciNetView ArticleMATHGoogle Scholar
- Boutillon E, Gross WJ, Gulak PG: VLSI architectures for the MAP algorithm. IEEE Transactions on Communications 2003, 51(2):175-185. 10.1109/TCOMM.2003.809247View ArticleGoogle Scholar
- Masera G, Piccinini G, Roch MR, Zamboni M: VLSI architectures for turbo codes. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 1999, 7(3):369-379.View ArticleGoogle Scholar
- Mansour MM, Shanbhag NR: VLSI architectures for SISO-APP decoders. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 2003, 11(4):627-650.View ArticleGoogle Scholar
- Schurgers C, Catthoor F, Engels M: Memory optimization of MAP turbo decoder algorithms. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 2001, 9(2):305-312.View ArticleGoogle Scholar
- Muller O, Baghdadi A, Jezequel M: Exploring parallel processing levels for convolutional turbo decoding. Proceedings of International Conference on Information & Communication Technologies: From Theory to Applications (ICTTA '06), April 2006Google Scholar
- Tarable A, Benedetto S, Montorsi G: Mapping interleaving laws to parallel turbo and LDPC decoder architectures. IEEE Transactions on Information Theory 2004, 50(9):2002-2009. 10.1109/TIT.2004.833353MathSciNetView ArticleMATHGoogle Scholar
- Thul MJ, Gilbert F, Vogt T, Kreiselmaier G, Wehn N: A scalable system architecture for high-throughput turbo-decoders. Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology 2005, 39(1-2):63-77.View ArticleMATHGoogle Scholar
- Zhang J, Fossorier MPC: Shuffled iterative decoding. IEEE Transactions on Communications 2005, 53(2):209-213. 10.1109/TCOMM.2004.841982View ArticleGoogle Scholar
- Muller O, Baghdadi A, Jézéquel M: On the parallelism of convolutional turbo decoding and interleaving interference. Proceedings of the Global Telecommunications Conference (GLOBECOM '06), December 2006Google Scholar
- Amdahl GM: Validity of the single processor approach to achieving large scale computing capabilities. AFIPS Spring Joint Computing Conference, April 1967 483-485.Google Scholar
- Robertson P, Hoeher P, Villebrun E: Optimal and sub-optimal maximum a posteriori algorithms suitable for turbo decoding. European Transactions on Telecommunications 1997, 8(2):119-125. 10.1002/ett.4460080202View ArticleGoogle Scholar
- Benedetto S, Divsalar D, Montorsi G, Pollara F: Soft-input soft-output maximum a posteriori (MAP) module to decode parallel and serial concatenated codes. TDA Progress Report 1996.Google Scholar
- Zhang J, Wang Y, Fossorier M, Yedidia JS: Replica shuffled iterative decoding. Proceedings of the IEEE International Symposium on Information Theory, September 2005, Adelaide, AustraliaGoogle Scholar
- Yoon S, Bar-Ness Y: A parallel MAP algorithm for low latency turbo decoding. IEEE Communications Letters 2002, 6(7):288-290. 10.1109/LCOMM.2002.801310View ArticleGoogle Scholar
- Wolf T: Initialization of sliding windows in turbo decoders. Proceedings of the 3rd International Symposium on Turbo Codes and Related Topics, September 2003, Brest, France 219-222.Google Scholar
- Douillard C, Jezequel M, Berrou C, Brengarth N, Tousch J, Pham N: The turbo code standard for DVB-RCS. Proceedings of the 2nd International Symposium on Turbo Codes & Related Topics, 2000, Brest, France 535-538.Google Scholar
- Heegard C, Wicker SB: Turbo Coding. Kluwer Academic Publishers, Dodrecht, The Netherlands; 1999.View ArticleMATHGoogle Scholar
- Gnaëdig D, Boutillon E, Tousch J, Jezequel M: Towards an optimal parallel decoding of turbo codes. Proceedings of the 4th International Symposium on Turbo Codes and Related Topics, April 2006, Munich, GermanyGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.