Turbo decoding principle is based on an exchange of probabilistic information, called *extrinsic information* between two (or more) component decoders dealing with the same received set of data. As shown in Fig. 2, a typical turbo decoder consists of two decoders operating iteratively on the received frame. The first component (SISO decoder 0 in Fig. 2) works in natural domain while the second (SISO decoder 1 in Fig. 2) works in interleaved domain. The soft-input soft-output (SISO) decoders operate on soft information to improve the decoding performance. Thus, besides its own channel input data, each SISO decoder deals with the extrinsic information generated by the other SISO decoder in order to improve its estimation over the iterations. Usually, but not necessary, the computations are done in the logarithmic domain. Each decoder calculates the log-likelihood ratio (LLR) for the *i*th data bit *d*
_{
i
} as

$$\begin{array}{@{}rcl@{}} L(d_{i}) & = & \ln \frac{\Pr(d_{i} = 1|y)} {\Pr(d_{i} = 0|y)} \end{array} $$

(2)

Input LLRs causing trellis transition can be decomposed into three independent terms as

$$\begin{array}{@{}rcl@{}} L(d_{i}) & = & L^{\text{apr}}(d_{i}) + L^{\text{sys}}(d_{i}) + L^{\text{par}}(d_{i}) \end{array} $$

(3)

where *L*
^{apr}(*d*
_{
i
}) is the a-priori information of *d*
_{
i
}, *L*
^{sys}(*d*
_{
i
}) and *L*
^{par}(*d*
_{
i
}) are the channel measurement of the systematic and parity parts respectively. Each soft-input soft-output (SISO) decoder generates extrinsic information that is sent to the other decoder. Extrinsic information becomes the a-priori information *L*
^{ap}(*d*
_{
i
}) for the other decoder as shown in Fig. 2.

Several algorithms for this SISO decoding have been proposed in the literature. The soft output Viterbi algorithm (SOVA) and the Maximum Aposteriori Probability (MAP) algorithms are the most frequently used. This last algorithm has been simplified in [23] to propose the Max-Log-MAP algorithm that is most often adopted because of the efficient hardware implementation possibility. For a better understanding of the architectural and configuration issues highlighted in the rest of this paper, the next sub-section provides a short introduction to the Max-Log-MAP decoding.

### 3.1 Max-Log-MAP algorithm

In order to explain briefly the underlined computations, let us consider the 8-state double binary turbo code (DBTC) of WiMAX standard. For each received double binary symbol (S0,S1) _{
k
}, the SISO decoder computes first the branch metrics (*γ*
_{
k
}(*s*
^{′},*s*)) which represent the probability of a transition to occur between two trellis states (*s*
^{′}, starting state; *s*, ending state). These branch metrics can be decomposed, as defined by the following expressions, in an intrinsic part (\(\gamma _{k}^{\text {intr}_{x}}(s',s)\)) due to the systematic information (\(\gamma _{k}^{\text {sys}_{x}}(s',s)\)), the a-priori information (\(\gamma _{k}^{\mathrm {n.apr}_{x}}(s',s)\)) and a redundancy part due to the parity information (\(\gamma _{k}^{\text {par}_{y}}(s',s)\)).

$$\begin{array}{*{20}l} \gamma_{k}(s',s)&=\gamma_{k}^{\text{intr}_{x}}(s',s) + \gamma_{k}^{\text{par}_{y}}(s',s)\\ & \forall{(x,y=00,01,10,11)} \end{array} $$

(4)

$$\begin{array}{*{20}l} \gamma_{k}^{\text{intr}_{x}}(s',s)&=\gamma_{k}^{\text{sys}_{x}}(s',s) + \gamma_{k}^{\mathrm{n.apr}_{x}}(s',s)\\ & \forall{(x=00,01,10,11)} \end{array} $$

(5)

where \(\gamma _{k}^{\mathrm {n.apr}_{x}}(s',s)\) is the normalized a priori information of the *k*th symbol or the normalized extrinsic information (\(Z_{k}^{\mathrm {n.ext}}\)), sent by the other decoder component (expression given below). Furthermore, the systematic and the parity information in these expressions represent the *symbol* log-likelihood-ratios (LLRs) which can be obtained by direct addition and subtraction operations between the received channel *bit* LLRs (S0, S1, P0, P1, P0 ^{′}, P1 ^{′}).

Then the SISO decoder runs the forward and backward recursion over the trellis. The forward state metrics *α*
_{
k
}(*s*) of the *k*th symbol are computed recursively using those of the (*k*−1)th symbol and the branch metrics of the corresponding trellis section. Similarly for the backward state metrics *β*
_{
k
}(*s*) which correspond to the backward recursion (traversing the trellis in the reverse direction).

$$\begin{array}{*{20}l} \alpha_{k}(s)&=\max_{s'}(\alpha_{k-1}(s)+\gamma_{k}(s^{'},s))\\ & \forall{(s',s=0,1,..7)} \end{array} $$

(6)

$$\begin{array}{*{20}l} \beta_{k}(s)&=\max_{s'}(\beta_{k+1}(s)+\gamma_{k}(s^{'},s))\\ & \forall{(s',s=0,1,..7)} \end{array} $$

(7)

Finally, the extrinsic information of the *k*th symbol is computed for all possible decisions (00,01,10,11) using the forward state metrics, the backward state metrics, and the extrinsic part of the branch metrics as formulated in the following expressions:

$$\begin{array}{*{20}l} {}Z_{k}^{\text{apos}}(d(s',s)\!=x\!)&=\max_{(s',s)/d(s',s)=x}(\alpha_{k-1}(s)\! +\!\gamma_{k}(s',s)\! +\!\beta_{k}(s))\\ & \forall{(x=00,01,10,11)} \end{array} $$

(8)

$$\begin{array}{*{20}l} Z_{k}^{\text{ext}}(d(s',s)=x)&= Z_{k}^{\text{apos}}(d(s',s)=x) - \gamma_{k}^{\text{int}_{x}}(s',s)\\ & \forall{(x=00,01,10,11)} \end{array} $$

(9)

The extrinsic information can be normalized by subtracting the minimum value in order to reduce the related storage and communication requirements; thus, only three extrinsic information values should be exchanged for each symbol.

$$\begin{array}{*{20}l} {}Z_{k}^{\mathrm{n.ext}}(d(s',s)\,=\,x)\!&=\!Z_{k}^{\text{ext}}(d(s'\!,s)\!\,=\,x)\,-\,\text{min}(Z_{k}^{\text{ext}}(d(s'\!,s)\!\,=\,x))\\ & \forall{(x=00,01,10,11)} \end{array} $$

(10)

Executing one forward-backward recursion on all symbols of the received frame in the natural order completes one half iteration. A second half iteration should be executed in the interleaved order to complete one full turbo decoding iteration. Once all the iterations are completed (usually 6–7 iterations), the turbo decoder produces a hard decision for each symbol \(Z_{k}^{\mathrm {hard\ dec.}} \in (00,01,10,11)\).

For SBTC, the use of the trellis compression (radix-4) [24] represents an efficient parallelism technique and allows or efficient resource sharing with a DBTC SISO decoder as two single binary trellis sections (two bits) can be merged into one double binary trellis section.

The next section introduces the different levels of parallelism that can be exploited considering a Max-Log-MAP SISO decoder. It particularly highlights the SISO decoder level parallelism.

### 3.2 Parallelism in turbo decoding

Turbo decoding provides an efficient solution to reach very low error rate performance at the cost of high processing time for data retrieval. In order to face this limitation, many efforts targeting the exploitation of parallelism have been conduced in order to achieve high throughput. These parallelism levels can be categorized in three groups: metric level, SISO decoder level, and turbo decoder level.

The metric level parallelism deals with the processing of all metrics involved in the decoding of each received symbol inside a Max-Log-MAP SISO decoder. For that purpose, the inherent parallelism of the trellis structure [25, 26] and the parallelism of the MAP computation can be exploited [25–27]. The SISO decoder level parallelism consists in duplicating the SISO decoders in natural and interleaved domains, each executing the MAP algorithm on a sub-block of the frame to be decoded. Finally, the turbo decoder level parallelism proposes to duplicate whole turbo decoders to process iterations and/or frames in parallel. However, this level of parallelism is not relevant due to the huge area overhead of such an approach (all memories and computation resources are duplicated). Moreover, this solution presents no gain in frame decoding latency.

The SISO decoder level parallelism hugely impacts the configuration process of a multi-processor turbo decoder. Indeed, the number of SISO decoders that have to be configured and the configuration parameters associated with each SISO decoder are both dependent of this parallelism level. At this level, two techniques are available: frame sub-blocking and shuffled decoding. These two techniques are detailed in the hereafter.

**Frame sub-blocking:** In sub-block parallelism, each frame is divided into *M* sub-blocks and then each sub-block is processed on a Max-Log-MAP SISO decoder (Fig. 3). Besides duplications of Max-Log-MAP SISO decoders, this parallelism imposes two other constraints. On the one hand, interleaving has to be parallelized in order to proportionally scale the communication bandwidth. Due to the scramble property of interleavers, this parallelism can induce communication conflicts except for interleavers of emerging standards that are conflict-free for certain parallelism degrees. In case of conflicts, an appropriate communication structure, e.g., NoC, should be implemented for conflict management [15]. On the other hand, Max-Log-MAP SISO decoders have to be initialized adequately either by acquisition or by message passing (*β*
_{
i
} and *α*
_{
i
} on Fig. 3). In [28], a detailed analysis of the parallelism efficiency of these two methods is presented. It gives favor to the use of the message passing technique. The message passing, which initializes a sub-block with recursion metrics (*α* and *β*) computed during the previous iteration in the neighboring sub-blocks (Fig. 3), presents negligible time overhead compared to the acquisition method.

**Shuffled turbo decoding:** The principle of the shuffled turbo decoding technique has been introduced in [29]. In this mode, all component decoders of natural and interleaved domains work in parallel and exchange extrinsic information as soon as it is created. Thus, the shuffled turbo decoding technique performs decoding (computation time) and interleaving (communication time) fully concurrently while serial decoding implies waiting for the update of all extrinsic information before starting the next half iteration. Thus, by doubling the number of Max-Log-MAP SISO decoders, shuffled turbo decoding parallelism halves the iteration period in comparison with originally proposed serial turbo decoding. Nevertheless, to preserve error-rate performance with shuffled turbo decoding, an overhead of iterations between 5 and 50% is required depending on the MAP computation scheme, on the degree of sub-block parallelism, on the propagation time, and on the interleaving rules [28].

Frame sub-blocking and shuffled turbo decoding greatly impact the number of SISO decoders that have to be configured for a given throughput objective. Thus, the configuration load and the configuration latency depend on these two techniques. Moreover, these levels of parallelism impact the performance of the decoder. Thus, it directly impacts the number of decoding iterations that has to be performed for a given BER objective [30]. This section has provided the basic background on turbo decoding and on the different levels of parallelism which can be exploited in order to reach high-throughput requirement imposed by emerging communication standards. The next section introduces the reconfigurable multi-ASIP UDec architecture for turbo decoding.