- Research
- Open Access

# Dynamic configuration management of a multi-standard and multi-mode reconfigurable multi-ASIP architecture for turbo decoding

- Vianney Lapotre
^{1}Email authorView ORCID ID profile, - Guy Gogniat
^{1}, - Amer Baghdadi
^{2}and - Jean-Philippe Diguet
^{1}

**2017**:35

https://doi.org/10.1186/s13634-017-0468-x

© The Author(s) 2017

**Received:**3 March 2016**Accepted:**22 April 2017**Published:**23 May 2017

## Abstract

The multiplication of connected devices goes along with a large variety of applications and traffic types needing diverse requirements. Accompanying this connectivity evolution, the last years have seen considerable evolutions of wireless communication standards in the domain of mobile telephone networks, local/wide wireless area networks, and Digital Video Broadcasting (DVB). In this context, intensive research has been conducted to provide flexible turbo decoder targeting high throughput, multi-mode, multi-standard, and power consumption efficiency. However, flexible turbo decoder implementations have not often considered dynamic reconfiguration issues in this context that requires high speed configuration switching. Starting from this assessment, this paper proposes the first solution that allows frame-by-frame run-time configuration management of a multi-processor turbo decoder without compromising the decoding performances.

## Keywords

- Wireless communication
- Turbo codes
- ASIP
- Dynamic configuration

## 1 Introduction

The last years have seen considerable evolutions of wireless communication standards in the domain of mobile telephone networks, local/wide wireless area networks, and Digital Video Broadcasting (DVB). Besides the increasing requirements in terms of throughput and robustness against destructive channel effects, the convergence of services in single smart terminal becomes a crucial and challenging feature. Channel coding is a key technique of a wireless communication standard. It allows reliable data transfer targeting high throughput over unreliable communication channels. A channel coding technique is typically associated to a variety of parameters and configuration options (frame size, communication channel, signal-to-noise ratio, etc.). Among channel coding techniques, turbo codes [1] are frequently adopted in the recent wireless standards to reach a very low bit error rate (BER).

The introduction of contention-free interleavers in recent communication standards, such as long-term evolution (LTE) [2] and Worldwide Interoperability for Microwave Access (WiMAX) [3], enables high-throughput implementations such as presented in [4–8] and [9]. These architectures propose to use multiple soft-input soft-output (SISO) decoders to reach the high throughput requirement of emerging and future standards. They offer certain degrees of flexibility to adapt for instance the number of SISO decoders, the turbo code mode, i.e., single binary turbo code (SBTC) or double binary turbo code (DBTC), or the frame size. However, these efforts do not present any configuration infrastructure in order to support a fast and efficient dynamic configuration switching. In [10], the authors propose a solution in order to support dynamic configuration. They present an field-programmable gate array (FPGA) implementation of a high speed MAP decoder architecture for turbo decoding achieving 346 Mbps. The configuration latency cost of such an implementation is not evaluated. The configuration latency of Xilinx FPGA [11] depends on the targeted FPGA technology, the bitstream size and the medium used to transfer the configuration bitstream. However, the configuration latency overhead is still important (from around 100 *μ*s to 100 ms [11]). Recent work investigated general purpose processor (GPP) implementations using high-performance multi-core architectures taking advantage of the Intel SSE (Streaming SIMD Extensions) instructions. In [12], a 418 Mbps turbo decoder for LTE is implemented on an Intel Xeon processor X5670 with a 12 threads level of parallelism. In [13], an adaptive turbo decoder implementation on an Intel I7-960 core is investigated. The authors propose to adapt the decoding algorithm depending on the communication channel quality. However, for both [12] and [13] works, no discussion is provided about the context switching cost when the turbo decoder configuration has to be changed. Moreover, these GPP implementations have been initially developed for base stations. Thus, they are not suitable for mobile terminals due to the high power consumption of such processors.

Recently, application-specific instruction-set processor (ASIP) solutions have been investigated in order to offer architectures providing good trade-offs in terms of flexibility, throughput and power dissipation. In [14], a flexible and high performance ASIP model for turbo decoding was proposed, which can be configured to support all simple and double binary turbo codes up to eight states. The architecture uses shuffled decoding with frame sub-blocking. The extrinsic information is iteratively and concurrently exchanged between multiple component decoders via an on-chip communication network presented in [15]. Afterwards, an optimized implementation of the ASIP supporting both turbo codes and LDPC codes, called DecASIP have been presented in [16]. Similarly, the authors in [17] introduce the FlexiTreP ASIP presented in [18] in a multi-ASIP architecture for turbo decoding to reach the 150 Mbps throughput requirement of LTE. Previous works provide an efficient way to reach the high performance requirement of emerging standards. However, the dynamic reconfiguration aspect of these platforms is superficially addressed. In [19], the authors propose a reconfigurable multi-processor approach in order to decode multiple data streams in parallel. However, the timing impact of such a reconfiguration process is not detailed. Among the few works which consider this issue, we can cite the recent architecture presented in [20], where the authors propose solutions for the reconfiguration management of the network-on-chip (NoC) based multi-processor turbo/LDPC decoder architecture presented in [21]. Up to 35 processing elements (PEs) and up to 8 configuration buses have been implemented. Each PE is configured through a configuration memory, which is organized as a circular buffer. The reconfiguration process to switch from one configuration to another configuration can be masked by the current decoding task if the configuration memory provides enough free space and if a high speed configuration infrastructure is provided. Dynamic reconfiguration during one frame duration is possible when the current configuration is small enough to load a new configuration in the memory. If not, the authors provide management solutions to deal with this issue, such as erasing the current configuration during the last decoding iteration and continuing the reconfiguration process during the first iteration of the new configuration. However, this solution is not always sufficient. Then, stopping the current processing to configure the new configuration is unavoidable and leads to a decoding quality loss in terms of BER. The authors of [22] propose a dynamically reconfigurable ASIP-based architecture for turbo decoding allowing reconfiguration of the entire platform during the current decoding task in order to propose a frame by frame dynamic configuration. This architecture has been optimized based on the initial work presented in [16]. Up to 64 processors are reconfigured using a bus-based configuration infrastructure implementing optimized transfer mechanisms. In this context, this paper aims to bring a complete configuration management solution for multi-processor turbo decoder providing novel solutions allowing for the first time: (1) a run-time evaluation of the number of decoding iterations and the level of sub-block parallelism regarding throughput and bit error rate (BER) requirements and (2) a run-time configuration generation. As a base architecture, the reconfigurable UDec ASIP-based turbo decoder presented in [22] is considered. Therefore, the corresponding architectural parameters in terms of memory bank sizes and communication interfaces between the ASIPs have been used. In this paper, no specific optimizations have been introduced regarding the turbo decoding itself.

The rest of this paper is organized as follows. Section 2 gives more insights about the motivation of this work. Section 3 provides basics about turbo decoding and related parallelism techniques. Section 4 introduces the Reconfigurable UDec architecture implementing the RDecASIP processor. Section 5 presents the proposed method to dynamically evaluate the number of decoding iterations and the level of sub-block parallelism that have to be used to reach throughput and BER objectives. Section 6 describes the proposed configuration management method and evaluates the obtained performances. Finally, Section 7 concludes the paper.

## 2 Motivation

*k*ensuring a null extra delay between two frames is evaluated using (1).

where *k* is the *k*th received frame, *N*
_{PrevFrame} is the number of consecutive frames decoded with the same configuration that precedes the frame *k*, FrameSize(*k*−1) is the (*k*−1)th frame size in bits, Throughput(*k*−1) is the throughput requirement associated with the (*k*−1)th data frame and *R*
_{
c
}(*k*−1) is the code rate associated with (*k*−1)th data frame. MCL, FrameSize, and Throughput are expressed in seconds, bits, and bits/s, respectively. Assuming the worst case when *N*
_{PrevFrame}(*k*) = 1, the maximum configuration latency critically decreases with high throughput targeted by emerging and future wireless communication standards as shown in Fig. 1
b. This figure presents the decoding latency, i.e., frame duration in Fig. 1
b., of a 2048-bit data frame for different wireless communication standards. Regarding the throughput requirement evolution, the decoding latency of a frame decreases and will reach latencies around few microseconds in LTE-advanced standard. Thus, considering the dynamic configuration scenario presented in this section, emerging and future high-throughput multi-mode and multi-standard architectures will have to deal with maximum configuration latencies around few microseconds. That is why, this paper presents solutions to solve this challenging issue.

## 3 Turbo decoding

*extrinsic information*between two (or more) component decoders dealing with the same received set of data. As shown in Fig. 2, a typical turbo decoder consists of two decoders operating iteratively on the received frame. The first component (SISO decoder 0 in Fig. 2) works in natural domain while the second (SISO decoder 1 in Fig. 2) works in interleaved domain. The soft-input soft-output (SISO) decoders operate on soft information to improve the decoding performance. Thus, besides its own channel input data, each SISO decoder deals with the extrinsic information generated by the other SISO decoder in order to improve its estimation over the iterations. Usually, but not necessary, the computations are done in the logarithmic domain. Each decoder calculates the log-likelihood ratio (LLR) for the

*i*th data bit

*d*

_{ i }as

where *L*
^{apr}(*d*
_{
i
}) is the a-priori information of *d*
_{
i
}, *L*
^{sys}(*d*
_{
i
}) and *L*
^{par}(*d*
_{
i
}) are the channel measurement of the systematic and parity parts respectively. Each soft-input soft-output (SISO) decoder generates extrinsic information that is sent to the other decoder. Extrinsic information becomes the a-priori information *L*
^{ap}(*d*
_{
i
}) for the other decoder as shown in Fig. 2.

Several algorithms for this SISO decoding have been proposed in the literature. The soft output Viterbi algorithm (SOVA) and the Maximum Aposteriori Probability (MAP) algorithms are the most frequently used. This last algorithm has been simplified in [23] to propose the Max-Log-MAP algorithm that is most often adopted because of the efficient hardware implementation possibility. For a better understanding of the architectural and configuration issues highlighted in the rest of this paper, the next sub-section provides a short introduction to the Max-Log-MAP decoding.

### 3.1 Max-Log-MAP algorithm

_{ k }, the SISO decoder computes first the branch metrics (

*γ*

_{ k }(

*s*

^{′},

*s*)) which represent the probability of a transition to occur between two trellis states (

*s*

^{′}, starting state;

*s*, ending state). These branch metrics can be decomposed, as defined by the following expressions, in an intrinsic part (\(\gamma _{k}^{\text {intr}_{x}}(s',s)\)) due to the systematic information (\(\gamma _{k}^{\text {sys}_{x}}(s',s)\)), the a-priori information (\(\gamma _{k}^{\mathrm {n.apr}_{x}}(s',s)\)) and a redundancy part due to the parity information (\(\gamma _{k}^{\text {par}_{y}}(s',s)\)).

where \(\gamma _{k}^{\mathrm {n.apr}_{x}}(s',s)\) is the normalized a priori information of the *k*th symbol or the normalized extrinsic information (\(Z_{k}^{\mathrm {n.ext}}\)), sent by the other decoder component (expression given below). Furthermore, the systematic and the parity information in these expressions represent the *symbol* log-likelihood-ratios (LLRs) which can be obtained by direct addition and subtraction operations between the received channel *bit* LLRs (S0, S1, P0, P1, P0 ^{′}, P1 ^{′}).

*α*

_{ k }(

*s*) of the

*k*th symbol are computed recursively using those of the (

*k*−1)th symbol and the branch metrics of the corresponding trellis section. Similarly for the backward state metrics

*β*

_{ k }(

*s*) which correspond to the backward recursion (traversing the trellis in the reverse direction).

*k*th symbol is computed for all possible decisions (00,01,10,11) using the forward state metrics, the backward state metrics, and the extrinsic part of the branch metrics as formulated in the following expressions:

Executing one forward-backward recursion on all symbols of the received frame in the natural order completes one half iteration. A second half iteration should be executed in the interleaved order to complete one full turbo decoding iteration. Once all the iterations are completed (usually 6–7 iterations), the turbo decoder produces a hard decision for each symbol \(Z_{k}^{\mathrm {hard\ dec.}} \in (00,01,10,11)\).

For SBTC, the use of the trellis compression (radix-4) [24] represents an efficient parallelism technique and allows or efficient resource sharing with a DBTC SISO decoder as two single binary trellis sections (two bits) can be merged into one double binary trellis section.

The next section introduces the different levels of parallelism that can be exploited considering a Max-Log-MAP SISO decoder. It particularly highlights the SISO decoder level parallelism.

### 3.2 Parallelism in turbo decoding

Turbo decoding provides an efficient solution to reach very low error rate performance at the cost of high processing time for data retrieval. In order to face this limitation, many efforts targeting the exploitation of parallelism have been conduced in order to achieve high throughput. These parallelism levels can be categorized in three groups: metric level, SISO decoder level, and turbo decoder level.

The metric level parallelism deals with the processing of all metrics involved in the decoding of each received symbol inside a Max-Log-MAP SISO decoder. For that purpose, the inherent parallelism of the trellis structure [25, 26] and the parallelism of the MAP computation can be exploited [25–27]. The SISO decoder level parallelism consists in duplicating the SISO decoders in natural and interleaved domains, each executing the MAP algorithm on a sub-block of the frame to be decoded. Finally, the turbo decoder level parallelism proposes to duplicate whole turbo decoders to process iterations and/or frames in parallel. However, this level of parallelism is not relevant due to the huge area overhead of such an approach (all memories and computation resources are duplicated). Moreover, this solution presents no gain in frame decoding latency.

The SISO decoder level parallelism hugely impacts the configuration process of a multi-processor turbo decoder. Indeed, the number of SISO decoders that have to be configured and the configuration parameters associated with each SISO decoder are both dependent of this parallelism level. At this level, two techniques are available: frame sub-blocking and shuffled decoding. These two techniques are detailed in the hereafter.

**Frame sub-blocking:**In sub-block parallelism, each frame is divided into

*M*sub-blocks and then each sub-block is processed on a Max-Log-MAP SISO decoder (Fig. 3). Besides duplications of Max-Log-MAP SISO decoders, this parallelism imposes two other constraints. On the one hand, interleaving has to be parallelized in order to proportionally scale the communication bandwidth. Due to the scramble property of interleavers, this parallelism can induce communication conflicts except for interleavers of emerging standards that are conflict-free for certain parallelism degrees. In case of conflicts, an appropriate communication structure, e.g., NoC, should be implemented for conflict management [15]. On the other hand, Max-Log-MAP SISO decoders have to be initialized adequately either by acquisition or by message passing (

*β*

_{ i }and

*α*

_{ i }on Fig. 3). In [28], a detailed analysis of the parallelism efficiency of these two methods is presented. It gives favor to the use of the message passing technique. The message passing, which initializes a sub-block with recursion metrics (

*α*and

*β*) computed during the previous iteration in the neighboring sub-blocks (Fig. 3), presents negligible time overhead compared to the acquisition method.

**Shuffled turbo decoding:** The principle of the shuffled turbo decoding technique has been introduced in [29]. In this mode, all component decoders of natural and interleaved domains work in parallel and exchange extrinsic information as soon as it is created. Thus, the shuffled turbo decoding technique performs decoding (computation time) and interleaving (communication time) fully concurrently while serial decoding implies waiting for the update of all extrinsic information before starting the next half iteration. Thus, by doubling the number of Max-Log-MAP SISO decoders, shuffled turbo decoding parallelism halves the iteration period in comparison with originally proposed serial turbo decoding. Nevertheless, to preserve error-rate performance with shuffled turbo decoding, an overhead of iterations between 5 and 50% is required depending on the MAP computation scheme, on the degree of sub-block parallelism, on the propagation time, and on the interleaving rules [28].

Frame sub-blocking and shuffled turbo decoding greatly impact the number of SISO decoders that have to be configured for a given throughput objective. Thus, the configuration load and the configuration latency depend on these two techniques. Moreover, these levels of parallelism impact the performance of the decoder. Thus, it directly impacts the number of decoding iterations that has to be performed for a given BER objective [30]. This section has provided the basic background on turbo decoding and on the different levels of parallelism which can be exploited in order to reach high-throughput requirement imposed by emerging communication standards. The next section introduces the reconfigurable multi-ASIP UDec architecture for turbo decoding.

## 4 Reconfigurable UDec architecture

The entire platform is configured through a bus-based configuration infrastructure that implements unicast, multi-cast, and broadcast mechanisms. The proposed bus architecture can be split in three functional blocks: *Master Interface* (MI) ⑦, *Slave Interface* (SI) ⑧, and *Selector* ⑨. Each configuration memory is connected to the bus through a SI. The configuration manager deals with the configuration generation, which is based on internal decisions and external information and commands, which are described in Sections 5 and 6. The MI provides an interface allowing the connection of the configuration manager to the bus, the SI provides an interface between the bus and the configuration memory, and the Selector provides a simple and efficient solution to select, at run-time, RDecASIPs that are targeted by the next configuration data. For clarity reasons, connections between the Selector and the SIs are not represented in Fig. 4. This infrastructure allows the transfer of a data into the RDecASIP configuration memory in 5 clock cycles. More details on the UDec configuration infrastructure architecture and implementation can be found in [22, 31].

*N*

_{ w }windows each with a maximum size of 64 symbols. Each ASIP can manage a maximum of 12 windows. Since the RDecASIP is designed to work in a multi-ASIP architecture, it requires several parameters to deal with a sub-block of the data frame and several parameters to configure the ASIP mode. Concerning the sub-block partitioning, each ASIP is configured with the size and the number of windows it has to decode. Furthermore, the last window size can be different, so it corresponds to an additional parameter. In a SBTC decoding mode, the address of the tail bits in memory, the size, and the number of windows for the specific decoding phase of the tail bits have to be configured. Parameters for the ASIP mode correspond to the location of the ASIP in the architecture, the number of ASIPs required, the parameter which defines if the current ASIP is in charge of tail bits or not, the target standard, and the scaling factor for extrinsic information. Finally, some seed values are necessary for interleaving address generation in order to exchange information over the NoC that connects the ASIPs of each decoder component. All these parameters are required for each new configuration of an ASIP within the platform and are stored in a configuration memory (⑤ in Fig. 4). This configuration load represents 253 bits per ASIP. Thus, at run-time, all these parameters need to be computed and loaded into the configuration memories when a new configuration is required. For optimizing the configuration latency, the configuration memory has been organized in a way that allows broacasting and multi-casting transfers [32, 33]. In this context, the configuration latency of the UDec platform using the proposed configuration infrastructure is defined by (11) [31].

Where *N*
_{ASIP} is the number of RDecASIPs and *F*
_{clk} is the frequency of the proposed bus architecture. Thirty-one clock cycles are necessary to transfer the parameters common to all ASIPs (broadcasting) and the parameters common to ASIPs of the same decoder component (multi-casting). Three additional clock cycles are necessary to transfer parameters that are different for each ASIP.

*N*

_{instr}=4 instructions per iteration are needed to process one symbol which is composed of two bits.

*F*

_{clk}and

*N*

_{iter}are the clock frequency of the system and the number of decoding iterations respectively. (

*N*

_{ASIP}/2) reflects the level of sub-block parallelism. This throughput is multiplied by two when shuffled decoding is enabled. The decoding latency is dictated by the processing latency of the RDecASIPs. In the complete receiver architecture, input memories of the turbo decoder are duplicated to allow buffering of next input streaming frame from the demapper while executing the iterative turbo decoding on the current received frame. Similarly, the Butterfly NoCs are dimensioned to accommodate the required communication bandwidth dictated by the RDecASIPs. Therefore, regarding the RDecASIPs latency, 4 clock cycles (i.e. 4 instructions) are needed to process 1 symbol which is composed of 2 bits. This latency should be multiplied by the number of iterations and the number of symbols in a frame (FrameSize/2). On the other side, it should be divided by the number of ASIPs per component decoder (

*N*

_{ASIP}/2). Finally, 10 clock cycles should be added once due to the 10 pipeline stages of the ASIP [32, 33].

In (12), the throughput of the platform is mainly influenced to the number of ASIPs and the number of decoding iterations that have to be performed. In [30], the authors show that determining the level of sub-block parallelism and the number of decoding iterations for a given couple (throughput, BER) is not a trivial task. The next section presents a low-overhead method to estimate the level of sub-block parallelism and the number of decoding iterations, which can be efficiently used at run-time.

## 5 Dynamic estimation of the number of decoding iterations for turbo decoding

This section analyses the impact of sub-block parallelism regarding the number of decoding iterations, which have to be performed with respect to a given FER (Frame Error Rate) or BER. Then, a low-complexity method to dynamically estimate the number of decoding iterations with respect to the level of sub-block parallelism, which is suitable for run-time execution, is proposed.

### 5.1 Case of sub-block parallelism

In the context of this work considering the Reconfigurable UDec architecture presented in Section 4, sub-block parallelism method is associated with initialization by message passing as described in Sub-section 3.2. This method dynamically initializes a sub-block with recursion metrics computed during the last decoding iteration in the neighboring sub-blocks. The authors of [28] have studied the impact of sub-blocking on the turbo decoding performance in terms of FER considering message passing and conclude that asymptotic error rate is not affected by message passing approach whatever is the parallelism degree. Thus, it ensures that initialization by message passing can be used without performance degradation in terms of decoding quality by increasing the number of decoding iterations with respect to the level of sub-block parallelism.

^{−6}while the configuration presented in Fig. 6 presents the case of a communication with low SNR. The results of the different conducted simulations show that the necessary number of iterations can be roughly estimated using a simple equation taking into account a base number of decoding iterations, the level of sub-block parallelism, and a threshold value that drives the evolution of the number of decoding iterations that have to be performed to reach the target BER. The number of necessary decoding iterations is given by (14).

where *N*
_{iterBase} is the number of decoding iterations that has to be performed when the level of sub-block parallelism is one for a fixed target BER, *T* is a constant threshold that can be evaluated by studying the linear behavior of the evolution of the number of decoding iterations, and *P* represents the level of sub-block parallelism that have to be used to reach a given throughput. Thus, *N*
_{iterBase} and *T* are determined by analyzing the simulation results. In our experiments, *N*
_{iterBase} has been fixed using one additional iteration in order to provide a BER greater or equal to the original BER objective. In the examples of Figs. 5 and 6, *N*
_{iterBase} is equal to 7 and 8, respectively. *T* values have been estimated by analyzing the slope of the curve created with the simulation results shown in straight dark line in Figs. 5 and 6. In Figs. 5 and 6, *T* is equal to 4 and 11, respectively. It is worth noting that, regarding a given configuration, the value of *T* is constant whatever is the target BER for typical *N*
_{iterBase} values (from 5 to 10). Indeed, the degradation of the decoding quality due to the increasing level of sub-block parallelism is independent of the targeted BER. It is related to the message passing method used for Max-Log-MAP SISO decoders initialization. This observation is illustrated in Fig. 6 where the behavior for three different BER objectives is shown. Indeed, we observe that the slope of the curves is roughly identical allowing an estimation with a unique value *T* for a fixed SNR. Estimation results using (14) are presented in gray dotted line in Figs. 5 and 6. Results show that, (14) provides a low complexity solution that can be easily implemented on a mobile device to estimate the number of necessary decoding iterations with at most one more decoding iteration compared to the simulation results. It is worth noting that when the number of estimated decoding iterations is not equal to that obtained through simulations, it is overestimated ensuring the decoding performance in terms of BER.

### 5.2 Case of shuffled decoding

Comparison of necessary number of decoding iterations regarding the level of sub-block parallelism for 53 bytes DVB-RCS interleaving code rate = 6/7, SNR = 4.0 dB, Log-MAP algorithm, FER = 1.6×10^{−3}

Sub-block parallelism | Number of iterations without shuffled decoding | Number of iterations with shuffled decoding | Shuffled decoding efficiency |
---|---|---|---|

1 | 8 | 12 | 0.66 |

4 | 11 | 15 | 0.73 |

8 | 16 | 20 | 0.8 |

53 | 47 | 51 | 0.92 |

Comparison of necessary number of decoding iterations regarding the level of sub-block parallelism for 188 bytes DVB-RCS interleaving code rate = 6/7, SNR = 4.0 dB, Log-MAP algorithm, FER = 1.6×10^{−3}

Sub-block parallelism | Number of iterations without shuffled decoding | Number of iterations with shuffled decoding | Shuffled decoding efficiency |
---|---|---|---|

1 | 8 | 11 | 0.72 |

2 | 9 | 11 | 0.82 |

4 | 9 | 12 | 0.75 |

16 | 13 | 15 | 0.86 |

64 | 19 | 23 | 0.83 |

128 | 34 | 37 | 0.92 |

where *N*
_{iterShuffled} is a constant that depends of the considered standard and frame size.

In the next subsection, the proposed estimation method is considered for the configuration process of the reconfigurable UDec architecture. An algorithm jointly determining the number of decoding iterations and the level of sub-block parallelism is proposed to be used at run-time in order to dynamically generate the configuration data for the reconfigurable UDec platform.

### 5.3 Configuration parameters search algorithm

*N*

_{iterBase}and the threshold value

*T*, which are used to compute the necessary number of decoding iterations as explained in the two previous sub-sections, and (3)

*P*

_{max}, which is the maximum level of sub-block parallelism supported by the platform. While the frame size and the target throughput are transfered to the UDec platform at run-time, the base number of decoding iterations

*N*

_{iterBase}and the threshold value

*T*taking into account different SNR and BER objectives have to be stored in a memory associated to the configuration manager. Each couple of values (

*N*

_{iterBase},

*T*) can be stored using 2 bytes. Thus, the number of bytes necessary to store these parameters can be determined using the following equation:

Where *N*
_{SNR} and *N*
_{BER} represent the number of supported SNR and BER objectives, respectively.

The proposed algorithm is built with a search loop based on the level of sub-block incrementation. This incrementation increases the throughput by rising the level of sub-block parallelism, i.e., the number of activated RDecASIPs (line 5). For each level of parallelism, the corresponding number of decoding iterations is deduced from *N*
_{iterBase} and *T* (line 3). Then, the UDec throughput (Throughput_{
UDec}) corresponding to the level of sub-block parallelism and the computed number of decoding iterations is calculated (line 4). Finally, the UDec throughput and the target throughput are compared. If the target throughput is greater than the current UDec throughput, the level of sub-block parallelism has to be increased to reach the throughput requirement. Once loop iterations finished (line 7), the UDec throughput and the target throughput values are compared. If the UDec throughput is greater than the target throughput, then a configuration solution exists with a level of sub-block parallelism of *P* and *N*
_{iter} decoding iterations. If no solution is found, shuffled decoding can be enabled if the condition in terms of frame size and code rate are met and a second search (from line 9 to 15) can be performed. Indeed, shuffled decoding cannot be used efficiently on small frame sizes and high-code rate configurations [28]. Moreover, the algorithm favors a serial decoding configuration since it requires less decoding iterations as shown in Tables 1 and 2. If shuffled decoding cannot be used, the reconfigurable UDec architecture is not able to support such a configuration respecting the required decoding performance. In this case, a default configuration respecting either the target throughput or the BER objective can be generated. However, the system should be dimensioned at design-time to support the worst-case supported scenario. It is worth noting that the proposed algorithm does not provide an optimum solution. Indeed, the algorithm stops when the first solution is found. Thus, it guarantees that, for a given configuration, the minimum number of RDecASIPs is used. However, for a given configuration, more RDecASIPs could be used for decoding a frame in a shorter time (i.e., by increasing the throughput of the platform).

The next section presents the proposed configuration management solution ensuring a frame-by-frame configuration process, as the worst-case configuration scenario, as presented in Section 2.

## 6 Run-time configuration management

- 1.The configuration manager (shown in Fig. 4) receives the configuration order associated with the frame parameters (i.e., frame size, standard, throughput, and targeted BER) necessary to generate the configuration for the RDecASIPs.
- 2.
The configuration manager generates the configuration parameters for each selected RDecASIP configuration memory presented in Section 4.

- 3.
The configuration parameters for each selected RDecASIP are transfered through the configuration infrastructure presented in Section 4.

For this study, we assume that the configuration manager generates at run-time the configuration information for the entire UDec architecture based on the configuration parameters received with the configuration order for the next frame. These parameters are the frame size, the throughput requirement, the target standard, and the BER objectives which are used to compute the number of decoding iterations as explained in Section 5. From these parameters, the number of activated RDecASIPs is first determined depending on the number of decoding iterations using the search algorithm presented in Fig. 7. Then, the contents of the different configuration memories of the reconfigurable UDec architecture can be generated.

*n*

_{ min }is equal to the maximal configuration generation latency plus the maximal configuration transfer latency.

In order to estimate the configuration generation latency of the reconfigurable UDec architecture, a C-code allowing a run-time configuration generation has been implemented on an ARM cortex A15 core with a frequency of 1600 MHz. It is important to note that the considered C-code has not been fully optimized and not parallelized. This C-code consists in two steps. The first step is an implementation of the algorithm shown in Fig. 7 to determine the number of decoding iterations that have to be performed and the level of sub-block parallelism for a given configuration. The second step generates the configuration data for each configuration memory of the UDec architecture. Implementation results show that the configuration generation latency for a maximum level of sub-block parallelism of 32 (64 RDecASIPs) is 4.14 *μ*s. The first step requires 1.44 *μ*s while the second step requires 2.7 *μ*s in order to generate 18,590 configuration bits that have to be loaded in the 64 RDecASIPs configuration memories and in the platform configuration memory.

Configuration transfer latency in nanosecond for different levels of sub-block parallelism

Level of sub-block parallelism (P) | Number of RDecASIPs | Transfer latency (in ns) (estimated with (11)) |
---|---|---|

2 | 4 | 86 |

3 | 6 | 98 |

4 | 8 | 110 |

8 | 16 | 158 |

16 | 32 | 254 |

32 | 64 | 446 |

*μ*s. This maximum configuration latency represents the latency to generate a configuration for 64 RDecASIPs plus the latency to send the configuration data into the configuration memories of the platform. Considering this maximum configuration latency, a maximum theoretical throughput ensuring the configuration constraints described in Section 2 for a given frame size can be determined using (17). However, this maximal throughput is limited by the number of decoding iterations and the number of implemented RDecASIPs in the platform. Table 4 shows the maximal throughput considering the number of decoding iterations and the decoding mode. These results are obtained from (12) considering 64 RDecASIP.

Maximum throughput of the reconfigurable UDec platform with 64 RDecASIPs, *N*
_{instr} = 4, *F*
_{clk} = 500 MHz

Number of decoding iterations | Max. throughput (in Mbps) | |
---|---|---|

Serial decoding | Shuffled decoding | |

4 | 1000 | 2000 |

6 | 667 | 1334 |

7 | 571 | 1142 |

8 | 500 | 1000 |

9 | 444 | 888 |

10 | 400 | 800 |

12 | 333 | 666 |

16 | 250 | 500 |

20 | 200 | 400 |

*μ*s), which are shown in white cells. In this case, the maximum throughput for a given frame size is determined using (17). The second one concerns throughput values limited by the number of integrated ASIPs in the platform, which are shown in gray cells. In this case, the maximum throughput for a given frame size is determined using results given in Table 4. These results show that the proposed configuration management of the reconfigurable UDec architecture offers an efficient solution respecting the configuration scenario presented in Section 2 with high maximum throughput values up to 667 and 1334 Mbps in serial and shuffled decoding modes, respectively.

Maximum throughput supporting frame-by-frame configuration scenario in serial decoding with 64 RDecASIPs, *N*
_{instr} = 4, *F*
_{clk} = 500 MHz

Frame size | Max. throughput (in Mbps) | ||
---|---|---|---|

(bits) | 6 iter. | 10 iter. | 20 iter. |

96 | 21 | 21 | 21 |

480 | 105 | 105 | 105 |

880 | 192 | 192 | 192 |

1920 | 418 | 400 | 200 |

4800 | 667 | 400 | 200 |

6144 | 667 | 400 | 200 |

Maximum throughput supporting frame-by-frame configuration scenario in shuffled decoding with 64 RDecASIPs, *N*
_{instr} = 4, *F*
_{clk} = 500 MHz

Frame size | Max. throughput (in Mbps) | ||
---|---|---|---|

(bits) | 6 iter. | 10 iter. | 20 iter. |

96 | 21 | 21 | 21 |

480 | 105 | 105 | 105 |

880 | 192 | 192 | 192 |

1920 | 418 | 418 | 400 |

4800 | 1047 | 800 | 400 |

6144 | 1334 | 800 | 400 |

Comparison of supported dynamic configuration features with relevant existing works

Supported standards | Maximum throughput (Mbps) | Frame-by-frame configuration | Run-time configuration generation | |
---|---|---|---|---|

[16] | LDPC | 312, 263 | No | No |

DBTC | 173 @6iter. | |||

SBTC | 173 @6iter. | |||

[17] | DBTC | 21 @6iter./ASIP | No | No |

SBTC | 21 @6iter./ASIP | |||

[20] | LDPC | 455 | Yes with BER | No |

DBTC | 292 @8iter. | degradation | ||

This | DBTC | 1334 @6iter. | Yes | Yes |

work | SBTC | 1334 @6iter. |

## 7 Conclusions

This paper presents the first solution that allows a frame-by-frame run-time configuration management of a high-throughput multi-processor turbo decoder. It provides an analysis of the dynamic evolution of the number of decoding iterations regarding the level of sub-block parallelism in order to be integrated in the configuration management of the UDec architecture. A configuration management, where the configuration information is generated at run-rime has been proposed. This solution provides an efficient method for exploiting the capacity of the reconfigurable UDec architecture while ensuring a frame-by-frame configuration process w.r.t., the application requirements in terms of throughput and error rate. Considering a maximum configuration latency of 4.586 *μ*s, the maximum throughput supported by the architecture implementing 64 RDecASIPs is 1334 Mbps when shuffled turbo decoding is enabled.

## Declarations

### Authors’ contributions

VL designed and implemented the proposed architecture and wrote the paper. AB scientifically supervised the work, provided the simulation results used in this work, and participated in writing the paper. GG and J-PD scientifically supervised the work and contributed in implementing the proposed architecture. All authors read and approved the final manuscript.

### Competing interests

The authors declare that they have no competing interests.

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## Authors’ Affiliations

## References

- C Berrou, A Glavieux, P Thitimajshima, in
*Proc. of the IEEE International Conference on Communications (ICC)*, 2. Near Shannon limit error-correcting coding and decoding: turbo-codes. 1, (1993), pp. 1064–1070. doi:10.1109/ICC.1993.397441. - 3GPP TS 36.212. Evolved Universal Terrestrial radio access (E-UTRA); multiplexing and channel coding, version 8.4.0 (2008). http://www.etsi.org/deliver/etsi_ts/136200_136299/136212/08.04.00_60/ts_136212v080400p.pdf.
- IEEE Standard for Local and Metropolitan Area Networks Part 16: Air Interface for Fixed and Mobile Broadband Wireless Access Systems (Std., 2006). doi:10.1109/IEEESTD.2006.99107.Google Scholar
- C-C Wong, H-C Chang, Reconfigurable turbo decoder with parallel architecture for 3GPP LTE system. IEEE Trans. Circuits Syst. II: Express Briefs.
**57**(7), 566–570 (2010). doi:10.1109/TCSII.2010.2048481.View ArticleGoogle Scholar - J-H Kim, I-C Park, in Proc. of the IEEE Custom Integrated Circuits Conference (CICC). A unified parallel radix-4 turbo decoder for mobile wimax and 3GPP-LTE, (2009), pp. 487–490. doi:10.1109/CICC.2009.5280790.Google Scholar
- D-S Cho, H-J Park, H-C Park, in Proc. of the International Conference on Telecommunications (ICT). Implementation of an efficient UE decoder for 3G LTE system, (2008), pp. 1–5. doi:10.1109/ICTEL.2008.4652642.Google Scholar
- D Wu, R Asghar, Y Huang, D Liu, in Proc. of the IEEE 8th International Conference on ASIC (ASICON). Implementation of a high-speed parallel turbo decoder for 3GPP LTE terminals, (2009), pp. 481–484. doi:10.1109/ASICON.2009.5351623.Google Scholar
- C-H Lin, C-Y Chen, E-J Chang, A-Y Wu, in Proc. of the 13th International Symposium on Integrated Circuits (ISIC). A 0.16nj/bit/iteration 3.38mm2 turbo decoder chip for WiMAX/LTE standards, (2011), pp. 168–171. doi:10.1109/ISICir.2011.6131904.Google Scholar
- M May, T Ilnseher, N Wehn, W Raab, in Proc. of the Design, Automation and Test in Europe Conference & Exhibition (DATE). A 150Mbit/s 3GPP LTE turbo code decoder, (2010), pp. 1420–1425. doi:10.1109/DATE.2010.5457035.Google Scholar
- R Shrestha, R Paily, in Proc. of the 26th International Conference on VLSI Design and 12th International Conference on Embedded Systems (VLSID). Design and implementation of a high speed MAP decoder architecture for turbo decoding, (2013), pp. 86–91. doi:10.1109/VLSID.2013.168.Google Scholar
- Xilinx, Partial Reconfiguration User Guide UG702 (v14.5).Google Scholar
- S Zhang, R Qian, T Peng, R Duan, K Chen, in Proc. of the 7th International ICST Conference on Communications and Networking in China (CHINACOM). High throughput turbo decoder design for GPP platform, (2012), pp. 817–821. doi:10.1109/ChinaCom.2012.6417597.Google Scholar
- L Huang, Y Luo, H Wang, F Yang, Z Shi, D Gu, in Proc. of the IET International Conference on Communication Technology and Application (ICCTA). A high speed turbo decoder implementation for CPU-based SDR system, (2011), pp. 19–23. doi:10.1049/cp.2011.0622.Google Scholar
- O Muller, A Baghdadi, M Jezequel, in Proc. of the Design, Automation and Test in Europe Conference & Exhibition (DATE), 1. ASIP-based multiprocessor SoC design for simple and double binary turbo decoding, (2006), pp. 1–6. doi:10.1109/DATE.2006.244126.Google Scholar
- H Moussa, O Muller, A Baghdadi, M Jezequel, in Proc. of the Design, Automation Test in Europe Conference & Exhibition (DATE). Butterfly and Benes-based on-chip communication networks for multiprocessor turbo decoding, (2007), pp. 1–6. doi:10.1109/DATE.2007.364668.Google Scholar
- P Murugappa, A-K R., A Baghdadi, M Jézéquel, in Proc. of Design, Automation and Test in Europe Conference & Exhibition (DATE). A flexible high throughput multi-ASIP architecture for LDPC and turbo decoding, (2011), pp. 1–6. doi:10.1109/DATE.2011.5763047.Google Scholar
- C Brehm, T Ilnseher, N Wehn, in
*Proc. of the International SoC Design Conference (ISOCC)*. A scalable multi-ASIP architecture for standard compliant trellis decoding, (2011), pp. 349–352. doi:10.1109/ISOCC.2011.6138782. - T Vogt, N Wehn, A reconfigurable ASIP for convolutional and turbo decoding in an SDR environment. IEEE Trans. Very Large Scale Integration (VLSI) Syst.
**16**(10), 1309–1320 (2008). doi:10.1109/TVLSI.2008.2002428.View ArticleGoogle Scholar - S Kunze, E Matus, G Fettweis, T Kobori, in
*Proc. of the IEEE Workshop on Signal Processing Systems (SIPS)*. A “multi-user” approach towards a channel decoder for convolutional, turbo and ldpc codes, (2010), pp. 386–391. http://ieeexplore.ieee.org/document/5624878/. - C Condo, M Martina, G Masera, VLSI implementation of a multi-mode turbo/LDPC decoder architecture. IEEE Trans. Circuits Syst. I: Reg. Papers.
**60**(6), 1441–1454 (2012). doi:10.1109/TCSI.2012.2221216.MathSciNetView ArticleGoogle Scholar - C Condo, M Martina, G Masera, in Proc. of the Design, Automation and Test in Europe Conference & Exhibition (DATE). A network-on-chip-based turbo/LDPC decoder architecture, (2012), pp. 1525–1530. doi:10.1109/DATE.2012.6176715.Google Scholar
- V Lapotre, P Murugappa, G Gogniat, A Baghdadi, M Hubner, J-P Diguet, A dynamically reconfigurable multi-ASIP architecture for multistandard and multimode turbo decoding. IEEE Trans. Very Large Scale Integration (VLSI) Syst.
**PP**(99), 1–1 (2015). doi:10.1109/TVLSI.2015.2396941.Google Scholar - P Robertson, E Villebrun, P Hoeher, in Proc. of the IEEE International Conference on Communications (ICC), 2. A comparison of optimal and sub-optimal MAP decoding algorithms operating in the log domain, (1995), pp. 1009–10132. doi:10.1109/ICC.1995.524253.Google Scholar
- M Bickerstaff, L Davis, C Thomas, D Garrett, C Nicol, in Proc. of the 2003 IEEE International Solid-State Circuits Conference (ISSCC). A 24mb/s radix-4 logmap turbo decoder for 3GPP-HSDPA mobile wireless, (2003), pp. 150–4841. doi:10.1109/ISSCC.2003.1234244.Google Scholar
- G Masera, G Piccinini, MR Roch, M Zamboni, VLSI architectures for turbo codes. IEEE Trans. Very Large Scale Integration (VLSI) Syst.
**7**(3), 369–379 (1999). doi:10.1109/92.784098.View ArticleGoogle Scholar - E Boutillon, WJ Gross, PG Gulak, VLSI architectures for the MAP algorithm. IEEE Trans. Commun.
**51**(2), 175–185 (2003). doi:10.1109/TCOMM.2003.809247.View ArticleGoogle Scholar - Y Zhang, KK Parhi, in Proceedings of the 2004 International Symposium on Circuits and Systems (ISCAS), 2. Parallel turbo decoding, (2004), pp. 509–512. doi:10.1109/ISCAS.2004.1329320.Google Scholar
- O Muller, A Baghdadi, M Jezequel, Parallelism efficiency in convolutional turbo decoding. EURASIP J. Adv. Signal Process.
**2010**(1), 927–920 (2010).View ArticleGoogle Scholar - J Zhang, MPC Fossorier, Shuffled iterative decoding. IEEE Trans. Commun.
**53**(2), 209–213 (2005). doi:10.1109/TCOMM.2004.841982.View ArticleGoogle Scholar - O Muller, A Baghdadi, M Jezequel, in Information and Communication Technologies, 2006. ICTTA ’06. 2nd, 2. Exploring parallel processing levels for convolutional turbo decoding, (2006), pp. 2353–2358. doi:10.1109/ICTTA.2006.1684774.Google Scholar
- V Lapotre, P Murugappa, G Gogniat, A Baghdadi, M Huebner, J-P Diguet, in
*Proc. of the 2013 16th Euromicro Conference on Digital System Design (DSD)*. Stopping-free dynamic configuration of a multi-asip turbo decoder, (2013). http://ieeexplore.ieee.org/document/6628272/. - V Lapotre, P Murugappa, G Gogniat, A Baghdadi, J-P Diguet, J-N Bazin, M Huebner, in
*Proc. of the 2013 IEEE International Symposium on Circuits and Systems (ISCAS)*. Optimizations for an efficient reconfiguration of an ASIP-based turbo decoder, (2013). http://ieeexplore.ieee.org/document/6571888/. - V Lapotre, P Murugappa, G Gogniat, A Baghdadi, M Huebner, J-P Diguet, in
*Proc. of the 2013 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)*. A reconfigurable multi-standard ASIP-based turbo decoder for an efficient dynamic reconfiguration in a multi-ASIP context, (2013). http://ieeexplore.ieee.org/document/6654620/. - C Schurgers, F Catthoor, M Engels, Memory optimization of MAP turbo decoder algorithms. IEEE Trans. Very Large Scale Integration (VLSI) Syst.
**9**(2), 305–312 (2001). doi:10.1109/92.924051.View ArticleGoogle Scholar - S Benedetto, D Divsalar, G Montorsi, F Pollara, Soft-output decoding algorithms in iterative decoding of turbo codes (1996). The Telecommunications and Data Acquisition Progress Report 42-124. NASA Code 315-91-20-20-53. https://ipnpr.jpl.nasa.gov/progress_report/42-124/title.htm.