EURASIP Journal on Applied Signal Processing 2003:13, 1317–1327 c ○ 2003 Hindawi Publishing Corporation A Novel High-Speed Configurable Viterbi Decoder for Broadband Access

A novel design and implementation of an online reconfigurable Viterbi decoder is proposed, based on an area-efficient add-compare-select (ACS) architecture, in which the constraint length and traceback depth can be dynamically reconfigured. A design-space exploration to trade off decoding capability, area, and decoding speed has been performed, from which the maximum level of pipelining against the number of ACS units to be used has been determined while maintaining an in-place path metric updating. An example design with constraint lengths from 7 to 10 and a 5-level ACS pipelining has been successfully implemented on a Xilinx Virtex FPGA device. FPGA implementation results, in terms of decoding speed, resource usage, and BER, have been obtained using a tailored testbench. These confirmed the functionality and the expected higher speeds and lower resources.


INTRODUCTION
Overcoming the variable deterioration in the reliability of a broadband communication channel in real time is a critical issue. That is why channel-coding techniques such as convolutional codes represent an important part of any broadband communication system. For example, DSL, WLAN, and 3G standards all require variations of convolutional coding with differing coding performance (constraint length and code rate) at differing data rates and therefore require differing decoding performance, usually using Viterbi decoding [1]. Therefore, from the viewpoint of channel-coding techniques, this demands both high decoding speed and variable decoding capability to match the channel conditions. Furthermore, it is becoming increasingly important to develop hardware implementations that can operate over a range of standards and can support multiple networks without redesign. Hence both hardware performance and flexibility are crucial. This requires high-speed, low-power dynamically reconfigurable forward error control coding dedicated hardware architectures that can operate within a range of channel conditions under a number of speed/power performance constraints at different time intervals.
Designing and implementing such architectures is a challenging problem for large constraint lengths Viterbi de-coders since decoding capability and decoding complexity are closely related to the constraint length used. A larger constraint length can offer a higher decoding capability but at the expense of a higher decoder complexity, often in terms of a cost function of resource usage versus decoding delay versus decoding capability, depending on the specific hardware architecture adopted. A useful Viterbi decoder architecture will therefore offer the flexibility to trade off the parameters of this cost function with reasonable performance. This requires architectural level decisions to allow optimum resource sharing and maximum pipelining to achieve a practical compromise between resource usage and decoding performance for a range of constraint lengths. Such architectural decisions would range from state-parallel to state-serial architectures. On the one hand, a state-parallel architecture, in which the number of ACSs is equal to the number of states and all ACSs operate in parallel, can offer high decoding speed, which only depends on the computation delay of the ACS feedback loop. However, the hardware complexity increases exponentially with the constraint length of the convolutional codes and this makes these architectures often unsuitable for applications requiring codes with large constraint lengths such as 3G (constraint length 9). On the other hand, in a state-serial architecture (sometimes referred to as software solutions), all states share one ACS; although flexible, such architecture would result in a huge decoding delay for large constraint lengths, hence limited throughput to suit most broadband applications. An area-efficient/foldable architecture as proposed in [2,3,4,5] uses more than one ACS. The number of ACSs to be used depends on the requirement of resource usage, and as such this class of architectures is attractive for a configurable implementation solution for large constraint lengths without excessive penalties in terms of resource usage. However, their speed performance suffers when the ratio of number of states to number of ACS units increases. Therefore, such architectures would only be possible for broadband access performance if their design space is explored in terms of maximum speedup (pipelining) versus number of ACS units (area) versus constraint length (decoding capability).
In this paper, we investigate the design space for areaefficient Viterbi decoders and develop an online reconfigurable architecture that will support a range of constraint lengths without an excessive loss of speed performance.
A scheduling program is used to systematically determine the maximum level of pipelining (speedup) that can be applied to the decoder in an area-efficient/foldable architecture with in-place path metric updating [6]. This enables the exploration of the trade-off of decoding speed (throughput) versus area (number of ACS units) for a range of constraint lengths.
This exploration is undertaken for a range of constraint lengths from 7 to 10 selected to cover many broadband access applications and also this range is challenging enough in terms of complexity to validate the design approach adopted. The optimum solution in terms of throughput versus area versus decoding capability (which is limited here by constraints 7 to 10) yielded a maximum level of pipelining of 5 levels for an area-efficient architecture with 8 ACS units using in-place path metric updating. This gives a speedup of 5 times on designs using a similar areaefficient/foldable architecture and achieves 5/8 the speed of a state-parallel architecture. The speed/throughput of course is determined by the requirements of the lowest constraint length, in this case, 7. In addition to the in-place updating, pipelining also enables reduction in path metric memory by allowing lower bit resolution for the computations.
The design is then implemented on a Virtex FPGA and tested using a developed hardware testbench. Actual hardware performance figures and BER curves are obtained to confirm the functionality and performance improvements.
It is important to note that Viterbi decoders have been widely investigated and implementations of configurable decoders have been reported in many papers. For example, [7] implemented an adaptive Viterbi decoder (AVD) based on reconfigurable processor board (RCPB), in which the constraint lengths can be reconfigured from 7 to 15. The AVD is specifically designed for an FPGA platform by using the features of FPGA configuration, so it is not suitable for the application where instant online reconfiguration is required due to the very low-speed FPGA configuration. In [8], a reconfigurable Viterbi decoder architec-  /P)  1  2  4  8  16  32  ACS pipeline levels  1  1  2  5 10 20 Throughput/speed (Mbps) F F/2 F/2 5F/8 5F/8 5F/8 ture, the constraint lengths which can be reconfigured from 3 up to 7, was proposed by adopting a state-parallel ACS module. Because the hardware complexity of state-parallel ACS architectures is exponentially proportional to the constraint length, this approach is not suitable for large constraint lengths.
To our knowledge, the approach adopted in this paper, the level of performance improvements, and the trade-offs achieved have not been reported before.
The paper is organised as follows. A brief design-space exploration is given in Section 2. The architecture of a configurable Viterbi decoder example is described in Section 3. FPGA implementations and performance comparisons based on the FPGA prototype are given in Section 4. Comparisons and conclusions are presented in Sections 5 and 6, respectively.

DESIGN-SPACE EXPLORATION FOR AREA-EFFICIENT ARCHITECTURES
As already mentioned in the introduction, the trade-off area versus speed versus decoding capability is crucial in a reconfigurable area-efficient/foldable Viterbi architecture. In our case, decoding capability corresponds to the constraint length, area corresponds to the number of ACS units used, and speed corresponds to the throughput achieved, which can be assimilated in this case to the number of pipeline levels that can be inserted in the ACS feedback loop. A software program was written to explore this 3D design space in order to determine an optimum solution while maintaining a standard resource saving techniques known as in-place path metric update. The results are shown in Table 1.
A number of interesting observations can be made at this stage. The first column of course refers to a stateparallel architecture (N = P), which achieves the best speed/ throughput that we note as F (Mbps), for example. The second and third columns show that halving the number of ACS units (P = N/2) is the worst solution as it does not give any speedup (pipelining) advantage. In fact we can achieve the same throughput rate of F/2 by using a 2-level pipelining of the ACS feedback loop on a quarter of the number of ACS units (P = N/4). This corresponds to a speedup by a factor of 2. The extreme case of the last column shows that a throughput rate of 5F/8 can, in theory, be maintained on a number of ACS units P = N/32 as long as we can insert 20 levels of pipelining. Of course pipeline balancing is a critical issue in this case and adopting such a solution in practice would not be advisable.
The optimum solution from a practical hardware implementation viewpoint is the fourth column which corresponds to using a number of ACS units P = N/8. This gives a The next section explains in detail the issues involved in the context of a design example.

CONFIGURABLE VITERBI DECODER ARCHITECTURE
A reconfigurable Viterbi decoder, which is based on an areaefficient ACS architecture, is composed of a branch metric (BM) module, an ACS module, a best-state module, and a traceback module.

BM module
The BM module is to generate the BMs [9]  To be easily implemented, a ROM (128 × 2) is used to store the 120 2-bit index data needed for each BF unit. For each ROM, the 120 2-bit index data are arranged as shown in Table 2 as this allows for easy hardware implementation. The first 8 addresses (0 to 7) are not used, and then 8 addresses (8 to 15), 16 addresses (16 to 31), 32 addresses (32 to 63), and 64 addresses (64 to 127) are used for constraint lengths 7, 8, 9, and 10, respectively.

ACS module
In the proposed architecture, this module is the most critical part, in which a novel ACS pipeline scheme is implemented to achieve higher ACS computation speed. To better describe the ACS pipeline scheme, we consider the case of constraint length 7, so the number of states is 64. Assume that the num- ber of available ACS units is 8. The key feature of the proposed ACS pipeline scheme is to speed up ACS operations by inserting the maximum number of ACS pipeline levels.
For the purpose of simplification, BF units, rather than ACS units, are used to explain the proposed scheme. The diagram of BF unit is illustrated in Figure 1. Each BF unit consists of two ACS units that share the same input and output states. More specifically, for each BF, the path metrics for two current states are obtained from the current BMs and the path metrics of two previous states, which lead to current states by executing two ACS operations.
The overall architecture of the ACS module is shown in Figure 2. BF0, BF1, BF2, and BF3 are BF units. There are 4 BF units, which make up 8 ACS units as used in our areaefficient ACS module. Switch0 and Switch1 are 4×4 switches, the function of which, as given in Table 3, is to permute the path metric network in such a way that the global routing network can be localized by these regular bus-switch components. Different from [10], in order to have an identical simplified architecture for all BF units, a 4 × 4 switch is used instead of two 2 × 2 switches. DpRAM0 to DpRAM7 are dual port RAMs used for path metric memory. With inplace path metric updating, the required path metric memory size is equal to the number of path metrics, which is the same as the number of states (there are 64 states for our case). So the depth of each path metric memory DpRAM is 8.
The initial arrangement of all the 64 path metrics in the path metric memory is given at iteration 0 in Table 4, in which the state number is used to denote the corresponding path metric. For instance, the path metric of state 2D is assigned into dual-port memory DpRAM1 at address 5, and will be the output to BF0 as PmIn01 for ACS computation. Following the architecture of the ACS module shown in Figure 2, with proper selection control as shown in Table 3, the state distribution at iteration 1 can be obtained from iteration 0 after 8 cycles by executing in-place path metric updating. Each iteration takes 8 cycles and the initial arrangement of the state of path metrics in DpRAM is re-established after 6 iterations in terms of the property of in-place path metric updating technique [6]. Only iterations 0 and 1 are given in Table 4, in which we can see that due to in-place path metric updating, the path metric distributions are different between iterations 0 and 1.
DpRAM0In DpRAM0 PmIn00 Figure 2: The architecture of the ACS module.
Obviously, address scrambling is required for in-place path metric updating to be executed, in other words, address scrambling is used to schedule the right path metric into the right cycle in order for the same set of path metrics to be read into BF units for ACS operation at the same cycles of any iteration. There are many different address scrambling methods, all of which can meet the requirements of in-place path metric updating. However, besides in-place path metric updating scheme, another requirement of address scrambling is that the maximum number of pipeline levels can be obtained without any impact of in-place path metric updating. For further discussion, we consider two specific address scrambling methods as shown in Table 5 in which only the first two iterations are given.
For Address scrambling 1, for any path metric memory, the path metric is read from address i at cycle i of iteration 0, where i is from 0 to 7. At iteration 1, for path metric memory, DpRAM0 to DpRAM3, the path metrics are read from addresses 0, 2, 4, 6, 1, 3, 5, and 7 at cycles 0, 1, 2, 3, 4, 5, 6, and 7, respectively, while for DpRAM4 to DpRAM7, the path metrics are read from addresses 1, 3, 5, 7, 0, 2, 4, and 6 at cycles 0, 1, 2, 3, 4, 5, 6, and 7, respectively. By address scrambling, at any iteration, the same path metrics will be read out at the same cycles as in the first iteration. For example, at cycle 4 of any iteration, the path metrics of state 09, 29, 19, 39, 01, 21, 11, and 31 must be read from the path metric memory into 4 BF units, BF0, BF1, BF2, and BF3. After the multiplexing of the two switches, Switch0 and Switch1, the output path   metrics of state 02, 22, 12, 32, 03, 23, 13, and 33 will be written back to the path metric memory with the same address. From Tables 3 and 4, we can see that the output path metrics of state 02, 22, 12, and 32 will not be read until 6 cycles later, while the output path metrics of state 03, 23, 13, and 33 will not be read until 10 cycles later. Therefore, 6 cycles can be allowed for the ACS computations of the fourth cycle path metrics. In other words, 6 cycles can be available for the ACS computations of the path metrics read out at cycle 4 without any impacts on in-place path metric updating. Likewise, at any other cycle, the number of cycles allowed from the corresponding ACS computation can be worked out, which is given in Table 6. From the point of view of the entire ACS module, with address scrambling 1, 4 cycles are available for the ACS computation, in other words, 4 pipeline levels can be inserted into ACS feedback loop to speed up ACS computation.
By applying the same method to address scrambling 2, which is obtained from the address scrambling 1 by swapping the addresses between cycles 3 and 4, the corresponding allowed cycles for ACS are obtained as in Table 7. As a result of address scrambling 2, 5 pipeline levels can be available for ACS operations. From the above discussion, for our area-efficient ACS module with constraint length 7 and the area saving requirement of 8 ACS units, at least 5 pipeline levels can be introduced for the ACS operation. However, by using exhaustive computer search, we found that 5 is the maximum number of pipeline levels which can be introduced for the above areaefficient ACS module.
With the usage of 8 ACS units, the maximum number of ACS pipeline levels can be worked out for constraint lengths from 7 to 10 as shown in Table 8.
Therefore, in order to implement our ACS module, in which constraint length can be reconfigurable from 7 to 10 with the restriction of 8 ACS units, 5 ACS pipeline levels can be inserted into ACS feedback loop.
To reduce the delay of the ACS computational loop, two's complement arithmetic [11] is normally used for implicit renormalization of the path metrics. Furthermore, in order to enable modulo normalization of the path metrics, according to [12,13], the minimum resolution of the path metrics is given by where N is the number of states, λ max is maximum BM, and k is 1 and 2 for radix-2 ACS and radix-4 ACS, respectively. Hence, for a maximum constraint length 10 and radix-2 ACS with 3-bit quantisation, N = 512, k = 1, and λ max = 14; thus 1 gives a minimum resolution of the path metrics of 9 bits. In other words, at least 9-bit data width is required for path metric memory in order to use modulo normalization for the path metrics. However, in our reconfigurable Viterbi decoder, the 5-level ACS pipeline scheme allows a modified variable shift path metric normalization [12] and saturation protection circuits to be inserted into the ACS feedback loop in a pipeline fashion. This allows even lower resolution to be used for the path metric without decoding performance loss. The modified variable shift path metric normalization is realized by subtracting a constant value from all path metrics, if all path metrics is greater than this constant value, rather than subtracting the minimum path metric from all path metrics. Hence, no operation of minimum path metric selection is required in our modified variable shift path metric normalization. Saturation protection circuit, which is used to avoid catastrophic overflow, is implemented by setting the maximum value for any overflow path metrics. With our modified variable shift path metric normalization and saturation protection scheme, a 6-bit path metric is sufficient for the path metric computation in the proposed reconfigurable Viterbi decoder, without suffering from a decoding performance penalty. Therefore, 33% reduction of path metric memory usage has been achieved, compared with the case of modulo normalization of the path metrics. In [5], a 12-bit path metric was used for adequate resolution, however, with path metric rescaling and saturation protection, and the 6-bit path metric was used for the path metric computation in the proposed configurable Viterbi decoder without suffering from a decoding performance penalty. Therefore, another 50% reduction of path metric memory usage has been achieved compared with the case of [5].

Best-state module
There are two solutions of traceback in a Viterbi decoder, best state and fixed state. In a best-state solution, the beststate survivor path is found for traceback operation, while in a fixed-state solution the survivor path of any state, usually state 0, is used for tracing back. An in-depth discussion of decoding performance for best-state and fixed-state solutions has been addressed in [14]. It is shown that, for comparable performance, the traceback depth of the fixed-state solution is as roughly twice as that of the best-state solution.
As we know, the size of the survivor memory is proportional to the traceback depth, and a larger traceback depth results in more memory usage. Therefore, the survivor memory usage of a fixed-state solution can be twice that of a best-state solution. Generally, a fixed-state decoding is only employed when it is expensive to find the best state such as in the case of a state-parallel architecture with a large constraint length. For our reconfigurable Viterbi decoder, because only 8 ACS are in parallel, only 7 units compare-select (CS) are used to pick out the best state in which only a 3-cycle extra initial delay is introduced. The best-state module consists of 7 CS units working in pipeline to find the best state for the traceback module to execute the best-state traceback. Therefore, the hardware overhead for the best-state solution is significantly low.

Traceback module
In configurable traceback module, a dual-port RAM-based survivor memory is used to perform the traceback operation. Considering 8 ACS units in parallel, each ACS unit outputs one survivor information bit and 8-bit dual-port RAM data width is used to simplify interfacing between survivor memory and 8 parallel ACS units. In order for the ACS operations to be time-efficient which demands that no ACS be idle at any time, traceback must be executed in such a way that no overflow will take place for the 8-bit survivor data stream from the ACS module. In other words, traceback module and ACS module must operate in a pipeline fashion at the same throughput rate. To be a time-efficient implementation, for our reconfigurable Viterbi decoder, the overall throughput rates have to be 1/8, 1/16, 1/32, and 1/64 bit/cycle for constraint lengths 7, 8, 9, and 10 because all states are scheduled into 8, 16, 32, and 64 cycles for constraint lengths 7, 8, 9, and 10, respectively. We consider the case of constraint length 7 to figure out how to design a configurable traceback module to meet the overall throughput rate (1/8 bit/cycle). Generally, a traceback depth of five times constraint length is needed for the beststate traceback, and hence for constraint length 7, the required traceback depth is 35. Furthermore, in order to match the high-speed clock of the area-efficient ACS module, trackback module needs to be speeded up by scheduling 2 cycles into each traceback step. Therefore, at least 70 cycles are required to finish one traceback operation. It is scheduled in our reconfigurable Viterbi decoder that one traceback operation is executed for every 16 iterations of ACS operation. Because each iteration contains 8 cycles for constraint length 7, 128 cycles are available for one traceback operation, while 100 cycles, which is calculated from (35 + 15) × 2, are needed to retrieve 16 decoded bits at each traceback operation. In this way, time-efficient decoding can be achieved since the number of cycles needed for each traceback operation is less than that of 16 iterations. Obviously, if it is highly desirable to minimise the initial decoding delay, we can schedule one traceback operation every 12 iterations. This also meets the requirement of a time-efficient implementation as the number of cycles for 12 ACS iterations, 12 × 8, is still greater than (35 + 11) × 2 cycles which are needed to retrieve 12 decoded bits. The only drawback is a more complicated hardware architecture because 12 is not a value with the form of 2 n . By using the same method, time-efficient traceback schedule can be worked out as in Table 9.
To work out the requirement of a survivor memory size for our configurable Viterbi decoder, we have to consider the largest survivor memory usage which should occur at constraint length 10. Because one traceback operation is scheduled every 16 ACS iterations and the traceback depth is required not to be less than 50 for constraint length 10, 50 × 64 × 8 bits are needed to reserve for 50 traceback steps to retrieve 2 decoded bits which take 102 cycles to finish the traceback operation. To achieve nonstop ACS operation, an extra 102 × 8 bits are needed to buffer the new survivor data from the ACS module during the traceback operation. Therefore, the overall memory required is 50×64×8+102×8 bits equaling to 3302 × 8 bits. After rounding up to binary border, we use a dual-port RAM (4096×8) as survivor memory.
It can be calculated from Table 9 that the maximum traceback depths are 49, 57, 61, and 63 for constraint lengths 7, 8, 9, and 10, respectively. For our FPGA prototype, due to Before going into the details of the architecture of the configurable traceback SP module, we start with the data format in survivor memory because the traceback logic is decided by the survivor data format in the survivor memory. The input data bus of DpRAM is connected to the survivor data that outputted from BF units in ACS module. From Tables 4 and 5, we know that, in area-efficient ACS module, addresses are swapped between cycles 3 and 4 to maximise the speed of ACS computation by inserting 5 pipeline levels into ACS loop. In order to simplify the hardware architecture of the traceback operation, address exchange between cycles 3 and 4, which cancels the address-swapping operation in address scrambling in Table 5, is employed before writing into survivor memory DpRAM.
To better explain the traceback logic of the configurable traceback SP module, we start by considering constraint length 7. Survivor data generated in each ACS iteration are 8×8 bits which occupy 8 address entries in survivor memory, and survivor memory receives survivor data for ACS module iteration by iteration and stores the survivor data one iteration after another. As we know, a 12-bit address is required to access all data in DpRAM (4096 × 8). Obviously, the low 3-bit address is used to access data within one iteration and the high 9-bit address is used to identify iteration number. Table 10 shows the resulting survivor data arrangement in DpRAM. Because the data format is the same for any iteration, Table 10 only gives the data arrangement for one iteration.
Let I be a 9-bit iteration number, let C be the low 3-bit address of the 12-bit survivor memory address, and let R be 3-bit index of 8-bit data in survivor memory. So any survivor bit in survivor memory can be identified by I, C, and R. In addition, let V be the survivor bit value with the corresponding I, C, and R. In order for traceback logic to be clearly described, I, C, R, and V are packed together and are called traceback packet in Figure 3.
Obviously, with the current traceback packet information (I, C, R, and V ), the previous traceback packet can be obtained from the trellis diagram of Viterbi algorithm. By   I8 I7 I6 I5 I4 I3 I2 I1 I0 C2 C1 C0 R2 R1 R0 V   I  checking all states, traceback formulas can be deduced as where the subscripts prv and cur denote the previous and current traceback steps. Equation (4) is quite obvious because the iteration is simply updated by reducing one for each traceback step. Using an example to verify (2) and (3) assuming that the current state is 03 and the corresponding survivor bit value is "1," it can be seen from Table 10 that the corresponding current R and C are "101" and "100," respectively. Using (2) and (3), the corresponding previous R and C can be calculated as follows: So the corresponding previous state is 21. On the other hand, it can be seen from the trellis diagram of Viterbi algorithm that, with survivor bit value 1, the state previous to state 03 is state 21. It is the same as that in (2) and (3).
For constraint length 8, I prv = I cur − 1.  For constraint length 10, where the subscripts prv and cur denote the previous and current traceback steps. From (2) to (12), we can see that, for each different constraint length, only two exclusive ORs and a down counter are needed to implement traceback mechanism. Moreover, two exclusive ORs can be shared by all constraint lengths for our configurable traceback SP module. In other words, the traceback logics of the configurable traceback SP module can be implemented by using four down counters (9-bit, 8-bit, 7bit, and 6-bit), two exclusive ORs, and some multiplexers.

IMPLEMENTATION RESULTS OF THE FPGA PROTOTYPE
In order to validate the configurable Viterbi decoder and evaluate its decoding performance, in terms of decoding delay, speed and resource usage, by using VHDL language, a synthesisable core of the decoder has been developed and implemented on Xilinx Virtex FPGA device [15]. The core's top-level interfacing is shown in Figure 5, in which the constraint length and the traceback depth can be instantly reconfigured through two configuration signals, ConstraintLength and TracebackDepth.  Figure 5: Reconfigurable Viterbi decoder core. In the FPGA prototype, the path metric RAMs are mapped onto Virtex distributed memory, while Virtex builtin block dual-port RAMs are used for survivor memory. One port is used to receive the survivor data from the ACS module and the other accommodates the traceback operation. This leads to a very simple and regular traceback architecture. The main specifications of the FPGA implementation are given in Table 11.
The decoding throughput and initial delay is given in Table 12. Obviously, it is the best possible decoding throughput rate for the area-efficient architecture with 8 ACS in parallel because no ACS is idle at any time. In addition, the proposed configurable Viterbi decoder can work with any size of frame data, so the initial delay could be ignored with a large enough frame. To do BER testing, a PC-controlled BER testbench, as shown in Figure 6, has been developed which works in conjunction with the FPGA prototype. In order for the hardware testbench to be general and flexible, most functional modules such as message generation, FEC encoding, and channel model are implemented in software. Ethernet communication is used to download channel data to the hardware FPGA FEC decoder and upload the decoded results for decoding performance evaluation. BER results for constraint lengths with the traceback depth of five times the constraint length have been obtained and are shown in Figure 7. The measured BER results agree with the expected theoretical results [9].

COMPARISONS
Comparisons in terms of area (gates) and speed (throughput in Mbps) have been obtained from actual FPGA implementations. These are shown in Table 13. A fixed constraintlength (K = 7) Viterbi decoder was implemented using both a state-parallel and an area-efficient architecture with 5 levels pipelining using 8 ACS units to evaluate the pipeline scheme. With only 30% of the hardware resources of a state-parallel implementation, the area-efficient implementation achieved a throughput of 13.5 Mbps which is not too far off the theoretical expected rate (5/8 * 32 = 20 Mbps), taking into account the nonuniform delays across the FPGA. In order to evaluate the reconfiguration overhead, a fixed constraint length (K = 10) decoder was also implemented and comparisons were made with the reconfigurable decoder (K = 7-10). As shown in Table 13, the configuration overhead is only 1% while the throughputs are comparable.
The only previous work that is directly comparable to our work is the one reported in [8] based on a state-parallel implementation for constraints 3 to 7 only. From Table 13, for constraint 7, the throughput rate obtained in our case is inline with the expected ratio of 5/8 compared to the stateparallel implementation in [8]; of course a significant area overhead would be incurred by a state-parallel implementation for constraint lengths from 8 to 10. Overall, the results obtained confirmed the design-space analysis in Section 2, taking into account that the prototypes are based on FPGA implementations. ASIC implementations would yield much more improved overall performance.

CONCLUSIONS
Broadband access raises new demands for channel coding. Besides higher decoding speed and decoding capability, reconfigurable decoding performance is highly desired, which suggests that decoding speed can be traded for decoding capability to adapt to the dynamic condition of a channel. In this paper, a novel design and implementation of an online reconfigurable Viterbi decoder has been proposed based on an area-efficient ACS architecture in which the constraint length and traceback depth can be dynamically reconfigured. A design-space exploration to trade off decoding capability, area, and decoding speed has been performed, from which the maximum level of pipelining against the number of ACS units to be used has been determined while maintaining an in-place path metric updating. A challenging example design with constraint lengths from 7 to 10 has been presented together with the new ACS schedule scheme, which provides 5 level ACS pipelining in this case and which can be applied for any constraint length in a totally uniform way. In general, this pipeline scheme can be applied to any area-efficient architecture with more than 8 time units for each ACS iteration. A modified variable shift path metric normalization and saturation protection are included in the ACS pipelining which allows for the path metric memory to be further reduced by 33% through using lower resolution for the path metric, compared with the case of modulo path metric normalization. In addition, best-state traceback is used to allow significant reduction of survivor memory. The design has been successfully implemented on Xilinx Virtex FPGA devices. FPGA implementation results, in terms of decoding speed, resource usage, and BER, have been obtained using a tailored testbench. These confirmed the functionality and the expected higher speeds and lower resources. Furthermore, the reconfigurable decoding performance, trading decoding speed, and area for decoding capability, has been verified. Further analysis will be carried out to confirm the expected improvement in power consumption offered by the proposed architecture.

Mohammed Benaissa is currently a Senior Lecturer in the Electronic and Electrical
Engineering Department at the University of Sheffield. He is a member of the Electronic Systems Group. He has been actively working in the area of VLSI signal processing coding and cryptography for the past 15 years. He has published more than 40 papers in recognized journals and conferences. His recent research concentrate on investigating configurable approaches to optimum hardware implementation of error control coding and cryptographic techniques and their incorporation in SOCs.
Yiqun Zhu received the B.S. degree in electrical engineering and M.S. degree in image processing from Beijing University of Aeronautics and Astronautics, China, in 1988 and 1991, respectively. From 1991 to 1998, he worked in China Aerospace Corporation as a DSP Engineer. He is currently with the Electronic Systems Group, Department of Electronic and Electrical Engineering, the University of Sheffield, pursuing his Ph.D degree.