EURASIP Journal on Applied Signal Processing 2003:13, 1346–1354 c ○ 2003 Hindawi Publishing Corporation Design of Application-Specific Instructions and Hardware Accelerator for Reed-Solomon Codecs

This paper presents new application-specific digital signal processor (ASDSP) instructions and their hardware accelerator to efficiently implement Reed-Solomon (RS) encoding and decoding, which is one of the most widely used forward error control (FEC) algorithms. The proposed ASDSP architecture can implement various programmable primitive polynomials, and thus, hardwired RS codecs can be replaced. The new instructions and their hardware accelerator perform Galois field (GF) operations using the proposed GF multiplier and adder. Therefore, the proposed digital signal processor (DSP) architecture can significantly reduce the number of clock cycles compared with existing DSP chips. The proposed GF multiplier was implemented using the Faraday 0.25m standard cell library and it can perform RS decoding at a rate up to 228.1 Mbps at 130 MHz.


INTRODUCTION
With the rapid progress of communication technologies, various broadband access systems have been developed, such as very-high-data-rate digital subscriber line (VDSL) cable modem and wireless LAN, gigabit Ethernet, 4G wireless communication, and so forth. Currently, the software defined radio (SDR) can support various communication standards since a common hardware platform can be adapted for various communication standards by means of software [1]. However, ASIC chips face several limitations such as lack of flexibility for various communication standards, high development costs, and slow time-to-market. Due to these restrictions, implementation methods have been changed to digital signal processor (DSP)-based communication systems that can have advantages in several aspects [2]. Programmable DSPs are greatly improving time-to-market and allowing faster changes and upgrades than hardwired ASIC chips. In addition, DSPs can be used for various applications as well as the Reed-Solomon (RS) decoder.
RS codes, providing the capability to efficiently correct burst errors as well as random error, have been extensively used in various communications and digital data storage systems, such as power line communications (PLC) [3], digital video broadcasting terrestrial (DVB-T) system [4], vestigial sideband (VSB) system [5], cable modem [6], satellite and mobile communications [7], magnetic recording [8], and so forth. This paper presents new application-specific DSP (AS-DSP) instructions and their hardware accelerator to efficiently implement RS codecs. Various algorithm blocks for RS codecs require Galois field (GF) multiply and add operations. Therefore, a typical RS decoder has been designed as a hardwired ASIC chip since an RS decoder needs special GF arithmetic units [9,10,11,12,13,14,15,16]. Moreover, the RS decoder should be redesigned to accommodate the various primitive polynomials in recent communication systems.
Existing DSP chips [17,18] require many clock cycles for GF multiply and add operations since they use general ALUs. The method that uses a lookup table (LUT) instead of GF operation units consumes a significant amount of power due to its large memory and large number of access delays.  Hence, existing DSP chips have not yet satisfied the requirements of high-speed communication standards. However, if DSP chips can be made to support the special architecture for the RS algorithm, they will be able to implement RS codecs for various communication standards [19]. Thus, having application-specific instructions and their hardware accelerator for the RS algorithm, ASDSP can support various broadband communication standards. This paper is organized as follows. Section 2 analyzes the implementation and hardware architectures of existing DSP chips [17,18] and custom-designed RS processors [9,10,11,12,13,14,15,16]. Section 3 describes the proposed RS decoding instructions and their hardware accelerator. Section 4 presents the performance comparisons with existing DSP chips. Finally, Section 5 contains conclusions.

IMPLEMENTATION OF THE EXISTING DSP-BASED RS DECODERS AND HARDWIRED RS PROCESSORS
This section describes the typical RS processor to briefly review the decoding process and analyzes the existing DSPbased implementation of RS.

Typical RS processor
Depending on the application, a typical RS processor is made up of several hardware blocks for parallel processing. Such an architecture can achieve higher transmission rates than required by current communication standards; however, due to its lack of flexibility regarding the primitive polynomials in various standards, the RS processor has to be redesigned to meet these standards.

RS encoder architecture
The architecture of the RS processor inserts 16 (2t) surplus symbols when t = 8. The generator polynomial for this architecture is represented by (1) [19,20,21]: (1) Figure 1 shows the typical RS encoder that has the linear feedback shift register (LFSR) structure, based on the generator polynomial. If the architecture is enabled, each register is initialized as "0." After the message polynomial m(x) is inserted, the operation is executed by combining m(x) and g(x) through the LFSR structure. If the insertion of the message polynomial m(x) is ended, the remaining values in the registers are output as parity symbols.

RS decoder architecture
The RS decoding process is as follows. First, the syndrome value, which is the error pattern, is calculated, and then the error-locator polynomial is calculated to find the error locations. Second, the error values are determined and corrected. Figure 2 illustrates the typical RS decoder [20,21,22,23,24]. Figure 3 shows the syndrome calculation block. The syndrome is calculated using the roots of the generator polynomial (gx), which is used in the encoder. The syndrome polynomial presents the error pattern of the received code word. By using this error pattern, the key for error correction is decoded.
The number of the cells in the syndrome block is twice the number of correctable errors. When the error correction capability (t) of the RS decoder is 8, the number of 2t = 16 for the syndrome block is needed, as shown in Figure 3.
The error-locator and error-value polynomials are calculated using this syndrome polynomial. The calculation of the error-locator and error-value polynomials is the most complicated and time consuming process in the RS decoding. The Berlekamp-Massey [9,10], Euclid's [11,12], or the modified Euclid's [13,14,15] algorithms are used in this process. In general, the architecture of the Berlekamp-Massey algorithm is smaller than that of the Euclid's algorithm. However, the serial structure of the Berlekamp-Massey algorithm has long latency and its parallel structure requires a large gate count. Figure 4 shows the architecture of the modified Euclid's algorithm [13,14,15]. This architecture is more suitable for high-speed transmission systems than that of the Berlekamp-Massey algorithm. The modified Euclid's Reg. Figure 3: Syndrome calculation block.  algorithm can efficiently reduce the area since it does not require an LUT for the quotient calculation.
After the error-locator and error-value polynomials are obtained using the Euclid's algorithm, the error locations are calculated using the Chien search [22,23] and Forney algorithms [13]. Then, the error values are calculated. This algorithm for calculating the roots of the error-locator polynomial is described in Figure 5. The roots of error locations are calculated using the coefficients (λ i ) of the error-locator polynomial. The error values are computed using the coefficients (λ i ) of the error-locator polynomial and error-value polynomial coefficients (R i ) as shown in Figure 6.
Typical RS ASIC chips require the hardwired GF operation units as modulo multipliers and adders, and thus, the architecture of the GF operation units has to be redesigned based on various primitive polynomials and standards.

Existing DSP-based RS decoder
It is possible to implement the RS decoder with the existing DSP chip; however, to implement the GF operation with the existing DSP chips, a number of operations are needed to execute ALU operations repeatedly. These operations have to be programmed as a subroutine and this subroutine is called from the GF operation part of the main RS program [20].
Generally, a GF multiplication consists of two steps. In the first step, two equations are multiplied as in (2). If the least significant bit (LSB) of the multiplier is one, the multiplicand is copied down; otherwise, zeros are copied down. The partial products copied down in successive lines are shifted one position to the left from the previous partial product. The 15-bit product which is the third equation of (2) is acquired using XOR operations of all partial products. In the second step, the GF operation is executed according to the primitive polynomial to convert the 15-bit data into the 8-bit data. GF multiplications are shown as the "⊗" symbols in Figures 1, 3, 5, and 6. Additions and subtractions in GF operations can be implemented using XOR operations in the ALU: (2) Figure 7 shows the GF multiplication flow of general DSP chips that do not support the RS decoding. To implement (2), AND operations are executed from the LSB of (A) and 8 bits of (B) to the MSB of (A) and 8 bits of (B) in cycle 1. Then, the results are shifted according to the digits in cycle 2. Eight 15-bit results are executed by XOR operations to acquire the 15-bit data that appeared in the third equation of (2). Finally, the GF operation is executed in cycle 3. The GF operation can be implemented using AND and XOR. · · · α 253 α 254 α 255 Reg. Reg.
Error-value detection Reg.
Reg. To implement this procedure, general purpose DSP chips require quite a number of clock cycles. The DSP used here should be accessible by a bit as well as a byte. If the DSP is a 32-bit machine, it can compute two GF multiply operations. If the DSP is a 64-bit machine, it can compute four GF multiply operations simultaneously. If N ALUs can be operated at the same time, 1/N cycles are taken to compute the GF multiplication. However, if the DSP cannot be accessed by a byte, a number of additional cycles is required.
Hence, we cannot get a fast RS decoding rate since the hardware architecture and instructions are not supported for the GF multiplication on existing DSP chips. Therefore, for the RS decoding, the existing DSP chips can be used only in slow-speed data communication. Recently, TMS320C64x has 8 GF multipliers and the GMPY4 instructions can perform four GF multiplications of two integers, each of which con-tains 4 packed bytes. Two GMPY4 instructions can be executed in parallel; hence the 8 GF multiplications can be performed in a single cycle. However, it supports only the GF multiply operation [19] and does not support the GF multiply and add operations. Moreover, it has a large hardware size and high power consumption due to its VLIW architecture.
SC140 does not support GF operations and is also a VLIW architecture having similar disadvantages. In addition, it consumes more power and needs larger memory since it uses the LUT method [25]. In the implementation using an LUT, the results of GF operations have been stored in ROM or RAM, and they are accessed when they are needed [25]. When m is equal to 8, a 2 8 × 2 8 64 Kbytes storage device is needed. Even in the highly integrated DSP, it is hard to use on-chip memory only for storing these values. Regardless of the data width of DSP, only one GF operation at a time is  possible. Moreover, additional cycles are needed to access the on-chip and off-chip memories. Hence, most DSPs implement the RS decoding without using an LUT.

NEW INSTRUCTIONS AND THEIR ARCHITECTURE
This section presents three instructions for the RS decoder implementation and the proposed operation flows, and their new architecture. The proposed instructions include modulo-add (MADD), modulo-multiply (MMUL), and modulo-MAC (MMAC).
Various algorithm blocks for RS codecs require repetitive multiply and add operations, as shown in Figure 8. The Berlekamp-Massey [9,10] algorithm, the Euclid [11,12] algorithm, and the modified Euclid [13,14,15] algorithm also use the circuit shown in Figure 8 [9,10,11,12,13,14,15,19] to implement the RS decoding. The multiplier and adder used for RS have the same circuit shown in Figure 8 regardless of various algorithms or primitive polynomials. The architecture of the hardwired RS codec is redesigned based on the primitive polynomial. In general, implementing the RS decoder on an existing DSP chip is not effective since the instructions of DSP chips do not support GF multiply and add operations. The GF multiply and add operations, shown in Figure 8, are different from general multiply and add operations. Hence, we need an ASDSP chip that has a programmable architecture to support various primitive polynomials according to various communication standards. Figure 9 represents the proposed MADD, MMUL, and MMAC instructions. The MADD instruction performs the modulo (GF) add operation and can be implemented with an XOR operation of an existing ALU; thus, we do not need additional hardware for the MADD instruction. The MMUL instruction can implement the GF multiply operation for error-value detection with the proposed GF multiplier shown in Figure 10. The proposed GF multiplier can perform successive GF multiply operations by adding a small amount of extra hardware, consisting of XOR gates and AND gates. The MMAC instruction can perform successive operations of the MADD and MMUL instructions. The MMAC instruction takes one cycle to execute the general modulo MAC instruction.
The proposed instructions are used extensively in RS algorithm blocks, such as the encoder, the syndrome computation block, the modified Euclid's algorithm block, the Chien search block, and the Forney algorithm block, as shown in Figures 1, 3, 5, and 6. In contrast, TMS320C64x supports the modulo MUL operation but does not support the modulo MAC operation. Hence, the proposed architecture can improve the performance of the RS codec. Figure 10 shows the proposed GF multiplier block used for the MMUL and MMAC instructions in GF (2 m , m = 8). The required number of AND operations shown in the upper side of Figure 10 is the same as the value of m. In Figure 10, after two 8-bit data a and b are multiplied, the 15-bit ω(i), which is the third equation in (2), is obtained through the modulo add operation of the multiplication results. Then the 8-bit result Ω(i) can be obtained from GF multiply operations of 15-bit ω(i).
The proposed GF multiplier uses about 630 gates including the primitive polynomial decoder. The gate count of the proposed GF multiplier is larger than that of a GF multiplier of the hardwired RS ASIC chip (about 261 gates). However, the hardwired RS ASIC chip uses about 89 GF multipliers for t = 8 [13], 16 GF multipliers for the syndrome calculation block, 64 GF multipliers for the modified Euclid's algorithm block, 8 GF multipliers for the Chien search block, and one GF multiplier for the Forney algorithm. The proposed ASDSP uses only 8 proposed GF multipliers, and thus, requires a much lower gate count than does the hardwired RS ASIC chip. Therefore, the ASDSP has little extra hardware. When m is greater than 8, the adder can be implemented with additional XOR gates, and the GF multiplier shown in Figure 10 can also be implemented with additional AND and XOR gates.
The primitive polynomial decoder of the proposed GF multiplier has the information whether the ω(i) is enabled or disabled. About 8 cases according to m values and the primitive polynomials are used in various communication standards. Hence, the decoder receives 3 bits (8 = 2 3 ) and outputs 15 × 8 = 120-bit control signals, as shown in Figure 11. The proposed GF multiplier performs the GF operation with  Figure 9: The proposed MADD, MMUL, and MMAC instructions.
these control signals. The primitive polynomial decoder is designed with combinational circuits. To implement 8 different combinations using ASIC chips, 8 different hardware implementations are required. However, the proposed ASDSP can efficiently implement these combinations. Figure 12 shows the overall architecture of the proposed ASDSP, based on the modified Harvard architecture. Two 16bit data memories can be accessed in a single clock cycle since the address generation unit (AGU) generates two addresses. The data processing unit (DPU) consists of two MACs, two ALUs, and one barrel shifter to efficiently support RS. The 8 GF multipliers are also included in DPU. The proposed AS-DSP employs 7 pipeline stages: prefetch, fetch, decode, exe-cute1, execute2, execute3, and write back. Every instruction, including program control instructions, is executed in a single cycle. The DO instruction, one of the most frequently used instructions, can also be executed in a cycle.

PERFORMANCE COMPARISONS
The proposed GF multiplier used for the MMUL and MMAC instructions is implemented with the combinational circuit and can perform high-speed GF multiplication. However, the general ALU of existing DSP chips takes quite a number of  Figure 12: Overall architecture of the proposed ASDSP. clock cycles just for a GF multiplication, since it has to repeat the AND, SHIFT, and XOR instructions shown in Figure 7. Table 1 shows the performance comparisons of RS decoding between the ASDSP having 8 proposed GF multipliers shown in Figure 10 and the existing DSP chips [17,18,25]. Note that the performance figures of commercial DSP chips are given by their datasheets or references [17,18] The overall latency of the SC140 takes between 819 clock cycles and 1115 clock cycles for t = 2. However, it has less error correction capability (t = 2) than the ASDSP (t = 8). The overall latency of the SC140 becomes more than double for t = 8. In addition, the proposed ASDSP reduces the overall latency by 25% compared with TMS320C64x, supporting only the GF multiplication but not the modulo MAC operation. Moreover, these VLIW DSPs have much larger hardware size and higher power consumption than the proposed one has. Thus, the ASDSP having the proposed GF multiplier shows better performance than the other DSP chips in Table 1.

CONCLUSIONS
This paper proposed new ASDSP instructions and their hardware accelerator for high-speed RS decoding. First, we proposed MMAD, MMUL, and MMAC instructions that are necessary to perform the RS decoding and proposed architecture to support these instructions. The proposed GF multiplier, having little extra hardware overhead, can perform the GF multiplication faster than the general ALU of existing DSP chips in terms of execution cycles. Hence, the proposed ASDSP having the proposed GF multiplier can support an RS decoding rate up to 228.1 Mbps at a 130 MHz operating frequency even with the 0.25 µm technology. In addition, the ASDSP can be adapted to various communication standards and can support SDR because of programmability. In the near future, all of these features will be implemented on an ASDSP chip.