EURASIP Journal on Applied Signal Processing 2002:9, 936–943 c ○ 2002 Hindawi Publishing Corporation P-CORDIC: A Precomputation Based Rotation CORDIC Algorithm

This paper presents a CORDIC (coordinate rotation digital computer) algorithm and architecture for the rotation mode in which the directions of all micro-rotations are precomputed while maintaining a constant scale factor. Thus, an examination of the sign of the angle after each iteration is no longer required. The algorithm is capable to perform the CORDIC computation for an operand word-length of 54 bits. Additionally, there is a higher degree of freedom in choosing the pipeline cutsets due to the novel feature of independence of the iterations and in the CORDIC rotation.


INTRODUCTION
CORDIC (coordinate rotation digital computer) [1,2] is an iterative algorithm for the calculation of the rotation of a 2dimensional vector, in linear, circular, or hyperbolic coordinate systems, using only add and shift operations. It has a wide range of applications including discrete transformations such as Hartley transform [3], discrete cosine transform [4], fast Fourier transform (FFT) [5], chirp Z transform (CZT) [6], solving eigenvalue and singular value problems [7], digital filters [8], Toeplitz system and linear system solvers [9], and Kalman filters [10]. It is also able to detect multiuser in code division multiple access (CDMA) wireless systems [11].
The CORDIC algorithm consists of two operating modes, the rotation mode and the vectoring mode, respectively. In the rotation mode, a vector (x, y) is rotated by an angle θ to obtain the new vector (x * , y * ) (see Figure 1). In every micro-rotation i, fixed angles of the value arctan(2 −i ) are subtracted or added from/to the angle remainder θ i , so that the angle remainder approaches zero. In the vectoring mode, the length R and the angle towards the x-axis α of a vector (x, y) are computed. For this purpose, the vector is rotated towards the x-axis so that the y-component approaches zero. The sum of all angle rotations is equal to the value of α, while the value of the x-component corresponds to the length R of the vector (x, y). The mathematical relations for the CORDIC rotations are as follows: where σ i is the weight of each micro-rotation and m steers the choice of rectangular (m = 0), circular (m = 1), or hyperbolic (m = −1) coordinate systems. The required microrotations are not perfect rotations, they increase the length of the vector. In order to maintain a constant vector length, the obtained results have to be scaled by a scale factor K. Nevertheless, assuming consecutive rotations in positive and/or negative directions, the scale factor is constant and can be precomputed according to The computation of the scale factor can be truncated after n/2 iterations because the multiplicands in the last n/2 iterations are 1 due to the finite word-length and do not affect the final value of K 1 , There are two different approaches for the computation of the CORDIC algorithm. The first one uses consecutive rotations in positive and/or negative direction, where the weight of each rotation is 1. Hence, σ i is either −1 or 1, depending on the sign of the angle remainder z(i). In every iteration a significant amount of time is used to examine the most significant bit in case of a binary architecture or the most significant three digits of a redundant architecture to predict the sign of z(i) and hence the rotation direction σ i . In comparison to the CORDIC implementations with constant scale factor, other implementations use a minimally redundant radix-4 or an even higher radix number representation [12,13,14]. These architectures make use of a wider range of σ i . In case of a minimally redundant radix-4 architecture, σ i ∈ {−2, −1, 0, 1, 2}. By using this numbering system, the number of iterations can be reduced. However, the computation time per iteration increases, since it takes more time to differentiate between five different rotation direction values and to generate five different multiples of arctan(2 −i ). The scale factor also becomes variable and has to be computed every time, due to the absence of consecutive rotations leading to an increase in area.
To speed up the computation time of the CORDIC algorithm, either the number of iterations or the delay of each iteration have to be minimized. The proposed algorithm introduces a novel approach, in which the rotation direction can be precomputed by adding the rotation angle θ, a constant and a variable adjustment which is stored in a table. Hence, a significant speedup of the delay per iteration is obtained. Since all rotation directions are known before the actual rotation begins, more than one rotation can also be performed in one iteration leading to a reduction in latency. The proposed architecture also eliminates the z-datapath and reduces the area of the implementation. This paper is organized as follows. Section 2 presents the theoretical background for the novel CORDIC algorithm for rotation mode and Section 3 presents the novel architecture. Section 4 performs an evaluation of different CORDIC architectures while Section 5 concludes the paper.

Mathematical derivation using Taylor series
The summation of all micro-rotation with their corresponding weight σ i is equivalent to the rotation angle θ where σ i ∈ {−1, 1}, corresponding to the addition and subtraction of the micro-angles θ i . Since consecutive rotations are employed, the scale factor is constant. The value of σ can be interpreted as a number in radix-2 representation. The goal of the proposed method is to compute the sequence of the micro-rotation without performing any iteration. To accomplish this, σ i is recoding as 2d i − 1 leading to a binary representation in which a zero corresponds to the addition of a micro-angle [15,16]. This allows the use of simple binary adders. Adding and subtracting 2 −i to (4) results in where c 1 corresponds to c 1 = 2 − ∞ i=0 (2 −i − arctan(2 −i )). Solving (8) for d results in where c corresponds to 0.5c 1 . Table 1 shows the values of the partial offsets i for the first 10 values of i and indicates that the value of i decreases approximately by a factor of 8 with increasing i. Hence, the summation of d i i can be limited to Rather than storing the partial offsets i and computing the sum over all i of the product d i i , δ = n/3 i=1 d i i can be precomputed and stored. Hence, the only difficulty consists of determining which offset corresponds to the input θ. This can be achieved by comparing the input θ with a reference angle θ ref . The reference angles θ ref correspond to the summation of the first n/3 micro-rotation. To be certain to obtain the correct offset, θ has to be larger than the reference angle θ ref . All reference angles are stored in a ROM and are accessed by the most significant n/3 bits of θ. In addition to the reference angles, the values of δ are stored. In case of a negative difference θ ref − θ, the corresponding δ is selected, otherwise the next smaller value of δ is chosen to be subtracted from θ + c − sign(θ) · 0 .
Example 1. Assuming we have a word-length of 16 bits and θ = 0.9773844. According to Table 2, θ ref corresponds to 0.97337076 and δ = 0.03644375. Hence, d is computed as

High precision
By using a mantissa of n = 54 bits (corresponding to the floating point precision), the ROM for storing all offsets would require 2 18 entries. This is rather impractical since the required area to implement the ROM will exceed by far the area for the CORDIC implementation. To reduce the area for the ROM, δ can be split into two parts, where δ ROM is stored in a ROM while δ r is computed. By examining the Taylor series expansion of arctan(2 −i ), it becomes obvious that the partial offset for iteration i and i + 1 By comparing (13) and (16), it can be seen that (13) is about 2 3 times larger than (16). Assuming a word-length of n bits and i > n/5 − 2, the factor is 2 3 . Hence, the term n/5 −1 = −2 −3( n/5 −1) /3 + 2 −5( n/5 −1) /5 can be stored in a ROM and the remaining offset δ r is computed as The largest magnitude of δ r is smaller than 2 −3( n/5 −1) .

Example for high precision
Assume that we have a word-length of 50 bits and θ = 0.977384381116. Using the most significant 9 bits of θ, δ ROM = 0.03644501895249 can be obtained. Hence, d is computed according to

The rotation mode in hyperbolic coordinate systems
Similar to the circular coordinate system, a simple correlation between the input angle θ and the directions of the micro-rotation can be obtained. Due to the incomplete representation of the hyperbolic rotation angle θ i , some iterations have to be performed twice. In [2], it was recommended that every 4th, 13th, (3k + 1)th iteration should be repeated to complete the angle representation. Similar to the rotation mode in circular coordinate system, the rotation angle θ is equivalent to the summation of all micro-rotation with their corresponding weight. This leads to Performing a Taylor series expansion and applying σ i = 2d i − 1 results in where c corresponds to Since these extra rotation are not known in advance, an efficient high precision VLSI implementation is not possible. However, for signal processing applications using a wordlength of less than 13 bits, the ROM size corresponds to only 14 entries.

THE NOVEL ROTATION-CORDIC ARCHITECTURE
For an implementation with the operand word-length of n bits, the pre-processing part consists of a ROM of 2 n/5−2 entries in which the reference angles θ ref and the corresponding offsets δ are stored, respectively (see Figure 2). To avoid a second access to the ROM in case of θ ref > θ the next smaller offset δ k−1 is additionally stored in the kth entry of the ROM. The ROM is accessed by the n/5−2 MSB bits of θ. A binary tree adder computes, whether θ is smaller or larger than the chosen reference angle θ ref and selects the corresponding offset (either δ k or δ k−1 ). Using a 3 : 2 compressor and another fast binary tree adder, the two required additions to obtain d approx = 0.5θ + c 2 + δ ROM can be performed, where c 2 corresponds to c+sign(θ) 0 . Using the bits d n/5 −1 to d n/3 , δ r can be computed according to (17) and has to be added to d approx . For the worst case scenario, there is a possible ripple from the bit d 3( n/5 −1) to the bit d ( n/5 ) which would call for a time consuming ripple adder. However, by employing an extra rotation for d 3( n/5 −1)−1 this limitation can be resolved. This extra rotation corresponds to the overflow bit of the addition from the bits d approx −3( n/5 −1)···n and δ r . The additional rotation also does not affect the scale factor, since 3( n/5 − 1) > n/2. For a precision of n ≤ 16 bits, there are less than 32 offsets which can be stored in a ROM and the additional overhead to compute δ r can be removed.
The alternative architecture can be chosen by realizing that the directions of the micro-rotations are required in a most significant bit first manner (see Figure 2). As in the previous architecture, a fast binary adder is employed to determine which offset has to be selected. A redundant sign digit adder adds 0.5θ, c, and δ ROM and an on-the-fly converter starts converting resulting into the corresponding binary representation. Normally, the most significant bit cannot be determined until the least significant digit is converted. However, such worst cases do not exist in the CORDIC implementation, due to the redundant representations of the angles arctan(2 −i ), where as opposed to the binary representation  iteration has to be rotated into the opposite direction. This happens if the angle remainder z i ≈ 0. Table 3 shows the maximum number of consecutive unidirectional rotations depending on the iteration number i. This limitation leads to a reduction in the complexity of the online converters and its most significant bits can already be used to start the rotations in the x/ y datapath.
The next rotation has to be performed in the negative directions, since θ 4 > 0. Hence, it is not possible to obtain rotation sequence like σ 0···4 = 01111 but it has to be σ 0···4 = 01110.

Delay analysis
In this paper, we assume a similar delay model as proposed in [14]. Nevertheless in [14], the unit delay is set to a gate delay while in our evaluation the unit delay is set to a full-adder delay. Hence, the delays for 2-input (NAND, NOR) gate, XOR, multiplexer, register, and full-adder are 0.25, 0.5, 0.5, 0.5, and 1t FA . The determination of which offset has to be chosen consists of the delay of the decoder, the ROM, a fast binary nbit tree adder and a multiplexer. Assuming a delay of log 2 (m) gate delays for the decoder, where m corresponds to the number of rows in the ROM (m < log 2 (n) + 1), one for the word-line driver and another for the ROM, log 2 (n) · t Mux for the fast binary adder and 0.5 · t FA for the multiplexer, we can obtain the correct value of δ ROM after a delay of (0.5 log 2 (n) + 1 + 0.25 log 2 (log 2 (n))) · t FA .
A 3 : 2 compressor can be employed to reduce the number of partial products to two. An additional fast binary tree adder can compute the final value of d approx . Hence, the entire delay to obtain d approx corresponds to 0.5 log 2 (n) + 1 + 0.25 log 2 log 2 (n) After obtaining the bits d n/5 −1 to d n/3 , δ r can be computed.
Since the value of δ r is smaller than 2 −3( n/5 −1) and the value of d approx + δ r is not required before 2 3( n/5 ) t FA the computation of δ r is not in the critical path. Alternatively to the 3 : 2 compressor and the tree adder, a minimally redundant radix-4 sign digit adder can be employed which has a delay of two full-adders. Hence, all output digits are available after these two full-adder delays. An additional on-the-fly converter converts the digits into its equivalent binary representation starting with the MSD. It requires a delay of multiplexer and four NANDs/NORs to convert one digit which results in 1.5t FA per digit (1 digit = 2 bits). The last digit is converted after a delay of (n/2 + 1) · 1.5t FA . As already described in Table 3, bit n/3 is stable as soon as the last digit (corresponding to bit n) has been converted. Hence, the n/3 rotation can be performed after a delay of (n/2 + 1) · 1.5t FA . Therefore, the iterations i = 0 can already be performed after a delay of (n/2 + 1) · 1.5t FA − n/3 · 2t FA = (1/12 · n + 1)t FA . Note that the conversion of one redundant digit is performed faster than the addition/subtraction of the x/ y datapath. Hence, an initial delay of (1/12 · n + 1)t FA + (log 2 (n) + 2.25)t FA = (1/12 · n + log 2 (n) + 3.25)t FA has to be added to the delay of the x/ y datapath.

Area analysis
Previously, the area of the z-datapath consists of n/2 iterations in which (n+log 2 n+2) multiplexers and (n+log 2 n+2) full-adders and registers are employed. Additionally, due to the Booth encoding, in the last n/4 iterations, about 2(n + log 2 n + 2) multiplexers and (n + log 2 n + 2) full-adders are required. Assuming A FA = 1.93 · A mux and A FA = 1.61 · A reg (values are based on layouts), the hardware complexity of the z-datapath results in A z = 1.7·n(n+log 2 n+2)A FA . Assuming a word-length of 54 bits and neglecting the required area for the examination of the most significant three digits, about 5700A FA are required.
The proposed architecture utilizes a ROM of word-length n and 2 n/5−2 entries, requiring an area of n · 2 n/5−2 · A FA · 1/50 resulting in 552A FA for a word-length of 54 bits. The implementation of the decoders can be done in multiple ways. NOR based decoders with precharge lead to the fastest implementation. However, the decoder area becomes larger. The decoder size per word-line corresponds to A dec = 0.83A FA . Since 2 n/5 −2 decoders are required, the area for all decoder corresponds to A dec,total = 0.83 · 2 n/5 −2 = 424A FA , assuming a 54 bit word-length. The ROM has to store θ ref , δ k , and δ k−1 . This results in a total area for the ROM and the decoder of about 2080A FA . The computation of δ r requires n/3 − n/5 + 2 = 2n/15 + 2 rows of CSA (carry-saveadders) and Muxes and a final fast binary tree adder. Note that each row of CSA adders and Muxes only consists of (n − 3n/5 + 6 = 2n/5 + 6) bits (the more significant bits are zero). The required area corresponds to 10 · 27A FA + 10 · 27A mux and 5 · 27A FA , respectively. Hence, the computation of δ r requires 540A FA . Moreover, the two redundant sign digit adder require 2n · A FA , while the converter consists of about (0.5n 2 + n)A mux . This corresponds to 108 and 696A FA for a word-length of 54 bits. This makes a total of 3426A FA , which is about 60% of the z-datapath previously employed.

Evaluation of the x/ y datapath
In the first n/2 micro-rotations, the critical path of the x/ y rotator part consists of a multiplexer and a 4 : 2 compressor, which has a combined critical path of 2 full-adders. The last n/2 micro-rotations can be performed only using n/4 iterations, since Booth encoding can be employed. However, the delay of the selection for the multiple of the shifted x/ y components requires slightly more time, resulting in a delay of about one full-adder delay. The delay for the 4 : 2 compressor remains 1.5 full-adder. Hence, the critical path of the entire x/ y rotator part consists of n/2 · 2t FA + n/4 · 2.5 · t FA = 1.625n · t FA . Note that the direction of the first iteration is already known; hence, the first iteration is not in the critical path. Therefore, the critical path of the entire x/ y rotator part consists of (1.625n − 2)t FA .
As an example, for a word-length of n = 16 bits, the x/ y datapath delay and the entire delay of the CORDIC algorithm corresponds to 24 and 32.5 full-adder delays, respectively.

Scale factor compensation
Since the scale factor is constant, the x and y values can already be scaled while the rotation direction is being computed. The scaling requires an adder of word-length (n + log 2 (n)) bits. Using a binary tree adder, this results in a delay of log 2 (n + log 2 (n)) · t Mux . For the scale factor, a CSD (canonic signed digit) representation can be used, leading to at most n/3 nonzero digits. Applying a Wallace-tree for the partial product reduction, the total delay of the scaling results into (0.5 log 2 (n + log 2 (n)) + log 1.5 (n/3)) · t FA < (1/12 · n + log 2 (n) + 3.25) · t FA = t initial . Hence, the scaling of the x and y coordinates does not affect the total latency of the novel algorithm.

OVERVIEW OF PREVIOUSLY REPORTED CORDIC ALGORITHMS
The delay of every iteration can be decomposed into two different time delays, t d,σ and t d,xy , where t d,σ corresponds to the time delay to predict the new rotation direction while t d,xy corresponds to the time delay of the multiplexer/add structure of the x/ y datapath. Various implementations have been proposed to obtain a speedup of the CORDIC algorithm. Improvements have been especially made in the reduction of t d,σ .
In [17], the angle remainder has been decomposed every k = 3k + 1 iteration. From the given angle θ, the first four rotation directions can be immediately determined. After performing the corresponding addition/subtraction of the terms σ i · α i from the input angle θ using CSA arithmetic, a fast binary tree adder computes the nonredundant result z 4 . The bits 4 to 13 of z 4 deliver the rotation direction σ 4 to σ 13 which are used to perform the rotation in the x/ y datapath and the computation of the next angle remainder z 40 . Hence, a low latency CORDIC algorithm is obtained. However, a significant reduction in latency is achieved at the cost of an irregular design. Furthermore, it is difficult to perform a π/2 initial rotation or the rotation of index i = 0 for circular coordinates, as it would force a conversion from redundant to conventional arithmetic for the z coordinate just after the first micro-rotation which is costly in time and area. Hence, this parallel and nonpipelined architecture only converges in the range of [−1, 1]. The overall latency of this architecture corresponds to about 2n + log 3 (n) + log 2 (n) full-adder delay.
In [18], a direct correlation between the z remainder after n/3 rotations and the remaining rotation direction have been shown. Hence, no more examination of the direction of the micro-rotation has to be performed leading to a considerable reduction in latency. However, in the first n/3 iteration a conventional method has to be employed.
In [19], the directions of the micro-rotation have been recoded using an offset binary coding (OBC) [20]. The obtained correlation is approximately piecewise linear since small elementary angles can be approximated by a(i) = arctan(2 −i ) ≈ s · 2 n−i−2 , where s is the slope of the linearity. This is valid for i ≥ m, where m is an integer which makes the approximation tolerable (normally m = n/3 ). Hence, the following correlation can be obtained: By performing some arithmetic computations, the following correlation of the rotation direction can be obtained: Hence, a multiplication by the inverse of the slope s is required. This multiplication can be simplified to two stages of addition for an operand word-length of 9 bits. However, in most digital signal processing application, the operands have a word-length of up to 16 bits. Hence, for those applications, the presented method requires more stages of addition to compensate the multiplication resulting in a more complex implementation and an increase in delay.
In [21], a double rotation method is introduced which compensates for the scale factor while performing the regular x/ y rotations. However, due to the double rotation nature of this method, t d,xy is increased to about twice its original value.
To reduce the latency of the CORDIC operation, [22] proposed an algorithm using online arithmetic. However, this results in a variable scale factor. This drawback is removed in [23]. In every iteration a significant amount of time is used to examine the most significant three digits to predict σ i . The employed random logic requires a delay of about 1.5 full-adder delays. Since the x/ y datapath consists of a 4-2 compressor, it requires also a delay of 2 full-adders. Hence, the overall iteration delay corresponds to 3.5 full-adder delays. To maintain a constant scale factor, consecutive rotations are required in the first n/2, where n corresponds to the word-length of the operands. For the computation of the last n/2 bits, Booth encoding can be employed reducing the number of iterations by a factor of 2. However, the selection of multiple of the shifted x and y operands requires an additional multiplexer delay and increases the overall iteration delay to 4 full-adder delays. Hence, the number of iteration is equivalent to 0.75n which corresponds to a total latency of 3n full-adders (this does not include the scale operation and the conversion).
Other implementations like [24] remove the extra rotations by a branching mechanism in case that the sign of the remainder cannot be determined (most significant three digits are zero). Hence, no extra-rotations are required while the required implementation area is doubled. Nevertheless, the most significant three digits (or most significant six bits) still have to be examined for the prediction of the next rotation direction. In [25], the double step branching CORDIC algorithm is introduced which performs two rotations in a single step. Nevertheless, this method requires an examination of the most significant six digits to detect two rotation directions. Since some of the digits can be examined in parallel, the delay increases only to 2t FA . The computation time of a double rotation in the x/ y datapath is slightly reduced compared to two normal x/ y rotations. Hence, the total amount of computation time corresponds to 0.5n(2t FA + 3t FA ) = 2.5t FA .
In [26], the signs of all micro-rotations are computed serially. However, a speed up of the sampling rate is achieved by separating the computation of the sign and the magnitude of every z i or y i remainder. The sign of every remainder is computed by a pipelined carry-ripple adder (CRA) leading to an initial latency of n full-adders before the first CORDIC rotation can be performed. Nevertheless, after this initial latency, the following signs can be obtained with a delay of only one Table 4: An overview between the proposed algorithm and other CORDIC implementations.
In comparison to the CORDIC implementations with constant scale factor, other implementations use a minimally redundant radix-4 or an even higher radix number representation [12,13,14]. By using this number system, the number of iterations can be reduced. However, the prediction of the σ i becomes more complicated, since there are more possible values for σ i . In addition, the scale factor becomes variable and has to be computed every time, due to the absence of consecutive rotations. An online computation of the scale factor and a parallel scaling of the x and y operands can be achieved. Depending of the use of CSA or fast carry-propagate-adders (CCLA), the number of iterations can be reduced to 2 n/3 + 4 and n/2 + 1, respectively. The iteration delay t d,CSA of the architecture using CSA adders corresponds to the same delay as already described for the last n/2 iteration in the constant scale factor using Booth-encoding, while the architecture employing the fast CCLA adders requires 1.5· d,CSA [14]. Hence, the overall latency of these CORDIC algorithm using a minimally redundant radix-4 digit set corresponds to about 2n full-adder delays. Table 4 provides a delay comparison between the proposed algorithm and other CORDIC implementations. Some of the delays have been taken from [14,17,26].

CONCLUSION
This paper presented a CORDIC algorithm for the rotation mode which computes the directions of the required microrotation before the actual CORDIC computations start while maintaining a constant scale factor. This is obtained by using a linear correlation between the rotation angle θ and the corresponding direction of all micro-rotations for the rotation mode. The rotation directions are obtained by adding the rotation angle θ to a constant and a variable offset which is stored in a ROM. An implementation for high precision is also provided which reduces the size of the required ROM. Hence, neither extra or double rotations nor a variable scale factor are required. The implementation is suitable for wordlengths up to 54 bits, while maintaining a reasonable ROM size.