Using the complement of the cosine to compute trigonometric functions

The computation of the sine and cosine functions is required in devices ranging from application-specific signal processors to general purpose floating-point units. Even in the latter case, the required functionality can be reduced to computing the sine and/or cosine of multiples of a constant angle. The latency of a sine/cosine generator can be reduced by using look-up tables. However, a direct implementation with look-up tables may be unfeasible if the input space is huge. In such a case, look-up tables with a number of entries lower than the size of the input space can be used indirectly. In previously published methods, the reduction in the number of table entries is obtained at the expense of increasing the table width and the computational cost. This paper introduces an alternative technique that makes it possible to reduce the size of the look-up tables as well as the required multiplications. The proposed technique can be used to implement sine/cosine generators of huge input space. It has been used to implement several twiddle factor generators in reconfigurable hardware and has enabled the number of look-up tables to be reduced by between 6 and 26% with respect to previous table-based techniques. Also, these implementations are about 50% faster than those based on Volder’s algorithm.

the IEEE 754-1985 double-precision format [4]. x is coded in a 64-bit word with 3 fields called sign (1 bit), exponent (11 bits), and significand (52 bits). The significand is a number in the range [ 1, 2 − 2 −52 ] coded in fixed-point, while the exponent is an integer laying in the range [ −1022, 1023] coded in excess 1023. If the exponent is lower than −27, then the unit can simply return 1 as the cosine and x as the sine, assuming rounding to the nearest representable value [5]. Otherwise, it can return the sine and/or cosine of nφ, where φ is the constant 2 −27−52 = 2 −79 and n is the integer x2 79 . • Generating the set of coefficients, called twiddle factors, of a discrete Fourier transform (DFT) [6]. The DFT of a complex sequence x of length L is another complex sequence X of the same length defined by X(k) = L−1 t=0 x(t)W tk L , where W L = e φi and φ = −2π/L. The twiddle factors are the integer powers of W L , and there are L different twiddle factors. Thus, the twiddle factor of index n is W n L = (e φi ) n = e nφi = sin(nφ)i + cos(nφ).
Since the sine and cosine functions are computationally expensive, in applications where a low latency is required, the generator is implemented using look-up tables (LUT). This implementation approach is problematic if the input space is large. For example, consider the arithmetic unit previously mentioned: as stated, exponents lower than −27 can be dismissed. However, even if the input angle is restricted to [ 0, 2π), it is still necessary to consider 30 different exponent values and 2 52 different significand values. A direct implementation would therefore require an LUT of 30 * 2 52 entries. Another example is given by the DFT engines required in applications such as required in Power Line Communications (PLC) [7], Digital Video Broadcasting-Terrestrial 2 (DVB-T2) [8], photon counting [9], and radio astronomy [10]. In those applications, the sequence can be as long as 2 13 , 2 15 , 2 27 , and 2 30 , respectively, and hence the coefficient tables are large in comparison with other elements of the DFT engine [11]. In this paper, we propose an innovative technique to reduce the resources required to implement a sine/cosine generator in application-specific integrated circuit (ASIC) and configurable logic. We have implemented an open-source tool to automate the design of twiddle factor generators of arbitrary size and precision using the proposed technique. The rest of the paper is organized as follows. In the next section, the notation used is introduced and optimization techniques are presented to reduce the number of entries of the required LUT to a number proportional to the input space. In Section 3, optimization techniques are given that enable LUTs to be employed with a total number of entries that grows sublinearly with the input space. The new proposed technique is introduced in Section 4. The experiments are described in Section 5. The corresponding performance results are shown and discussed in Section 6. The last section provides a summary of the conclusions.

Argument reduction
As mentioned in the introduction, our objective is to efficiently implement a functional unit that provides sin(nφ) and cos(nφ), where n is an integer provided as an input to the unit, and φ is a constant angle that depends on the application. Hereinafter, the input of the functional unit will be denoted as I and the number of bits of I will be denoted as w. Furthermore, the following definitions will be used:

Definition 1 A real number φ is trigonometric rational if and only if φ π is rational.
For example, the angle φ = −2π/L, used in the definition of the twiddle factors in Section 1, is trigonometric rational, while the angle φ = 2 −79 mentioned in the arithmetic unit example of Section 1 is not.

Definition 2 The trigonometric Carmichael function of a trigonometric rational number
This definition is useful in the calculation of the size of the output space of the generator. This size is the minimum of λ t (φ) and the size of the input space. Note that λ t (0) = 1. In the DFT example, λ t (φ) is equal to the length of the transform. The function λ t can also be employed to make the following simplification. Suppose that φ has been defined as a trigonometric rational number whose absolute value is very large: |φ| 2π. In this case, an angle α with |α| < 2π can be found such that the functionality of the generator, that is, computing sin(nφ) and cos(nφ), is equivalent to computing sin(nα) and cos(nα). In order to obtain such α: 2. Take the remainder r of the division k λ t (φ) . Note that k and r have the same sign and |r| < λ t (φ).

Definition 3 The trigonometric Shannon entropy of a trigonometric rational number
If φ is trigonometric rational, in order to maximize the size of the output space of the generator, that is, get an output space of size λ t (φ), the minimum number of bits required to code the input is H t (φ) .

Definition 4 φ is trigonometric binary if and only if it is trigonometric rational and H t (φ) is an integer.
The latter definition is relevant since, in many applications, the constant angle φ is trigonometric binary. For example, consider the algorithms designed to compute efficiently the DFT called fast Fourier transform (FFT) [12] algorithms: many of these algorithms require the length of the transform L to be a power of 2 [13], that is, log 2 (L) must be an integer, and hence φ must be trigonometric binary since H t (φ) = log 2 (λ t (−2π/L)) = log 2 (L). Moreover, in applications where φ is trigonometric binary, it is irrelevant whether the representation of n is either unsigned or two's complement as long as w ≥ H t (φ). As an example, consider the functional unit specified in [1]. This unit computes the sine and cosine of πx where x is a number in the interval [ −1, 1) coded in fixed-point two's complement. Let S be the value represented by the input I in integer two's complement, x = S/2 w−1 , and hence πx = Sπ/2 w−1 . Thus, the functional unit computes the sine and cosine of Sφ, where φ = π/2 w−1 . Let n be the value represented by I in unsigned integer representation. It is easy to prove that sin(Sφ) = sin(nφ) and cos(Sφ) = cos(nφ). Therefore, the functionality of the unit is equivalent to the computation of the sine and cosine of nφ. This is exemplified in Table 1 for w = 3. Hereinafter, an unsigned notation for n is assumed. In the following subsections, we will see optimization techniques that require a trigonometric rational value of φ. In these subsections, the trigonometric Carmichael function of φ, λ t (φ), is abbreviated to L.

Periodicity
If the size of the input space of the functional unit is greater than L, then the periodicity of the sine and cosine can be used to compute sin(nφ) and cos(nφ) in the following way: 1. Compute n mod L. As noted by [1], if φ is trigonometric binary, then this computation has no cost since the result is simply the H t (φ) least significant bits of the input I. 2. Use a subgenerator to compute the sine and cosine of (n mod L)φ. The input space of the subgenerator is Z L , smaller than the original input space, and hence, it can be implemented using a smaller LUT.
In the optimization shown in the following subsections, it is assumed that the input space of the generator is Z L .

Sign reduction
If the input space of the functional unit is Z L , then it is possible to implement it with another subgenerator with the same value of φ but whose input space is Z L/2 +1 , that is, its size is roughly half of the size of the original input space. This optimization takes into account the following trigonometric identities: If n ≤ L/2, then the input of the subgenerator is n and its output is the output of the functional unit. Otherwise, the input of the subgenerator is L − n, the cosine output of the unit is the cosine output of the subgenerator, and the sine output of the unit is the opposite of the sine output of the subgenerator. Note that if φ is trigonometric binary, then L − n Table 1 Values returned by the functional unit specified in [1] when the input I has exactly three bits (φ = π/2 3−1 = π/4) can be obtained by simply taking the two's complement. In the next subsection, another optimization is presented for the implementation of the subgenerator when L is even.

Quadrant reduction
If L is even and the size of the input space of the functional unit is L/2 + 1 = L/2 + 1, then it can be implemented using a subgenerator with the same value of φ but whose input space is reduced to Z L/4 +1 . This optimization uses the following trigonometric identities: If n ≤ L/4, then the input of the subgenerator is n and its output is the output of the functional unit. Otherwise, the input of the subgenerator is L/2 − n, the sine output of the unit is the sine output of the subgenerator, and the cosine output of the unit is the opposite of the cosine output of the subgenerator. Again, the computation of L/2 − n is simply a two's complement if φ is trigonometric binary. In turn, the optimization described in the next subsection can be used to implement the subgenerator if L is multiple of 4.

Octant reduction
Finally, if L is a multiple of 4 and the input space of the functional unit is Z L/4 +1 = Z L/4+1 , then it can be implemented with a subgenerator with the same value of φ but whose input space is reduced to Z L/8 +1 by applying the following trigonometric identities: If n ≤ L/8, then the input of the subgenerator is n and its output is the output of the functional unit. Otherwise, the input of the subgenerator is L/4 − n, the sine output of the unit is the cosine output of the subgenerator and the sine output of the unit is the cosine output of the subgenerator. Once more, a simple two's complement provides L/4 − n if φ is trigonometric binary.
With the previous optimizations, a sine/cosine generator can be implemented with an LUT of L/8 + 1 entries. For example, the functional unit described in [1] previously mentioned can be implemented as shown in Fig. 1. As previously discussed, the functionality of that unit is equivalent to computing the sine and cosine of nφ, where n is the number provided by input I in unsigned integer notation, the angle φ is 2π/2 w , and w is the number of bits of I. In this case, φ is a trigonometric binary. In this implementation, the output is provided in some type of sign-magnitude notation such as one of the IEEE 754-1985 floating-point formats, and w > 3 so λ t (φ) is multiple of 4 and an LUT of only 2 w−3 + 1 entries is required. The LUT of the figure should return the sine and cosine of angles in the range [ 0, π/4] and, since they are all positive, there is no need to store the sign bits. Instead, they are computed using a simple XOR gate (4c). The LUT has been implemented using a direct access memory. In order to prevent the problem of dealing with a direct access memory with a number of positions that is not a power of two, the access of the entry of the LUT corresponding to n = 2 w−3 is detected by a simple logic gate (4a) and is treated separately. In that case, the LUT returns the sine and cosine of (2020) 2020:35 Page 6 of 21

Fig. 1
Argument reduction employed to optimize a sine/cosine generator π/4, that is, 1/ √ 2, using a pair of multiplexers (2b). The address lines of the memory are fed with the w − 3 least significant bits of I or with its two's complement depending on I w−3 , using the adder (3) and the multiplexer (2a). The gate (4b) is employed to ascertain whether the magnitude of the sine and the cosine should be interchanged with the multiplexers (2c).

Sublinear optimizations
The optimizations described in the previous sections have the following drawbacks: • They require a subgenerator with an input space greater than λ t (φ)/8, that is, its input space grows linearly with λ t (φ). In many applications, the subgenerator cannot be directly implemented using an LUT with a number of entries proportional to the input space since it is excessively large. For example, even if the octant optimization could be directly applied to the arithmetic unit mentioned in Section 1, it would only reduce the size of the input space of the required subgenerator to roughly 30 * 2 49 . • They can only be directly applied if φ is trigonometric rational. Furthermore, the quadrant and octant optimizations require λ t (φ) to be even and a multiple of 4, respectively. Hence, in order to apply them in the arithmetic unit of the example of Section 1, a workaround similar to that shown in [1,14] is necessary. For example, assuming the angle is positive, the arithmetic unit could execute the following steps to compute the sine and cosine: 1. Divide the angle by 2π. This can be implemented efficiently by multiplying the angle by the precomputed value of the reciprocal of 2π. 2. Take the fixed-point representation I of the fractional part of the previous division. 3. Return the sine and cosine of nφ, where n is the number represented by I in unsigned integer notation, φ = 2π 2 w , and w is the width of I.
This last step can be carried out by a generator that can be implemented using the optimizations described in the previous section. However, this approach has its own drawbacks: first, the cost of a multiplication is introduced; second, the reciprocal of 2π is not rational, and hence, its exact representation cannot be stored. Moreover, even if we could use an exact representation of the reciprocal of 2π, the exact product of the angle by the reciprocal is not rational unless the angle is zero, that is, the exact result of the division of the angle by 2π is not rational. Hence, its representation cannot be exact and an error is introduced [14].
In the following subsections, optimizations without these drawbacks are described.

Branching
This optimization, used in [1], accepts an arbitrary value of φ, although it was originally employed for a trigonometric rational value. When branching is applied, the generator is implemented using two subgenerators, M 1 and M 0 , which we call branches. The inputs of the branches are denoted by A(1) and A(0), while the widths of these inputs are denoted by L(1) and L(0), respectively. These are chosen so that the width of the input of the generator, I, is w = L(1) + L(0). M 1 provides the sine and cosine of integer multiples of φ 1 , that is, the sine and cosine of n 1 φ 1 , where n 1 is the value represented by A(1) and On the other hand, M 0 provides the sine and cosine of n 0 φ 0 , where n 0 is the value represented by A(0) and φ 0 = φ. The least significant bits of I are connected to A(0), while the rest are connected to A(1). Since I is the concatenation of A(1) and A(0), the value represented by I is and hence, Since the sines and cosines of n 1 φ 1 and n 0 φ 0 are provided by M 1 and M 0 , the sine and cosine of their sum can be computed by applying the following trigonometric identities: Alternatively, we can say that each subgenerator M k provides the complex sin(n k φ k )i + cos(n k φ k ) = e n k φ k i , and the generator can provide the value e nφi by computing the complex product e n 1 φ 1 i e n 0 φ 0 i . Indeed, computing the product of two complexes, each of a unitary module, is equivalent to computing the sine and cosine of the sum of two angles from the sine and cosine of those angles and implies four real products, a real sum, and a real subtraction. A generalization of this branching technique was proposed in [15] to compute twiddle factors. Note that the sum of the sizes of the input spaces of M 1 and M 0 is minimum when L(1) and L(0) differ by no more than 1. In this case, such a sum grows with the square root of the size of the original input space, that is, sublinearly [15]. Accordingly to [1], floating-point sine/cosine applications will benefit from a fixed-point conversion of the datapath around these functions.

Tree generator
The implementation of the branches was not detailed in the previous subsection. In the generator described in [1], the branch M 0 computes its output using the Taylor series, while M 1 is implemented with an LUT of affordable size. Further optimization could be achieved if one or both branches were, in turn, implemented with sub-branches. This recursive application of the branch optimization is used by the tree generator described in [16]. In general, the tree generator requires a set of subgenerators that we will call leaves, complex multipliers, and, if the implementation is sequential or pipelined, registers. The following notation is employed for its description: : total number of input lines of the leaves with an index lower than k • φ k : angle defined by φ k = (2 SL(k) )φ Each leaf subgenerator M k provides the sine and cosine of n k φ k . The leaves are chosen such that the sum of the widths of their inputs is equal to the width of the input of the tree generator: The input lines of each leaf M k are connected to the input lines of the tree generator from I SL(k) to I SL(k+1)−1 , that is, each input line A(k) t is connected to I t+SL(k) : Hence, the input value n represented by I becomes: and therefore the angle whose sine and cosine must be computed by the tree generator can be written as: Hence, the angle nφ is the sum of the subangles n k φ k , or, alternatively, the complex e nφi is the product e n 0 φ 0 i e n 1 φ 1 i . . . e n m−1 φ m−1 i . Again, since the sine and cosine of the subangles are provided by the leaves, the sine and cosine of nφ can be computed with complex multiplications. Taking this into account, the structure of the generator described in [16] becomes a directed rooted binary tree with m leaves. Each vertex corresponds to a component whose output is a complex of unitary module. Each internal vertex has exactly two children and corresponds to a complex multiplier that computes the product of the outputs of the components associated to these children. The components corresponding to the m leaves are the m subgenerators and provide the complex values e n k φ k i . The output of the tree generator is the output of the component corresponding to the root vertex. Hereinafter, the height of the tree will be denoted as h. The following recommendations may improve the efficiency of the design: • It is desirable to minimize the height of the tree h in order to reduce latency and rounding errors. This is achieved if the structure of the generator is a complete binary tree.
• If each leaf is implemented with an LUT, the total number of entries is minimum when the width of the inputs of those LUTs differ by no more than 1. To this end, let q be the quotient obtained by dividing w by m, and let r be the remainder. A total of r LUTs must have inputs of width q + 1. The other LUTs must have inputs of width q.
• If the above recommendation is followed, then the total number of entries decreases when m increases. For a fixed height h, the maximum possible value of m is 2 h , and therefore, the total number of entries can be minimized by using 2 h leaves.
In order to ascertain the power of this approach, suppose we use a complete binary tree with height h = log 2 (w) . In this case, the number of subgenerators m would be no greater than w, and each subgenerator would have no more than 2 input lines. If each subgenerator is implemented with an LUT, then an upper bound on the total number of entries is 4w, that is, the total number of entries grows logarithmically with the size of the input space of the tree generator. Hence, the implementation of a sine/cosine generator with an input space as large as that required by the arithmetic unit mentioned in Section 1 is feasible with a tree generator. Note that a tree generator can be combined with the argument reduction mentioned in Section 2. For example, in [1], argument reduction is first applied, and hence, only a subgenerator with an input space of roughly 1/8 of the original size is required. That subgenerator is then implemented with a tree generator of height h = 1.
In the following sub-subsections, we will see several optimizations that can be applied to the tree generator. In the rest of the paper, φ > 0 is assumed for the sake of simplicity, although in practice this is not a restriction since cos(nφ) = cos(n|φ|) and sin(nφ) = sgn(φ) * sin(n|φ|).

Quadrant restriction
This optimization can be applied to quadrant restricted sine/cosine generators, which are defined as follows:

Definition 5 Given a functional unit with an integer input n ≥ 0 that computes one or more trigonometric functions of nφ, where φ > 0 is a constant, the unit is quadrant restricted if and only if π 2φ is an upper bound on its input space.
If a sine/cosine generator is quadrant restricted, then it must compute the sine and cosine of an angle nφ in the interval [ 0, π/2]. Since both functions are positive in that interval, the following optimizations are possible: • As in the example of Section 2, if the generator is implemented with an LUT, there is no need to store the sign bits.
• If it is implemented with a tree generator, no signed adders, subtracters, nor multipliers are required.
For example, the tree generator used in [1] is quadrant restricted, and hence, the complex multiplier requires no signed arithmetic components and the LUT employed does not need to store the sign bits. Note that even if a tree generator is not quadrant restricted, it may contain quadrant restricted branches that can benefit from these optimizations.

Leading zeros of the sine
This optimization is useful when the sine values of a quadrant-restricted generator are coded in fixed-point. In this case, an upper bound on the sine output is sin(n max φ), where n max is the maximum of the input space. Consequently, if the k most significant bits of the fixed-point representation of sin(n max φ) are 0, those bits of the sine output of the generator are always 0, and the following optimizations are possible: • If the generator is implemented with an LUT, then there is no need to store the k most significant bits of the sine.
• If the generator feeds a complex multiplier of a tree generator, the size of its real multipliers can be reduced.

Leading ones of the cosine
This optimization is useful when the cosine values of a quadrant-restricted generator are coded in fixed-point using all the bits for the fractional part. In this case, the representable value nearest to cos(0) = 1 corresponds to the word with all the bits equal to 1. A lower bound on the cosine output is cos(n max φ), where n max is the maximum of the input space, and therefore, if the k most significant bits of the fixed-point representation of cos(n max φ) are 1, those bits of the cosine output of the generator are always 1. Hence, if the generator is implemented with an LUT, there is no need to store the k most significant bits of the cosine.
To exemplify these optimizations, suppose a generator must provide the sine and cosine of nφ in fixed-point notation with 8 fractional bits and no integer bits rounding to the nearest representable value. In this example, φ = 2π/2 11 = π/2 10 , that is, it is trigonometric binary and λ t (φ) = 2 11 is a multiple of 4, and hence, we first apply argument reduction and treat the case n = 2 8 separately as in the example of Section 2. A subgenerator still has to be subsequently implemented with an input space of size 2 8 (w = H t (φ) − 3 = 8). We implement this subgenerator with a tree generator of height h = 1, and therefore, there are 2 leaves (m = 2 h = 2) as shown in Fig. 2. Since this tree generator is quadrant restricted, it does not require signed arithmetic components. Each leaf is implemented with direct access memory (5) of depth 2 4 . Note that the sine/cosine values must be stored with a precision of 17 bits to compensate for rounding errors. The leaf M 0 provides the sines and cosines of the multiples of φ 0 = 2 0 φ = π/2 10 , while the leaf M 1 provides the sines and cosines of the multiples of φ 1 = 2 4 φ = π/2 6 . The greatest angle whose sine and cosine is stored in M 0 is 15π/2 10 . The 4 most significant bits of the representation of the sine of this angle are 0 and, since the generator is quadrant restricted, the 4 most significant bits of the other representations of the sines stored in M 0 are also 0, and therefore, there is no need for them to be stored. On the other hand, the 9 most significant bits of the representation of the cosine of that angle are equal to 1, and therefore, there is no need for them to be stored either. Similar optimizations can be applied to M 1 , but in this case, only one bit can be saved. Note that we only need two integer multipliers of size 13 × 17 (6a) and two of size 17 × 17 (6b) instead of four multipliers of size 17 × 17 thanks to the leading zeros of the sine. The leading ones of the cosine cannot be employed to reduce the size the arithmetic components in a similar way. In the last stage, an adder (3) provides the sine of the tree generator and a subtracter (7) provides the cosine.

Sine/complement generator
In the optimizations described in Section 3, the angle whose sine/cosine must be computed is decomposed into two subangles, A and B. Two subgenerators, called branches, are employed to compute the sine/cosine of A and B, and then the sine/cosine of A + B is computed by applying the identities 6. This method presents the following drawbacks: • If the maximum value of one of the angles A or B is small, then its sine is close to 0, while its cosine is close to 1. In this case, the product cos  These problems can be solved by using another type of generator that we call sine/complement generator. Such generator receives an integer n and computes the sine and the complement of the cosine of nφ, that is defined as follows:

Definition 6 The complement of the cosine of x is com(x) = 1 − cos(x)
It is possible to compute the sine and the complement of the cosine of the sum of two angles, A and B, from the sines and complements of the cosines of those angles by using a functional unit called a trigonometric adder [17]. Similar to the complex multiplier, the trigonometric adder can be implemented with adders (3), subtracters (7), and multipliers (6) as depicted in Fig. 3. This trigonometric adder implementation uses the following trigonometric identities derived from those of 6: Trigonometric adders enable the implementation of a sine/complement generator using a tree structure similar to that described in Section 3.2. Such an implementation, described in [18], requires a set of sine/complement subgenerators (the leaves of the tree) as well as trigonometric adders (the internal vertex). Furthermore, if the generator is quadrant restricted, then optimizations similar to those described in Section 3.2 can be applied: • Since the complement of the cosine is also positive in [ 0, π/2], the trigonometric adders can be implemented without signed arithmetic components. • If fixed-point representation is used, the leading zeros of the sines make it possible to reduce the size of the integer multipliers and that of the leaves.
In this case, the optimization described in Section 3.2.3 cannot be applied, but if fixedpoint representation is used, we can use the following optimization that we call leading zeros of the complement: if a branch is quadrant restricted, an upper bound on its complement output is com(n max φ), where n max is the maximum of the input space of the branch. Therefore, if the k most significant bits of the fixed-point representation of com(n max φ) are 0, those bits of the complement output of the branch are always 0. If the branch is implemented with an LUT, then there is no need to store those bits. Note that, unlike the optimization described in Section 3.2.3, this optimization enables a reduction of the size of the required multipliers.
A sine/cosine generator can be implemented with a sine/complement generator by simply adding a trivial arithmetic circuit to subtract the complement of the cosine from 1. In fact, if fixed-point notation is used, then this arithmetic circuit is not necessary since the trigonometric adder corresponding to the root can be easily modified to provide cos(nφ) instead of com(nφ) at no additional cost. To this end, instead of com(nφ) = com (A + B), the root vertex computes its opposite −com(nφ) = −com(A + B) by using the following equation: subsequently 1 can be added to −com(nφ) in order to obtain cos(nφ). Note that this last operation is merely toggling the integer bit of the representation of −com(nφ). If a sine/cosine generator or a branch of it is quadrant restricted, then it should be implemented employing a complement generator due to the following reasons: • The smearing problems are lessened by using the formulae 10.
• In contrast to the leading ones of the cosine optimization, the leading zeros of the complement optimization make it possible to reduce the size of the multipliers.

Experiments
In order to measure the possible enhancements that the proposed approach may provide, we have written an open-source tool, called twiddle.py, to automate the design of twiddle factor generators that use the proposed optimization. The tool admits arbitrary output precision and tree height. We selected the same size for the input, the sine output, and the cosine output. In the current version, the sequence length must be a powers of 2 so it is possible to apply argument reduction and only quadrant-restricted generators are needed. The sequence lengths in our experiments ranged from 2 16  6 Results and discussion

Multipliers
As described in Section 3.2.3, the leading ones of the cosine optimization make it possible to reduce the size of the memories, but not the size of the required arithmetic components. We claimed that the leading zeros of the complement optimization, described in Section 4, were more convenient because they make it possible to reduce not only the size of the memories, but also the size of the multipliers. To ascertain such claim, we used the twiddle.py tool to implement generators of height 1 with input size ranging from 16 to 23 bits. The comparison of the size of the complement multiplicands and the corresponding cosine multiplicands is shown in Table 2. The width in bits of the cosine and the complement of the cosine is reported in the subcolumns cosine and compl respectively. The relative reduction is reported in the subcolumns red.
The first fractional bit of the cosine multiplicand is always zero. However, the precision of the multiplicands must be two bits higher than the output precision to compensate for rounding errors. For this reason, the size of the cosine multiplicands of both branches must be a bit greater than the output data width. The size of the corresponding complement multiplicand of the left branch is just one bit lower so the relative saving decreases with the data width. On the other hand, the size of the complement multiplicand of the right branch is almost constant and is never greater than 4 so the relative saving increases with the data width and ranges from 83 to 90%.
As noted by [1], an obvious way to further reduce the size of the memories is to replace them by sub-branches with memories of smaller depth at the cost of more multiplications. Of course, if the implementation is fully pipelined or combinational, the additional multiplications should be carried out by new multipliers. If such replacement is carried out, the angles corresponding to most of these new memories are remarkably smaller. For this reason, the leading zeros of the complement optimization should further reduce the size of the additional multipliers. To ascertain this, we used the twiddle.py to carry out such replacements by increasing the tree height one stage. The saving obtained for each branch is reported in the next sub-subsections.

Left branch
The size of the operands of the left branch is shown in Table 3. As in the tree of height 1, the saving obtained by using the proposed implementation in its left sub-branch is negligible, but the saving in its right sub-branch is remarkable, ranging from 41 to 53%.

Right branch
As shown in Table 4, the saving obtained by using the proposed implementation in the right branch is astonishing. To begin, the saving corresponding to its left sub-branch is very high, ranging from 77 to 88%. Moreover, the saving corresponding to its right subbranch is always 100%. This means that the complement multiplicand of the right subbranch is always zero and the corresponding multipliers can be removed.

Implementation
The hardware description of the twiddle factor generators with a sine/complement tree of height 1 were implemented in a Xilinx (Virtex) 7 XC7VX485T-2FFG1761 fieldprogrammable gate array (FPGA) chip. We also implemented the equivalent sine/cosine tree generators described in [1] and cordinate rotation digital computer (CORDIC) generators to measure the relative enhancements and penalties. Following the notation used in [1], the reference sine/cosine tree implementations will be called SinAndCos. The proposed implementations will be called SinAndCom. The SinAndCos and CORDIC implementations are generated by the flopoco open-source tool version 4.1.2 available at http://flopoco.gforge.inria.fr. The tested implementations are combinational, but are embedded in a dummy sequential module in order to obtain delay estimations from the synthesis tool. Synthesis was carried out with the (Vivado) Design Suite tool of Xilinx version 2017.2.1 using the default options. The only exception is that the use of the Digital Signal Processing (DSP) blocks was disabled in order to extrapolate the results to other  reconfigurable devices as well as to ASIC. The obtained results are reported in the next sub-subsections.

Delay
The delay of the slowest path of the generators is shown in Tables 5, 6, and 7. The SinAnd-Com approach turned out to be about 50% faster than CORDIC but was slower than SinAndCom in most cases. For the lower data widths, the penalty was as high as 26%. One of the reasons is that, in order to apply Eq. 10, it is necessary to execute more explicit additions and subtractions than in Eq. 6. The penalty is not that severe for higher widths, and there are some little speed-ups ranging from 5 to 7%.

Resources
The resource utilization of the generators are shown in Tables 8, 9, and 10. No DSP units were used. As mentioned in Section 3.2, in order to save resources, the flopoco tool does not implement the right branch of the SinAndCom generators with a memory. Instead, the corresponding values are evaluated by using the Taylor series. The same optimization can be applied to the SinAndCom approach, but the twiddle.py does not implement it yet. For this reason, this comparison is not fair to the SinAndCom approach but, even without the Taylor series optimization, SinAndCom used less LUTs in every case. The best saving was as high as 26%. The implementation of the Taylor optimization should provide higher relative savings. Also, as mentioned in Section 6.1, the relative saving should be remarkable if higher tree structures are used. Unfortunately, unlike twiddle.py, the current version of flopoco only supports tree structures of height 1 so we were unable to ascertain this. Regarding CORDIC, for low data widths, the SinAndCom has enabled the number of look-up tables to be reduced by between 20 and 27%, while for higher data widths, it introduces a penalty as high as 15% in the resource utilization.