 Research
 Open Access
 Published:
Using the complement of the cosine to compute trigonometric functions
EURASIP Journal on Advances in Signal Processing volume 2020, Article number: 35 (2020)
Abstract
The computation of the sine and cosine functions is required in devices ranging from applicationspecific signal processors to general purpose floatingpoint units. Even in the latter case, the required functionality can be reduced to computing the sine and/or cosine of multiples of a constant angle. The latency of a sine/cosine generator can be reduced by using lookup tables. However, a direct implementation with lookup tables may be unfeasible if the input space is huge. In such a case, lookup tables with a number of entries lower than the size of the input space can be used indirectly. In previously published methods, the reduction in the number of table entries is obtained at the expense of increasing the table width and the computational cost. This paper introduces an alternative technique that makes it possible to reduce the size of the lookup tables as well as the required multiplications. The proposed technique can be used to implement sine/cosine generators of huge input space. It has been used to implement several twiddle factor generators in reconfigurable hardware and has enabled the number of lookup tables to be reduced by between 6 and 26% with respect to previous tablebased techniques. Also, these implementations are about 50% faster than those based on Volder’s algorithm.
1 Introduction
The computation of sine and cosine functions is fundamental in a wide range of applications, including that of signal processing [1, 2]. Obvious examples are the computation of discrete cosine transform (DCT), discrete sine transform (DST), and their inverses (IDCT and IDST) [3]. A fused sineandcosine implementation is of major interest because various methods compute both and numerous applications require both [1]. In this paper, the focus is on the implementation of functional units that provide the sine and cosine of multiples of a constant angle ϕ, that is, sin(nϕ) and cos(nϕ), where n is an integer given as an input. Applications of such functional units include the following:

Implementing the sine and/or cosine functions in arithmetic units. For example, suppose an arithmetic unit must compute the sine and/or cosine of a number x using the IEEE 7541985 doubleprecision format [4]. x is coded in a 64bit word with 3 fields called sign (1 bit), exponent (11 bits), and significand (52 bits). The significand is a number in the range [1,2−2^{−52}] coded in fixedpoint, while the exponent is an integer laying in the range [−1022,1023] coded in excess 1023. If the exponent is lower than −27, then the unit can simply return 1 as the cosine and x as the sine, assuming rounding to the nearest representable value [5]. Otherwise, it can return the sine and/or cosine of nϕ, where ϕ is the constant 2^{−27−52}=2^{−79} and n is the integer x2^{79}.

Generating the set of coefficients, called twiddle factors, of a discrete Fourier transform (DFT) [6]. The DFT of a complex sequence x of length L is another complex sequence X of the same length defined by \(X(k)=\sum ^{L1}_{t=0} x(t)W_{L}^{tk}\), where W_{L}=e^{ϕi} and ϕ=−2π/L. The twiddle factors are the integer powers of W_{L}, and there are L different twiddle factors. Thus, the twiddle factor of index n is \(W_{L}^{n}=(e^{\phi i})^{n}=e^{n \phi i}=\sin (n \phi)i+\cos (n \phi)\).
Since the sine and cosine functions are computationally expensive, in applications where a low latency is required, the generator is implemented using lookup tables (LUT). This implementation approach is problematic if the input space is large. For example, consider the arithmetic unit previously mentioned: as stated, exponents lower than −27 can be dismissed. However, even if the input angle is restricted to [0,2π), it is still necessary to consider 30 different exponent values and 2^{52} different significand values. A direct implementation would therefore require an LUT of 30∗2^{52} entries. Another example is given by the DFT engines required in applications such as required in Power Line Communications (PLC) [7], Digital Video Broadcasting—Terrestrial 2 (DVBT2) [8], photon counting [9], and radio astronomy [10]. In those applications, the sequence can be as long as 2^{13}, 2^{15}, 2^{27}, and 2^{30}, respectively, and hence the coefficient tables are large in comparison with other elements of the DFT engine [11]. In this paper, we propose an innovative technique to reduce the resources required to implement a sine/cosine generator in applicationspecific integrated circuit (ASIC) and configurable logic. We have implemented an opensource tool to automate the design of twiddle factor generators of arbitrary size and precision using the proposed technique.
The rest of the paper is organized as follows. In the next section, the notation used is introduced and optimization techniques are presented to reduce the number of entries of the required LUT to a number proportional to the input space. In Section 3, optimization techniques are given that enable LUTs to be employed with a total number of entries that grows sublinearly with the input space. The new proposed technique is introduced in Section 4. The experiments are described in Section 5. The corresponding performance results are shown and discussed in Section 6. The last section provides a summary of the conclusions.
2 Argument reduction
As mentioned in the introduction, our objective is to efficiently implement a functional unit that provides sin(nϕ) and cos(nϕ), where n is an integer provided as an input to the unit, and ϕ is a constant angle that depends on the application. Hereinafter, the input of the functional unit will be denoted as I and the number of bits of I will be denoted as w. Furthermore, the following definitions will be used:
Definition 1
A real number ϕ is trigonometric rational if and only if \(\frac {\phi }{\pi }\) is rational.
For example, the angle ϕ=−2π/L, used in the definition of the twiddle factors in Section 1, is trigonometric rational, while the angle ϕ=2^{−79} mentioned in the arithmetic unit example of Section 1 is not.
Definition 2
The trigonometric Carmichael function of a trigonometric rational number ϕ is the minimum natural number λ_{t}(ϕ) such that \(\frac {\lambda _{t}(\phi)\phi }{2\pi }\) is an integer.
This definition is useful in the calculation of the size of the output space of the generator. This size is the minimum of λ_{t}(ϕ) and the size of the input space. Note that λ_{t}(0)=1. In the DFT example, λ_{t}(ϕ) is equal to the length of the transform. The function λ_{t} can also be employed to make the following simplification. Suppose that ϕ has been defined as a trigonometric rational number whose absolute value is very large: ϕ≫2π. In this case, an angle α with α<2π can be found such that the functionality of the generator, that is, computing sin(nϕ) and cos(nϕ), is equivalent to computing sin(nα) and cos(nα). In order to obtain such α:

1.
Take the integer \(k=\frac {\lambda _{t}(\phi)\phi }{2\pi }\)

2.
Take the remainder r of the division \(\frac {k}{\lambda _{t}(\phi)}\). Note that k and r have the same sign and r<λ_{t}(ϕ).

3.
\(\alpha =\frac {r 2\pi }{\lambda _{t}(\phi)}\).
Definition 3
The trigonometric Shannon entropy of a trigonometric rational number ϕ is H_{t}(ϕ)=log_{2}(λ_{t}(ϕ)).
If ϕ is trigonometric rational, in order to maximize the size of the output space of the generator, that is, get an output space of size λ_{t}(ϕ), the minimum number of bits required to code the input is ⌈H_{t}(ϕ)⌉.
Definition 4
ϕ is trigonometric binary if and only if it is trigonometric rational and H_{t}(ϕ) is an integer.
The latter definition is relevant since, in many applications, the constant angle ϕ is trigonometric binary. For example, consider the algorithms designed to compute efficiently the DFT called fast Fourier transform (FFT) [12] algorithms: many of these algorithms require the length of the transform L to be a power of 2 [13], that is, log_{2}(L) must be an integer, and hence ϕ must be trigonometric binary since H_{t}(ϕ)=log_{2}(λ_{t}(−2π/L))=log_{2}(L). Moreover, in applications where ϕ is trigonometric binary, it is irrelevant whether the representation of n is either unsigned or two’s complement as long as w≥H_{t}(ϕ). As an example, consider the functional unit specified in [1]. This unit computes the sine and cosine of πx where x is a number in the interval [−1,1) coded in fixedpoint two’s complement. Let S be the value represented by the input I in integer two’s complement, x=S/2^{w−1}, and hence πx=Sπ/2^{w−1}. Thus, the functional unit computes the sine and cosine of Sϕ, where ϕ=π/2^{w−1}. Let n be the value represented by I in unsigned integer representation. It is easy to prove that sin(Sϕ)= sin(nϕ) and cos(Sϕ)= cos(nϕ). Therefore, the functionality of the unit is equivalent to the computation of the sine and cosine of nϕ. This is exemplified in Table 1 for w=3.
Hereinafter, an unsigned notation for n is assumed. In the following subsections, we will see optimization techniques that require a trigonometric rational value of ϕ. In these subsections, the trigonometric Carmichael function of ϕ, λ_{t}(ϕ), is abbreviated to L.
2.1 Periodicity
If the size of the input space of the functional unit is greater than L, then the periodicity of the sine and cosine can be used to compute sin(nϕ) and cos(nϕ) in the following way:

1.
Compute n mod L. As noted by [1], if ϕ is trigonometric binary, then this computation has no cost since the result is simply the H_{t}(ϕ) least significant bits of the input I.

2.
Use a subgenerator to compute the sine and cosine of (n mod L)ϕ. The input space of the subgenerator is \(\mathbb {Z}_{L}\), smaller than the original input space, and hence, it can be implemented using a smaller LUT.
In the optimization shown in the following subsections, it is assumed that the input space of the generator is \(\mathbb {Z}_{L}\).
2.2 Sign reduction
If the input space of the functional unit is \(\mathbb {Z}_{L}\), then it is possible to implement it with another subgenerator with the same value of ϕ but whose input space is \(\mathbb {Z}_{\lfloor L/2 \rfloor +1}\), that is, its size is roughly half of the size of the original input space. This optimization takes into account the following trigonometric identities:
If n≤L/2, then the input of the subgenerator is n and its output is the output of the functional unit. Otherwise, the input of the subgenerator is L−n, the cosine output of the unit is the cosine output of the subgenerator, and the sine output of the unit is the opposite of the sine output of the subgenerator. Note that if ϕ is trigonometric binary, then L−n can be obtained by simply taking the two’s complement. In the next subsection, another optimization is presented for the implementation of the subgenerator when L is even.
2.3 Quadrant reduction
If L is even and the size of the input space of the functional unit is ⌊L/2⌋+1=L/2+1, then it can be implemented using a subgenerator with the same value of ϕ but whose input space is reduced to \(\mathbb {Z}_{\lfloor L/4 \rfloor +1}\). This optimization uses the following trigonometric identities:
If n≤L/4, then the input of the subgenerator is n and its output is the output of the functional unit. Otherwise, the input of the subgenerator is L/2−n, the sine output of the unit is the sine output of the subgenerator, and the cosine output of the unit is the opposite of the cosine output of the subgenerator. Again, the computation of L/2−n is simply a two’s complement if ϕ is trigonometric binary. In turn, the optimization described in the next subsection can be used to implement the subgenerator if L is multiple of 4.
2.4 Octant reduction
Finally, if L is a multiple of 4 and the input space of the functional unit is \(\mathbb {Z}_{\lfloor L/4 \rfloor +1}=\mathbb {Z}_{L/4+1}\), then it can be implemented with a subgenerator with the same value of ϕ but whose input space is reduced to \(\mathbb {Z}_{\lfloor L/8 \rfloor +1}\) by applying the following trigonometric identities:
If n≤L/8, then the input of the subgenerator is n and its output is the output of the functional unit. Otherwise, the input of the subgenerator is L/4−n, the sine output of the unit is the cosine output of the subgenerator and the sine output of the unit is the cosine output of the subgenerator. Once more, a simple two’s complement provides L/4−n if ϕ is trigonometric binary.
With the previous optimizations, a sine/cosine generator can be implemented with an LUT of ⌊L/8⌋+1 entries. For example, the functional unit described in [1] previously mentioned can be implemented as shown in Fig. 1. As previously discussed, the functionality of that unit is equivalent to computing the sine and cosine of nϕ, where n is the number provided by input I in unsigned integer notation, the angle ϕ is 2π/2^{w}, and w is the number of bits of I. In this case, ϕ is a trigonometric binary. In this implementation, the output is provided in some type of signmagnitude notation such as one of the IEEE 7541985 floatingpoint formats, and w>3 so λ_{t}(ϕ) is multiple of 4 and an LUT of only 2^{w−3}+1 entries is required. The LUT of the figure should return the sine and cosine of angles in the range [0,π/4] and, since they are all positive, there is no need to store the sign bits. Instead, they are computed using a simple XOR gate (4c). The LUT has been implemented using a direct access memory. In order to prevent the problem of dealing with a direct access memory with a number of positions that is not a power of two, the access of the entry of the LUT corresponding to n=2^{w−3} is detected by a simple logic gate (4a) and is treated separately. In that case, the LUT returns the sine and cosine of π/4, that is, \(1/\sqrt []{2}\), using a pair of multiplexers (2b). The address lines of the memory are fed with the w−3 least significant bits of I or with its two’s complement depending on I_{w−3}, using the adder (3) and the multiplexer (2a). The gate (4b) is employed to ascertain whether the magnitude of the sine and the cosine should be interchanged with the multiplexers (2c).
3 Sublinear optimizations
The optimizations described in the previous sections have the following drawbacks:

They require a subgenerator with an input space greater than λ_{t}(ϕ)/8, that is, its input space grows linearly with λ_{t}(ϕ). In many applications, the subgenerator cannot be directly implemented using an LUT with a number of entries proportional to the input space since it is excessively large. For example, even if the octant optimization could be directly applied to the arithmetic unit mentioned in Section 1, it would only reduce the size of the input space of the required subgenerator to roughly 30∗2^{49}.

They can only be directly applied if ϕ is trigonometric rational. Furthermore, the quadrant and octant optimizations require λ_{t}(ϕ) to be even and a multiple of 4, respectively. Hence, in order to apply them in the arithmetic unit of the example of Section 1, a workaround similar to that shown in [1, 14] is necessary. For example, assuming the angle is positive, the arithmetic unit could execute the following steps to compute the sine and cosine:

1.
Divide the angle by 2π. This can be implemented efficiently by multiplying the angle by the precomputed value of the reciprocal of 2π.

2.
Take the fixedpoint representation I of the fractional part of the previous division.

3.
Return the sine and cosine of nϕ, where n is the number represented by I in unsigned integer notation, \(\phi =\frac {2\pi }{2^{w}}\), and w is the width of I.
This last step can be carried out by a generator that can be implemented using the optimizations described in the previous section. However, this approach has its own drawbacks: first, the cost of a multiplication is introduced; second, the reciprocal of 2π is not rational, and hence, its exact representation cannot be stored. Moreover, even if we could use an exact representation of the reciprocal of 2π, the exact product of the angle by the reciprocal is not rational unless the angle is zero, that is, the exact result of the division of the angle by 2π is not rational. Hence, its representation cannot be exact and an error is introduced [14].

1.
In the following subsections, optimizations without these drawbacks are described.
3.1 Branching
This optimization, used in [1], accepts an arbitrary value of ϕ, although it was originally employed for a trigonometric rational value. When branching is applied, the generator is implemented using two subgenerators, M_{1} and M_{0}, which we call branches. The inputs of the branches are denoted by A(1) and A(0), while the widths of these inputs are denoted by L(1) and L(0), respectively. These are chosen so that the width of the input of the generator, I, is w=L(1)+L(0). M_{1} provides the sine and cosine of integer multiples of ϕ_{1}, that is, the sine and cosine of n_{1}ϕ_{1}, where n_{1} is the value represented by A(1) and ϕ_{1}=2^{L(0)}ϕ. On the other hand, M_{0} provides the sine and cosine of n_{0}ϕ_{0}, where n_{0} is the value represented by A(0) and ϕ_{0}=ϕ. The least significant bits of I are connected to A(0), while the rest are connected to A(1). Since I is the concatenation of A(1) and A(0), the value represented by I is
and hence,
Since the sines and cosines of n_{1}ϕ_{1} and n_{0}ϕ_{0} are provided by M_{1} and M_{0}, the sine and cosine of their sum can be computed by applying the following trigonometric identities:
Alternatively, we can say that each subgenerator M_{k} provides the complex \(\sin (n_{k}\phi _{k})i+\cos (n_{k}\phi _{k})=e^{n_{k}\phi _{k} i}\phantom {\dot {i}\!}\), and the generator can provide the value e^{nϕi} by computing the complex product \(\phantom {\dot {i}\!}e^{n_{1}\phi _{1} i}e^{n_{0}\phi _{0} i}\). Indeed, computing the product of two complexes, each of a unitary module, is equivalent to computing the sine and cosine of the sum of two angles from the sine and cosine of those angles and implies four real products, a real sum, and a real subtraction. A generalization of this branching technique was proposed in [15] to compute twiddle factors. Note that the sum of the sizes of the input spaces of M_{1} and M_{0} is minimum when L(1) and L(0) differ by no more than 1. In this case, such a sum grows with the square root of the size of the original input space, that is, sublinearly [15]. Accordingly to [1], floatingpoint sine/cosine applications will benefit from a fixedpoint conversion of the datapath around these functions.
3.2 Tree generator
The implementation of the branches was not detailed in the previous subsection. In the generator described in [1], the branch M_{0} computes its output using the Taylor series, while M_{1} is implemented with an LUT of affordable size. Further optimization could be achieved if one or both branches were, in turn, implemented with subbranches. This recursive application of the branch optimization is used by the tree generator described in [16]. In general, the tree generator requires a set of subgenerators that we will call leaves, complex multipliers, and, if the implementation is sequential or pipelined, registers. The following notation is employed for its description:

w: width of the input of the tree generator

I=I_{w−1}I_{w−2}…I_{1}I_{0}: input of the tree generator

\(n=\sum _{t=0}^{w1}{I_{t}2^{t}}\): number represented by the input of the tree generator

m: number of leaf subgenerators employed

M_{0},M_{1},…, M_{m−1}: the m leaves

L(k): width of the input of the leaf M_{k}

A(k)=A(k)_{L(k)−1}…A(k)_{0}: input of the leaf M_{k}

\(n_{k}=\sum _{t=0}^{L(k)1}{A(k)_{t}2^{t}}\): number represented by the input A(k)

\(SL(k)=\sum _{t=0}^{k1}{L(t)}=\)
\(\left \{\begin {array}{l} 0 \textrm { if} \ k=0 \\ L(k1)+SL(k1) \textrm { if}\ k>0 \end {array}\right.\): total number of input lines of the leaves with an index lower than k

ϕ_{k}: angle defined by ϕ_{k}=(2^{SL(k)})ϕ
Each leaf subgenerator M_{k} provides the sine and cosine of n_{k}ϕ_{k}. The leaves are chosen such that the sum of the widths of their inputs is equal to the width of the input of the tree generator:
The input lines of each leaf M_{k} are connected to the input lines of the tree generator from I_{SL(k)} to I_{SL(k+1)−1}, that is, each input line A(k)_{t} is connected to I_{t+SL(k)}:
Hence, the input value n represented by I becomes:
and therefore the angle whose sine and cosine must be computed by the tree generator can be written as:
Hence, the angle nϕ is the sum of the subangles n_{k}ϕ_{k}, or, alternatively, the complex e^{nϕi} is the product \(e^{n_{0}\phi _{0} i}e^{n_{1}\phi _{1} i} \ldots e^{n_{m1}\phi _{m1} i}\phantom {\dot {i}\!}\). Again, since the sine and cosine of the subangles are provided by the leaves, the sine and cosine of nϕ can be computed with complex multiplications. Taking this into account, the structure of the generator described in [16] becomes a directed rooted binary tree with m leaves. Each vertex corresponds to a component whose output is a complex of unitary module. Each internal vertex has exactly two children and corresponds to a complex multiplier that computes the product of the outputs of the components associated to these children. The components corresponding to the m leaves are the m subgenerators and provide the complex values \(e^{n_{k}\phi _{k} i}\phantom {\dot {i}\!}\). The output of the tree generator is the output of the component corresponding to the root vertex. Hereinafter, the height of the tree will be denoted as h. The following recommendations may improve the efficiency of the design:

It is desirable to minimize the height of the tree h in order to reduce latency and rounding errors. This is achieved if the structure of the generator is a complete binary tree.

If each leaf is implemented with an LUT, the total number of entries is minimum when the width of the inputs of those LUTs differ by no more than 1. To this end, let q be the quotient obtained by dividing w by m, and let r be the remainder. A total of r LUTs must have inputs of width q+1. The other LUTs must have inputs of width q.

If the above recommendation is followed, then the total number of entries decreases when m increases. For a fixed height h, the maximum possible value of m is 2^{h}, and therefore, the total number of entries can be minimized by using 2^{h} leaves.
In order to ascertain the power of this approach, suppose we use a complete binary tree with height h=⌊log_{2}(w)⌋. In this case, the number of subgenerators m would be no greater than w, and each subgenerator would have no more than 2 input lines. If each subgenerator is implemented with an LUT, then an upper bound on the total number of entries is 4w, that is, the total number of entries grows logarithmically with the size of the input space of the tree generator. Hence, the implementation of a sine/cosine generator with an input space as large as that required by the arithmetic unit mentioned in Section 1 is feasible with a tree generator. Note that a tree generator can be combined with the argument reduction mentioned in Section 2. For example, in [1], argument reduction is first applied, and hence, only a subgenerator with an input space of roughly 1/8 of the original size is required. That subgenerator is then implemented with a tree generator of height h=1.
In the following subsubsections, we will see several optimizations that can be applied to the tree generator. In the rest of the paper, ϕ>0 is assumed for the sake of simplicity, although in practice this is not a restriction since cos(nϕ)= cos(nϕ) and sin(nϕ)=sgn(ϕ)∗ sin(nϕ).
3.2.1 Quadrant restriction
This optimization can be applied to quadrant restricted sine/cosine generators, which are defined as follows:
Definition 5
Given a functional unit with an integer input n≥0 that computes one or more trigonometric functions of nϕ, where ϕ>0 is a constant, the unit is quadrant restricted if and only if \(\frac {\pi }{2\phi }\) is an upper bound on its input space.
If a sine/cosine generator is quadrant restricted, then it must compute the sine and cosine of an angle nϕ in the interval [0,π/2]. Since both functions are positive in that interval, the following optimizations are possible:

As in the example of Section 2, if the generator is implemented with an LUT, there is no need to store the sign bits.

If it is implemented with a tree generator, no signed adders, subtracters, nor multipliers are required.
For example, the tree generator used in [1] is quadrant restricted, and hence, the complex multiplier requires no signed arithmetic components and the LUT employed does not need to store the sign bits. Note that even if a tree generator is not quadrant restricted, it may contain quadrant restricted branches that can benefit from these optimizations.
3.2.2 Leading zeros of the sine
This optimization is useful when the sine values of a quadrantrestricted generator are coded in fixedpoint. In this case, an upper bound on the sine output is sin(n_{max}ϕ), where n_{max} is the maximum of the input space. Consequently, if the k most significant bits of the fixedpoint representation of sin(n_{max}ϕ) are 0, those bits of the sine output of the generator are always 0, and the following optimizations are possible:

If the generator is implemented with an LUT, then there is no need to store the k most significant bits of the sine.

If the generator feeds a complex multiplier of a tree generator, the size of its real multipliers can be reduced.
3.2.3 Leading ones of the cosine
This optimization is useful when the cosine values of a quadrantrestricted generator are coded in fixedpoint using all the bits for the fractional part. In this case, the representable value nearest to cos(0)=1 corresponds to the word with all the bits equal to 1. A lower bound on the cosine output is cos(n_{max}ϕ), where n_{max} is the maximum of the input space, and therefore, if the k most significant bits of the fixedpoint representation of cos(n_{max}ϕ) are 1, those bits of the cosine output of the generator are always 1. Hence, if the generator is implemented with an LUT, there is no need to store the k most significant bits of the cosine.
To exemplify these optimizations, suppose a generator must provide the sine and cosine of nϕ in fixedpoint notation with 8 fractional bits and no integer bits rounding to the nearest representable value. In this example, ϕ=2π/2^{11}=π/2^{10}, that is, it is trigonometric binary and λ_{t}(ϕ)=2^{11} is a multiple of 4, and hence, we first apply argument reduction and treat the case n=2^{8} separately as in the example of Section 2. A subgenerator still has to be subsequently implemented with an input space of size 2^{8} (w=H_{t}(ϕ)−3=8). We implement this subgenerator with a tree generator of height h=1, and therefore, there are 2 leaves (m=2^{h}=2) as shown in Fig. 2. Since this tree generator is quadrant restricted, it does not require signed arithmetic components. Each leaf is implemented with direct access memory (5) of depth 2^{4}. Note that the sine/cosine values must be stored with a precision of 17 bits to compensate for rounding errors. The leaf M_{0} provides the sines and cosines of the multiples of ϕ_{0}=2^{0}ϕ=π/2^{10}, while the leaf M_{1} provides the sines and cosines of the multiples of ϕ_{1}=2^{4}ϕ=π/2^{6}. The greatest angle whose sine and cosine is stored in M_{0} is 15π/2^{10}. The 4 most significant bits of the representation of the sine of this angle are 0 and, since the generator is quadrant restricted, the 4 most significant bits of the other representations of the sines stored in M_{0} are also 0, and therefore, there is no need for them to be stored. On the other hand, the 9 most significant bits of the representation of the cosine of that angle are equal to 1, and therefore, there is no need for them to be stored either. Similar optimizations can be applied to M_{1}, but in this case, only one bit can be saved. Note that we only need two integer multipliers of size 13×17(6a) and two of size 17×17(6b) instead of four multipliers of size 17×17 thanks to the leading zeros of the sine. The leading ones of the cosine cannot be employed to reduce the size the arithmetic components in a similar way. In the last stage, an adder (3) provides the sine of the tree generator and a subtracter (7) provides the cosine.
4 Sine/complement generator
In the optimizations described in Section 3, the angle whose sine/cosine must be computed is decomposed into two subangles, A and B. Two subgenerators, called branches, are employed to compute the sine/cosine of A and B, and then the sine/cosine of A+B is computed by applying the identities 6. This method presents the following drawbacks:

If the maximum value of one of the angles A or B is small, then its sine is close to 0, while its cosine is close to 1. In this case, the product cos(A) cos(B) can be orders of magnitude greater than sin(A) sin(B), and hence, smearing may occur when computing cos(A+B)= cos(A) cos(B)− sin(A) sin(B).

Unlike the optimization described in Section 3.2.2, the one described in Section 3.2.3 fails to help in the reduction of the required arithmetic components.
These problems can be solved by using another type of generator that we call sine/complement generator. Such generator receives an integer n and computes the sine and the complement of the cosine of nϕ, that is defined as follows:
Definition 6
The complement of the cosine of x is com(x)=1− cos(x)
It is possible to compute the sine and the complement of the cosine of the sum of two angles, A and B, from the sines and complements of the cosines of those angles by using a functional unit called a trigonometric adder [17]. Similar to the complex multiplier, the trigonometric adder can be implemented with adders (3), subtracters (7), and multipliers (6) as depicted in Fig. 3. This trigonometric adder implementation uses the following trigonometric identities derived from those of 6:
Trigonometric adders enable the implementation of a sine/complement generator using a tree structure similar to that described in Section 3.2. Such an implementation, described in [18], requires a set of sine/complement subgenerators (the leaves of the tree) as well as trigonometric adders (the internal vertex). Furthermore, if the generator is quadrant restricted, then optimizations similar to those described in Section 3.2 can be applied:

Since the complement of the cosine is also positive in [0,π/2], the trigonometric adders can be implemented without signed arithmetic components.

If fixedpoint representation is used, the leading zeros of the sines make it possible to reduce the size of the integer multipliers and that of the leaves.
In this case, the optimization described in Section 3.2.3 cannot be applied, but if fixedpoint representation is used, we can use the following optimization that we call leading zeros of the complement: if a branch is quadrant restricted, an upper bound on its complement output is com(n_{max}ϕ), where n_{max} is the maximum of the input space of the branch. Therefore, if the k most significant bits of the fixedpoint representation of com(n_{max}ϕ) are 0, those bits of the complement output of the branch are always 0. If the branch is implemented with an LUT, then there is no need to store those bits. Note that, unlike the optimization described in Section 3.2.3, this optimization enables a reduction of the size of the required multipliers.
A sine/cosine generator can be implemented with a sine/complement generator by simply adding a trivial arithmetic circuit to subtract the complement of the cosine from 1. In fact, if fixedpoint notation is used, then this arithmetic circuit is not necessary since the trigonometric adder corresponding to the root can be easily modified to provide cos(nϕ) instead of com(nϕ) at no additional cost. To this end, instead of com(nϕ)=com(A+B), the root vertex computes its opposite −com(nϕ)=−com(A+B) by using the following equation:
subsequently 1 can be added to −com(nϕ) in order to obtain cos(nϕ). Note that this last operation is merely toggling the integer bit of the representation of −com(nϕ). If a sine/cosine generator or a branch of it is quadrant restricted, then it should be implemented employing a complement generator due to the following reasons:

The smearing problems are lessened by using the formulae 10.

In contrast to the leading ones of the cosine optimization, the leading zeros of the complement optimization make it possible to reduce the size of the multipliers.
As an example, Fig. 4 shows how to implement a sine/cosine generator with an input I of width w=11 using a sine/complement generator whose topology is a tree of height h=2. The sine/complement generator uses m=2^{h}=4 subgenerators, M_{0}, M_{1}, M_{2}, and M_{3}, which have been implemented with direct access memories (8). Following the recommendations of Section 3.2 to minimize the total number of memory locations, M_{3} has 2 address lines (q=⌊w/m⌋=2) and each of the other 3 remaining memories (r=w−mq=3) has an additional address line, and therefore, L(3)=2 and L(2)=L(1)=L(0)=3. Hence, SL(0)=0, ϕ_{0}=2^{0}ϕ, SL(1)=3, ϕ_{1}=2^{3}ϕ, SL(2)=6, ϕ_{2}=2^{6}ϕ, SL(3)=9, and ϕ_{3}=2^{9}ϕ. Each memory M_{k} contains the sines and the complement of the cosines of the multiples of ϕ_{k}=(2^{SL(k)})ϕ, and therefore, its output provides the sine and the complement of the cosine of n_{k}ϕ_{k}, where n_{k} is the value of its address lines. Each address line t of each memory M_{k} is connected to I_{t+SL(k)}, that is, the inputs of M_{0}, M_{1}, M_{2}, and M_{3} are connected to I_{2}I_{1}I_{0}, I_{5}I_{4}I_{3}, I_{8}I_{7}I_{6}, and I_{10}I_{9}, respectively. Hence, n=n_{0}2^{SL(0)}+n_{1}2^{SL(1)}+n_{2}2^{SL(2)}+n_{3}2^{SL(3)} ⇒ nϕ=n_{0}2^{SL(0)}ϕ+n_{1}2^{SL(1)}ϕ+n_{2}2^{SL(2)}ϕ+n_{3}2^{SL(3)}ϕ=n_{0}ϕ_{0}+n_{1}ϕ_{1}+n_{2}ϕ_{2}+n_{3}ϕ_{3}. Three trigonometric adders are used (9). Those connected directly to the memories are employed to compute the sine and the complement of the cosine of the angles n_{0}ϕ_{0}+n_{1}ϕ_{1} and n_{2}ϕ_{2}+n_{3}ϕ_{3}. The other computes the sine and the complement of the cosine of nϕ=n_{0}ϕ_{0}+n_{1}ϕ_{1}+n_{2}ϕ_{2}+n_{3}ϕ_{3}. A trivial arithmetic circuit (10) subtracts the complement of the cosine of nϕ from 1 to obtain the cosine of nϕ.
5 Experiments
In order to measure the possible enhancements that the proposed approach may provide, we have written an opensource tool, called twiddle.py, to automate the design of twiddle factor generators that use the proposed optimization. The tool admits arbitrary output precision and tree height. We selected the same size for the input, the sine output, and the cosine output. In the current version, the sequence length must be a powers of 2 so it is possible to apply argument reduction and only quadrantrestricted generators are needed. The sequence lengths in our experiments ranged from 2^{16} to 2^{23}. The twiddle.py tool follows the recommendations of Section 3.2 to minimize the size of the memories. More information about the tool can be found in the section “Availability of data and materials.” The values computed by the generators are faithfully rounded.
6 Results and discussion
6.1 Multipliers
As described in Section 3.2.3, the leading ones of the cosine optimization make it possible to reduce the size of the memories, but not the size of the required arithmetic components. We claimed that the leading zeros of the complement optimization, described in Section 4, were more convenient because they make it possible to reduce not only the size of the memories, but also the size of the multipliers. To ascertain such claim, we used the twiddle.py tool to implement generators of height 1 with input size ranging from 16 to 23 bits. The comparison of the size of the complement multiplicands and the corresponding cosine multiplicands is shown in Table 2. The width in bits of the cosine and the complement of the cosine is reported in the subcolumns cosine and compl respectively. The relative reduction is reported in the subcolumns red.
The first fractional bit of the cosine multiplicand is always zero. However, the precision of the multiplicands must be two bits higher than the output precision to compensate for rounding errors. For this reason, the size of the cosine multiplicands of both branches must be a bit greater than the output data width. The size of the corresponding complement multiplicand of the left branch is just one bit lower so the relative saving decreases with the data width. On the other hand, the size of the complement multiplicand of the right branch is almost constant and is never greater than 4 so the relative saving increases with the data width and ranges from 83 to 90%.
As noted by [1], an obvious way to further reduce the size of the memories is to replace them by subbranches with memories of smaller depth at the cost of more multiplications. Of course, if the implementation is fully pipelined or combinational, the additional multiplications should be carried out by new multipliers. If such replacement is carried out, the angles corresponding to most of these new memories are remarkably smaller. For this reason, the leading zeros of the complement optimization should further reduce the size of the additional multipliers. To ascertain this, we used the twiddle.py to carry out such replacements by increasing the tree height one stage. The saving obtained for each branch is reported in the next subsubsections.
6.1.1 Left branch
The size of the operands of the left branch is shown in Table 3. As in the tree of height 1, the saving obtained by using the proposed implementation in its left subbranch is negligible, but the saving in its right subbranch is remarkable, ranging from 41 to 53%.
6.1.2 Right branch
As shown in Table 4, the saving obtained by using the proposed implementation in the right branch is astonishing. To begin, the saving corresponding to its left subbranch is very high, ranging from 77 to 88%. Moreover, the saving corresponding to its right subbranch is always 100%. This means that the complement multiplicand of the right subbranch is always zero and the corresponding multipliers can be removed.
6.2 Implementation
The hardware description of the twiddle factor generators with a sine/complement tree of height 1 were implemented in a Xilinx (Virtex) 7 XC7VX485T2FFG1761 fieldprogrammable gate array (FPGA) chip. We also implemented the equivalent sine/cosine tree generators described in [1] and cordinate rotation digital computer (CORDIC) generators to measure the relative enhancements and penalties. Following the notation used in [1], the reference sine/cosine tree implementations will be called SinAndCos. The proposed implementations will be called SinAndCom. The SinAndCos and CORDIC implementations are generated by the flopoco opensource tool version 4.1.2 available at http://flopoco.gforge.inria.fr. The tested implementations are combinational, but are embedded in a dummy sequential module in order to obtain delay estimations from the synthesis tool. Synthesis was carried out with the (Vivado) Design Suite tool of Xilinx version 2017.2.1 using the default options. The only exception is that the use of the Digital Signal Processing (DSP) blocks was disabled in order to extrapolate the results to other reconfigurable devices as well as to ASIC. The obtained results are reported in the next subsubsections.
6.2.1 Delay
The delay of the slowest path of the generators is shown in Tables 5, 6, and 7. The SinAndCom approach turned out to be about 50% faster than CORDIC but was slower than SinAndCom in most cases. For the lower data widths, the penalty was as high as 26%. One of the reasons is that, in order to apply Eq. 10, it is necessary to execute more explicit additions and subtractions than in Eq. 6. The penalty is not that severe for higher widths, and there are some little speedups ranging from 5 to 7%.
6.2.2 Resources
The resource utilization of the generators are shown in Tables 8, 9, and 10. No DSP units were used. As mentioned in Section 3.2, in order to save resources, the flopoco tool does not implement the right branch of the SinAndCom generators with a memory. Instead, the corresponding values are evaluated by using the Taylor series. The same optimization can be applied to the SinAndCom approach, but the twiddle.py does not implement it yet. For this reason, this comparison is not fair to the SinAndCom approach but, even without the Taylor series optimization, SinAndCom used less LUTs in every case. The best saving was as high as 26%. The implementation of the Taylor optimization should provide higher relative savings. Also, as mentioned in Section 6.1, the relative saving should be remarkable if higher tree structures are used. Unfortunately, unlike twiddle.py, the current version of flopoco only supports tree structures of height 1 so we were unable to ascertain this. Regarding CORDIC, for low data widths, the SinAndCom has enabled the number of lookup tables to be reduced by between 20 and 27%, while for higher data widths, it introduces a penalty as high as 15% in the resource utilization.
7 Conclusions
In this paper, we propose a table basedsine/cosine computation technique. In the proposed technique, the complement of the cosine is computed before the cosine itself in order to reduce the size of the required multiplications. We have released an opensource tool to automate the design of twiddle factor generators of arbitrary size and precision using the proposed technique. Several twiddle factor generators have been implemented in a Xilinx (Virtex) 7 XC7VX485T2FFG1761 FPGA chip using the proposed technique and other techniques described in [1]. The proposed technique is remarkably faster than CORDIC. Also, when compared with previous tablebased implementations with a tree structure of height 1, the proposed technique enabled a remarkable saving in the hardware resources at the expense of delay. To increase the relative saving, some of the memories could be replaced by circuits that evaluate the corresponding values. Further research is required to measure the benefits that such optimization could provide. Also, it would be interesting to make new comparisons if a version of flopoco supporting higher tree structures is released.
Abbreviations
 ASIC:

Applicationspecific integrated circuit
 CORDIC:

Coordinate rotation digital computer
 DCT:

Discrete cosine transform
 DSP:

Digital signal processing
 DFT:

Discrete fourier transform
 DST:

Discrete sine transform
 DVBT2:

Digital video broadcasting—terrestrial 2
 FFT:

Fast fourier transform
 FPGA:

Fieldprogrammable gate array
 IDCT:

Inverse discrete cosine transform
 IDST:

Inverse discrete sine transform
 PLC:

Power line communications
 LUT:

Lookup table
References
F. de Dinechin, M. Istoan, G. Sergent, Fixedpoint trigonometric functions on FPGAS. SIGARCH Comput. Archit. News. 41(5), 83–88 (2014). https://doi.org/10.1145/2641361.2641375.
K. J. Lin, C. C. Hou, in Proceedings of the IEEE 2nd Global Conference on Consumer Electronics (GCCE 2013). Implementation of trigonometric custom functions hardware on embedded processor (Tokyo, 2013), pp. 155–157. https://doi.org/10.1109/GCCE.2013.6664782.
H. Huang, L. Xiao, J. Liu, Cordicbased unified architectures for computation of DCT/IDCT/DST/IDST. Circ. Syst. Signal Proc.33(3), 799–714 (2014). https://doi.org/10.1007/s0003401396619.
D. Goldberg, What every computer scientist should know about floatingpoint arithmetic. ACM Comput. Surv.23(1), 5–48 (1991). https://doi.org/10.1145/103162.103163.
V. Lefevre, J. M. Muller, in Proceedings of the 15th IEEE Symposium on Computer Arithmetic. ARITH15 2001. Worst cases for correct rounding of the elementary functions in double precision (Vail, 2001), pp. 111–118. https://doi.org/10.1109/ARITH.2001.930110.
T. Kulshreshtha, A. S. Dhar, Cordicbased high throughput sliding DFT architecture with reduced erroraccumulation. Circ. Syst. Signal Proc.37(11), 5101–5126 (2018). https://doi.org/10.1007/s000340180810z.
IEEE Standard for broadband over power line networks: medium access control and physical layer specifications. IEEE Std 19012010, 1–1586 (2010). https://doi.org/10.1109/IEEESTD.2010.5678772.
S. Y. Lin, C. L. Wey, M. D. Shieh, Lowcost FFT processor for DVBT2 applications. IEEE Trans. Consum. Electron.56(4), 2072–2079 (2010). https://doi.org/10.1109/TCE.2010.5681074.
R. H. Stanton, in Proceedings of the 31st Annual SAS Symposium on Telescope Science. Photon counting  one more time (Big Bear Lake, 2012), pp. 177–184. http://adsabs.harvard.edu/abs/2012SASS...31..177S.
H. Nakahara, H. Nakanishi, T. Sasao, On a wideband fast Fourier transform for a radio telescope. SIGARCH Comput. Archit. News. 40(5), 46–51 (2012). https://doi.org/10.1145/2460216.2460225.
F. Qureshi, O. Gustafsson, in Proceedings of the 2009 Conference Record of the FortyThird Asilomar Conference on Signals, Systems and Computers. Analysis of twiddle factor memory complexity of radix2i pipelined FFTs (Pacific Grove, 2009), pp. 217–220. https://doi.org/10.1109/ACSSC.2009.5470121.
J. G. Nash, Distributedmemorybased FFT architecture and FPGA implementations. Electronics. 7(7) (2018). https://doi.org/10.3390/electronics7070116.
J. W. Cooley, P. A. W. Lewis, P. D. Welch, Historical notes on the fast Fourier transform. Proc. IEEE. 55(10), 1675–1677 (1967). https://doi.org/10.1109/PROC.1967.5959.
R. A. Smith, A continuedfraction analysis of trigonometric argument reduction. IEEE Trans. Bus. Econ.44(11), 1348–1351 (1995). https://doi.org/10.1109/12.475133.
H. Kang, B. Yang, J. Lee, Low complexity twiddle factor multiplication with ROM partitioning in FFT processor. Electron. Lett.49(9), 589–591 (2013). https://doi.org/10.1049/el.2013.0689.
D. Guerrero, J. Viejo, P. RuizdeClavijo, J. Juan, M. J. Bellido, A. Millan, E. Ostua, J. I. Villar, J. Quiros, A. Muñoz, Digital Electronic circuit for calculating sines and cosines of multiples of an angle. WO2018104566A1: (2018).
D. Guerrero, A. Millan, J. Juan, J. Viejo, M. J. Bellido, P. RuizdeClavijo, E. Ostua, Dispositivo Electrónico Calculador de Funciones Trigonométricas. P201831134: (2019).
D. Guerrero, A. Millan, J. Juan, J. Viejo, M. J. Bellido, P. RuizdeClavijo, E. Ostua, Dispositivo Electrónico Calculador de Funciones Trigonométricas Y Usos Del mismo. P201831133: (2019).
Acknowledgements
Not applicable.
Funding
This work has been partially supported by the Ministerio de Economía, Industria y Competitividad of Spain under project TIN201789951P (BootTimeIoT) and by the European Regional Development Fund (ERDF).
Author information
Authors and Affiliations
Contributions
Conceptualization, D.G.; data curation, D.G. and J.J; formal analysis, D.G.; funding acquisition, J.J. and P.R.; investigation, D.G.; methodology, D.G.; project administration, J.J.; resources, A.M., M.B. and E.O.; software, D.G. and A.M.; supervision, M.B.; validation, D.G.; visualization, P.R.; writing of the original draft and preparation, D.G. and J.J.; writing, review, and editing, A.M., J.V, and M.B. The authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors developed the inventions covered by patents WO2018104566A1, P201831134, and P201831133.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Guerrero Martos, D., Millán Calderón, A., Juan Chico, J. et al. Using the complement of the cosine to compute trigonometric functions. EURASIP J. Adv. Signal Process. 2020, 35 (2020). https://doi.org/10.1186/s13634020006925
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13634020006925
Keywords
 Trigonometric functions
 Computational cost
 Signal processing
 Discrete Fourier transform