Scaled AAN for Fixed-Point Multiplier-Free IDCT

An e ﬃ cient algorithm derived from AAN algorithm (proposed by Arai, Agui, and Nakajima in 1988) for computing the Inverse Discrete Cosine Transform (IDCT) is presented. We replace the multiplications in conventional AAN algorithm with additions and shifts to realize the ﬁxed-point and multiplier-free computation of IDCT and adopt coe ﬃ cient and compensation matrices to improve the precision of the algorithm. Our 1D IDCT can be implemented by 46 additions and 20 shifts. Due to the absence of the multiplications, this modiﬁed algorithm takes less time than the conventional AAN algorithm. The algorithm has low drift in decoding due to the higher computational precision, which fully complies with IEEE 1180 and ISO/IEC 23002-1 speciﬁcations. The implementation of the novel fast algorithm for 32-bit hardware is discussed, and the implementations for 24-bit and 16-bit hardware are also introduced, which are more suitable for mobile communication devices.


Introduction
Discrete cosine transforms (DCTs) are widely used in speech coding and image compression.Among four types of discrete cosine transforms (type-I, type-II, type-III, and type-IV), DCT-II and DCT-III are frequently adopted in Codecs.1D DCT-II (also known as forward DCT) and DCT-III (also known as Inverse DCT) are defined as follows: DCT-II: (n = 0, 1, . . ., N − 1) , DCT-III: where Many existing image and video coding standards (such as JPEG, H.261, H.263, MEPG-1, MPEG-2, and MPEG-4 part 2) require the implementation of an integer-output approximation of the 8 × 8 inverse discrete cosine transform (IDCT) function, defined as follows: • cos n (2k + 1) π 16 • cos m (2l + 1) π 16 (k, l = 0, 1, . . ., 7) , (3) where In this paper, we will propose an efficient algorithm for implementing (3).The Inverse DCT is supposed to decode data modeled by different encoders with low drift.Some classical DCT/IDCT algorithms have been proposed, such as, Lee [1], AAN [2], and LLM [3] algorithms.However, the slightly irregular structures of these classical algorithms require many floating-point multipliers and adders, which take much time for both hardware and software to implement.Therefore, many fast algorithms for DCT/IDCT were proposed in the past years [4][5][6][7][8][9][10][11][12].In order to decrease the implementation complexity, some researchers developed the recursive transform algorithms by taking advantage of the local connectivity and the simple structures in the circuit realizations, which are particularly suitable for very large scale integration (VLSI) implementations [4][5][6][7][8].However, comparing with the other fast algorithms, longer computational time and larger round-off errors limit the use of the recursive algorithms.To reduce the computational complexity, looking-up tables and accumulators instead of multipliers are used to compute inner products.This method is widely used in many DSP applications, such as DFT, DCT, convolutions, and digital filters [9,10].However, the hardware will probably encounter the out of memory problem, especially for mobile devices, because the lookingup tables require large storage memories.
Considering low-power implementations of IDCT on mobile devices with no or less floating-point multipliers and the requirements of higher precisions, less computational complexities, and less storage memories in application, some multiplier-free DCTs are presented.Among them, multiplier-free approximation of DCT based on lifting scheme is developed [11,12].The method replaces the butterfly computational structures in the original DCT signal flow graph with lifting structures.The advantage of the lifting structures is that each lifting step is a biorthogonal transform, and its inverse transform also has similar lifting structures, which means we just need to subtract what was added at forward transform to invert a lifting step.Hence, the original signals can still be perfectly constructed even if the floating-point multiplications result in the lifting steps are rounded to integers, as long as the same procedure is applied to both the forward and inverse transforms.In order to implement multiplier-free algorithm, the algorithm approximating the floating-point lifting coefficients by hardware friendly dyadic values can be implemented by only shift and addition operators.This kind of approximation of original DCT is called BinDCT.However in most cases BinDCT is not the best choice, because forward transforms and inverse transforms are always not implemented by the same procedures.Moreover, BinDCT introduces more multiplication operators into the signal flow graph, which decrease the computational precision remarkably.If we just use BinDCT for inverse transform and original DCT for forward transform, then the differences between original data and recover data cannot be neglected.It means that BinDCT cannot perform well to recover date modulated by other forward DCTs.
In this paper, we propose our novel multiplier-free IDCT algorithm derived from conventional AAN algorithm [2].
The algorithm contains no multiplications and is implemented only by fixed-point integer-arithmetic.In order to improve the precision and reduce the computational complexity, we adopt the scale factors to modulate a coefficient matrix and a compensation matrix.
The paper is organized as follows.In Section 2, we present the improvement of the 1D IDCT algorithm through deleting the multiplication operators in the conventional algorithm and replacing them with addition and shift operators.Then we discuss the 8 × 8 2D IDCT and its 32bit hardware implementation in Section 3. Considering lowpower implementations of IDCT on mobile devices with no or less floating multipliers, we propose the 8 × 8 2D IDCTs based for 24-bit and 16-bit hardware in Section 4. We then show the performances of our proposed algorithms, including their computational complexities and precisions in Section 5. Finally, we give the conclusions in Section 6.

Implementation of 1D IDCT
In this section, firstly we give a general method which is able to reform many existing 1D IDCTs.Then we try to use this approach to reform the traditional AAN algorithm.Finally, considering the characteristics of AAN flow graph, we propose a new and more efficient algorithm.

General Method to
Reform Existing 1D IDCTs.The butterfly computational structures which are always found in most of the existing IDCTs, such as ones in [1][2][3], can be interpreted by the following equation: Here u and v are scale factors; a and b are integer inputs.Let The details of how to replace the multiplications in (7) with additions and shifts are given below.Without losing of generality, assume that w 1 and w 2 are positive numbers and transform them into binary numbers, then If am i + bn i (i = 0, 1, . . ., t) are calculated out, then T can be obtained by t shifts and t additions.Because m i , n i (i = 0, 1, . . . x (1) x (2) x (3) x (4) x (5) x (6) x( 7) actually be calculated by t − s additions and t − s shifts of a, b, and a + b, where s denotes the amount of am i + bn i equal to 0. Since w 1 and w 2 are constants, the values of m i , n i (i = 0, 1, . . ., t) can be known in advance.So the optimal scheme of additions and shifts can be designed to decrease the amount of operations.
Most algorithms of the fast IDCT and DCT only deal with each separate multiplication using additions and shifts but our proposed method implements the linear combinations of multiplications via additions and shifts, which remarkably simplifies the computational complexity.

Reformation of AAN Algorithm Based on the General
Method.The 1D IDCT flow graph of AAN fast algorithm [2], when n = 8, is shown in Figure 1, where the symbol ci denotes cos(iπ/16), and the scale coefficients A i (i = 0, . . ., 7) are defined as follows: We reform the above algorithm flow graph based on our method described in Section 2. The new flow graph is shown in Figure 2.
In Figure 2, the butterfly computational structures are replaced with the formulas of T1 and T2.The formulas of T1
Optimal schemes of additions and shifts are 8 additions and 8 shifts are used in the step to implement the computation of T1.
In the codes, x, x 1 , x 2 , and x 3 are all variables, and symbol " " denotes the right-shift operator.The binary numbers B1 and B2 following the codes are the coefficients of Input a and Input b, respectively.The result of each code line can be expressed with B1, B2, a, and b.Now we explain this step in details.
Factors w 1 and w 2 can be expressed as binary numbers.Considering the precision and complexity of the computation, we choose 2 17 as the denominators, then When Input a is right-shifted r bits, the value is equal to 2 −r • a.So the code "x 1 = a − (a 3);" can be expressed by mathematic expression as follows: So the binary number B1 following the code as the coefficient of Input a is equal to (0.111) 2 .

Revision Based on the Characteristics of AAN Algorithm.
Obviously, Input a and Input b in Figure 2 are computed twice, in order to reduce the redundancy, another algorithm is presented in Figure 3.
Details of the implementation of Figure 3 are presented as follows where t a = g(6) × cos(3π/8), t b = g(6) × cos(π/8), t c = g (7) × cos(3π/8), and t d = g (7) × cos(π/8).We also deal with the computations of h (6) and h (7) with the similar method introduced in Section 2.2 and choose 2 17 as the denominators again.Optimal schemes of additions and shifts are In order to reduce the computational complexity, we let instead of The computational complexity of the improved method is tabulated in Table 2.
For the 1D IDCT, the total number of additions and shifts is 46 and 20, respectively.

Implementation of 2D IDCT
In order to implement an 8 × 8 2D IDCT defined in (3), we decompose it into a cascade of 1D IDCTs which are applied to each row and column in the 8×8 IDCT coefficient matrix.The algorithm of 1D IDCT has been discussed in Section 2. In this section, we will focus on the scale coefficients A i (i = 0, . . ., 7) and the matrices derived from them which remarkably affect the computational precision of the algorithm.

Choice of Coefficient Matrices and Compensation Matrices.
To ensure the precision of the transform, we use a coefficient matrix and a compensation matrix to scale the input X(n, m) (n, m = 0, . . ., 7) in the preprocessing.The details are given as follows.
In order to improve the computational precision, we expect p 1 as greater as possible.However, the value of p 1 is limited by the bit width of register.So we use the compensation matrix coef1[i][ j] to improve the precision of computations.The compensation matrix is obtained as Since the compensation matrix coef1[i][ j] stores p 2 -bit data information, in some extent, the introduction of coef1[i][ j] improves p 2 -bit precision of the computations.
Matrix block[i][ j] (i, j = 0, 1, . . ., 7) is defined as an 8×8 coefficient matrix.Then the preprocessing can be expressed with the compensation matrix coef1[i][ j] as follows: Considering proper rounding at the final stage of the transform, we add 2 p1−1 to all output data and right-shift them by p 1 bits.Observing Figure 3 the flow graph of 1D IDCT algorithm, we find that if we add 2 t to X(0), then all output x(k, l) (k, l = 0, . . ., 7) are all added by 2 t .So we just need to add a bias 2 p1−1 to DC term X(0, 0): and simply right-shift all output x(k, l) (k, l = 0, . . ., 7) by p 1 bits for rounding of the 2D IDCT.

Implementation for 32-Bit
Hardware.For 8-bit DCT coefficients, corresponding IDCT coefficients are at most 11bit data.Due to the additions in the flow-graph, for 32-bit hardware implementation we let p 1 = 18, p 2 = 3, so we get the coefficient matrix and compensation matrix as follows: After preprocessing, we process 64 coefficients according to the flow graph shown in Figure 3 by row-column way.Finally, we right-shift the output back to the original scale as The multiplications of the coefficient matrix and the compensation matrix in (28) can also be implemented with the method of shifts and additions.There are positive and negative elements in the compensation matrix coef1[i] [ j].The purpose of this kind of design is to reduce the absolute values of these elements and decrease the computational complexity.Due to the limited space, the details of optimal schemes of additions and shifts for all elements in the matrices are omitted.

Implementations of 2D IDCT for 24-Bit and 16-Bit Hardware
In Section 3, we implemented the 2D IDCT for 32-bit hardware.However in some cases it cannot be applied.Take mobile devices for example.The bit width of these devices is not enough to complete a 32-bit computation.So we present the implementations of IDCTs for 24-bit and 16-bit hardware in this section.
4.1.Implementation for 24-Bit Hardware.To implement the above algorithm in 24-bit frame with the same idea, let p 1 = 11 and p 2 = 5, then get the coefficient matrix and compensation matrix: After preprocessing, we process 64 coefficients according to the flow graph shown in Figure 3 by row-column way.Finally, we right-shift the output back to the original scale as The computations of 1D IDCTs for 24-bit hardware and 32bit hardware are nearly the same.The only difference is just the bit width.In order to complete calculations of the IDCT according to Figure 3 within 16-bit width, we use a combination of two bytes as a unit to deal with calculations.We denote two 16bit buffers as buf0 and buf1 to store the high 16 bits and low 8 bits of the original 24-bit datum, respectively, which can be expressed formally as follows.
Let the original 24-bit datum as x, and the data stored in buf0 and buf1 are x 0 and x 1 , respectively.Then In order to use unique data structure to express data, we define the standard storage format, in which x 0 and x 1 must satisfy the equations as follows: This process also can be demonstrated by Figure 4.
In Figure S, S 0 , and S 1 are all sign bits, and S 0 = S and S 1 = 0 when data are stored in the standard storage format.
Due to the limit of 16-bit width, we cannot implement preprocessing according to (28).We also use a combination of two bytes as a unit which contains 30 data bits and 1 sign bit, and the method of additions and shifts to deal with calculations in preprocessing.The data structure is showed in Figure 5.
Because there are 30 data bits, we do not need the compensation matrix but we use a new coefficient matrix coef 16[i][ j], which is defined as follows: where p = p 1 + p 2 = 11 + 5 = 16.The coefficient matrix is presented as follows: After preprocessing, we transform the data into standard format mentioned above and complete the computation of IDCT.
As discussed above, in fact the algorithm for 16-bit hardware is theoretically the same as for 24-bit hardware; there are just some differences in the implementations.In other words, we complete the 24-bit computations within 16-bit width, so that we could gain the same precisions.process.In such cases, scaling can be executed without taking any extra resources.So we do not take account of these computational complexities in Table 5.

Conclusions
In this paper, we propose a new general method to compute the fast IDCT, which can be applied to most existing IDCT algorithms with butterfly computational structures.We also introduce a specific algorithm derived from AAN algorithm.Considering characteristics of the AAN algorithm, coefficient matrices and compensation matrices are brought into the algorithm to improve its precision.Through varying the scale coefficients p 1 and p 2 , we can modify the precision to meet the different requirements of hardware, such as 32-bit hardware and 24-bit hardware.However, in order to implement our algorithm using 16-bit hardware with satisfied precision, we have to revise our design which includes a new coefficient matrix, a new kind of data structure, and some new methods of data manipulation.
The new IDCT algorithm achieves a good compromise between the precision and computational complexity.The idea of optimizing the IDCT algorithm can also be extended to other similar fast algorithms.
, t) are equal to either 0 or 1, there are totally 4 combinations with a and b.They are 0, a, b, a + b.So T can EURASIP Journal on Advances in Signal Processing 3

Figure 2 :
Figure 2: The revised flow graph of AAN n = 8.

Figure 3 :
Figure 3: The revised flow graph of AAN based on the character, n = 8.

Figure 4 :
Figure 4: The standard storage data structure.

Figure 5 :
Figure 5: The storage data structure for preprocessing.
• a. 4 additions and 4 shifts are used to implement the computation of each In these codes, x, x 1 , x 2 , and x 3 are all variables, a is the input, and the output is approximately equal to √ 2/2 √ 2/2 in the 1D IDCT computation.The computational complexity of 1D IDCT is tabulated in Table1.

Table 1 :
The statistics of computational complexity.

Table 2 :
The statistics of computational complexity.
Then the preprocessing can be expressed as follows with the new coefficient matrix coef 16[i][ j]: block [i] j = block [i] j × coef 16 [i] j , block [i] j = block [i] j P 2 .