### 2.1 Algorithm of the 32-point IDCT

The transform computation in HEVC uses a set of IDCT transform matrices. In general, a 2-D inverse transform can be obtained by performing two 1-D IDCTs through the row-column decompensation method.

$$\begin{array}{*{20}l} \mathbf{y}&=\mathbf{CZC^{T}} \end{array} $$

(1)

$$\begin{array}{*{20}l} \mathbf{y^{T}}&=\mathbf{C^{T}\left(CZ\right)^{T}} = \mathbf{C^{T}x^{T}} \end{array} $$

(2)

The 32-point 1-D IDCT can be expressed as follows:

$$\begin{array}{*{20}l} \mathbf{x}&={\mathbf{C}\mathbf{Z}} \end{array} $$

(3)

$$\begin{array}{*{20}l} \mathbf{x}&=\left[\begin{array}{cccc} x_{0} & x_{1} & \ldots & x_{31} \end{array} \right]^{T} \end{array} $$

(4)

$$\begin{array}{*{20}l} \mathbf{Z}&=\left[\begin{array}{cccc} Z_{0} & Z_{1} & \ldots & Z_{31} \end{array} \right]^{T} \end{array} $$

(5)

where **C** indicates the 32×32 coefficient matrix.

According to the symmetric property, Eq. (3) can be decomposed into two separate equations:

$$\begin{array}{*{20}l} \mathbf{x_{16u}}&= \mathbf{C_{16e}}\mathbf{Z_{16e}}+\mathbf{C_{16o}}\mathbf{Z_{16o}}=\mathbf{A}+\mathbf{B} \end{array} $$

(6)

$$\begin{array}{*{20}l} \mathbf{x_{16d}}&= \mathbf{C_{16e}}\mathbf{Z_{16e}}-\mathbf{C_{16o}}\mathbf{Z_{16o}}=\mathbf{A}-\mathbf{B} \end{array} $$

(7)

where

$$\begin{array}{*{20}l} \mathbf{x_{16u}}&=\left[\begin{array}{cccc} x_{0} & x_{1} & \ldots & x_{15} \end{array} \right]^{T} \end{array} $$

(8)

$$\begin{array}{*{20}l} \mathbf{x_{16d}}&=\left[\begin{array}{cccc} x_{16} & x_{17} & \ldots & x_{31} \end{array} \right]^{T} \end{array} $$

(9)

$$\begin{array}{*{20}l} \mathbf{Z_{16e}}&=\left[\begin{array}{cccc} Z_{0} & Z_{2} & \ldots & Z_{30} \end{array} \right]^{T} \end{array} $$

(10)

$$\begin{array}{*{20}l} \mathbf{Z_{16o}}&=\left[\begin{array}{cccc} Z_{1} & Z_{3} & \ldots & Z_{31} \end{array} \right]^{T} \end{array} $$

(11)

*C*_{16e} and *C*_{16o} are the 16-point even and odd coefficient matrices, respectively, for the 32-point transform. The coefficient of *C*_{16e} is presented in Eq. (12). The 16-point even-part computation can be further divided into 8-point even and odd computations.

$$ \begin{aligned} \mathbf{C_{16e}} = \left[\begin{array}{cccccccccccccccc} 64& 64 & 64 & 64 &64 &64 &64 &64 &64 &64 &64 &64 &64 &64 &64 &64 \\ 90& 87 & 80 & 70 &57 &43 &25 &9 &-9 &-25&-43&-57&-70&-80&-87& 90 \\ 89& 75 & 50 & 18 &-18 &-50&-75&-89&-89&-75&-50&-18& 18& 50& 75& 89 \\ 87& 57 & 9 &-43 &-80 &-90&-70&-25&25 &70 &90 &80 &43 &-9 &-57& -87 \\ 83& 36 & -36&-83 &-83 &-36& 36&83 &83 &36 &-36&-83&-83&-36& 36& 83 \\ 80& 9 & -70&-87 &-25 &57 &90 &43 &-43&-90&-57& 25& 87& 70&-9 &-80 \\ 75& -18& -89&-50 &50 &89 &18 &-75&-75&18 &89 &50 &-50&-89&-18& 75 \\ 70& -43& -87& 9 &90 &25 &-80&-57& 57&80 &-25&-90&-9 &87 &43 &-70 \\ 64& -64& -64& 64 &64 &-64&-64& 64& 64&-64&-64& 64& 64&-64&-64& 64 \\ 57& -80& -25& 90 &-9 &-87& 43& 70&-70&-43& 87& 9 &-90& 25& 80& -57 \\ 50& -89& 18& 75 &-75 &-18& 89&-50&-50&89 &-18&-75& 75& 18&-89& 50 \\ 43& -90& 57& 25 &-87 & 70& 9 &-80& 80&-9 &-70& 87&-25&-57& 90& -43 \\ 36& -83& 83& -36& -36& 83&-83& 36& 36&-83& 83&-36&-36& 83&-83& 36 \\ 25& -70& 90& -80& 43 &9 &-57& 87&-87& 57&-9 &-43& 80&-90& 70& -25 \\ 18& -50& 75& -89& 89 &-75& 50&-18&-18& 50&-75& 89&-89& 75&-50& 18 \\ 9 & -25 & 43& -57& 70 &-80& 87&-90& 90&-87& 80&-70& 57&-43& 25& -9 \end{array} \right] \end{aligned} $$

(12)

$$\begin{array}{*{20}l} \mathbf{x_{16u}}&= \left[\begin{array}{c} \mathbf{x_{8u}} \\ \mathbf{x_{8d}}\end{array} \right] \end{array} $$

(13)

$$\begin{array}{*{20}l} \mathbf{x_{8u}}&= \mathbf{C_{8e}}\mathbf{Z_{8e}}+\mathbf{C_{8o}}\mathbf{Z_{8o}}=\mathbf{a}+\mathbf{b} \end{array} $$

(14)

$$\begin{array}{*{20}l} \mathbf{x_{8d}}&= \mathbf{C_{8e}}\mathbf{Z_{8e}}-\mathbf{C_{8o}}\mathbf{Z_{8o}}=\mathbf{a}-\mathbf{b} \end{array} $$

(15)

and

$$\begin{array}{*{20}l} \mathbf{x_{8u}}&=\left[\begin{array}{cccc} x_{0} & x_{1} & \ldots & x_{7} \end{array} \right]^{T} \end{array} $$

(16)

$$\begin{array}{*{20}l} \mathbf{x_{8d}}&=\left[\begin{array}{cccc} x_{8} & x_{9} & \ldots & x_{15} \end{array} \right]^{T} \end{array} $$

(17)

$$\begin{array}{*{20}l} \mathbf{Z_{8e}}&=\left[\begin{array}{cccc} Z_{0} & Z_{4} & \ldots & Z_{28} \end{array} \right]^{T} \end{array} $$

(18)

$$\begin{array}{*{20}l} \mathbf{Z_{8o}}&=\left[\begin{array}{cccc} Z_{2} & Z_{6} & \ldots & Z_{30} \end{array} \right]^{T} \end{array} $$

(19)

Moreover, the 8-point even-part computation can be divided into the following equations:

$$\begin{array}{*{20}l} \mathbf{x_{8u}}&=\left[\begin{array}{c} \mathbf{x_{4u}} \\ \mathbf{x_{4d}} \end{array} \right] \end{array} $$

(20)

$$\begin{array}{*{20}l} \mathbf{x_{4u}}&= \mathbf{C_{4e}}\mathbf{Z_{4e}}+\mathbf{C_{4o}}\mathbf{Z_{4o}}=\alpha+\beta \end{array} $$

(21)

$$\begin{array}{*{20}l} \mathbf{x_{4d}}&= \mathbf{C_{4e}}\mathbf{Z_{4e}}-\mathbf{C_{4o}}\mathbf{Z_{4o}}=\alpha-\beta \end{array} $$

(22)

and

$$\begin{array}{*{20}l} \mathbf{x_{4u}}&=\left[\begin{array}{cccc} x_{0} & x_{1} & x_{2} & x_{3} \end{array} \right]^{T} \end{array} $$

(23)

$$\begin{array}{*{20}l} \mathbf{x_{4d}}&=\left[\begin{array}{cccc} x_{4} & x_{5} & x_{6} & x_{7} \end{array} \right]^{T} \end{array} $$

(24)

$$\begin{array}{*{20}l} \mathbf{Z_{4e}}&=\left[\begin{array}{cccc} Z_{0} & Z_{8} & Z_{16} & Z_{24} \end{array} \right]^{T} \end{array} $$

(25)

$$\begin{array}{*{20}l} \mathbf{Z_{4o}}&=\left[\begin{array}{cccc} Z_{4} & Z_{12} & Z_{20} & Z_{28} \end{array} \right]^{T} \end{array} $$

(26)

Thus, the entire 32-point IDCT computation can be decomposed into 16-, 8-, and 4-point operations, as displayed in Fig. 1. The 16-point IDCT computation can be decomposed into 8- and 4-point operations (the 4-point even, 4-point odd, BF4, 8-point odd, and BF8 modules); 8-point IDCT can be calculated by using 4-point even, 4-point odd, and BF4 modules. The 4-point IDCT is implemented as a 4-point even module.

### 2.2 Proposed architecture

Compared to the multiple computation path IDCT [19], the proposed 2-D IDCT core is composed of one 1-D transform core and one transposed memory (TMEM) to achieve a small-area design. The 1-D IDCT core utilizes the proposed data shared in the time scheme such that the throughput rate can be maintained the same as the operation frequency. The 1-D core supports full HD 1080p, which requires 1080×1920×60=124,416,000 pel/s ≃125 MP/s. The entire architecture is illustrated in Fig. 2.

#### 2.2.1 1-D IDCT core

The 1-D 32-point IDCT core comprises a 4-point even-part process element (PEE4), a 4-point odd-part process element (PEO4), an 8-point odd-part process element (PEO8), a 16-point odd-part process element (PEO16), and three butterfly (BF) modules. The process elements (PEs) are designed using add-and-shift to share the hardware resources. The matrix product of the PEO4 computation can be expressed as follows:

$$ \left[\begin{array}{c} \beta_{0}\\ \beta_{1}\\ \beta_{2}\\ \beta_{3} \end{array} \right] = \left[\begin{array}{cccc} 89 & 75 & 50 & 18 \\ 75 & -18 & -89 & -50 \\ 50 & -89 & 18 & 75 \\ 18 & -50 & 75 & -89 \end{array} \right] \left[\begin{array}{c} Z_{0}\\ Z_{1}\\ Z_{2}\\ Z_{3} \end{array} \right] $$

(27)

Four coefficients {89,75,50,18} with different signs are used to multiply the inputs \(\left [\begin {array}{cccc} Z_{0} & Z_{1} & Z_{2} & Z_{3} \end {array} \right ]^{T}\). Thus, the matrix product operation can be simplified using the multiple constant multiplication technique.

$$\begin{array}{*{20}l} 89\cdot Z&=64\cdot Z + 25\cdot Z \end{array} $$

(28)

$$\begin{array}{*{20}l} 75\cdot Z&=(4+1)\cdot25\cdot Z \end{array} $$

(29)

$$\begin{array}{*{20}l} 50\cdot Z&=2\cdot25Z \end{array} $$

(30)

$$\begin{array}{*{20}l} 18\cdot Z&=2\cdot(9\cdot Z) \end{array} $$

(31)

The sharing architecture called four operands SAU (SAU4) is displayed on the left side of Fig. 3. SAU4 uses the shift-and-add function instead of the multiplier function to reduce the area cost. Furthermore, it shares the same hardware resource among the constant multiplications. Then, the sign-and-interconnection circuit maintains the matrix product. Finally, four accumulators (ACCs) sum the product results for every four clock cycles. Thus, every four clock cycles, the outputs *β*_{0},*β*_{1},*β*_{2}, and *β*_{3} complete the computation in Eq. (27).

#### 2.2.2 Architecture of the 8-, 16-, and 32-point IDCTs

The architecture of the 8-point IDCT, which is called PEE8, is displayed in Fig. 4. PEE8 consists of the PEE4, PEO4, and BF4 modules, which execute the computations in Eqs. (20)–(22). The PEO4 module executes the matrix product *C*_{4o}*Z*_{4o}, as illustrated in Fig. 3. The even-part computation (*C*_{4e}*Z*_{4e}) is also implemented in SAU3, sign-and-interconnection circuits, ACCs, and registers (D). The four ACCs and four registers are used to sum the product results for every four clock cycles and send them in the following four clock cycles. The BF4 module adds and subtracts *C*_{4o}*Z*_{4o} and *C*_{4e}*Z*_{4e} to output *x*_{4u} and *x*_{4d}.

Moreover, the 16-point IDCT consists of the PEO8, PEE8, and BF8 modules. The PEO8 module calculates the odd part of the 16-point transformation (*C*_{8o}*Z*_{8o}), as indicated in Eqs. (14) and (15). The lower half of Fig. 5 illustrates the architecture of the PEO8 module. The SAU8 module shares the hardware resources by using the shift-and-add architecture, and the BF8 module controls the addition and subtraction output.

The BF16 module calculates the final results before transpose and output. Thus, *C*_{16o}*Z*_{16o} and *C*_{16e}*Z*_{16e} in Eqs. (6) and (7) can be calculated using PEO16 and PEE16, respectively. The architecture of PEO16 is displayed in Fig. 6. The mixed SAU16 (SAU16M) module, which uses the shift-and-add architecture, executes the 16-point matrix product *C*_{16o}*Z*_{16o} as well as the 16-point *C*_{16e}*Z*_{16e}, 8-point *C*_{8e}*Z*_{8e}, and 4-point *C*_{4e}*Z*_{4e} by supporting variable transform sizes (32-, 16-, 8-, and 4-point matrix products). Thus, *x*_{16o},*x*_{16u},*x*_{8u}, and *x*_{4u} can be obtained from PEO16 according to the adaptive transform size.

#### 2.2.3 Data flow of the proposed IDCT

The proposed IDCT core has a 1-D core and TMEM. The 1st-D and 2nd-D computations can be executed in the same 1-D core through the proposed data control scheme to save hardware cost. Thus, the proposed IDCT core can achieve a high throughput and low area.

According to the reorder registers and MUX, the 1st-D/2nd-D data is input into the 16-point odd-/even-part PE during the first 16 cycles of the 32-cycle period. The 1st-D/2nd-D data is then input into the 16-point even-/odd-part PE during the following 16 cycles of the 32-cycle period. Thus, the 1st-D and 2nd-D computations can share the same hardware resources during the 32-cycle period. For the 32-point transform, the PEE4 module executes in the first four clock cycles, the PEO4 module executes in the following four cycles, and the PEE4 module outputs the results to BF4. When the PEO4 module outputs the results to BF4, the BF4 module begins calculating the addition and subtraction as per Eqs. (21) and (22). In the following eight cycles, the PEO8 module calculates the matrix product *C*_{8o}*Z*_{8o} and the BF4 module simultaneously outputs the results. In cycles 1624, the PEO8 module outputs the computation results to BF8 and BF8 executes addition and subtraction. The BF8 module then outputs the addition results in cycles 1624 and the subtraction results in cycles 2432. The PEO16 module executes the matrix product *C*_{16o}*Z*_{16o} when BF8 outputs the addition and subtraction results to BF16. In the 48th cycle, BF16 outputs the first 32-point 1-D transform data and inputs the following 32-point transform data. After 1008 cycles, the 2nd-D data is output from the TMEM and fed into the PEE4 module. In these 16 cycles, the PEE4, PEO4, BF4, PEO8, and BF8 modules execute the 2nd-D data due to the ideal time of these circuit resources. In the following 16 cycles, the PEE4, PEO4, BF4, PEO8, and BF8 modules execute the 1st-D data and the PEO16 and BF16 modules execute the 2nd-D data. The 2-D transform data is starting output at the 1040 cycle; thus, the latency of the proposed core is 1040 clock cycles. The core takes 2064 cycles to complete the 32×32 IDCT transformation. According the proposed data flow (Fig. 7), the proposed circuit can maintain the throughput rate to be the same as the operation frequency.