Discrete cosine transform (DCT)[1] has become one of the basic tools in signal and image processing; the popularity of which is mainly due to its good energy compaction properties. In particular, DCT is the best substitute for the Karhunen-Loeve Transform (KLT), which is considered to be statistically optimal for energy concentration[2, 3], whereas the discrete cosine transform is suboptimal. The KLT is data dependent and requires more computation compared to the DCT. Due to this fact, discrete cosine transform is the finest substitute for the KLT. Indeed, DCT has found applications in many image and video compression standard such as JPEG[4], MPEG-1[5], MPEG-2[6], H.261[7], H.263[8], and H.264/AVC[9, 10]. During the JPEG process, an image is divided into several 8 × 8 blocks and then the two-dimensional discrete cosine transform (2-D DCT) is applied for encoding each block. The two-dimensional DCT of order N × N is defined as

\begin{array}{ll}\phantom{\rule{6pt}{0ex}}{T}_{\text{DCT}}\left(u,v\right)& =\alpha \left(u\right)\alpha \left(v\right)\sum _{i=0}^{N-1}\sum _{j=0}^{N-1}X\left(i,j\right)cos\left[\frac{\pi \left(2i+1\right)u}{2N}\right]\\ \phantom{\rule{2.5em}{0ex}}cos\left[\frac{\pi \left(2j+1\right)v}{2N}\right]\phantom{\rule{1em}{0ex}}\text{for}\phantom{\rule{0.25em}{0ex}}0\le i,j,u,v\le N-1\end{array}

(1)

Where

\alpha \left(u\right)=\alpha \left(v\right)=\left\{\begin{array}{ll}\sqrt{\frac{1}{N}}& \text{for}\phantom{\rule{0.25em}{0ex}}u,v=0\\ \sqrt{\frac{2}{N}}& \mathrm{otherwise}\end{array}\right\}

In general, the floating point DCT decorrelates the data being transformed so that most of its energy is packed in the low-frequency region, which is best suited for well-known image compression techniques[11–15] but does not meet the requirements of very fast real-time compression applications. For this reason, there has been huge interest in finding fixed point multiplication-free DCT algorithms[16–32] that can be implemented as low power and area efficient digital circuits, thus useful for mobile imaging devices.

In this scenario, recently a large number of DCT approximations have been proposed. Approximated algorithms provide a meaningful estimation at low complexity of 8-point DCT. Cham[16] proposed the integer cosine transforms (ICT) using the principle of dyad symmetry. The performance of ICT is very close to that of DCT. Haweel[17] proposed a signed DCT (SDCT) by applying a signum function to the DCT matrix, which maintains the good de-correlation and power compaction properties of the DCT but requires 24 additions and is not orthogonal. Lengwehasatit and Ortega[18] suggested the two 8 × 8 transform matrices, one for the coarsest and another for the finest. Using these two matrices, a trade-off between speedup and accuracy in various bit ranges can be achieved. The coding performance shows that 73% reduction in complexity with only 0.2 dB degradation in peak signal-to-noise ratio (PSNR). Tran[13] proposed the family of 8 × 8 biorthogonal transforms called binDCT, which are approximates of the popular 8 × 8 DCT. The binDCT requires 31 additions and 14 shift operations with a coding gain ranging from 8.77 to 8.82 dB, and shows finer approximations to exact DCT and are suitable for VLSI implementation. Bouguezel et al. proposed a series of DCT approximation techniques[19–23] which have a trade-off between computational complexity and image compression performance. Cintra and Bayer[24] proposed an approximate DCT based on the round-off function which requires 22 additions with less blocking artifacts. Bouguezel et al.[23] proposed a low complexity parametric transform for image compression, which requires 18 additions and 2 multiplications. This computational complexity can be reduced by varying the parameter *a*. Usually, the parameter *a* is selected as a small integer in order to minimize the computational complexity. In Bouguezel et al.[23], the suggested values of *a*∊ {0, 1/2, 1}. For the value *a* = 1/2, the two multiplications become just bit-shift operations. If *a* = 1, then no shift operation is necessary. The transform requires only 18 additions. In the case of *a* = 0, the complexity reduces to 16 additions. Brahimi and Bouguezel[25] proposed an efficient fast integer DCT transform which is also claimed to require only 16 additions, and it is not orthogonal. Senapati et al.[26] proposed a low complexity orthogonal 8 × 8 transform matrix for fast image compression, which requires 14 additions and two shift operations. This computational complexity is further reduced by Bayer and Cintra[27] to 14 additions, which gives better image compression performance than the classic SDCT[17] and Bouguezel et al.[23] transforms. Cintra et al.[28] proposed a very low complexity DCT approximation obtained via pruning, which is claimed to require only 10 additions. However, the performance results reported in[28] is not reproduced, since the proposed work concentrates on non-pruned techniques. On the other hand, integrating multiple standard encoding or decoding hardware into a single chip increases the area and power consumption. Numerous architectures have proposed a low power, high speed and area efficient hardware implementation for DCT computation[32–35].

In general, DCT approximation with low computational complexity and low bit rates are preferred. In this paper, a low complexity multiplier-less DCT approximation is proposed, which is more essential for hardware realization. The derived fast algorithm requires only 12 additions, which is lesser than the number of additions required for any existing DCT approximation[17–27, 29–31]. To examine the performance and trade-offs associated with the algorithm, we have coded the proposed as well as the existing algorithms[17, 19, 21–24, 26, 27] in MATLAB and Verilog HDL, and it is synthesized with Xilinx Virtex 7 XC7V585T-2LFFG1761C device (Xilinx, Inc., San Jose, CA, USA)[36] and Cadence® RTL Compiler®[37] using UMC 90 nm standard cell library.

The rest of the paper is structured as follows. In Section 2, the proposed transform and the factors influencing its performance improvements and computational complexity are compared with the existing methods. An image compression simulation and hardware implementation for the proposed and existing approximation DCT are detailed and analyzed in Section 3. Conclusion and final remarks are given in Section 4.