- Research Article
- Open Access
Harmonic Enhancement in Low Bitrate Audio Coding Using an Efficient Long-Term Predictor
© Jeongook Song et al. 2010
- Received: 8 February 2010
- Accepted: 29 July 2010
- Published: 15 August 2010
This paper proposes audio coding using an efficient long-term prediction method to enhance the perceptual quality of audio codecs to speech input signals at low bit-rates. The MPEG-4 AAC-LTP exploited a similar concept, but its improvement was not significant because of small prediction gain due to long prediction lags and aliased components caused by the transformation with a time-domain aliasing cancelation (TDAC) technique. The proposed algorithm increases the prediction gain by employing a deharmonizing predictor and a long-term compensation filter. The look-back memory elements are first constructed by applying the de-harmonizing predictor to the input signal, then the prediction residual is encoded and decoded by transform audio coding. Finally, the long-term compensation filter is applied to the updated look-back memory of the decoded prediction residual to obtain synthesized signals. Experimental results show that the proposed algorithm has much lower spectral distortion and higher perceptual quality than conventional approaches especially for harmonic signals, such as voiced speech.
- Residual Signal
- Speech Sample
- Spectral Distance
- Prediction Coefficient
- Modify Discrete Cosine Transform
The main objective of speech and audio coding algorithms is to represent an input signal with as few bits as possible while maintaining high perceptual quality; however their fundamental design concepts are somewhat different. The reason can be found in the unique characteristics of input signal to be encoded, and the application areas of each codec. For example, speech coding that employs the voice production mechanism is used for bidirectional communications, while audio coding that utilizes the hearing mechanism is used for one-way broadcasting services in general. Due to the different design concept one method does not work well for other type of input signals .
As the communication and broadcasting networks are merging together, demands for developing a unified speech/audio codec are rapidly increasing . As a first step toward this unification, MPEG standardized the MPEG-4 audio which combines a large set of codecs covering different signal characteristics and operating bit rates . The 3GPP also standardized the adaptive multirate wideband plus (AMR-WB+) codec that has the combined structure with an ACELP technology and a transform-based coding (TCX) scheme . Recently, MPEG has initiated a new standard to provide a unified coding tool for speech and audio signals. In response to the Call for Proposal (CfP) on the unified speech and audio coding (USAC), 8 candidate systems have been submitted, and a reference model was selected through a competitive evaluation process [5, 6]. The reference model has a combined architecture containing two separate coding branches: one comes from a modification of advanced audio coding (AAC), and the other comes from a traditional linear prediction-based coding especially the AMR-WB+ [6, 7].
To design a unified speech and audio codec, it is important to fully understand the signal characteristics of input signal as well as the type of distortions related to the codec used, that is, the effect caused by encoding speech signals with audio codec and vice versa. It is well known that transform-based codecs are inadequate to efficiently express the speech input signals, especially at low bitrates [8, 9]. Among several interpretations to explain the distortion of coded speech in transform based codecs, the smearing effect coming from a loose tracking of pitch variation is said to be one of the most significant reasons [10, 11]. In other words, relatively long transform analysis leads to roughness, because the pitch is rather frequently varied in the transform duration, and thus the harmonic components in the frequency domain are not ensured to be preserved by perceptual bit allocation. As an another aspect, it should be also noted that each peak and valley coming from the pitch harmonics might be independently coded in the transform domain, thus it is less efficient to code them as much as to be done by namely long-term prediction in many speech coders .
The AAC-LTP introduced a concept to the transform coder as an intention to remove the harmonic redundancy where the prediction was designed to reduce the interframe redundancy . However, the quality improvement was marginal because of its inherent structural limitation in the encoding step, that is, a modified discrete cosine transform (MDCT) with a time-domain aliasing cancelation (TDAC) [14, 15]. Since the MDCT in AAC has a long frame size and needs additional one-frame delay to reconstruct the aliasing-free time domain signal, the lag of the predictor should be very long. Therefore, the prediction gain becomes low because it applies to less correlated signal. Obviously, the method could not be applicable to speech input signals having pitch harmonics, and rather it may be appropriate to code very tone-like stationary musical solo signals such as pitch-pipe and violin.
This paper proposes a new long-term prediction structure that can be integrated into transform-based audio coding algorithms. The harmonic components of input signal are first reduced by a deharmonizing long-term predictor, and then the predicted signal is encoded and decoded by a transform coder. Finally the effect of the deharmonization predictor is compensated by a long-term synthesis filter that minimizes the overall quantization error between the input and the synthesized signal. Since the look-back memory of the compensation filter has been updated by the decoded signal of the previous frame, it provides higher prediction gain, which results in much lower perceptual distortion. The performance of the proposed algorithm is verified by implementing it with the Enhanced aacPlus (EAAC) codec released by 3GPP . Simulation results obtained from objective and subjective tests confirm the superiority of the proposed algorithm especially for speech and concatenated signals.
The AAC-LTP has been designed to enhance harmonic components of the input signal using a long-term predictor .
The prediction gain of AAC-LTP.
Operated frame rate (%)
prediction gain (dB)
1 frame delay
3.1. Deharmonization Predictor and Harmonic Compensation Filter
3.2. The Decoder of the Proposed Algorithm
3.3. Flexible Frame Length Algorithm
The last term in (8), , causes the artifacts over all frequencies. If the original signal is band-limited, the residual signal is also band-limited. However, as the difference of the prediction gain between consecutive frames becomes larger, artifacts become more severe.
The proposed algorithm is integrated into the EAAC released by 3GPP because of its high encoding efficiency and good sound quality compared to the other AAC versions, but it does not have a long-term prediction module. The deharmonization predictor and the compensation filter consist of the first-order filter, which is more stable and requires small amount of additional bits.
4.1. Additional Bit Allocation
Additional bit allocation.
Total per frame
4.2. Perceptual Entropy in T/F Encoder
where is a power spectral density of residual signals and denotes the masking threshold density of original signals which is computed in each scale factor band. The modified PE and the masking threshold are utilized for the quantization and encoding process .
5.1. Experimental Setup
materials from MPEG USAC
12, 16, 20 kbps
only long window
In the encoding block diagram of the proposed algorithm depicted in Figure 3, the T/F encoder covers the frequency bandwidth of up to 3.328 kHz. To remove quality variation caused by the block-switching effect, it is processed with the long window mode only. It is true in practice because the short-window processing is hardly used in low bitrate codecs due to its bit limitation. Test signals were selected from the database used for testing the quality of reference speech and audio codecs during the initial stage of MPEG USAC standard activity . To separately analyze the quality impact of the proposed algorithm, the input data set was partitioned into four clusters such as speech, music, mixed, and concatenated.
5.2. Objective Quality Analysis
where is the original spectrum and is the synthesized spectrum.
Average spectral distance of speech signal.
Average spectral distance (dB)
EAAC without LTP
5.3. Subjective Quality Analysis
We performed the MUSHRA (MUltiple Stimuli with Hidden Reference and Anchor) test  for evaluating subjective quality at 12, 16, and 20 kbps. Eleven trained listeners were participating, and they used headphones (Sennheiser HD600). Results denote mean values and 95% confidence levels of test scores.
Since audio codecs were designed to allocate their bits based on the psychoacoustic model in the transform domain, they did not efficiently compress speech-like components. New long-term prediction module by combining the deharmonization predictor and the harmonic compensation filter has been proposed. Similar to state-of-the-art speech codecs, the analysis frame is divided by subframes to obtain pitch information. Both subjective listening tests and objective tests confirmed the superiority of the proposed algorithm to the conventional audio codec, EAAC.
- van Schijndel NH, Bensa J, Christensen MG, Colomes C, Edler B, Heusdens R, Jensen J, Jensen SH, Kleijn WB, Kot V, Kövesi B, Lindblom J, Massaloux D, Niamut OA, Nordén F, Plasberg JH, Vafin R, Van De Par S, Virette D, Wübbolt O: Adaptive RD optimized hybrid sound coding. Journal of the Audio Engineering Society 2008, 56(10):787-809.Google Scholar
- Brandenburg K, Bosi M: Overview of MPEG audio: current and future standards for low-bit-rate audio coding. Journal of the Audio Engineering Society 1997, 45(1-2):4-21.Google Scholar
- Brandenburg K, Kunz O, Sugiyama A: MPEG-4 natural audio coding. Signal Processing: Image Communication 2000, 15(4):423-444. 10.1016/S0923-5965(99)00056-9Google Scholar
- Mäkinen J, Bessette B, Bruhn S, Ojala P, Salami R, Taleb A: AMR-WB+: a new audio coding standard for 3RD generation mobile audio services. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), March 2005 2: II1109-II1112.Google Scholar
- ISO/IEC JTC1/SC29/WG11 : Report on Unified Speech and Audio Coding Call for Proposals. N10047, 2008Google Scholar
- Neuendorf M, et al.: A novel scheme for low bitrate unified speech and audio coding—MPEG RM0. Proceedings of the 126th AES Convention, May 2009, Munich, GermanyGoogle Scholar
- ISO/IEC JTC1/SC29/WG11 : WD2 of USAC. MPEG2009/N10418, 2009Google Scholar
- Noll P: Wideband speech and audio coding. IEEE Communications Magazine 1993, 31(11):34-44. 10.1109/35.256878View ArticleGoogle Scholar
- Yang M: Low bit rate speech coding. IEEE Potentials 2004, 23(4):32-36. 10.1109/MP.2004.1343228View ArticleGoogle Scholar
- Edler B, Disch S, Stefan B, Fuchs G, Geiger R: A time-warped mdct approach to speech transform coding. Proceedings of the 126th AES Convention, May 2009, Munich, GermanyGoogle Scholar
- Tan RKC, Lin AHJ: A Time-scale modification algorithm based on the subband time-domain technique for broad-band signal applications. Journal of the Audio Engineering Society 2000, 48(5):437-449.Google Scholar
- Kondoz AM: Digital Speech, Coding for Low Bit Rate Communication Systems. John Wiley & Sons, New York, NY, USA; 1995.Google Scholar
- Ojanpera J, Vaananen M, Yin L: Long term predictor for transform domain perceptual audio coding. Proceedings of the 107th AES Convention, September 1999, New York, NY, USAGoogle Scholar
- Bosi M, Goldberg RE: Introduction to Digital Audio Coding and Standards. Kluwer Academic Publishers, Dordrecht, The Netherlands; 2003.View ArticleGoogle Scholar
- Princen JP, Bradley AB: Analysis/synthesis filter bank design based on time domain aliasing cancellation. IEEE Transactions on Acoustics, Speech, and Signal Processing 1986, 34(5):1153-1161. 10.1109/TASSP.1986.1164954View ArticleGoogle Scholar
- 3GPP Technical Specification TS26.403 : Enhanced aacPlus general audio codec. http://www.3gpp.org/
- Liu C-M, Lee W-C: Unified fast algorithm for cosine modulated filter banks in current audio coding standards. Journal of the Audio Engineering Society 1999, 47(12):1061-1075.Google Scholar
- Birgmeier M, Bernhard H, Kubin G: Nonlinear long-term prediction of speech signals. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97), April 1997 1283-1286.Google Scholar
- de Cheveigné A: YIN, a fundamental frequency estimator for speech and music. Journal of the Acoustical Society of America 2002, 111(4):1917-1930. 10.1121/1.1458024View ArticleGoogle Scholar
- Johnston JD: Estimation of perceptual entropy using noise masking criteria. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP '98) 2524-2527.Google Scholar
- Painter T, Spanias A: Perceptual coding of digital audio. Proceedings of the IEEE 2000, 88(4):451-512. 10.1109/5.842996View ArticleGoogle Scholar
- ISO/IEC JTC1/SC29/WG11 : Workplan for Exploration of Speech and Audio Coding. MPEG2007/N9096, 2007Google Scholar
- RECOMMENDATION ITU-R BS.1534-1 : Method for the subjective assessment of intermediate quality level of coding systems. 2001–2003Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.