 Research Article
 Open Access
HighQuality Time Stretch and Pitch Shift Effects for Speech and Audio Using the Instantaneous Harmonic Analysis
 Elias Azarov^{1},
 Alexander Petrovsky (EURASIPMember)^{1, 2}Email author and
 Marek Parfieniuk (EURASIPMember)^{2}
https://doi.org/10.1155/2010/712749
© Elias Azarov et al. 2010
 Received: 6 May 2010
 Accepted: 10 November 2010
 Published: 30 November 2010
Abstract
The paper presents methods for instantaneous harmonic analysis with application to highquality pitch, timbre, and timescale modifications. The analysis technique is based on narrowband filtering using special analysis filters with frequencymodulated impulse response. The main advantages of the technique are high accuracy of harmonic parameters estimation and adequate harmonic/noise separation that allow implementing audio and speech effects with low level of audible artifacts. Time stretch and pitch shift effects are considered as primary application in the paper.
Keywords
 Instantaneous Frequency
 Harmonic Component
 Analysis Frame
 Periodic Part
 Pitch Contour
1. Introduction
Parametric representation of audio and speech signals has become integral part of modern effect technologies. The choice of an appropriate parametric model significantly defines overall quality of implemented effects. The present paper describes an approach to parametric signal processing based on deterministic/stochastic decomposition. The signal is considered as a sum of periodic (harmonic) and residual (noise) parts. The periodic part can be efficiently described as a sum of sinusoids with slowly varying amplitudes and frequencies, and the residual part is assumed to be irregular noise signal. This representation was introduced in [1] and since then has been profoundly studied and significantly enhanced. The model provides good parameterization of both voiced and unvoiced frames and allows using different modification techniques for them. It insures effective and simple processing in frequency domain; however, the crucial point there is accuracy of harmonic analysis. The harmonic part of the signal is specified by sets of harmonic parameters (amplitude, frequency, and phase) for every instant of time. A number of methods have been proposed to estimate these parameters. The majority of analysis methods assume local stationarity of amplitude and frequency parameters within the analysis frame [2, 3]. It makes the analysis procedure easier but, on the other hand, degrades parameters estimation and periodic/residual separation accuracy.
 (i)
simplified closed form expressions for instantaneous parameters estimation;
 (ii)
pitch detection and smooth pitch contour estimation;
 (iii)
improved harmonic parameters estimation accuracy.
The analysed signal is separated into periodic and residual parts and then processed through modification techniques. Then the processed signal can be easily synthesized in time domain at the output of the system. The deterministic/stochastic representation significantly simplifies the processing stage. As it is shown in the experimental section, the combination of the proposed analysis, processing, and synthesis techniques provides good quality of signal analysis, modification, and reconstruction.
2. TimeFrequency Representations and Harmonic Analysis
where is the length of the frame. The transformation gives spectral representation of the signal by sinusoidal components of multiple frequencies. The balance between frequency and time resolution is defined by the length of the analysis frame . Because of the local stationarity assumption DFT can hardly provide accurate estimate of frequencymodulated components that gives rise to such approaches as harmonic transform [10] and fanchirp transform [11]. The general idea of these approaches is using the Fourier transform of the warpedtime signal.
The Hilbert transform and DESA can be applied only to monocomponent signals as long as for multicomponent signals the notion of a singlevalued instantaneous frequency and amplitude becomes meaningless. Therefore, the signal should be split into single components before using these techniques. It is possible to use narrowband filtering for this purpose [6]. However, in the case of frequencymodulated components, it is not always possible due to their wide frequency.
3. Instantaneous Harmonic Analysis
3.1. Instantaneous Harmonic Analysis of Nonstationary Harmonic Components
Center frequency contour is adjusted within the analysis frame providing narrowband filtering of frequencymodulated components.
3.2. Filter Properties
3.3. Estimation Technique
In this subsection the general technique of sinusoidal parameters estimation is presented. The technique does not assume harmonic structure of the signal and therefore can be applied both to speech and audio signals [13].
The analysis was carried out using the following settings: analysis frame length—48 ms, analysis step—14 ms, filter bandwidths—70 Hz, and windowing function—the Hamming window. The synthesized periodic part is shown in Figure 6(b). As can be seen from the spectrogram, the periodic part contains only long sinusoidal components with highenergy localization. The transients are left untouched in the residual signal that is presented in Figure 6(c).
3.4. Speech Analysis
 (i)
initial fundamental frequency contour estimation;
 (ii)
harmonic parameters estimation with fundamental frequency adjustment.
In voiced speech analysis, the problem of initial fundamental frequency estimation comes to finding a periodical component with the lowest possible frequency and sufficiently high energy. Within the possible fundamental frequency range (in this paper, it is defined as Hz) all periodical components are extracted, and then the suitable one is considered as the fundamental. In order to reduce computational complexity, the source signal is filtered by a lowpass filter before the estimation.
Results of synthetic speech analysis.
Harmonic transform method  Instantaneous harmonic analysis  






Signal 1— , random constant harmonic amplitudes  
 41.5  41.5  50.4  50.4 
40  38.5  41.4  41.2  44.7 
20  20.8  29.2  21.9  26.2 
10  10.7  19.5  11.9  16.4 
0  1.2  9.2  2.9  6.0 
Signal 2— changes from 150 to 220 Hz at a rate of 0.1 Hz/ms, constant harmonic amplitudes that model sound [a]  
 41.5  41.5  48.3  48.3 
40  38.2  40.7  41.0  44.3 
20  21.0  29.5  22.1  26.4 
10  11.0  20.3  12  17.1 
0  1.3  9.3  2.7  6.5 
Signal 3— changes from 150 to 220 Hz at a rate of 0.1 Hz/ms, variable harmonic amplitudes that model sequence of vowels  
 19.6  19.7  34.0  34.0 
40  17.3  17.5  31.2  31.8 
20  17.7  21.3  20.1  25.5 
10  8.7  15.6  10.3  15.1 
0  −0.8  7.55  0.94  5.2 
Signal 4— changes from 150 to 220 Hz at a rate of 0.1 Hz/ms, variable harmonic amplitudes that model sequence of vowels, harmonic frequencies deviate from integer multiplies of on 10 Hz  
 13.2  14.0  26.9  27.0 
40  10.6  11.9  24.8  25.3 
20  11.9  13.6  19.3  22.7 
10  6.9  12.1  9.6  14 
0  −1.6  6.1  0.5  4.2 
4. Effects Implementation
Since the periodic part of the signal is expressed by harmonic parameters, it is easy to synthesize the periodic part slowing down or stepping up the tempo. Amplitude and frequency contours should be interpolated in the respective moments of time, and then the output signal can be synthesized. The noise part is parameterized by spectral envelopes and then timescaled as described in [15]. Separate periodic/noise processing provides highquality timescale modifications with low level of audible artifacts.
5. Experimental Results
In this section an example of vocal processing is shown. The concerned processing system is aimed at pitch shifting in order to assist a singer.
The source signal was analyzed using proposed harmonic analysis, and then the pitch shifting technique was applied as has been described above.
6. Conclusions
The stochastic/deterministic model can be applied to voice processing systems. It provides efficient signal parameterization in the way that is quite convenient for making voice effects such as pitch shifting, timbre and timescale modifications. The practical application of the proposed harmonic analysis technique has shown encouraging results. The described approach might be a promising solution to harmonic parameters estimation in speech and audio processing systems [13].
Declarations
Acknowledgment
This work was supported by the Polish Ministry of Science and Higher Education (MNiSzW) in years 2009–2011 (Grant no. N N516 388836).
Authors’ Affiliations
References
 Quatieri TF, McAulay RJ: Speech analysis/synthesis based on a sinusoidal representation. IEEE Transactions on Acoustics, Speech, and Signal Processing 1986, 34(6):14491464. 10.1109/TASSP.1986.1164985View ArticleGoogle Scholar
 Spanias AS: Speech coding: a tutorial review. Proceedings of the IEEE 1994, 82(10):15411582. 10.1109/5.326413View ArticleGoogle Scholar
 Serra X: Musical sound modeling with sinusoids plus noise. In Musical Signal Processing. Edited by: Roads C, Pope S, Picialli A, De Poli G. Swets & Zeitlinger; 1997:91122.Google Scholar
 Boashash B: Estimating and interpreting the instantaneous frequency of a signal. Proceedings of the IEEE 1992, 80(4):520568. 10.1109/5.135376View ArticleGoogle Scholar
 Maragos P, Kaiser JF, Quatieri TF: Energy separation in signal modulations with application to speech analysis. IEEE Transactions on Signal Processing 1993, 41(10):30243051. 10.1109/78.277799View ArticleMATHGoogle Scholar
 Abe T, Kobayashi T, Imai S: Harmonics tracking and pitch extraction based on instantaneous frequency. Proceedings of the 20th International Conference on Acoustics, Speech, and Signal Processing, May 1995 756759.Google Scholar
 Abe T, Honda M: Sinusoidal model based on instantaneous frequency attractors. IEEE Transactions on Audio, Speech and Language Processing 2006, 14(4):12921300.View ArticleGoogle Scholar
 Azarov E, Petrovsky A, Parfieniuk M: Estimation of the instantaneous harmonic parameters of speech. Proceedings of the 16th European Signal Processing Conference (EUSIPCO '08), 2008, Lausanne, SwitzerlandGoogle Scholar
 Azarov I, Petrovsky A: Harmonic analysis of speech. Speech Technology 2008, (1):6777.Google Scholar
 Zhang F, Bi G, Chen YQ: Harmonic transform. IEE Proceedings: Vision, Image and Signal Processing 2004, 151(4):257263. 10.1049/ipvis:20040604Google Scholar
 Weruaga L, Képesi M: The fanchirp transform for nonstationary harmonic signals. Signal Processing 2007, 87(6):15041522. 10.1016/j.sigpro.2007.01.006View ArticleMATHGoogle Scholar
 Gabor D: Theory of communication. Proceedings of the IEE 1946, 93(3):429457.Google Scholar
 Petrovsky A, Azarov E, Petrovsky AA: Harmonic representation and auditory modelbased parametric matching and its application in speech/audio analysis. Proceedings of the 126th AES Convention, 2009, Munich, Germany 13. Preprint 7705Google Scholar
 Azarov E, Petrovsky A: Instantaneous harmonic analysis for vocal processing. Proceedings of the 12th International Conference on Digital Audio Effects (DAFx '09), September 2009, Como, ItalyGoogle Scholar
 Levine S, Smith J: A sines+transients+noise audio representation for data compression and time/pitch scale modifications. Proceedings of the 105th AES Convention, September 1998, San Francisco, Calif, USA Preprint 4781Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.