High-Quality Time Stretch and Pitch Shift Effects for Speech and Audio Using the Instantaneous Harmonic Analysis
© Elias Azarov et al. 2010
Received: 6 May 2010
Accepted: 10 November 2010
Published: 30 November 2010
The paper presents methods for instantaneous harmonic analysis with application to high-quality pitch, timbre, and time-scale modifications. The analysis technique is based on narrow-band filtering using special analysis filters with frequency-modulated impulse response. The main advantages of the technique are high accuracy of harmonic parameters estimation and adequate harmonic/noise separation that allow implementing audio and speech effects with low level of audible artifacts. Time stretch and pitch shift effects are considered as primary application in the paper.
Parametric representation of audio and speech signals has become integral part of modern effect technologies. The choice of an appropriate parametric model significantly defines overall quality of implemented effects. The present paper describes an approach to parametric signal processing based on deterministic/stochastic decomposition. The signal is considered as a sum of periodic (harmonic) and residual (noise) parts. The periodic part can be efficiently described as a sum of sinusoids with slowly varying amplitudes and frequencies, and the residual part is assumed to be irregular noise signal. This representation was introduced in  and since then has been profoundly studied and significantly enhanced. The model provides good parameterization of both voiced and unvoiced frames and allows using different modification techniques for them. It insures effective and simple processing in frequency domain; however, the crucial point there is accuracy of harmonic analysis. The harmonic part of the signal is specified by sets of harmonic parameters (amplitude, frequency, and phase) for every instant of time. A number of methods have been proposed to estimate these parameters. The majority of analysis methods assume local stationarity of amplitude and frequency parameters within the analysis frame [2, 3]. It makes the analysis procedure easier but, on the other hand, degrades parameters estimation and periodic/residual separation accuracy.
simplified closed form expressions for instantaneous parameters estimation;
pitch detection and smooth pitch contour estimation;
improved harmonic parameters estimation accuracy.
The analysed signal is separated into periodic and residual parts and then processed through modification techniques. Then the processed signal can be easily synthesized in time domain at the output of the system. The deterministic/stochastic representation significantly simplifies the processing stage. As it is shown in the experimental section, the combination of the proposed analysis, processing, and synthesis techniques provides good quality of signal analysis, modification, and reconstruction.
2. Time-Frequency Representations and Harmonic Analysis
where is the length of the frame. The transformation gives spectral representation of the signal by sinusoidal components of multiple frequencies. The balance between frequency and time resolution is defined by the length of the analysis frame . Because of the local stationarity assumption DFT can hardly provide accurate estimate of frequency-modulated components that gives rise to such approaches as harmonic transform  and fan-chirp transform . The general idea of these approaches is using the Fourier transform of the warped-time signal.
The Hilbert transform and DESA can be applied only to monocomponent signals as long as for multicomponent signals the notion of a single-valued instantaneous frequency and amplitude becomes meaningless. Therefore, the signal should be split into single components before using these techniques. It is possible to use narrow-band filtering for this purpose . However, in the case of frequency-modulated components, it is not always possible due to their wide frequency.
3. Instantaneous Harmonic Analysis
3.1. Instantaneous Harmonic Analysis of Nonstationary Harmonic Components
3.2. Filter Properties
3.3. Estimation Technique
In this subsection the general technique of sinusoidal parameters estimation is presented. The technique does not assume harmonic structure of the signal and therefore can be applied both to speech and audio signals .
The analysis was carried out using the following settings: analysis frame length—48 ms, analysis step—14 ms, filter bandwidths—70 Hz, and windowing function—the Hamming window. The synthesized periodic part is shown in Figure 6(b). As can be seen from the spectrogram, the periodic part contains only long sinusoidal components with high-energy localization. The transients are left untouched in the residual signal that is presented in Figure 6(c).
3.4. Speech Analysis
initial fundamental frequency contour estimation;
harmonic parameters estimation with fundamental frequency adjustment.
In voiced speech analysis, the problem of initial fundamental frequency estimation comes to finding a periodical component with the lowest possible frequency and sufficiently high energy. Within the possible fundamental frequency range (in this paper, it is defined as Hz) all periodical components are extracted, and then the suitable one is considered as the fundamental. In order to reduce computational complexity, the source signal is filtered by a low-pass filter before the estimation.
Results of synthetic speech analysis.
Harmonic transform method
Instantaneous harmonic analysis
4. Effects Implementation
Since the periodic part of the signal is expressed by harmonic parameters, it is easy to synthesize the periodic part slowing down or stepping up the tempo. Amplitude and frequency contours should be interpolated in the respective moments of time, and then the output signal can be synthesized. The noise part is parameterized by spectral envelopes and then time-scaled as described in . Separate periodic/noise processing provides high-quality time-scale modifications with low level of audible artifacts.
5. Experimental Results
In this section an example of vocal processing is shown. The concerned processing system is aimed at pitch shifting in order to assist a singer.
The source signal was analyzed using proposed harmonic analysis, and then the pitch shifting technique was applied as has been described above.
The stochastic/deterministic model can be applied to voice processing systems. It provides efficient signal parameterization in the way that is quite convenient for making voice effects such as pitch shifting, timbre and time-scale modifications. The practical application of the proposed harmonic analysis technique has shown encouraging results. The described approach might be a promising solution to harmonic parameters estimation in speech and audio processing systems .
This work was supported by the Polish Ministry of Science and Higher Education (MNiSzW) in years 2009–2011 (Grant no. N N516 388836).
- Quatieri TF, McAulay RJ: Speech analysis/synthesis based on a sinusoidal representation. IEEE Transactions on Acoustics, Speech, and Signal Processing 1986, 34(6):1449-1464. 10.1109/TASSP.1986.1164985View ArticleGoogle Scholar
- Spanias AS: Speech coding: a tutorial review. Proceedings of the IEEE 1994, 82(10):1541-1582. 10.1109/5.326413View ArticleGoogle Scholar
- Serra X: Musical sound modeling with sinusoids plus noise. In Musical Signal Processing. Edited by: Roads C, Pope S, Pi-cialli A, De Poli G. Swets & Zeitlinger; 1997:91-122.Google Scholar
- Boashash B: Estimating and interpreting the instantaneous frequency of a signal. Proceedings of the IEEE 1992, 80(4):520-568. 10.1109/5.135376View ArticleGoogle Scholar
- Maragos P, Kaiser JF, Quatieri TF: Energy separation in signal modulations with application to speech analysis. IEEE Transactions on Signal Processing 1993, 41(10):3024-3051. 10.1109/78.277799View ArticleMATHGoogle Scholar
- Abe T, Kobayashi T, Imai S: Harmonics tracking and pitch extraction based on instantaneous frequency. Proceedings of the 20th International Conference on Acoustics, Speech, and Signal Processing, May 1995 756-759.Google Scholar
- Abe T, Honda M: Sinusoidal model based on instantaneous frequency attractors. IEEE Transactions on Audio, Speech and Language Processing 2006, 14(4):1292-1300.View ArticleGoogle Scholar
- Azarov E, Petrovsky A, Parfieniuk M: Estimation of the instantaneous harmonic parameters of speech. Proceedings of the 16th European Signal Processing Conference (EUSIPCO '08), 2008, Lausanne, SwitzerlandGoogle Scholar
- Azarov I, Petrovsky A: Harmonic analysis of speech. Speech Technology 2008, (1):67-77.Google Scholar
- Zhang F, Bi G, Chen YQ: Harmonic transform. IEE Proceedings: Vision, Image and Signal Processing 2004, 151(4):257-263. 10.1049/ip-vis:20040604Google Scholar
- Weruaga L, Képesi M: The fan-chirp transform for non-stationary harmonic signals. Signal Processing 2007, 87(6):1504-1522. 10.1016/j.sigpro.2007.01.006View ArticleMATHGoogle Scholar
- Gabor D: Theory of communication. Proceedings of the IEE 1946, 93(3):429-457.Google Scholar
- Petrovsky A, Azarov E, Petrovsky AA: Harmonic representation and auditory model-based parametric matching and its application in speech/audio analysis. Proceedings of the 126th AES Convention, 2009, Munich, Germany 13. Preprint 7705Google Scholar
- Azarov E, Petrovsky A: Instantaneous harmonic analysis for vocal processing. Proceedings of the 12th International Conference on Digital Audio Effects (DAFx '09), September 2009, Como, ItalyGoogle Scholar
- Levine S, Smith J: A sines+transients+noise audio representation for data compression and time/pitch scale modifications. Proceedings of the 105th AES Convention, September 1998, San Francisco, Calif, USA Preprint 4781Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.