- Research Article
- Open Access
High-Quality Time Stretch and Pitch Shift Effects for Speech and Audio Using the Instantaneous Harmonic Analysis
EURASIP Journal on Advances in Signal Processing volume 2010, Article number: 712749 (2010)
The paper presents methods for instantaneous harmonic analysis with application to high-quality pitch, timbre, and time-scale modifications. The analysis technique is based on narrow-band filtering using special analysis filters with frequency-modulated impulse response. The main advantages of the technique are high accuracy of harmonic parameters estimation and adequate harmonic/noise separation that allow implementing audio and speech effects with low level of audible artifacts. Time stretch and pitch shift effects are considered as primary application in the paper.
Parametric representation of audio and speech signals has become integral part of modern effect technologies. The choice of an appropriate parametric model significantly defines overall quality of implemented effects. The present paper describes an approach to parametric signal processing based on deterministic/stochastic decomposition. The signal is considered as a sum of periodic (harmonic) and residual (noise) parts. The periodic part can be efficiently described as a sum of sinusoids with slowly varying amplitudes and frequencies, and the residual part is assumed to be irregular noise signal. This representation was introduced in  and since then has been profoundly studied and significantly enhanced. The model provides good parameterization of both voiced and unvoiced frames and allows using different modification techniques for them. It insures effective and simple processing in frequency domain; however, the crucial point there is accuracy of harmonic analysis. The harmonic part of the signal is specified by sets of harmonic parameters (amplitude, frequency, and phase) for every instant of time. A number of methods have been proposed to estimate these parameters. The majority of analysis methods assume local stationarity of amplitude and frequency parameters within the analysis frame [2, 3]. It makes the analysis procedure easier but, on the other hand, degrades parameters estimation and periodic/residual separation accuracy.
Some good alternatives are methods that make estimation of instantaneous harmonic parameters. The notion of instantaneous frequency was introduced in [4, 5], the estimation methods have been presented in [4–9]. The aim of the current investigation is to study applicability of the instantaneous harmonic analysis technique described in [8, 9] to a processing system for making audio and speech effects (such as pitch, timbre, and time-scale modifications). The analysis method is based on narrow-band filtering using analysis filters with closed form impulse response. It has been shown  that analysis filters can be adjusted in accordance with pitch contour in order to get adequate estimate of high-order harmonics with rapid frequency modulations. The technique presented in this paper has the following improvements:
simplified closed form expressions for instantaneous parameters estimation;
pitch detection and smooth pitch contour estimation;
improved harmonic parameters estimation accuracy.
The analysed signal is separated into periodic and residual parts and then processed through modification techniques. Then the processed signal can be easily synthesized in time domain at the output of the system. The deterministic/stochastic representation significantly simplifies the processing stage. As it is shown in the experimental section, the combination of the proposed analysis, processing, and synthesis techniques provides good quality of signal analysis, modification, and reconstruction.
2. Time-Frequency Representations and Harmonic Analysis
The sinusoidal model assumes that the signal can be expressed as the sum of its periodic and stochastic parts:
where —the instantaneous magnitude of the th sinusoidal component, is the number of components, is the instantaneous phase of theth component, and is the stochastic part of the signal. Instantaneous phase and instantaneous frequency are related as follows:
where is the sampling frequency and is the initial phase of the th component. The harmonic model states that frequencies are integer multiples of the fundamental frequency and can be calculated as
The harmonic model is often used in speech coding since it represents voiced speech in a highly efficient way. The parameters ,, and are estimated by means of the sinusoidal (harmonic) analysis. The stochastic part obviously can be calculated as the difference between the source signal and estimated sinusoidal part:
Assuming that sinusoidal components are stationary (i.e., have constant amplitude and frequency) over a short period of time that correspond to the length of the analysis frame, they can be estimated using DFT:
where is the length of the frame. The transformation gives spectral representation of the signal by sinusoidal components of multiple frequencies. The balance between frequency and time resolution is defined by the length of the analysis frame . Because of the local stationarity assumption DFT can hardly provide accurate estimate of frequency-modulated components that gives rise to such approaches as harmonic transform  and fan-chirp transform . The general idea of these approaches is using the Fourier transform of the warped-time signal.
The signal warping can be carried out before transformation or directly embedded in the transform expression :
where is frequency and is the chirp rate. The transform is able to identify components with linear frequency change; however, their spectral amplitudes are assumed to be constant. There are several methods for estimation instantaneous harmonic parameters. Some of them are connected with the notion of analytic signal based on the Hilbert transform (HT). A unique complex signal from a real one can be generated using the Fourier transform . This also can be done as the following time-domain procedure:
where is the Hilbert transform, defined as
where . denotes Cauchy principle value of the integral. is referred to as Gabor's complex signal, and and can be considered as the instantaneous amplitude and instantaneous phase, respectively. Signals and are theoretically in quadrature. Being a complex signal can be expressed in polar coordinates, and therefore and can be calculated as follows:
Recently the discrete energy separation algorithm (DESA) based on the Teager energy operator was presented . The energy operator is defined as
where the derivative operation is approximated by the symmetric difference. The instantaneous amplitude and frequency can be evaluated as
The Hilbert transform and DESA can be applied only to monocomponent signals as long as for multicomponent signals the notion of a single-valued instantaneous frequency and amplitude becomes meaningless. Therefore, the signal should be split into single components before using these techniques. It is possible to use narrow-band filtering for this purpose . However, in the case of frequency-modulated components, it is not always possible due to their wide frequency.
3. Instantaneous Harmonic Analysis
3.1. Instantaneous Harmonic Analysis of Nonstationary Harmonic Components
The proposed analysis method is based on the filtering technique that provides direct parameters estimation . In voiced speech harmonic components are spaced in frequency domain and each component can be limited thereby a narrow frequency band. Therefore harmonic components can be separated within the analysis frame by filters with nonoverlapping bandwidths. These considerations point to the applicability and effectiveness of the filtering approach to harmonic analysis. The signal is represented as a sum of bandlimited cosine functions with instantaneous amplitude, phase, and frequency. It is assumed that harmonic components are spaced in frequency domain so that each component can be limited by a narrow frequency band. The harmonic components can be separated within the analysis frame by filters with nonoverlapping bandwidths. Let us denote the number of cosines and frequency separation borders (in Hz) , where , . The given signal can be represented as its convolution with the impulse response of the ideal low-pass filter :
where —the impulse response of the band-pass filter with passband , —bandlimited output signal. The impulse response can be written in the following way:
where and . Parameters and correspond to the center frequency of the passband and the half of bandwidth, respectively. Convolution of finite signal () and can be expressed as the following sum:
The expression can be rewritten as a sum of zero frequency components:
Thus, considering (15), the expression (14) is a magnitude and frequency-modulated cosine function:
with instantaneous magnitude , phase , and frequency that can be calculated as
In that way the signal frame () can be represented by cosines with instantaneous amplitude and frequency. Instantaneous sinusoidal parameters of the filter output are available at every instant of time within the analysis frame. The filter output can be interpreted as an analytical signal in the following way:
The bandwidth specified by border frequencies and (or by parameters and ) should cover the frequency of the periodic component that is being analyzed. In many applications there is no need to represent entire signal as a sum of modulated cosines. In hybrid parametric representation it is necessary to choose harmonic components with smooth contours of frequency and amplitude values. For accurate sinusoidal parameters estimation of periodical components with high-frequency modulations a frequency-modulated filter can be used. The closed form impulse response of the filter is modulated according to frequency contour of the analyzed component. This approach is quite applicable to analysis of voiced speech since rough harmonic frequency trajectories can be estimated from the pitch contour. Considering centre frequency of the filter bandwidth as a function of time , (15) can be rewritten in the following form:
The required instantaneous parameters can be calculated using expressions (18). The frequency-modulated filter has a warped band pass, aligned to the given frequency contour that provides adequate analysis of periodic components with rapid frequency alterations. This approach is an alternative to time warping that is used in speech analysis . In Figure 1 an example of parameters estimation is shown. The frequency contour of the harmonic component can be covered by the filter band pass specified by the centre frequency contour and the bandwidth .
Center frequency contour is adjusted within the analysis frame providing narrow-band filtering of frequency-modulated components.
3.2. Filter Properties
Estimation accuracy degrades close to borders of the frame because of signal discontinuity and spectral leakage. However, the estimation error can be reduced using wider passband—Figure 2.
In any case the passband should be wide enough in order to provide adequate estimation of harmonic amplitudes. If the passband is too narrow, the evaluated amplitude values become lower than they are in reality. It is possible to determine the filter bandwidth as a threshold value that gives desired level of accuracy. The threshold value depends on length of analysis window and type of window function. In Figure 3 the dependence for Hamming window is presented, assuming that amplitude attenuation should be less than −20 dB.
It is evident that required bandwidth becomes more narrow when the length of the window increases. It is also clear that a wide passband affects estimation accuracy when the signal contains noise. The noise sensitivity of the filters with different bandwidths is demonstrated in Figure 4.
3.3. Estimation Technique
In this subsection the general technique of sinusoidal parameters estimation is presented. The technique does not assume harmonic structure of the signal and therefore can be applied both to speech and audio signals .
In order to locate sinusoidal components in frequency domain, the estimation procedure uses iterative adjustments of the filter bands with a predefined number of iterations—Figure 5. At every step the centre frequency of each filter is changed in accordance with the calculated frequency value in order to position energy peak at the centre of the band. At the initial stage, the frequency range of the signal is covered by overlapping bands (where is the number of bands) with constant central frequencies , respectively. At every step the respective instantaneous frequencies are estimated by formulas (15) and (18) at the instant that corresponds to the centre of the frame . Then the central bandwidth frequencies are reset , and the next estimation is carried out. When all the energy peaks are located, the final sinusoidal parameters (amplitude, frequency, and phase) can be calculated using the expressions (15) and (18) as well. During the peak location process, some of the filter bands may locate the same component. Duplicated parameters are discarded by comparison of the centre band frequencies .
In order to discard short-term components (that apparently are transients or noise and should be taken to the residual), sinusoidal parameters are tracked from frame to frame. The frequency and amplitude values of adjacent frames are compared, providing long-term component matching. The technique has been used in the hybrid audio coder , since it is able to pick out the sinusoidal part and leave the original transients in the residual without any prior transient detection. In Figure 6 a result of the signal separation is presented. The source signal is a jazz tune (Figure 6(a)).
The analysis was carried out using the following settings: analysis frame length—48 ms, analysis step—14 ms, filter bandwidths—70 Hz, and windowing function—the Hamming window. The synthesized periodic part is shown in Figure 6(b). As can be seen from the spectrogram, the periodic part contains only long sinusoidal components with high-energy localization. The transients are left untouched in the residual signal that is presented in Figure 6(c).
3.4. Speech Analysis
In speech processing, it is assumed that signal frames can be either voiced or unvoiced. In voiced segments the periodical constituent prevails over the noise, in unvoiced segments the opposite takes place, and therefore any harmonic analysis is unsuitable in that case. In the proposed analysis framework voiced/unvoiced frame classification is carried out using pitch detector. The harmonic parameters estimation procedure consists of the two following stages:
initial fundamental frequency contour estimation;
harmonic parameters estimation with fundamental frequency adjustment.
In voiced speech analysis, the problem of initial fundamental frequency estimation comes to finding a periodical component with the lowest possible frequency and sufficiently high energy. Within the possible fundamental frequency range (in this paper, it is defined as Hz) all periodical components are extracted, and then the suitable one is considered as the fundamental. In order to reduce computational complexity, the source signal is filtered by a low-pass filter before the estimation.
Having fundamental contour estimated, it is possible to calculate filter impulse responses aligned to the fundamental frequency contour. Central frequency of the filter band is calculated as the instantaneous frequency of fundamental multiplied by the number of the correspondent harmonic . The procedure goes from the first harmonic to the last, adjusting fundamental frequency at every step—Figure 7. The fundamental frequency recalculation formula can be written as follows:
The fundamental frequency values become more precise while moving up the frequency range. It allows making proper analysis of high-order harmonics with significant frequency modulations. Harmonic parameters are estimated using expressions (10)-(11). After parameters estimation, the periodical part of the signal is synthesized by formula (1) and subtracted from the source in order to get the noise part.
In order to test applicability of the proposed technique, a set of synthetic signals with predefined parameters was used. The signals were synthesized with different harmonic-to-noise ratio defined as
where is the energy of the deterministic part of the signal and is the energy of its stochastic part. All the signals were generated using a specified fundamental frequency contour and the same number of harmonics—20. Stochastic parts of the signals were generated as white noise with such energy that provides specified values. After analysis the signals were separated into stochastic and deterministic parts with new harmonic-to-noise ratios:
Quantitative characteristics of accuracy were calculated as signal-to-noise ratio:
where —energy of the estimated harmonic part and —energy of the estimation error (energy of the difference between source and estimated harmonic parts). The signals were analyzed using the proposed technique and STFT-based harmonic transform method . During analysis the same frame length was used (64 ms) and the same window function (Hamming window). In both methods, it was assumed that the fundamental frequency contour is known and that frequency trajectories of the harmonics are integer multiplies of the fundamental frequency. The results, reported in Table 1 show that the measured values decrease with values. However, for nonstationary signals, the proposed technique provides higher values even when is low.
An example of natural speech analysis is presented in Figure 8. The source signal is a phrase uttered by a female speaker ( kHz). Estimated harmonic parameters were used for the synthesis of the signal's periodic part that was subtracted from the source in order to get the residual. All harmonics of the source are modeled by the harmonic analysis when the residual contains transient and noise components, as can be seen in the respective spectrograms.
4. Effects Implementation
The harmonic analysis described in the previous section results in a set of harmonic parameters and residual signal. Instantaneous spectral envelopes can be estimated from the instantaneous harmonic amplitudes and the fundamental frequency obtained at the analysis stage . The linear interpolation can be used for this purpose. The set of frequency envelopes can be considered as a function of two parameters: sample number and frequency. Pitch shifting procedure affects only the periodic part of the signal that can be synthesized as follows:
Phases of harmonic components are calculated according to a new fundamental frequency contour :
Harmonic frequencies are calculated by formula (3):
Additional phase parameter is used in order to keep the original phases of harmonics relative phase of the fundamental
As long as described pitch shifting does not change spectral envelope of the source signal and keeps relative phases of the harmonic components, the processed signal has a natural sound with completely new intonation. The timbre of speakers voice is defined by the spectral envelope function . If we consider the envelope function as a matrix
then any timbre modification can be expressed as a conversion function that transforms the source envelope matrix into a new matrix :
Since the periodic part of the signal is expressed by harmonic parameters, it is easy to synthesize the periodic part slowing down or stepping up the tempo. Amplitude and frequency contours should be interpolated in the respective moments of time, and then the output signal can be synthesized. The noise part is parameterized by spectral envelopes and then time-scaled as described in . Separate periodic/noise processing provides high-quality time-scale modifications with low level of audible artifacts.
5. Experimental Results
In this section an example of vocal processing is shown. The concerned processing system is aimed at pitch shifting in order to assist a singer.
The voice of the singer is analyzed by the proposed technique and then synthesized with pitch modifications to assist the singer to be in tune with the accompaniment. The target pitch contour is predefined by analysis of a reference recording. Since only pitch contour is changed, the source voice maintains its identity. The output signal however is damped in regions, where the energy of the reference signal is low in order to provide proper synchronization with accompaniment. The reference signal is shown in Figure 9, it is a recorded male vocal. The recording was made in a studio with a low level of background noise. The fundamental frequency contour was estimated from the reference signal as described in Section 3. As can be seen from Figure 10, the source vocal has different pitch and is not completely noise free.
The source signal was analyzed using proposed harmonic analysis, and then the pitch shifting technique was applied as has been described above.
The synthesized signal with pitch modifications is shown in Figure 11. As can be seen the output signal contains the pitch contour of the reference signal, but still has timbre, and energy of the source voice. The noise part of the source signal (including background noise) remained intact.
The stochastic/deterministic model can be applied to voice processing systems. It provides efficient signal parameterization in the way that is quite convenient for making voice effects such as pitch shifting, timbre and time-scale modifications. The practical application of the proposed harmonic analysis technique has shown encouraging results. The described approach might be a promising solution to harmonic parameters estimation in speech and audio processing systems .
Quatieri TF, McAulay RJ: Speech analysis/synthesis based on a sinusoidal representation. IEEE Transactions on Acoustics, Speech, and Signal Processing 1986, 34(6):1449-1464. 10.1109/TASSP.1986.1164985
Spanias AS: Speech coding: a tutorial review. Proceedings of the IEEE 1994, 82(10):1541-1582. 10.1109/5.326413
Serra X: Musical sound modeling with sinusoids plus noise. In Musical Signal Processing. Edited by: Roads C, Pope S, Pi-cialli A, De Poli G. Swets & Zeitlinger; 1997:91-122.
Boashash B: Estimating and interpreting the instantaneous frequency of a signal. Proceedings of the IEEE 1992, 80(4):520-568. 10.1109/5.135376
Maragos P, Kaiser JF, Quatieri TF: Energy separation in signal modulations with application to speech analysis. IEEE Transactions on Signal Processing 1993, 41(10):3024-3051. 10.1109/78.277799
Abe T, Kobayashi T, Imai S: Harmonics tracking and pitch extraction based on instantaneous frequency. Proceedings of the 20th International Conference on Acoustics, Speech, and Signal Processing, May 1995 756-759.
Abe T, Honda M: Sinusoidal model based on instantaneous frequency attractors. IEEE Transactions on Audio, Speech and Language Processing 2006, 14(4):1292-1300.
Azarov E, Petrovsky A, Parfieniuk M: Estimation of the instantaneous harmonic parameters of speech. Proceedings of the 16th European Signal Processing Conference (EUSIPCO '08), 2008, Lausanne, Switzerland
Azarov I, Petrovsky A: Harmonic analysis of speech. Speech Technology 2008, (1):67-77.
Zhang F, Bi G, Chen YQ: Harmonic transform. IEE Proceedings: Vision, Image and Signal Processing 2004, 151(4):257-263. 10.1049/ip-vis:20040604
Weruaga L, Képesi M: The fan-chirp transform for non-stationary harmonic signals. Signal Processing 2007, 87(6):1504-1522. 10.1016/j.sigpro.2007.01.006
Gabor D: Theory of communication. Proceedings of the IEE 1946, 93(3):429-457.
Petrovsky A, Azarov E, Petrovsky AA: Harmonic representation and auditory model-based parametric matching and its application in speech/audio analysis. Proceedings of the 126th AES Convention, 2009, Munich, Germany 13. Preprint 7705
Azarov E, Petrovsky A: Instantaneous harmonic analysis for vocal processing. Proceedings of the 12th International Conference on Digital Audio Effects (DAFx '09), September 2009, Como, Italy
Levine S, Smith J: A sines+transients+noise audio representation for data compression and time/pitch scale modifications. Proceedings of the 105th AES Convention, September 1998, San Francisco, Calif, USA Preprint 4781
This work was supported by the Polish Ministry of Science and Higher Education (MNiSzW) in years 2009–2011 (Grant no. N N516 388836).
About this article
Cite this article
Azarov, E., Petrovsky (EURASIPMember), A. & Parfieniuk (EURASIPMember), M. High-Quality Time Stretch and Pitch Shift Effects for Speech and Audio Using the Instantaneous Harmonic Analysis. EURASIP J. Adv. Signal Process. 2010, 712749 (2010). https://doi.org/10.1155/2010/712749
- Instantaneous Frequency
- Harmonic Component
- Analysis Frame
- Periodic Part
- Pitch Contour