# High-Quality Time Stretch and Pitch Shift Effects for Speech and Audio Using the Instantaneous Harmonic Analysis

- Elias Azarov
^{1}, - Alexander Petrovsky (EURASIPMember)
^{1, 2}Email author and - Marek Parfieniuk (EURASIPMember)
^{2}

**2010**:712749

https://doi.org/10.1155/2010/712749

© Elias Azarov et al. 2010

**Received: **6 May 2010

**Accepted: **10 November 2010

**Published: **30 November 2010

## Abstract

The paper presents methods for instantaneous harmonic analysis with application to high-quality pitch, timbre, and time-scale modifications. The analysis technique is based on narrow-band filtering using special analysis filters with frequency-modulated impulse response. The main advantages of the technique are high accuracy of harmonic parameters estimation and adequate harmonic/noise separation that allow implementing audio and speech effects with low level of audible artifacts. Time stretch and pitch shift effects are considered as primary application in the paper.

## 1. Introduction

Parametric representation of audio and speech signals has become integral part of modern effect technologies. The choice of an appropriate parametric model significantly defines overall quality of implemented effects. The present paper describes an approach to parametric signal processing based on deterministic/stochastic decomposition. The signal is considered as a sum of periodic (harmonic) and residual (noise) parts. The periodic part can be efficiently described as a sum of sinusoids with slowly varying amplitudes and frequencies, and the residual part is assumed to be irregular noise signal. This representation was introduced in [1] and since then has been profoundly studied and significantly enhanced. The model provides good parameterization of both voiced and unvoiced frames and allows using different modification techniques for them. It insures effective and simple processing in frequency domain; however, the crucial point there is accuracy of harmonic analysis. The harmonic part of the signal is specified by sets of harmonic parameters (amplitude, frequency, and phase) for every instant of time. A number of methods have been proposed to estimate these parameters. The majority of analysis methods assume local stationarity of amplitude and frequency parameters within the analysis frame [2, 3]. It makes the analysis procedure easier but, on the other hand, degrades parameters estimation and periodic/residual separation accuracy.

- (i)
simplified closed form expressions for instantaneous parameters estimation;

- (ii)
pitch detection and smooth pitch contour estimation;

- (iii)
improved harmonic parameters estimation accuracy.

The analysed signal is separated into periodic and residual parts and then processed through modification techniques. Then the processed signal can be easily synthesized in time domain at the output of the system. The deterministic/stochastic representation significantly simplifies the processing stage. As it is shown in the experimental section, the combination of the proposed analysis, processing, and synthesis techniques provides good quality of signal analysis, modification, and reconstruction.

## 2. Time-Frequency Representations and Harmonic Analysis

*,*, and are estimated by means of the sinusoidal (harmonic) analysis. The stochastic part obviously can be calculated as the difference between the source signal and estimated sinusoidal part:

where is the length of the frame. The transformation gives spectral representation of the signal by sinusoidal components of multiple frequencies. The balance between frequency and time resolution is defined by the length of the analysis frame . Because of the local stationarity assumption DFT can hardly provide accurate estimate of frequency-modulated components that gives rise to such approaches as harmonic transform [10] and fan-chirp transform [11]. The general idea of these approaches is using the Fourier transform of the warped-time signal.

The Hilbert transform and DESA can be applied only to monocomponent signals as long as for multicomponent signals the notion of a single-valued instantaneous frequency and amplitude becomes meaningless. Therefore, the signal should be split into single components before using these techniques. It is possible to use narrow-band filtering for this purpose [6]. However, in the case of frequency-modulated components, it is not always possible due to their wide frequency.

## 3. Instantaneous Harmonic Analysis

### 3.1. Instantaneous Harmonic Analysis of Nonstationary Harmonic Components

Center frequency contour is adjusted within the analysis frame providing narrow-band filtering of frequency-modulated components.

### 3.2. Filter Properties

### 3.3. Estimation Technique

In this subsection the general technique of sinusoidal parameters estimation is presented. The technique does not assume harmonic structure of the signal and therefore can be applied both to speech and audio signals [13].

The analysis was carried out using the following settings: analysis frame length—48 ms, analysis step—14 ms, filter bandwidths—70 Hz, and windowing function—the Hamming window. The synthesized periodic part is shown in Figure 6(b). As can be seen from the spectrogram, the periodic part contains only long sinusoidal components with high-energy localization. The transients are left untouched in the residual signal that is presented in Figure 6(c).

### 3.4. Speech Analysis

- (i)
initial fundamental frequency contour estimation;

- (ii)
harmonic parameters estimation with fundamental frequency adjustment.

In voiced speech analysis, the problem of initial fundamental frequency estimation comes to finding a periodical component with the lowest possible frequency and sufficiently high energy. Within the possible fundamental frequency range (in this paper, it is defined as Hz) all periodical components are extracted, and then the suitable one is considered as the fundamental. In order to reduce computational complexity, the source signal is filtered by a low-pass filter before the estimation.

Results of synthetic speech analysis.

## 4. Effects Implementation

Since the periodic part of the signal is expressed by harmonic parameters, it is easy to synthesize the periodic part slowing down or stepping up the tempo. Amplitude and frequency contours should be interpolated in the respective moments of time, and then the output signal can be synthesized. The noise part is parameterized by spectral envelopes and then time-scaled as described in [15]. Separate periodic/noise processing provides high-quality time-scale modifications with low level of audible artifacts.

## 5. Experimental Results

In this section an example of vocal processing is shown. The concerned processing system is aimed at pitch shifting in order to assist a singer.

The source signal was analyzed using proposed harmonic analysis, and then the pitch shifting technique was applied as has been described above.

## 6. Conclusions

The stochastic/deterministic model can be applied to voice processing systems. It provides efficient signal parameterization in the way that is quite convenient for making voice effects such as pitch shifting, timbre and time-scale modifications. The practical application of the proposed harmonic analysis technique has shown encouraging results. The described approach might be a promising solution to harmonic parameters estimation in speech and audio processing systems [13].

## Declarations

### Acknowledgment

This work was supported by the Polish Ministry of Science and Higher Education (MNiSzW) in years 2009–2011 (Grant no. N N516 388836).

## Authors’ Affiliations

## References

- Quatieri TF, McAulay RJ: Speech analysis/synthesis based on a sinusoidal representation.
*IEEE Transactions on Acoustics, Speech, and Signal Processing*1986, 34(6):1449-1464. 10.1109/TASSP.1986.1164985View ArticleGoogle Scholar - Spanias AS: Speech coding: a tutorial review.
*Proceedings of the IEEE*1994, 82(10):1541-1582. 10.1109/5.326413View ArticleGoogle Scholar - Serra X: Musical sound modeling with sinusoids plus noise. In
*Musical Signal Processing*. Edited by: Roads C, Pope S, Pi-cialli A, De Poli G. Swets & Zeitlinger; 1997:91-122.Google Scholar - Boashash B: Estimating and interpreting the instantaneous frequency of a signal.
*Proceedings of the IEEE*1992, 80(4):520-568. 10.1109/5.135376View ArticleGoogle Scholar - Maragos P, Kaiser JF, Quatieri TF: Energy separation in signal modulations with application to speech analysis.
*IEEE Transactions on Signal Processing*1993, 41(10):3024-3051. 10.1109/78.277799View ArticleMATHGoogle Scholar - Abe T, Kobayashi T, Imai S: Harmonics tracking and pitch extraction based on instantaneous frequency.
*Proceedings of the 20th International Conference on Acoustics, Speech, and Signal Processing, May 1995*756-759.Google Scholar - Abe T, Honda M: Sinusoidal model based on instantaneous frequency attractors.
*IEEE Transactions on Audio, Speech and Language Processing*2006, 14(4):1292-1300.View ArticleGoogle Scholar - Azarov E, Petrovsky A, Parfieniuk M: Estimation of the instantaneous harmonic parameters of speech.
*Proceedings of the 16th European Signal Processing Conference (EUSIPCO '08), 2008, Lausanne, Switzerland*Google Scholar - Azarov I, Petrovsky A: Harmonic analysis of speech. Speech Technology 2008, (1):67-77.Google Scholar
- Zhang F, Bi G, Chen YQ: Harmonic transform.
*IEE Proceedings: Vision, Image and Signal Processing*2004, 151(4):257-263. 10.1049/ip-vis:20040604Google Scholar - Weruaga L, Képesi M: The fan-chirp transform for non-stationary harmonic signals.
*Signal Processing*2007, 87(6):1504-1522. 10.1016/j.sigpro.2007.01.006View ArticleMATHGoogle Scholar - Gabor D: Theory of communication.
*Proceedings of the IEE*1946, 93(3):429-457.Google Scholar - Petrovsky A, Azarov E, Petrovsky AA: Harmonic representation and auditory model-based parametric matching and its application in speech/audio analysis.
*Proceedings of the 126th AES Convention, 2009, Munich, Germany*13. Preprint 7705Google Scholar - Azarov E, Petrovsky A: Instantaneous harmonic analysis for vocal processing.
*Proceedings of the 12th International Conference on Digital Audio Effects (DAFx '09), September 2009, Como, Italy*Google Scholar - Levine S, Smith J: A sines+transients+noise audio representation for data compression and time/pitch scale modifications.
*Proceedings of the 105th AES Convention, September 1998, San Francisco, Calif, USA*Preprint 4781Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.