Joint DOA and multi-pitch estimation based on subspace techniques
© Zhang et al; licensee Springer. 2012
Received: 26 March 2011
Accepted: 2 January 2012
Published: 2 January 2012
In this article, we present a novel method for high-resolution joint direction-of-arrivals (DOA) and multi-pitch estimation based on subspaces decomposed from a spatio-temporal data model. The resulting estimator is termed multi-channel harmonic MUSIC (MC-HMUSIC). It is capable of resolving sources under adverse conditions, unlike traditional methods, for example when multiple sources are impinging on the array from approximately the same angle or similar pitches. The effectiveness of the method is demonstrated on a simulated an-echoic array recordings with source signals from real recorded speech and clarinet. Furthermore, statistical evaluation with synthetic signals shows the increased robustness in DOA and fundamental frequency estimation, as compared with to a state-of-the-art reference method.
The problem of estimating the fundamental frequency, or pitch, of a period waveform has been of interest to the signal processing community for many years. Fundamental frequency estimators are important for many practical applications such as automatic note transcription in music, audio and speech coding, classification of music, and speech analysis. Numerous algorithms have been proposed for both the single- and multi-pitch scenarios [1–5]. The problem for single-pitch scenarios is considered as well-posed. However, in real-world signals, the multi-pitch scenario occurs quite frequently [2, 6]. The multi-pitch estimation algorithms are often based on, i.e., various modification of the auto-correlation function [1, 7], maximum likelihood, optimal filtering, and subspace techniques [2, 3, 8]. In real-life recordings, problems such as frequency overlap of sources, reverberation, and colored noise will strongly limit the performance of multi-pitch estimator and estimator designed for single channel recordings often use simplified signal models. One widely used signal simplification in multi-pitch estimators, for example, is the sparseness of the signal, where the frequency spectrum of sources are assumed to not overlap . This assumption may be appropriate when sources consist of mixture of several speech signals having different pitches . However, for audio signals it is less likely to be true. This is especially so in western music, where instruments are most often played in accord, something that causes the harmonics to overlap or even coincide. With only single-channel recording it is, therefore, hard, or perhaps even impossible, to estimate pitches with overlapping harmonics, unless additional information, such as a temporal or spectral model, is included.
Recently, multi-channel approaches have attracted considerable attention both in single- and multi-pitch scenarios. By exploring the spatial information of the sources, more robust pitch estimators have been proposed [10–14]. Most of those multi-channel methods are still mainly based on auto-correlation function-related approaches, however, although a few exceptions can be found in [15–18]. In direction-of-arrival (DOA) estimators, audio and speech signals are often modeled as broadband signal, and standard subspace methods such as MUSIC and ESPRIT are only defined for narrow-band signal model, which then fail to directly operate on broadband signals . One often used concept is band-pass filtering of broadband signals into subbands, where narrow-band estimators can be applied to each subband . In the narrow-band case, a delay in the signal is equivalent to a phase shifts according to the frequencies of complex exponentials. An alternative method is, however, as follows: since harmonic signals consist of sinusoidal components, we can model each source as multiple narrow-band signal with distinct frequencies arriving at the same DOA.
In this article, we propose a parametric method for solving the problem of joint fundamental frequency and DOA estimation based on subspace techniques where the quantities of interest are jointly estimated using a MUSIC-like approach. We term the proposed estimator Multi-channel multi-pitch Harmonic MUSIC (MC-HMUSIC). The spatio-temporal data model used in MC-HMUSIC is based on the JAFE data model [21, 22]. Originally, the JAFE data model was used for estimating joint unconstrained frequencies and DOAs estimates of complex exponential using ESPRIT, which is referred as joint angle-frequency estimation (JAFE) algorithm. Other-related work with joint frequency-DOA methods includes [23–25]. In this article, we have parametrized the harmonic structure of periodic signals in the signal model to model the fundamental frequency and the DOA of individual sources. An estimator is constructed for jointly estimating the parameters of interest. Incorporating the DOA parameter in finding the fundamental frequency may give better robustness against a signal with overlapping harmonics. Similarly, it can be expected that the DOA can be found more accurately when the nature of the signal of interest is taken into account.
The remainder of this article is comprised four sections: Section 2, in which we will introduce some notation, the spatio-temporal signal model, for which we also derive the associated Cramér-Rao lower bound, along with the JAFE data mode; Section 3, where we then present the proposed method; Section 4, in which we present the experimental results obtained using the proposed method; and, finally, Section 5, where we conclude on our work.
2.1. Spatio-temporal signal model
for sample index n = 0,..., N - 1, where subscript k denotes the k th source and l the l th harmonic. Moreover, A l,k is the real-valued positive amplitude of the complex exponential, L k is the number of harmonics, K is number of sources, γ l,k is the phase of the individual harmonics, ϕ k is the phase shift caused by the DOA, and e i (n) is complex symmetric white Gaussian noise. The phase shift between array elements is given as , where d is the spacing between the elements measured in wavelengths, c is the speed of propagation in unit [m/s], θ k is the DOA defined for θ k ∈ [-90°, 90°], f s is the signal sampling frequency. The problem of interest is to estimate ω k and θ k . We in the following assume that the number of sources K is known and the number of harmonics L k of individual sources is known or found in some other, possibly joint, way. We note that a number of ways of doing this has been proposed in the past [26–28, 2].
2.2. Cramér-Rao lower bound
2.3. The JAFE data model
where E ∈ ℂ M×N is a matrix containing N sample of the noise vector e(n).
In speech and audio signal processing, it is common to model each source as a set of multiple harmonics with model order L k > 1. Due to the narrow-band approximation of the steering vector, the multiple complex components with distinct frequencies impinge on the array with identical DOA will result in a non-unique spatial frequencies which cause a harmonic structure in the spatial frequencies ϕ k l ∀l as well. The multiple sources impinge on the array with different DOAs consisting of various frequency components may, for certain frequency combinations, give the same array steering vector, which cause the matrix A to be rank deficient. Normally, this ambiguous mapping of the steering vector is mitigated by band-pass filtering the signal into its subbands, where the DOA of the signal is uniquely modeled by the narrow-band steering vector [20, Chap. 9].
where Ā t = [A AΦ ... AΦ t -1] T and B t = [b Φb ... Φ N-t b]. The temporally smoothed data matrix X t can maximally resole up to complex exponentials, where Ā t is linearly independent for any distinct θ and ω .
where I t ∈ ℝ t×t and are the identity matrices, ⊗ is the Kroneker product as defined in .
It is interesting to note that the noise term E t,s is no longer white due to the spatio-temporal smoothing procedure, as correlation between the different rows of (23) is obtained. A pre-whitening step can be implemented in (23) to mitigate this. We note, however, that according to results reported in , pre-whitening step is only interesting for signals with low SNR where minor estimation improvement can be achieved. In this article, the main interest is to propose a multi-channel joint DOA and multi-pitch estimator, for which reason the whitening process is left without further description, but we refer the interested reader to . We also note that aside from spatial smoothing, forward-backward averaging could also be implemented to reduce the influence of the correlated sources [22, 31, 19].
3. The proposed method
3.1. Coarse estimates
3.2. Refined estimates
with Re (·) denoting the real value. The gradient can be used for finding refined estimate using standard methods.
The method is initialized for i = 0 using the coarse estimates obtained from (32).
4. Experimental results
4.1. Signal examples
4.2. Statistical evaluation
Next, we use Monte Carlo simulations evaluated on synthetic signals embedded in noise in assessing the statistical properties of the proposed method and compare it with the exact CRLB. As a reference method for pitch and DOA estimation, we use the JAFE algorithm proposed in  for jointly estimating unconstrained frequencies and DOAs. Next, the unconstrained frequencies are grouped according to their corresponding DOAs where closely related directions are grouped together. A fundamental frequency is formed from these grouped frequencies in a weighted way as proposed in . We refer this as the WLS estimator. In order to remove the errors due to the erroneous estimate of amplitudes, we assume WLS having the exact signal amplitude given. The WLS estimator is a computationally efficient pitch estimation method with good statistical properties. The reference DOA estimate is easily obtained in a similar way from the mean value of these grouped DOAs according to .
In this article, we have generalized the single-channel multi-pitch problem into a multi-channel multi-pitch estimation problem. To solve this new problem, we propose an estimator for joint estimation of fundamental frequencies and DOAs of multiple sources. The proposed estimator is based on subspace analysis using a time-space data model. The method is shown to have potential in applications to real signals with simulated anechoic array recording, and a statistical evaluation demonstrates its robustness in DOA and fundamental frequency estimation as compared to a state-of-the-art reference method. Furthermore, the proposed method is shown to have good statistical performance under adverse conditions, for example for sources with similar DOA or fundamental frequency.
The study of Zhang was supported by the Marie Curie EST-SIGNAL Fellowship, Contract No. MEST-CT-2005-021175.
- Klapuri A: Automatic music transcription as we know it today. J New Music Res 2004, 33: 269-282.View ArticleGoogle Scholar
- Christensen MG, Jakobsson A: Multi-Pitch Estimation. Synthesis Lectures on Speech and Audio Processing 2009.Google Scholar
- Rabiner L: On the use of autocorrelation analysis for pitch detection. IEEE Trans Signal Process 1996, 44: 2229-2244.View ArticleGoogle Scholar
- Zhang JX, Christensen MG, Jensen SH, Moonen M: A robust and computationally efficient subspace-based fundamental frequency estimator. IEEE Trans Acoust Speech Language Process 2010, 18(3):487-497.View ArticleGoogle Scholar
- de Cheveigne A, Kawahara H: YIN, a fundamental frequency estimator for speech and music. J Acoust Soc Am 2002, 111(4):1917-1930.View ArticleGoogle Scholar
- Wang DL, Brown GJ: Computational Auditory Scene Analysis: Principle, Algorithm, and Applications. Wiley, IEEE Press, New York; 2006.View ArticleGoogle Scholar
- Klapuri A: Multiple fundamental frequency estimation based on harmonicity and spectral smoothness. IEEE Trans Speech Audio Process 2003, 11: 804-816.View ArticleGoogle Scholar
- Emiya V, Bertrand D, Badeau R: A parametric method for pitch estimation of piano tones. IEEE International Conference on Acoustics, Speech, and Signal Processing 2007, 1: 249-252.Google Scholar
- Rickard S, Yilmaz O: Blind separation of speech mixtures via time-frequency masking. IEEE Trans Signal Process 2004, 52: 1830-1847.MathSciNetView ArticleGoogle Scholar
- Wohmayr M, Kepsi M: Joint position-pitch extraction from multichannel audio. Proceedings of the Interspeech 2007.Google Scholar
- Qian X, Kumaresan R: Joint estimation of time delay and pitch of voiced speech signals. Record of the Asilomar Conference on Signals, Systems, and Computers 1996., 2:Google Scholar
- Wrigley SN, Brown GJ: Recurrent timing neural networks for joint F0-localisation based speech separation. IEEE International Conference on Acoustics, Speech and Signal Processing 2007.Google Scholar
- Flego F, Omologo M: Robust F0 estimation based on a multi-microphone periodicity function for distant-talking speech. EUSIPCO 2006.Google Scholar
- Armani L, Omologo M: Weighted auto-correlation-based F0 estimation for distant-talking interaction with a distributed microphone network. IEEE International Conference on Acoustics, Speech and Signal Processing 2004, 1: 113-116.Google Scholar
- Chazan D, Stettiner Y, Malah D: Optimal multi-pitch estimation using the em algorithm for co-channel speech separation. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing 1993.Google Scholar
- Liao G, So HC, Ching PC: Joint time delay and frequency estimation of multiple sinusoids. IEEE International Conference on Acoustics, Speech and Signal Processing 2001, 5: 3121-3124.Google Scholar
- Wu Y, So HC, Tan Y: Joint time-delay and frequency estimation using parallel factor analysis. Elsevier Signal Process 2009, 89: 1667-1670.View ArticleGoogle Scholar
- Ngan LY, Wu Y, So HC, Ching PC, Lee SW: Joint time delay and pitch estimation for speaker localization. Proceedings of the IEEE International Symposium on Circuits and Systems 2003, 722-725.Google Scholar
- Stoica P, Moses R: Spectral Analysis of Signals. Prentice-Hall, Upper Saddle River; 2005.Google Scholar
- Brandstein M, Ward D: Microphone Arrays. Springer, Berlin; 2001.View ArticleGoogle Scholar
- van der Veen AJ, Vanderveen M, Paulraj A: Joint angle and delay estimation using shift invariance techniques. IEEE Trans Signal Process 1998, 46: 405-418.View ArticleGoogle Scholar
- Lemma AN, van der Veen AJ, Deprettere EF: Analysis of joint angle-frequency estimation using ESPRIT. IEEE Trans Signal Process 2003, 51: 1264-1283.MathSciNetView ArticleGoogle Scholar
- Viberg M, Stoica P: A computationally efficient method for joint direction finding and frequency estimation in colored noise. Record of the Asilomar Conference on Signals, Systems, and Computers 1998, 2: 1547-1551.Google Scholar
- Lin JD, Fang WH, Wang YY, Chen JT: FSF MUSIC for joint DOA and frequency estimation and its performance analysis. IEEE Trans Signal Process 2006, 54: 4529-4542.View ArticleGoogle Scholar
- Wang S, Caffery J, Zhou X: Analysis of a joint space-time doa/foa estimator using MUSIC. IEEE International Symposium on Personal, Indoor and Mobile Radio Communications 2001, B138-B142.Google Scholar
- Christensen MG, Stoica P, Jakobsson A, Jensen SH: Multi-pitch estimation. Elsevier Signal Process 2008, 88(4):972-983.View ArticleGoogle Scholar
- Christensen MG, Jakobsson A, Jensen SH: Joint high-resolution fundamental frequency and order estimation. IEEE Trans. Acoust Speech Signal Process 2007, 15(5):1635-1644.Google Scholar
- Zhang JX, Christensen MG, Jensen SH, Moonen M: An iterative subspace-based multi-pitch estimation algorithm. Elsevier Signal Process 2011, 91: 150-154.View ArticleGoogle Scholar
- Lemma AN: ESPRIT based joint angle-frequency estimation algorithms and simulations. PhD Thesis Delft University 1999.Google Scholar
- Shu T, Liu XZ: Robust and computationally efficient signal-dependent method for joint DOA and frequency estimation. EURASIP J Adv Signal Process 2008., 2008: Article ID 10.1155/2008/134853Google Scholar
- Krim H, Viberg M: Two decades of array processing research-the parametric approach. IEEE SP Mag 1996.Google Scholar
- Christensen MG, Jakobsson A, Jensen SH: Multi-pitch estimation using Harmonic MUSIC. Record of the Asilomar Conference on Signals, Systems, and Computers 2006, 521-525.Google Scholar
- Christensen MG, Jakobsson A, Jensen SH: Sinusoidal order estimation using angles between subspaces. EURASIP J Adv Signal Process 2009, 1-11. Article ID 948756Google Scholar
- Veen BDV, Buckley KM: Beamforming: a versatile approach to spatial filtering. IEEE ASSP Mag 1988, 5: 4-24.View ArticleGoogle Scholar
- Li H, Stoica P, Li J: Computationally efficient parameter estimation for harmonic sinusoidal signals. Elsevier Signal Process 2000, 1937-1944.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.