Skip to main content

Joint DOA and multi-pitch estimation based on subspace techniques


In this article, we present a novel method for high-resolution joint direction-of-arrivals (DOA) and multi-pitch estimation based on subspaces decomposed from a spatio-temporal data model. The resulting estimator is termed multi-channel harmonic MUSIC (MC-HMUSIC). It is capable of resolving sources under adverse conditions, unlike traditional methods, for example when multiple sources are impinging on the array from approximately the same angle or similar pitches. The effectiveness of the method is demonstrated on a simulated an-echoic array recordings with source signals from real recorded speech and clarinet. Furthermore, statistical evaluation with synthetic signals shows the increased robustness in DOA and fundamental frequency estimation, as compared with to a state-of-the-art reference method.

1. Introduction

The problem of estimating the fundamental frequency, or pitch, of a period waveform has been of interest to the signal processing community for many years. Fundamental frequency estimators are important for many practical applications such as automatic note transcription in music, audio and speech coding, classification of music, and speech analysis. Numerous algorithms have been proposed for both the single- and multi-pitch scenarios [15]. The problem for single-pitch scenarios is considered as well-posed. However, in real-world signals, the multi-pitch scenario occurs quite frequently [2, 6]. The multi-pitch estimation algorithms are often based on, i.e., various modification of the auto-correlation function [1, 7], maximum likelihood, optimal filtering, and subspace techniques [2, 3, 8]. In real-life recordings, problems such as frequency overlap of sources, reverberation, and colored noise will strongly limit the performance of multi-pitch estimator and estimator designed for single channel recordings often use simplified signal models. One widely used signal simplification in multi-pitch estimators, for example, is the sparseness of the signal, where the frequency spectrum of sources are assumed to not overlap [2]. This assumption may be appropriate when sources consist of mixture of several speech signals having different pitches [9]. However, for audio signals it is less likely to be true. This is especially so in western music, where instruments are most often played in accord, something that causes the harmonics to overlap or even coincide. With only single-channel recording it is, therefore, hard, or perhaps even impossible, to estimate pitches with overlapping harmonics, unless additional information, such as a temporal or spectral model, is included.

Recently, multi-channel approaches have attracted considerable attention both in single- and multi-pitch scenarios. By exploring the spatial information of the sources, more robust pitch estimators have been proposed [1014]. Most of those multi-channel methods are still mainly based on auto-correlation function-related approaches, however, although a few exceptions can be found in [1518]. In direction-of-arrival (DOA) estimators, audio and speech signals are often modeled as broadband signal, and standard subspace methods such as MUSIC and ESPRIT are only defined for narrow-band signal model, which then fail to directly operate on broadband signals [19]. One often used concept is band-pass filtering of broadband signals into subbands, where narrow-band estimators can be applied to each subband [20]. In the narrow-band case, a delay in the signal is equivalent to a phase shifts according to the frequencies of complex exponentials. An alternative method is, however, as follows: since harmonic signals consist of sinusoidal components, we can model each source as multiple narrow-band signal with distinct frequencies arriving at the same DOA.

In this article, we propose a parametric method for solving the problem of joint fundamental frequency and DOA estimation based on subspace techniques where the quantities of interest are jointly estimated using a MUSIC-like approach. We term the proposed estimator Multi-channel multi-pitch Harmonic MUSIC (MC-HMUSIC). The spatio-temporal data model used in MC-HMUSIC is based on the JAFE data model [21, 22]. Originally, the JAFE data model was used for estimating joint unconstrained frequencies and DOAs estimates of complex exponential using ESPRIT, which is referred as joint angle-frequency estimation (JAFE) algorithm. Other-related work with joint frequency-DOA methods includes [2325]. In this article, we have parametrized the harmonic structure of periodic signals in the signal model to model the fundamental frequency and the DOA of individual sources. An estimator is constructed for jointly estimating the parameters of interest. Incorporating the DOA parameter in finding the fundamental frequency may give better robustness against a signal with overlapping harmonics. Similarly, it can be expected that the DOA can be found more accurately when the nature of the signal of interest is taken into account.

The remainder of this article is comprised four sections: Section 2, in which we will introduce some notation, the spatio-temporal signal model, for which we also derive the associated Cramér-Rao lower bound, along with the JAFE data mode; Section 3, where we then present the proposed method; Section 4, in which we present the experimental results obtained using the proposed method; and, finally, Section 5, where we conclude on our work.

2. Fundamentals

2.1. Spatio-temporal signal model

Next, the signal model employed throughout the article will be presented. Without multi-path propagation of sources, it is given as follows: the signal x i received by microphone element i arranged in a uniform linear array (ULA) configuration, i = 1,..., M, is given by

x i ( n ) = k = 1 K l = 1 L k β l , k e j ( ω k l n + ϕ k l ( i - 1 ) ) + e i ( n ) , β l , k = A l , k e j γ l , k ,

for sample index n = 0,..., N - 1, where subscript k denotes the k th source and l the l th harmonic. Moreover, A l,k is the real-valued positive amplitude of the complex exponential, L k is the number of harmonics, K is number of sources, γ l,k is the phase of the individual harmonics, ϕ k is the phase shift caused by the DOA, and e i (n) is complex symmetric white Gaussian noise. The phase shift between array elements is given as ϕ k = ω k f s d c sin ( θ k ) , where d is the spacing between the elements measured in wavelengths, c is the speed of propagation in unit [m/s], θ k is the DOA defined for θ k [-90°, 90°], f s is the signal sampling frequency. The problem of interest is to estimate ω k and θ k . We in the following assume that the number of sources K is known and the number of harmonics L k of individual sources is known or found in some other, possibly joint, way. We note that a number of ways of doing this has been proposed in the past [2628, 2].

2.2. Cramér-Rao lower bound

We will now proceed to derive the exact Cramér-Rao lower bound (CRLB) for the problem of estimating the parameters of interest. First, we define the M × 1 deterministic signal model vector s(n, μ) with column element as

s i ( n , μ ) = k = 1 K l = 1 L k β l , k e j ( ω k l n + ϕ k l ( i - 1 ) ) , β l , k = A l , k e j γ l , k ,

where s(n, μ) = [s1(n, μ) ... s M (n, μ)]T. Furthermore, the parameter vector μ is given by

μ = [ ω 1 ω K θ 1 θ K A 1 , 1 γ 1 , 1 A L K , K γ L K , K ] .

Recall that the observed signal vector with additive white noise is given by

x ( n ) = s ( n , μ ) + e ( n ) = s 1 ( n , μ ) s M ( n , μ ) + e ( n ) ,

with e(n) being the noise column vector. The CRLB is defined as the variance of an unbiased estimate of the p th element of μ, which is lower bounded as

var ( μ p ) [ C - 1 ] p p ,

where C is the so-called Fisher information matrix given by

C = 2 σ 2 Re n = 0 N - 1 s ( n , μ ) H μ s ( n , μ ) μ T .

The partial derivative matrix is denoted as

s ( n , μ ) μ = s 1 ( n , μ ) μ s M ( n , μ ) μ ,

where vector s i ( n , μ ) μ is the partial derivatives with respect to the entries in the vector μ. The expression for the columns in s ( n , μ ) μ is given as

s i ( n , μ ) μ = l = 1 L 1 j l n + ( i - 1 ) f s d c sin ( θ 1 ) β l , 1 e j ( ω 1 l n + ϕ 1 l ( i - 1 ) ) l = 1 L K j l n + ( i - 1 ) f s d c sin ( θ K ) β l , K e j ( ω K l n + ϕ K l ( i - 1 ) ) l = 1 L 1 j l ( i - 1 ) ω 1 f s d c cos ( θ 1 ) β l , 1 e j ( ω 1 l n + ϕ 1 l ( i - 1 ) ) l = 1 L K j l ( i - 1 ) ω K f s d c cos ( θ K ) β l , K e j ( ω K l n + ϕ K l ( i - 1 ) ) e j γ 1 , 1 e j ( ω 1 n + ϕ 1 ( i - 1 ) ) j A 1 , 1 e j γ 1 , 1 e j ( ω 1 n + ϕ 1 ( i - 1 ) ) e j γ L K , K e j ( ω K L K n + ϕ K L K ( i - 1 ) ) j A L K , K e j γ L K , K e j ( ω K L K n + ϕ K L K ( i - 1 ) ) .

2.3. The JAFE data model

Next, we will introduce the specifics of the JAFE data model [22, 29] that our method is based on. At a time instant n the received signal from the M array elements are x(n) = [x1(n) x2(n) ... x M (n)]T, which can be written as

x ( n ) = A Φ b + e ( n ) ,

where e(n) M×1 is the noise vector, and A = [A1 ... A K ] is a Vandermonde matrix containing parameters ω k and θ k for sources k = 1, . . . , K, i.e.,

A k = a ( θ k , ω k 1 ) a ( θ k , ω k L k ) ,

with a(θ, ω) being the array steering vector given by

a ( θ , ω ) = 1 e j ω f s d c ( M - 1 ) sin ( θ ) T .

Here, (·)T denotes the vector transpose. Unlike the steering vector defined in [22, 21], where only the DOA is parametrized, here, a general definition of the vector (11) is used, in which it depends on both θ and ω [29]. The frequency components are expressed in Φ n = diag Φ 1 n Φ K n where the matrix for each source is given by

Φ k = diag e j ω k e j ω k L k .

The complex amplitudes for involving components are represented by the following vector:

b = β 1 , 1 β L 1 , 1 β 1 , K β L K , K T .

To capture the temporal behavior, N time-domain data samples of the array output x(n) are collected to form the M × N data matrix X, which is defined as

X = x ( n ) x ( N ) .

Due to the structure of the harmonic components, the data matrix is given by

X = A b Φ b Φ N - 1 b + E ,

where E M×N is a matrix containing N sample of the noise vector e(n).

In speech and audio signal processing, it is common to model each source as a set of multiple harmonics with model order L k > 1. Due to the narrow-band approximation of the steering vector, the multiple complex components with distinct frequencies impinge on the array with identical DOA will result in a non-unique spatial frequencies which cause a harmonic structure in the spatial frequencies ϕ k l l as well. The multiple sources impinge on the array with different DOAs consisting of various frequency components may, for certain frequency combinations, give the same array steering vector, which cause the matrix A to be rank deficient. Normally, this ambiguous mapping of the steering vector is mitigated by band-pass filtering the signal into its subbands, where the DOA of the signal is uniquely modeled by the narrow-band steering vector [20, Chap. 9].

Here, the ambiguities and the rank-deficiency are avoided by introducing temporal smoothness in order to restore the rank of A. The temporally smoothed data matrix is obtained by stacking t times temporally shifted versions of the original data matrix [22, 21, 29], given as

X t = A [ b Φ b Φ N - t b ] A Φ [ b Φ b Φ N - t b ] A Φ t - 1 [ b Φ b Φ N - t b ] + E t ,

where X t tM×N-t+1 is the temporally smoothed data matrix, and E t is the noise term constructed from E in a similar way as X t . In using the signal model where the amplitudes are assumed stationary for n = 0, . . . , N - 1, X t can be factorized as

X t = A A Φ A Φ t - 1 b Φ b Φ N - t b + E t .

With some additional definitions, we can also write this expression more compactly as

X t = A ̄ t B t + E t ,

where Ā t = [A ... t-1]T and B t = [b Φb ... ΦN-t b]. The temporally smoothed data matrix X t can maximally resole up to tM k = 1 K L k complex exponentials, where Ā t is linearly independent for any distinct θ and ω [30].

When multiple sources with distinct DOA with the same fundamental frequency impinge on the array, it will result in correlation between the underlying signals, which will make it harder to separate the corresponding components into its eigenvectors [22, 31]. To mitigate this problem, spatial smoothing is introduced, which works as follows. An array of M sensors is subdivided into S subarrays. In this article, the subarrays are spatially shifted with one element in each subarrays, the number of elements in each subarray being M S = M - S + 1. For s = 1, . . . , S, let J s t M s × t M be the selection matrix corresponding to the s th subarray for the data matrix X t . Then, the spatio-temporally smoothed data matrix X t , S t M s × S ( N - t + 1 ) is given by

X t , s = J 1 X t J S X t .

Furthermore, X t,s can be factorized as

X t , s = J 1 A ̄ t J S A ̄ t B t B t + E t , s ,

where E t,s is the noise term constructed from E in a similar way as X t,s . Using the shift invariance structure in A m , the term J s A m for s = 1, . . . , S is given by

J s A ̄ t = J 1 A ̄ t Θ s - 1 ,


Θ = diag e j ϕ 1 1 e j ϕ 1 L 1 e j ϕ K 1 e j ϕ K L K ,

which is simply the phase difference between array elements. With (21), the matrix X t,s can be written in a compact form as

X t , s = J 1 A ̄ t B t Θ B t Θ S - 1 B t + E t , s ,

with selection matrix expressed as

J 1 = I t [ I M s 0 ] ,

where I t t×t and I M s M s × M s are the identity matrices, is the Kroneker product as defined in [22].

It is interesting to note that the noise term E t,s is no longer white due to the spatio-temporal smoothing procedure, as correlation between the different rows of (23) is obtained. A pre-whitening step can be implemented in (23) to mitigate this. We note, however, that according to results reported in [22], pre-whitening step is only interesting for signals with low SNR where minor estimation improvement can be achieved. In this article, the main interest is to propose a multi-channel joint DOA and multi-pitch estimator, for which reason the whitening process is left without further description, but we refer the interested reader to [22]. We also note that aside from spatial smoothing, forward-backward averaging could also be implemented to reduce the influence of the correlated sources [22, 31, 19].

3. The proposed method

3.1. Coarse estimates

From the final spatio-temporally smoothed data matrix, a basis for the signal and noise subspaces can be obtained as follows. The singular value decomposition (SVD) of the data matrix (23) is given by

X t , s = U Σ V H ,

where the columns of U are the singular vectors, i.e.,

U = u 1 u t M S .

A basis of the orthogonal complement of the signal subspace, also called the noise subspace, is formed from singular vector associated with the mM S - Q least significant singular values, i.e.,

G = u Q + 1 u m M S ,

with Q = k = 1 K L k being the total number of complex exponentials in the signal. Similarly, the signal subspace is spanned by the Q largest singular values, i.e.,

S = u 1 u Q .

The defined signal subspace and noise subspace have similar property as traditional subspaces where estimators such as joint DOA and frequency, or fundamental frequency estimators can be constructed using the principle used in MUSIC [19, 32, 27, 26, 4]. According to the signal noise subspace orthogonality principle, the following relationship holds:

J 1 A ̄ t G = 0 ,

where we, for notational simplicity, have introduced J1Ā t = A ts . The matrix A ts is comprised Vandermonde matrices for sources k = 1, . . . , K. The matrix for each individual source is given by

A t s , k = 1 1 e j ϕ k e j ϕ k L k e j ϕ k S e j ϕ k L k S e j ω k ( t - 1 ) e j ω k L k ( t - 1 ) e j ϕ k e j ω k ( t - 1 ) e j ϕ k L k e j ω k L k ( t - 1 ) e j ϕ k S e j ω k ( t - 1 ) e j ϕ k L k S e j ω k L k ( t - 1 ) .

The cost function of the proposed joint DOA and multi-pitch estimator is then

J ( ω k , θ k ) = A t s , k H G F 2 ,

where ||·|| F is the Frobenius Norm. Note that this measure is closely related to the angles between the subspaces as explained in [33] and can hence be used as a measure of the extent to which (29) holds for a candidate fundamental frequency and DOA. The pair of fundamental frequency and DOA can, therefore, be found as the combination that is the closest to being orthogonal to G, i.e.,

{ ω k , θ k } k = 1 K = arg min { θ k } 1 K , { ω k } 1 K A t s , k H G F 2 .

The multi-channel estimators will have a cost function which is more well-behaved compared to those of single channel multi-pitch estimators (see, e.g., [26, 32, 28] for some examples of such).

3.2. Refined estimates

For many applications, only a coarse estimate of involved fundamental frequencies and DOAs are needed, in which case the cost function in (32) is evaluated on pre-defined search region with some specified granularity. If, however, very accurate estimates are desired, a refined estimate can be found as described next. For a rough estimate of the parameter of interests, refined estimates are obtained by minimizing the cost function in (32) using a cyclic minimization approach. The gradient of the cost function (32) for fundamental frequency and DOA are given as

ω k J ( ω k , θ k ) = 2 Re Tr A t s , k H G G H ω k A t s , k ,
θ k J ( ω k , θ k ) = 2 Re Tr A t s , k H G G H θ k A t s , k ,

with Re (·) denoting the real value. The gradient can be used for finding refined estimate using standard methods.

Here, we iteratively find a refined estimate using a cyclic approach. During an iteration, ω k is first estimated with

ω ^ k i + 1 = ω ^ k i - δ ω k J ( ω ^ k i , θ ^ k i ) ,

where i is the iteration index and δ is a small positive constant that is found using line search. The estimated ω ^ k i + 1 is then used to initialize the minimization function for DOA, which is then found as

θ ^ k i + 1 = θ ^ k i - δ θ k J ( ω ^ k i + 1 , θ ^ k i ) .

The method is initialized for i = 0 using the coarse estimates obtained from (32).

4. Experimental results

4.1. Signal examples

We start the experimental part of this article by illustrating the application of the proposed method to analyzing a mixed signal consisting of speech and clarinet signals, sampled at f s = 8000 Hz. The single-channel signals are converted into a multi-channel signal by introducing different delays according to two pre-determined DOA to simulate a microphone array with M = 8 channels. The simulated DOAs of the speech and the clarinet signals are, respectively, θ1 = -45° and θ2 = 45°. The spectrogram of the mixed signal of the first channel is illustrated in Figure 1. To avoid spatial ambiguities, the distance between two sensor is half the wavelength of the highest frequency in the observed signal, here d = 0.0425 m. The mixed signal is segmented into 50% overlapped signal segments with N = 128. The user parameter selected in this experiment is t= 2 N 3 and s= M 2 . The cost function is evaluated with a Vandermonde matrix with L = 5 complex exponentials, and the noise subspace is formed from an overestimated signal subspace with assumption of signal subspace containing N/2 = 64 complex exponentials. The signal subspace overestimation technique is usually used when the true order of the signal subspace is unknown, the signal subspace is assumed to be larger than the true one which can minimize the signal subspace components in the noise subspace. An added benefit of posing the problem as a joint estimation problem is that the multi-pitch estimation problem can be seen as several single-pitch problems for a distinct set of DOAs, one per source. Therefore, it is less important to select an exact signal model order than single-channel multi-pitch estimators would need [28]. The cost function is evaluated for frequencies from 100 to 500 with granularity of 0.52 Hz. The evaluated results are illustrated in Figure 2 where the upper panel contains the fundamental frequency estimates and lower panel the DOA estimates. It can be seen that the proposed algorithm can track the fundamental frequency and the DOA of the speech signal well, with only a few observed errors on regions with low signal energy. The clarinet signal's DOA and fundamental frequencies have also been estimated well for all segments.

Figure 1

The mixed spectrogram of the real recorded speech and clarinet signal.

Figure 2

The estimation results using the proposed methods: (a) fundamental frequency, (b) the DOA with the horizontal axis denoting time axis.

For the purpose of further comparison, the same signal will be analyzed using a standard time delay-and-sum beamformer [34] for DOA estimates and a single-channel maximum-likelihood based pitch estimator applied on the beamformed output signals [2]. The results are shown in Figure 3. The figure clearly shows that the delay-sum beamformer cannot satisfactory resolve the DOAs with M = 8 array elements which will further affect the performance of the single-channel pitch estimator, as shown in the upper panel. In this example, the proposed algorithm shown in Figure 2 is superior compared to reference method shown in Figure 3. The low resolution performance of the reference method will make the statistical evaluation of this method uninteresting, and we, therefore, will not be using it any further in the experiments to follow.

Figure 3

The estimates of (a) the fundamental frequency using maximum-likelihood estimator at the output of the beamformer, (b) the DOA using a delay-sum beamformer.

4.2. Statistical evaluation

Next, we use Monte Carlo simulations evaluated on synthetic signals embedded in noise in assessing the statistical properties of the proposed method and compare it with the exact CRLB. As a reference method for pitch and DOA estimation, we use the JAFE algorithm proposed in [22] for jointly estimating unconstrained frequencies and DOAs. Next, the unconstrained frequencies are grouped according to their corresponding DOAs where closely related directions are grouped together. A fundamental frequency is formed from these grouped frequencies in a weighted way as proposed in [35]. We refer this as the WLS estimator. In order to remove the errors due to the erroneous estimate of amplitudes, we assume WLS having the exact signal amplitude given. The WLS estimator is a computationally efficient pitch estimation method with good statistical properties. The reference DOA estimate is easily obtained in a similar way from the mean value of these grouped DOAs according to [22].

Here, we consider a M = 8 element ULA with sensor distance d = 0.0425 with a sampling frequency of f s = 8000. The estimators are evaluated for two signal setups, first with two sources having ω1 = 252.123 and ω2 = 300.321 with L1,2 = 3, and second with one harmonic source of ω1 = 252.123 and L1 = 3. All amplitudes on individual harmonics are set to unity A k,l = 1 for tractability. Both sources are assumed to be far-field sources impinging on the array with DOAs at θ1 = -43.23° and θ2 = 70°, respectively, and for one source having a DOA of θ1 = -43.23°. All simulation results are based on 100 Monte Carlo runs. The performance is measured using the root mean squared estimation error (RMSE) as defined in [28, 32, 26, 27]. The user parameter for JAFE data model is selected to the optimal values as proposed in [22] with temporal and spatial smoothness parameters, t= 2 N 3 and s= M 2 , respectively. We note that in practical applications, the computational complexity has to also be considered in selecting the appropriate parameters t and s. An example of the 2-dimensional (2D) cost function of our proposed method evaluated on two mixed signal is illustrated in Figure 4, where a coarser estimate of the DOA and fundamental estimates can be identified from the two peaks in the 2D cost function.

Figure 4

Example of cost functions for two synthetic sources having three harmonic each, N = 64 and M = 8. The true fundamental frequency of ω1 = 252.123 and ω2 = 300.321 having DOA θ1 = -43.23° and θ2 = 70°, respectively.

In the first simulation, we evaluate the proposed method's statistical properties in a single source scenario for varying sample lengths and SNRs. The RMSEs on signal with varying N are shown in Figure 5, and with varying SNR in Figure 6. It can be seen from these figures that both estimators perform well for all SNR above 0 dB with WLS being slightly better for fundamental frequency estimation while the proposed estimator is better in DOA estimation. Both methods are also able to follow CRLB closely for around sample length N > 60. The better DOA estimation capabilities of the proposed method can be explained by the joint estimation of the fundamental frequency and DOA, which leads to increased robustness under adverse conditions. Both estimators can be considered as consistent in the single-pitch scenario.

Figure 5

RMSE as a function of N for SNR = 40 dB evaluated on single-pitch signal with unit amplitude: (a) fundamental frequency estimates; (b) DOA estimates.

Figure 6

RMSE as a function of SNR for N = 64 evaluated on single-pitch signal with unit amplitudes: (a) fundamental frequency estimates; (b) DOA estimates.

Next, we evaluate our method for the multi-pitch scenario. The so-obtained RMSEs for varying N and SNR are depicted in Figures 7 and 8. In Figure 7, it clearly shows that the proposed method is better than the WLS estimator for short sample lengths. The WLS estimator is not following CRLB until N > 80 samples while the proposed estimator is for N > 64. The remaining gap between CRLB and both evaluated estimators for N > 80 are due to the mutual interference between the harmonic sources. The slowly converging performance of WLS is mainly due to the bad estimate of the unconstrained frequency estimate using the JAFE method. With our selected simulation setup, the JAFE estimator is not giving consistent estimates for all harmonic components, which, in turn, results in poor performance in the WLS estimates. In general, the WLS estimator is sensitive to spurious estimate of the unconstrained frequencies. Moreover, the proposed estimator, which is jointly estimating both the DOA and the fundamental frequency, yields better estimates for smaller sample length N. The results in terms of RMSEs for varying SNRs are shown in Figure 8. This figure shows that the proposed estimator is again more robust than the WLS estimator for both DOA and fundamental frequency estimation.

Figure 7

RMSE as a function of N for SNR = 40 dB evaluated on multi-pitch signal with unit amplitudes: (a) joint fundamental frequency estimates; (b) joint DOA estimates.

Figure 8

RMSE as a function of SNR for N = 64 evaluated on multi-pitch signal with unit amplitudes: (a) joint fundamental frequency estimates; (b) joint DOA estimates.

In next two experiments, we will study the performance as a function of the difference in fundamental frequencies and DOAs for multiple sources. We start with studying the RMSE as a function of the difference between the fundamental frequencies of two harmonic sources, i.e., Δω = |ω1 - ω2|, with θ1 = -43.321° and θ2 = 70°. Here, we use an SNR set to 40 dB, and a sample length N = 64 with M = 8 array elements. The obtained RMSEs are shown in Figure 9. The figure clearly shows that both methods can successfully estimate the fundamental frequencies and DOAs. Once again the proposed estimator gives more robust estimates, close to the CRLB. Additionally, it should be noted that both methods are correctly estimating the DOA even when the both fundamental frequencies are identical ω1 = ω2, something that would not be possible with only a single channel. MC-HMUSIC has the ability to estimate the fundamental frequencies when both harmonics are identical provided that the DOAs are distinct and vice versa. Estimation of the parameters of signals with overlapping harmonics is a crucial limitation in multi-pitch estimation using only single-channel recordings. In the final experiment, the RMSE as a function of the difference between the DOAs of two harmonic sources Δθ = |θ1 - θ2| is analyzed for an SNR set to 40 dB and a sample length of N = 64 with M = 8 array elements. The fundamental frequencies are ω1 = 252.123 and ω2 = 300.321, respectively. The observations and conclusions are basically the same as before, with the proposed method outperforming the reference method so far.

Figure 9

RMSE as a function of Δ ω: (a) joint fundamental frequency estimates; (b) joint DOA estimates.

5. Conclusion

In this article, we have generalized the single-channel multi-pitch problem into a multi-channel multi-pitch estimation problem. To solve this new problem, we propose an estimator for joint estimation of fundamental frequencies and DOAs of multiple sources. The proposed estimator is based on subspace analysis using a time-space data model. The method is shown to have potential in applications to real signals with simulated anechoic array recording, and a statistical evaluation demonstrates its robustness in DOA and fundamental frequency estimation as compared to a state-of-the-art reference method. Furthermore, the proposed method is shown to have good statistical performance under adverse conditions, for example for sources with similar DOA or fundamental frequency.


  1. 1.

    Klapuri A: Automatic music transcription as we know it today. J New Music Res 2004, 33: 269-282.

    Article  Google Scholar 

  2. 2.

    Christensen MG, Jakobsson A: Multi-Pitch Estimation. Synthesis Lectures on Speech and Audio Processing 2009.

    Google Scholar 

  3. 3.

    Rabiner L: On the use of autocorrelation analysis for pitch detection. IEEE Trans Signal Process 1996, 44: 2229-2244.

    Article  Google Scholar 

  4. 4.

    Zhang JX, Christensen MG, Jensen SH, Moonen M: A robust and computationally efficient subspace-based fundamental frequency estimator. IEEE Trans Acoust Speech Language Process 2010, 18(3):487-497.

    Article  Google Scholar 

  5. 5.

    de Cheveigne A, Kawahara H: YIN, a fundamental frequency estimator for speech and music. J Acoust Soc Am 2002, 111(4):1917-1930.

    Article  Google Scholar 

  6. 6.

    Wang DL, Brown GJ: Computational Auditory Scene Analysis: Principle, Algorithm, and Applications. Wiley, IEEE Press, New York; 2006.

    Chapter  Google Scholar 

  7. 7.

    Klapuri A: Multiple fundamental frequency estimation based on harmonicity and spectral smoothness. IEEE Trans Speech Audio Process 2003, 11: 804-816.

    Article  Google Scholar 

  8. 8.

    Emiya V, Bertrand D, Badeau R: A parametric method for pitch estimation of piano tones. IEEE International Conference on Acoustics, Speech, and Signal Processing 2007, 1: 249-252.

    Google Scholar 

  9. 9.

    Rickard S, Yilmaz O: Blind separation of speech mixtures via time-frequency masking. IEEE Trans Signal Process 2004, 52: 1830-1847.

    MathSciNet  Article  Google Scholar 

  10. 10.

    Wohmayr M, Kepsi M: Joint position-pitch extraction from multichannel audio. Proceedings of the Interspeech 2007.

    Google Scholar 

  11. 11.

    Qian X, Kumaresan R: Joint estimation of time delay and pitch of voiced speech signals. Record of the Asilomar Conference on Signals, Systems, and Computers 1996., 2:

    Google Scholar 

  12. 12.

    Wrigley SN, Brown GJ: Recurrent timing neural networks for joint F0-localisation based speech separation. IEEE International Conference on Acoustics, Speech and Signal Processing 2007.

    Google Scholar 

  13. 13.

    Flego F, Omologo M: Robust F0 estimation based on a multi-microphone periodicity function for distant-talking speech. EUSIPCO 2006.

    Google Scholar 

  14. 14.

    Armani L, Omologo M: Weighted auto-correlation-based F0 estimation for distant-talking interaction with a distributed microphone network. IEEE International Conference on Acoustics, Speech and Signal Processing 2004, 1: 113-116.

    Google Scholar 

  15. 15.

    Chazan D, Stettiner Y, Malah D: Optimal multi-pitch estimation using the em algorithm for co-channel speech separation. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing 1993.

    Google Scholar 

  16. 16.

    Liao G, So HC, Ching PC: Joint time delay and frequency estimation of multiple sinusoids. IEEE International Conference on Acoustics, Speech and Signal Processing 2001, 5: 3121-3124.

    Google Scholar 

  17. 17.

    Wu Y, So HC, Tan Y: Joint time-delay and frequency estimation using parallel factor analysis. Elsevier Signal Process 2009, 89: 1667-1670.

    Article  Google Scholar 

  18. 18.

    Ngan LY, Wu Y, So HC, Ching PC, Lee SW: Joint time delay and pitch estimation for speaker localization. Proceedings of the IEEE International Symposium on Circuits and Systems 2003, 722-725.

    Google Scholar 

  19. 19.

    Stoica P, Moses R: Spectral Analysis of Signals. Prentice-Hall, Upper Saddle River; 2005.

    Google Scholar 

  20. 20.

    Brandstein M, Ward D: Microphone Arrays. Springer, Berlin; 2001.

    Book  Google Scholar 

  21. 21.

    van der Veen AJ, Vanderveen M, Paulraj A: Joint angle and delay estimation using shift invariance techniques. IEEE Trans Signal Process 1998, 46: 405-418.

    Article  Google Scholar 

  22. 22.

    Lemma AN, van der Veen AJ, Deprettere EF: Analysis of joint angle-frequency estimation using ESPRIT. IEEE Trans Signal Process 2003, 51: 1264-1283.

    MathSciNet  Article  Google Scholar 

  23. 23.

    Viberg M, Stoica P: A computationally efficient method for joint direction finding and frequency estimation in colored noise. Record of the Asilomar Conference on Signals, Systems, and Computers 1998, 2: 1547-1551.

    Google Scholar 

  24. 24.

    Lin JD, Fang WH, Wang YY, Chen JT: FSF MUSIC for joint DOA and frequency estimation and its performance analysis. IEEE Trans Signal Process 2006, 54: 4529-4542.

    Article  Google Scholar 

  25. 25.

    Wang S, Caffery J, Zhou X: Analysis of a joint space-time doa/foa estimator using MUSIC. IEEE International Symposium on Personal, Indoor and Mobile Radio Communications 2001, B138-B142.

    Google Scholar 

  26. 26.

    Christensen MG, Stoica P, Jakobsson A, Jensen SH: Multi-pitch estimation. Elsevier Signal Process 2008, 88(4):972-983.

    Article  Google Scholar 

  27. 27.

    Christensen MG, Jakobsson A, Jensen SH: Joint high-resolution fundamental frequency and order estimation. IEEE Trans. Acoust Speech Signal Process 2007, 15(5):1635-1644.

    Google Scholar 

  28. 28.

    Zhang JX, Christensen MG, Jensen SH, Moonen M: An iterative subspace-based multi-pitch estimation algorithm. Elsevier Signal Process 2011, 91: 150-154.

    Article  Google Scholar 

  29. 29.

    Lemma AN: ESPRIT based joint angle-frequency estimation algorithms and simulations. PhD Thesis Delft University 1999.

    Google Scholar 

  30. 30.

    Shu T, Liu XZ: Robust and computationally efficient signal-dependent method for joint DOA and frequency estimation. EURASIP J Adv Signal Process 2008., 2008: Article ID 10.1155/2008/134853

    Google Scholar 

  31. 31.

    Krim H, Viberg M: Two decades of array processing research-the parametric approach. IEEE SP Mag 1996.

    Google Scholar 

  32. 32.

    Christensen MG, Jakobsson A, Jensen SH: Multi-pitch estimation using Harmonic MUSIC. Record of the Asilomar Conference on Signals, Systems, and Computers 2006, 521-525.

    Google Scholar 

  33. 33.

    Christensen MG, Jakobsson A, Jensen SH: Sinusoidal order estimation using angles between subspaces. EURASIP J Adv Signal Process 2009, 1-11. Article ID 948756

    Google Scholar 

  34. 34.

    Veen BDV, Buckley KM: Beamforming: a versatile approach to spatial filtering. IEEE ASSP Mag 1988, 5: 4-24.

    Article  Google Scholar 

  35. 35.

    Li H, Stoica P, Li J: Computationally efficient parameter estimation for harmonic sinusoidal signals. Elsevier Signal Process 2000, 1937-1944.

    Google Scholar 

Download references


The study of Zhang was supported by the Marie Curie EST-SIGNAL Fellowship, Contract No. MEST-CT-2005-021175.

Author information



Corresponding author

Correspondence to Johan Xi Zhang.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Zhang, J.X., Christensen, M.G., Jensen, S.H. et al. Joint DOA and multi-pitch estimation based on subspace techniques. EURASIP J. Adv. Signal Process. 2012, 1 (2012).

Download citation


  • multi-pitch estimation
  • direction-of-arrival estimation
  • subspace orthogonality
  • array processing