Joint DOA and multi-pitch estimation based on subspace techniques

In this article, we present a novel method for high-resolution joint direction-of-arrivals (DOA) and multi-pitch estimation based on subspaces decomposed from a spatio-temporal data model. The resulting estimator is termed multi-channel harmonic MUSIC (MC-HMUSIC). It is capable of resolving sources under adverse conditions, unlike traditional methods, for example when multiple sources are impinging on the array from approximately the same angle or similar pitches. The effectiveness of the method is demonstrated on a simulated an-echoic array recordings with source signals from real recorded speech and clarinet. Furthermore, statistical evaluation with synthetic signals shows the increased robustness in DOA and fundamental frequency estimation, as compared with to a state-of-the-art reference method.


Introduction
The problem of estimating the fundamental frequency, or pitch, of a period waveform has been of interest to the signal processing community for many years. Fundamental frequency estimators are important for many practical applications such as automatic note transcription in music, audio and speech coding, classification of music, and speech analysis. Numerous algorithms have been proposed for both the single-and multi-pitch scenarios [1][2][3][4][5]. The problem for single-pitch scenarios is considered as well-posed. However, in real-world signals, the multi-pitch scenario occurs quite frequently [2,6]. The multi-pitch estimation algorithms are often based on, i.e., various modification of the auto-correlation function [1,7], maximum likelihood, optimal filtering, and subspace techniques [2,3,8]. In real-life recordings, problems such as frequency overlap of sources, reverberation, and colored noise will strongly limit the performance of multi-pitch estimator and estimator designed for single channel recordings often use simplified signal models. One widely used signal simplification in multi-pitch estimators, for example, is the sparseness of the signal, where the frequency spectrum of sources are assumed to not overlap [2]. This assumption may be appropriate when sources consist of mixture of several speech signals having different pitches [9]. However, for audio signals it is less likely to be true. This is especially so in western music, where instruments are most often played in accord, something that causes the harmonics to overlap or even coincide. With only single-channel recording it is, therefore, hard, or perhaps even impossible, to estimate pitches with overlapping harmonics, unless additional information, such as a temporal or spectral model, is included.
Recently, multi-channel approaches have attracted considerable attention both in single-and multi-pitch scenarios. By exploring the spatial information of the sources, more robust pitch estimators have been proposed [10][11][12][13][14]. Most of those multi-channel methods are still mainly based on auto-correlation function-related approaches, however, although a few exceptions can be found in [15][16][17][18]. In direction-of-arrival (DOA) estimators, audio and speech signals are often modeled as broadband signal, and standard subspace methods such as MUSIC and ESPRIT are only defined for narrowband signal model, which then fail to directly operate on broadband signals [19]. One often used concept is band-pass filtering of broadband signals into subbands, where narrow-band estimators can be applied to each subband [20]. In the narrow-band case, a delay in the signal is equivalent to a phase shifts according to the frequencies of complex exponentials. An alternative method is, however, as follows: since harmonic signals consist of sinusoidal components, we can model each source as multiple narrow-band signal with distinct frequencies arriving at the same DOA.
In this article, we propose a parametric method for solving the problem of joint fundamental frequency and DOA estimation based on subspace techniques where the quantities of interest are jointly estimated using a MUSIC-like approach. We term the proposed estimator Multi-channel multi-pitch Harmonic MUSIC (MC-HMUSIC). The spatio-temporal data model used in MC-HMUSIC is based on the JAFE data model [21,22]. Originally, the JAFE data model was used for estimating joint unconstrained frequencies and DOAs estimates of complex exponential using ESPRIT, which is referred as joint angle-frequency estimation (JAFE) algorithm. Other-related work with joint frequency-DOA methods includes [23][24][25]. In this article, we have parametrized the harmonic structure of periodic signals in the signal model to model the fundamental frequency and the DOA of individual sources. An estimator is constructed for jointly estimating the parameters of interest. Incorporating the DOA parameter in finding the fundamental frequency may give better robustness against a signal with overlapping harmonics. Similarly, it can be expected that the DOA can be found more accurately when the nature of the signal of interest is taken into account.
The remainder of this article is comprised four sections: Section 2, in which we will introduce some notation, the spatio-temporal signal model, for which we also derive the associated Cramér-Rao lower bound, along with the JAFE data mode; Section 3, where we then present the proposed method; Section 4, in which we present the experimental results obtained using the proposed method; and, finally, Section 5, where we conclude on our work.

Spatio-temporal signal model
Next, the signal model employed throughout the article will be presented. Without multi-path propagation of sources, it is given as follows: the signal x i received by microphone element i arranged in a uniform linear array (ULA) configuration, i = 1,..., M, is given by for sample index n = 0,..., N -1, where subscript k denotes the kth source and l the lth harmonic.
Moreover, A l,k is the real-valued positive amplitude of the complex exponential, L k is the number of harmonics, K is number of sources, g l,k is the phase of the individual harmonics, j k is the phase shift caused by the DOA, and e i (n) is complex symmetric white Gaussian noise. The phase shift between array elements is given as φ k = ω k f s d c sin(θ k ) , where d is the spacing between the elements measured in wavelengths, c is the speed of propagation in unit [m/s], θ k is the DOA defined for θ k [-90°, 90°], f s is the signal sampling frequency. The problem of interest is to estimate ω k and θ k . We in the following assume that the number of sources K is known and the number of harmonics L k of individual sources is known or found in some other, possibly joint, way. We note that a number of ways of doing this has been proposed in the past [26][27][28]2].

Cramér-Rao lower bound
We will now proceed to derive the exact Cramér-Rao lower bound (CRLB) for the problem of estimating the parameters of interest. First, we define the M × 1 deterministic signal model vector s(n, μ) with column element as where s(n, μ) = [s 1 (n, μ) ... s M (n, μ)] T . Furthermore, the parameter vector μ is given by Recall that the observed signal vector with additive white noise is given by . . .
with e(n) being the noise column vector. The CRLB is defined as the variance of an unbiased estimate of the pth element of μ, which is lower bounded as where C is the so-called Fisher information matrix given by The partial derivative matrix is denoted as where vector is the partial derivatives with respect to the entries in the vector μ. The expression for the columns in ∂s(n,μ) ∂μ is given as

The JAFE data model
Next, we will introduce the specifics of the JAFE data model [22,29] that our method is based on. At a time instant n the received signal from the M array elements are x(n) = [x 1 (n) x 2 (n) ... x M (n)] T , which can be written as where e(n) ℂ M×1 is the noise vector, and A = [A 1 ... A K ] is a Vandermonde matrix containing parameters ω k and θ k for sources k = 1, . . . , K, i.e., with a(θ, ω) being the array steering vector given by Here, (·) T denotes the vector transpose. Unlike the steering vector defined in [22,21], where only the DOA is parametrized, here, a general definition of the vector (11) is used, in which it depends on both θ and ω [29]. The frequency components are expressed in n = diag n 1 · · · n K where the matrix for each source is given by The complex amplitudes for involving components are represented by the following vector: To capture the temporal behavior, N time-domain data samples of the array output x(n) are collected to form the M × N data matrix X, which is defined as Due to the structure of the harmonic components, the data matrix is given by where E ℂ M×N is a matrix containing N sample of the noise vector e(n).
In speech and audio signal processing, it is common to model each source as a set of multiple harmonics with model order L k >1. Due to the narrow-band approximation of the steering vector, the multiple complex components with distinct frequencies impinge on the array with identical DOA will result in a non-unique spatial frequencies which cause a harmonic structure in the spatial frequencies j k l ∀l as well. The multiple sources impinge on the array with different DOAs consisting of various frequency components may, for certain frequency combinations, give the same array steering vector, which cause the matrix A to be rank deficient. Normally, this ambiguous mapping of the steering vector is mitigated by band-pass filtering the signal into its subbands, where the DOA of the signal is uniquely modeled by the narrow-band steering vector [20,Chap. 9].
Here, the ambiguities and the rank-deficiency are avoided by introducing temporal smoothness in order to restore the rank of A. The temporally smoothed data matrix is obtained by stacking t times temporally shifted versions of the original data matrix [22,21,29], given as . . .
where X t ℂ tM×N-t+1 is the temporally smoothed data matrix, and E t is the noise term constructed from E in a similar way as X t . In using the signal model where the amplitudes are assumed stationary for n = 0, . . . , N -1, X t can be factorized as With some additional definitions, we can also write this expression more compactly as . The temporally smoothed data matrix X t can maximally resole up to tM ≥ K k=1 L k complex exponentials, where Ā t is linearly independent for any distinct θ and ω [30].
When multiple sources with distinct DOA with the same fundamental frequency impinge on the array, it will result in correlation between the underlying signals, which will make it harder to separate the corresponding components into its eigenvectors [22,31]. To mitigate this problem, spatial smoothing is introduced, which works as follows. An array of M sensors is subdivided into S subarrays. In this article, the subarrays are spatially shifted with one element in each subarrays, the number of elements in each subarray being M S = M -S + 1. For s = 1, . . . , S, let J s ∈ C tM s ×tM be the selection matrix corresponding to the sth subarray for the data matrix X t . Then, the spatio-temporally smoothed data matrix X t,S ∈ C tM s ×S(N−t+1) is given by Furthermore, X t,s can be factorized as where E t,s is the noise term constructed from E in a similar way as X t,s . Using the shift invariance structure in A m , the term J s A m for s = 1, . . . , S is given by where = diag e jφ 1 1 · · · e jφ 1 L 1 · · · e jφ K 1 · · · e jφ K L K , (22) which is simply the phase difference between array elements. With (21), the matrix X t,s can be written in a compact form as with selection matrix expressed as where I t ℝ t×t and I M s ∈ R M s ×M s are the identity matrices, ⊗ is the Kroneker product as defined in [22].
It is interesting to note that the noise term E t,s is no longer white due to the spatio-temporal smoothing procedure, as correlation between the different rows of (23) is obtained. A pre-whitening step can be implemented in (23) to mitigate this. We note, however, that according to results reported in [22], pre-whitening step is only interesting for signals with low SNR where minor estimation improvement can be achieved. In this article, the main interest is to propose a multi-channel joint DOA and multi-pitch estimator, for which reason the whitening process is left without further description, but we refer the interested reader to [22]. We also note that aside from spatial smoothing, forward-backward averaging could also be implemented to reduce the influence of the correlated sources [22,31,19].

Coarse estimates
From the final spatio-temporally smoothed data matrix, a basis for the signal and noise subspaces can be obtained as follows. The singular value decomposition (SVD) of the data matrix (23) is given by where the columns of U are the singular vectors, i.e., A basis of the orthogonal complement of the signal subspace, also called the noise subspace, is formed from singular vector associated with the mM S -Q least significant singular values, i.e., with Q = K k=1 L k being the total number of complex exponentials in the signal. Similarly, the signal subspace is spanned by the Q largest singular values, i.e., The defined signal subspace and noise subspace have similar property as traditional subspaces where estimators such as joint DOA and frequency, or fundamental frequency estimators can be constructed using the principle used in MUSIC [19,32,27,26,4]. According to the signal noise subspace orthogonality principle, the following relationship holds: where we, for notational simplicity, have introduced J 1 Ā t = A ts . The matrix A ts is comprised Vandermonde matrices for sources k = 1, . . . , K. The matrix for each individual source is given by The cost function of the proposed joint DOA and multi-pitch estimator is then where ||·|| F is the Frobenius Norm. Note that this measure is closely related to the angles between the subspaces as explained in [33] and can hence be used as a measure of the extent to which (29) holds for a candidate fundamental frequency and DOA. The pair of fundamental frequency and DOA can, therefore, be found as the combination that is the closest to being orthogonal to G, i.e., The multi-channel estimators will have a cost function which is more well-behaved compared to those of single channel multi-pitch estimators (see, e.g., [26,32,28] for some examples of such).

Refined estimates
For many applications, only a coarse estimate of involved fundamental frequencies and DOAs are needed, in which case the cost function in (32) is evaluated on pre-defined search region with some specified granularity. If, however, very accurate estimates are desired, a refined estimate can be found as described next. For a rough estimate of the parameter of interests, refined estimates are obtained by minimizing the cost function in (32) using a cyclic minimization approach.
The gradient of the cost function (32) for fundamental frequency and DOA are given as with Re (·) denoting the real value. The gradient can be used for finding refined estimate using standard methods.
Here, we iteratively find a refined estimate using a cyclic approach. During an iteration, ω k is first estimated withω where i is the iteration index and δ is a small positive constant that is found using line search. The estimated ω i+1 k is then used to initialize the minimization function for DOA, which is then found aŝ The method is initialized for i = 0 using the coarse estimates obtained from (32).

Signal examples
We start the experimental part of this article by illustrating the application of the proposed method to analyzing a mixed signal consisting of speech and clarinet signals, sampled at f s = 8000 Hz. The single-channel signals are converted into a multi-channel signal by introducing different delays according to two pre-determined DOA to simulate a microphone array with M = 8 channels. The simulated DOAs of the speech and the clarinet signals are, respectively, θ 1 = -45°and θ 2 = 45°. The spectrogram of the mixed signal of the first channel is illustrated in Figure 1. To avoid spatial ambiguities, the distance between two sensor is half the wavelength of the highest frequency in the observed signal, here d = 0.0425 m. The mixed signal is segmented into 50% overlapped signal segments with N = 128. The user parameter selected in this experiment is t = 2N 3 and s = M 2 . The cost function is evaluated with a Vandermonde matrix with L = 5 complex exponentials, and the noise subspace is formed from an overestimated signal subspace with assumption of signal subspace containing N/2 = 64 complex exponentials. The signal subspace overestimation technique is usually used when the true order of the signal subspace is unknown, the signal subspace is assumed to be larger than the true one which can minimize the signal subspace components in the noise subspace. An added benefit of posing the problem as a joint estimation problem is that the multi-pitch estimation problem can be seen as several single-pitch problems for a distinct set of DOAs, one per source. Therefore, it is less important to select an exact signal model order than single-channel multi-pitch estimators would need [28]. The cost function is evaluated for frequencies from 100 to 500 with granularity of 0.52 Hz. The evaluated results are illustrated in Figure 2 where the upper panel contains the fundamental frequency estimates and lower panel the DOA estimates. It can be seen that the proposed algorithm can track the fundamental frequency and the DOA of the speech signal well, with only a few observed errors on regions with low signal energy. The clarinet signal's DOA and fundamental frequencies have also been estimated well for all segments.
For the purpose of further comparison, the same signal will be analyzed using a standard time delayand-sum beamformer [34] for DOA estimates and a single-channel maximum-likelihood based pitch estimator applied on the beamformed output signals [2]. The results are shown in Figure 3. The figure clearly shows that the delay-sum beamformer cannot satisfactory resolve the DOAs with M = 8 array elements which will further affect the performance of the single-channel pitch estimator, as shown in the upper panel. In this example, the proposed algorithm shown in Figure 2 is superior compared to reference method shown in Figure 3. The low resolution performance of the reference method will make the statistical evaluation of this method uninteresting, and we, therefore, will not be using it any further in the experiments to follow.

Statistical evaluation
Next, we use Monte Carlo simulations evaluated on synthetic signals embedded in noise in assessing the statistical properties of the proposed method and compare it with the exact CRLB. As a reference method for pitch and DOA estimation, we use the JAFE algorithm proposed in [22] for jointly estimating unconstrained frequencies and DOAs. Next, the unconstrained frequencies are grouped according to their corresponding DOAs where closely related directions are grouped together. A fundamental frequency is formed from these grouped frequencies in a weighted way as proposed in [35]. We refer this as the WLS estimator. In order to  remove the errors due to the erroneous estimate of amplitudes, we assume WLS having the exact signal amplitude given. The WLS estimator is a computationally efficient pitch estimation method with good statistical properties. The reference DOA estimate is easily obtained in a similar way from the mean value of these grouped DOAs according to [22]. Here, we consider a M = 8 element ULA with sensor distance d = 0.0425 with a sampling frequency of f s = 8000. The estimators are evaluated for two signal setups, first with two sources having ω 1 = 252.123 and ω 2 = 300.321 with L 1,2 = 3, and second with one harmonic source of ω 1 = 252.123 and L 1 = 3. All amplitudes on individual harmonics are set to unity A k,l = 1 for tractability. Both sources are assumed to be far-field sources impinging on the array with DOAs at θ 1 = -43.23°and θ 2 = 70°, respectively, and for one source having a DOA of θ 1 = -43.23°. All simulation results are based on 100 Monte Carlo runs. The performance is measured using the root mean squared estimation error (RMSE) as defined in [28,32,26,27]. The user parameter for JAFE data model is selected to the optimal values as proposed in [22] with temporal and spatial smoothness parameters, t = 2N 3 and s = M 2 , respectively. We note that in practical applications, the computational complexity has to also be considered in selecting the appropriate parameters t and s. An example of the 2dimensional (2D) cost function of our proposed method evaluated on two mixed signal is illustrated in Figure 4, where a coarser estimate of the DOA and fundamental estimates can be identified from the two peaks in the 2D cost function.
In the first simulation, we evaluate the proposed method's statistical properties in a single source scenario for varying sample lengths and SNRs. The RMSEs on signal with varying N are shown in Figure 5, and with varying SNR in Figure 6. It can be seen from these figures that both estimators perform well for all SNR above 0 dB with WLS being slightly better for fundamental frequency estimation while the proposed estimator is better in DOA estimation. Both methods are also able to follow CRLB closely for around sample length N >60. The better DOA estimation capabilities of the proposed method can be explained by the joint estimation of the fundamental frequency and DOA, which leads to increased robustness under adverse conditions. Both estimators can be considered as consistent in the singlepitch scenario.
Next, we evaluate our method for the multi-pitch scenario. The so-obtained RMSEs for varying N and  Figure 3 The estimates of (a) the fundamental frequency using maximum-likelihood estimator at the output of the beamformer, (b) the DOA using a delay-sum beamformer. SNR are depicted in Figures 7 and 8. In Figure 7, it clearly shows that the proposed method is better than the WLS estimator for short sample lengths. The WLS estimator is not following CRLB until N >80 samples while the proposed estimator is for N >64. The remaining gap between CRLB and both evaluated estimators for N >80 are due to the mutual interference between the harmonic sources. The slowly converging performance of WLS is mainly due to the bad estimate of the unconstrained frequency estimate using the JAFE method. With our selected simulation setup, the JAFE estimator is not giving consistent estimates for all harmonic components, which, in turn, results in poor performance in the WLS estimates. In general, the WLS estimator is sensitive to spurious estimate of the unconstrained frequencies. Moreover, the proposed estimator, which is jointly estimating both the DOA and the fundamental frequency, yields better estimates for smaller sample length N. The results in terms of RMSEs for varying SNRs are shown in Figure 8. This figure shows that the proposed estimator is again more robust than the WLS estimator for both DOA and fundamental frequency estimation. In next two experiments, we will study the performance as a function of the difference in fundamental frequencies and DOAs for multiple sources. We start with studying the RMSE as a function of the difference between the fundamental frequencies of two harmonic sources, i.e., Δω = |ω 1 -ω 2 |, with θ 1 = -43.321°and θ 2 = 70°. Here, we use an SNR set to 40 dB, and a sample length N = 64 with M = 8 array elements. The obtained RMSEs are shown in Figure 9. The figure clearly shows  that both methods can successfully estimate the fundamental frequencies and DOAs. Once again the proposed estimator gives more robust estimates, close to the CRLB. Additionally, it should be noted that both methods are correctly estimating the DOA even when the both fundamental frequencies are identical ω 1 = ω 2 , something that would not be possible with only a single channel. MC-HMUSIC has the ability to estimate the fundamental frequencies when both harmonics are identical provided that the DOAs are distinct and vice versa. Estimation of the parameters of signals with overlapping harmonics is a crucial limitation in multi-pitch estimation using only single-channel recordings. In the final experiment, the RMSE as a function of the difference between the DOAs of two harmonic sources Δθ = |θ 1θ 2 | is analyzed for an SNR set to 40 dB and a sample length of N = 64 with M = 8 array elements. The fundamental frequencies are ω 1 = 252.123 and ω 2 = 300.321, respectively. The observations and conclusions are basically the same as before, with the proposed method outperforming the reference method so far.

Conclusion
In this article, we have generalized the single-channel multi-pitch problem into a multi-channel multi-pitch estimation problem. To solve this new problem, we propose an estimator for joint estimation of fundamental frequencies and DOAs of multiple sources. The proposed estimator is based on subspace analysis using a time-space data model. The method is shown to have potential in applications to real signals with simulated anechoic array recording, and a statistical evaluation demonstrates its robustness in DOA and fundamental frequency estimation as compared to a state-of-the-art reference method. Furthermore, the proposed method is shown to have good statistical performance under