- Research
- Open Access
DOA-informed source extraction in the presence of competing talkers and background noise
- Maja Taseska^{1}Email authorView ORCID ID profile and
- Emanuël A. P. Habets^{1}
https://doi.org/10.1186/s13634-017-0495-7
© The Author(s) 2017
- Received: 24 March 2017
- Accepted: 10 August 2017
- Published: 22 August 2017
Abstract
A desired speech signal in hands-free communication systems is often degraded by noise and interfering speech. Even though the number and locations of the interferers are often unknown in practice, it is justified to assume in certain applications that the direction-of-arrival (DOA) of the desired source is approximately known. Using the known DOA, fixed spatial filters such as the delay-and-sum beamformer can be steered to extract the desired source. However, it is well-known that fixed data-independent spatial filters do not provide sufficient reduction of directional interferers. Instead, the DOA information can be used to estimate the statistics of the desired and the undesired signals and to compute optimal data-dependent spatial filters. One way the DOA is exploited for optimal spatial filtering in the literature, is by designing DOA-based narrowband detectors to determine whether a desired or an undesired signal is dominant at each time-frequency (TF) bin. Subsequently, the statistics of the desired and the undesired signals can be estimated during the TF bins where the respective signal is dominant. In a similar manner, a Gaussian signal model-based detector which does not incorporate DOA information has been used in scenarios where the undesired signal consists of stationary background noise. However, when the undesired signal is non-stationary, resulting for example from interfering speakers, such a Gaussian signal model-based detector is unable to robustly distinguish desired from undesired speech. To this end, we propose a DOA model-based detector to determine the dominant source at each TF bin and estimate the desired and undesired signal statistics. We demonstrate that data-dependent spatial filters that use the statistics estimated by the proposed framework achieve very good undesired signal reduction, even when using only three microphones.
Keywords
- Spatial filtering
- Speech enhancement
- PSD matrix estimation
- RTF estimation
- Signal detection
1 Introduction
In applications that require hands-free capture of speech, the desired speech signal is often corrupted by background noise and interfering speech signals. Such applications involve human-to-human and human-to-machine communication, where speech enhancement is crucial: in the former, to improve the communication comfort, and in the latter, to ensure low error rate of speech recognisers. In this work, we address scenarios where the desired speaker has a known DOA with respect to the microphones, such as in-car applications, or voice-controlled devices where the source of interest is restricted to a pre-defined DOA. Given the DOA of the desired source and assuming anechoic propagation, fixed spatial filters such as the delay-and-sum beamformer (DSB) [1] or superdirective beamformers [2] can be used. However, these filters are suboptimal as they do not consider the spatio-temporal statistics of the signals, and often provide insufficient interference reduction. Moreover, propagation model mismatch due to reverberation and DOA errors further limit the performance. In this work, we focus on optimal data-dependent spatial filters [3, 4]. Two main paradigms can be distinguished which aim at optimal filtering given the source DOA: robust adaptive beamformers (RABs) and informed spatial filters (ISFs). While RABs seek to improve the robustness to errors in the DOAs and the signal propagation vectors, ISFs address the estimation of propagation vectors and signal statistics from the microphone signals, and their usage for optimal spatial filtering.
RAB representatives include Bayesian beamformers [5, 6] and spatial filters with eigenvector constraints [7, 8], which are implemented using linearly constrained minimum variance (LCMV) filters. These filters seek to minimise the undesired signal power at the output, while imposing constraints to ensure that the desired signal from the DOA of interest is undistorted. Another approach is proposed in [9], where the desired signal power spectral density (PSD) matrix is computed by integrating the free field-based PSD matrices across the region of possible source DOAs. Note that the increased robustness in these approaches often comes at the cost of worse undesired signal reduction. RABs can also be implemented in a general sidelobe canceler (GSC) structure, where the robustness to DOA and propagation vector mismatches is ensured by using an adaptive blocking matrix [10, 11], and by imposing constraints to the adaptive noise cancellers [10]. The robust GSCs require a desired signal detector, as the noise cancellers need to be updated when the desired signal is absent, while the blocking matrix when the desired signal is present [12].
ISFs, in contrast to RABs, estimate the desired signal propagation vector and the undesired signal statistics from the data, and substitute them in optimal filter expressions such as the minimum variance distortionless response (MVDR) or the multichannel Wiener filter (MWF) [13–16]. As the ISFs are estimated and implemented in the frequency domain, the relevant statistics correspond to the PSD matrices of the desired and the undesired signals at each frequency. The advantage of estimating the propagation vectors from the data, rather than using anechoic propagation models, is well-known since the development of the transfer function-GSC [17] and the relative transfer function (RTF)-GSC [18]. However, a less often addressed question is how to design narrowband signal detectors which are required to estimate the propagation vectors and the PSD matrices, or perform the filter adaptations in the adaptive GSCs. Signal detection in the presence of non-stationary interferers is a very challenging problem [19]. The Gaussian model-based detectors used in state-of-the-art systems [15, 20, 21] assume that the noise is significantly more stationary than speech, which is not true for speech interferers.
The question addressed in this paper is how to design a robust narrowband signal detector, by using the microphone signals, narrowband DOA estimates extracted from the signals, and the information about the desired source DOA. Narrowband DOAs have been previously used for desired speech detection in the literature. For instance, in [22], the authors use narrowband DOAs to control the a priori desired speech presence probability (DSPP) in a Gaussian signal model, while in [23] a Gaussian DOA model is used to compute a DSPP and apply it as a single-channel gain to the output of a spatial filter. We propose a different statistical model for the narrowband DOA estimates which is used for desired signal detection and estimation of the propagation vectors and the PSD matrices in an ISF framework. Initial results obtained using the proposed framework were presented in [24]. In this paper, we provide a more detailed description of the system, further discussions and comparison to the state-of-the-art approaches, as well as an extended set of experiments to evaluate the performance of the narrowband signal detector and the quality of the extracted source signal at the ISF output.
2 Signal model and problem formulation
In addition, we introduce the hypothesis \(\mathcal {H}_{u}= \mathcal {H}_{i} \cup \mathcal {H}_{v}\) that undesired signal is dominant, regardless whether it is a competing speaker or background noise.
The objective in this work is to define likelihood models for the hypotheses in (5), design a detector that associates each TF bin to the correct hypothesis, estimate the PSD matrices and the RTF vector g _{1}, and finally, compute the ISF coefficients w _{opt} required for source extraction in (3). Narrowband DOA estimates play a key role in the framework, and appropriate state-of-the art estimators are briefly discussed in the following section.
3 Narrowband DOA estimation
The most important criteria when choosing a DOA estimator for our framework is the ability to obtain nearly instantaneous narrowband DOA estimates without requiring temporally averaged covariance matrices as the subspace-based estimators [25], and a sufficiently low complexity suitable for real-time implementation. We briefly review two estimators which satisfy these requirements.
3.1 Least squares (LS)-fitting of instantaneous phase differences [26]
where ()^{+} denotes Moore-Penrose pseudoinverse of a matrix.
3.2 LS-fitting of cross PSD phase differences [27]
The estimators given by (9) and (12) assume that for each microphone pair, the spatial aliasing frequency lies above \(\frac {F_{s}}{2}\), where F _{ s } is the sampling rate. Alternatively, frequency-dependent binary weights can be used to exclude microphone pairs at the frequency bins where spatial aliasing might occur for those pairs, as done in [27].
4 State-of-the-art DOA-informed source extraction
4.1 DSB and MPDR beamforming
If the signal propagation is modelled as a pure delay, the DSB is the simplest filter which can be applied for source extraction. However, the DSB offers suboptimal performance as it does not consider the signal statistics and the reverberation.
In contrast to the MVDR filter which is expressed in terms of the undesired signal PSD matrix Φ _{ u }, the MPDR filter is expressed in terms of Φ _{ y }, which contains the desired signal as well. Therefore, if the RTF vector is inaccurate due to the anechoic model mismatch in reverberant environments, or due to DOA errors, the MPDR filter causes severe distortion of the desired signal [28].
4.2 Informed spatial filtering
The authors in [22] incorporate the DOA information in the a priori DSPP q _{ s }, to provide more robust discrimination between desired and undesired speakers. If \(\Theta _{\theta,\hat {\theta }}(t,k)\) denotes the angle between the true DOA θ of the source of interest, and \(\hat {\theta }_{tk}\) the DOA estimate at TF bin (t,k), the a priori DSPP in [22] is computed as \(q_{s}(t,k) = w\left (\Theta _{\theta,\hat {\theta }}(t,k)\right)\), where w(Θ) is a Gaussian window centred at Θ=0.
5 Proposed DOA model-based signal detection
The Gaussian model-based DSPP is very sensitive to non-stationarity of the undesired signal, as the expression (15) requires an estimate of the PSD matrix Φ _{ u }. To estimate the DSPP at TF bin (t,k), the PSD matrix estimate \(\widehat {\mathbf {\Phi }}_{\boldsymbol {u}}(t-1,k)\) from the previous frame t−1 is used, which leads to estimation errors when the undesired signal changes in consecutive frames. The DOA-based a priori DSPP used in [22] in the spherical harmonic domain, seeks to reduce this sensitivity in scenarios with non-stationary interferers. Nevertheless, our experiments for a posteriori DSPP estimation in the traditional signal domain indicated that the DOA-based a priori DSPP is often insufficient to compensate for errors in the likelihoods (15) occurring due to erroneous \(\widehat {\mathbf {\Phi }}_{\boldsymbol {u}}\) estimates. This is our motivation to develop a different method to incorporate DOA information in the a posteriori DSPP estimation, by using a generative probabilistic model of the narrowband DOAs.
5.1 Likelihood model for the narrowband DOA estimates
The normalisation \(c_{\mathcal {M}}(\kappa)=[2 \pi I_{0}(\kappa)]^{-1}\) is derived in [30], where I _{0} is the modified Bessel function of the first kind. If the DOA estimator is unbiased, the mean \(\tilde {\theta }\) is equal to the DOA of the desired source. The concentration parameter κ reflects the uncertainty in the DOA estimates, where larger concentration indicates larger DOA estimation error variance, while smaller concentration indicates smaller DOA estimation error variance. Factors which commonly affect the concentration include the array geometry, the number of microphones, the coherent signal-to-diffuse signal ratio, as well as the DOA estimator. The concentration κ is an unknown model parameter and its computation is discussed in Section 5.3.2.
5.2 Desired speech presence probability and optimal detection
where the a priori probabilities \(q_{s} = p(\mathcal {H}_{s})\), \(q_{i} = p(\mathcal {H}_{i})\) and \(q_{v} = p(\mathcal {H}_{v})\) satisfy q _{ s }+q _{ i }+q _{ v }=1.
where \(\mathcal {I}_{\mathcal {H}_{a}}\) is a binary indicator which equals one if the hypothesis in the subscript is true, and zero otherwise. Using the binary indicator, only the PSD matrix of the dominant signal is updated, as discussed in Section 6 in more detail.
5.3 Estimation of the likelihood model parameters
5.3.1 Estimation of the a priori probabilities q _{ s },q _{ i }, and q _{ v }
In this manner, the a priori SPP in the proposed model exploits the spatio-temporal properties of the signal vector y(t,k) and knowledge of the noise PSD matrix to aid the discrimination between noise and speech-dominated TF bins, prior to the estimation of the narrowband DOA at the current TF bin.
5.3.2 Estimation of the concentration parameter κ
It was mentioned in Section 5.1 that the concentration parameter κ related to the mode and the notch of the DOA-related likelihoods often depends on the coherent-to-diffuse ratio (CDR), the array geometry, and the DOA estimator. For a given array geometry and a given DOA estimator, a single concentration parameter can be estimated for instance by collecting the DOA estimates from all TF bins during a training period when only the desired speech source and background noise are present, and finding the maximum likelihood (ML) estimate. However, this way of obtaining a single concentration parameter does not take into account the fact that many of the TF bins used for training are noise-dominated and do not contain significant speech energy. Instead, of providing an average concentration parameter, we seek to quantify the uncertainty of the DOA estimate at each TF bin. By quantifying the certainty of each DOA estimate, we provide additional information to the proposed signal detector for determining the dominant source at each TF bin. Therefore, the concentration parameter κ of the von Mises distribution needs to estimated for each TF bin as well.
where l _{max} determines the maximum value of the function, \(c\in \mathbb {R}\) controls the offset along the \(\widehat \Gamma _{tk}\) axis, and \(\rho \in \mathbb {R}^{+}\) controls the steepness of transition region of the sigmoid-like function. The minimum value of the function \(f\left (\widehat {\Gamma }_{tk}\right) =0\), attained in the limit \(\widehat {\Gamma }_{tk} \rightarrow -\infty \) indicates that the distribution of the DOA is a uniform distribution on the circle in the absence of a coherent signal.
- 1.
Simulate a short signal segment by convolving white Gaussian noise signal with an anechoic room impulse response, and add an ideally diffuse noise signal simulated according to [34], with a specified signal-to-noise ratio (SNR). Note that although the CDR also depends on the reverberation from directional sources, the spatial properties of late reverberation closely resemble those of a diffuse sound field.
- 2.
Repeat the simulation for different SNRs (we used the range [−30,30] dB, with steps of 5 dB), and for each simulation store the CDR estimates and the DOA estimates for each TF bin.
- 3.
Make a histogram of the CDR estimates stored from all simulations and associate to each histogram bin the corresponding DOA estimates.
- 4.If the set of DOA estimates associated with the n-th histogram bin is \(\Theta _{n} = \left \{ \theta _{1}, \ldots, \theta _{L_{n}} \right \}\), a maximum likelihood estimate of the concentration parameter for this histogram bin is obtained by first computing$$ r = \sqrt{\left(\frac{1}{L_{n}}\,\sum_{i=1}^{L_{n}} \, \text{cos}\, \theta_{i}\right)^{2} + \left(\frac{1}{L_{n}}\,\sum_{i=1}^{L_{n}} \, \text{sin}\, \theta_{i}\right)^{2} }, $$(25)and using the following approximation (see ([29], Section 5.3.1) for details)$$ \kappa_{n,\text{ML}} = \left\{\begin{array}{ll} 2\,r + r^{3} + \frac{5}{6}\,r^{5} & \text{if}\ r<0.53, \\ -0.4 + 1.39\,r + \frac{0.43}{1-r} & \text{if}\ 0.53 \leq r < 0.85 \\ \frac{1}{2(1-r)} & \text{if}\ r \geq 0.85. \end{array}\right. $$(26)
- 5.
For each histogram bin n, store the CDR value of the bin centre and the corresponding ML estimate of the concentration parameter as a pair (Γ _{ n },κ _{ n,ML}).
Following this data-driven procedure, we have experimentally found a correspondence between the CDR estimates and the concentration parameter κ. Given the pairs (Γ _{ n },κ _{ n,ML}), we can now determine the parameters of the sigmoid-like mapping function. First, note that although in theory the logarithmic range of the CDR is [−∞,∞], in practice, the CDR estimators saturate and are limited to a relatively small range of values around 0 dB. For our particular estimator, we observed that the range of estimates was [ −10, 20] dB, which allows us to determine the maximum value l _{max} of the concentration parameter by observing the values of κ _{ n,ML} for the histogram bins where Γ _{ n }≈20 dB. To find the parameter c that determines the offset along the \(\widehat \Gamma \) axis, we note that for any value of ρ, the value of \(\widehat \Gamma \) for which the resulting κ is exactly in the midpoint of its range [0,l _{max}], satisfies \(\widehat {\Gamma } = 10\,c\). Therefore, by looking for the pair (Γ _{ n },κ _{ n,ML}) in our training results where κ _{ n,ML} is as close as possible to l _{max}/2, we can use the corresponding Γ _{ n } to compute the parameter c. Having fixed c and l _{max} and noting that due to the aforementioned saturation of the CDR estimator, the concentration parameter is approximately 0 for Γ _{ n }≈−10, there is only a small range of values for ρ which satisfy the constraints on the maxima and the minima of the sigmoid-like function (i.e., f(−10)≈0 and f(20)≈l _{max}). This range was ρ∈ [ 0.2,2] in our case, and the best fit for ρ can be easily found by visual inspection of the curves obtained by substituting several values for ρ from this range. The above described procedure for our data resulted in l _{max}=8, c=1.5, and ρ=1.2, which we kept constant for all the experiments.
6 Application to informed spatial filtering
where \(\tilde {\alpha }_{v} \in \ [\!0,1)\) is a pre-defined constant. In contrast to (29), the noise averaging parameter (30) leads to a soft recursive update, as often done when the undesired signal is stationary [15, 20, 21].
which has a similar role as the single-channel DOA-based gain in [23] and the DOA-based TF mask common for source separation [36]. Applying the DSPP as a multiplicative factor provides additional undesired signal reduction, however, when inaccurately estimated, it causes audible distortion to the desired signal. This is further evaluated in the experiments in Section 7.2.
7 Performance evaluation
To evaluate the performance for different reverberation, simulated data was used. RIRs were computed using the simulator in [37]. Diffuse noise was simulated as described in [34] and the microphone signals were obtained by adding the speech signals convolved with the RIRs, the diffuse noise signal, and spatially and temporally uncorrelated noise signal. The processing was done at a sampling rate of 16 kHz, with an STFT frame length of 64 ms with 50% overlap, windowed by a Hamming window. Unless stated otherwise, the DOA estimator with instantaneous phase differences, described in Section 3.1 was employed.
7.1 Detector evaluation in terms of receiver operating characteristics (ROC)
The ROC curves are obtained by computing the FPR and FNR as \(\frac {C_{su}}{C_{us}}\) varies from 0 to ∞. The FPR and FNR are computed for the three sources (French female, English female, and English male) during 20 s of multi-talk. The average FPR and FNR used for the ROC curves are obtained by averaging over the segments for each of the three sources (hence over 60 s of speech in total). In all experiments, the desired-to-interfering speech ratio (DSIR) was in the range [ 5,8] dB. The DSIR for each source is computed at one of the microphones from the nearest array.
7.1.1 Experiment 1
To evaluate the detectors for different reverberation levels, the setup shown in Fig. 3 was simulated for T _{60} of 0.2 s, 0.35 s, 0.5 s, and 0.65 s and diffuse babble noise with an SNR of 22 dB. As shown in Fig. 4 c, reverberation has a stronger effect on the ROC than the noise, as the curves shift more notably with increasing reverberation. However, the proposed detector clearly outperforms the signal-model based detector in all cases.
7.1.2 Experiment 2
In this experiment, we simulated scenarios with one desired source and one interferer, for different angular separations between the desired source and the interferer, namely, 160°, 95°, 50°, 25°, and 0°. In all cases, the desired source is located at 0.7 m from the array, whereas the interferer at 1.5 m from the array. The reverberation time was T _{60}=0.35 s and diffuse babble noise with an SNR of 22 dB and uncorrelated sensor noise with an SNR of 35 dB were added. As expected, with decreasing angular separation, the detection accuracy deteriorates, as visible in Fig. 4 b. Note that even when the desired and the undesired source have equal DOA, the detector provides good accuracy due to the fact that the desired signal is stronger than the interferer at its respective nearest array. Another reason is that the CDR in interferer-dominated TF bins is lower than the CDR in desired signal-dominated TF bins, hence allowing the CDR-controlled concentration κ to aid the detection even when the sources have equal DOA.
7.1.3 Experiment 3
The detector ROC curves obtained with the two DOA estimators discussed in Section 3, the one with instantaneous, and the one with time-averaged phase differences, are illustrated in Fig. 4 d. Although time averaging of the phase differences generally provides less noisy DOA estimates (i.e., smoother across the TF spectrum), the detector performance is better when instantaneous DOA estimates are used.
7.2 Objective evaluation of extracted signals
To estimate the PSD matrices and the RTF vector, we computed a Bayes detector according to (22), where we used C _{ su }=1 and C _{ us }=2. These costs were chosen after investigating the objective performance measures and the results of informal listening tests in different acoustic conditions, where they proved to achieve the best performance from all (C _{ su },C _{ us }) pairs across the ROC. The chosen costs resulted in an FPR of 0.1 and an FNR of 0.9 on average (across the different experiments), which corroborates the observation made in Section 7.1.1, that the FPR needs to be very low in order to ensure a good extracted signal quality. The averaging constants for the PSD matrices were \(\tilde {\alpha }_{v} = 0.95\), \(\tilde {\alpha }_{s} = \tilde {\alpha }_{u} = 0.92\) (corresponding to time constants of 0.62 and 0.38 s). The performance was evaluated in terms of segmental noise reduction (NR), segmental interference reduction (IR), speech distortion (SD) index ν _{sd}, PESQ score improvement Δ _{PESQ} [38], and improvement of the short-time objective intelligibility (STOI) score [39], Δ _{STOI}. Five spatial filtering frameworks are evaluated: (i) An oracle MVDR filter, where the PSD matrices are computed using recursive averaging with an ideal detector, denoted by \(\mathcal {D}_{\text {id}}\), (ii) a DSB steered to the desired source DOA, (iii) an MPDR filter steered to the desired source DOA, iv) an informed MVDR filter obtained using the Gaussian signal model-based detector with a DOA-based a priori SPP, denoted by \(\mathcal {D}_{\text {sm}}\) and v) an informed MVDR filter obtained using the proposed DOA-based detector, denoted by \(\mathcal {D}_{\text {dm}}\).
7.2.1 Experiment 1
Results for Source1 (top), Source2 (middle), and Source3 (bottom)
Average input SNR 10 dB | Average input SNR 2 dB | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
DSB | MPDR | \(\mathcal {D}_{\text {id}}\) | \(\mathcal {D}_{\text {sm}}\) | \(\mathcal {D}_{\text {dm}}\) | DSB | MPDR | \(\mathcal {D}_{\text {id}}\) | \(\mathcal {D}_{\text {sm}}\) | \(\mathcal {D}_{\text {dm}}\) | |
NR | 1.4 | 6.4 | 7.5 | 6.1 | 7.2 | 1.4 | 7.0 | 8.9 | 7.6 | 9.2 |
IR | 1.9 | 8.1 | 14.4 | 5.5 | 12.9 | 1.9 | 8.0 | 12.8 | 6.3 | 11.8 |
ν _{sd} | 0.03 | 0.25 | 0.02 | 0.11 | 0.06 | 0.03 | 0.28 | 0.03 | 0.08 | 0.07 |
Δ _{PESQ} | 0.02 | 0.17 | 0.75 | -0.02 | 0.68 | 0.02 | 0.12 | 0.59 | 0.17 | 0.53 |
Δ _{STOI} | 0.01 | 0.07 | 0.17 | -0.01 | 0.15 | 0.01 | 0.05 | 0.18 | 0.04 | 0.16 |
NR | 0.9 | 4.3 | 7.1 | 7.0 | 6.0 | 0.9 | 2.8 | 11.7 | 7.5 | 6.1 |
IR | 0.5 | 2.9 | 15.7 | 5.3 | 13.9 | 0.5 | 2.1 | 13.9 | 6.2 | 11.0 |
ν _{sd} | 0.06 | 0.19 | 0.03 | 0.11 | 0.03 | 0.06 | 0.17 | 0.03 | 0.10 | 0.03 |
Δ _{PESQ} | 0.03 | -0.01 | 0.85 | 0.20 | 0.76 | 0.03 | 0.04 | 0.73 | 0.31 | 0.51 |
Δ _{STOI} | 0.01 | -0.03 | 0.13 | 0.01 | 0.12 | 0.01 | -0.02 | 0.16 | 0.05 | 0.11 |
NR | 1.4 | 10.6 | 6.4 | 5.6 | 6.5 | 1.4 | 12.8 | 8.4 | 7.7 | 8.1 |
IR | 1.7 | 15.9 | 13.8 | 5.6 | 11.8 | 1.7 | 15.0 | 11.2 | 6.4 | 10.4 |
ν _{sd} | 0.03 | 0.87 | 0.02 | 0.09 | 0.04 | 0.04 | 0.81 | 0.02 | 0.07 | 0.04 |
Δ _{PESQ} | 0.02 | -1.10 | 0.61 | 0.20 | 0.53 | 0.02 | -0.63 | 0.51 | 0.31 | 0.47 |
Δ _{STOI} | 0 | -0.37 | 0.07 | 0.01 | 0.07 | 0.01 | -0.25 | 0.10 | 0.06 | 0.09 |
7.2.2 Experiment 2
In this experiment, the proposed system \(\mathcal {D}_{\text {dm}}\) and the output of the DSB are multiplied by the a posteriori DSPP. A system where DOA-based DSPP is applied at the output of a fixed spatial filter is proposed in [23], and the goal of the current experiment is to confirm that the benefit of the DSPP is even larger when it is used in combination with a data-dependent, informed spatial filter, rather than a fixed spatial filter. The experiment is repeated with the two DOA estimators discussed in Section 3.
Results when the spatial filter output is multiplied by the estimated DSPP
DSB-inst | DSB-cPSD | \(\mathcal {D}_{\text {dm}}\)-inst | \(\mathcal {D}_{\text {dm}}\)-cPSD | |
---|---|---|---|---|
NR | 14.3 | 14.2 | 19.3 | 18.9 |
IR | 15.7 | 15.6 | 24.9 | 24.3 |
ν _{sd} | 0.36 | 0.36 | 0.33 | 0.34 |
Δ _{PESQ} | 0.52 | 0.47 | 0.88 | 0.79 |
Δ _{STOI} | 0.10 | 0.08 | 0.13 | 0.11 |
NR | 9.1 | 8.2 | 15.7 | 15.4 |
IR | 12.3 | 11.5 | 25.0 | 23.1 |
ν _{sd} | 0.12 | 0.13 | 0.15 | 0.15 |
Δ _{PESQ} | 0.67 | 0.60 | 1.07 | 1.00 |
Δ _{STOI} | 0.08 | 0.07 | 0.11 | 0.10 |
NR | 12.2 | 11.4 | 16.7 | 16.0 |
IR | 14.0 | 13.2 | 22.5 | 21.6 |
ν _{sd} | 0.27 | 0.26 | 0.24 | 0.23 |
Δ _{PESQ} | 0.54 | 0.48 | 0.87 | 0.84 |
Δ _{STOI} | 0.04 | 0.02 | 0.05 | 0.04 |
7.2.3 Experiment 3
7.2.4 Experiment 4
8 Conclusions
We addressed the problem of source extraction in the presence of background noise and speech interferers. The DOA of the desired source was assumed to be approximately known, while the number and locations of the interferers were unknown. Designing robust spatial filters is a challenging task in such scenarios, as the PSD matrix of the undesired speech signals needs to be estimated from the data. We proposed an informed spatial filtering framework, where the first step is to design appropriate desired signal detector. We discussed and experimentally showed that the commonly used Gaussian signal model-based detector is not suitable when the undesired signals contain speech. Therefore, we proposed a DOA model-based detector, where narrowband DOA estimates are used for discrimination of desired and undesired speakers, while the Gaussian signal model aids the detection of noisy TF bins. The performance of the detector was evaluated in terms of ROC curves, and by objective evaluation of the extracted signals when the detector is applied for PSD matrix estimation in an informed MVDR filtering framework.
Declarations
Acknowledgements
The authors would like to thank the Editorial board and the Reviewers for considering and revising this manuscript.
Funding
No funding was received or used to prepare this manuscript.
Authors’ contributions
Both authors had significant contribution to the development of early ideas and design of the final algorithms. Throughout all stages, the authors discussed the importance and the quality of the algorithms and the structure of the manuscript. Both authors read and approved the submitted version of the manuscript.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
- JL Flanagan, JD Johnston, R Zahn, GW Elko, Computer-steered microphone arrays for sound transduction in large rooms. J. Acoust. Soc. Am.5(78), 1508–1518 (1985).View ArticleGoogle Scholar
- S Doclo, M Moonen, Superdirective beamforming robust against microphone mismatch. IEEE Signal Process. Lett.15(2), 617–631 (2007).Google Scholar
- J Benesty, J Chen, Y Huang, Microphone Array Signal Processing (Springer, Berlin, 2008).Google Scholar
- M Souden, J Benesty, S Affes, On optimal frequency-domain multichannel linear filtering for noise reduction. IEEE Trans. Audio Speech Lang. Process.18(2), 260–276 (2010).View ArticleGoogle Scholar
- KL Bell, Y Ephraim, HL Van Trees, A Bayesian approach to robust adaptive beamforming. IEEE Trans. Signal Process.48(2), 386–398 (2000).View ArticleGoogle Scholar
- CJ Lam, AC Singer, Bayesian beamforming for DOA uncertainty: theory and implementation. IEEE Trans. Signal Process.54(11), 4435–4445 (2006).View ArticleGoogle Scholar
- Y Grenier, A microphone array for car environments. Speech Commun.12:, 25–39 (1993).View ArticleGoogle Scholar
- MK Buckley, Spatial/spectral filtering with linearly constrained minimum variance beamformers. IEEE Trans. Acoust. Speech Signal Process.35(3), 249–266 (1987).View ArticleGoogle Scholar
- CA Anderson, PD Teal, MA Poletti, Spatially robust far-field beamforming using the von Mises(-Fisher) distribution. IEEE Trans. Acoust. Speech Signal Process.23(12), 2189–2197 (2015).Google Scholar
- O Hoshuyama, A Sugiyama, A Hirano, A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters. IEEE Trans. Signal Process.47(10), 2677–2684 (1999).View ArticleGoogle Scholar
- BJ Yoon, I Tashev, A Acero, in IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). Robust adaptive beamforming algorithm using instantaneous direction of arrival with enhanced noise suppression capability (HI, USA, 2007), pp. 133–136.Google Scholar
- O Hoshuyama, B Begasse, A Sugiyama, A Hirano, in IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). A real time robust adaptive microphone array controlled by an SNR estimate (ICASSP, Seattle, 1998), pp. 3605–3608.Google Scholar
- M Taseska, EAP Habets, Informed spatial filtering with distributed arrays. IEEE Trans. Audio Speech Lang. Process.22(7), 1195–1207 (2014).View ArticleGoogle Scholar
- M Taseska, EAP Habets, Spotforming: Spatial filtering with distributed arrays for position-selective sound acquisition. IEEE/ACM Trans. Audio Speech Lang. Process.24(7), 1291–1304 (2016).View ArticleGoogle Scholar
- M Souden, J Chen, J Benesty, S Affes, An integrated solution for online multichannel noise tracking and reduction. IEEE Trans. Audio Speech Lang. Process.19(7), 2159–2169 (2011).View ArticleGoogle Scholar
- T Higuchi, N Ito, T Yoshioka, T Nakatani, in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP). Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise (Shanghai, 2016).Google Scholar
- S Affès, Y Grenier, A signal subspace tracking algorithm for microphone array processing of speech. IEEE Trans. Speech Audio Process.5(5), 425–437 (1997).View ArticleGoogle Scholar
- S Gannot, D Burshtein, E Weinstein, Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Trans. Signal Process.49(8), 1614–1626 (2001).View ArticleGoogle Scholar
- D Van Compernolle, Adaptive filter structures for enhancing cocktail party speech from multiple microphone recordings. Colloque sur le traitement du signal et des images, 513–516 (1989). http://documents.irevues.inist.fr/handle/2042/11518?show=full.
- I Cohen, Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging. IEEE Trans. Speech Audio Process.11(5), 466–475 (2003).View ArticleGoogle Scholar
- M Taseska, EAP Habets, in International Workshop on Acoustic Echo and Noise Control (IWAENC). MMSE-based blind source extraction in diffuse noise fields using a complex coherence-based a priori SAP estimator, (2012).Google Scholar
- DP Jarrett, EAP Habets, PA Naylor, in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP). Spherical harmonic domain noise reduction using an MVDR beamformer and DOA-based second-order statistics estimation (Vancouver, 2013).Google Scholar
- I Tashev, A Acero, in International Workshop on Acoustic Echo and Noise Control (IWAENC). Microphone array post-processor using instantaneous direction-of-arrival (Paris, 2006).Google Scholar
- M Taseska, EAP Habets, in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP). Minimum Bayes risk signal detection for speech enhancement based on a narrowband DOA model (Brisbane, 2015).Google Scholar
- R Roy, T Kailath, ESPRIT - estimation of signal parameters via rotational invariance techniques. IEEE Trans. Acoust. Speech Signal Process.37:, 984–995 (1989).View ArticleMATHGoogle Scholar
- S Araki, H Sawada, R Mukai, S Makino, DOA estimation for multiple sparse sources with arbitrarily arranged multiple sensors. J. Signal Process. Syst.63:, 265–275 (2011).View ArticleGoogle Scholar
- O Thiergart, W Huang, EAP Habets, in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP). A low complexity weighted least squares narrowband DOA estimator for arbitrary array geometries (ICASSP, Shanghai, 2016), pp. 340–344.Google Scholar
- H Cox, RM Zeskind, MM Owen, Robust adaptive beamforming. IEEE Trans. Acoust. Speech Signal Process.35(10), 1365–1376 (1987).View ArticleGoogle Scholar
- KV Mardia, PE Jupp, Directional Statistics (Wiley-Blackwell, New York, 1999).View ArticleMATHGoogle Scholar
- Stegun IA, Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. (M Abramowitz, ed.) (United States Department of Commerce, USA, 1972).MATHGoogle Scholar
- S Kay, Fundamentals of Statistical Signal Processing, Volume II: Detection Theory (Prentice-Hall, Inc., NJ, USA, 1998).Google Scholar
- M Souden, J Chen, J Benesty, S Affès, Gaussian model-based multichannel speech presence probability. IEEE Trans. Audio Speech Lang. Process.18(5), 1072–1077 (2010).View ArticleGoogle Scholar
- O Thiergart, G Del Galdo, EAP Habets, On the spatial coherence in mixed sound fields and its application to signal-to-diffuse ratio estimation. J. Acoust. Soc. Am.132(4), 2337–2346 (2012).View ArticleGoogle Scholar
- EAP Habets, I Cohen, S Gannot, Generating nonstationary multisensor signals under a spatial coherence constraint. J. Acoust. Soc. Am.124(5), 2911–2917 (2008).View ArticleGoogle Scholar
- A Krueger, E Warsitz, R Haeb-Umbach, Speech enhancement with a GSC-like structure employing eigenvector-based transfer function ratios estimation. IEEE Trans. Audio Speech Lang. Process.19(1), 206–219 (2011).View ArticleGoogle Scholar
- S Araki, H Sawada, R Mukai, S Makino, in IEEE International Workshop on Acoustic Signal Enhancement (IWAENC). A novel blind source separation method with observation vector clustering (IWAENC, Eindhoven, 2005).Google Scholar
- EAP Habets, Room impulse response generator. Technical report, Technische Universiteit Eindhoven (2006).Google Scholar
- ITU-T: Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs.Google Scholar
- CH Taal, RC Hendriks, R Heusdens, J Jensen, An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process.19(7), 2125–2136 (2011).View ArticleGoogle Scholar