Skip to content

Advertisement

  • Research
  • Open Access

Analysis of dual-channel ICA-based blocking matrix for improved noise estimation

EURASIP Journal on Advances in Signal Processing20142014:26

https://doi.org/10.1186/1687-6180-2014-26

  • Received: 17 June 2013
  • Accepted: 10 February 2014
  • Published:

Abstract

For speech enhancement or blind signal extraction (BSE), estimating interference and noise characteristics is decisive for its performance. For multichannel approaches using multiple microphone signals, a BSE scheme combining a blocking matrix (BM) and spectral enhancement filters was proposed in numerous publications. For such schemes, the BM provides a noise estimate by suppressing the target signal only. The estimated noise reference is then used to design spectral enhancement filters for the purpose of noise reduction. For designing the BM, ‘Directional Blind Source Separation (BSS)’ was already proposed earlier. This method combines a generic BSS algorithm with a geometric constraint derived from prior information on the target source position to obtain an estimate for all interfering point sources and diffuse background noise. In this paper, we provide a theoretical analysis to show that Directional BSS converges to a relative transfer function (RTF)-based BM. The behavior of this informed signal separation scheme is analyzed and the blocking performance of Directional BSS under various acoustical conditions is evaluated. The robustness of Directional BSS regarding the localization error for the target source position is verified by experiments. Finally, a BSE scheme combining Directional BSS and Wiener-type spectral enhancement filters is described and evaluated.

Keywords

  • Blocking Matrix
  • Blind Source Separation
  • Noise Estimate
  • Target Source
  • Minimum Variance Distortionless Response

1 Introduction

Blind signal extraction (BSE) aiming at extracting one source signal from a mixture of an unknown number of acoustic sources in noisy environments is a generic task in acoustic signal processing. It has a wide range of applications in many fields: As popular examples, hands-free interfaces for acoustic communications and human-machine interaction offer many challenging and relevant application scenarios, such as teleconferencing, interactive television, humanoid robots, and gaming. Moreover, acoustic signal extraction techniques are also highly relevant for assistive devices, such as hearing aids.

If multiple microphones are available, data-dependent multichannel approaches for signal extraction can be classified into unsupervised and supervised approaches. The class of unsupervised methods does not require prior knowledge on the spatial distribution of sources and sensors. The lack of prior knowledge is compensated by exploiting fundamental signal characteristics. Conventional unsupervised signal extraction approaches are, e.g., independent component analysis (ICA)-based [1, 2] or sparseness-based blind source separation (BSS) algorithms [3, 4]. However, conventional ICA-based approaches cannot be used for underdetermined cases, where the number of sensors is less than the number of sources, and sparseness-based methods are highly dependent on the sparsity of the mixing signals. Recently, model-based multichannel approaches gained a lot of attention. These are, e.g., approaches based on a spatial covariance model [5] or multichannel nonnegative matrix factorization (NMF) methods based on modeling complex Gaussian distributions [6, 7]. As opposed to [14] they do not solely rely on the independence or the sparsity of the underlying signals and can be used for underdetermined source separation.

Unlike unsupervised methods, the class of supervised methods needs reference information. Typical supervised signal extraction approaches are, e.g., multichannel Wiener filtering (MWF) [8, 9] or beamforming approaches, such as linearly constrained minimum variance (LCMV) beamformer [10]. MWF approaches are based on minimum mean square error (MMSE) estimators requiring noise and interference statistics as references. The LCMV beamformer requires reference signal(s) represented by linear constraints in order to suppress interfering sources and to preserve target signal(s) from known directions [11]. As an alternative form of the LCMV beamformer, the generalized sidelobe canceler (GSC) was proposed in [12], which converts the constrained optimization problem into an unconstrained problem.

Under realistic acoustic conditions, prior information is often exploited for practical realizations of supervised and unsupervised signal extraction approaches. This leads to the class of informed signal processing algorithms, where relevant information of the underlying conditions is exploited to realize the signal extraction algorithms or to render these algorithms more robust and reliable for practical conditions. Prior knowledge, which can be spatial information in terms of the direction of arrival (DoA) of source signals, coherence or diffuseness of the sound field, etc., may be given or estimated from the acquired sensor data. An overview of the relevant work belonging to this class is given in the following.

To realize the MWF, an estimate of the second-order statistics (SOS) of the noise signals is required. Based on the assumption of a diffuse noise field, several methods are derived for estimating the SOS of the noise components in terms of the auto-power spectral density (PSD) [1315] or the cross PSDs between all channels for both the target source(s) and the noise and interference components [16]. Furthermore, it was recently proposed to exploit the direct-to-diffuse ratio (DDR) to realize the MWF for stationary noise and babble noise conditions [17]. It was also suggested to exploit the position information to estimate the cross PSDs of directional speech interferences [18]. For unsupervised algorithms, prior spatial information such as information about the source positions or the sensor constellation is often incorporated to improve the robustness. Model-based multichannel approaches [6, 7] can incorporate the directional information by initialization of a part of the spatial model. Parra and Alvino [19] proposed to combine an ICA algorithm with geometric constraints in order to improve the separation performance, where BSS was regarded as a set of beamformers whose response is constrained to a set of DoAs for recovering all sources from the mixture. Inspired by [19], Directional BSS [20] was proposed to serve as a blocking matrix (BM) when using a different constraint for the opposite purpose: this constraint forces essentially a spatial null towards a certain direction in order to suppress the target source and to preserve the interfering and noise components. The precondition not only for Directional BSS but also for Parra’s method is that the DoA information on the target source(s) must be given. Furthermore, based on the noise estimate produced by Directional BSS, a two-unit source extraction/noise reduction scheme combining a BM and a noise reduction unit was proposed in [21], where the spectral weights in the noise reduction stage are designed based on a diffuse noise field assumption. In this paper, we focus on the discussion of Directional BSS operating as a BM.

The concept of a BM was originally proposed in [12] for the structure as shown in Figure 1. The structure separates the LCMV beamformer into two main processing paths: the first path comprises a fixed beamformer (FB) with constraints on the target signal. The second path contains a BM and an adaptive interference canceler (AIC) that adaptively minimizes the noise power in the output. The BM is defined as a matrix used to reject (block) the target signal at its output, hence providing references of all undesired interference signals and noise components required for interference cancellation schemes.
Figure 1
Figure 1

Structure of a general sidelobe canceler.

Originally in [12], the BM was designed for time-invariant free-field environments and rejected the source signal from one direction only, requiring precise source location information. This BM can be regarded as minimum variance distortionless response (MVDR) BM as the MVDR beamformer imposes the distortionless constraint only for the desired direction. For the dual-channel case, the conventional MVDR BM is given by the delay-and-subtract beamformer (DSB) and can only suppress the direct path of the target source. Theoretically, the conventional LCMV BM can suppress the direct path and reflections by formulating the corresponding constraints in the BM if the perfect knowledge on the angle of arrival for each reflection is given [22]. However, the conventional LCMV/MVDR BM will likely lead to target signal leakage as it is conceived for time-invariant scenarios, and any movement of the target source will lead to a steering error relative to the true DoA for the target signal and its reflections. To improve the robustness against the steering error, an adaptive BM was proposed in [23, 24], which needs an adaptive control requiring source activity information. In [25], the relative transfer function (RTF)-based BM for LCMV/MVDR beamforming was proposed. The RTF-based BM can perfectly suppress the target signal if the RTFs are given. However, estimating RTFs usually requires estimation of the source activity or a double-talk detector, as noise-only frames or time segments where both the transfer functions (TFs) and the noise signals are assumed to be stationary need to be available for RTF estimation [2527].

More recently, ICA-based BSS algorithms were proposed to realize a BM [20, 28]. The approach presented in [28] is very efficient in noise estimation but can only be used for overdetermined/determined scenarios (i.e., the number of sensors is larger than or equal to the number of sources) as [28] is a generic ICA-based BSS algorithm. In [20] exploiting Directional BSS as a BM for noise estimation (here the noise including interfering sources and diffuse background noise) was proposed. This approach can be applied in both determined and underdetermined scenarios. Unlike for beamforming approaches, correlated components arriving from other directions, i.e., reflections and reverberation will also be suppressed to the greatest extent possible by Directional BSS. This concept can deal with underdetermined scenarios such that a meaningful instantaneous estimate for all undesired signals comprising interfering speech signals and diffuse background noise can be obtained using only two microphones and regardless of the noise statistics. Note that for applying the directional constraint, the directional information on the target source must be given or estimated by a source localizer. Even with a source localizer, a predefined angular range of the target source must be given. This range was set to be −20° to 20° in front of the microphone array [20]. The algorithm of Directional BSS was introduced and its efficiency was shown in [20] with respect to the blocking performance. In this paper, we provide an in-depth analysis of the heuristically motivated BM in [20] and provide new insights with respect to several decisive aspects: (1) the relation of the ICA-based BM to other BMs, (2) the blocking performance if the target source arrives from directions which are different from broadside direction, (3) the robustness against localization errors, and (4) the BSE/speech enhancement performance when using the noise estimate produced by Directional BSS for Wiener-type spectral enhancement. Additionally, a BSE scheme combining Directional BSS and spectral enhancement filters under various acoustical conditions will be evaluated. Therefore, the main contributions of this paper compared to our earlier work are the following: For one, we show by a theoretical analysis and by experimental results that Directional BSS converges to an RTF-based BM. In addition, the performance of the proposed method is for the first time analyzed regarding some practically highly relevant aspects, e.g., the blocking ability for sources impinging from arbitrary directions and the noise reduction performance of the applied noise reduction scheme.

The paper is organized as follows: In Section 2, the generic BSS algorithm is reviewed. In Section 3, we provide a theoretical analysis to show that Directional BSS converges to an RTF-based BM and describe the algorithm of Directional BSS. Moreover, the relation/difference of Directional BSS to other conventional/state-of-art BMs is discussed. Furthermore, in Section 4, experimental results with respect to the blocking performance and the robustness against localization errors in various acoustical scenarios are presented. Finally, a BSE scheme combining Directional BSS and Wiener-type spectral enhancement filters is presented and evaluated in Section 5. Note that in this paper, we restrict our consideration to two-channel cases.

2 Determined blind source separation: generic ICA-based BSS algorithm

In this section, we briefly review a two-channel ICA-based BSS algorithm. Figure 2 depicts the basic two-channel BSS signal model for two point sources s1,s2. The microphone signals can be described in the discrete time domain by
x p ( k ) = m = 1 2 h mp ( k ) s m ( k ) , p { 1 , 2 } ,
(1)
Figure 2
Figure 2

Basic two-channel linear BSS signal model.

where * represents convolution and h mp (k), m{1,2} denote the finite acoustic impulse responses from the m th point source to the p th microphone in discrete time and k is the discrete time index.

BSS algorithms aim at determining demixing filters to extract the individual sources from the mixed signals. The output signals of the demixing system y q , q{1,2} are described by
y q ( k ) = p = 1 2 w pq ( k ) x p ( k ) , q { 1 , 2 } ,
(2)

where w pq (k) denotes the demixing filter from the p th microphone to the q th output channel.

The various criteria used for identifying w pq in (2) (see e.g., [1, 2, 29]) are essentially based on the assumption that sources are statistically independent. In this paper, we use triple-N independent component analysis for convolutive mixtures (TRINICON) [30] for BSS, where mutual information between the output channels y = [ y 1 T ( k ) , y 2 T ( k ) ] T should be minimized. As the algorithm is derived for block processing of convolutive mixtures, for each output y q (k), a sequence of D output samples corresponding to D successive time lags is taken into account.

The generic cost function used to determine a demixing system W is then given by [31]
J BSS ( W ) = E ̂ log p ̂ y , PD ( y ) q = 1 P p ̂ y q , D ( y q ) ,
(3)

where E ̂ { · } is the estimate of the statistical expectation, with ensemble averaging being replaced by temporal averaging over N blocks assuming ergodicity within the individual blocks. p ̂ y , PD is an estimate of the joint probability density function (pdf) of dimension PD over all P (here, P = 2) output channels, and p ̂ y q , D is the estimated multivariate pdf for channel q of dimension D. Matrix W captures all the impulse response coefficients of the demixing filters, with a detailed description of its structure given in [31, 32]. Minimizing JBSS(W) corresponds to minimizing the Kullback-Leibler divergence (KLD) between p ̂ y , PD ( y ) and q = 1 P p ̂ y q , D ( y q ) , which leads to maximization of the statistical independence of the output vectors y q .

3 Directional blind source separation as a blocking matrix

In this section, we firstly discuss the relation of Directional BSS with a conventional RTF-based BM in Subsection 3.1. The Directional BSS algorithm is described in Subsection 3.2 before comparing it to alternative approaches in Subsection 3.3.

3.1 From system identification to RTF-based blocking matrix

In [33], the relation between the optimum broadband solution of blind source separation and blind system identification was presented. For a single-input/multiple-output (SIMO) system as shown in Figure 3, the perfect suppression of a broadband source implies for system identification:
h 11 ( k ) w 11 ( k ) + h 12 ( k ) w 21 ( k ) = 0 h 11 ( k ) w 11 ( k ) = h 12 ( k ) w 21 ( k ) .
(4)
Figure 3
Figure 3

Blind system identification based on a SIMO model.

The optimum filters fulfilling (4) read in the z-domain [33]:
W 11 ( z ) = α H 12 ( z ) , W 21 ( z ) = α H 11 ( z ) .
(5)

As a precondition for identifying this solution, H11(z) and H21(z) may not have common zeros and the filter lengths equal the lengths of room impulse responses. Obviously, the optimum filters can only be determined up to a scaling factor α.

Let us consider the case where w21 is forced to be a delay τ, then (4) reads in the z-domain:
W 11 ( z ) = H 12 ( z ) H 11 ( z ) z τ , W 21 ( z ) = z τ .
(6)
In the frequency domain, (6) can be expressed by:
w 11 ( Ω ) = h 12 ( Ω ) h 11 ( Ω ) e jΩτ , w 21 ( Ω ) = e jΩτ ,
(7)

where underlined characters denote frequency-domain representations. The normalized frequency Ω is given as 2 πf f s , where fs denotes the sampling frequency. The ratio of the two frequency responses h 12 ( Ω ) h 11 ( Ω ) is known as the RTF or the TF ratio.

If we divide w 11 ( Ω ) by w 21 ( Ω ) , we get:
h ~ RTF = h 12 ( Ω ) h 11 ( Ω ) ,
(8)

which is exactly the form of the RTF-based BM proposed in [25].

For a multiple-input/multiple-output (MIMO) system, in [33], it is shown that the optimum BSS solution is the generalization of the SIMO identification solution. This holds however only for determined cases. For an underdetermined scenario as shown in Figure 4, there is no determined solution. However, here, our aim is not to find a determined BSS solution in underdetermined scenarios, but to exploit BSS as a BM to suppress the target source s1 only. Therefore, it still follows that
s 1 ( k ) h 11 ( k ) w 11 ( k ) + s 1 h 12 ( k ) w 21 ( k ) = 0 h 11 ( k ) w 11 ( k ) = h 12 ( k ) w 21 ( k ) ,
(9)
Figure 4
Figure 4

ICA-based BSS for an underdetermined scenario.

which is the same as in (4) for the system identification in a SIMO system. As BSS has no determined solution in underdetermined scenarios, the problem is how to force BSS to suppress the target source only and preserve the other sources to form a joint noise estimate. For this purpose, we combine the generic BSS with a geometric constraint to force a spatial null towards the direction of the target source. We denote the combined algorithm as ‘Directional BSS’ and analyze it in the following sections.

3.2 Algorithm

Blind source separation can be regarded as ‘blind adaptive beamforming’ (blind ABF) [34] as BSS and ABF have similar goals and a similar structure: Both attempt to extract a target signal and reduce the interference by multichannel array processing as described in [35, 36]. In [34] it is shown that BSS is equivalent to a set of adaptive beamformers which form multiple null-beams steered towards the directions of interfering sources and its reflections. On the other hand, there are fundamental (characteristic) differences between BSS and ABF: generic BSS usually does not require prior information on source locations and sensor constellations, while ABF requires the spatial information on the locations of sources and sensors. In [19] a method was proposed to combine BSS and beamforming for achieving a better separation performance by utilizing the geometric information of sources. The kind of combination is known as geometric source separation, where the response of BSS demixing filters is additionally constrained to a set of directions.

The original algorithm of geometric source separation was described in the discrete Fourier transform (DFT) domain. The response of BSS at the q th BSS output is constrained to the direction θ, which can be expressed by
w q T ( Ω ) d ( Ω , θ ) = ξ ,
(10)
where ξ denotes the constraint, w q ( Ω ) = [ w 1 q ( Ω ) , w 2 q ( Ω ) ] T describes the demixing filters for the q th BSS output channel at the frequency Ω = 2 πν N (ν is the frequency bin and N is the length of the demixing filter) in the DFT domain; {·}T is the transpose operator; d ( Ω , θ ) is the steering vector pointing to direction θ:
d ( Ω , θ ) = [ e , 1 ] T ,
(11)
ζ = Ω d mic f s sin θ c ,
(12)

where c is the sound velocity. Note that both the microphone spacing dmic and the angle θ relative to the array axis must be given. For simplicity, we omit the frequency variable Ω in the sequel.

More generally, to constrain the response of the BSS demixing matrix to a set of P=2 directions Θ, we write:
W D ( Θ ) = C ,
(13)

where W = [ w 1 , w 2 ] T is the BSS demixing matrix and D ( Θ ) = [ d ( θ 1 ) , d ( θ 2 ) ] contains steering vectors pointing to Θ= [ θ1,θ2]. The 2×2 matrix C refers to the constraints.

Two constraints were proposed in [19], and they are
diag ( W D ( Θ ) ) = I ,
(14)
or W D ( Θ ) = I ,
(15)
where I refers to a 2×2 identity matrix. As both of the two constraints aim at extracting the sources, not at blocking the sources, we will not discuss them here, but a detailed discussion can be found in [19, 37]. The constraint for blocking sources was proposed in [37]:
offdiag ( W D ( Θ ) ) = 0 ,
(16)

which restricts the output channels to have a zero response for the signals arriving from the directions given in Θ, i.e., it forces each output channel to form a null beamformer steered to the source which should be blocked in this output channel.

The constraint (16) can be incorporated into the overall cost function for the source separation (3) as an additional penalty term:
J C ( W ) = offdiag ( W D ( Θ ) ) F 2 ,
(17)
where A F 2 = trace A A H is the Frobenius norm of the matrix A. {·} H refers to the conjugated transpose operator. Combining this with the cost function for the generic BSS algorithm given in (3), we obtain:
J total ( W ) = J BSS ( W ) + η C J C ( W ) ,
(18)

where the weighting parameter ηC can be chosen to control the importance of the geometric constraint relative to the separation criterion represented by JBSS (3).

As Directional BSS serves as a BM for a single desired source, only the target source needs to be suppressed. Therefore, J C ( W ) is modified by considering the following conditions:

  • The direct path for the target signal is suppressed by the penalty term analogously to null-steering beamforming, i.e, a spatial null is forced toward the direction θ of the target source.

  • As only the target signal needs to be suppressed, only one BSS output channel is controlled by the geometric constraint. Without loss of generality, the output channel 1 is chosen to be controlled with the penalty term J C ( W ) in the sequel.

  • In order to converge to the RTF-based BM, w21 is set to be a pure delay and remains unchanged during adaptation of the demixing system. Note that we could equivalently use the first channel as the reference and in that case w11 is a pure delay.

The simplified cost function for the constraint then reads:
J C ( W ) = [ w 1 T d ( θ ) ] 0 F 2 .
(19)
As J C ( W ) is complex-valued, the gradient-descent update for the constrained part for W is obtained by taking the derivative of the cost function J C ( W ) with respect to W H [38]. Besides, as we want to keep w21 fixed, the constraint must be applied to the demixing filter w11 only. Thus, the filter update term for the constraint part in the DFT-domain yields:
J C ( W ) W H = [ w 1 T d ( θ ) ] [ w 1 T d ( θ ) ] H w 11 0 0 0 = ( w 11 d + w 21 ) ( w 11 d + w 21 ) w 11 0 0 0 = w 21 e j Ω d mic f s sin θ c + w 11 0 0 0 ,
(20)
where a refers to the complex conjugate of a. It should be noted that both frequency-domain and time-domain BSS algorithms can be associated with the geometric constraint. As we use the time-domain TRINICON SOS-based algorithm given in [32], the filter is updated in the time-domain for block m ̆ after iteration k ̆ as follows:
W k ̆ + 1 ( m ̆ ) = W k ̆ ( m ̆ ) μ ̆ Δ W total ,
(21)
where μ ̆ is stepsize and Δ Wtotal is given as [20]
Δ W total = J BSS ( W ) W + η C DFT 1 J C ( W ) W H ,
(22)
where DFT−1{·} denotes the inverse discrete Fourier transform yielding a nonzero update contribution of the same length as the demixing filter length N. J C ( W ) W H is already given in (20). In [32] a detailed description for the applied TRINICON-based update J BSS ( W ) W can be found. The natural gradient update is given as [32]:
J BSS ( W ) W = 2 i ̆ = 0 β ( i ̆ , m ̆ ) W k ̆ ( m ̆ ) × [ R yy bdiag R yy ] bdiag 1 R yy ,
(23)

where R y y denotes the 2D×2D correlation matrix of the output signal vector y of length 2D, bdiag refers to considering block matrix and describes the operation of setting all off-diagonal block matrices of the block matrix to zero. β ( i ̆ , m ̆ ) is a weighting function normalized to i ̆ = 0 m ̆ β ( i ̆ , m ̆ ) = 1 allowing for online, offline or block-online realization of the algorithm [32].

By applying Directional BSS, the target source s1 should be suppressed in BSS output channel 1. Thus, the noise estimate n ̂ is given by the output y1 as follows:
n ̂ = y 1 = w 11 x 1 v 1 + w 21 x 2 v 2 = w 11 ( x s , 1 + x n , 1 ) v 1 + w 21 ( x s , 2 + x n , 2 ) v 2 w 11 x n , 1 + w 21 x n , 2 ,
(24)

where xs,p and xn,p denotes the target and the noise component contained in microphone p, respectively.

Besides, in [19] the efficiency of a proper geometrical initialization was shown. For the geometric constraint, the direction of the target source needs to be known a priori or it needs to be estimated. If the target source position is not known, an additional source localizer is necessary. Many localization algorithms can be used as, e.g., GCC-PHAT [39] or an ICA-based source localizer [40]. With the given DoA information, we can initialize the filter structure corresponding to a DSB in order to accelerate convergence. The initialization is performed after each movement of the target source. Defining a vector d sub ( θ ) = [ 1 , e j Ω d mic f s sin θ c ] T , for the constraint-controlled channel 1, the filter coefficients can be initialized as follows:
w 1 = d sub ( θ ) .
(25)

3.3 Comparison to alternative approaches

The original BM proposed by Griffith and Jim [12] is constructed by subtracting pairs of time-aligned signals with respect to the target signal. For the dual-channel case, this is exactly a DSB, which is attractive for its simple structure. However, a major limitation in real acoustic scenarios is that the performance of a DSB will significantly degrade for an imprecise target source position information, i.e., for steering errors. Additionally, due to reflections of the target signal impinging from directions other than the steering direction, significant signal leakage into the noise reference needs to be expected. As a possible countermeasure, an adaptive BM (ABM) with coefficient constraints was proposed in [23]. In this conventional ABM, the output of the FB (see Figure 1) is used as a reference signal for the target source and adaptively subtracted from the microphone signal. The least mean squares (LMS) algorithm is usually used for the ABM adaptation. However, the adaptation can only be carried out in time segments, where only the target source is active. Therefore, a double-talk detector is necessary, which requires significant sophistication and will still be imperfect in complex acoustic scenarios. The difference of our approach to this BM is that (1) the adaptation criterion is different and very important for its practical relevance, and (2) a double-talk detector is not required.

The transfer-function-generalized sidelobe canceler (TF-GSC) was proposed by Gannot et al. [25], where the BM is constructed based on RTFs. This approach takes the reverberant nature of the enclosure into account. The RTFs are estimated by a least squares method and for this, two assumptions are necessary: (1) the RTFs change slowly over time compared to the time variations of the signals, which effectively precludes movements of the source, and (2) time segments are available, where both the TFs and the noise signal are assumed to be stationary. In Section 3.1 we already show that our approach converges to an RTF-based BM. In contrast to [25], our approach does not rely on such time segments but only a coarse DoA estimation is required.

Warsitz et al. presented a BM based on a generalized eigenvalue decomposition. They construct the BM directly by using the beamformer filter coefficients resulting from maximizing the output signal-to-noise ratio, where the filter coefficients are computed iteratively by solving a generalized eigenvalue problem [41, 42]. This approach indirectly estimates the RTFs and does not require periods of absence of noise and the DoA of the target source. On the other hand, it works only for stationary noise while our approach can work in a nonstationary multispeaker scenario.

Recently, a subspace approach for estimating RTFs in multiple-noise scenarios was proposed in [27], which was used to construct the RTF-based BM efficiently. However, this approach needs an estimation for source activities.

The conventional noise estimation methods other than BM-based approaches are mostly based on source activity estimation [43] or minimum statistic noise power estimation [44, 45]. For those approaches, it is usually assumed that the sources are statistically independent and the noise is more stationary than the target signal. The recently well-studied model-based NMF approaches [57] can be used directly for noise reduction [46, 47] or for noise estimation [48]. Those methods usually rely on the prior knowledge of the noise type (point source or diffuse noise) to define the model parameters. Therefore, they can only be as efficient as the models match the current scenario and are prone to fail if the model assumptions do not hold or the parameters could not be properly learned. The latter is especially crucial for online algorithms in time-varying scenarios. On the other hand, Directional BSS will fail if the interfering source arrives from the same direction as the target source, as then Directional BSS is not able to provide an estimate for the interfering source. Compared to these alternative approaches, the main advantage of the proposed approach is that no target source activity estimation or prior knowledge on the source characteristics is necessary and no model needs to be matched, but only a coarse DoA estimation is required.

4 Evaluation of blocking matrices

In order to evaluate the proposed BM, comprehensive experiments were carried out. The system behavior and the target suppression performance of Directional BSS in single-source scenarios (only one directional source is active) and multiple-source scenarios (multiple directional sources are simultaneously active) are evaluated. For showing the system behavior, Directional BSS is compared to (1) an ideal RTF-based BM and (2) a BM based on a DSB. The ideal RTF-based BM is calculated from the measured room impulse responses (RIRs). In order to evaluate the target suppression performance, Directional BSS is compared to (1) a perfect adaptive BM (we name it ideal ABM) and (2) a BM based on a DSB. The ideal ABM is adapted in a single-source scenario. It should be noted that state-of-art BMs are mostly based on an estimation of source activities and require perfectly detected target source-only time segments. Therefore, we compare Directional BSS always with the DSB and not to other BMs as only these two BMs do not require estimating any source activities. The comparison to the ideal ABM can show us how close Directional BSS can reach to the perfectly supervised case. Besides, the robustness of Directional BSS against localization errors is analyzed in Section 4.3.

4.1 Experimental setup

Two real rooms were considered for evaluation: (1) room A: a living-room-like environment with a moderate reverberation time of T60≈250 ms and a critical distance [49] of 1.3 m and (2) room B: a more reverberant living room with T60≈400 ms and a critical distance of 0.9 m. As source-array, distances 1 and 2 m were considered. The experiments are based on RIR measurements carried out with a two-channel array. The measurements were performed for two different microphone spacings, dmic{6,11.5} cm at a sampling frequency of 48 kHz using the maximum length sequences (MLS) method [49]. For the following evaluation, the RIRs were downsampled to a sampling frequency fs=16 kHz. We combine the efficient SOS-based online BSS algorithm presented in [32] with the geometric constraint (22) to perform Directional BSS. The filter length of the finite impulse response (FIR) filters w pq (21) is 1024, the block length for estimation of the correlation matrix R y y (23) is 2048. The number of iterations per data block of 125 ms is 15 (see [32] for details on the adaptation). Three male and female speech signals of length 10 s were used as source signals. Diffuse background noise components were simulated using the method proposed in [50]. All sources (including speech sources and diffuse source signals) are continuously active and normalized to equal average power. For the experiments, the DoA information of the target source is given. However, as in practice, a source localizer is necessary to estimate the target DoA, in Section 4.3, the robustness of Directional BSS against localization errors is investigated.

4.2 Performance of the blocking matrices

As performance measures, we use (1) the frequency response of the overall system to show the system behavior in different scenarios under various acoustical conditions, (2) the target suppression gain and the root mean square error between the estimated RTF and the true RTF to measure the blocking performance, and (3) the signal-to-interference ratio (SIR) difference between the BM input and the BM output signal to measure the ability of the BM to preserve all interfering signals. Note that here the interference includes interfering sources and diffuse background noise.

4.2.1 Frequency response of the overall system

To study the overall system behavior, we investigate the frequency response of the transfer function for a BM supplied with perfect localization information. The transfer function for different source positions −90°≤ϕ≤90° is evaluated as depicted in Figure 5. The spatiotemporal frequency response associated with the BM is given by
h trans ( ϕ ) = n ̂ s s = h 11 ( ϕ ) w 11 + h 12 ( ϕ ) w 21 ,
(26)
Figure 5
Figure 5

System to evaluate the frequency response of different BMs.

where n ̂ s refers to the residual of the target signal component s in the noise estimate n ̂ (‘leakage’). This characterization is similar to but not equal to a beam pattern in the usual sense where it is assumed that the acoustic waves propagate in free field and no scattering is considered. Instead, (26) also considers the acoustic environment by accounting for the transfer functions from the source position to the microphones. Thereby, h trans captures reflections of source signals determined by the given source positions. Thus, if (26) exhibits a minimum for a certain angle relative to a certain distance to the microphone array, it indicates that all signal components originating from this angle at this distance, including possible reflections at surfaces in the acoustic environment, are suppressed to the given extent.

We show the magnitude response of (1) Directional BSS, (2) the ideal RTF-based BM, and (3) a DSB. With the ideal RTF-based BM, the target component is perfectly suppressed. Directional BSS is expected to converge to this ideal solution. For the BM based on a DSB, the filter coefficients are calculated according to the fractional delay between the two microphone signals. For Directional BSS, the BM coefficients are a set of converged BSS demixing filters of length 1024 adapted for the corresponding scenarios. We first show the magnitude response for the BMs adapted/calculated for single-source scenarios.

In Figure 6 the magnitude responses for three BMs (ideal RTF-based BM, Directional BSS, DSB) steered towards 0 ° are depicted for the array of dmic= 6 cm and dmic= 11.5 cm, respectively, in room A with 1-m source-array distance. Comparing all plots in Figure 6, the three BMs have similar magnitude responses. For all three BMs, spatial aliasing is unavoidable at f>5 kHz (dmic= 6 cm) and at f>3 kHz (dmic= 11.5 cm). Besides, they do not have a significant spatial selectivity for low frequencies (lower than 300 Hz), and it is observed that the frequency range with no spatial selectivity is larger for dmic= 6 cm than for dmic= 11.5 cm. In this frequency range, not only the target source but also interferers located at positions differing from 0 ° is suppressed to a large extent and consequently, no noise estimate can be obtained. Despite similar behaviors in the range of low frequencies, it is clearly noticeable that both the ideal RTF-based BM and Directional BSS achieve a more pronounced spatial null than the DSB, which reflects a much better suppression performance of these two BMs compared to the DSB.In Figures 7 and 8, the magnitude responses for three BMs are depicted for steering directions −45° and −90° in room A with 1 m source-array distance. Obviously, the behaviors of the BMs change if the target source moves towards −90°. For the steering direction of −45°, the ideal RTF-based BM can still perfectly suppress the source but the null becomes broader. For Directional BSS and a DSB, the spatial null becomes apparently weaker and broader. Besides, it is observed that the spatial null reaches only up to approximately 4 kHz. In practice, it will not affect the performance of Directional BSS for suppressing speech signals too significantly, as most of the energy of speech signals is usually in the frequency range below 4 kHz. For the steering direction of −90°, the spatial null of Directional BSS becomes broader especially at low frequencies, whereas almost no spatial null can be observed for the DSB. The target suppression gain (discussed in Section 4.2.2) for Directional BSS degrades from 20 dB for the target signal at 0 ° to about 10 dB for the target signal at 90 °, where the target suppression performance is still acceptable, but the missing selectivity will suppress interfering sources located close to the target source as well. In the following experiments, we limit the target source position to the range [ −20°, 20°] relative to the broadside of the microphone array. Besides, we note that the spatial null of the proposed method is limited to frequencies below 4 kHz, which is due to the fact that we use speech signals as the test signals and the energy of speech signals is concentrated to the frequency range below 4 kHz. However, it should be noted that the spatial null can be extended to a higher frequency range by using other wideband-signals with sufficient support at those frequencies.In Figure 9a, the magnitude response of Directional BSS for different rooms and different source-array distances are depicted. It can be seen that the spatial null of Directional BSS becomes slightly weaker with increased reverberation.
Figure 6
Figure 6

Magnitude responses for all three BMs steered to 0° (room A, 1 m). (a) Ideal RTF-BM (dmic = 6 cm). (b) Ideal RTF-BM (dmic = 11.5 cm). (c) DirBSS (dmic = 6 cm). (d) DirBSS (dmic = 11.5 cm). (e) Dsub BF (dmic = 6 cm). (f) Dsub BF (dmic = 11.5 cm).

Figure 7
Figure 7

Magnitude responses for all three BMs steered to -45° (room A, 1 m). (a) Ideal RTF-BM (dmic = 6 cm). (b) Ideal RTF-BM (dmic = 11.5 cm). (c) DirBSS (dmic = 6 cm). (d) DirBSS (dmic = 11.5 cm). (e) Dsub BF (dmic = 6 cm). (f) Dsub BF (dmic = 11.5 cm).

Figure 8
Figure 8

Magnitude responses for all three BMs steered to -90° (room A, 1 m). (a) Ideal RTF-BM (dmic = 6 cm). (b) Ideal RTF-BM (dmic = 11.5 cm). (c) DirBSS (dmic = 6 cm). (d) DirBSS (dmic = 11.5 cm). (e) Dsub BF (dmic = 6 cm). (f) Dsub BF (dmic = 11.5 cm).

Figure 9
Figure 9

Magnitude responses for Directional BSS and MSC of target signal under various testing conditions ( d mic =6 cm). (a) Room A, 1 m. (b) Room A, 2 m. (c) Room B, 1 m. (d) Room B, 2 m. (e) MSC (room A, 1 m). (f) MSC (room A, 2 m). (g) MSC (room B, 1 m). (h) MSC (room B, 2 m).

To explain the performance degradation, we plot the magnitude squared coherence (MSC) of the target signal between the two microphones for each testing scenario in Figure 9e-9h. The MSC is estimated by using Welch’s averaged periodogram method. The block length for estimating the MSC is 2048, which is the same as the block length for BSS adaptation. As can be seen, the target signal for room A with 1-m distance is strongly correlated (MSC ≈1). With the increasing reverberation, the coherence of the target signal becomes weaker. Consequently, the blocking performance of Directional BSS degrades. If we increase the block length to be larger than the length of the measured RIRs, the bias of the coherence towards zero will reduce according to [51]. The MSC will be close to 1 again. Therefore, theoretically, increasing both the filter length of the demixing system and the block length will increase the performance. However, a deteriorating convergence of the BSS algorithm must be expected for very long demixing filters. This is a general problem of adaptive filtering realized in the time domain.In the above figures, we showed the behavior of the three BMs in single-source scenarios. For multiple-source scenarios, the ideal RTF-based BM and the BM based on a DSB remain unchanged. However, the adaptation of Directional BSS is affected due to the existence of the interfering sources. Consequently, the performance of Directional BSS is different from the performance in single-source scenarios. Figure 10 illustrates the magnitude responses of Directional BSS steering at 0°, with one interfering point source at 30°. It can be seen that the spatial null is only slightly weaker compared to the single-source case (Figure 9), especially at low frequencies. This indicates that the target source is slightly less suppressed (degration about 1 to 4 dB in terms of the target suppression gain) due to the existence of the interfering signal.
Figure 10
Figure 10

Magnitude responses for Directional BSS with one-point interfering source at -30° ( d mic =6 cm). (a) Room A, 1 m. (b) Room A, 2 m. (c) Room B, 1 m. (d) Room B, 2 m.

4.2.2 Target suppression performance

The blocking performance should be quantified to show how well the target source can be suppressed. We propose to use two measures to evaluate the performance. One is the target speech suppression gain which is defined as follows:
Gain sup = 1 2 p = 1 2 10 log 10 σ x s , p 2 σ n ̂ s 2 ,
(27)

where σ a 2 denotes the (long-term averaged) signal power of the signal a, xs,p denotes the target component contained in p th microphone, and n ̂ s denotes the target residual contained in the noise estimate. The target suppression gain of the ideal RTF-based BM is infinity. A higher target suppression gain indicates a higher blocking performance for the signal from the target source direction. This measure is very similar to the ‘signal blocking factor’ used in [52]. We compare the target suppression performance with (1) the DSB and (2) a simulated ideal ABM, where one microphone channel is simply adaptively subtracted with a LMS-type algorithm from the other. For this simulation, an ideal case is assumed, i.e., the microphone signal contains only the target signal. The simulated ABM can be regarded as an ideal version (in a supervised case) of a conventional ABM using an LMS algorithm as proposed in [23, 24].

Additionally, we calculate the normalized squared error (NSE) between the estimated RTFs and the ideal RTF calculated from the measured RIR to evaluate the estimation of the RTF. The NSE is calculated as follows:
NSE BM = 10 log 10 k = 0 N 1 RTF ~ BM ( k ) RTF true ( k ) 2 k = 0 N 1 RTF true ( k ) 2 ,
(28)
where RTF ~ BM denotes the RTF estimated by a BM, e.g., RTF ~ DirBSS refers to the RTF estimated by Directional BSS, k is the time sample index, and N is the filter length of the BM; in our experiments, it was chosen to be 1024.The scenarios as depicted in Figure 11a are considered for evaluating the blocking performance. In scenarios 1 to 3, only point sources are active. One male speech signal of 10 s was used as the target source. A female speech signal of the same length was used as the interferer in scenario 2. For scenario 3, a female and a male speech signal were used as the interferers. In scenario 4, additional diffuse background noise is added to the microphone. All test signals are normalized to equal power.
Figure 11
Figure 11

Test scenarios and performance comparison for evaluating the BMs ( d mic =6 cm). (a) Testing scenarios. (b) Target suppression gain. (c) Normalized squared error.

Figure 11b shows the target suppression gain for the three BMs. As the DSB is only dependent on the target direction and the target source-array distance, the target suppression gain of the DSB for scenarios 2 to 4 is the same as for scenario 1. We simulate the ideal ABM by adapting the BM filter in a single-source scenario, i.e., scenario 1. Therefore, the performance of the ideal ABM is only shown for this scenario. It can be seen that in a single-source scenario, the target suppression gain of an ideal ABM is only slightly higher than Directional BSS, which indicates that in a single-source scenario, Directional BSS can reach to the upper limit. For scenarios 2 to 4, the target suppression gain of Directional BSS degrades but is always over 10 dB and clearly superior compared to the DSB. In Figure 11c, the NSEDirBSS and NSEidealABM are shown. For scenario 1, where only the target source is active, the NSEs of the both BMs are very close and very low, which indicates that in a single-source scenario, the estimated RTFs are very close to the true RTF. With an increased number of sources or with more complicated acoustical conditions (higher reverberation time and larger source-array distance), it is more difficult to estimate the RTF. We can see that NSEDirBSS increases and the target suppression gain degrades. However, even in such complex scenarios, Directional BSS can produce an acceptable estimate of the RTF without any source activity detection. In the latest work [53], more evaluation results for comparing the estimated RTFs with the true RTFs are shown. The performance of Directional BSS is somewhat dependent on the signal characteristics (stationary or nonstationary, speech signal or white noise) of the involved sources. The energy of a white noise is distributed over full band while the energy of speech-like sources is usually limited to low frequencies. Besides, nonstationary sources make the adaptation of Directional BSS difficult as it needs to catch the variation of the signals within short frames. Therefore, different signal characteristics will lead to different results.

4.2.3 Preservation of interfering sources

The target suppression gain can only be used to evaluate the blocking performance for the target signal. From the magnitude response for the overall system, it can be seen that spatial aliasing appears in a certain frequency range. The goal of a BM is to produce a noise reference by suppressing the target source, which indicates that the noise signals should be well preserved while the target source should be well suppressed. Therefore, besides the target suppression gain, we need to measure how well the noise signals are preserved. To this end, we define the SIRdiff as follows:
SIR diff = SIR in SIR outBM ,
(29)
where SIRin and SIRoutBM are given by
SIR in = 1 2 p = 1 2 10 log 10 σ x s , p 2 σ x n , p 2 , SIR outBM = 10 log 10 σ n ̂ s 2 σ n ̂ n 2 ,
(30)
where n ̂ s and n ̂ n denote the target and the noise component contained in noise estimate n ̂ , respectively. The higher SIRdiff, the better the noise signals are preserved relative to the target signal.We carried out a test for the scenario shown in Figure 12a, where the target source is located at 0°, while the interfering source is located at varying DoA from −90° to −10°.
Figure 12
Figure 12

Comparison of SIR diff achieved by Directional BSS and DSB ( d mic =6 cm). (a) Scenario for measuring the SIRdiff. (b) SIRdiff of directional BSS. (c) SIRdiff of DSB.

Figure 12b,c shows the SIRdiff achieved by Directional BSS and the DSB for the testing scenario. We observe that due to the increased reverberation, SIRdiff decreases. The interfering sources located near to the target source, e.g., the interfering source at −20° or −10° BSS may treat them as one source. Both the interfering sources and the target source are suppressed to a certain extend. Comparing the performance of Directional BSS and the DSB, it can be seen that Directional BSS is clearly superior to the DSB especially in reverberant environments.

As the target source is defined to impinge from the range −20° to 20° relative to the broadside of the microphone array, the blocking performance including the preservation of the interfering source for the target located other than 0 ° is of interest. Figure 13 shows the obtained Gainsup and SIRdiff for a scenario where the target is located at −10° or −20°, and an interfering source is located at 60 ° and diffuse noise is active. It can be seen that Directional BSS can still achieve a target suppression gain of more than 10 dB. Here, the results of DSB for the same scenarios are not shown as even for the source located at 0°, only less than 10 dB Gainsup can be obtained (see Figure 11b). For the source located off 0°, the performance of the DSB degrades further.
Figure 13
Figure 13

Performance of Directional BSS for different target direction ( d mic =6 cm). (a) Target suppression gain. (b) SIR difference.

4.3 Robustness against localization errors

In practical applications, usually the target direction is unknown and needs to be estimated using a source localization algorithm [39, 40] so that estimation errors must be expected. Hence, the robustness of Directional BSS against localization errors is of special interest. To this end, we experimentally evaluate the sensitivity of Directional BSS with respect to the localization errors. The scenario for evaluation is illustrated in Figure 14a, where the target source is always located at 0°, and one active interferer varies its direction from −90° to −10°. The target localization error is 5 °, 10 °, or 15 °. Various values for ηC from 0.1 to 0.8 are applied for the constraint.
Figure 14
Figure 14

Target suppression gain as a function of the localization error and the weighting parameter η C . (a) Scenarios for evaluating the sensitivity against localization errors. (b) dmic=6 cm, error = 5°. (c) dmic=11.5 cm, error = 5°. (d) dmic=6 cm, error = 10°. (e) dmic=11.5 cm, error = 10° (f) dmic=6 cm, error = 15°. (g) dmic=11.5 cm, error = 15°.

In Figure 14b-14g, the target suppression results are shown. The performance of SIRdiff for measuring the preservation of the interfering sources as discussed in Section 4.2.3 is shown in Figure 15. From Figure 14b,d,f and Figure 15a,c,e, it can be seen that for dmic=6 cm, a localization error of 15 ° can be tolerated, the Gainsup is above 10 dB and SIRdiff is above 6 dB if the interferer is far from the target source. However, as indicated above, if the interferer is close to the target source, BSS might treat th e interferer and target source as a single source and jointly suppress the target source and the interferer. This leads to a certain confusion in BSS adaptation, which results in a performance degradation (low SIRdiff for the interferer close to the target source). Similar results are observed for dmic=11.5 cm. It shows that with larger microphone spacing, BSS is more sensitive to the localization error, especially if the interfering sources are near to the target source. Besides, for a large localization error, with lower ηC, Directional BSS achieves a better performance. Basically, if BSS runs freely without any constraint (ηC=0), it automatically adapts to the true source direction for source separation. However, this holds only for determined cases, which means for a scenario with only two sources, only two microphones are available. For an underdetermined case with more than two simultaneously active point sources, BSS will always try to produce mutually statistically independent outputs. Therefore, in an underdetermined situation, a determined BSS will divide the sources into two groups, which leads to an unpredictable suppression/separation of the sources, e.g., it may treat the target source and the nearest interfering source together as one source/one group and produce a compromise suppression. Therefore, we need to constrain BSS to suppress the source from a predefined direction only but not to constrain BSS too much in order to tolerate a possible localization error. This is balanced by the weighting factor ηC in (22) which controls the importance of the geometric constraint relative to the separation criterion. A lower ηC indicates less weight for the geometric constraint, and the estimation of the demixing filters is more based on statistical independence for source separation. Hence, a lower ηC should be chosen for unreliable DoA information so that Directional BSS can better adapt to the true target direction.
Figure 15
Figure 15

SIR diff as a function of the localization error and the weighting parameter η C . (a) dmic=6 cm, error = 5°. (b) dmic=11.5 cm, error = 5°. (c) dmic=6 cm, error = 10°. (d) dmic=11.5 cm, error = 10°. (e) dmic=6 cm, error = 15°. (f) dmic=11.5 cm, error = 15°.

5 Application of a DirBSS-based noise estimate to blind signal extraction

In this section, a two-channel BSE scheme combining Directional BSS as BM and Wiener-type spectral enhancement filters will be presented and evaluated.

5.1 A two-unit scheme: BM plus spectral enhancement filters

The noise estimate obtained by the above method can be used for various applications. Conventionally, it can be used for a realization which relies on a noise estimate produced by an RTF-based BM [25, 53] or for the MWF which requires an SOS estimate of the noise [54]. In this section, we discuss one generic application to show the effectiveness of the noise estimation of the proposed BM. A two-channel BSE scheme combining Directional BSS as BM and Wiener-type spectral enhancement filters will be presented. The scheme is depicted in Figure 16a. It comprises two units. In the first unit, an estimate of noise components is produced by a BM. The noise estimate as well as the microphone signals is fed into the noise reduction unit so that the desired speech components can be extracted from the microphone signals.
Figure 16
Figure 16

Speech enhancement/blind signal extraction scheme and test scenarios for evaluating its performance. (a) Speech enhancement/blind signal extraction scheme. (b) Test scenarios.

Typical approaches which can be considered for the speech enhancement unit include an interference canceler or Wiener-type spectral enhancement filters. However, in underdetermined cases, an interference canceler is not able to suppress all noise sources [55]. Therefore, Wiener-type spectral enhancement filters based on the obtained noise estimate are used for the noise reduction unit. The real-valued spectral weights for frequency Ω at output channel p are given by [56]
g p = max 1 μ Ŝ n ̂ n ̂ Ŝ v p v p , g min ,
(31)

where, Ŝ aa represents the auto-PSD of a, v p =wp 1x p denotes the p th microphone signal filtered by Directional BSS, g min refers to the minimum value of the spectral weights (spectral floor), and μ is a real number which is used to achieve a trade-off between noise reduction and speech distortion. Note that the spectral weights can be designed in many forms, e.g., can be derived from Bayesian estimation using maximum likelihood (ML) method, maximum a posteriori (MAP), or MMSE estimator [57].

The spectrum of the enhanced signal at the p th output channel is thus given by
z p = g p x p .
(32)

5.2 Improved noise estimate by assuming an ideal diffuse noise field

The obtained noise estimate n ̂ is biased relative to the original noise components as (1) the noise estimate n ̂ is spectrally shaped by the BSS filters and (2) n ̂ is a sum of all the filtered interference and noise components. A bias correction function is proposed based on an assumed coherence [21, 56]. If the noise field is approximated as spherically isotropic [49], the theoretical noise coherence function reads:
Γ diffuse = sinc Ω d mic f s c .
(33)
The noise estimate is corrected with the noise coherence function [21]:
Ŝ n ̂ n ̂ = Ŝ n ̂ n ̂ 2 ( 1 + { Γ diffuse } ) ,
(34)
where denotes the real part of a complex value. Note that Γdiffuse and Ŝ n ̂ n ̂ are frequency-dependent. The spectral weights are then calculated with the corrected noise estimate:
g p = max 1 μ Ŝ n ̂ n ̂ Ŝ v p v p , g min .
(35)

This correction function was discussed in detail in [21]. Other possible correction functions could be based on coherence measurements during target inactivity [54, 58] or a method combined with minimum statistics [59]. However, a more detailed discussion of the possible correction functions and choices of spectral enhancement filters is outside the scope of this paper.

5.3 Experimental results

In this section, the performance of the proposed scheme with two spectral enhancement filters ((31) and (35)) is evaluated in terms of signal-to-interference ratio improvement (SIRgain) and speech distortion (SD) for various scenarios under various testing conditions. Besides, we show the performance of the proposed scheme based on the noise estimate provided by the ideal RTF-based BM and the DSB for comparison.

5.3.1 Experimental setup

Two different rooms and two source-array distances are considered for evaluating the BMs. The scenarios as shown in Figure 16b are considered for evaluating the performance of the speech enhancement scheme. All tested signals are the same as used in the experiments for evaluating the performance of BMs.

The same algorithm of Directional BSS and the same parameters are used as in Section 4 (see (22) and (23)). The frequency-domain Wiener filter is implemented with a polyphase filter bank [60] using a prototype FIR filter of length 1024, with 512 complex-valued subbands and a downsampling rate of 128.

5.3.2 Performance measures

The performance of the proposed scheme is evaluated in terms of SIRgain and SD using the following definitions:
SIR gain = SIR out SIR in ,
(36)
SIR in = 1 2 p = 1 2 10 log 10 σ x s , p 2 σ x n , p 2 ,
(37)
SIR out = 1 2 p = 1 2 10 log 10 σ z s , p 2 σ z n , p 2 ,
(38)
SD = 1 2 p = 1 2 10 log 10 E ̂ { ( x s , p ( k τ g ) z s , p ( k ) ) 2 } σ x s , p 2 ,
(39)

where xs,p and zs,p denote the target speech components at the p th input and the p th output of the proposed scheme, respectively; σ x s , p 2 and σ z s , p 2 denote the (long-term) signal power of the target speech components at the p th input and the p th output; whereas σ x n , p 2 and σ z n , p 2 denote the (long-term) signal power of all noise and interference components at the p th input and p th output, respectively. τg refers to the overall signal delay caused by the filter bank.

5.3.3 Performance of the proposed BSE scheme

In order to establish a reference for the two-stage BSE methods, we first consider a 2×2-channel time-domain ICA algorithm as an uninformed BSE system where we identify the target signal in one of the output channels. The SIR improvement for the various scenarios is shown in Figure 17. The SD is not shown as it is meaningless for BSS due to the known filtering ambiguity of optimum BSS solutions [31]. As a generic ICA algorithm is designed only for the determined case, the this algorithm can achieve high separation performance only in scenario 1. For scenario 2, it has no determined solution. The three sources are separated into two groups but the grouping is unpredictable, as can be seen by the results, e.g., for the testing conditions [rooms (A,B), 1 m], the target source is extracted alone, while for the other testing conditions, the target source is separated together with one interfering source as one group. For scenarios such as 3 and 4, where the diffuse noise is active, generic ICA is usually not capable to separate the point source from the diffuse noise. A corresponding analysis can be found in [28]. With these results, it is documented that the generic ICA algorithm cannot be expected to extract the target source for underdetermined scenarios.
Figure 17
Figure 17

SIR improvement obtained by the conventional time-domain BSS algorithm ( d mic = 6 cm).

The SIR improvement and speech distortion of the speech enhancement scheme with two spectral filters g (31) and g (35) are shown in Figure 18a,b, respectively. The upper plot and the lower plot show the results for the spectral filter g and the improved spectral filter g , respectively. It should be noted that for the four considered scenarios, the parameters of the spectral filters are optimized with respect to the best perceptual quality as judged by informal listening tests. For all scenarios, the proposed BSE scheme with the two spectral filters can achieve a good noise reduction performance (above 6 dB) and maintain a very low speech distortion (lower than −11 dB). For scenarios 1 and 2 where only point sources are active, the performance of the spectral filter g based on the incorrect assumption of ideal diffuse noise is not improved compared to the scheme without the bias correction. For scenario 3 and scenario 4 where additional diffuse background noise is present, i.e., the assumption of ideal diffuse noise is matched to a certain degree, a significant improvement for the spectral filter g can be observed.
Figure 18
Figure 18

SIR improvement and speech distortion for BSE scheme based on spectral filters g and g ( d mic =6 cm). (a) SIR improvement. (b) Speech distortion.

For comparison, the performance of the proposed scheme based on the noise estimate produced by the ideal RTF-based BM and the DSB is shown in Figure 19a,b. Note that here the speech enhancement scheme with only the spectral filter g (31) is evaluated.
Figure 19
Figure 19

SIR improvement and speech distortion for BSE scheme based on ideal RTF- and DSB-based BM. (No bias correction, dmic=6 cm). (a) SIR improvement. (b) Speech distortion.

Although the ideal RTF-based BM can perfectly suppress the target source, the produced noise estimate is still biased relative to the true noise components contained in the microphone signal. Therefore, we cannot expect a perfect noise reduction for the speech enhancement scheme with the biased noise estimate. However, the noise reduction performance achieved with this noise estimate can be regarded as the upper limit for the scheme with the noise estimate produced by Directional BSS without applying a bias correction. It can be seen that for all scenarios, with Directional BSS as BM, the noise reduction performance is almost the same as the performance achieved by the ideal-RTF (1 dB less), but clearly superior to the performance achieved by a DSB. Obviously, for the latter, the large residual of the target component reduces the SIR improvement and leads to a significant distortion of the target source.

6 Conclusions

In our earlier work, Directional BSS was proposed as a BM for source extraction. The concept combines BSS with a geometric constraint to cope with the underdetermined scenario such that a meaningful and joint estimate of all interfering speech signals and diffuse background noise can be obtained using only two microphones. In this paper, we show that Directional BSS converges to an RTF-based ideal BM. Experimental results analyzing the system behavior and the blocking performance of Directional BSS under various acoustical conditions were presented. These results verify that Directional BSS can successfully estimate the RTF in underdetermined nonstationary noise scenarios without requiring source activity information. The target suppression performance of Directional BSS is clearly superior to a common DSB. Simulation results confirm also that Directional BSS is very robust against localization errors. Additionally, we evaluate a source extraction scheme which combines Directional BSS and Wiener-type spectral enhancement filters. It is shown that the noise reduction performance achieved by this scheme using Directional BSS is very close to the performance achieved by the proposed scheme using an ideal RTF-based BM. Therefore, for exploiting Directional BSS as a BM, no source activity information and no information on the number of active sources is necessary. The only required information for this informed algorithm is some coarse DoA information of the target source.

Declarations

Authors’ Affiliations

(1)
Chair of Multimedia Communications and Signal Processing, niversity of Erlangen-Nuremberg, Cauerstraße 7, Erlangen, 91054, Germany

References

  1. Hyvärinen A, Oja E: Independent Component Analysis. Wiley, New York; 2001.View ArticleMATHGoogle Scholar
  2. Makino S, Lee TW, Sawada H: Blind Speech Separation. Springer, Berlin; 2007.View ArticleGoogle Scholar
  3. Yilmaz Ö, Rickard S: Blind separation of speech mixtures via time-frequency masking. IEEE Trans. Signal Process 2004, 52: 1830-1847. 10.1109/TSP.2004.828896MathSciNetView ArticleGoogle Scholar
  4. Araki S, Sawada H, Mukai R, Makino S: A novel blind source separation method with observation vector clustering. In Proceedings of the International Workshop on Acoustic Echo and Noise Control (IWAENC). Eindhoven; September 2005:117-120.Google Scholar
  5. Duong NK, Vincent E, Gribonval R: Under-determined reverberant audio source separation using a full-rank spatial covariance model. IEEE Trans. Audio Speech Lang. Process 2010, 18: 1830-1840.View ArticleGoogle Scholar
  6. Ozerov A, Fevotte C: Multichannel nonnegative matrix factorization in convolutive mixtures with application to blind audio source separation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Taipei; March 2009:550-563.Google Scholar
  7. Sawada H, Kameoka H, Araki S, Ueda N: New formulations and efficient algorithms for multichannel NMF. In Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). New York; October 2011:153-156.Google Scholar
  8. Van den Bogaert T, Doclo S, Wouters J, Moonen M: Speech enhancement with multichannel Wiener filter techniques in multimicrophone binaural hearing aids. J. Acoust. Soc. Am 2009, 125: 360-371. 10.1121/1.3023069View ArticleGoogle Scholar
  9. Cornelis B, Moonen M: A VAD-robust multichannel Wiener filter algorithm for noise reduction in hearing aids. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Prague; May 2011:281-284.Google Scholar
  10. Frost OL: An algorithm for linearly constrained adaptive array processing. Proc. IEEE 1972, 60: 926-935.View ArticleGoogle Scholar
  11. Van Trees HL: Optimum Array Processing (Detection, Estimation, and Modulation Theory, Part IV), 1st edn. Wiley, New York; 2002.Google Scholar
  12. Griffiths LJ, Jim CW: An alternative approach to linear constrained adaptive beamforming. IEEE Trans. Speech Audio Process 1982, 30: 27-34.Google Scholar
  13. Zelinski R: A microphone array with adaptive post-filtering for noise reduction in reverberant rooms. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Las Vegas; March 2008:2578-2581.Google Scholar
  14. McCowan I, Bourlard H: Microphone array post-filter based on noise field coherence. IEEE Trans. Speech Audio Process 2003, 11: 709-716. 10.1109/TSA.2003.818212View ArticleGoogle Scholar
  15. Ito N, Shimizu H, Ono N, Sagayama S: Diffuse noise suppression using crystal-shaped microphone arrays. IEEE Trans. Audio Speech Lang. Process 2011, 19: 2101-2110.View ArticleGoogle Scholar
  16. R Hendriks T: Gerkmann, Noise correlation matrix estimation for multi-microphone speech enhancement. IEEE Trans. Audio Speech Lang. Process 2012, 20: 223-233.View ArticleGoogle Scholar
  17. Taseska M, Habets EAP: MMSE-based blind source extraction in diffuse noise fields using a complex coherence-based a priori sap estimator. In Proceedings of the International Workshop on Acoustic Echo and Noise Control (IWAENC). Aachen, Germany; September 2012.Google Scholar
  18. Taseska M, Habets EAP: MMSE-based source extraction using position-based posterior probabilities. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Vancouver; May 2013.Google Scholar
  19. Parra L, Alvino C: Geometric source separation: merging convolutive source separation with geometric beamforming. IEEE Trans. Speech Audio Process 2002, 10: 352-362. 10.1109/TSA.2002.803443View ArticleGoogle Scholar
  20. Zheng Y, Reindl K, Kellermann W: BSS for improved interference estimation for blind speech signal extraction with two microphones. In Proceedings of 3rd International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP). Dutch Antilles, Aruba; December 2009:253-256.Google Scholar
  21. Maas R, Schwarz A, Zheng Y, Reindl K, Meier S, Sehr A, Kellermann W: A two-channel acoustic front-end for robust automatic speech recognition in noisy and reverberant environments. In Proceedings of the International Workshop on Machine Listening in Multisource Environments (CHiME). Florence; September 2011:41-46.Google Scholar
  22. Van Veen BD, Buckley KM: Beamforming: a versatile approach to spatial filtering. IEEE ASSP Mag 1988, 5: 4-24.View ArticleGoogle Scholar
  23. Hoshuyama O, Sugiyama A: A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Atlanta; May 1996:925-928.Google Scholar
  24. Herbordt W, Kellermann W: Frequency-domain integration of acoustic echo cancellation and a generalized sidelobe canceller with improved robustness. Eur. Trans. Telecommun 2002, 13: 123-132. 10.1002/ett.4460130207View ArticleGoogle Scholar
  25. Gannot S, Burshtein D, Weinstein E: Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Trans. Signal Process 2001, 49: 1614-1626. 10.1109/78.934132View ArticleGoogle Scholar
  26. Cohen I: Relative transfer function identification using speech signals. IEEE Trans. Speech Audio Process 2004, 12(5):451-459. 10.1109/TSA.2004.832975View ArticleGoogle Scholar
  27. Golan S, Gannot S, Cohen I: Multichannel eigenspace beamforming in a reverberant noisy environment with multiple interfering speech signals. IEEE Trans. Audio Speech Lang. Process 2009, 17: 1071-1086.View ArticleGoogle Scholar
  28. Takahashi Y, Takatani T, Osako K, Saruwatari H, Shikano K: Blind spatial subtraction array for speech enhancement in noisy environment. IEEE Trans. Audio Speech Lang. Process 2009, 17: 650-664.View ArticleGoogle Scholar
  29. Smaragdis P: Blind separation of convolved mixtures in the frequency domain. Neurocomputing 1998, 22: 21-34. 10.1016/S0925-2312(98)00047-2MATHView ArticleGoogle Scholar
  30. Buchner H, Aichner R, Kellermann W: The TRINICON framework for adaptive MIMO signal processing with focus on the generic Sylvester constraint. In Proceedings of the ITG Conference on Speech Communication. Aachen, October; 2008.Google Scholar
  31. Buchner H, Aichner R, Kellermann W: Blind source separation for convolutive mixtures exploiting nongaussianity, nonwhiteness, and nonstationarity. In Proceedings of International Workshop on Acoustic Echo and Noise Control (IWAENC). Kyoto; September 2003:275-278.Google Scholar
  32. Aichner R, Buchner H, Yan F, Kellermann W: A real-time blind source separation scheme and its application to reverberant and noisy acoustic environments. Signal Process 2006, 86: 1260-1277. 10.1016/j.sigpro.2005.06.022MATHView ArticleGoogle Scholar
  33. Buchner H, Aichner R, Kellermann W: Relation between blind systemidentification and convolutive blind source separation. In Proceedings of Workshop for Hands-Free Speech Communication and Microphone Arrays (HSCMA). Piscataway; March 2005.Google Scholar
  34. Araki S, Makino S, Hinamoto Y, Mukai R, Nishikawa T, Saruwatari H: Equivalence between frequency-domain blind source separation and frequency-domain adaptive beamforming for convolutive mixtures. EURASIP J. Adv. Signal Process 2003, 2003: 1157-1166. 10.1155/S1110865703305074MATHView ArticleGoogle Scholar
  35. Gerven SV, Compernolle DV: Signal separation by symmetric adaptive decorrelation: stability, convergence, and uniqueness. IEEE Trans. Signal Process 1995, 43: 1602-1612. 10.1109/78.398721View ArticleGoogle Scholar
  36. Weinstein E, Feder M, Oppenheim A: Multichannel signal separation by decorrelation. IEEE Trans. Speech Audio Process 1993, 1: 405-413. 10.1109/89.242486View ArticleGoogle Scholar
  37. Zheng Y, Lombard A, Kellermann W: An improved combination of directional BSS and a source localizer for robust source separation in rapidly time-varying acoustic scenarios. In Proceedings of the Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA). Edinburgh; May 2011.Google Scholar
  38. Hjørungnes A: Complex-valued matrix derivatives: with applications in signal processing and communications. Cambridge University Press, Cambridge; 2011.View ArticleMATHGoogle Scholar
  39. Knapp C, Carter G: The generalized correlation method for estimation of time delay. IEEE Trans. Acoustics Speech Signal Process. ASSP 1976, 24: 320-327. 10.1109/TASSP.1976.1162830View ArticleGoogle Scholar
  40. Lombard A, Rosenkranz T, Buchner H, Kellermann W: Multidimensional localization of multiple sound sources using averaged directivity patterns of blind source separation systems. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Taipei; April 2009:233-236.Google Scholar
  41. Warsitz E, Krueger Ar, Haeb-Umbach R: Speech enhancement with a new generalized eigenvector blocking matrix for application in a generalized sidelobe canceller. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Las Vegas; March 2008:73-76.Google Scholar
  42. Krueger A, Warsitz E, Haeb-Umbach R: Speech enhancement with a GSC-like structure employing eigenvector-based transfer function ratios estimation. IEEE Trans. Audio Speech Lang. Process 2011, 19: 206-219.View ArticleGoogle Scholar
  43. Gerkmann T, Hendriks R: Noise power estimation based on the probability of speech presence. In Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). New Paltz; October 2011:145-148.Google Scholar
  44. Martin R: Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process 2001, 9(5):504-512. 10.1109/89.928915View ArticleGoogle Scholar
  45. Cohen I: Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging. IEEE Trans. Speech Audio Process 2003, 11(5):466-475. 10.1109/TSA.2003.811544View ArticleGoogle Scholar
  46. Ozerov A, Vincent E: Using the FASST source separation toolbox for noise robust speech recognition. In Proceedings of International Workshop on Machine Listening in Multisource Environments (CHiME 2011). Florence; September 2011.Google Scholar
  47. Moritz N, Schädler M, Adiloglu K, Meyer B, Jürgens T, Gerkmann T, Goetze S: Noise robust distant automatic speech recognition utilizing NMF based source separation and auditory feature extraction. In Proceedings of The 2nd International Workshop on Machine Listening in Multisource Environments (CHiME 2013). Vancouver; June 2013:1-1.Google Scholar
  48. Jeon K, Park N, Kim H, Choi M, Hwang K: Mechanical noise suppression based on non-negative matrix factorization and multi-band spectral subtraction for digital cameras. IEEE Trans. Consum. Electron 2013, 59(2):296-302.View ArticleGoogle Scholar
  49. Kuttruff H: Room Acoustics. Taylor & Francis, London; 2000.Google Scholar
  50. Habets EAP, Gannot S: Generating sensor signals in isotropic noise fields. J. Acoustical Soc. Am 2007, 122: 3464-3470. 10.1121/1.2799929View ArticleGoogle Scholar
  51. Martin R: Kurzfassung Freisprecheinrichtung mit mehrkanaliger Echokompensation und Störgeräuschreduktion. RWTH Aachen University, PhD thesis; 1995.Google Scholar
  52. Talmon R, Cohen I, Gannot S: Relative transfer function identification using convolutive transfer function approximation. IEEE Trans. Audio Speech Lang. Process 2009, 17: 546-555.View ArticleGoogle Scholar
  53. Reindl K, Barfuss H, Gannot S, Kellermann W, Markovich-S Golan: Geometrically constrained TRINICON-based relative transfer function estimation in underdetermined scenarios. In Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). New York; October 2013.Google Scholar
  54. Kellermann W, Reindl K, Zheng Y: Method and acoustic signal processing system for interference and noise suppression in binaural microphone configurations. US Patent 2011/0307249 A1 2011.Google Scholar
  55. Reindl K, Zheng Y, Kellermann W: Speech enhancement for binaural hearing aids based on blind source separation. In Proceedings of 4th International Symposium on Communications, Control, and Signal Processing (ISCCSP). Limassol; March 2010.Google Scholar
  56. Reindl K, Zheng Y, Schwarz A, Meier S, Maas R, Sehr A, Kellermann W: A stereophonic acoustic signal extraction scheme for noisy and reverberant environments. Comput. Speech Lang. (CSL) 2012, 27: 726-745.View ArticleGoogle Scholar
  57. Vary P, Martin R: Digital Speech Transmission: Enhancement, Coding and Error Concealment. Wiley, Chichester; 2006.View ArticleGoogle Scholar
  58. Kim K, Jeong S, Jeong J, Oh K, Kim J: Dual channel noise reduction method using phase difference-based spectral amplitude estimation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Dallas; March 2010:217-220.Google Scholar
  59. Jeong S, Kim K, Jeong J, Oh K, Kim J: Adaptive noise power spectrum estimation for compact dual channel speech enhancement. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Dallas; March 2010:1630-1633.Google Scholar
  60. Vaidyanathan PP: Multirate Systems and Filter Banks. Prentice-Hall, Upper Saddle River; 1993.MATHGoogle Scholar

Copyright

© Zheng et al.; licensee Springer. 2014

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Advertisement