 Research
 Open Access
 Published:
Analysis of dualchannel ICAbased blocking matrix for improved noise estimation
EURASIP Journal on Advances in Signal Processing volume 2014, Article number: 26 (2014)
Abstract
For speech enhancement or blind signal extraction (BSE), estimating interference and noise characteristics is decisive for its performance. For multichannel approaches using multiple microphone signals, a BSE scheme combining a blocking matrix (BM) and spectral enhancement filters was proposed in numerous publications. For such schemes, the BM provides a noise estimate by suppressing the target signal only. The estimated noise reference is then used to design spectral enhancement filters for the purpose of noise reduction. For designing the BM, ‘Directional Blind Source Separation (BSS)’ was already proposed earlier. This method combines a generic BSS algorithm with a geometric constraint derived from prior information on the target source position to obtain an estimate for all interfering point sources and diffuse background noise. In this paper, we provide a theoretical analysis to show that Directional BSS converges to a relative transfer function (RTF)based BM. The behavior of this informed signal separation scheme is analyzed and the blocking performance of Directional BSS under various acoustical conditions is evaluated. The robustness of Directional BSS regarding the localization error for the target source position is verified by experiments. Finally, a BSE scheme combining Directional BSS and Wienertype spectral enhancement filters is described and evaluated.
1 Introduction
Blind signal extraction (BSE) aiming at extracting one source signal from a mixture of an unknown number of acoustic sources in noisy environments is a generic task in acoustic signal processing. It has a wide range of applications in many fields: As popular examples, handsfree interfaces for acoustic communications and humanmachine interaction offer many challenging and relevant application scenarios, such as teleconferencing, interactive television, humanoid robots, and gaming. Moreover, acoustic signal extraction techniques are also highly relevant for assistive devices, such as hearing aids.
If multiple microphones are available, datadependent multichannel approaches for signal extraction can be classified into unsupervised and supervised approaches. The class of unsupervised methods does not require prior knowledge on the spatial distribution of sources and sensors. The lack of prior knowledge is compensated by exploiting fundamental signal characteristics. Conventional unsupervised signal extraction approaches are, e.g., independent component analysis (ICA)based [1, 2] or sparsenessbased blind source separation (BSS) algorithms [3, 4]. However, conventional ICAbased approaches cannot be used for underdetermined cases, where the number of sensors is less than the number of sources, and sparsenessbased methods are highly dependent on the sparsity of the mixing signals. Recently, modelbased multichannel approaches gained a lot of attention. These are, e.g., approaches based on a spatial covariance model [5] or multichannel nonnegative matrix factorization (NMF) methods based on modeling complex Gaussian distributions [6, 7]. As opposed to [1–4] they do not solely rely on the independence or the sparsity of the underlying signals and can be used for underdetermined source separation.
Unlike unsupervised methods, the class of supervised methods needs reference information. Typical supervised signal extraction approaches are, e.g., multichannel Wiener filtering (MWF) [8, 9] or beamforming approaches, such as linearly constrained minimum variance (LCMV) beamformer [10]. MWF approaches are based on minimum mean square error (MMSE) estimators requiring noise and interference statistics as references. The LCMV beamformer requires reference signal(s) represented by linear constraints in order to suppress interfering sources and to preserve target signal(s) from known directions [11]. As an alternative form of the LCMV beamformer, the generalized sidelobe canceler (GSC) was proposed in [12], which converts the constrained optimization problem into an unconstrained problem.
Under realistic acoustic conditions, prior information is often exploited for practical realizations of supervised and unsupervised signal extraction approaches. This leads to the class of informed signal processing algorithms, where relevant information of the underlying conditions is exploited to realize the signal extraction algorithms or to render these algorithms more robust and reliable for practical conditions. Prior knowledge, which can be spatial information in terms of the direction of arrival (DoA) of source signals, coherence or diffuseness of the sound field, etc., may be given or estimated from the acquired sensor data. An overview of the relevant work belonging to this class is given in the following.
To realize the MWF, an estimate of the secondorder statistics (SOS) of the noise signals is required. Based on the assumption of a diffuse noise field, several methods are derived for estimating the SOS of the noise components in terms of the autopower spectral density (PSD) [13–15] or the cross PSDs between all channels for both the target source(s) and the noise and interference components [16]. Furthermore, it was recently proposed to exploit the directtodiffuse ratio (DDR) to realize the MWF for stationary noise and babble noise conditions [17]. It was also suggested to exploit the position information to estimate the cross PSDs of directional speech interferences [18]. For unsupervised algorithms, prior spatial information such as information about the source positions or the sensor constellation is often incorporated to improve the robustness. Modelbased multichannel approaches [6, 7] can incorporate the directional information by initialization of a part of the spatial model. Parra and Alvino [19] proposed to combine an ICA algorithm with geometric constraints in order to improve the separation performance, where BSS was regarded as a set of beamformers whose response is constrained to a set of DoAs for recovering all sources from the mixture. Inspired by [19], Directional BSS [20] was proposed to serve as a blocking matrix (BM) when using a different constraint for the opposite purpose: this constraint forces essentially a spatial null towards a certain direction in order to suppress the target source and to preserve the interfering and noise components. The precondition not only for Directional BSS but also for Parra’s method is that the DoA information on the target source(s) must be given. Furthermore, based on the noise estimate produced by Directional BSS, a twounit source extraction/noise reduction scheme combining a BM and a noise reduction unit was proposed in [21], where the spectral weights in the noise reduction stage are designed based on a diffuse noise field assumption. In this paper, we focus on the discussion of Directional BSS operating as a BM.
The concept of a BM was originally proposed in [12] for the structure as shown in Figure 1. The structure separates the LCMV beamformer into two main processing paths: the first path comprises a fixed beamformer (FB) with constraints on the target signal. The second path contains a BM and an adaptive interference canceler (AIC) that adaptively minimizes the noise power in the output. The BM is defined as a matrix used to reject (block) the target signal at its output, hence providing references of all undesired interference signals and noise components required for interference cancellation schemes.
Originally in [12], the BM was designed for timeinvariant freefield environments and rejected the source signal from one direction only, requiring precise source location information. This BM can be regarded as minimum variance distortionless response (MVDR) BM as the MVDR beamformer imposes the distortionless constraint only for the desired direction. For the dualchannel case, the conventional MVDR BM is given by the delayandsubtract beamformer (DSB) and can only suppress the direct path of the target source. Theoretically, the conventional LCMV BM can suppress the direct path and reflections by formulating the corresponding constraints in the BM if the perfect knowledge on the angle of arrival for each reflection is given [22]. However, the conventional LCMV/MVDR BM will likely lead to target signal leakage as it is conceived for timeinvariant scenarios, and any movement of the target source will lead to a steering error relative to the true DoA for the target signal and its reflections. To improve the robustness against the steering error, an adaptive BM was proposed in [23, 24], which needs an adaptive control requiring source activity information. In [25], the relative transfer function (RTF)based BM for LCMV/MVDR beamforming was proposed. The RTFbased BM can perfectly suppress the target signal if the RTFs are given. However, estimating RTFs usually requires estimation of the source activity or a doubletalk detector, as noiseonly frames or time segments where both the transfer functions (TFs) and the noise signals are assumed to be stationary need to be available for RTF estimation [25–27].
More recently, ICAbased BSS algorithms were proposed to realize a BM [20, 28]. The approach presented in [28] is very efficient in noise estimation but can only be used for overdetermined/determined scenarios (i.e., the number of sensors is larger than or equal to the number of sources) as [28] is a generic ICAbased BSS algorithm. In [20] exploiting Directional BSS as a BM for noise estimation (here the noise including interfering sources and diffuse background noise) was proposed. This approach can be applied in both determined and underdetermined scenarios. Unlike for beamforming approaches, correlated components arriving from other directions, i.e., reflections and reverberation will also be suppressed to the greatest extent possible by Directional BSS. This concept can deal with underdetermined scenarios such that a meaningful instantaneous estimate for all undesired signals comprising interfering speech signals and diffuse background noise can be obtained using only two microphones and regardless of the noise statistics. Note that for applying the directional constraint, the directional information on the target source must be given or estimated by a source localizer. Even with a source localizer, a predefined angular range of the target source must be given. This range was set to be −20° to 20° in front of the microphone array [20]. The algorithm of Directional BSS was introduced and its efficiency was shown in [20] with respect to the blocking performance. In this paper, we provide an indepth analysis of the heuristically motivated BM in [20] and provide new insights with respect to several decisive aspects: (1) the relation of the ICAbased BM to other BMs, (2) the blocking performance if the target source arrives from directions which are different from broadside direction, (3) the robustness against localization errors, and (4) the BSE/speech enhancement performance when using the noise estimate produced by Directional BSS for Wienertype spectral enhancement. Additionally, a BSE scheme combining Directional BSS and spectral enhancement filters under various acoustical conditions will be evaluated. Therefore, the main contributions of this paper compared to our earlier work are the following: For one, we show by a theoretical analysis and by experimental results that Directional BSS converges to an RTFbased BM. In addition, the performance of the proposed method is for the first time analyzed regarding some practically highly relevant aspects, e.g., the blocking ability for sources impinging from arbitrary directions and the noise reduction performance of the applied noise reduction scheme.
The paper is organized as follows: In Section 2, the generic BSS algorithm is reviewed. In Section 3, we provide a theoretical analysis to show that Directional BSS converges to an RTFbased BM and describe the algorithm of Directional BSS. Moreover, the relation/difference of Directional BSS to other conventional/stateofart BMs is discussed. Furthermore, in Section 4, experimental results with respect to the blocking performance and the robustness against localization errors in various acoustical scenarios are presented. Finally, a BSE scheme combining Directional BSS and Wienertype spectral enhancement filters is presented and evaluated in Section 5. Note that in this paper, we restrict our consideration to twochannel cases.
2 Determined blind source separation: generic ICAbased BSS algorithm
In this section, we briefly review a twochannel ICAbased BSS algorithm. Figure 2 depicts the basic twochannel BSS signal model for two point sources s_{1},s_{2}. The microphone signals can be described in the discrete time domain by
where * represents convolution and h_{ mp }(k), m∈{1,2} denote the finite acoustic impulse responses from the m th point source to the p th microphone in discrete time and k is the discrete time index.
BSS algorithms aim at determining demixing filters to extract the individual sources from the mixed signals. The output signals of the demixing system y_{ q }, q∈{1,2} are described by
where w_{ pq }(k) denotes the demixing filter from the p th microphone to the q th output channel.
The various criteria used for identifying w_{ pq } in (2) (see e.g., [1, 2, 29]) are essentially based on the assumption that sources are statistically independent. In this paper, we use tripleN independent component analysis for convolutive mixtures (TRINICON) [30] for BSS, where mutual information between the output channels $\mathbf{y}=\phantom{\rule{2.77626pt}{0ex}}{\left[\phantom{\rule{0.3em}{0ex}}{\mathbf{y}}_{1}^{T}\right(k),{\mathbf{y}}_{2}^{T}(k\left)\right]}^{T}$ should be minimized. As the algorithm is derived for block processing of convolutive mixtures, for each output y_{ q }(k), a sequence of D output samples corresponding to D successive time lags is taken into account.
The generic cost function used to determine a demixing system W is then given by [31]
where $\widehat{\mathbf{E}}\{\xb7\}$ is the estimate of the statistical expectation, with ensemble averaging being replaced by temporal averaging over N blocks assuming ergodicity within the individual blocks. ${\widehat{p}}_{\mathbf{y},\mathit{\text{PD}}}$ is an estimate of the joint probability density function (pdf) of dimension PD over all P (here, P = 2) output channels, and ${\widehat{p}}_{{\mathbf{y}}_{q},D}$ is the estimated multivariate pdf for channel q of dimension D. Matrix W captures all the impulse response coefficients of the demixing filters, with a detailed description of its structure given in [31, 32]. Minimizing J_{BSS}(W) corresponds to minimizing the KullbackLeibler divergence (KLD) between ${\widehat{p}}_{\mathbf{y},\mathit{\text{PD}}}\left(\mathbf{y}\right)$ and $\prod _{q=1}^{P}{\widehat{p}}_{{\mathbf{y}}_{q},D}\left({\mathbf{y}}_{q}\right)$, which leads to maximization of the statistical independence of the output vectors y_{ q }.
3 Directional blind source separation as a blocking matrix
In this section, we firstly discuss the relation of Directional BSS with a conventional RTFbased BM in Subsection 3.1. The Directional BSS algorithm is described in Subsection 3.2 before comparing it to alternative approaches in Subsection 3.3.
3.1 From system identification to RTFbased blocking matrix
In [33], the relation between the optimum broadband solution of blind source separation and blind system identification was presented. For a singleinput/multipleoutput (SIMO) system as shown in Figure 3, the perfect suppression of a broadband source implies for system identification:
The optimum filters fulfilling (4) read in the zdomain [33]:
As a precondition for identifying this solution, H_{11}(z) and H_{21}(z) may not have common zeros and the filter lengths equal the lengths of room impulse responses. Obviously, the optimum filters can only be determined up to a scaling factor α.
Let us consider the case where w_{21} is forced to be a delay τ, then (4) reads in the zdomain:
In the frequency domain, (6) can be expressed by:
where underlined characters denote frequencydomain representations. The normalized frequency Ω is given as $\frac{2\mathrm{\pi f}}{{f}_{\mathrm{s}}}$, where f_{s} denotes the sampling frequency. The ratio of the two frequency responses $\frac{{\underset{\u2014}{h}}_{12}\left(\Omega \right)}{{\underset{\u2014}{h}}_{11}\left(\Omega \right)}$ is known as the RTF or the TF ratio.
If we divide ${\underset{\u2014}{w}}_{11}\left(\Omega \right)$ by ${\underset{\u2014}{w}}_{21}\left(\Omega \right)$, we get:
which is exactly the form of the RTFbased BM proposed in [25].
For a multipleinput/multipleoutput (MIMO) system, in [33], it is shown that the optimum BSS solution is the generalization of the SIMO identification solution. This holds however only for determined cases. For an underdetermined scenario as shown in Figure 4, there is no determined solution. However, here, our aim is not to find a determined BSS solution in underdetermined scenarios, but to exploit BSS as a BM to suppress the target source s_{1} only. Therefore, it still follows that
which is the same as in (4) for the system identification in a SIMO system. As BSS has no determined solution in underdetermined scenarios, the problem is how to force BSS to suppress the target source only and preserve the other sources to form a joint noise estimate. For this purpose, we combine the generic BSS with a geometric constraint to force a spatial null towards the direction of the target source. We denote the combined algorithm as ‘Directional BSS’ and analyze it in the following sections.
3.2 Algorithm
Blind source separation can be regarded as ‘blind adaptive beamforming’ (blind ABF) [34] as BSS and ABF have similar goals and a similar structure: Both attempt to extract a target signal and reduce the interference by multichannel array processing as described in [35, 36]. In [34] it is shown that BSS is equivalent to a set of adaptive beamformers which form multiple nullbeams steered towards the directions of interfering sources and its reflections. On the other hand, there are fundamental (characteristic) differences between BSS and ABF: generic BSS usually does not require prior information on source locations and sensor constellations, while ABF requires the spatial information on the locations of sources and sensors. In [19] a method was proposed to combine BSS and beamforming for achieving a better separation performance by utilizing the geometric information of sources. The kind of combination is known as geometric source separation, where the response of BSS demixing filters is additionally constrained to a set of directions.
The original algorithm of geometric source separation was described in the discrete Fourier transform (DFT) domain. The response of BSS at the q th BSS output is constrained to the direction θ, which can be expressed by
where ξ denotes the constraint, ${\underset{\u2014}{\mathbf{w}}}_{q}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\left(\Omega \right)=\phantom{\rule{2.77626pt}{0ex}}{\left[\phantom{\rule{0.3em}{0ex}}{\underset{\u2014}{w}}_{1q}\right(\Omega ),{\underset{\u2014}{w}}_{2q}(\Omega \left)\right]}^{\mathrm{T}}$ describes the demixing filters for the q th BSS output channel at the frequency $\Omega =\frac{2\mathrm{\pi \nu}}{N}$ (ν is the frequency bin and N is the length of the demixing filter) in the DFT domain; {·}^{T} is the transpose operator; $\underset{\u2014}{\mathbf{d}}(\Omega ,\theta )$ is the steering vector pointing to direction θ:
where c is the sound velocity. Note that both the microphone spacing d_{mic} and the angle θ relative to the array axis must be given. For simplicity, we omit the frequency variable Ω in the sequel.
More generally, to constrain the response of the BSS demixing matrix to a set of P=2 directions Θ, we write:
where $\underset{\u2014}{\mathbf{W}}=\phantom{\rule{2.77626pt}{0ex}}{[\phantom{\rule{0.3em}{0ex}}{\underset{\u2014}{\mathbf{w}}}_{1},{\underset{\u2014}{\mathbf{w}}}_{2}]}^{\mathrm{T}}$ is the BSS demixing matrix and $\underset{\u2014}{\mathbf{D}}\left(\Theta \right)=\phantom{\rule{2.77626pt}{0ex}}\left[\phantom{\rule{0.3em}{0ex}}\underset{\u2014}{\mathbf{d}}\right({\theta}_{1}),\underset{\u2014}{\mathbf{d}}({\theta}_{2}\left)\right]$ contains steering vectors pointing to Θ= [ θ_{1},θ_{2}]. The 2×2 matrix C refers to the constraints.
Two constraints were proposed in [19], and they are
where I refers to a 2×2 identity matrix. As both of the two constraints aim at extracting the sources, not at blocking the sources, we will not discuss them here, but a detailed discussion can be found in [19, 37]. The constraint for blocking sources was proposed in [37]:
which restricts the output channels to have a zero response for the signals arriving from the directions given in Θ, i.e., it forces each output channel to form a null beamformer steered to the source which should be blocked in this output channel.
The constraint (16) can be incorporated into the overall cost function for the source separation (3) as an additional penalty term:
where $\parallel \mathbf{A}{\parallel}_{\mathrm{F}}^{2}=\text{trace}\left\{\mathbf{A}{\mathbf{A}}^{H}\right\}$ is the Frobenius norm of the matrix A. {·}^{H} refers to the conjugated transpose operator. Combining this with the cost function for the generic BSS algorithm given in (3), we obtain:
where the weighting parameter η_{C} can be chosen to control the importance of the geometric constraint relative to the separation criterion represented by J_{BSS} (3).
As Directional BSS serves as a BM for a single desired source, only the target source needs to be suppressed. Therefore, ${J}_{\mathrm{C}}\left(\underset{\u2014}{\mathbf{W}}\right)$ is modified by considering the following conditions:

The direct path for the target signal is suppressed by the penalty term analogously to nullsteering beamforming, i.e, a spatial null is forced toward the direction θ of the target source.

As only the target signal needs to be suppressed, only one BSS output channel is controlled by the geometric constraint. Without loss of generality, the output channel 1 is chosen to be controlled with the penalty term ${J}_{\mathrm{C}}\left(\underset{\u2014}{\mathbf{W}}\right)$ in the sequel.

In order to converge to the RTFbased BM, w_{21} is set to be a pure delay and remains unchanged during adaptation of the demixing system. Note that we could equivalently use the first channel as the reference and in that case w_{11} is a pure delay.
The simplified cost function for the constraint then reads:
As ${J}_{\mathrm{C}}\left(\underset{\u2014}{\mathbf{W}}\right)$ is complexvalued, the gradientdescent update for the constrained part for $\underset{\u2014}{\mathbf{W}}$ is obtained by taking the derivative of the cost function ${J}_{\mathrm{C}}\left(\underset{\u2014}{\mathbf{W}}\right)$ with respect to ${\underset{\u2014}{\mathbf{W}}}^{H}$[38]. Besides, as we want to keep w_{21} fixed, the constraint must be applied to the demixing filter w_{11} only. Thus, the filter update term for the constraint part in the DFTdomain yields:
where a^{∗} refers to the complex conjugate of a. It should be noted that both frequencydomain and timedomain BSS algorithms can be associated with the geometric constraint. As we use the timedomain TRINICON SOSbased algorithm given in [32], the filter is updated in the timedomain for block $\stackrel{\u0306}{m}$ after iteration $\stackrel{\u0306}{k}$ as follows:
where $\stackrel{\u0306}{\mu}$ is stepsize and Δ W_{total} is given as [20]
where DFT^{−1}{·} denotes the inverse discrete Fourier transform yielding a nonzero update contribution of the same length as the demixing filter length N. $\frac{\partial {J}_{\mathrm{C}}\left(\underset{\u2014}{\mathbf{W}}\right)}{\partial {\underset{\u2014}{\mathbf{W}}}^{H}}$ is already given in (20). In [32] a detailed description for the applied TRINICONbased update $\frac{\partial {J}_{\text{BSS}}\left(\mathbf{W}\right)}{\partial \mathbf{W}}$ can be found. The natural gradient update is given as [32]:
where R_{ y y } denotes the 2D×2D correlation matrix of the output signal vector y of length 2D, bdiag refers to considering block matrix and describes the operation of setting all offdiagonal block matrices of the block matrix to zero. $\beta (\stackrel{\u0306}{i},\stackrel{\u0306}{m})$ is a weighting function normalized to $\sum _{\stackrel{\u0306}{i}=0}^{\stackrel{\u0306}{m}}\beta (\stackrel{\u0306}{i},\stackrel{\u0306}{m})=1$ allowing for online, offline or blockonline realization of the algorithm [32].
By applying Directional BSS, the target source s_{1} should be suppressed in BSS output channel 1. Thus, the noise estimate $\widehat{n}$ is given by the output y_{1} as follows:
where x_{s,p} and x_{n,p} denotes the target and the noise component contained in microphone p, respectively.
Besides, in [19] the efficiency of a proper geometrical initialization was shown. For the geometric constraint, the direction of the target source needs to be known a priori or it needs to be estimated. If the target source position is not known, an additional source localizer is necessary. Many localization algorithms can be used as, e.g., GCCPHAT [39] or an ICAbased source localizer [40]. With the given DoA information, we can initialize the filter structure corresponding to a DSB in order to accelerate convergence. The initialization is performed after each movement of the target source. Defining a vector ${\underset{\u2014}{\mathbf{d}}}_{\text{sub}}\left(\theta \right)=\phantom{\rule{2.77626pt}{0ex}}{[\phantom{\rule{0.3em}{0ex}}1,{e}^{j\frac{\Omega {d}_{\text{mic}}{f}_{s}sin\theta}{c}}]}^{T}$, for the constraintcontrolled channel 1, the filter coefficients can be initialized as follows:
3.3 Comparison to alternative approaches
The original BM proposed by Griffith and Jim [12] is constructed by subtracting pairs of timealigned signals with respect to the target signal. For the dualchannel case, this is exactly a DSB, which is attractive for its simple structure. However, a major limitation in real acoustic scenarios is that the performance of a DSB will significantly degrade for an imprecise target source position information, i.e., for steering errors. Additionally, due to reflections of the target signal impinging from directions other than the steering direction, significant signal leakage into the noise reference needs to be expected. As a possible countermeasure, an adaptive BM (ABM) with coefficient constraints was proposed in [23]. In this conventional ABM, the output of the FB (see Figure 1) is used as a reference signal for the target source and adaptively subtracted from the microphone signal. The least mean squares (LMS) algorithm is usually used for the ABM adaptation. However, the adaptation can only be carried out in time segments, where only the target source is active. Therefore, a doubletalk detector is necessary, which requires significant sophistication and will still be imperfect in complex acoustic scenarios. The difference of our approach to this BM is that (1) the adaptation criterion is different and very important for its practical relevance, and (2) a doubletalk detector is not required.
The transferfunctiongeneralized sidelobe canceler (TFGSC) was proposed by Gannot et al. [25], where the BM is constructed based on RTFs. This approach takes the reverberant nature of the enclosure into account. The RTFs are estimated by a least squares method and for this, two assumptions are necessary: (1) the RTFs change slowly over time compared to the time variations of the signals, which effectively precludes movements of the source, and (2) time segments are available, where both the TFs and the noise signal are assumed to be stationary. In Section 3.1 we already show that our approach converges to an RTFbased BM. In contrast to [25], our approach does not rely on such time segments but only a coarse DoA estimation is required.
Warsitz et al. presented a BM based on a generalized eigenvalue decomposition. They construct the BM directly by using the beamformer filter coefficients resulting from maximizing the output signaltonoise ratio, where the filter coefficients are computed iteratively by solving a generalized eigenvalue problem [41, 42]. This approach indirectly estimates the RTFs and does not require periods of absence of noise and the DoA of the target source. On the other hand, it works only for stationary noise while our approach can work in a nonstationary multispeaker scenario.
Recently, a subspace approach for estimating RTFs in multiplenoise scenarios was proposed in [27], which was used to construct the RTFbased BM efficiently. However, this approach needs an estimation for source activities.
The conventional noise estimation methods other than BMbased approaches are mostly based on source activity estimation [43] or minimum statistic noise power estimation [44, 45]. For those approaches, it is usually assumed that the sources are statistically independent and the noise is more stationary than the target signal. The recently wellstudied modelbased NMF approaches [5–7] can be used directly for noise reduction [46, 47] or for noise estimation [48]. Those methods usually rely on the prior knowledge of the noise type (point source or diffuse noise) to define the model parameters. Therefore, they can only be as efficient as the models match the current scenario and are prone to fail if the model assumptions do not hold or the parameters could not be properly learned. The latter is especially crucial for online algorithms in timevarying scenarios. On the other hand, Directional BSS will fail if the interfering source arrives from the same direction as the target source, as then Directional BSS is not able to provide an estimate for the interfering source. Compared to these alternative approaches, the main advantage of the proposed approach is that no target source activity estimation or prior knowledge on the source characteristics is necessary and no model needs to be matched, but only a coarse DoA estimation is required.
4 Evaluation of blocking matrices
In order to evaluate the proposed BM, comprehensive experiments were carried out. The system behavior and the target suppression performance of Directional BSS in singlesource scenarios (only one directional source is active) and multiplesource scenarios (multiple directional sources are simultaneously active) are evaluated. For showing the system behavior, Directional BSS is compared to (1) an ideal RTFbased BM and (2) a BM based on a DSB. The ideal RTFbased BM is calculated from the measured room impulse responses (RIRs). In order to evaluate the target suppression performance, Directional BSS is compared to (1) a perfect adaptive BM (we name it ideal ABM) and (2) a BM based on a DSB. The ideal ABM is adapted in a singlesource scenario. It should be noted that stateofart BMs are mostly based on an estimation of source activities and require perfectly detected target sourceonly time segments. Therefore, we compare Directional BSS always with the DSB and not to other BMs as only these two BMs do not require estimating any source activities. The comparison to the ideal ABM can show us how close Directional BSS can reach to the perfectly supervised case. Besides, the robustness of Directional BSS against localization errors is analyzed in Section 4.3.
4.1 Experimental setup
Two real rooms were considered for evaluation: (1) room A: a livingroomlike environment with a moderate reverberation time of T_{60}≈250 ms and a critical distance [49] of 1.3 m and (2) room B: a more reverberant living room with T_{60}≈400 ms and a critical distance of 0.9 m. As sourcearray, distances 1 and 2 m were considered. The experiments are based on RIR measurements carried out with a twochannel array. The measurements were performed for two different microphone spacings, d_{mic}∈{6,11.5} cm at a sampling frequency of 48 kHz using the maximum length sequences (MLS) method [49]. For the following evaluation, the RIRs were downsampled to a sampling frequency f_{s}=16 kHz. We combine the efficient SOSbased online BSS algorithm presented in [32] with the geometric constraint (22) to perform Directional BSS. The filter length of the finite impulse response (FIR) filters w_{ pq } (21) is 1024, the block length for estimation of the correlation matrix R_{ y y } (23) is 2048. The number of iterations per data block of 125 ms is 15 (see [32] for details on the adaptation). Three male and female speech signals of length 10 s were used as source signals. Diffuse background noise components were simulated using the method proposed in [50]. All sources (including speech sources and diffuse source signals) are continuously active and normalized to equal average power. For the experiments, the DoA information of the target source is given. However, as in practice, a source localizer is necessary to estimate the target DoA, in Section 4.3, the robustness of Directional BSS against localization errors is investigated.
4.2 Performance of the blocking matrices
As performance measures, we use (1) the frequency response of the overall system to show the system behavior in different scenarios under various acoustical conditions, (2) the target suppression gain and the root mean square error between the estimated RTF and the true RTF to measure the blocking performance, and (3) the signaltointerference ratio (SIR) difference between the BM input and the BM output signal to measure the ability of the BM to preserve all interfering signals. Note that here the interference includes interfering sources and diffuse background noise.
4.2.1 Frequency response of the overall system
To study the overall system behavior, we investigate the frequency response of the transfer function for a BM supplied with perfect localization information. The transfer function for different source positions −90°≤ϕ≤90° is evaluated as depicted in Figure 5. The spatiotemporal frequency response associated with the BM is given by
where ${\underset{\u2014}{\widehat{n}}}_{s}$ refers to the residual of the target signal component s in the noise estimate $\widehat{n}$ (‘leakage’). This characterization is similar to but not equal to a beam pattern in the usual sense where it is assumed that the acoustic waves propagate in free field and no scattering is considered. Instead, (26) also considers the acoustic environment by accounting for the transfer functions from the source position to the microphones. Thereby, ${\underset{\u2014}{h}}_{\text{trans}}$ captures reflections of source signals determined by the given source positions. Thus, if (26) exhibits a minimum for a certain angle relative to a certain distance to the microphone array, it indicates that all signal components originating from this angle at this distance, including possible reflections at surfaces in the acoustic environment, are suppressed to the given extent.
We show the magnitude response of (1) Directional BSS, (2) the ideal RTFbased BM, and (3) a DSB. With the ideal RTFbased BM, the target component is perfectly suppressed. Directional BSS is expected to converge to this ideal solution. For the BM based on a DSB, the filter coefficients are calculated according to the fractional delay between the two microphone signals. For Directional BSS, the BM coefficients are a set of converged BSS demixing filters of length 1024 adapted for the corresponding scenarios. We first show the magnitude response for the BMs adapted/calculated for singlesource scenarios.
In Figure 6 the magnitude responses for three BMs (ideal RTFbased BM, Directional BSS, DSB) steered towards 0 ° are depicted for the array of d_{mic}= 6 cm and d_{mic}= 11.5 cm, respectively, in room A with 1m sourcearray distance. Comparing all plots in Figure 6, the three BMs have similar magnitude responses. For all three BMs, spatial aliasing is unavoidable at f>5 kHz (d_{mic}= 6 cm) and at f>3 kHz (d_{mic}= 11.5 cm). Besides, they do not have a significant spatial selectivity for low frequencies (lower than 300 Hz), and it is observed that the frequency range with no spatial selectivity is larger for d_{mic}= 6 cm than for d_{mic}= 11.5 cm. In this frequency range, not only the target source but also interferers located at positions differing from 0 ° is suppressed to a large extent and consequently, no noise estimate can be obtained. Despite similar behaviors in the range of low frequencies, it is clearly noticeable that both the ideal RTFbased BM and Directional BSS achieve a more pronounced spatial null than the DSB, which reflects a much better suppression performance of these two BMs compared to the DSB.In Figures 7 and 8, the magnitude responses for three BMs are depicted for steering directions −45° and −90° in room A with 1 m sourcearray distance. Obviously, the behaviors of the BMs change if the target source moves towards −90°. For the steering direction of −45°, the ideal RTFbased BM can still perfectly suppress the source but the null becomes broader. For Directional BSS and a DSB, the spatial null becomes apparently weaker and broader. Besides, it is observed that the spatial null reaches only up to approximately 4 kHz. In practice, it will not affect the performance of Directional BSS for suppressing speech signals too significantly, as most of the energy of speech signals is usually in the frequency range below 4 kHz. For the steering direction of −90°, the spatial null of Directional BSS becomes broader especially at low frequencies, whereas almost no spatial null can be observed for the DSB. The target suppression gain (discussed in Section 4.2.2) for Directional BSS degrades from 20 dB for the target signal at 0 ° to about 10 dB for the target signal at 90 °, where the target suppression performance is still acceptable, but the missing selectivity will suppress interfering sources located close to the target source as well. In the following experiments, we limit the target source position to the range [ −20°, 20°] relative to the broadside of the microphone array. Besides, we note that the spatial null of the proposed method is limited to frequencies below 4 kHz, which is due to the fact that we use speech signals as the test signals and the energy of speech signals is concentrated to the frequency range below 4 kHz. However, it should be noted that the spatial null can be extended to a higher frequency range by using other widebandsignals with sufficient support at those frequencies.In Figure 9a, the magnitude response of Directional BSS for different rooms and different sourcearray distances are depicted. It can be seen that the spatial null of Directional BSS becomes slightly weaker with increased reverberation.
To explain the performance degradation, we plot the magnitude squared coherence (MSC) of the target signal between the two microphones for each testing scenario in Figure 9e9h. The MSC is estimated by using Welch’s averaged periodogram method. The block length for estimating the MSC is 2048, which is the same as the block length for BSS adaptation. As can be seen, the target signal for room A with 1m distance is strongly correlated (MSC ≈1). With the increasing reverberation, the coherence of the target signal becomes weaker. Consequently, the blocking performance of Directional BSS degrades. If we increase the block length to be larger than the length of the measured RIRs, the bias of the coherence towards zero will reduce according to [51]. The MSC will be close to 1 again. Therefore, theoretically, increasing both the filter length of the demixing system and the block length will increase the performance. However, a deteriorating convergence of the BSS algorithm must be expected for very long demixing filters. This is a general problem of adaptive filtering realized in the time domain.In the above figures, we showed the behavior of the three BMs in singlesource scenarios. For multiplesource scenarios, the ideal RTFbased BM and the BM based on a DSB remain unchanged. However, the adaptation of Directional BSS is affected due to the existence of the interfering sources. Consequently, the performance of Directional BSS is different from the performance in singlesource scenarios. Figure 10 illustrates the magnitude responses of Directional BSS steering at 0°, with one interfering point source at 30°. It can be seen that the spatial null is only slightly weaker compared to the singlesource case (Figure 9), especially at low frequencies. This indicates that the target source is slightly less suppressed (degration about 1 to 4 dB in terms of the target suppression gain) due to the existence of the interfering signal.
4.2.2 Target suppression performance
The blocking performance should be quantified to show how well the target source can be suppressed. We propose to use two measures to evaluate the performance. One is the target speech suppression gain which is defined as follows:
where ${\sigma}_{a}^{2}$ denotes the (longterm averaged) signal power of the signal a, x_{s,p} denotes the target component contained in p th microphone, and ${\widehat{n}}_{s}$ denotes the target residual contained in the noise estimate. The target suppression gain of the ideal RTFbased BM is infinity. A higher target suppression gain indicates a higher blocking performance for the signal from the target source direction. This measure is very similar to the ‘signal blocking factor’ used in [52]. We compare the target suppression performance with (1) the DSB and (2) a simulated ideal ABM, where one microphone channel is simply adaptively subtracted with a LMStype algorithm from the other. For this simulation, an ideal case is assumed, i.e., the microphone signal contains only the target signal. The simulated ABM can be regarded as an ideal version (in a supervised case) of a conventional ABM using an LMS algorithm as proposed in [23, 24].
Additionally, we calculate the normalized squared error (NSE) between the estimated RTFs and the ideal RTF calculated from the measured RIR to evaluate the estimation of the RTF. The NSE is calculated as follows:
where ${\stackrel{~}{\text{RTF}}}_{\text{BM}}$ denotes the RTF estimated by a BM, e.g., ${\stackrel{~}{\text{RTF}}}_{\text{DirBSS}}$ refers to the RTF estimated by Directional BSS, k is the time sample index, and N is the filter length of the BM; in our experiments, it was chosen to be 1024.The scenarios as depicted in Figure 11a are considered for evaluating the blocking performance. In scenarios 1 to 3, only point sources are active. One male speech signal of 10 s was used as the target source. A female speech signal of the same length was used as the interferer in scenario 2. For scenario 3, a female and a male speech signal were used as the interferers. In scenario 4, additional diffuse background noise is added to the microphone. All test signals are normalized to equal power.
Figure 11b shows the target suppression gain for the three BMs. As the DSB is only dependent on the target direction and the target sourcearray distance, the target suppression gain of the DSB for scenarios 2 to 4 is the same as for scenario 1. We simulate the ideal ABM by adapting the BM filter in a singlesource scenario, i.e., scenario 1. Therefore, the performance of the ideal ABM is only shown for this scenario. It can be seen that in a singlesource scenario, the target suppression gain of an ideal ABM is only slightly higher than Directional BSS, which indicates that in a singlesource scenario, Directional BSS can reach to the upper limit. For scenarios 2 to 4, the target suppression gain of Directional BSS degrades but is always over 10 dB and clearly superior compared to the DSB. In Figure 11c, the NSE_{DirBSS} and NSE_{idealABM} are shown. For scenario 1, where only the target source is active, the NSEs of the both BMs are very close and very low, which indicates that in a singlesource scenario, the estimated RTFs are very close to the true RTF. With an increased number of sources or with more complicated acoustical conditions (higher reverberation time and larger sourcearray distance), it is more difficult to estimate the RTF. We can see that NSE_{DirBSS} increases and the target suppression gain degrades. However, even in such complex scenarios, Directional BSS can produce an acceptable estimate of the RTF without any source activity detection. In the latest work [53], more evaluation results for comparing the estimated RTFs with the true RTFs are shown. The performance of Directional BSS is somewhat dependent on the signal characteristics (stationary or nonstationary, speech signal or white noise) of the involved sources. The energy of a white noise is distributed over full band while the energy of speechlike sources is usually limited to low frequencies. Besides, nonstationary sources make the adaptation of Directional BSS difficult as it needs to catch the variation of the signals within short frames. Therefore, different signal characteristics will lead to different results.
4.2.3 Preservation of interfering sources
The target suppression gain can only be used to evaluate the blocking performance for the target signal. From the magnitude response for the overall system, it can be seen that spatial aliasing appears in a certain frequency range. The goal of a BM is to produce a noise reference by suppressing the target source, which indicates that the noise signals should be well preserved while the target source should be well suppressed. Therefore, besides the target suppression gain, we need to measure how well the noise signals are preserved. To this end, we define the SIR_{diff} as follows:
where SIR_{in} and SIR_{outBM} are given by
where ${\widehat{n}}_{s}$ and ${\widehat{n}}_{n}$ denote the target and the noise component contained in noise estimate $\widehat{n}$, respectively. The higher SIR_{diff}, the better the noise signals are preserved relative to the target signal.We carried out a test for the scenario shown in Figure 12a, where the target source is located at 0°, while the interfering source is located at varying DoA from −90° to −10°.
Figure 12b,c shows the SIR_{diff} achieved by Directional BSS and the DSB for the testing scenario. We observe that due to the increased reverberation, SIR_{diff} decreases. The interfering sources located near to the target source, e.g., the interfering source at −20° or −10° BSS may treat them as one source. Both the interfering sources and the target source are suppressed to a certain extend. Comparing the performance of Directional BSS and the DSB, it can be seen that Directional BSS is clearly superior to the DSB especially in reverberant environments.
As the target source is defined to impinge from the range −20° to 20° relative to the broadside of the microphone array, the blocking performance including the preservation of the interfering source for the target located other than 0 ° is of interest. Figure 13 shows the obtained Gain_{sup} and SIR_{diff} for a scenario where the target is located at −10° or −20°, and an interfering source is located at 60 ° and diffuse noise is active. It can be seen that Directional BSS can still achieve a target suppression gain of more than 10 dB. Here, the results of DSB for the same scenarios are not shown as even for the source located at 0°, only less than 10 dB Gain_{sup} can be obtained (see Figure 11b). For the source located off 0°, the performance of the DSB degrades further.
4.3 Robustness against localization errors
In practical applications, usually the target direction is unknown and needs to be estimated using a source localization algorithm [39, 40] so that estimation errors must be expected. Hence, the robustness of Directional BSS against localization errors is of special interest. To this end, we experimentally evaluate the sensitivity of Directional BSS with respect to the localization errors. The scenario for evaluation is illustrated in Figure 14a, where the target source is always located at 0°, and one active interferer varies its direction from −90° to −10°. The target localization error is 5 °, 10 °, or 15 °. Various values for η_{C} from 0.1 to 0.8 are applied for the constraint.
In Figure 14b14g, the target suppression results are shown. The performance of SIR_{diff} for measuring the preservation of the interfering sources as discussed in Section 4.2.3 is shown in Figure 15. From Figure 14b,d,f and Figure 15a,c,e, it can be seen that for d_{mic}=6 cm, a localization error of 15 ° can be tolerated, the Gain_{sup} is above 10 dB and SIR_{diff} is above 6 dB if the interferer is far from the target source. However, as indicated above, if the interferer is close to the target source, BSS might treat th e interferer and target source as a single source and jointly suppress the target source and the interferer. This leads to a certain confusion in BSS adaptation, which results in a performance degradation (low SIR_{diff} for the interferer close to the target source). Similar results are observed for d_{mic}=11.5 cm. It shows that with larger microphone spacing, BSS is more sensitive to the localization error, especially if the interfering sources are near to the target source. Besides, for a large localization error, with lower η_{C}, Directional BSS achieves a better performance. Basically, if BSS runs freely without any constraint (η_{C}=0), it automatically adapts to the true source direction for source separation. However, this holds only for determined cases, which means for a scenario with only two sources, only two microphones are available. For an underdetermined case with more than two simultaneously active point sources, BSS will always try to produce mutually statistically independent outputs. Therefore, in an underdetermined situation, a determined BSS will divide the sources into two groups, which leads to an unpredictable suppression/separation of the sources, e.g., it may treat the target source and the nearest interfering source together as one source/one group and produce a compromise suppression. Therefore, we need to constrain BSS to suppress the source from a predefined direction only but not to constrain BSS too much in order to tolerate a possible localization error. This is balanced by the weighting factor η_{C} in (22) which controls the importance of the geometric constraint relative to the separation criterion. A lower η_{C} indicates less weight for the geometric constraint, and the estimation of the demixing filters is more based on statistical independence for source separation. Hence, a lower η_{C} should be chosen for unreliable DoA information so that Directional BSS can better adapt to the true target direction.
5 Application of a DirBSSbased noise estimate to blind signal extraction
In this section, a twochannel BSE scheme combining Directional BSS as BM and Wienertype spectral enhancement filters will be presented and evaluated.
5.1 A twounit scheme: BM plus spectral enhancement filters
The noise estimate obtained by the above method can be used for various applications. Conventionally, it can be used for a realization which relies on a noise estimate produced by an RTFbased BM [25, 53] or for the MWF which requires an SOS estimate of the noise [54]. In this section, we discuss one generic application to show the effectiveness of the noise estimation of the proposed BM. A twochannel BSE scheme combining Directional BSS as BM and Wienertype spectral enhancement filters will be presented. The scheme is depicted in Figure 16a. It comprises two units. In the first unit, an estimate of noise components is produced by a BM. The noise estimate as well as the microphone signals is fed into the noise reduction unit so that the desired speech components can be extracted from the microphone signals.
Typical approaches which can be considered for the speech enhancement unit include an interference canceler or Wienertype spectral enhancement filters. However, in underdetermined cases, an interference canceler is not able to suppress all noise sources [55]. Therefore, Wienertype spectral enhancement filters based on the obtained noise estimate are used for the noise reduction unit. The realvalued spectral weights for frequency Ω at output channel p are given by [56]
where, ${\u015c}_{\mathit{\text{aa}}}$ represents the autoPSD of a, v_{ p }=w_{p 1}∗x_{ p } denotes the p th microphone signal filtered by Directional BSS, ${\underset{\u2014}{g}}_{min}$ refers to the minimum value of the spectral weights (spectral floor), and μ is a real number which is used to achieve a tradeoff between noise reduction and speech distortion. Note that the spectral weights can be designed in many forms, e.g., can be derived from Bayesian estimation using maximum likelihood (ML) method, maximum a posteriori (MAP), or MMSE estimator [57].
The spectrum of the enhanced signal at the p th output channel is thus given by
5.2 Improved noise estimate by assuming an ideal diffuse noise field
The obtained noise estimate $\widehat{n}$ is biased relative to the original noise components as (1) the noise estimate $\widehat{n}$ is spectrally shaped by the BSS filters and (2) $\widehat{n}$ is a sum of all the filtered interference and noise components. A bias correction function is proposed based on an assumed coherence [21, 56]. If the noise field is approximated as spherically isotropic [49], the theoretical noise coherence function reads:
The noise estimate is corrected with the noise coherence function [21]:
where ℜ denotes the real part of a complex value. Note that Γ_{diffuse} and ${\u015c}_{\widehat{n}\widehat{n}}^{\prime}$ are frequencydependent. The spectral weights are then calculated with the corrected noise estimate:
This correction function was discussed in detail in [21]. Other possible correction functions could be based on coherence measurements during target inactivity [54, 58] or a method combined with minimum statistics [59]. However, a more detailed discussion of the possible correction functions and choices of spectral enhancement filters is outside the scope of this paper.
5.3 Experimental results
In this section, the performance of the proposed scheme with two spectral enhancement filters ((31) and (35)) is evaluated in terms of signaltointerference ratio improvement (SIR_{gain}) and speech distortion (SD) for various scenarios under various testing conditions. Besides, we show the performance of the proposed scheme based on the noise estimate provided by the ideal RTFbased BM and the DSB for comparison.
5.3.1 Experimental setup
Two different rooms and two sourcearray distances are considered for evaluating the BMs. The scenarios as shown in Figure 16b are considered for evaluating the performance of the speech enhancement scheme. All tested signals are the same as used in the experiments for evaluating the performance of BMs.
The same algorithm of Directional BSS and the same parameters are used as in Section 4 (see (22) and (23)). The frequencydomain Wiener filter is implemented with a polyphase filter bank [60] using a prototype FIR filter of length 1024, with 512 complexvalued subbands and a downsampling rate of 128.
5.3.2 Performance measures
The performance of the proposed scheme is evaluated in terms of SIR_{gain} and SD using the following definitions:
where x_{s,p} and z_{s,p} denote the target speech components at the p th input and the p th output of the proposed scheme, respectively; ${\sigma}_{{x}_{s,p}}^{2}$ and ${\sigma}_{{z}_{s,p}}^{2}$ denote the (longterm) signal power of the target speech components at the p th input and the p th output; whereas ${\sigma}_{{x}_{n,p}}^{2}$ and ${\sigma}_{{z}_{n,p}}^{2}$ denote the (longterm) signal power of all noise and interference components at the p th input and p th output, respectively. τ_{g} refers to the overall signal delay caused by the filter bank.
5.3.3 Performance of the proposed BSE scheme
In order to establish a reference for the twostage BSE methods, we first consider a 2×2channel timedomain ICA algorithm as an uninformed BSE system where we identify the target signal in one of the output channels. The SIR improvement for the various scenarios is shown in Figure 17. The SD is not shown as it is meaningless for BSS due to the known filtering ambiguity of optimum BSS solutions [31]. As a generic ICA algorithm is designed only for the determined case, the this algorithm can achieve high separation performance only in scenario 1. For scenario 2, it has no determined solution. The three sources are separated into two groups but the grouping is unpredictable, as can be seen by the results, e.g., for the testing conditions [rooms (A,B), 1 m], the target source is extracted alone, while for the other testing conditions, the target source is separated together with one interfering source as one group. For scenarios such as 3 and 4, where the diffuse noise is active, generic ICA is usually not capable to separate the point source from the diffuse noise. A corresponding analysis can be found in [28]. With these results, it is documented that the generic ICA algorithm cannot be expected to extract the target source for underdetermined scenarios.
The SIR improvement and speech distortion of the speech enhancement scheme with two spectral filters $\underset{\u2014}{g}$ (31) and ${\underset{\u2014}{g}}^{\prime}$ (35) are shown in Figure 18a,b, respectively. The upper plot and the lower plot show the results for the spectral filter $\underset{\u2014}{g}$ and the improved spectral filter ${\underset{\u2014}{g}}^{\prime}$, respectively. It should be noted that for the four considered scenarios, the parameters of the spectral filters are optimized with respect to the best perceptual quality as judged by informal listening tests. For all scenarios, the proposed BSE scheme with the two spectral filters can achieve a good noise reduction performance (above 6 dB) and maintain a very low speech distortion (lower than −11 dB). For scenarios 1 and 2 where only point sources are active, the performance of the spectral filter ${\underset{\u2014}{g}}^{\prime}$ based on the incorrect assumption of ideal diffuse noise is not improved compared to the scheme without the bias correction. For scenario 3 and scenario 4 where additional diffuse background noise is present, i.e., the assumption of ideal diffuse noise is matched to a certain degree, a significant improvement for the spectral filter ${\underset{\u2014}{g}}^{\prime}$ can be observed.
For comparison, the performance of the proposed scheme based on the noise estimate produced by the ideal RTFbased BM and the DSB is shown in Figure 19a,b. Note that here the speech enhancement scheme with only the spectral filter $\underset{\u2014}{g}$ (31) is evaluated.
Although the ideal RTFbased BM can perfectly suppress the target source, the produced noise estimate is still biased relative to the true noise components contained in the microphone signal. Therefore, we cannot expect a perfect noise reduction for the speech enhancement scheme with the biased noise estimate. However, the noise reduction performance achieved with this noise estimate can be regarded as the upper limit for the scheme with the noise estimate produced by Directional BSS without applying a bias correction. It can be seen that for all scenarios, with Directional BSS as BM, the noise reduction performance is almost the same as the performance achieved by the idealRTF (1 dB less), but clearly superior to the performance achieved by a DSB. Obviously, for the latter, the large residual of the target component reduces the SIR improvement and leads to a significant distortion of the target source.
6 Conclusions
In our earlier work, Directional BSS was proposed as a BM for source extraction. The concept combines BSS with a geometric constraint to cope with the underdetermined scenario such that a meaningful and joint estimate of all interfering speech signals and diffuse background noise can be obtained using only two microphones. In this paper, we show that Directional BSS converges to an RTFbased ideal BM. Experimental results analyzing the system behavior and the blocking performance of Directional BSS under various acoustical conditions were presented. These results verify that Directional BSS can successfully estimate the RTF in underdetermined nonstationary noise scenarios without requiring source activity information. The target suppression performance of Directional BSS is clearly superior to a common DSB. Simulation results confirm also that Directional BSS is very robust against localization errors. Additionally, we evaluate a source extraction scheme which combines Directional BSS and Wienertype spectral enhancement filters. It is shown that the noise reduction performance achieved by this scheme using Directional BSS is very close to the performance achieved by the proposed scheme using an ideal RTFbased BM. Therefore, for exploiting Directional BSS as a BM, no source activity information and no information on the number of active sources is necessary. The only required information for this informed algorithm is some coarse DoA information of the target source.
References
 1.
Hyvärinen A, Oja E: Independent Component Analysis. Wiley, New York; 2001.
 2.
Makino S, Lee TW, Sawada H: Blind Speech Separation. Springer, Berlin; 2007.
 3.
Yilmaz Ö, Rickard S: Blind separation of speech mixtures via timefrequency masking. IEEE Trans. Signal Process 2004, 52: 18301847. 10.1109/TSP.2004.828896
 4.
Araki S, Sawada H, Mukai R, Makino S: A novel blind source separation method with observation vector clustering. In Proceedings of the International Workshop on Acoustic Echo and Noise Control (IWAENC). Eindhoven; September 2005:117120.
 5.
Duong NK, Vincent E, Gribonval R: Underdetermined reverberant audio source separation using a fullrank spatial covariance model. IEEE Trans. Audio Speech Lang. Process 2010, 18: 18301840.
 6.
Ozerov A, Fevotte C: Multichannel nonnegative matrix factorization in convolutive mixtures with application to blind audio source separation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Taipei; March 2009:550563.
 7.
Sawada H, Kameoka H, Araki S, Ueda N: New formulations and efficient algorithms for multichannel NMF. In Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). New York; October 2011:153156.
 8.
Van den Bogaert T, Doclo S, Wouters J, Moonen M: Speech enhancement with multichannel Wiener filter techniques in multimicrophone binaural hearing aids. J. Acoust. Soc. Am 2009, 125: 360371. 10.1121/1.3023069
 9.
Cornelis B, Moonen M: A VADrobust multichannel Wiener filter algorithm for noise reduction in hearing aids. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Prague; May 2011:281284.
 10.
Frost OL: An algorithm for linearly constrained adaptive array processing. Proc. IEEE 1972, 60: 926935.
 11.
Van Trees HL: Optimum Array Processing (Detection, Estimation, and Modulation Theory, Part IV), 1st edn. Wiley, New York; 2002.
 12.
Griffiths LJ, Jim CW: An alternative approach to linear constrained adaptive beamforming. IEEE Trans. Speech Audio Process 1982, 30: 2734.
 13.
Zelinski R: A microphone array with adaptive postfiltering for noise reduction in reverberant rooms. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Las Vegas; March 2008:25782581.
 14.
McCowan I, Bourlard H: Microphone array postfilter based on noise field coherence. IEEE Trans. Speech Audio Process 2003, 11: 709716. 10.1109/TSA.2003.818212
 15.
Ito N, Shimizu H, Ono N, Sagayama S: Diffuse noise suppression using crystalshaped microphone arrays. IEEE Trans. Audio Speech Lang. Process 2011, 19: 21012110.
 16.
R Hendriks T: Gerkmann, Noise correlation matrix estimation for multimicrophone speech enhancement. IEEE Trans. Audio Speech Lang. Process 2012, 20: 223233.
 17.
Taseska M, Habets EAP: MMSEbased blind source extraction in diffuse noise fields using a complex coherencebased a priori sap estimator. In Proceedings of the International Workshop on Acoustic Echo and Noise Control (IWAENC). Aachen, Germany; September 2012.
 18.
Taseska M, Habets EAP: MMSEbased source extraction using positionbased posterior probabilities. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Vancouver; May 2013.
 19.
Parra L, Alvino C: Geometric source separation: merging convolutive source separation with geometric beamforming. IEEE Trans. Speech Audio Process 2002, 10: 352362. 10.1109/TSA.2002.803443
 20.
Zheng Y, Reindl K, Kellermann W: BSS for improved interference estimation for blind speech signal extraction with two microphones. In Proceedings of 3rd International Workshop on Computational Advances in MultiSensor Adaptive Processing (CAMSAP). Dutch Antilles, Aruba; December 2009:253256.
 21.
Maas R, Schwarz A, Zheng Y, Reindl K, Meier S, Sehr A, Kellermann W: A twochannel acoustic frontend for robust automatic speech recognition in noisy and reverberant environments. In Proceedings of the International Workshop on Machine Listening in Multisource Environments (CHiME). Florence; September 2011:4146.
 22.
Van Veen BD, Buckley KM: Beamforming: a versatile approach to spatial filtering. IEEE ASSP Mag 1988, 5: 424.
 23.
Hoshuyama O, Sugiyama A: A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Atlanta; May 1996:925928.
 24.
Herbordt W, Kellermann W: Frequencydomain integration of acoustic echo cancellation and a generalized sidelobe canceller with improved robustness. Eur. Trans. Telecommun 2002, 13: 123132. 10.1002/ett.4460130207
 25.
Gannot S, Burshtein D, Weinstein E: Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Trans. Signal Process 2001, 49: 16141626. 10.1109/78.934132
 26.
Cohen I: Relative transfer function identification using speech signals. IEEE Trans. Speech Audio Process 2004, 12(5):451459. 10.1109/TSA.2004.832975
 27.
Golan S, Gannot S, Cohen I: Multichannel eigenspace beamforming in a reverberant noisy environment with multiple interfering speech signals. IEEE Trans. Audio Speech Lang. Process 2009, 17: 10711086.
 28.
Takahashi Y, Takatani T, Osako K, Saruwatari H, Shikano K: Blind spatial subtraction array for speech enhancement in noisy environment. IEEE Trans. Audio Speech Lang. Process 2009, 17: 650664.
 29.
Smaragdis P: Blind separation of convolved mixtures in the frequency domain. Neurocomputing 1998, 22: 2134. 10.1016/S09252312(98)000472
 30.
Buchner H, Aichner R, Kellermann W: The TRINICON framework for adaptive MIMO signal processing with focus on the generic Sylvester constraint. In Proceedings of the ITG Conference on Speech Communication. Aachen, October; 2008.
 31.
Buchner H, Aichner R, Kellermann W: Blind source separation for convolutive mixtures exploiting nongaussianity, nonwhiteness, and nonstationarity. In Proceedings of International Workshop on Acoustic Echo and Noise Control (IWAENC). Kyoto; September 2003:275278.
 32.
Aichner R, Buchner H, Yan F, Kellermann W: A realtime blind source separation scheme and its application to reverberant and noisy acoustic environments. Signal Process 2006, 86: 12601277. 10.1016/j.sigpro.2005.06.022
 33.
Buchner H, Aichner R, Kellermann W: Relation between blind systemidentification and convolutive blind source separation. In Proceedings of Workshop for HandsFree Speech Communication and Microphone Arrays (HSCMA). Piscataway; March 2005.
 34.
Araki S, Makino S, Hinamoto Y, Mukai R, Nishikawa T, Saruwatari H: Equivalence between frequencydomain blind source separation and frequencydomain adaptive beamforming for convolutive mixtures. EURASIP J. Adv. Signal Process 2003, 2003: 11571166. 10.1155/S1110865703305074
 35.
Gerven SV, Compernolle DV: Signal separation by symmetric adaptive decorrelation: stability, convergence, and uniqueness. IEEE Trans. Signal Process 1995, 43: 16021612. 10.1109/78.398721
 36.
Weinstein E, Feder M, Oppenheim A: Multichannel signal separation by decorrelation. IEEE Trans. Speech Audio Process 1993, 1: 405413. 10.1109/89.242486
 37.
Zheng Y, Lombard A, Kellermann W: An improved combination of directional BSS and a source localizer for robust source separation in rapidly timevarying acoustic scenarios. In Proceedings of the Workshop on Handsfree Speech Communication and Microphone Arrays (HSCMA). Edinburgh; May 2011.
 38.
Hjørungnes A: Complexvalued matrix derivatives: with applications in signal processing and communications. Cambridge University Press, Cambridge; 2011.
 39.
Knapp C, Carter G: The generalized correlation method for estimation of time delay. IEEE Trans. Acoustics Speech Signal Process. ASSP 1976, 24: 320327. 10.1109/TASSP.1976.1162830
 40.
Lombard A, Rosenkranz T, Buchner H, Kellermann W: Multidimensional localization of multiple sound sources using averaged directivity patterns of blind source separation systems. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Taipei; April 2009:233236.
 41.
Warsitz E, Krueger Ar, HaebUmbach R: Speech enhancement with a new generalized eigenvector blocking matrix for application in a generalized sidelobe canceller. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Las Vegas; March 2008:7376.
 42.
Krueger A, Warsitz E, HaebUmbach R: Speech enhancement with a GSClike structure employing eigenvectorbased transfer function ratios estimation. IEEE Trans. Audio Speech Lang. Process 2011, 19: 206219.
 43.
Gerkmann T, Hendriks R: Noise power estimation based on the probability of speech presence. In Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). New Paltz; October 2011:145148.
 44.
Martin R: Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process 2001, 9(5):504512. 10.1109/89.928915
 45.
Cohen I: Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging. IEEE Trans. Speech Audio Process 2003, 11(5):466475. 10.1109/TSA.2003.811544
 46.
Ozerov A, Vincent E: Using the FASST source separation toolbox for noise robust speech recognition. In Proceedings of International Workshop on Machine Listening in Multisource Environments (CHiME 2011). Florence; September 2011.
 47.
Moritz N, Schädler M, Adiloglu K, Meyer B, Jürgens T, Gerkmann T, Goetze S: Noise robust distant automatic speech recognition utilizing NMF based source separation and auditory feature extraction. In Proceedings of The 2nd International Workshop on Machine Listening in Multisource Environments (CHiME 2013). Vancouver; June 2013:11.
 48.
Jeon K, Park N, Kim H, Choi M, Hwang K: Mechanical noise suppression based on nonnegative matrix factorization and multiband spectral subtraction for digital cameras. IEEE Trans. Consum. Electron 2013, 59(2):296302.
 49.
Kuttruff H: Room Acoustics. Taylor & Francis, London; 2000.
 50.
Habets EAP, Gannot S: Generating sensor signals in isotropic noise fields. J. Acoustical Soc. Am 2007, 122: 34643470. 10.1121/1.2799929
 51.
Martin R: Kurzfassung Freisprecheinrichtung mit mehrkanaliger Echokompensation und Störgeräuschreduktion. RWTH Aachen University, PhD thesis; 1995.
 52.
Talmon R, Cohen I, Gannot S: Relative transfer function identification using convolutive transfer function approximation. IEEE Trans. Audio Speech Lang. Process 2009, 17: 546555.
 53.
Reindl K, Barfuss H, Gannot S, Kellermann W, MarkovichS Golan: Geometrically constrained TRINICONbased relative transfer function estimation in underdetermined scenarios. In Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). New York; October 2013.
 54.
Kellermann W, Reindl K, Zheng Y: Method and acoustic signal processing system for interference and noise suppression in binaural microphone configurations. US Patent 2011/0307249 A1 2011.
 55.
Reindl K, Zheng Y, Kellermann W: Speech enhancement for binaural hearing aids based on blind source separation. In Proceedings of 4th International Symposium on Communications, Control, and Signal Processing (ISCCSP). Limassol; March 2010.
 56.
Reindl K, Zheng Y, Schwarz A, Meier S, Maas R, Sehr A, Kellermann W: A stereophonic acoustic signal extraction scheme for noisy and reverberant environments. Comput. Speech Lang. (CSL) 2012, 27: 726745.
 57.
Vary P, Martin R: Digital Speech Transmission: Enhancement, Coding and Error Concealment. Wiley, Chichester; 2006.
 58.
Kim K, Jeong S, Jeong J, Oh K, Kim J: Dual channel noise reduction method using phase differencebased spectral amplitude estimation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Dallas; March 2010:217220.
 59.
Jeong S, Kim K, Jeong J, Oh K, Kim J: Adaptive noise power spectrum estimation for compact dual channel speech enhancement. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Dallas; March 2010:16301633.
 60.
Vaidyanathan PP: Multirate Systems and Filter Banks. PrenticeHall, Upper Saddle River; 1993.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Zheng, Y., Reindl, K. & Kellermann, W. Analysis of dualchannel ICAbased blocking matrix for improved noise estimation. EURASIP J. Adv. Signal Process. 2014, 26 (2014) doi:10.1186/16876180201426
Received
Accepted
Published
DOI
Keywords
 Blocking Matrix
 Blind Source Separation
 Noise Estimate
 Target Source
 Minimum Variance Distortionless Response