Analysis of dualchannel ICAbased blocking matrix for improved noise estimation
 Yuanhang Zheng^{1}Email author,
 Klaus Reindl^{1} and
 Walter Kellermann^{1}
https://doi.org/10.1186/16876180201426
© Zheng et al.; licensee Springer. 2014
Received: 17 June 2013
Accepted: 10 February 2014
Published: 4 March 2014
Abstract
For speech enhancement or blind signal extraction (BSE), estimating interference and noise characteristics is decisive for its performance. For multichannel approaches using multiple microphone signals, a BSE scheme combining a blocking matrix (BM) and spectral enhancement filters was proposed in numerous publications. For such schemes, the BM provides a noise estimate by suppressing the target signal only. The estimated noise reference is then used to design spectral enhancement filters for the purpose of noise reduction. For designing the BM, ‘Directional Blind Source Separation (BSS)’ was already proposed earlier. This method combines a generic BSS algorithm with a geometric constraint derived from prior information on the target source position to obtain an estimate for all interfering point sources and diffuse background noise. In this paper, we provide a theoretical analysis to show that Directional BSS converges to a relative transfer function (RTF)based BM. The behavior of this informed signal separation scheme is analyzed and the blocking performance of Directional BSS under various acoustical conditions is evaluated. The robustness of Directional BSS regarding the localization error for the target source position is verified by experiments. Finally, a BSE scheme combining Directional BSS and Wienertype spectral enhancement filters is described and evaluated.
Keywords
1 Introduction
Blind signal extraction (BSE) aiming at extracting one source signal from a mixture of an unknown number of acoustic sources in noisy environments is a generic task in acoustic signal processing. It has a wide range of applications in many fields: As popular examples, handsfree interfaces for acoustic communications and humanmachine interaction offer many challenging and relevant application scenarios, such as teleconferencing, interactive television, humanoid robots, and gaming. Moreover, acoustic signal extraction techniques are also highly relevant for assistive devices, such as hearing aids.
If multiple microphones are available, datadependent multichannel approaches for signal extraction can be classified into unsupervised and supervised approaches. The class of unsupervised methods does not require prior knowledge on the spatial distribution of sources and sensors. The lack of prior knowledge is compensated by exploiting fundamental signal characteristics. Conventional unsupervised signal extraction approaches are, e.g., independent component analysis (ICA)based [1, 2] or sparsenessbased blind source separation (BSS) algorithms [3, 4]. However, conventional ICAbased approaches cannot be used for underdetermined cases, where the number of sensors is less than the number of sources, and sparsenessbased methods are highly dependent on the sparsity of the mixing signals. Recently, modelbased multichannel approaches gained a lot of attention. These are, e.g., approaches based on a spatial covariance model [5] or multichannel nonnegative matrix factorization (NMF) methods based on modeling complex Gaussian distributions [6, 7]. As opposed to [1–4] they do not solely rely on the independence or the sparsity of the underlying signals and can be used for underdetermined source separation.
Unlike unsupervised methods, the class of supervised methods needs reference information. Typical supervised signal extraction approaches are, e.g., multichannel Wiener filtering (MWF) [8, 9] or beamforming approaches, such as linearly constrained minimum variance (LCMV) beamformer [10]. MWF approaches are based on minimum mean square error (MMSE) estimators requiring noise and interference statistics as references. The LCMV beamformer requires reference signal(s) represented by linear constraints in order to suppress interfering sources and to preserve target signal(s) from known directions [11]. As an alternative form of the LCMV beamformer, the generalized sidelobe canceler (GSC) was proposed in [12], which converts the constrained optimization problem into an unconstrained problem.
Under realistic acoustic conditions, prior information is often exploited for practical realizations of supervised and unsupervised signal extraction approaches. This leads to the class of informed signal processing algorithms, where relevant information of the underlying conditions is exploited to realize the signal extraction algorithms or to render these algorithms more robust and reliable for practical conditions. Prior knowledge, which can be spatial information in terms of the direction of arrival (DoA) of source signals, coherence or diffuseness of the sound field, etc., may be given or estimated from the acquired sensor data. An overview of the relevant work belonging to this class is given in the following.
To realize the MWF, an estimate of the secondorder statistics (SOS) of the noise signals is required. Based on the assumption of a diffuse noise field, several methods are derived for estimating the SOS of the noise components in terms of the autopower spectral density (PSD) [13–15] or the cross PSDs between all channels for both the target source(s) and the noise and interference components [16]. Furthermore, it was recently proposed to exploit the directtodiffuse ratio (DDR) to realize the MWF for stationary noise and babble noise conditions [17]. It was also suggested to exploit the position information to estimate the cross PSDs of directional speech interferences [18]. For unsupervised algorithms, prior spatial information such as information about the source positions or the sensor constellation is often incorporated to improve the robustness. Modelbased multichannel approaches [6, 7] can incorporate the directional information by initialization of a part of the spatial model. Parra and Alvino [19] proposed to combine an ICA algorithm with geometric constraints in order to improve the separation performance, where BSS was regarded as a set of beamformers whose response is constrained to a set of DoAs for recovering all sources from the mixture. Inspired by [19], Directional BSS [20] was proposed to serve as a blocking matrix (BM) when using a different constraint for the opposite purpose: this constraint forces essentially a spatial null towards a certain direction in order to suppress the target source and to preserve the interfering and noise components. The precondition not only for Directional BSS but also for Parra’s method is that the DoA information on the target source(s) must be given. Furthermore, based on the noise estimate produced by Directional BSS, a twounit source extraction/noise reduction scheme combining a BM and a noise reduction unit was proposed in [21], where the spectral weights in the noise reduction stage are designed based on a diffuse noise field assumption. In this paper, we focus on the discussion of Directional BSS operating as a BM.
Originally in [12], the BM was designed for timeinvariant freefield environments and rejected the source signal from one direction only, requiring precise source location information. This BM can be regarded as minimum variance distortionless response (MVDR) BM as the MVDR beamformer imposes the distortionless constraint only for the desired direction. For the dualchannel case, the conventional MVDR BM is given by the delayandsubtract beamformer (DSB) and can only suppress the direct path of the target source. Theoretically, the conventional LCMV BM can suppress the direct path and reflections by formulating the corresponding constraints in the BM if the perfect knowledge on the angle of arrival for each reflection is given [22]. However, the conventional LCMV/MVDR BM will likely lead to target signal leakage as it is conceived for timeinvariant scenarios, and any movement of the target source will lead to a steering error relative to the true DoA for the target signal and its reflections. To improve the robustness against the steering error, an adaptive BM was proposed in [23, 24], which needs an adaptive control requiring source activity information. In [25], the relative transfer function (RTF)based BM for LCMV/MVDR beamforming was proposed. The RTFbased BM can perfectly suppress the target signal if the RTFs are given. However, estimating RTFs usually requires estimation of the source activity or a doubletalk detector, as noiseonly frames or time segments where both the transfer functions (TFs) and the noise signals are assumed to be stationary need to be available for RTF estimation [25–27].
More recently, ICAbased BSS algorithms were proposed to realize a BM [20, 28]. The approach presented in [28] is very efficient in noise estimation but can only be used for overdetermined/determined scenarios (i.e., the number of sensors is larger than or equal to the number of sources) as [28] is a generic ICAbased BSS algorithm. In [20] exploiting Directional BSS as a BM for noise estimation (here the noise including interfering sources and diffuse background noise) was proposed. This approach can be applied in both determined and underdetermined scenarios. Unlike for beamforming approaches, correlated components arriving from other directions, i.e., reflections and reverberation will also be suppressed to the greatest extent possible by Directional BSS. This concept can deal with underdetermined scenarios such that a meaningful instantaneous estimate for all undesired signals comprising interfering speech signals and diffuse background noise can be obtained using only two microphones and regardless of the noise statistics. Note that for applying the directional constraint, the directional information on the target source must be given or estimated by a source localizer. Even with a source localizer, a predefined angular range of the target source must be given. This range was set to be −20° to 20° in front of the microphone array [20]. The algorithm of Directional BSS was introduced and its efficiency was shown in [20] with respect to the blocking performance. In this paper, we provide an indepth analysis of the heuristically motivated BM in [20] and provide new insights with respect to several decisive aspects: (1) the relation of the ICAbased BM to other BMs, (2) the blocking performance if the target source arrives from directions which are different from broadside direction, (3) the robustness against localization errors, and (4) the BSE/speech enhancement performance when using the noise estimate produced by Directional BSS for Wienertype spectral enhancement. Additionally, a BSE scheme combining Directional BSS and spectral enhancement filters under various acoustical conditions will be evaluated. Therefore, the main contributions of this paper compared to our earlier work are the following: For one, we show by a theoretical analysis and by experimental results that Directional BSS converges to an RTFbased BM. In addition, the performance of the proposed method is for the first time analyzed regarding some practically highly relevant aspects, e.g., the blocking ability for sources impinging from arbitrary directions and the noise reduction performance of the applied noise reduction scheme.
The paper is organized as follows: In Section 2, the generic BSS algorithm is reviewed. In Section 3, we provide a theoretical analysis to show that Directional BSS converges to an RTFbased BM and describe the algorithm of Directional BSS. Moreover, the relation/difference of Directional BSS to other conventional/stateofart BMs is discussed. Furthermore, in Section 4, experimental results with respect to the blocking performance and the robustness against localization errors in various acoustical scenarios are presented. Finally, a BSE scheme combining Directional BSS and Wienertype spectral enhancement filters is presented and evaluated in Section 5. Note that in this paper, we restrict our consideration to twochannel cases.
2 Determined blind source separation: generic ICAbased BSS algorithm
where * represents convolution and h_{ mp }(k), m∈{1,2} denote the finite acoustic impulse responses from the m th point source to the p th microphone in discrete time and k is the discrete time index.
where w_{ pq }(k) denotes the demixing filter from the p th microphone to the q th output channel.
The various criteria used for identifying w_{ pq } in (2) (see e.g., [1, 2, 29]) are essentially based on the assumption that sources are statistically independent. In this paper, we use tripleN independent component analysis for convolutive mixtures (TRINICON) [30] for BSS, where mutual information between the output channels $\mathbf{y}=\phantom{\rule{2.77626pt}{0ex}}{\left[\phantom{\rule{0.3em}{0ex}}{\mathbf{y}}_{1}^{T}\right(k),{\mathbf{y}}_{2}^{T}(k\left)\right]}^{T}$ should be minimized. As the algorithm is derived for block processing of convolutive mixtures, for each output y_{ q }(k), a sequence of D output samples corresponding to D successive time lags is taken into account.
where $\widehat{\mathbf{E}}\{\xb7\}$ is the estimate of the statistical expectation, with ensemble averaging being replaced by temporal averaging over N blocks assuming ergodicity within the individual blocks. ${\widehat{p}}_{\mathbf{y},\mathit{\text{PD}}}$ is an estimate of the joint probability density function (pdf) of dimension PD over all P (here, P = 2) output channels, and ${\widehat{p}}_{{\mathbf{y}}_{q},D}$ is the estimated multivariate pdf for channel q of dimension D. Matrix W captures all the impulse response coefficients of the demixing filters, with a detailed description of its structure given in [31, 32]. Minimizing J_{BSS}(W) corresponds to minimizing the KullbackLeibler divergence (KLD) between ${\widehat{p}}_{\mathbf{y},\mathit{\text{PD}}}\left(\mathbf{y}\right)$ and $\prod _{q=1}^{P}{\widehat{p}}_{{\mathbf{y}}_{q},D}\left({\mathbf{y}}_{q}\right)$, which leads to maximization of the statistical independence of the output vectors y_{ q }.
3 Directional blind source separation as a blocking matrix
In this section, we firstly discuss the relation of Directional BSS with a conventional RTFbased BM in Subsection 3.1. The Directional BSS algorithm is described in Subsection 3.2 before comparing it to alternative approaches in Subsection 3.3.
3.1 From system identification to RTFbased blocking matrix
As a precondition for identifying this solution, H_{11}(z) and H_{21}(z) may not have common zeros and the filter lengths equal the lengths of room impulse responses. Obviously, the optimum filters can only be determined up to a scaling factor α.
where underlined characters denote frequencydomain representations. The normalized frequency Ω is given as $\frac{2\mathrm{\pi f}}{{f}_{\mathrm{s}}}$, where f_{s} denotes the sampling frequency. The ratio of the two frequency responses $\frac{{\underset{\u2014}{h}}_{12}\left(\Omega \right)}{{\underset{\u2014}{h}}_{11}\left(\Omega \right)}$ is known as the RTF or the TF ratio.
which is exactly the form of the RTFbased BM proposed in [25].
which is the same as in (4) for the system identification in a SIMO system. As BSS has no determined solution in underdetermined scenarios, the problem is how to force BSS to suppress the target source only and preserve the other sources to form a joint noise estimate. For this purpose, we combine the generic BSS with a geometric constraint to force a spatial null towards the direction of the target source. We denote the combined algorithm as ‘Directional BSS’ and analyze it in the following sections.
3.2 Algorithm
Blind source separation can be regarded as ‘blind adaptive beamforming’ (blind ABF) [34] as BSS and ABF have similar goals and a similar structure: Both attempt to extract a target signal and reduce the interference by multichannel array processing as described in [35, 36]. In [34] it is shown that BSS is equivalent to a set of adaptive beamformers which form multiple nullbeams steered towards the directions of interfering sources and its reflections. On the other hand, there are fundamental (characteristic) differences between BSS and ABF: generic BSS usually does not require prior information on source locations and sensor constellations, while ABF requires the spatial information on the locations of sources and sensors. In [19] a method was proposed to combine BSS and beamforming for achieving a better separation performance by utilizing the geometric information of sources. The kind of combination is known as geometric source separation, where the response of BSS demixing filters is additionally constrained to a set of directions.
where c is the sound velocity. Note that both the microphone spacing d_{mic} and the angle θ relative to the array axis must be given. For simplicity, we omit the frequency variable Ω in the sequel.
where $\underset{\u2014}{\mathbf{W}}=\phantom{\rule{2.77626pt}{0ex}}{[\phantom{\rule{0.3em}{0ex}}{\underset{\u2014}{\mathbf{w}}}_{1},{\underset{\u2014}{\mathbf{w}}}_{2}]}^{\mathrm{T}}$ is the BSS demixing matrix and $\underset{\u2014}{\mathbf{D}}\left(\Theta \right)=\phantom{\rule{2.77626pt}{0ex}}\left[\phantom{\rule{0.3em}{0ex}}\underset{\u2014}{\mathbf{d}}\right({\theta}_{1}),\underset{\u2014}{\mathbf{d}}({\theta}_{2}\left)\right]$ contains steering vectors pointing to Θ= [ θ_{1},θ_{2}]. The 2×2 matrix C refers to the constraints.
which restricts the output channels to have a zero response for the signals arriving from the directions given in Θ, i.e., it forces each output channel to form a null beamformer steered to the source which should be blocked in this output channel.
where the weighting parameter η_{C} can be chosen to control the importance of the geometric constraint relative to the separation criterion represented by J_{BSS} (3).
As Directional BSS serves as a BM for a single desired source, only the target source needs to be suppressed. Therefore, ${J}_{\mathrm{C}}\left(\underset{\u2014}{\mathbf{W}}\right)$ is modified by considering the following conditions:

The direct path for the target signal is suppressed by the penalty term analogously to nullsteering beamforming, i.e, a spatial null is forced toward the direction θ of the target source.

As only the target signal needs to be suppressed, only one BSS output channel is controlled by the geometric constraint. Without loss of generality, the output channel 1 is chosen to be controlled with the penalty term ${J}_{\mathrm{C}}\left(\underset{\u2014}{\mathbf{W}}\right)$ in the sequel.

In order to converge to the RTFbased BM, w_{21} is set to be a pure delay and remains unchanged during adaptation of the demixing system. Note that we could equivalently use the first channel as the reference and in that case w_{11} is a pure delay.
where R_{ y y } denotes the 2D×2D correlation matrix of the output signal vector y of length 2D, bdiag refers to considering block matrix and describes the operation of setting all offdiagonal block matrices of the block matrix to zero. $\beta (\stackrel{\u0306}{i},\stackrel{\u0306}{m})$ is a weighting function normalized to $\sum _{\stackrel{\u0306}{i}=0}^{\stackrel{\u0306}{m}}\beta (\stackrel{\u0306}{i},\stackrel{\u0306}{m})=1$ allowing for online, offline or blockonline realization of the algorithm [32].
where x_{s,p} and x_{n,p} denotes the target and the noise component contained in microphone p, respectively.
3.3 Comparison to alternative approaches
The original BM proposed by Griffith and Jim [12] is constructed by subtracting pairs of timealigned signals with respect to the target signal. For the dualchannel case, this is exactly a DSB, which is attractive for its simple structure. However, a major limitation in real acoustic scenarios is that the performance of a DSB will significantly degrade for an imprecise target source position information, i.e., for steering errors. Additionally, due to reflections of the target signal impinging from directions other than the steering direction, significant signal leakage into the noise reference needs to be expected. As a possible countermeasure, an adaptive BM (ABM) with coefficient constraints was proposed in [23]. In this conventional ABM, the output of the FB (see Figure 1) is used as a reference signal for the target source and adaptively subtracted from the microphone signal. The least mean squares (LMS) algorithm is usually used for the ABM adaptation. However, the adaptation can only be carried out in time segments, where only the target source is active. Therefore, a doubletalk detector is necessary, which requires significant sophistication and will still be imperfect in complex acoustic scenarios. The difference of our approach to this BM is that (1) the adaptation criterion is different and very important for its practical relevance, and (2) a doubletalk detector is not required.
The transferfunctiongeneralized sidelobe canceler (TFGSC) was proposed by Gannot et al. [25], where the BM is constructed based on RTFs. This approach takes the reverberant nature of the enclosure into account. The RTFs are estimated by a least squares method and for this, two assumptions are necessary: (1) the RTFs change slowly over time compared to the time variations of the signals, which effectively precludes movements of the source, and (2) time segments are available, where both the TFs and the noise signal are assumed to be stationary. In Section 3.1 we already show that our approach converges to an RTFbased BM. In contrast to [25], our approach does not rely on such time segments but only a coarse DoA estimation is required.
Warsitz et al. presented a BM based on a generalized eigenvalue decomposition. They construct the BM directly by using the beamformer filter coefficients resulting from maximizing the output signaltonoise ratio, where the filter coefficients are computed iteratively by solving a generalized eigenvalue problem [41, 42]. This approach indirectly estimates the RTFs and does not require periods of absence of noise and the DoA of the target source. On the other hand, it works only for stationary noise while our approach can work in a nonstationary multispeaker scenario.
Recently, a subspace approach for estimating RTFs in multiplenoise scenarios was proposed in [27], which was used to construct the RTFbased BM efficiently. However, this approach needs an estimation for source activities.
The conventional noise estimation methods other than BMbased approaches are mostly based on source activity estimation [43] or minimum statistic noise power estimation [44, 45]. For those approaches, it is usually assumed that the sources are statistically independent and the noise is more stationary than the target signal. The recently wellstudied modelbased NMF approaches [5–7] can be used directly for noise reduction [46, 47] or for noise estimation [48]. Those methods usually rely on the prior knowledge of the noise type (point source or diffuse noise) to define the model parameters. Therefore, they can only be as efficient as the models match the current scenario and are prone to fail if the model assumptions do not hold or the parameters could not be properly learned. The latter is especially crucial for online algorithms in timevarying scenarios. On the other hand, Directional BSS will fail if the interfering source arrives from the same direction as the target source, as then Directional BSS is not able to provide an estimate for the interfering source. Compared to these alternative approaches, the main advantage of the proposed approach is that no target source activity estimation or prior knowledge on the source characteristics is necessary and no model needs to be matched, but only a coarse DoA estimation is required.
4 Evaluation of blocking matrices
In order to evaluate the proposed BM, comprehensive experiments were carried out. The system behavior and the target suppression performance of Directional BSS in singlesource scenarios (only one directional source is active) and multiplesource scenarios (multiple directional sources are simultaneously active) are evaluated. For showing the system behavior, Directional BSS is compared to (1) an ideal RTFbased BM and (2) a BM based on a DSB. The ideal RTFbased BM is calculated from the measured room impulse responses (RIRs). In order to evaluate the target suppression performance, Directional BSS is compared to (1) a perfect adaptive BM (we name it ideal ABM) and (2) a BM based on a DSB. The ideal ABM is adapted in a singlesource scenario. It should be noted that stateofart BMs are mostly based on an estimation of source activities and require perfectly detected target sourceonly time segments. Therefore, we compare Directional BSS always with the DSB and not to other BMs as only these two BMs do not require estimating any source activities. The comparison to the ideal ABM can show us how close Directional BSS can reach to the perfectly supervised case. Besides, the robustness of Directional BSS against localization errors is analyzed in Section 4.3.
4.1 Experimental setup
Two real rooms were considered for evaluation: (1) room A: a livingroomlike environment with a moderate reverberation time of T_{60}≈250 ms and a critical distance [49] of 1.3 m and (2) room B: a more reverberant living room with T_{60}≈400 ms and a critical distance of 0.9 m. As sourcearray, distances 1 and 2 m were considered. The experiments are based on RIR measurements carried out with a twochannel array. The measurements were performed for two different microphone spacings, d_{mic}∈{6,11.5} cm at a sampling frequency of 48 kHz using the maximum length sequences (MLS) method [49]. For the following evaluation, the RIRs were downsampled to a sampling frequency f_{s}=16 kHz. We combine the efficient SOSbased online BSS algorithm presented in [32] with the geometric constraint (22) to perform Directional BSS. The filter length of the finite impulse response (FIR) filters w_{ pq } (21) is 1024, the block length for estimation of the correlation matrix R_{ y y } (23) is 2048. The number of iterations per data block of 125 ms is 15 (see [32] for details on the adaptation). Three male and female speech signals of length 10 s were used as source signals. Diffuse background noise components were simulated using the method proposed in [50]. All sources (including speech sources and diffuse source signals) are continuously active and normalized to equal average power. For the experiments, the DoA information of the target source is given. However, as in practice, a source localizer is necessary to estimate the target DoA, in Section 4.3, the robustness of Directional BSS against localization errors is investigated.
4.2 Performance of the blocking matrices
As performance measures, we use (1) the frequency response of the overall system to show the system behavior in different scenarios under various acoustical conditions, (2) the target suppression gain and the root mean square error between the estimated RTF and the true RTF to measure the blocking performance, and (3) the signaltointerference ratio (SIR) difference between the BM input and the BM output signal to measure the ability of the BM to preserve all interfering signals. Note that here the interference includes interfering sources and diffuse background noise.
4.2.1 Frequency response of the overall system
where ${\underset{\u2014}{\widehat{n}}}_{s}$ refers to the residual of the target signal component s in the noise estimate $\widehat{n}$ (‘leakage’). This characterization is similar to but not equal to a beam pattern in the usual sense where it is assumed that the acoustic waves propagate in free field and no scattering is considered. Instead, (26) also considers the acoustic environment by accounting for the transfer functions from the source position to the microphones. Thereby, ${\underset{\u2014}{h}}_{\text{trans}}$ captures reflections of source signals determined by the given source positions. Thus, if (26) exhibits a minimum for a certain angle relative to a certain distance to the microphone array, it indicates that all signal components originating from this angle at this distance, including possible reflections at surfaces in the acoustic environment, are suppressed to the given extent.
We show the magnitude response of (1) Directional BSS, (2) the ideal RTFbased BM, and (3) a DSB. With the ideal RTFbased BM, the target component is perfectly suppressed. Directional BSS is expected to converge to this ideal solution. For the BM based on a DSB, the filter coefficients are calculated according to the fractional delay between the two microphone signals. For Directional BSS, the BM coefficients are a set of converged BSS demixing filters of length 1024 adapted for the corresponding scenarios. We first show the magnitude response for the BMs adapted/calculated for singlesource scenarios.
4.2.2 Target suppression performance
where ${\sigma}_{a}^{2}$ denotes the (longterm averaged) signal power of the signal a, x_{s,p} denotes the target component contained in p th microphone, and ${\widehat{n}}_{s}$ denotes the target residual contained in the noise estimate. The target suppression gain of the ideal RTFbased BM is infinity. A higher target suppression gain indicates a higher blocking performance for the signal from the target source direction. This measure is very similar to the ‘signal blocking factor’ used in [52]. We compare the target suppression performance with (1) the DSB and (2) a simulated ideal ABM, where one microphone channel is simply adaptively subtracted with a LMStype algorithm from the other. For this simulation, an ideal case is assumed, i.e., the microphone signal contains only the target signal. The simulated ABM can be regarded as an ideal version (in a supervised case) of a conventional ABM using an LMS algorithm as proposed in [23, 24].
Figure 11b shows the target suppression gain for the three BMs. As the DSB is only dependent on the target direction and the target sourcearray distance, the target suppression gain of the DSB for scenarios 2 to 4 is the same as for scenario 1. We simulate the ideal ABM by adapting the BM filter in a singlesource scenario, i.e., scenario 1. Therefore, the performance of the ideal ABM is only shown for this scenario. It can be seen that in a singlesource scenario, the target suppression gain of an ideal ABM is only slightly higher than Directional BSS, which indicates that in a singlesource scenario, Directional BSS can reach to the upper limit. For scenarios 2 to 4, the target suppression gain of Directional BSS degrades but is always over 10 dB and clearly superior compared to the DSB. In Figure 11c, the NSE_{DirBSS} and NSE_{idealABM} are shown. For scenario 1, where only the target source is active, the NSEs of the both BMs are very close and very low, which indicates that in a singlesource scenario, the estimated RTFs are very close to the true RTF. With an increased number of sources or with more complicated acoustical conditions (higher reverberation time and larger sourcearray distance), it is more difficult to estimate the RTF. We can see that NSE_{DirBSS} increases and the target suppression gain degrades. However, even in such complex scenarios, Directional BSS can produce an acceptable estimate of the RTF without any source activity detection. In the latest work [53], more evaluation results for comparing the estimated RTFs with the true RTFs are shown. The performance of Directional BSS is somewhat dependent on the signal characteristics (stationary or nonstationary, speech signal or white noise) of the involved sources. The energy of a white noise is distributed over full band while the energy of speechlike sources is usually limited to low frequencies. Besides, nonstationary sources make the adaptation of Directional BSS difficult as it needs to catch the variation of the signals within short frames. Therefore, different signal characteristics will lead to different results.
4.2.3 Preservation of interfering sources
Figure 12b,c shows the SIR_{diff} achieved by Directional BSS and the DSB for the testing scenario. We observe that due to the increased reverberation, SIR_{diff} decreases. The interfering sources located near to the target source, e.g., the interfering source at −20° or −10° BSS may treat them as one source. Both the interfering sources and the target source are suppressed to a certain extend. Comparing the performance of Directional BSS and the DSB, it can be seen that Directional BSS is clearly superior to the DSB especially in reverberant environments.
4.3 Robustness against localization errors
5 Application of a DirBSSbased noise estimate to blind signal extraction
In this section, a twochannel BSE scheme combining Directional BSS as BM and Wienertype spectral enhancement filters will be presented and evaluated.
5.1 A twounit scheme: BM plus spectral enhancement filters
where, ${\u015c}_{\mathit{\text{aa}}}$ represents the autoPSD of a, v_{ p }=w_{p 1}∗x_{ p } denotes the p th microphone signal filtered by Directional BSS, ${\underset{\u2014}{g}}_{min}$ refers to the minimum value of the spectral weights (spectral floor), and μ is a real number which is used to achieve a tradeoff between noise reduction and speech distortion. Note that the spectral weights can be designed in many forms, e.g., can be derived from Bayesian estimation using maximum likelihood (ML) method, maximum a posteriori (MAP), or MMSE estimator [57].
5.2 Improved noise estimate by assuming an ideal diffuse noise field
This correction function was discussed in detail in [21]. Other possible correction functions could be based on coherence measurements during target inactivity [54, 58] or a method combined with minimum statistics [59]. However, a more detailed discussion of the possible correction functions and choices of spectral enhancement filters is outside the scope of this paper.
5.3 Experimental results
In this section, the performance of the proposed scheme with two spectral enhancement filters ((31) and (35)) is evaluated in terms of signaltointerference ratio improvement (SIR_{gain}) and speech distortion (SD) for various scenarios under various testing conditions. Besides, we show the performance of the proposed scheme based on the noise estimate provided by the ideal RTFbased BM and the DSB for comparison.
5.3.1 Experimental setup
Two different rooms and two sourcearray distances are considered for evaluating the BMs. The scenarios as shown in Figure 16b are considered for evaluating the performance of the speech enhancement scheme. All tested signals are the same as used in the experiments for evaluating the performance of BMs.
The same algorithm of Directional BSS and the same parameters are used as in Section 4 (see (22) and (23)). The frequencydomain Wiener filter is implemented with a polyphase filter bank [60] using a prototype FIR filter of length 1024, with 512 complexvalued subbands and a downsampling rate of 128.
5.3.2 Performance measures
where x_{s,p} and z_{s,p} denote the target speech components at the p th input and the p th output of the proposed scheme, respectively; ${\sigma}_{{x}_{s,p}}^{2}$ and ${\sigma}_{{z}_{s,p}}^{2}$ denote the (longterm) signal power of the target speech components at the p th input and the p th output; whereas ${\sigma}_{{x}_{n,p}}^{2}$ and ${\sigma}_{{z}_{n,p}}^{2}$ denote the (longterm) signal power of all noise and interference components at the p th input and p th output, respectively. τ_{g} refers to the overall signal delay caused by the filter bank.
5.3.3 Performance of the proposed BSE scheme
Although the ideal RTFbased BM can perfectly suppress the target source, the produced noise estimate is still biased relative to the true noise components contained in the microphone signal. Therefore, we cannot expect a perfect noise reduction for the speech enhancement scheme with the biased noise estimate. However, the noise reduction performance achieved with this noise estimate can be regarded as the upper limit for the scheme with the noise estimate produced by Directional BSS without applying a bias correction. It can be seen that for all scenarios, with Directional BSS as BM, the noise reduction performance is almost the same as the performance achieved by the idealRTF (1 dB less), but clearly superior to the performance achieved by a DSB. Obviously, for the latter, the large residual of the target component reduces the SIR improvement and leads to a significant distortion of the target source.
6 Conclusions
In our earlier work, Directional BSS was proposed as a BM for source extraction. The concept combines BSS with a geometric constraint to cope with the underdetermined scenario such that a meaningful and joint estimate of all interfering speech signals and diffuse background noise can be obtained using only two microphones. In this paper, we show that Directional BSS converges to an RTFbased ideal BM. Experimental results analyzing the system behavior and the blocking performance of Directional BSS under various acoustical conditions were presented. These results verify that Directional BSS can successfully estimate the RTF in underdetermined nonstationary noise scenarios without requiring source activity information. The target suppression performance of Directional BSS is clearly superior to a common DSB. Simulation results confirm also that Directional BSS is very robust against localization errors. Additionally, we evaluate a source extraction scheme which combines Directional BSS and Wienertype spectral enhancement filters. It is shown that the noise reduction performance achieved by this scheme using Directional BSS is very close to the performance achieved by the proposed scheme using an ideal RTFbased BM. Therefore, for exploiting Directional BSS as a BM, no source activity information and no information on the number of active sources is necessary. The only required information for this informed algorithm is some coarse DoA information of the target source.
Declarations
Authors’ Affiliations
References
 Hyvärinen A, Oja E: Independent Component Analysis. Wiley, New York; 2001.View ArticleMATHGoogle Scholar
 Makino S, Lee TW, Sawada H: Blind Speech Separation. Springer, Berlin; 2007.View ArticleGoogle Scholar
 Yilmaz Ö, Rickard S: Blind separation of speech mixtures via timefrequency masking. IEEE Trans. Signal Process 2004, 52: 18301847. 10.1109/TSP.2004.828896MathSciNetView ArticleGoogle Scholar
 Araki S, Sawada H, Mukai R, Makino S: A novel blind source separation method with observation vector clustering. In Proceedings of the International Workshop on Acoustic Echo and Noise Control (IWAENC). Eindhoven; September 2005:117120.Google Scholar
 Duong NK, Vincent E, Gribonval R: Underdetermined reverberant audio source separation using a fullrank spatial covariance model. IEEE Trans. Audio Speech Lang. Process 2010, 18: 18301840.View ArticleGoogle Scholar
 Ozerov A, Fevotte C: Multichannel nonnegative matrix factorization in convolutive mixtures with application to blind audio source separation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Taipei; March 2009:550563.Google Scholar
 Sawada H, Kameoka H, Araki S, Ueda N: New formulations and efficient algorithms for multichannel NMF. In Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). New York; October 2011:153156.Google Scholar
 Van den Bogaert T, Doclo S, Wouters J, Moonen M: Speech enhancement with multichannel Wiener filter techniques in multimicrophone binaural hearing aids. J. Acoust. Soc. Am 2009, 125: 360371. 10.1121/1.3023069View ArticleGoogle Scholar
 Cornelis B, Moonen M: A VADrobust multichannel Wiener filter algorithm for noise reduction in hearing aids. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Prague; May 2011:281284.Google Scholar
 Frost OL: An algorithm for linearly constrained adaptive array processing. Proc. IEEE 1972, 60: 926935.View ArticleGoogle Scholar
 Van Trees HL: Optimum Array Processing (Detection, Estimation, and Modulation Theory, Part IV), 1st edn. Wiley, New York; 2002.Google Scholar
 Griffiths LJ, Jim CW: An alternative approach to linear constrained adaptive beamforming. IEEE Trans. Speech Audio Process 1982, 30: 2734.Google Scholar
 Zelinski R: A microphone array with adaptive postfiltering for noise reduction in reverberant rooms. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Las Vegas; March 2008:25782581.Google Scholar
 McCowan I, Bourlard H: Microphone array postfilter based on noise field coherence. IEEE Trans. Speech Audio Process 2003, 11: 709716. 10.1109/TSA.2003.818212View ArticleGoogle Scholar
 Ito N, Shimizu H, Ono N, Sagayama S: Diffuse noise suppression using crystalshaped microphone arrays. IEEE Trans. Audio Speech Lang. Process 2011, 19: 21012110.View ArticleGoogle Scholar
 R Hendriks T: Gerkmann, Noise correlation matrix estimation for multimicrophone speech enhancement. IEEE Trans. Audio Speech Lang. Process 2012, 20: 223233.View ArticleGoogle Scholar
 Taseska M, Habets EAP: MMSEbased blind source extraction in diffuse noise fields using a complex coherencebased a priori sap estimator. In Proceedings of the International Workshop on Acoustic Echo and Noise Control (IWAENC). Aachen, Germany; September 2012.Google Scholar
 Taseska M, Habets EAP: MMSEbased source extraction using positionbased posterior probabilities. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Vancouver; May 2013.Google Scholar
 Parra L, Alvino C: Geometric source separation: merging convolutive source separation with geometric beamforming. IEEE Trans. Speech Audio Process 2002, 10: 352362. 10.1109/TSA.2002.803443View ArticleGoogle Scholar
 Zheng Y, Reindl K, Kellermann W: BSS for improved interference estimation for blind speech signal extraction with two microphones. In Proceedings of 3rd International Workshop on Computational Advances in MultiSensor Adaptive Processing (CAMSAP). Dutch Antilles, Aruba; December 2009:253256.Google Scholar
 Maas R, Schwarz A, Zheng Y, Reindl K, Meier S, Sehr A, Kellermann W: A twochannel acoustic frontend for robust automatic speech recognition in noisy and reverberant environments. In Proceedings of the International Workshop on Machine Listening in Multisource Environments (CHiME). Florence; September 2011:4146.Google Scholar
 Van Veen BD, Buckley KM: Beamforming: a versatile approach to spatial filtering. IEEE ASSP Mag 1988, 5: 424.View ArticleGoogle Scholar
 Hoshuyama O, Sugiyama A: A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Atlanta; May 1996:925928.Google Scholar
 Herbordt W, Kellermann W: Frequencydomain integration of acoustic echo cancellation and a generalized sidelobe canceller with improved robustness. Eur. Trans. Telecommun 2002, 13: 123132. 10.1002/ett.4460130207View ArticleGoogle Scholar
 Gannot S, Burshtein D, Weinstein E: Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Trans. Signal Process 2001, 49: 16141626. 10.1109/78.934132View ArticleGoogle Scholar
 Cohen I: Relative transfer function identification using speech signals. IEEE Trans. Speech Audio Process 2004, 12(5):451459. 10.1109/TSA.2004.832975View ArticleGoogle Scholar
 Golan S, Gannot S, Cohen I: Multichannel eigenspace beamforming in a reverberant noisy environment with multiple interfering speech signals. IEEE Trans. Audio Speech Lang. Process 2009, 17: 10711086.View ArticleGoogle Scholar
 Takahashi Y, Takatani T, Osako K, Saruwatari H, Shikano K: Blind spatial subtraction array for speech enhancement in noisy environment. IEEE Trans. Audio Speech Lang. Process 2009, 17: 650664.View ArticleGoogle Scholar
 Smaragdis P: Blind separation of convolved mixtures in the frequency domain. Neurocomputing 1998, 22: 2134. 10.1016/S09252312(98)000472MATHView ArticleGoogle Scholar
 Buchner H, Aichner R, Kellermann W: The TRINICON framework for adaptive MIMO signal processing with focus on the generic Sylvester constraint. In Proceedings of the ITG Conference on Speech Communication. Aachen, October; 2008.Google Scholar
 Buchner H, Aichner R, Kellermann W: Blind source separation for convolutive mixtures exploiting nongaussianity, nonwhiteness, and nonstationarity. In Proceedings of International Workshop on Acoustic Echo and Noise Control (IWAENC). Kyoto; September 2003:275278.Google Scholar
 Aichner R, Buchner H, Yan F, Kellermann W: A realtime blind source separation scheme and its application to reverberant and noisy acoustic environments. Signal Process 2006, 86: 12601277. 10.1016/j.sigpro.2005.06.022MATHView ArticleGoogle Scholar
 Buchner H, Aichner R, Kellermann W: Relation between blind systemidentification and convolutive blind source separation. In Proceedings of Workshop for HandsFree Speech Communication and Microphone Arrays (HSCMA). Piscataway; March 2005.Google Scholar
 Araki S, Makino S, Hinamoto Y, Mukai R, Nishikawa T, Saruwatari H: Equivalence between frequencydomain blind source separation and frequencydomain adaptive beamforming for convolutive mixtures. EURASIP J. Adv. Signal Process 2003, 2003: 11571166. 10.1155/S1110865703305074MATHView ArticleGoogle Scholar
 Gerven SV, Compernolle DV: Signal separation by symmetric adaptive decorrelation: stability, convergence, and uniqueness. IEEE Trans. Signal Process 1995, 43: 16021612. 10.1109/78.398721View ArticleGoogle Scholar
 Weinstein E, Feder M, Oppenheim A: Multichannel signal separation by decorrelation. IEEE Trans. Speech Audio Process 1993, 1: 405413. 10.1109/89.242486View ArticleGoogle Scholar
 Zheng Y, Lombard A, Kellermann W: An improved combination of directional BSS and a source localizer for robust source separation in rapidly timevarying acoustic scenarios. In Proceedings of the Workshop on Handsfree Speech Communication and Microphone Arrays (HSCMA). Edinburgh; May 2011.Google Scholar
 Hjørungnes A: Complexvalued matrix derivatives: with applications in signal processing and communications. Cambridge University Press, Cambridge; 2011.View ArticleMATHGoogle Scholar
 Knapp C, Carter G: The generalized correlation method for estimation of time delay. IEEE Trans. Acoustics Speech Signal Process. ASSP 1976, 24: 320327. 10.1109/TASSP.1976.1162830View ArticleGoogle Scholar
 Lombard A, Rosenkranz T, Buchner H, Kellermann W: Multidimensional localization of multiple sound sources using averaged directivity patterns of blind source separation systems. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Taipei; April 2009:233236.Google Scholar
 Warsitz E, Krueger Ar, HaebUmbach R: Speech enhancement with a new generalized eigenvector blocking matrix for application in a generalized sidelobe canceller. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Las Vegas; March 2008:7376.Google Scholar
 Krueger A, Warsitz E, HaebUmbach R: Speech enhancement with a GSClike structure employing eigenvectorbased transfer function ratios estimation. IEEE Trans. Audio Speech Lang. Process 2011, 19: 206219.View ArticleGoogle Scholar
 Gerkmann T, Hendriks R: Noise power estimation based on the probability of speech presence. In Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). New Paltz; October 2011:145148.Google Scholar
 Martin R: Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process 2001, 9(5):504512. 10.1109/89.928915View ArticleGoogle Scholar
 Cohen I: Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging. IEEE Trans. Speech Audio Process 2003, 11(5):466475. 10.1109/TSA.2003.811544View ArticleGoogle Scholar
 Ozerov A, Vincent E: Using the FASST source separation toolbox for noise robust speech recognition. In Proceedings of International Workshop on Machine Listening in Multisource Environments (CHiME 2011). Florence; September 2011.Google Scholar
 Moritz N, Schädler M, Adiloglu K, Meyer B, Jürgens T, Gerkmann T, Goetze S: Noise robust distant automatic speech recognition utilizing NMF based source separation and auditory feature extraction. In Proceedings of The 2nd International Workshop on Machine Listening in Multisource Environments (CHiME 2013). Vancouver; June 2013:11.Google Scholar
 Jeon K, Park N, Kim H, Choi M, Hwang K: Mechanical noise suppression based on nonnegative matrix factorization and multiband spectral subtraction for digital cameras. IEEE Trans. Consum. Electron 2013, 59(2):296302.View ArticleGoogle Scholar
 Kuttruff H: Room Acoustics. Taylor & Francis, London; 2000.Google Scholar
 Habets EAP, Gannot S: Generating sensor signals in isotropic noise fields. J. Acoustical Soc. Am 2007, 122: 34643470. 10.1121/1.2799929View ArticleGoogle Scholar
 Martin R: Kurzfassung Freisprecheinrichtung mit mehrkanaliger Echokompensation und Störgeräuschreduktion. RWTH Aachen University, PhD thesis; 1995.Google Scholar
 Talmon R, Cohen I, Gannot S: Relative transfer function identification using convolutive transfer function approximation. IEEE Trans. Audio Speech Lang. Process 2009, 17: 546555.View ArticleGoogle Scholar
 Reindl K, Barfuss H, Gannot S, Kellermann W, MarkovichS Golan: Geometrically constrained TRINICONbased relative transfer function estimation in underdetermined scenarios. In Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). New York; October 2013.Google Scholar
 Kellermann W, Reindl K, Zheng Y: Method and acoustic signal processing system for interference and noise suppression in binaural microphone configurations. US Patent 2011/0307249 A1 2011.Google Scholar
 Reindl K, Zheng Y, Kellermann W: Speech enhancement for binaural hearing aids based on blind source separation. In Proceedings of 4th International Symposium on Communications, Control, and Signal Processing (ISCCSP). Limassol; March 2010.Google Scholar
 Reindl K, Zheng Y, Schwarz A, Meier S, Maas R, Sehr A, Kellermann W: A stereophonic acoustic signal extraction scheme for noisy and reverberant environments. Comput. Speech Lang. (CSL) 2012, 27: 726745.View ArticleGoogle Scholar
 Vary P, Martin R: Digital Speech Transmission: Enhancement, Coding and Error Concealment. Wiley, Chichester; 2006.View ArticleGoogle Scholar
 Kim K, Jeong S, Jeong J, Oh K, Kim J: Dual channel noise reduction method using phase differencebased spectral amplitude estimation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Dallas; March 2010:217220.Google Scholar
 Jeong S, Kim K, Jeong J, Oh K, Kim J: Adaptive noise power spectrum estimation for compact dual channel speech enhancement. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Dallas; March 2010:16301633.Google Scholar
 Vaidyanathan PP: Multirate Systems and Filter Banks. PrenticeHall, Upper Saddle River; 1993.MATHGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.