Evaluations on underdetermined blind source separation in adverse environments using time-frequency masking

Jafari, Ingrid; Haque, Serajul; Togneri, Roberto; Nordholm, Sven

doi:10.1186/1687-6180-2013-162

Research
Open access
Published: 23 October 2013

Evaluations on underdetermined blind source separation in adverse environments using time-frequency masking

Ingrid Jafari¹,
Serajul Haque¹,
Roberto Togneri¹ &
…
Sven Nordholm²

EURASIP Journal on Advances in Signal Processing volume 2013, Article number: 162 (2013) Cite this article

1851 Accesses
5 Citations
Metrics details

Abstract

The successful implementation of speech processing systems in the real world depends on its ability to handle adverse acoustic conditions with undesirable factors such as room reverberation and background noise. In this study, an extension to the established multiple sensors degenerate unmixing estimation technique (MENUET) algorithm for blind source separation is proposed based on the fuzzy c-means clustering to yield improvements in separation ability for underdetermined situations using a nonlinear microphone array. However, rather than test the blind source separation ability solely on reverberant conditions, this paper extends this to include a variety of simulated and real-world noisy environments. Results reported encouraging separation ability and improved perceptual quality of the separated sources for such adverse conditions. Not only does this establish this proposed methodology as a credible improvement to the system, but also implies further applicability in areas such as noise suppression in adverse acoustic environments.

1 Introduction

The ability of the human cognitive system to distinguish between multiple, simultaneously active sources of sound is a remarkable quality that is often taken for granted. This capability has been studied extensively within the speech processing community, and many an endeavor at imitation has been made. However, automatic speech processing systems are yet to perform at a level akin to human proficiency [1] and are thus frequently faced with the quintessential 'cocktail party problem’: the inadequacy in the processing of the target speaker/s when there are multiple speakers in the scene [2]. The implementation of a suitable source separation algorithm can improve the performance of such systems, where source separation is the recovery of the original sources from a set of mixed observations. If no a priori information of the original sources and/or mixing process is available, it is termed blind source separation (BSS). Rather than rely on the availability of such a priori information, BSS methods often exploit an assumption on the constituent source signals and utilize spatial diversity obtained from the sensor observations. BSS has many important applications in both the audio and biosignal disciplines, including medical imaging and communication systems.

In the last decade, the research field of BSS has evolved significantly to be an important technique in acoustic signal processing [3]. More specifically, the concept of time-frequency (TF) masking in the context of BSS has been of significance due to its applicability to all BSS scenarios, in particular the underdetermined case, where there exists more sources than sensors. In the TF masking approach to BSS, the assumption of sparseness between the speech sources is typically exploited as initiated in [4]. There exists several definitions for sparseness in the literature; for example, [5] simply defines sparseness as to contain as 'many zeros as possible’, whereas others offer a more quantifiable measure such as kurtosis [6]. Often, a sparse representation of speech mixtures can be acquired through the projection of the signals onto an appropriate basis, such as the Gabor or Fourier basis. In particular, the W-disjoint orthogonality (W-DO) of speech signals was explored for the short-time Fourier transform (STFT) domain, where the sparseness implies that the STFT supports of the signals are disjoint. This significant discovery motivated the degenerate unmixing estimation technique (DUET) [4]. The DUET proposed a demixing approach based on the formation of TF masks, where each mask would essentially correspond to the indicator function for the support of the source signal. The DUET algorithm successfully recovered the original source signals from stereo microphone observations using estimates of the relative attenuation and phase parameters.

The DUET algorithm consequently stimulated a plethora of demixing techniques. Among the first extensions to the DUET was the TF ratio of mixtures (TIFROM) algorithm which relaxed the sparseness assumption; however its performance was limited to anechoic conditions with the observations idealized to be of the linear and instantaneous case [7]. Subsequent research extended the DUET to echoic conditions with the use of the estimation of signal parameters via rotational invariance technique (ESPRIT) method to form the DUET-ESPRIT algorithm [8, 9]. However, this was restricted to a linear microphone arrangement and was thus subjected to front-back confusions primarily due to the natural constraint in spatial diversity from the microphone observations.

A different avenue of research as in [10] composed a two-stage algorithm which combined the sparseness principle presented in DUET with the established independent component analysis (ICA) algorithm to yield the sparseness and ICA (SPICA) algorithm. This approach exploited the sparseness of the signals to estimate and remove the active speech source at a particular TF point, and ICA was then applied to the remaining mixtures. Naturally, a restraint upon the number of sources present at any TF point relative to the number of sensors was inevitable due to the ICA stage. Furthermore, the algorithm was only investigated for the stereo case.

The authors of the SPICA expanded their research to nonlinear microphone arrays in [11–13] with the introduction of the clustering of normalized observation vectors. Whilst remaining similar in spirit to the DUET, the research was inclusive of non-ideal conditions such as room reverberation, and allowed more than two sensors in an arbitrary arrangement. This eventually culminated in the development of the multiple sensors degenerate unmixing estimation technique, termed MENUET [14, 15]. Additionally, the mask estimation in MENUET was automated through the application of the k-means clustering technique. Another algorithm which proposes the use of a clustering approach for the mask estimation is presented in [16]: this study is based upon the concept of Hermitian angles between the reference vector and observation vectors, in the complex vector space. However, evaluations were restricted to a linear microphone array.

Advancements in the TF masking approaches to BSS beyond MENUET involve additional stages and complexities. Of particular mention is the approach in [17] which resulted in superior BSS performance in underdetermined reverberant conditions. The algorithm employed a two-stage approach: firstly, observation vectors are clustered in a frequency bin-wise manner, and secondly, the separated frequency bin components classified as originating from the same source are grouped together. The benefit of this approach is that due to the bin-wise clustering, it is robust against higher room reverberations in comparison to previous techniques such as MENUET, as well as possessing an inherent immunity to the spatial aliasing problem in the measurement of the time differences of arrival/direction of arrivals [17]. However, despite the reported improvements in BSS performance, additional complexity was introduced due to the extra stage for the alignment of the frequency bin-wise permuted clustering results. Therefore, the MENUET has the advantage over the state-of-the-art study in [17] in that the fullband clustering for mask estimation eliminates the requirement for the additional stage of frequency bin-wise alignment.

However, the simplicity encapsulated in the MENUET inevitably presents its own limitations. Most significantly, the k-means clustering utilized for mask estimation is not highly robust in the presence of outliers or interference in the data. This often leads to non-optimal localization and partitioning results, particularly for reverberant mixtures [18, 19]. Furthermore, binary masking schemes have been shown to impede upon the separation quality due to musical noise distortions, and it was suggested that fuzzy masking approaches bear the potential to significantly reduce the musical noise at the output [12]. This may be attributed to the fact that when a hard partitioning approach is implemented, abrupt changes will exist in the recovered source estimate which consequently introduce artifacts in the time domain.

The suitability of fuzzy c-means (FCM) clustering for TF mask estimation in the BSS framework has been explored in [20, 21]. In this approach, the fuzzy partitioning in the c-means was suggested to be preferable to hard clustering due to the inherent ambiguity surrounding the membership of TF cells to a cluster, where examples of contributing factors to ambiguity include the effects of reverberation and environmental (background) noise. However, the investigations to date which employ the FCM, as with many others in the literature, have been restricted to a linear and overdetermined microphone arrangement.

Another soft clustering approach which has received attention in the BSS field lies within Gaussian mixture model (GMM)-based approaches [22–24]. This avenue of research is motivated by the intuitive notion that the individual component densities of the GMM may model some underlying set of hidden parameters in a mixture of sources. Due to the reported success of BSS methods that employ such Gaussian models, this clustering paradigm may be considered as a standard algorithm for comparison of mask estimation ability in the TF BSS framework, and is therefore investigated and regarded as a comparative model in this study.

However, each of the TF mask estimation approaches to BSS discussed above are limited in their evaluations with respect to the fact that diverse sources of interference are not considered. Potential contributors to interference in BSS scenarios include not only room reverberation, but also environmental background noise, or noise originating from non-ideal recording sensors. In fact, almost all real-world applications of BSS have the inconvenient aspect of noise at the recording sensors [25], and the influence of such noise has been described as a very difficult and continually open problem in the BSS framework [26].

In general, the focus of BSS algorithms is not directed towards the suppression of environmental noise. However, for a system to achieve optimal performance, the impact of such noise must be addressed. Numerous studies in the literature have been proposed for the problem of additive sensor noise: Li et al. [27] present a two-stage denoising/separation algorithm; Cichocki et al. [25] implement a FIR filter at each channel to reduce the effects of additive noise; and Shi et al. [28] suggest a preprocessing whitening procedure for enhancement. The study in [29] considers a variety of common sources of background noise in the separation algorithm, and modifies numerous pre- and post-processing algorithms in order to account for the characteristics of the background noise. Whilst noise reduction has been achieved with denoising techniques implemented as a pre- or post-processing step, the performance was proven to degrade significantly at lower signal-to-noise ratios [30].

Within the TF BSS framework, the authors of [22] include the possibility of background noise in the observation error for their BSS model; however, the experimental simulations were only conducted for anechoic/reverberant conditions, without any clear distinction between environmental noise and reverberation in the observation error.

Motivated by such various shortcomings, this work presents an extension to the MENUET algorithm through the use of an alternative clustering scheme for mask estimation, and provides comprehensive evaluations in adverse acoustic conditions. Firstly, this study proposes that the substitution of the TF clustering stage with a fuzzy clustering approach as explored in [20, 21] will improve the separation performance in the same conditions as presented in [14, 15]. Secondly, it is hypothesized that this combination is sufficiently robust to withstand the degrading effects of reverberation and environmental noise, and evaluations of all the methods under the challenging conditions of reverberation and environmental background noise are presented. For all investigations in the study, comparisons are provided with both the original MENUET k-means and the standard soft GMM-based clustering algorithm for mask estimation.

The remainder of this paper is organized as follows: section 2 provides an overview of the proposed BSS scheme and explains the primary signal processing stages. Section 3 describes each of the three clustering schemes in greater detail. Section 4 explains the experimental evaluation and presents a discussion on the achieved results. The section also includes the existing limitations with the system and offers some potential avenues for future work. Section 5 concludes the paper with a brief summary.

2 System overview

2.1 Problem statement

Consider a microphone array of M identical sensors in a reverberant enclosure where N sources are present. A convolutive mixing model is assumed, whereby the observation at the m th sensor, x _m(t), can be modeled as a summation of the individual contributions by the n th active source, s _n(t).

When all N sources are active, the observation at the m th sensor can be expressed via the convolutive mixing model as

x_{m} (t) = \sum_{n = 1}^{N} \sum_{p} h_{mn} (p) s_{n} (t - p) + n_{m} (t),

(1)

where h _mn(p) p = 0, …, P - 1 denote the coefficients of the room impulse response between the n th source to the m th sensor, n _m(t) denotes any additive noise received at the m th sensor and t indicates time.

The goal of any BSS system is to therefore recover the N sources, $ŝ_{1}, \dots, ŝ_{N}$ , each of which corresponds to the original source signals s ₁, …, s _N, respectively. Ideally, the separation is performed without any information about s _n(t) and h _mn(p).

2.2 STFT analysis

The time-domain sensor observations are converted into their corresponding frequency domain time-series X _m(k, l) via the STFT as

X_{m} (k, l) = \sum_{τ = - L / 2}^{L / 2 - 1} win (τ) x_{m} (τ + k τ_{0}) e^{- jl ω_{0} τ}, m = 1, \dots, M,

(2)

where k ∈ {0, …, K - 1} is a time frame index, l ∈ {0, …, L - 1} is a frequency bin index, win(τ) is an appropriately selected window function and τ ₀ and ω ₀ are the TF grid resolution parameters. The analysis window is typically chosen such that sufficient information is retained within whilst simultaneously reducing signal discontinuities at the edges. A suitable window is the Hann window:

win (τ) = 0.5 - 0.5 cos (\frac{2 π τ}{L}), τ = 0, \dots, L - 1,

(3)

where L denotes the frame size.

It is assumed that the length of L is sufficient such that the main portion of the impulse responses h _mn is covered. Therefore, the convolutive BSS problem may be approximated as an instantaneous mixture model [31] in the STFT domain

X_{m} (k, l) = \sum_{n = 1}^{N} H_{mn} (l) S_{n} (k, l) + N_{m} (k, l), m = 1, \dots, M,

(4)

where (k,l) represent the time and frequency index, respectively and H _mn(l) is the room impulse response between source n and sensor m. S _n(k, l), X _m(k, l) and N _m(k, l) are the STFT of the n th source, m th observation and additive noise at the m th sensor, respectively.

The assumption of sparseness between the source signals implies that at each TF cell, at most one source is dominant [4]. Therefore, (4) can be expressed as

\begin{array}{l} X_{m} (k, l) & \approx \sum_{n = 1}^{N} H_{mn} (l) S_{n} (k, l) δ_{n} (k, l) + N_{m} (k, l), \\ m & = 1, \dots, M, \end{array}

(5)

where δ _n(k, l) is the Dirac-delta function defined as

δ_{n} (k, l) = \{\begin{array}{l} 1 & when S_{n} (k, l) is active at (k, l), \\ 0 & otherwise . \end{array}

(6)

Whilst this sparseness assumption holds true for anechoic mixtures, as the reverberation and/or environmental noise in the acoustic scene increases it becomes increasingly unreliable due to the effects of multipath audio propagation and multiple reflections [4, 21].

2.3 Feature extraction

In this work, the TF mask estimation is realized through the estimation of the TF points where a signal is assumed dominant. To estimate such TF points, a spatial feature vector is calculated from the STFT representations of the M observations. Previous researches [14, 15] have identified level ratios and phase differences between the observations as appropriate features, as such features retain information on the magnitude and the argument of the TF points. Further discussion is presented in section 4.3.1.

The feature vector θ(k, l) = [θ ^L(k, l), θ ^P(k, l)]^T per TF point is estimated as

\begin{array}{l} θ^{L} (k, l) = & [\frac{| X_{1} (k, l) |}{A (k, l)}, \dots, \frac{| X_{J - 1} (k, l) |}{A (k, l)}, \\ \frac{| X_{J + 1} (k, l) |}{A (k, l)}, \dots, \frac{| X_{M} (k, l) |}{A (k, l)}], \end{array}

(7)

\begin{array}{l} θ^{P} (k, l) = & [\frac{1}{α} arg [\frac{X_{1} (k, l)}{X_{J} (k, l)}], \dots, \frac{1}{α} arg [\frac{X_{J - 1} (k, l)}{X_{J} (k, l)}], \\ \frac{1}{α} arg [\frac{X_{J + 1} (k, l)}{X_{J} (k, l)}], \dots, \frac{1}{α} arg [\frac{X_{M} (k, l)}{X_{J} (k, l)}]], \end{array}

(8)

for $A (k, l) = \sqrt{\sum_{m = 1}^{M} | X_{m} (k, l) |^{2}}$ and α = 4π fc ^-1 d _max, where f is the frequency at the l th frequency bin index, c is the propagation velocity of sound, d _max is the maximum distance between any two sensors in the array and J is the index of the (arbitrarily selected) reference sensor. The weighting parameters A(k, l) and α ensure appropriate amplitude and phase normalization of the features respectively. It is widely known that in the presence of reverberation, a greater accuracy in phase ratio measurements can be achieved with higher spatial resolution; however, it should be noted that the value of d _max is upper bounded by the spatial aliasing theorem [14, 17, 21]. If the exact value of the maximum sensor spacing is not known, a positive constant may be used in its place [14]. This eliminates the need for the system to know the precise spacing between sensors.

The frequency normalization in (8) ensures frequency independence of the phase ratios in order to prevent the frequency permutation problem in the later stages of clustering. It is possible to cluster without such frequency independence by implementing a bin-wise clustering as in [17, 32]. However, the utilization of all the frequency bins avoids the frequency permutation problem and also permits data observations of short length [14].

2.4 Mask estimation and separation

In this work, source separation is effected through the estimation and application of TF masks, which are estimated in the clustering stage. For the k-means algorithm, a binary mask for the n th source is simply estimated as [14]

M_{n} (k, l) = \{\begin{array}{l} 1 & for θ (k, l) \in C_{n}, \\ 0 & otherwise. \end{array}

(9)

where C _n denotes the set of TF points classified as belonging to the n th cluster.

The output of the FCM clustering is a fuzzy membership partition matrix [21, 33]. This partition matrix indicates the degree of membership of each TF point in the feature space to each of the N clusters. These membership values, denoted by u _n(k, l), are then interpreted as a collection of N TF masks:

M_{n} (k, l) = u_{n} (k, l) .

(10)

For the GMM clustering approach, the mask is set to the posterior probabilities of the dominant Gaussian components (cf. section 3.2) [22, 23]. This equates to

M_{n} (k, l) = p (θ (k, l) | μ_{k}, Σ_{k}),

(11)

where μ _p, Σ _p denotes the mean and covariance matrix of the p th Gaussian component of the mixture model.

The spatial image estimate of the n th signal received at the m th sensor is then obtained through the application of mask $M_{n}$ to the m th observation as [17]

Ŝ_{mn} (k, l) = M_{n} (k, l) X_{m} (k, l), n = 1, \dots, N .

(12)

2.5 Source resynthesis

Finally, the estimated source images are reconstructed in the time-domain to obtain the estimates $ŝ_{mn} (t)$ . This is realized through the overlap-and-add method [34] onto $Ŝ_{mn} (k, l)$ . The reconstructed estimate is

ŝ_{mn} (t) = \frac{1}{C_{win}} \sum_{k^{'} = 0}^{L / τ_{0} - 1} ŝ_{mn}^{k + k^{'}} (t),

(13)

where C _win = 0.5 / τ ₀ L is a Hann window function constant, and individual frequency components of the recovered signal are acquired through an inverse STFT

ŝ_{mn}^{k} (t) = \sum_{l = 0}^{L - 1} Ŝ_{mn} (k, l) e^{jl ω_{0} (t - k τ_{0})},

(14)

if (k τ ₀ ≤ t ≤ k τ + L - 1), and zero otherwise.

3 Clustering approaches

This section presents the details of the three clustering techniques employed in this study. The first two, the hard k-means and the Gaussian mixture model, have previously been used in other TF-based clustering BSS systems [14, 24], whilst the fuzzy c-means is the proposed mask estimation technique. All three techniques belong to the family of center-based clustering, and each have their own objective functions. The common goal of all is the classification of the set of feature vectors, $Θ (k, l) = {θ (k, l) | θ (k, l) \in R^{2 (M - 1)}, (k, l) \in Ω}$ , where Ω = {(k, l):0 ≤ k ≤ K - 1, 0 ≤ l ≤ L - 1} denotes the set of TF points in the STFT plane, into N clusters. In the instance where the clusters are distinct, as with the hard k-means, each data point may only belong to one cluster. However, for the soft clustering techniques, each data element may belong to multiple clusters with a certain probability (membership).

3.1 Hard k-means clustering

Previous mask estimation methods as in [13–16] employ binary clustering techniques such as the hard k-means (HKM). The HKM algorithm was initially introduced in studies published by MacQueen [35]. In this approach, the set of feature vectors Θ(k, l) is clustered into N distinct cluster sets {C} = C ₁, …, C _N. Each set from {C} contains the feature vectors assigned to the n th cluster, and has an associated set of prototype vectors, v _n, which denotes the n th cluster center.

Clustering of the data is achieved through the minimization of the objective function

J_{HKM} = \sum_{n = 1}^{N} \sum_{θ (k, l) \in C_{n}} D_{n} (k, l),

(15)

where D _n(k, l) = ∥θ(k, l) - v _n∥² is the squared Euclidean distance between the feature vector θ(k, l) and the n th cluster center.

Conditional on a set of initial centroids, this minimization is iteratively realized by the following alternating equations

C_{n}^{*} = {θ (k, l) | n = \underset{n}{argmin} D_{n} (k, l)}, \forall n, k, l,

(16)

{v_{n}}^{*} \leftarrow E {θ (k, l)}_{θ (k, l) \in C_{n}}, \forall n,

(17)

until convergence is met, where $E {.}_{θ (k, l) \in C_{n}}$ denotes the mean operator for the TF points within the cluster set C _n, and the (*) operator denotes the optimal value (at convergence). Due to the algorithm’s sensitivity to initialization of the cluster centers it is recommended to either design initial centroids using an assumption on the sensor and source geometry as in [14, 15], or to utilize the best outcome of a predetermined number of independent runs.

Summary: HKM clustering algorithm

3.2 Gaussian mixture model clustering

A number of studies in the literature for TF-based BSS have implemented the GMM clustering approach [22–24] and it is therefore included in this study for comparative purposes. It is also included in order to compare the effects of soft masking on the separation system, by providing the FCM with a fair comparison.

In the GMM-based clustering, each observation θ(k, l) can be modeled as a weighted sum of P component Gaussian densities (clusters). Unlike the HKM and FCM described above, where the number of clusters is equal to the number of sources, the GMM-based clustering methods have the additional complexity in that the best fitting for the data set to a mixture model may not necessitate that P is equivalent to the number of sources [14].

The p th component of the mixture model is assumed to follow a Gaussian distribution with a characteristic mean and covariance, μ _p and Σ _p, respectively. The probability density function of an observation θ(k, l), denoted by θ for simplicity from here onward, is represented mathematically as:

p (θ; (μ, Σ)) = \sum_{p = 1}^{P} w_{p} \cdot p (θ; (μ_{p}, Σ_{p})),

(18)

where (μ, Σ) contains the mean and covariance matrices for all P clusters, and w _p denotes the mixture weight (probability) of the p th distribution. This p th component density is represented by

\begin{array}{l} p (θ; (μ_{p}, Σ_{p})) = & \sum_{p = 1}^{P} w_{p} \cdot \frac{1}{{(2 π | Σ_{p} |)}^{1 / 2}} \\ \times exp \{- \frac{1}{2} {(θ - μ_{p})}^{'} Σ_{p}^{- 1} (θ - μ_{p})\} . \end{array}

(19)

The unknown parameter sets (μ _p, Σ _p) for the P distributions are estimated in such a manner as to maximize the likelihood of the mixture model; this estimation is most commonly iteratively calculated using the Expectation-Maximization (EM) algorithm [22]. The data is then clustered around the maximum likelihood parameters as determined from the EM algorithm by the final estimates of the a posteriori probabilities at convergence.

Conditional on an initial partitioning, that is the initial cluster sets {C ₁, …, C _P} are known, the parameters sets (μ _p, Σ _p) are found via the minimization of the negative log-likelihood of (19)

\begin{array}{l} \underset{μ_{p}, Σ_{p}, p = 1, \dots, P}{argmin} & [\frac{1}{2} \sum_{p = 1}^{P} w_{p} log (| Σ_{p} |) \\ + \frac{1}{2} \sum_{p = 1}^{P} \sum_{θ \in C_{p}} {(θ - μ_{p})}^{'} Σ_{p}^{- 1} (θ - μ_{p})] \end{array}

(20)

and for each w _p conditional on (μ _p, Σ _p, p = 1, …, P)

\begin{array}{l} \underset{w_{p}, p = 1, \dots, P}{argmax} & [\sum_{p = 1}^{P} w_{p} \frac{1}{{(2 π | Σ_{p} |)}^{1 / 2}} \\ \times exp \{- \frac{1}{2} {(θ - μ_{p})}^{'} Σ_{p}^{- 1} (θ - μ_{p})\}] . \end{array}

(21)

The cluster sets are then found by assigning posterior probabilities to the mixture components. The use of GMM clustering within this particular BSS framework results in the number of components not equal to the number of sources (see section 4.1); therefore, the dominant N components of the P, as determined by the mixture weights, are selected to represent the N sources. The posterior probabilities of the dominant Gaussians, denoted p(θ|μ _p, Σ _p), are then utilized as the TF mask to represent the corresponding source (analogous to the work in [14, 17]).

3.3 Fuzzy c-means clustering

Whilst the HKM performed satisfactorily in the context of MENUET for BSS, the work presented in [21] and [36] demonstrated that the use of a fuzzy clustering algorithm improves the accuracy of mask estimation. The origins of the FCM are credited to the work presented in [33], and as with the HKM method, the feature set is clustered into N clusters, where each cluster center is represented by a centroid v _n. However, each cluster also has an associated partition matrix $U = {u_{n} (k, l) \in R | n \in (1, \dots, N), (k, l) \in Ω)}$ which specifies the probability u _n(k, l) to which a feature vector θ(k, l) belongs to the n th cluster at the TF point (k, l).

Clustering is achieved by the minimization of the cost function

J_{FCM} = \sum_{n = 1}^{N} \sum_{\forall (k, l)} u_{n} {(k, l)}^{q} D_{n} (k, l),

(22)

where u _n(k, l) is subject to the constraint $\sum_{n = 1}^{N} u_{n} (k, l) = 1$ and with D _n(k, l) defined as in section 3.1. The fuzzification parameter q > 1 controls the membership softness in the cost function and therefore controls the fuzziness of the generated TF masks. Section 4.1 describes the selection of an appropriate value for the fuzzification parameter in this BSS context.

The minimization problem in (22) can be solved using Lagrange multipliers and is typically implemented as an alternating optimization scheme due to the open nature of its solution [21, 37]. Initialized with a random partitioning, the alternating updates are

v_{n}^{*} = \sum_{\forall (k, l)} \frac{u_{n} {(k, l)}^{q} θ (k, l)}{\sum_{\forall (k, l)} u_{n} {(k, l)}^{q}},

(23)

u_{n}^{*} (k, l) = {[\sum_{j = 1}^{N} {(\frac{D_{n} (k, l)}{D_{j} (k, l)})}^{\frac{1}{q - 1}}]}^{- 1}, \forall n, k, l,

(24)

where (*) denotes the optimal value, until a suitable termination criterion is satisfied. Typically, convergence is defined as when the difference between successive partition matrices is less than some predetermined threshold, ε[33]. However, as is also the case with the k-means, it is known that the alternating optimization scheme presented may converge to a local, as opposed to global, optimum; thus, it is suggested to independently implement the algorithm several times prior to selecting the most fitting result [21].

Summary: FCM clustering algorithm

4 Experimental evaluations

4.1 Experimental setup

The experimental setup was designed to replicate that of the studies in [14, 15] for comparative purposes. Figure 1 depicts the speaker and sensor arrangement, and Table 1 details the experimental conditions. The wall reflections of the enclosure and room impulse responses between each source and sensor were simulated using the image model method for small-room acoustics [38]. The room reverberation was quantified in the measure RT₆₀, where RT₆₀ is defined as the time required for reflections of a direct sound to decay by 60 dB below the level of the direct sound.

Table 1 The parameters used in experimental evaluations

Full size table

Several types of background noise can be described by a diffuse sound field and modeled by an infinite number of statistically independent point sources on a sphere [29]. In this model, the intensities of the incident sound are uniformly distributed over all possible directions, and can be modeled as additive noise at the sensors, as in (1) [29]. In this study, 30 individual and independent point sources were situated uniformly from the center of the microphone array at a distance of 1.5 m. In an effort to gain adversity in the evaluations, three types of environmental noise were considered: white noise, babble noise and factory noise. All noise samples are available in the NOISEX-92 database [39]. The simulated background noise was scaled according to the signal-to-noise ratio (SNR) definition as in [40], which uses the standardized method given by the International Telecommunications Union to objectively measure the active speech level and calibrate the interfering noise signal appropriately [41]. It should be noted that in real-world environments, noise is never exactly isotropic; therefore, these evaluations must be considered with caution.

The four target speech sources, the genders of which were randomly generated, were realized with phonetically-rich utterances from the TIMIT database [42], and the target-to-masker ratio between all of the sources was set to 0 dB. A representative number of mixtures for evaluative purposes was constructed. To avoid any spatial aliasing, the sensors were placed at a maximum distance of 4 cm apart.

Section 3.3 explains the role of the fuzzification parameter q in the FCM clustering. Past research [21] has identified a value of q in the range of q ∈ (1, 1.5] to result in performance akin to hard clustering. Furthermore, it was empirically determined that for reverberant speech mixtures, a value of q = 2 is an optimal value in order to achieve a balance between high separation performance with minimal artifacts [21]. This is consistent with other studies which also report an optimal value at 2 for the fuzzy exponent [43, 44]. Therefore, in this work, the fuzzification q is set to 2.

As mentioned in sections 3.1 and 3.3, it is widely recognized that the performance of the clustering algorithms is largely dependent on the initilization of the algorithm [19, 45]. If the initial partitions are not estimated with sufficient precision, there is a high possiblity of finding a local, as opposed to global, optimum. It has been recommended [19] to run the algorithms multiple times to reduce the degrading effects of its sensitivity; the effectiveness of this style of initialization was also described in [46]. In an effort to save computational expense, it was desired to determine the smallest number of independent, single-iteration runs for initialization which would result in the best solution. Previous experiments as in [21] had implemented the best of 50 runs; however, it was empirically confirmed that there was little difference in performance between 25 and 50 runs. Therefore, it can be assumed that satisfactory clustering initialization can result when the best solution of 25 independent, randomly initialized single-iteration executions are selected for initilization. The 'best’ solution was defined as the execution which resulted in the lowest cost function output of the independent runs (i.e. the smallest error).

Similar to the HKM and FCM algorithms, the GMM clustering approach also requires a suitable initialization. As recommended in [47], an initialization based on the Forgy method [48] was implemented, where the data set was randomly partitioned into K non-overlapping sets with uniform mixing proportions. The initial covariance matrices for all components were diagonal. However, the GMM clustering approach is also highly sensitive to the selection of an appropriate number of components in the model. It was observed in the experiments that an increase in the number of mixture components generally resulted in improved separation performance; however, the selection of an optimal number of Gaussians was not simple and required a considerable amount of experimentation in order to reach the optimal number. For this particular application of the GMM clustering in the desired source/sensor configuration, it was empirically determined as K = 12. This is in accordance to previous studies using GMM for BSS such as in [14], where the determination of the optimal number of clusters was at a considerable computational expense. As mentioned in section 3.2, since the number of components are not equal to the number of sources, the dominant N components (as indicated by the mixture weights) were used to estimate the TF separation masks. The TF masks were derived from the posterior probabilities of the dominant components.

4.2 Evaluation measures

In order to provide a comprehensive evaluation of the separation algorithms presented in this study, a range of performance metrics have been included. These include the widely used BSS_EVAL toolkit [49], the Perceptual Evaluation of Speech Quality measure (PESQ) [50] and the objective measures in the Perceptual Evaluation methods for Audio Source Separation (PEASS) toolkit [51].

4.2.1 BSS EVAL performance metrics

The first set of performance metrics was obtained from the publicly available MATLAB toolkit BSS_EVAL[49]. This set of metrics is applicable to all source separation approaches, and no prior information of the separation algorithm is required. However, the original toolkit does not account for environmental noise in the metrics. To account for this, an author of the BSS_EVAL was consulted in order to modify the toolkit to consider the addition of two extra metrics: the SNR and signal-to-interference-plus-noise ratio (SINR).

Using a least-squares projection, the BSS_EVAL toolkit assumes the decomposition of the estimated spatial image $ŝ_{mn} (t)$ as

\begin{array}{l} ŝ_{mn} (t) = s_{mn}^{img} (t) + e_{mn}^{spat} (t) + e_{mn}^{interf} (t) + e_{mn}^{artif} (t) + e_{mn}^{noise} (t), \end{array}

(25)

where m is the observation index, $s_{mn}^{img} (t)$ is the true source image and $e_{mn}^{spat} (t)$ , $e_{mn}^{interf} (t)$ , $e_{mn}^{artif} (t)$ and $e_{mn}^{noise} (t)$ are distinct error components representing spatial distortion, interference, artifacts and noise, respectively.

From this decomposition, the SIR was computed as [52]

{SIR}_{n} = 10 {log}_{10} \frac{\sum_{m = 1}^{M} \sum_{t} {(s_{mn}^{img} (t) + e_{mn}^{spat} (t))}^{2}}{\sum_{m = 1}^{M} \sum_{t} e_{mn}^{interf} {(t)}^{2}}

(26)

to provide an estimate of the relative amount of interference in the target source estimate.

The SINR was computed as

{SINR}_{n} = 10 {log}_{10} \frac{\sum_{m = 1}^{M} \sum_{t} {(s_{mn}^{img} (t) + e_{mn}^{spat} (t))}^{2}}{\sum_{m = 1}^{M} \sum_{t} {(e_{mn}^{noise} (t) + e_{mn}^{interf} (t))}^{2}}

(27)

to reflect the amount of noise and interference in the recovered signal estimate.

The global SNR for the n th source was calculated as

{SNR}_{n} = 10 {log}_{10} \frac{\sum_{m = 1}^{M} \sum_{t} {(s_{mn}^{img} (t) + e_{mn}^{spat} (t) + e_{mn}^{interf} (t))}^{2}}{\sum_{m = 1}^{M} \sum_{t} e_{mn}^{noise} {(t)}^{2}}

(28)

which provides a measure of the amount of noise at the recovered signal, independent of the interference. For all ratios, a higher value indicates better separation performance.

4.2.2 PESQ

The PESQ measure was originally designed to provide a subjective judgement of the speech quality of the recovered source signal. Despite its initial intention for telecommunication applications, it has since been shown to be an effective predictor for the quality of the speech isolated from the observation mixtures by the separation algorithm [53], as well as for ASR performance on the separated speech signals [54].

The PESQ score is computed by a comparison of the original (unmixed, anechoic) speech source signal to the recovered signal estimate. Both signals are time-aligned and passed through an auditory transform to achieve a psychoacoustically motivated representation [55]. The differences between the signals in this representation are measured and used to provide an estimate of the distortion in the signal estimate. The final measure of PESQ is reported to correlate well with subjective listening scores [53].

The PESQ score can take on a range from 0.5 to 4.5, where 4.5 represents the case when the signal estimate is equivalent to the original (clean) source. A higher score suggests better speech quality.

4.2.3 PEASS

The PEASS toolkit was created to provide a set of objective scores to predict the perceptual quality of estimated sources. This is complementary to the energy-based ratios in the BSS_EVAL (cf. section 4.2.1), and the PEASS has since been implemented as a standard for performance evaluation in international speech challenges such as the signal separation evaluation campaign (SiSEC) [52, 56].

In this toolkit, the estimated signals are decomposed via a complex, auditory-motivated algorithm as [51]

ŝ_{n} (t) - s_{n} (t) = e_{target} (t) + e_{interf} (t) + e_{artif} (t),

(29)

where s _n(t) is the original (clean) target signal, and the terms e _target(t), e _interf(t) and e _artif(t) denote the target distortion component, interference component and artifacts component, respectively. The salience of these error components is then measured using the perceptual similarity measure provided in the PEMO-Q auditory model [57]; the reader is referred to [51] for a detailed discussion.

The PEASS toolkit computes four auditory-motivated quality scores; however, the overall perceptual score (OPS) is considered as a global measure for the separation ability as it indicates the similarity between the recovered signal estimate and the original signal, and it is said to have a high coherence with the subjective perceptual evaluation. Therefore, in this study, the OPS is included as an additional performance metric for the perceptual quality of the speech. The OPS is expressed from 0 to 100, where 100 denotes the best perceptual match.

4.3 Results

4.3.1 Initial evaluations of MENUET with FCM

Prior to evaluating the effectiveness of the FCM clustering for mask estimation in the MENUET framework, the FCM was evaluated in a simple stereo setup for a variety of feature sets in order to test its feasibility in this context. In [14, 15], a comprehensive review of suitable location cues was presented and their effectiveness at separation was evaluated using the HKM clustering for mask estimation.

The experimental setup for these set of evaluations was such as to replicate the original work in [14] to as close a degree as possible. In an enclosure of dimensions 4.55 m × 3.55 m × 2.5 m with a room reverberation parameter RT₆₀ constant at 128 ms, two omnidirectional microphones were placed at a distance of 4 cm apart at an elevation of 1.2 m. Three speech sources, with a target-to-masker ratio of 0 dB, were situated at 30°, 70° and 135° at a distance of 50 cm from the array, and also at an elevation of 1.2 m. The speech sources were randomly chosen from both genders of the TIMIT database in order to emulate the investigations in [14, 15] which utilized English utterances. The source separation performance was evaluated with respect to the improvement in SIR and the results are depicted in Table 2.

Table 2 The hard k -means and fuzzy c -means are implemented for mask estimation

Full size table

The original purpose of the evaluations upon the range of features was to determine the effects of appropriate normalization upon the level and phase ratio features [14]. As expected, separation performance generally increases as the features are of the same order of magnitude (see section 2.3). It is additionally observed from the measured SIR gain that the FCM clustering is more robust than the original HKM for all but one feature set, and thus hints at the possibility of the FCM yielding similar results for related TF BSS approaches. Not only does this confirm the suitability of the FCM in the proposed BSS framework, it also demonstrates the robustness of the FCM against several types of spatial features. The results of this investigation provide further motivation to extend the soft TF masking scheme to other sensor arrangements and adverse acoustic conditions.

However, in the original evaluations in [14] the authors also compare the performance of the HKM for the same stereo, three speaker setup against the more robust GMM fitting clustering approach. The results of this demonstrated improvements in SIR gain in comparison to the HKM, although this was at the burden of significantly greater computational expense. Furthermore, the selection of the number of Gaussian components proved to require a lot of trial and error (cf. section 4.1). In order to offer a fair comparison of the FCM against other clustering techniques, the GMM fitting method was then implemented in further BSS evaluations as stated in the following sections.

4.3.2 Separation in reverberant conditions

The study was extended to the underdetermined case of three sensors and four sources in a nonlinear configuration as in Figure 1[14, 15]. The average improvement in SIR measured across all separated sources for all evaluations is depicted in Figure 2, where the average input SIR was measured at -4.20 dB (consistent with the studies in [14, 15]). It is immediately evident that the two soft masking techniques, GMM and FCM, improve the separation quality by a considerable amount. For example, for the anechoic scenario, the GMM and FCM clustering techniques perform equivalently, leading the HKM mask estimation by almost 10 dB. However, as the reverberation is increased to a mild 128 ms, a slight performance gap between the two soft masking techniques surfaces with the FCM leading by approximately 2 dB. This gap is heightened as the reverberation is increased again, with the performance gap considerably larger at almost 7 dB. Interestingly, at this higher reverberation time, the GMM performs even below the HKM.

A smaller standard deviation is also observed in Figure 2 when FCM clustering is used. For example, when the reverberation is RT₆₀ = 128 ms, the SIR performance using GMM clustering is comparable to that of FCM clustering. However, the standard deviation is more than twice that of the FCM clustering, and this suggests that the FCM delivers more consistent and reliable separation of the sources.

To evaluate the statistical significance of the evaluations, the Student’s t test was conducted for the three methods, where two tests were conducted per RT₆₀ value: one to compare the statistical significance of the FCM against the HKM, and one to compare the FCM against the GMM. A two-tailed distribution was assumed for each test, with unequal variances between the data. For the FCM against the HKM, a p value of p < < 0.001 was reported for all reverberation times. For the FCM against the GMM, for a reverberation time of RT₆₀ = 0 ms, a p value of less than 0.1 (p = 0.094) was measured. However, for the remaining reverberation times, a p value of p < < 0.001 was recorded. This demonstrates that the performance of the proposed FCM mask estimation is largely unlikely to be due to chance. Therefore, the performance of the FCM clustering indicates a superior mask estimation technique for source separation in a reverberant enclosure.

4.3.3 Separation in reverberant conditions with spatially diffuse environmental noise

The effect of background noise was then evaluated for the BSS system in the presence of white, babble and factory noise, added to the mixtures as described in section 4.1. The numerical results are shown in Tables 3, 4 and 5 for a range of reverberation times, with similar trends reported for all types of corrupting noise. To provide a fair comparison against the reverberation-free case in Figure 2, the SIR gain is reported. However, for the SINR and SNR, the absolute measured ratio at the output is provided.

Table 3 Source separation results in an anechoic enclosure (cf. Figure 1 ) with background noise

Full size table

Table 4 Source separation results in a reverberant enclosure (cf. Figure 1 ) with background noise

Full size table

Table 5 Source separation results in a reverberant enclosure (cf. Figure 1 ) with background noise

Full size table

It is firstly observed that for environmental SNRs of 25 dB and above, the measured SIR gain is approximately equivalent to the noise-free environment (Figure 2). However, as the level of noise is increased a steady decline in SIR gain is recorded, as to be expected. Interestingly, as previously observed in the separation results of section 4.3.2, the GMM mask estimation ability significantly declines with the introduction of more adverse conditions. For example, in the case of babble noise at a reverberation time of 128 ms, when the SNR is decreased from 25 to 20 dB we note a difference in SIR of almost 5 dB. However, the HKM has a difference of less than 1 dB, and the FCM of just 0.34 dB. Additionally, as was previously observed in the noise-free experiments (Figure 2), the GMM occasionally performs below that of the HKM clustering at the higher reverberation time of 300 ms.

The performance of the SINR is akin to the SIR across all room reverberations and environmental SNRs. To gain an appreciation of any possible noise suppression characteristics of the MENUET and its modifications using the GMM/FCM, the SNR was measured and then averaged for all the recovered source signals. The results are generally as expected, with a decrease in gain as the level of noise and reverberation time increase. However, as previously observed, there is often a notable decline in the performance of the GMM as the SNR drops below 20 dB, and/or the room reverberation is increased.

The isolation of the effects of reverberation and noise can be observed in Table 3 when the room reverberation is set to null. The effects of noise alone appear to have less of an impact upon separation ability than the reverberation for the FCM clustering; for example, when the SNR is varied from 30 to 10 dB, there is a change in SIR gain of between 3 and 5 dB, with just a 1 dB change in the case of babble noise. However, when comparing the SIR gains for the same SNRs across different reverberation times, there are significant differences especially at the reverberation time of RT₆₀ = 300 ms. For example, for the case of corrupting babble noise, for RT₆₀=0 ms the recorded SIR was 16.14 dB, whereas when RT₆₀ = 300 ms the SIR drops to 11.28 dB.

The PESQ was then evaluated on the recovered signals to provide a measure for the perceptual quality of the recovered source estimates. A general decrease in PESQ with an increase in adversity of the conditions is noted, with the FCM for mask estimation yielding the highest scores. The effect of environmental SNR appears to be more detrimental than that of reverberation; for example, in the case of babble noise, the measured PESQ for the FCM method at a reverberation time of 0 ms and SNR of 30 dB is 2.84. When the room reverberation is increased to 300 ms, the measured PESQ is 2.50. However, when the reverberation is maintained at 0 ms and the SNR is decreased to 0 dB, a PESQ is measured at 1.54. This reduction in PESQ is likely due to the decrease in the target signal amplitude and degraded time alignment in such noisy conditions, which leads to a source estimate of poorer quality.

The final performance metric implemented for this experimental setup was the OPS from the PEASS toolkit. Similar trends were observed in the OPS as with the other metrics, with a degradation in the achieved score as the hostility of the environment was increased. In this case also, the FCM demonstrated its superiority over the HKM and GMM clustering techniques.

4.3.4 SiSEC 2010 Data

The proposed method was then evaluated with publicly available benchmark data of the SiSEC 2010 [56]. The development data (dev.zip) in “Source separation in the presence of real-world background noise” data sets was used. In this data set, two microphones were spaced at 8.6 cm, and noise signals were recorded in real-world noise environments: 'Cafeteria’ (Ca) and 'Square’ (Sq). The 'Cafeteria’ environment was stated as reverberant (with an unspecified reverberation time), whereas the 'Square’ had little or no reverberation [56]. The noise signals were recorded at two different positions within the environment, center (Ce; where noise is more isotropic), and corner (Co; where noise may not be very isotropic) [56]. For each of the noise environments, two different locations of the same environment were considered (A and B).

The recordings were 10 s long, with mixed English and Japanese utterances of both genders. The original recordings were sampled at 16 kHz; however, it was empirically determined that a downsample to 8 kHz resulted in better separation for all methods tested. This can be attributed to the reduced effects of spatial aliasing at the lower sampling frequency.

For easy comparison against the published results of the SiSEC as available in [58], the same evaluation criteria for the “Source spatial image estimation” task was used. The estimated source image $ŝ_{mn} (t)$ is decomposed as

ŝ_{mn} (t) = s_{mn}^{img} (t) + e_{mn}^{spat} (t) + e_{mn}^{interf} (t) + e_{mn}^{artif} (t) .

(30)

Three energy ratios, the source image to spatial distortion ratio (ISR), signal to interference ratio (SIR) and the signal to artifact ratio (SAR), then measure the amount of spatial distortion, interference and artifacts in the recovered source estimates. These are expressed in dB as [52]

{ISR}_{n} = 10 {log}_{10} \frac{\sum_{m = 1}^{M} \sum_{t} s_{mn}^{img} {(t)}^{2}}{\sum_{m = 1}^{M} \sum_{t} e_{mn}^{spat} {(t)}^{2}}

(31)

{SIR}_{n} = 10 {log}_{10} \frac{\sum_{m = 1}^{M} \sum_{t} {(s_{mn}^{img} (t) + e_{mn}^{spat} (t))}^{2}}{\sum_{m = 1}^{M} \sum_{t} e_{mn}^{interf} {(t)}^{2}}

(32)

{SAR}_{n} = 10 {log}_{10} \frac{\sum_{m = 1}^{M} \sum_{t} {(s_{mn}^{img} (t) + e_{mn}^{spat} (t) + e_{mn}^{interf} (t))}^{2}}{\sum_{m = 1}^{M} \sum_{t} e_{mn}^{artif} {(t)}^{2}} .

(33)

The total error is captured in the signal-to-distortion ratio (SDR)

\begin{array}{l} {SDR}_{n} = 10 {log}_{10} \frac{\sum_{m = 1}^{M} \sum_{t} s_{mn}^{img} {(t)}^{2}}{\sum_{m = 1}^{M} \sum_{t} {(e_{mn}^{spat} (t) + e_{mn}^{interf} (t) + e_{mn}^{artif} (t))}^{2}} . \end{array}

(34)

The quality of the source signals were also evaluated with the PEASS toolkit as described in section 4.2.3. However, all four ratios were included: the target-related perceptual score (TPS), interference-related perceptual score (IPS), artifact-related perceptual score (APS) and the OPS. The reader is referred to [51] for details.

Table 6 shows the average results per environmental condition, averaged across all available mixtures. This table can easily be compared against the results of the SiSEC 2010, in the table entitled “Average Results for 2 channels” in [58]. The individual results for each recording are displayed in Table 7. The reported results are at a similar performance level with those published in the SiSEC 2010 [58], despite the reduced SAR and APS ratios. An overall decline in performance in comparison to the simulated evaluations (Tables 3, 4 and 5) can be observed. A likely reason for this is due to the larger sensor spacing (8.6 cm compared to the 4 cm spacing in previous evaluations), as for ideal phase measurements, the sensor spacing should be limited to below c / f _s, where c is the velocity of sound and f _s is the sampling frequency [21]. Additionally, the fact that two sensors are used to retrieve the information compared to three, as in section 4.3.3, could contribute to the decrease in performance. The reduction of the feature space dimension may have lowered the capability of the clustering algorithm, making any clustering performance differences less apparent.

Table 6 Average separation results for the SiSEC 2010 data

Full size table

Table 7 Separation results for the SiSEC 2010 data

Full size table

In general, the FCM for mask estimation proved the most robust. The GMM also achieved notable IPS values, however the remaining ratios were not as high as those achieved with the FCM. For example, the OPS was consistently at its highest when the FCM was used for mask estimation. Interestingly, the location of the noise source (center or corner) did not appear to have a substantial effect on the separation ability. This suggests that the proposed algorithm is robust in both isotropic and non-isotropic noise conditions.

4.4 Discussion

The experimental results presented have demonstrated that the implementation of the FCM clustering for mask estimation with a nonlinear microphone array setup as in the MENUET renders superior separation performance in conditions where reverberation and/or environmental noise exist. The feasibility of the FCM clustering was initially tested on a range of spatial feature vectors in an underdetermined simulated setting using a linear stereo microphone array, and compared against the original baseline HKM of the MENUET algorithm. The successful outcome of this prompted further investigation, with a natural extension to a nonlinear microphone array. The GMM clustering algorithm was also implemented as an additional comparative measure to further assess the quality of the FCM in this context and also to compare the performance of alternative soft mask estimation schemes. Evaluations confirmed the superiority of the FCM with positive improvements recorded for the average performance in all acoustic settings, with its significance established by the Student’s t test. In addition to this, the consistent performance of the FCM even in increased reverberation establishes the potential of FCM within the TF mask estimation framework.

However, rather than solely focus upon the reverberant BSS problem, this study extended it to be inclusive of an additional source of observational error: environmental noise, which was modeled as spatially diffuse noise by a number of independent sources. Recordings in real-world conditions were also considered, with the publicly available benchmark data of the international SiSEC 2010 included in evaluations. It was proposed that due to the documented robustness of the FCM in mask estimation for reverberant BSS, the extension to the noisy reverberant case would demonstrate similar abilities. Detailed evaluations confirmed this hypothesis, with noteworthy separation performance using a range of performance metrics in both simulated and real-world conditions reported. A decline in performance was noted when real-world evaluations were considered, and this is attributed to the change in sensor and speaker configuration as well as the undesired effects of spatial aliasing.

In general, the soft mask estimation techniques outperformed the binary masking; however, as the level of reverberation and background noise increased, there was a distinct performance gap between the two leading soft masking approaches, FCM and GMM. Furthermore, in certain scenarios, the GMM was surpassed in performance by the HKM clustering.

The poor performance of the GMM for mask estimation can be attributed to the fact that GMMs are often used for generative modeling for supervised pattern recognition and classification, as opposed to the clustering techniques HKM/FCM which are designed for unsupervised data clustering. Additionally, in these evaluations, there is not a one-to-one correspondence between the number of Gaussian mixture components and the number of sources. Each data point in the feature set is assumed to originate from one of the component densities; therefore, a mismatch between the number of sources and components is a likely additional factor in the reduced performance in corrupted environments. Furthermore, it may be required to re-determine the optimal number of mixture components as the acoustic environment changes; however, this will prove a tedious task with the possibility of little benefit. It can then be concluded that such a statistical modeling paradigm as the GMM is not suitable when the acoustic environment is corrupted at a moderate to marked level as in this study, and perhaps distance metric-based methods such as the HKM/FCM are more appropriate.

Therefore, due to its reliability, consistency and robustness in mask estimation ability over a range of acoustic environments, the FCM algorithm is deduced as the most suitable data classification technique out of the three evaluated in this study for the purposes of mask estimation in this BSS framework.

4.5 Future research

Future research should focus upon the improvement of the robustness of the mask estimation (clustering) stage of the algorithm. For example, an alternative distance measure in the FCM can be considered: it has been shown that the Euclidean distance metric as employed in this study may not be robust to outliers, such as those originating from undesired interferences in the acoustic environment [59]. A measure such as the l ₁-norm could be implemented in a bid to reduce error [21]. Additionally, the authors of [20, 21] also considered the implementation of observation weights and contextual information in an effort to emphasize the reliable features whilst simultaneously attenuating the unreliable features. In such a study, a suitable metric is required to determine such reliability: consideration may be given to the behavior of proximate TF cells through a property such as variance [20].

An approach explored in [60] proposes an enhancement to the traditional FCM through the introduction of a membership (probability) constraint function and also proposes flexibility in the selection of the fuzzification parameter to better fit the end application. It was proven to possess better capability over the FCM with respect to its clustering power and robustness, and thus remains a potential avenue for future research.

Furthermore, in a bid to move the presented BSS algorithm to that of a truly blind and autonomous nature, the introduction of a source enumeration technique is suggested. The automatic detection of the number of clusters may prove to be of significance as all three of the clustering techniques in this chapter require a priori knowledge of the number of sources. A modification to the FCM may suffice for enumeration; the authors of [61] describe two possible algorithms which employ a validation technique to automatically detect the optimum number of clusters to suit the data. Successful results of this technique have been reported within the BSS framework [16]. The inclusion of source enumeration into the presented study would pave the way towards a truly blind source separation system.

5 Conclusions

This study has presented an extension to the existing MENUET algorithm for underdetermined BSS in adverse environments. A non-exhaustive review of current TF-based BSS schemes was discussed with insight into the shortcomings affiliated with such techniques. In a bid to overcome such shortcomings, the substitution of the k-means clustering with the fuzzy c-means was proposed for the purposes of mask estimation for blind source separation. For an additional level of comparison, another soft clustering scheme based on Gaussian mixture models was also implemented.

It was suggested that a binary masking scheme for the mask estimation is inadequate at encapsulating the inevitable reverberation present in any acoustic setup, and thus a more suitable means for clustering the observation data, such as the fuzzy c-means, should be considered. The presented algorithm in this study integrated the c-means with the established MENUET technique for a range of acoustic conditions encompassing room reverberation and background noise.

In a number of experiments designed to evaluate the feasibility and performance of the c-means in the BSS context, the MENUET in conjunction with the FCM was found to outperform both the original in conditions from a stereo (linear) microphone array setup to a nonlinear arrangement, and in both anechoic and reverberant conditions. Furthermore, both simulated and real-world spatially diffuse background noise was included in the evaluations in order to better reflect the conditions of realistic acoustic environments, and again, the FCM proved an improved approach for mask estimation. Comprehensive performance assessment was implemented through the inclusion of a wide range of standard evaluation metrics.

Future research should endeavor upon the improvement of the accuracy of the mask estimation via modifications to the fuzzy c-means to move towards a more powerful and robust clustering algorithm. Furthermore, the evaluation of the BSS performance in alternative contexts such as automatic speech recognition should also be considered in order to gain greater perspective on its potential for implementation in real-life speech processing systems.

References

Lippmann R: Speech recognition by humans and machines. Speech Commun 1997, 22(1):1-15. 10.1016/S0167-6393(97)00021-6
Article Google Scholar
Cherry EC: Some experiments on the recognition of speech, with one and with two ears. J. Acoust. Soc. Am 1953, 25(5):975-979. 10.1121/1.1907229
Article Google Scholar
Coviello CM, Sibul LH: Blind source separation and beamforming: algebraic technique analysis. IEEE Trans. Aerosp. Electron. Syst 2004, 40(1):221-235. 10.1109/TAES.2004.1292155
Article Google Scholar
Yılmaz O, Rickard S: Blind separation of speech mixtures via time-frequency masking. IEEE Trans. Signal Process 2004, 52(7):1830-1847. 10.1109/TSP.2004.828896
Article MathSciNet Google Scholar
Georgiev P, Theis F, Cichocki A: Sparse component analysis and blind source separation of underdetermined mixtures. IEEE Trans. Neural Netw 2005, 16(4):992-996. 10.1109/TNN.2005.849840
Article Google Scholar
Li G, Lutman M: Sparseness and speech perception in noise. In Proc. of the Int. Conf. on Spoken Lang. Process. Pittsburgh, PA; September 17-21, 2006.
Google Scholar
Abrard F, Deville Y: A time-frequency blind signal separation method applicable to underdetermined mixtures of dependent sources. Signal Process 2005, 85(7):1389-1403. 10.1016/j.sigpro.2005.02.010
Article MATH Google Scholar
Melia T, Rickard S: Underdetermined blind source separation in echoic environments using DESPRIT. EURASIP J. Adv. Signal. Process 2007, 2007: 1-19.
Article MATH Google Scholar
Roy R, Kailath T: ESPRIT - estimation of signal parameters via rotational invariance techniques. IEEE Trans. Acoust. Speech Signal Process 1989, 37(7):984-995. 10.1109/29.32276
Article MATH Google Scholar
Araki S, Makino S, Blin A, Mukai R, Sawada H: Underdetermined blind separation for speech in real environments with sparseness and ICA. In Proc. of the IEEE Int. Conf. on Acoust., Speech and Signal Process. Montreal, Quebec; May 17–21, 2004.
Google Scholar
Araki S, Sawada H, Mukai Y, Makino S: A novel blind source separation method with observation vector clustering. In Proc. of the Int. Workshop on Acoust. Echo and Noise Control. Eindhoven: High Tech Campus; September 12–15, 2005.
Google Scholar
Araki S, Sawada H, Mukai R, Makino S: Blind sparse source separation with spatially smoothed time-frequency masking. In Proc. of the Int. Workshop on Acoust. Echo and Noise Control. Paris, France; September 12-14, 2006.
Google Scholar
Araki S, Sawada H, Mukai R, Makino S: DOA estimation for multiple sparse sources with normalized observation vector clustering. In Proc. of the IEEE Int. Conf. on Acoust., Speech and Signal Process. Toulouse, France; May 14-19, 2006.
Google Scholar
Araki S, Sawada H, Mukai R, Makino S: Underdetermined blind sparse source separation for arbitrarily arranged multiple sensors. Signal Process 2007, 87: 1833-1847. 10.1016/j.sigpro.2007.02.003
Article MATH Google Scholar
Araki S, Sawada H, Makino S: K-means based underdetermined blind speech separation. In Blind Speech Separation. Edited by: Makino S, Sawada H, Lee T-W. The Netherlands: Springer; 2007:243-270.
Chapter Google Scholar
Reju VG, Koh SN, Soon IY: Underdetermined convolutive blind source separation via time-frequency masking. IEEE Trans. Audio Speech Lang. Process 2010, 18(1):101-116.
Article Google Scholar
Sawada H, Araki S, Makino S: Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment. IEEE Trans. Audio Speech Lang. Process 2011, 19(3):516-527.
Article Google Scholar
Han J, Kamber M: Data Mining: Concepts and Techniques. San Francisco: Morgan Kaufmann; 2006.
MATH Google Scholar
Velmurugan T, Santhanam T: Performance evaluation of k-means and fuzzy c-means clustering algorithms for statistical distributions of input data points. Eur. J. Sci. Res 2010, 46(3):320-330.
Google Scholar
Kühne M, Togneri R, Nordholm S: Robust source localization in reverberant environments based on weighted fuzzy clustering. IEEE Signal Process. Lett 2009, 16(2):85-88.
Article Google Scholar
Kühne M, Togneri R, Nordholm S: A novel fuzzy clustering algorithm using observation weighting and context information for reverberant blind speech separation. Signal Process 2010, 90: 653-669. 10.1016/j.sigpro.2009.08.005
Article MATH Google Scholar
Izumi Y, Ono N, Sagayama S: Sparseness-based 2ch BSS using the EM algorithm in reverberant environment. In Proc. of the IEEE Workshop on App. of Signal Process. to Audio and Acoust. New York: New Paltz; October 21-24, 2007.
Google Scholar
Mandel M, Ellis D, Jebara T: An EM algorithm for localizing multiple sound sources in reverberant environments. In Proc. of Annu. Conf. on Neural Inf. Process. Syst. Vancouver, California; December, 2006.
Google Scholar
Araki S, Nakatani T, Sawada H, Makino S: Blind sparse source separation for unknown number of sources using Gaussian mixture model fitting with Dirichlet prior. In Proc. of the IEEE Int. Conf. on Acoust., Speech and Signal Process. Taipei; April 19-24, 2009.
Google Scholar
Cichocki A, Kasprzak W, Amari S-I: Adaptive approach to blind source separation with cancellation of additive and convolutional noise. In Proc. of Int. Conf. on Signal Process. Beijing; October 14-18, 1996.
Google Scholar
Mitianoudis N, Davies M: Audio source separation of convolutive mixtures. IEEE Trans. Speech Audio Process 2003, 11(5):489-497. 10.1109/TSA.2003.815820
Article Google Scholar
Li H, Wang H, Xiao B: Blind separation of noisy mixed speech signals based on wavelet transform and independent component analysis. In Proc. of Int. Conf. on Signal Process. Beijing; November 16-20, 2006.
Google Scholar
Shi Z, Tan X, Jiang Z, Zhang H, Guo C: Noisy blind source separation by nonlinear autocorrelation. In Proc. of Int. Congr. on Image and Signal Process. Yantai; October 16-18, 2010.
Google Scholar
Aichner R: Acoustic blind source separation in reverberant and noisy environments,. Ph.D. thesis, University Erlangen-Nuremberg, Erlangen-Nuremberg, 2007
Godsill S, Rayner P, Cappé O: chapter Applications of Digital Signal Processing to Audio and Acoustics. In Digital Audio Restoration. Berlin: Kluwer Academic Publishers; 1997:133-193.
Google Scholar
Smaragdis P: Blind separation of convolved mixtures in the frequency domain. Neurocomputing 1998, 22: 21-34. 10.1016/S0925-2312(98)00047-2
Article MATH Google Scholar
Sawada H, Araki S, Makino S: A two-stage frequency-domain blind source separation method for underdetermined convolutive mixtures. In Proc. of the IEEE Workshop on App. of Signal Process. to Audio and Acoust. Mohonk, New York; October 2007.
Google Scholar
Bezdek J: Pattern Recognition with Fuzzy Objective Function Algorithms. New York: Plenum Press; 1981.
Book MATH Google Scholar
Rabiner L: Digital Processing of Speech Signals. New Jersey: Prentice-Hall; 1978.
Google Scholar
MacQueen JB: Some methods for classification and analysis of multivariate observations. In Proc. of the Berkeley Symp. on Math. Stat. and Probab. Vol. 1. Berkeley: University of California Press; 1967:281-297.
Google Scholar
Jafari I, Haque S, Togneri R, Nordholm S: Underdetermined blind source separation with fuzzy clustering for arbitrarily arranged sensors. In Proc. of Interspeech. Florence; August 27–31 (2011).
Google Scholar
Theodoridis S, Koutroumbas K: Pattern Recognition, 3rd edition. New York: Academic Press; 2006.
MATH Google Scholar
Lehmann EA, Johansson AM: Prediction of energy decay in room impulse responses simulated with an image-source model. J. Acoust. Soc. Am 2008, 124(1):269-277. 10.1121/1.2936367
Article Google Scholar
Varga AP, Steeneken HJM, Tomlinson M, Jones D: The NOISEX-92 study on the effect of additive noise on automatic speech recognition. Tech. Rep., DRA Speech Research Unit, 1992
Loizou PC: Speech Enhancement: Theory and Practice. Boca Raton: CRC Press; 2007.
Google Scholar
ITU-T: Objective measurement of active speech level. Tech. Rep., International Telecommunication Union, 1994
Google Scholar
Fisher W, Dodington G, Goudie-Marshall K: The TIMIT-DARPA speech recognition research database: Specification and status. In Proc. of the DARPA Workshop on Speech Recognit. CA: Palo Alto; February 19, 1986.
Google Scholar
Wang X-Y, Garibaldi JM: A comparison of fuzzy and non-fuzzy clustering techniques in cancer diagnosis. In Proc. of the Int. Conf. in Comput. Intell. in Med. and Healthcare. Portugal: UNINOVA; June 29 - July 1, 2005.
Google Scholar
Jipkate BR, Gohokar VV: A comparative analysis of fuzzy c-means clustering and k-means clustering algorithms. Int. J. Comput. Eng 2012, 2(3):737-739.
Google Scholar
Arthur D, Vassilvitskii S: K-means++: The advantages of careful seeding. In Proc. of the Annu. ACM-SIAM Symp. on Discrete Algorithms. New Orleans, Louisiana; January 7-9, 2007.
Google Scholar
Jain AK: Data clustering: 50 years beyond k-means. Pattern Recognit. Lett 2010, 31(8):651-666. 10.1016/j.patrec.2009.09.011
Article Google Scholar
Hamerly G, Elkan C: Alternatives to the k-means algorithm that find better clusterings. In Proc. of the Int. Conf. on Inf. and Knowledge Manage. McLean, VA; November 4-9, 2002.
Google Scholar
Pena JM, Lozano JA, Larranaga P: An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognit. Lett 1999, 20: 1027-1040. 10.1016/S0167-8655(99)00069-0
Article Google Scholar
Vincent E, Gribonval R, Fevotte C: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process 2006, 14(4):1462-1469.
Article Google Scholar
Rix AW, Beerends JG, Hollier MP, Hekstra AP: Perceptual evaluation of speech quality (PESQ) - a new method for speech quality assessment of telephone networks and codecs. In Proc. of the IEEE Int. Conf. on Acoust., Speech and Signal Process. Salt Lake, City, UT; May 7-11, 2001.
Google Scholar
Emiya V, Vincent E, Harlander N, Hohmann V: Subjective and objective quality assessment of audio source separation. IEEE Trans. Audio Speech Lang. Process 2011, 19(7):2046-2057.
Article Google Scholar
Vincent E, Araki S, Theis F, Nolte G, Bofill P, Sawada H, Ozerov A, Gowreesunker BV, Lutter D, Duong NQK: The signal separation evaluation campaign (2007–2010): Achievements and remaining challenges. Signal Process 2012, 92: 1928-1936. 10.1016/j.sigpro.2011.10.007
Article Google Scholar
Hu Y, Loizou PC: Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio Speech Lang. Process 2008, 16(1):229-238.
Article Google Scholar
Di Persia L, Milone D, Rufiner HL, Yanagida M: Perceptual evaluation of blind source separation for robust speech recognition. Signal Process 2008, 88(10):2578-2583. 10.1016/j.sigpro.2008.04.006
Article MATH Google Scholar
Mandel MI, Bressler S, Shinn-Cunningham B, Ellis DPW: Evaluating source separation algorithms with reverberant speech. IEEE Trans. Audio Speech Lang. Process 2010, 18(7):1872-1883.
Article Google Scholar
Araki S, Ozerov A, Gowreesunker BV, Sawada H, Theis FJ, Nolte G, Lutter D, Duong NQK: The 2010 signal separation evaluation campaign (SiSEC2010): - audio source separation. In Proc. of Int. Conf. on Latent Variable. Anal. and Signal Sep. St. Malo, France; September 27-30, 2010.
Google Scholar
Huber R, Kollmeier B: PEMO-Q - a new method for objective audio quality assessment using a model of auditory perception. IEEE Trans. Audio Speech Lang. Process 2006, 14(6):1902-1911.
Article Google Scholar
Source separation in the presence of real-world background noise: Test database for 2 channels case [online] http://www.irisa.fr/metiss/SiSEC10/noise/SiSEC2010_diffuse_noise_2ch.html, 2010
Hathaway RJ, Bezdek JC, Yingkang H: Generalized fuzzy c-means clustering strategies using lp norm distances. IEEE Trans. Fuzzy Syst 2000, 8(5):576-582. 10.1109/91.873580
Article Google Scholar
Zhu L, Chung FL, Wang S: Generalized fuzzy c-means clustering algorithm with improved fuzzy partitions. IEEE Trans. Syst. Man Cybern 2009, 39(3):578-591.
Article Google Scholar
Sun H, Wang W, Zhang X, Li Y: FCM-based model selection algorithms for determining the number of clusters. Pattern Recognit 2004, 37: 2027-2037. 10.1016/j.patcog.2004.03.012
Article MATH Google Scholar

Download references

Acknowledgements

The authors would like to thank Dr. Emmanuel Vincent for his helpful discussions on the inclusion of environmental noise for use with his performance evaluation BSS_EVAL toolkit. This research is partly funded by the Australian Research Council Grant No. DP1096348.

Author information

Authors and Affiliations

School of Electrical, Electronic and Computer Engineering, The University of Western Australia, Crawley, WA, 6009, Australia
Ingrid Jafari, Serajul Haque & Roberto Togneri
Department of Electrical, Electronic and Computer Engineering, Curtin University, Perth, WA, 6845, Australia
Sven Nordholm

Authors

Ingrid Jafari
View author publications
You can also search for this author in PubMed Google Scholar
Serajul Haque
View author publications
You can also search for this author in PubMed Google Scholar
Roberto Togneri
View author publications
You can also search for this author in PubMed Google Scholar
Sven Nordholm
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ingrid Jafari.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

IJ developed the original concept, constructed the software implementation, performed the reported experimentation, and wrote the manuscript. SH assisted with the implementation of the Gaussian mixture models, and reviewed the manuscript. RT and SN reviewed the manuscript. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Jafari, I., Haque, S., Togneri, R. et al. Evaluations on underdetermined blind source separation in adverse environments using time-frequency masking. EURASIP J. Adv. Signal Process. 2013, 162 (2013). https://doi.org/10.1186/1687-6180-2013-162

Download citation

Received: 10 September 2012
Accepted: 03 October 2013
Published: 23 October 2013
DOI: https://doi.org/10.1186/1687-6180-2013-162

Evaluations on underdetermined blind source separation in adverse environments using time-frequency masking

Abstract

1 Introduction

2 System overview

2.1 Problem statement

2.2 STFT analysis

2.3 Feature extraction

2.4 Mask estimation and separation

2.5 Source resynthesis

3 Clustering approaches

3.1 Hard k-means clustering

3.2 Gaussian mixture model clustering

3.3 Fuzzy c-means clustering

4 Experimental evaluations

4.1 Experimental setup

4.2 Evaluation measures

4.2.1 BSS EVAL performance metrics

4.2.2 PESQ

4.2.3 PEASS

4.3 Results

4.3.1 Initial evaluations of MENUET with FCM

4.3.2 Separation in reverberant conditions

4.3.3 Separation in reverberant conditions with spatially diffuse environmental noise

4.3.4 SiSEC 2010 Data

4.4 Discussion

4.5 Future research

5 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords