Evaluations on underdetermined blind source separation in adverse environments using time-frequency masking

The successful implementation of speech processing systems in the real world depends on its ability to handle adverse acoustic conditions with undesirable factors such as room reverberation and background noise. In this study, an extension to the established multiple sensors degenerate unmixing estimation technique (MENUET) algorithm for blind source separation is proposed based on the fuzzy c-means clustering to yield improvements in separation ability for underdetermined situations using a nonlinear microphone array. However, rather than test the blind source separation ability solely on reverberant conditions, this paper extends this to include a variety of simulated and real-world noisy environments. Results reported encouraging separation ability and improved perceptual quality of the separated sources for such adverse conditions. Not only does this establish this proposed methodology as a credible improvement to the system, but also implies further applicability in areas such as noise suppression in adverse acoustic environments.


Introduction
The ability of the human cognitive system to distinguish between multiple, simultaneously active sources of sound is a remarkable quality that is often taken for granted. This capability has been studied extensively within the speech processing community, and many an endeavor at imitation has been made. However, automatic speech processing systems are yet to perform at a level akin to human proficiency [1] and are thus frequently faced with the quintessential 'cocktail party problem': the inadequacy in the processing of the target speaker/s when there are multiple speakers in the scene [2]. The implementation of a suitable source separation algorithm can improve the performance of such systems, where source separation is the recovery of the original sources from a set of mixed observations. If no a priori information of the original sources and/or mixing process is available, it is termed blind source separation (BSS). Rather than *Correspondence: jafari01@student.uwa.edu.au 1 School of Electrical, Electronic and Computer Engineering, The University of Western Australia, Crawley WA 6009, Australia Full list of author information is available at the end of the article rely on the availability of such a priori information, BSS methods often exploit an assumption on the constituent source signals and utilize spatial diversity obtained from the sensor observations. BSS has many important applications in both the audio and biosignal disciplines, including medical imaging and communication systems.
In the last decade, the research field of BSS has evolved significantly to be an important technique in acoustic signal processing [3]. More specifically, the concept of time-frequency (TF) masking in the context of BSS has been of significance due to its applicability to all BSS scenarios, in particular the underdetermined case, where there exists more sources than sensors. In the TF masking approach to BSS, the assumption of sparseness between the speech sources is typically exploited as initiated in [4]. There exists several definitions for sparseness in the literature; for example, [5] simply defines sparseness as to contain as 'many zeros as possible', whereas others offer a more quantifiable measure such as kurtosis [6]. Often, a sparse representation of speech mixtures can be acquired through the projection of the signals onto an appropriate basis, such as the Gabor or Fourier basis. In particular, http://asp.eurasipjournals.com/content/2013/1/162 the W-disjoint orthogonality (W-DO) of speech signals was explored for the short-time Fourier transform (STFT) domain, where the sparseness implies that the STFT supports of the signals are disjoint. This significant discovery motivated the degenerate unmixing estimation technique (DUET) [4]. The DUET proposed a demixing approach based on the formation of TF masks, where each mask would essentially correspond to the indicator function for the support of the source signal. The DUET algorithm successfully recovered the original source signals from stereo microphone observations using estimates of the relative attenuation and phase parameters.
The DUET algorithm consequently stimulated a plethora of demixing techniques. Among the first extensions to the DUET was the TF ratio of mixtures (TIFROM) algorithm which relaxed the sparseness assumption; however its performance was limited to anechoic conditions with the observations idealized to be of the linear and instantaneous case [7]. Subsequent research extended the DUET to echoic conditions with the use of the estimation of signal parameters via rotational invariance technique (ESPRIT) method to form the DUET-ESPRIT algorithm [8,9]. However, this was restricted to a linear microphone arrangement and was thus subjected to front-back confusions primarily due to the natural constraint in spatial diversity from the microphone observations.
A different avenue of research as in [10] composed a two-stage algorithm which combined the sparseness principle presented in DUET with the established independent component analysis (ICA) algorithm to yield the sparseness and ICA (SPICA) algorithm. This approach exploited the sparseness of the signals to estimate and remove the active speech source at a particular TF point, and ICA was then applied to the remaining mixtures. Naturally, a restraint upon the number of sources present at any TF point relative to the number of sensors was inevitable due to the ICA stage. Furthermore, the algorithm was only investigated for the stereo case.
The authors of the SPICA expanded their research to nonlinear microphone arrays in [11][12][13] with the introduction of the clustering of normalized observation vectors. Whilst remaining similar in spirit to the DUET, the research was inclusive of non-ideal conditions such as room reverberation, and allowed more than two sensors in an arbitrary arrangement. This eventually culminated in the development of the multiple sensors degenerate unmixing estimation technique, termed MENUET [14,15]. Additionally, the mask estimation in MENUET was automated through the application of the k-means clustering technique. Another algorithm which proposes the use of a clustering approach for the mask estimation is presented in [16]: this study is based upon the concept of Hermitian angles between the reference vector and observation vectors, in the complex vector space.
However, evaluations were restricted to a linear microphone array.
Advancements in the TF masking approaches to BSS beyond MENUET involve additional stages and complexities. Of particular mention is the approach in [17] which resulted in superior BSS performance in underdetermined reverberant conditions. The algorithm employed a twostage approach: firstly, observation vectors are clustered in a frequency bin-wise manner, and secondly, the separated frequency bin components classified as originating from the same source are grouped together. The benefit of this approach is that due to the bin-wise clustering, it is robust against higher room reverberations in comparison to previous techniques such as MENUET, as well as possessing an inherent immunity to the spatial aliasing problem in the measurement of the time differences of arrival/direction of arrivals [17]. However, despite the reported improvements in BSS performance, additional complexity was introduced due to the extra stage for the alignment of the frequency bin-wise permuted clustering results. Therefore, the MENUET has the advantage over the state-of-the-art study in [17] in that the fullband clustering for mask estimation eliminates the requirement for the additional stage of frequency bin-wise alignment.
However, the simplicity encapsulated in the MENUET inevitably presents its own limitations. Most significantly, the k-means clustering utilized for mask estimation is not highly robust in the presence of outliers or interference in the data. This often leads to non-optimal localization and partitioning results, particularly for reverberant mixtures [18,19]. Furthermore, binary masking schemes have been shown to impede upon the separation quality due to musical noise distortions, and it was suggested that fuzzy masking approaches bear the potential to significantly reduce the musical noise at the output [12]. This may be attributed to the fact that when a hard partitioning approach is implemented, abrupt changes will exist in the recovered source estimate which consequently introduce artifacts in the time domain.
The suitability of fuzzy c-means (FCM) clustering for TF mask estimation in the BSS framework has been explored in [20,21]. In this approach, the fuzzy partitioning in the c-means was suggested to be preferable to hard clustering due to the inherent ambiguity surrounding the membership of TF cells to a cluster, where examples of contributing factors to ambiguity include the effects of reverberation and environmental (background) noise. However, the investigations to date which employ the FCM, as with many others in the literature, have been restricted to a linear and overdetermined microphone arrangement.
Another soft clustering approach which has received attention in the BSS field lies within Gaussian mixture model (GMM)-based approaches [22][23][24]. This avenue of research is motivated by the intuitive notion that the http://asp.eurasipjournals.com/content/2013/1/162 individual component densities of the GMM may model some underlying set of hidden parameters in a mixture of sources. Due to the reported success of BSS methods that employ such Gaussian models, this clustering paradigm may be considered as a standard algorithm for comparison of mask estimation ability in the TF BSS framework, and is therefore investigated and regarded as a comparative model in this study.
However, each of the TF mask estimation approaches to BSS discussed above are limited in their evaluations with respect to the fact that diverse sources of interference are not considered. Potential contributors to interference in BSS scenarios include not only room reverberation, but also environmental background noise, or noise originating from non-ideal recording sensors. In fact, almost all realworld applications of BSS have the inconvenient aspect of noise at the recording sensors [25], and the influence of such noise has been described as a very difficult and continually open problem in the BSS framework [26].
In general, the focus of BSS algorithms is not directed towards the suppression of environmental noise. However, for a system to achieve optimal performance, the impact of such noise must be addressed. Numerous studies in the literature have been proposed for the problem of additive sensor noise: Li et al. [27] present a twostage denoising/separation algorithm; Cichocki et al. [25] implement a FIR filter at each channel to reduce the effects of additive noise; and Shi et al. [28] suggest a preprocessing whitening procedure for enhancement. The study in [29] considers a variety of common sources of background noise in the separation algorithm, and modifies numerous pre-and post-processing algorithms in order to account for the characteristics of the background noise. Whilst noise reduction has been achieved with denoising techniques implemented as a pre-or postprocessing step, the performance was proven to degrade significantly at lower signal-to-noise ratios [30].
Within the TF BSS framework, the authors of [22] include the possibility of background noise in the observation error for their BSS model; however, the experimental simulations were only conducted for anechoic/reverberant conditions, without any clear distinction between environmental noise and reverberation in the observation error.
Motivated by such various shortcomings, this work presents an extension to the MENUET algorithm through the use of an alternative clustering scheme for mask estimation, and provides comprehensive evaluations in adverse acoustic conditions. Firstly, this study proposes that the substitution of the TF clustering stage with a fuzzy clustering approach as explored in [20,21] will improve the separation performance in the same conditions as presented in [14,15]. Secondly, it is hypothesized that this combination is sufficiently robust to withstand the degrading effects of reverberation and environmental noise, and evaluations of all the methods under the challenging conditions of reverberation and environmental background noise are presented. For all investigations in the study, comparisons are provided with both the original MENUET k-means and the standard soft GMM-based clustering algorithm for mask estimation.
The remainder of this paper is organized as follows: section 2 provides an overview of the proposed BSS scheme and explains the primary signal processing stages. Section 3 describes each of the three clustering schemes in greater detail. Section 4 explains the experimental evaluation and presents a discussion on the achieved results. The section also includes the existing limitations with the system and offers some potential avenues for future work. Section 5 concludes the paper with a brief summary.

Problem statement
Consider a microphone array of M identical sensors in a reverberant enclosure where N sources are present. A convolutive mixing model is assumed, whereby the observation at the mth sensor, x m (t), can be modeled as a summation of the individual contributions by the nth active source, s n (t).
When all N sources are active, the observation at the mth sensor can be expressed via the convolutive mixing model as where h mn (p) p = 0, . . . , P − 1 denote the coefficients of the room impulse response between the nth source to the mth sensor, n m (t) denotes any additive noise received at the mth sensor and t indicates time. The goal of any BSS system is to therefore recover the N sources,ŝ 1 , . . . ,ŝ N , each of which corresponds to the original source signals s 1 , . . . , s N , respectively. Ideally, the separation is performed without any information about s n (t) and h mn (p).

STFT analysis
The time-domain sensor observations are converted into their corresponding frequency domain time-series X m (k, l) via the STFT as where k ∈ {0, . . . , K − 1} is a time frame index, l ∈ {0, . . . , L − 1} is a frequency bin index, win(τ ) is an appropriately selected window function and τ 0 and ω 0 http://asp.eurasipjournals.com/content/2013/1/162 are the TF grid resolution parameters. The analysis window is typically chosen such that sufficient information is retained within whilst simultaneously reducing signal discontinuities at the edges. A suitable window is the Hann window: where L denotes the frame size. It is assumed that the length of L is sufficient such that the main portion of the impulse responses h mn is covered. Therefore, the convolutive BSS problem may be approximated as an instantaneous mixture model [31] in the STFT domain where (k, l) represent the time and frequency index, respectively and H mn (l) is the room impulse response between source n and sensor m. S n (k, l), X m (k, l) and N m (k, l) are the STFT of the nth source, mth observation and additive noise at the mth sensor, respectively.
The assumption of sparseness between the source signals implies that at each TF cell, at most one source is dominant [4]. Therefore, (4) can be expressed as where δ n (k, l) is the Dirac-delta function defined as δ n (k, l) = 1 when S n (k, l) is active at (k, l), 0 otherwise.
Whilst this sparseness assumption holds true for anechoic mixtures, as the reverberation and/or environmental noise in the acoustic scene increases it becomes increasingly unreliable due to the effects of multipath audio propagation and multiple reflections [4,21].

Feature extraction
In this work, the TF mask estimation is realized through the estimation of the TF points where a signal is assumed dominant. To estimate such TF points, a spatial feature vector is calculated from the STFT representations of the M observations. Previous researches [14,15] have identified level ratios and phase differences between the observations as appropriate features, as such features retain information on the magnitude and the argument of the TF points. Further discussion is presented in section 4.3.1.
The feature vector θ(k, l) = θ L (k, l), θ P (k, l) T per TF point is estimated as where f is the frequency at the lth frequency bin index, c is the propagation velocity of sound, d max is the maximum distance between any two sensors in the array and J is the index of the (arbitrarily selected) reference sensor. The weighting parameters A(k, l) and α ensure appropriate amplitude and phase normalization of the features respectively. It is widely known that in the presence of reverberation, a greater accuracy in phase ratio measurements can be achieved with higher spatial resolution; however, it should be noted that the value of d max is upper bounded by the spatial aliasing theorem [14,17,21]. If the exact value of the maximum sensor spacing is not known, a positive constant may be used in its place [14]. This eliminates the need for the system to know the precise spacing between sensors. The frequency normalization in (8) ensures frequency independence of the phase ratios in order to prevent the frequency permutation problem in the later stages of clustering. It is possible to cluster without such frequency independence by implementing a bin-wise clustering as in [17,32]. However, the utilization of all the frequency bins avoids the frequency permutation problem and also permits data observations of short length [14].

Mask estimation and separation
In this work, source separation is effected through the estimation and application of TF masks, which are estimated in the clustering stage. For the k-means algorithm, a binary mask for the nth source is simply estimated as [14] M n (k, l) = 1 for θ (k, l) ∈ C n , 0 otherwise.
where C n denotes the set of TF points classified as belonging to the nth cluster. The output of the FCM clustering is a fuzzy membership partition matrix [21,33]. This partition matrix indicates http://asp.eurasipjournals.com/content/2013/1/162 the degree of membership of each TF point in the feature space to each of the N clusters. These membership values, denoted by u n (k, l), are then interpreted as a collection of N TF masks: For the GMM clustering approach, the mask is set to the posterior probabilities of the dominant Gaussian components (cf. section 3.2) [22,23]. This equates to where μ p , p denotes the mean and covariance matrix of the pth Gaussian component of the mixture model. The spatial image estimate of the nth signal received at the mth sensor is then obtained through the application of mask M n to the mth observation as [17]

Source resynthesis
Finally, the estimated source images are reconstructed in the time-domain to obtain the estimatesŝ mn (t). This is realized through the overlap-and-add method [34] ontô S mn (k, l). The reconstructed estimate iŝ where C win = 0.5/τ 0 L is a Hann window function constant, and individual frequency components of the recovered signal are acquired through an inverse STFT if (kτ 0 ≤ t ≤ kτ + L − 1), and zero otherwise.

Clustering approaches
This section presents the details of the three clustering techniques employed in this study. The first two, the hard k-means and the Gaussian mixture model, have previously been used in other TF-based clustering BSS systems [14,24], whilst the fuzzy c-means is the proposed mask estimation technique. All three techniques belong to the family of center-based clustering, and each have their own objective functions. The common goal of all is the classification of the set of feature vectors, (k, denotes the set of TF points in the STFT plane, into N clusters. In the instance where the clusters are distinct, as with the hard k-means, each data point may only belong to one cluster. However, for the soft clustering techniques, each data element may belong to multiple clusters with a certain probability (membership).

Hard k-means clustering
Previous mask estimation methods as in [13][14][15][16] employ binary clustering techniques such as the hard k-means (HKM). The HKM algorithm was initially introduced in studies published by MacQueen [35]. In this approach, the set of feature vectors (k, l) is clustered into N distinct cluster sets {C} = C 1 , . . . , C N . Each set from {C} contains the feature vectors assigned to the nth cluster, and has an associated set of prototype vectors, v n , which denotes the nth cluster center.
Clustering of the data is achieved through the minimization of the objective function where D n (k, l) = θ (k, l) − v n 2 is the squared Euclidean distance between the feature vector θ (k, l) and the nth cluster center.
Conditional on a set of initial centroids, this minimization is iteratively realized by the following alternating equations until convergence is met, where E{.} θ (k,l)∈C n denotes the mean operator for the TF points within the cluster set C n , and the (*) operator denotes the optimal value (at convergence). Due to the algorithm's sensitivity to initialization of the cluster centers it is recommended to either design initial centroids using an assumption on the sensor and source geometry as in [14,15], or to utilize the best outcome of a predetermined number of independent runs.

Gaussian mixture model clustering
A number of studies in the literature for TF-based BSS have implemented the GMM clustering approach [22][23][24]  and it is therefore included in this study for comparative purposes. It is also included in order to compare the effects of soft masking on the separation system, by providing the FCM with a fair comparison.
In the GMM-based clustering, each observation θ (k, l) can be modeled as a weighted sum of P component Gaussian densities (clusters). Unlike the HKM and FCM described above, where the number of clusters is equal to the number of sources, the GMM-based clustering methods have the additional complexity in that the best fitting for the data set to a mixture model may not necessitate that P is equivalent to the number of sources [14].
The pth component of the mixture model is assumed to follow a Gaussian distribution with a characteristic mean and covariance, μ p and p , respectively. The probability density function of an observation θ (k, l), denoted by θ for simplicity from here onward, is represented mathematically as: where (μ, ) contains the mean and covariance matrices for all P clusters, and w p denotes the mixture weight (probability) of the pth distribution. This pth component density is represented by The unknown parameter sets (μ p , p ) for the P distributions are estimated in such a manner as to maximize the likelihood of the mixture model; this estimation is most commonly iteratively calculated using the Expectation-Maximization (EM) algorithm [22]. The data is then clustered around the maximum likelihood parameters as determined from the EM algorithm by the final estimates of the a posteriori probabilities at convergence.
Conditional on an initial partitioning, that is the initial cluster sets {C 1 , . . . , C P } are known, the parameters sets (μ p , p ) are found via the minimization of the negative log-likelihood of (19) argmin and for each w p conditional on (μ p , p , p = 1, . . . , P) The cluster sets are then found by assigning posterior probabilities to the mixture components. The use of GMM clustering within this particular BSS framework results in the number of components not equal to the number of sources (see section 4.1); therefore, the dominant N components of the P, as determined by the mixture weights, are selected to represent the N sources. The posterior probabilities of the dominant Gaussians, denoted p(θ |μ p , p ), are then utilized as the TF mask to represent the corresponding source (analogous to the work in [14,17]).

Fuzzy c-means clustering
Whilst the HKM performed satisfactorily in the context of MENUET for BSS, the work presented in [21] and [36] demonstrated that the use of a fuzzy clustering algorithm improves the accuracy of mask estimation. The origins of the FCM are credited to the work presented in [33], and as with the HKM method, the feature set is clustered into N clusters, where each cluster center is represented by a centroid v n . However, each cluster also has an associated partition matrix U = {u n (k, l) ∈ R|n ∈ (1, . . . , N), (k, l) ∈ )} which specifies the probability u n (k, l) to which a feature vector θ (k, l) belongs to the nth cluster at the TF point (k, l).
Clustering is achieved by the minimization of the cost function where u n (k, l) is subject to the constraint N n=1 u n (k, l) = 1 and with D n (k, l) defined as in section 3.1. The fuzzification parameter q > 1 controls the membership softness in the cost function and therefore controls the fuzziness of the generated TF masks. Section 4.1 describes the selection of an appropriate value for the fuzzification parameter in this BSS context. The minimization problem in (22) can be solved using Lagrange multipliers and is typically implemented as an alternating optimization scheme due to the open nature of its solution [21,37]. Initialized with a random partitioning, the alternating updates are where (*) denotes the optimal value, until a suitable termination criterion is satisfied. Typically, convergence is defined as when the difference between successive partition matrices is less than some predetermined threshold, [33]. However, as is also the case with the k-means, it is known that the alternating optimization scheme presented may converge to a local, as opposed to global, optimum; thus, it is suggested to independently implement the algorithm several times prior to selecting the most fitting result [21].

Summary: FCM clustering algorithm
4 Experimental evaluations

Experimental setup
The experimental setup was designed to replicate that of the studies in [14,15] for comparative purposes. Figure 1 depicts the speaker and sensor arrangement, and Table 1 details the experimental conditions. The wall reflections of the enclosure and room impulse responses between each source and sensor were simulated using the image model method for small-room acoustics [38]. The room reverberation was quantified in the measure RT 60 , where RT 60 is defined as the time required for reflections of a direct sound to decay by 60 dB below the level of the direct sound. Several types of background noise can be described by a diffuse sound field and modeled by an infinite number of statistically independent point sources on a sphere [29]. In this model, the intensities of the incident sound are uniformly distributed over all possible directions, and can be modeled as additive noise at the sensors, as in (1) [29]. In this study, 30 individual and independent point sources were situated uniformly from the center of the microphone array at a distance of 1.5 m. In an effort to gain adversity in the evaluations, three types of environmental noise were considered: white noise, babble noise and factory noise. All noise samples are available in Figure 1 The simulated room setup for the nonlinear sensor arrangement experimental evaluations. http://asp.eurasipjournals.com/content/2013/1/162 the NOISEX-92 database [39]. The simulated background noise was scaled according to the signal-to-noise ratio (SNR) definition as in [40], which uses the standardized method given by the International Telecommunications Union to objectively measure the active speech level and calibrate the interfering noise signal appropriately [41]. It should be noted that in real-world environments, noise is never exactly isotropic; therefore, these evaluations must be considered with caution.
The four target speech sources, the genders of which were randomly generated, were realized with phonetically-rich utterances from the TIMIT database [42], and the target-to-masker ratio between all of the sources was set to 0 dB. A representative number of mixtures for evaluative purposes was constructed. To avoid any spatial aliasing, the sensors were placed at a maximum distance of 4 cm apart. Section 3.3 explains the role of the fuzzification parameter q in the FCM clustering. Past research [21] has identified a value of q in the range of q ∈ (1, 1.5] to result in performance akin to hard clustering. Furthermore, it was empirically determined that for reverberant speech mixtures, a value of q = 2 is an optimal value in order to achieve a balance between high separation performance with minimal artifacts [21]. This is consistent with other studies which also report an optimal value at 2 for the fuzzy exponent [43,44]. Therefore, in this work, the fuzzification q is set to 2. As mentioned in sections 3.1 and 3.3, it is widely recognized that the performance of the clustering algorithms is largely dependent on the initilization of the algorithm [19,45]. If the initial partitions are not estimated with sufficient precision, there is a high possiblity of finding a local, as opposed to global, optimum. It has been recommended [19] to run the algorithms multiple times to reduce the degrading effects of its sensitivity; the effectiveness of this style of initialization was also described in [46]. In an effort to save computational expense, it was desired to determine the smallest number of independent, single-iteration runs for initialization which would result in the best solution. Previous experiments as in [21] had implemented the best of 50 runs; however, it was empirically confirmed that there was little difference in performance between 25 and 50 runs. Therefore, it can be assumed that satisfactory clustering initialization can result when the best solution of 25 independent, randomly initialized single-iteration executions are selected for initilization. The 'best' solution was defined as the execution which resulted in the lowest cost function output of the independent runs (i.e. the smallest error).
Similar to the HKM and FCM algorithms, the GMM clustering approach also requires a suitable initialization. As recommended in [47], an initialization based on the Forgy method [48] was implemented, where the data set was randomly partitioned into K non-overlapping sets with uniform mixing proportions. The initial covariance matrices for all components were diagonal. However, the GMM clustering approach is also highly sensitive to the selection of an appropriate number of components in the model. It was observed in the experiments that an increase in the number of mixture components generally resulted in improved separation performance; however, the selection of an optimal number of Gaussians was not simple and required a considerable amount of experimentation in order to reach the optimal number. For this particular application of the GMM clustering in the desired source/sensor configuration, it was empirically determined as K = 12. This is in accordance to previous studies using GMM for BSS such as in [14], where the determination of the optimal number of clusters was at a considerable computational expense. As mentioned in section 3.2, since the number of components are not equal to the number of sources, the dominant N components (as indicated by the mixture weights) were used to estimate the TF separation masks. The TF masks were derived from the posterior probabilities of the dominant components.

Evaluation measures
In order to provide a comprehensive evaluation of the separation algorithms presented in this study, a range of performance metrics have been included. These include the widely used BSS_EVAL toolkit [49], the Perceptual Evaluation of Speech Quality measure (PESQ) [50] and the objective measures in the Perceptual Evaluation methods for Audio Source Separation (PEASS) toolkit [51].

BSS EVAL performance metrics
The first set of performance metrics was obtained from the publicly available MATLAB toolkit BSS_EVAL [49]. This set of metrics is applicable to all source separation approaches, and no prior information of the separation algorithm is required. However, the original toolkit does not account for environmental noise in the metrics. To account for this, an author of the BSS_EVAL was consulted in order to modify the toolkit to consider the addition of two extra metrics: the SNR and signal-tointerference-plus-noise ratio (SINR).
Using a least-squares projection, the BSS_EVAL toolkit assumes the decomposition of the estimated spatial imagê s mn (t) aŝ where m is the observation index, s img mn (t) is the true source image and e spat mn (t), e interf mn (t), e artif mn (t) and e noise mn (t) are distinct error components representing spatial distortion, interference, artifacts and noise, respectively. http://asp.eurasipjournals.com/content/2013/1/162 From this decomposition, the SIR was computed as [52] SIR n = 10log 10 to provide an estimate of the relative amount of interference in the target source estimate. The SINR was computed as to reflect the amount of noise and interference in the recovered signal estimate. The global SNR for the nth source was calculated as which provides a measure of the amount of noise at the recovered signal, independent of the interference. For all ratios, a higher value indicates better separation performance.

PESQ
The PESQ measure was originally designed to provide a subjective judgement of the speech quality of the recovered source signal. Despite its initial intention for telecommunication applications, it has since been shown to be an effective predictor for the quality of the speech isolated from the observation mixtures by the separation algorithm [53], as well as for ASR performance on the separated speech signals [54]. The PESQ score is computed by a comparison of the original (unmixed, anechoic) speech source signal to the recovered signal estimate. Both signals are time-aligned and passed through an auditory transform to achieve a psychoacoustically motivated representation [55]. The differences between the signals in this representation are measured and used to provide an estimate of the distortion in the signal estimate. The final measure of PESQ is reported to correlate well with subjective listening scores [53].
The PESQ score can take on a range from 0.5 to 4.5, where 4.5 represents the case when the signal estimate is equivalent to the original (clean) source. A higher score suggests better speech quality.

PEASS
The PEASS toolkit was created to provide a set of objective scores to predict the perceptual quality of estimated sources. This is complementary to the energy-based ratios in the BSS_EVAL (cf. section 4.2.1), and the PEASS has since been implemented as a standard for performance evaluation in international speech challenges such as the signal separation evaluation campaign (SiSEC) [52,56].
In this toolkit, the estimated signals are decomposed via a complex, auditory-motivated algorithm as [51] where s n (t) is the original (clean) target signal, and the terms e target (t), e interf (t) and e artif (t) denote the target distortion component, interference component and artifacts component, respectively. The salience of these error components is then measured using the perceptual similarity measure provided in the PEMO-Q auditory model [57]; the reader is referred to [51] for a detailed discussion. The PEASS toolkit computes four auditory-motivated quality scores; however, the overall perceptual score (OPS) is considered as a global measure for the separation ability as it indicates the similarity between the recovered signal estimate and the original signal, and it is said to have a high coherence with the subjective perceptual evaluation. Therefore, in this study, the OPS is included as an additional performance metric for the perceptual quality of the speech. The OPS is expressed from 0 to 100, where 100 denotes the best perceptual match.

Initial evaluations of MENUET with FCM
Prior to evaluating the effectiveness of the FCM clustering for mask estimation in the MENUET framework, the FCM was evaluated in a simple stereo setup for a variety of feature sets in order to test its feasibility in this context. In [14,15], a comprehensive review of suitable location cues was presented and their effectiveness at separation was evaluated using the HKM clustering for mask estimation.
The experimental setup for these set of evaluations was such as to replicate the original work in [14] to as close a degree as possible. In an enclosure of dimensions 4.55 m × 3.55 m × 2.5 m with a room reverberation parameter RT 60 constant at 128 ms, two omnidirectional microphones were placed at a distance of 4 cm apart at an elevation of 1.2 m. Three speech sources, with a target-tomasker ratio of 0 dB, were situated at 30°, 70°and 135°at a distance of 50 cm from the array, and also at an elevation of 1.2 m. The speech sources were randomly chosen from both genders of the TIMIT database in order to emulate the investigations in [14,15] which utilized English utterances. The source separation performance was evaluated with respect to the improvement in SIR and the results are depicted in Table 2.
The original purpose of the evaluations upon the range of features was to determine the effects of appropriate normalization upon the level and phase ratio features [14]. As expected, separation performance generally increases as the features are of the same order of magnitude (see section 2.3). It is additionally observed from the measured http://asp.eurasipjournals.com/content/2013/1/162 Table 2 The hard k-means and fuzzy c-means are implemented for mask estimation

Feature θ(k, l) k-means c-means
The reverberation was constant at RT 60 = 128 ms. The highest achieved ratios are emphasized in italics.
SIR gain that the FCM clustering is more robust than the original HKM for all but one feature set, and thus hints at the possibility of the FCM yielding similar results for related TF BSS approaches. Not only does this confirm the suitability of the FCM in the proposed BSS framework, it also demonstrates the robustness of the FCM against several types of spatial features. The results of this investigation provide further motivation to extend the soft TF masking scheme to other sensor arrangements and adverse acoustic conditions. However, in the original evaluations in [14] the authors also compare the performance of the HKM for the same stereo, three speaker setup against the more robust GMM fitting clustering approach. The results of this demonstrated improvements in SIR gain in comparison to the HKM, although this was at the burden of significantly greater computational expense. Furthermore, the selection of the number of Gaussian components proved to require a lot of trial and error (cf. section 4.1). In order to offer a fair comparison of the FCM against other clustering techniques, the GMM fitting method was then implemented in further BSS evaluations as stated in the following sections.

Separation in reverberant conditions
The study was extended to the underdetermined case of three sensors and four sources in a nonlinear configuration as in Figure 1 [14,15]. The average improvement in SIR measured across all separated sources for all evaluations is depicted in Figure 2, where the average input SIR was measured at −4.20 dB (consistent with the studies in [14,15]). It is immediately evident that the two soft masking techniques, GMM and FCM, improve the separation quality by a considerable amount. For example, for the anechoic scenario, the GMM and FCM clustering techniques perform equivalently, leading the HKM mask estimation by almost 10 dB. However, as the reverberation is increased to a mild 128 ms, a slight performance gap between the two soft masking techniques surfaces with the FCM leading by approximately 2 dB. This gap is heightened as the reverberation is increased again, with the performance gap considerably larger at almost 7 dB. Interestingly, at this higher reverberation time, the GMM performs even below the HKM.
A smaller standard deviation is also observed in Figure 2 when FCM clustering is used. For example, when the reverberation is RT 60 = 128 ms, the SIR performance using GMM clustering is comparable to that of FCM clustering. However, the standard deviation is more than twice that of the FCM clustering, and this suggests that the FCM delivers more consistent and reliable separation of the sources.
To evaluate the statistical significance of the evaluations, the Student's t test was conducted for the three methods, where two tests were conducted per RT 60 value: one to compare the statistical significance of the FCM against the HKM, and one to compare the FCM against the GMM. A two-tailed distribution was assumed for each test, with unequal variances between the data. For the FCM against the HKM, a p value of p << 0.001 was reported for all reverberation times. For the FCM against the GMM, for a reverberation time of RT 60 = 0 ms, a p value of less than 0.1 (p = 0.094) was measured. However, for the remaining reverberation times, a p value of p << 0.001 was recorded. This demonstrates that the performance of the proposed FCM mask estimation is largely unlikely to be due to chance. Therefore, the performance of the FCM clustering indicates a superior mask estimation technique for source separation in a reverberant enclosure.

Separation in reverberant conditions with spatially diffuse environmental noise
The effect of background noise was then evaluated for the BSS system in the presence of white, babble and factory noise, added to the mixtures as described in section 4.1.
The numerical results are shown in Tables 3, 4 and 5 for a range of reverberation times, with similar trends reported for all types of corrupting noise. To provide a fair comparison against the reverberation-free case in Figure 2, the SIR gain is reported. However, for the SINR and SNR, the absolute measured ratio at the output is provided. It is firstly observed that for environmental SNRs of 25 dB and above, the measured SIR gain is approximately equivalent to the noise-free environment (Figure 2). However, as the level of noise is increased a steady decline in SIR gain is recorded, as to be expected. Interestingly, http://asp.eurasipjournals.com/content/2013/1/162 as previously observed in the separation results of section 4.3.2, the GMM mask estimation ability significantly declines with the introduction of more adverse conditions. For example, in the case of babble noise at a reverberation time of 128 ms, when the SNR is decreased from 25 to 20 dB we note a difference in SIR of almost 5 dB. However, the HKM has a difference of less than 1 dB, and the FCM of just 0.34 dB. Additionally, as was previously observed in the noise-free experiments (Figure 2), the GMM occasionally performs below that of the HKM clustering at the higher reverberation time of 300 ms.
The performance of the SINR is akin to the SIR across all room reverberations and environmental SNRs. To gain an appreciation of any possible noise suppression characteristics of the MENUET and its modifications using the GMM/FCM, the SNR was measured and then averaged for all the recovered source signals. The results are generally as expected, with a decrease in gain as the level of noise and reverberation time increase. However, as previously observed, there is often a notable decline in the performance of the GMM as the SNR drops below 20 dB, and/or the room reverberation is increased.
The isolation of the effects of reverberation and noise can be observed in Table 3 when the room reverberation is set to null. The effects of noise alone appear to have less of an impact upon separation ability than the reverberation for the FCM clustering; for example, when the SNR is varied from 30 to 10 dB, there is a change in SIR gain of between 3 and 5 dB, with just a 1 dB change in the case of babble noise. However, when comparing the SIR gains for the same SNRs across different reverberation times, there are significant differences especially at the reverberation time of RT 60 = 300 ms. For example, for the case of corrupting babble noise, for RT 60 = 0 ms the recorded SIR was 16.14 dB, whereas when RT 60 = 300 ms the SIR drops to 11.28 dB.
The PESQ was then evaluated on the recovered signals to provide a measure for the perceptual quality of the recovered source estimates. A general decrease in PESQ with an increase in adversity of the conditions is noted, with the FCM for mask estimation yielding the highest scores. The effect of environmental SNR appears to be more detrimental than that of reverberation; for example, in the case of babble noise, the measured PESQ for the FCM method at a reverberation time of 0 ms and SNR of 30 dB is 2.84. When the room reverberation is increased to 300 ms, the measured PESQ is 2.50. However, when the reverberation is maintained at 0 ms and the SNR is decreased to 0 dB, a PESQ is measured at 1.54. This reduction in PESQ is likely due to the decrease in the target signal amplitude and degraded time alignment in such noisy conditions, which leads to a source estimate of poorer quality.
The final performance metric implemented for this experimental setup was the OPS from the PEASS toolkit. Similar trends were observed in the OPS as with the other metrics, with a degradation in the achieved score as the hostility of the environment was increased. In this case http://asp.eurasipjournals.com/content/2013/1/162 Table 3 Source separation results in an anechoic enclosure (cf. Figure 1)  The room reverberation is set to null. The HKM, GMM and FCM clustering algorithms are compared for TF mask estimation using the performance metrics of SIR gain, SINR and SNR as defined in section 4.2. The highest achieved ratio for each acoustic condition is denoted in italics.
also, the FCM demonstrated its superiority over the HKM and GMM clustering techniques.

SiSEC 2010 Data
The proposed method was then evaluated with publicly available benchmark data of the SiSEC 2010 [56]. The development data (dev.zip) in "Source separation in the presence of real-world background noise" data sets was used. In this data set, two microphones were spaced at 8.6 cm, and noise signals were recorded in real-world noise environments: 'Cafeteria' (Ca) and 'Square' (Sq). The 'Cafeteria' environment was stated as reverberant (with an unspecified reverberation time), whereas the 'Square' had little or no reverberation [56]. The noise signals were recorded at two different positions within the environment, center (Ce; where noise is more isotropic), and corner (Co; where noise may not be very isotropic) [56]. For each of the noise environments, two different locations of the same environment were considered (A and B). The recordings were 10 s long, with mixed English and Japanese utterances of both genders. The original recordings were sampled at 16 kHz; however, it was empirically determined that a downsample to 8 kHz resulted in better separation for all methods tested. This can be attributed to the reduced effects of spatial aliasing at the lower sampling frequency.
For easy comparison against the published results of the SiSEC as available in [58], the same evaluation criteria for the "Source spatial image estimation" http://asp.eurasipjournals.com/content/2013/1/162 Table 4 Source separation results in a reverberant enclosure (cf. Figure 1)  The room reverberation is set to RT 60 = 128 ms. The HKM, GMM and FCM clustering algorithms are compared for TF mask estimation using the performance metrics of SIR gain, SINR and SNR as defined in section 4.2. The highest achieved ratio for each acoustic condition is denoted in italics.
task was used. The estimated source imageŝ mn (t) is decomposed aŝ Three energy ratios, the source image to spatial distortion ratio (ISR), signal to interference ratio (SIR) and the signal to artifact ratio (SAR), then measure the amount of spatial distortion, interference and artifacts in the recovered source estimates. These are expressed in dB as [52] ISR n = 10log 10 The total error is captured in the signal-to-distortion ratio (SDR)  The room reverberation is set to RT 60 = 300 ms. The HKM, GMM and FCM clustering algorithms are compared for TF mask estimation using the performance metrics of SIR gain, SINR and SNR as defined in section 4.2. The highest achieved ratio for each acoustic condition is denoted in italics.
The quality of the source signals were also evaluated with the PEASS toolkit as described in section 4.2.3. However, all four ratios were included: the target-related perceptual score (TPS), interference-related perceptual score (IPS), artifact-related perceptual score (APS) and the OPS. The reader is referred to [51] for details. Table 6 shows the average results per environmental condition, averaged across all available mixtures. This table can easily be compared against the results of the SiSEC 2010, in the table entitled "Average Results for 2 channels" in [58]. The individual results for each recording are displayed in Table 7. The reported results are at a similar performance level with those published in the SiSEC 2010 [58], despite the reduced SAR and APS ratios. An overall decline in performance in comparison to the simulated evaluations (Tables 3, 4 and 5) can be observed. A likely reason for this is due to the larger sensor spacing (8.6 cm compared to the 4 cm spacing in previous evaluations), as for ideal phase measurements, the sensor spacing should be limited to below c/f s , where c is the velocity of sound and f s is the sampling frequency [21]. Additionally, the fact that two sensors are used to retrieve the information compared to three, as in section 4.3.3, could contribute to the decrease in performance. The reduction of the feature space dimension may have lowered the capability of the clustering algorithm, making any clustering performance differences less apparent.
In general, the FCM for mask estimation proved the most robust. The GMM also achieved notable IPS values, http://asp.eurasipjournals.com/content/2013/1/162 The average measured output ratio across all three sources, and for all mixtures in the condition, is displayed. The highest achieved ratio is denoted in italics.
however the remaining ratios were not as high as those achieved with the FCM. For example, the OPS was consistently at its highest when the FCM was used for mask estimation. Interestingly, the location of the noise source (center or corner) did not appear to have a substantial effect on the separation ability. This suggests that the proposed algorithm is robust in both isotropic and non-isotropic noise conditions.

Discussion
The experimental results presented have demonstrated that the implementation of the FCM clustering for mask estimation with a nonlinear microphone array setup as in the MENUET renders superior separation performance in conditions where reverberation and/or environmental noise exist. The feasibility of the FCM clustering was initially tested on a range of spatial feature vectors in an The average measured output ratio across all three sources is displayed. The highest achieved ratio is denoted in italics. http://asp.eurasipjournals.com/content/2013/1/162 underdetermined simulated setting using a linear stereo microphone array, and compared against the original baseline HKM of the MENUET algorithm. The successful outcome of this prompted further investigation, with a natural extension to a nonlinear microphone array. The GMM clustering algorithm was also implemented as an additional comparative measure to further assess the quality of the FCM in this context and also to compare the performance of alternative soft mask estimation schemes.
Evaluations confirmed the superiority of the FCM with positive improvements recorded for the average performance in all acoustic settings, with its significance established by the Student's t test. In addition to this, the consistent performance of the FCM even in increased reverberation establishes the potential of FCM within the TF mask estimation framework. However, rather than solely focus upon the reverberant BSS problem, this study extended it to be inclusive of an additional source of observational error: environmental noise, which was modeled as spatially diffuse noise by a number of independent sources. Recordings in realworld conditions were also considered, with the publicly available benchmark data of the international SiSEC 2010 included in evaluations. It was proposed that due to the documented robustness of the FCM in mask estimation for reverberant BSS, the extension to the noisy reverberant case would demonstrate similar abilities. Detailed evaluations confirmed this hypothesis, with noteworthy separation performance using a range of performance metrics in both simulated and real-world conditions reported. A decline in performance was noted when realworld evaluations were considered, and this is attributed to the change in sensor and speaker configuration as well as the undesired effects of spatial aliasing.
In general, the soft mask estimation techniques outperformed the binary masking; however, as the level of reverberation and background noise increased, there was a distinct performance gap between the two leading soft masking approaches, FCM and GMM. Furthermore, in certain scenarios, the GMM was surpassed in performance by the HKM clustering.
The poor performance of the GMM for mask estimation can be attributed to the fact that GMMs are often used for generative modeling for supervised pattern recognition and classification, as opposed to the clustering techniques HKM/FCM which are designed for unsupervised data clustering. Additionally, in these evaluations, there is not a one-to-one correspondence between the number of Gaussian mixture components and the number of sources. Each data point in the feature set is assumed to originate from one of the component densities; therefore, a mismatch between the number of sources and components is a likely additional factor in the reduced performance in corrupted environments. Furthermore, it may be required to re-determine the optimal number of mixture components as the acoustic environment changes; however, this will prove a tedious task with the possibility of little benefit. It can then be concluded that such a statistical modeling paradigm as the GMM is not suitable when the acoustic environment is corrupted at a moderate to marked level as in this study, and perhaps distance metric-based methods such as the HKM/FCM are more appropriate.
Therefore, due to its reliability, consistency and robustness in mask estimation ability over a range of acoustic environments, the FCM algorithm is deduced as the most suitable data classification technique out of the three evaluated in this study for the purposes of mask estimation in this BSS framework.

Future research
Future research should focus upon the improvement of the robustness of the mask estimation (clustering) stage of the algorithm. For example, an alternative distance measure in the FCM can be considered: it has been shown that the Euclidean distance metric as employed in this study may not be robust to outliers, such as those originating from undesired interferences in the acoustic environment [59]. A measure such as the l 1 -norm could be implemented in a bid to reduce error [21]. Additionally, the authors of [20,21] also considered the implementation of observation weights and contextual information in an effort to emphasize the reliable features whilst simultaneously attenuating the unreliable features. In such a study, a suitable metric is required to determine such reliability: consideration may be given to the behavior of proximate TF cells through a property such as variance [20].
An approach explored in [60] proposes an enhancement to the traditional FCM through the introduction of a membership (probability) constraint function and also proposes flexibility in the selection of the fuzzification parameter to better fit the end application. It was proven to possess better capability over the FCM with respect to its clustering power and robustness, and thus remains a potential avenue for future research.
Furthermore, in a bid to move the presented BSS algorithm to that of a truly blind and autonomous nature, the introduction of a source enumeration technique is suggested. The automatic detection of the number of clusters may prove to be of significance as all three of the clustering techniques in this chapter require a priori knowledge of the number of sources. A modification to the FCM may suffice for enumeration; the authors of [61] describe two possible algorithms which employ a validation technique to automatically detect the optimum number of clusters to suit the data. Successful results of this technique have been reported within the BSS framework [16]. The inclusion of source enumeration into the presented http://asp.eurasipjournals.com/content/2013/1/162 study would pave the way towards a truly blind source separation system.

Conclusions
This study has presented an extension to the existing MENUET algorithm for underdetermined BSS in adverse environments. A non-exhaustive review of current TFbased BSS schemes was discussed with insight into the shortcomings affiliated with such techniques. In a bid to overcome such shortcomings, the substitution of the kmeans clustering with the fuzzy c-means was proposed for the purposes of mask estimation for blind source separation. For an additional level of comparison, another soft clustering scheme based on Gaussian mixture models was also implemented.
It was suggested that a binary masking scheme for the mask estimation is inadequate at encapsulating the inevitable reverberation present in any acoustic setup, and thus a more suitable means for clustering the observation data, such as the fuzzy c-means, should be considered. The presented algorithm in this study integrated the cmeans with the established MENUET technique for a range of acoustic conditions encompassing room reverberation and background noise.
In a number of experiments designed to evaluate the feasibility and performance of the c-means in the BSS context, the MENUET in conjunction with the FCM was found to outperform both the original in conditions from a stereo (linear) microphone array setup to a nonlinear arrangement, and in both anechoic and reverberant conditions. Furthermore, both simulated and real-world spatially diffuse background noise was included in the evaluations in order to better reflect the conditions of realistic acoustic environments, and again, the FCM proved an improved approach for mask estimation. Comprehensive performance assessment was implemented through the inclusion of a wide range of standard evaluation metrics.
Future research should endeavor upon the improvement of the accuracy of the mask estimation via modifications to the fuzzy c-means to move towards a more powerful and robust clustering algorithm. Furthermore, the evaluation of the BSS performance in alternative contexts such as automatic speech recognition should also be considered in order to gain greater perspective on its potential for implementation in real-life speech processing systems.