# A hybrid algorithm for blind source separation of a convolutive mixture of three speech sources

- Shahab Faiz Minhas
^{1}Email author and - Patrick Gaydecki
^{1}

**2014**:92

https://doi.org/10.1186/1687-6180-2014-92

© Minhas and Gaydecki; licensee Springer. 2014

**Received: **23 August 2013

**Accepted: **23 May 2014

**Published: **17 June 2014

## Abstract

In this paper we present a novel hybrid algorithm for blind source separation of three speech signals in a real room environment. The algorithm in addition to using second-order statistics also exploits an information-theoretic approach, based on higher order statistics, to achieve source separation and is well suited for real-time implementation due to its fast adaptive methodology. It does not require any prior information or parameter estimation. The algorithm also uses a novel post-separation speech harmonic alignment that results in an improved performance. Experimental results in simulated and real environments verify the effectiveness of the proposed method, and analysis demonstrates that the algorithm is computationally efficient.

## Keywords

## 1 Introduction

The blind source separation (BSS) of speech signals also known as convolutive BSS is a very challenging problem in real room environments. It can be broadly divided into two categories, those that use an information-theoretic approach and those based on de-correlation. Some of the most widely applied information-theoretic approaches include independent component analysis (ICA) [1], maximum likelihood [2], information maximisation [3] and Kurtosis maximisation [4]. Based on these information-theoretic approaches, the neural network-based algorithms presented in [5–8] are unsuitable for implementation of BSS in real time. The reason is massive complexity, i.e. the calculation of thousands of adaptive filter coefficients and also the temporal whitening problem (pp. 340–345 in [9]).

The frequency domain implementation decomposes this convolutive mixture problem into multiple instantaneous mixing problems; however, this in turn leads to scaling and permutation alignment problems (pp. 352–353 in [9]). To solve this permutation problem, many algorithms have been proposed, such as in [10] and [11], that exploit the direction of arrival (DOA) and also speech harmonics. These DOA-based algorithms are more semi-blind in nature than blind itself since they are dependent on certain geometrical arrangement. Another way to resolve the permutation alignment issue is to exploit the correlation property between separated signals at adjacent frequency bands [12]. The reliability of this and other similar techniques is based on the amount of correlation and that surely varies case by case. In [13], a different approach is developed, based on de-correlation in the frequency domain; the algorithm avoids permutation with its very slowly converging diagonalisation procedure, but this slow convergence makes it less suitable for real-time implementation. Apart from the permutation problem, there are some other frequency-based limitations as discussed in detail in [14]. In [15], a secondary algorithm is proposed, based on time-frequency masking, which improves the signal-to-interference ratio (SIR) of separated streams. Such techniques are completely dependent on the BSS of the primary algorithm, and if the primary fails, then so does the secondary.

The BSS of more than two sources is a more complicated and computationally intense problem. The fact can be seen in [16]: a detailed survey reveals that, of 400 publications that employed convolutive source separation, only 2% of the publications dealt with more than two sources. Even for two sources, it was concluded that only 10% of them really worked with varying degrees of SIR from 5 to 18 dB using real room impulse responses. The results are still questionable since it is very difficult to analyse or compare different algorithms due to a lack of unified test bench methods for performance measure [17].

In this paper a novel algorithm for the BSS of three speech sources in real room environments is proposed. It uses both the information-theoretic and de-correlation approaches to achieve superior source separation with fast convergence. The algorithm has low complexity and is optimised for real-time implementation. In addition, it does not require any prior parameter estimation; furthermore, a harmonic alignment methodology, presented in this paper, improves the quality of separated speech in a real room environment.

The paper is organised as follows: In the following section, the motivation behind the new hybrid algorithm will be discussed. In Section 3, the hybrid algorithm will be presented followed by a discussion of its constituent parts. In Section 4, a novel harmonic alignment method will be presented. In Section 5, the performance of the algorithm will be analysed based on a simulated room environment. The results from the real room experiment will be shown in Section 6, followed by discussion (showing computational load) and conclusion in Sections 7 and 8, respectively.

The notation that will be used in this paper will be small letter *x* for scalar quantities, small and bold letter x for vector quantities or first-order tensors and bold and capital letter **X** for two-dimensional matrices or second-order tensors, and for three-dimensional matrices or third-order tensors, it will be similar to two-dimensional ones but with a double bar on top $\overline{\overline{\mathbf{X}}}.$

## 2 Motivation

The motivation behind the hybrid algorithm will become evident as we progress in this section. The DOA-based algorithms presented in [18–21] have considered source separation cases for more than two sources in real room environments. However, for this, a single large microphone that consists of an array of microphones (within it) is used, which has limitations. The limitations are not only in the placement of sources in a geometrical arrangement, but also the performance is dependent on the distance of the microphone from sources. The source separation for a speaker behind a speaker or a speaker whose face is towards the wall (rather than the microphone) also cannot be achieved through DOA.

Most of the practical scenarios require cases that do require the placement of arbitrary microphones to pick up the stronger source signal and cancel the other weaker interfering signals. For example, in the case of musical instruments in concerts, acoustics in theatre performances, meetings in conference rooms, discussion in parliament houses, etc. All of these cases do require a BSS algorithm for real-time separation. This research will show the potential of working with only three speech sources with an equivalent number of sensors (microphones), i.e. a critically determined BSS case in a real room environment.

_{ q }is the speech source that is convolved with the FIR filter containing the impulse response (channel response) given by

**h**

_{pq}between the source and the sensor and then added at the sensor to give the final convolutive mixture represented by x

_{ p }. In the above,

*K*represents the length of the filters,

*S*represents the total number of sources, i.e. three in our case, and

*n*represents the sample number. Equation 1 represents speech signals passing through a (third-order tensor or three-dimensional) mixing matrix $\phantom{\rule{0.25em}{0ex}}{\overline{\overline{\mathbf{H}}}}_{\mathbf{m}}$ given by

To obtain original speech signals s_{1}, s_{2} and s_{3}, the de-mixing matrix ${\overline{\overline{\mathbf{W}}}}_{\mathbf{d}}$ needs to be calculated. Most of the algorithms only use simulated room environments for mixing matrix instead of real room as shown in [22, 23]. Apart from this, the temporal whitening caused by the equalisation filters **w**_{11}, **w**_{22} and **w**_{33} will render the output useless. To address this problem, in [24], a linear predictive codec-based solution is proposed, but that is not suitable in all cases. In [25], it is stated that the main difficulty is that audio source separation problems are usually mathematically ill-posed and to succeed it is necessary to incorporate additional knowledge about the mixing process and/or the source signals. However, by definition, blindness implies an absence of prior information.

This research has exploited the fusion of two different criteria, i.e. one based on de-correlation and the other based on information theory. The former requires the implementation in the frequency domain, and the latter requires that in the time-frequency domain using neural networks. This fusion used in the hybrid algorithm improves the SIR performance compared to each technique if used individually (independently). It obviates the requirement for semi-blind array processing methodologies to resolve the permutation problem. It also does not have any temporal whitening problem and is suitable for real-time digital signal processing (DSP) board implementation based on its low computational load shown later on.

## 3 Hybrid algorithm

Here *x*_{1}(*n*), *x*_{2}(*n*) and *x*_{3}(*n*) are three convolved mixed streams of data coming from the sensors. The output of the algorithm *u*_{1}(*n*), *u*_{2}(*n*) and *u*_{3}(*n*) are three separated signals. The hybrid algorithm fuses two approaches based on two conditions in a sequential manner. The first approach uses frequency domain diagonalisation based on a de-correlation condition; the second approach is neural network feedback based on a statistical independence condition using information maximisation [3]. The reason for choosing each condition with its relevant approach will be discussed in the following subsections. The implementation mechanism for both of these approaches is novel. Each structure of the hybrid algorithm, i.e. controlled frequency domain diagonalisation (CFDD) and frequency domain adaptive feedback separation (FDAFS), will be discussed in the following two subsections.

### 3.1 Controlled frequency domain diagonalisation

Frequency domain diagonalisation is applied here through a controlled mechanism in order to avoid the permutation problem similar to that shown by Schobben and Sommen in [13] for two sources in a real room environment. Joint diagonalisation of correlation matrices based on the Jacobi method [27] could also be implemented in the frequency domain for convolutive mixture problems, but the adaptive controlled diagonalisation mechanism proposed here is more robust.

The CFDD starts by converting the time domain BSS problem into the frequency domain. This simplifies the time domain (multi-dimensional) matrix inversion problem to bin-by-bin separation in the frequency domain. The time to frequency domain conversion process is performed by using the overlap and save method; a Hanning window is applied. This is also known as the short-time Fourier transform (STFT). The length of the fast Fourier transform is *N*, the length of the filter in the time domain is *K* and the size of the speech signal block taken is *B*.

**R**

_{ kb }is a frequency domain correlation matrix where

*k*denotes the bin number and

*b*denotes the block number and the asterisk indicates the conjugate value. In order to obtain the de-mixing (inverse) system adaptively, a strong correlation should exist over multiple blocks. However, this is not the case in speech that is stationary only over 10 to 30 ms, and apart from that, it is non-stationary. So, the first step of the algorithm is the block-based correlation constraint, realised as

*α*is the weighting factor that can take any value from 0 → 1. The value recommended for non-stationary signals like speech is above 0.9. This step can also be referred as the intersection of solution sets as in [28]. The initial correlation matrix from which Equation 5 starts is an identity matrix. Now, taking the square root inverse of the constrained correlation matrix and apply normalisation,

**W**

_{ kb }from the previous blocks, the CFDD uses a stochastic-based approach for updating. The stochastic-based approach is similar to that shown for the instantaneous case of BSS based on the Frobenius norm in [29], but here it is applied to the convolutive case. The previous block de-mixing matrix is made unitary by minimisation of the following cost function and running it in a least mean square (LMS) manner, i.e.

*N*/2 bins since the other half is the conjugate mirror of it. Also,

**W**

_{ kb }needs to be adjusted to avoid circular convolution and perform linear convolution, a step that can be seen in the next section too. The original signals s

_{1}, s

_{2}and s

_{3}can be recovered by multiplying the de-mixing matrix

**W**

_{ kb }with the mixed streams bin X

_{1}(

*k*), X

_{2}(

*k*) and X

_{3}(

*k*) and then taking the inverse Fourier transform (IFFT) of the signal to convert it back to the time domain. The final step before the inverse STFT is

The permutation problem in the algorithm is resolved by the linear convolution constraint that results in the population of zeros in the time domain that links the otherwise independent frequencies, similar to Parra and Spence in [30]. However, the length of the filter *K* versus frequency resolution constrains the length of the filter to be less than the typical impulse response of the room, approximately 200 to 300 ms. This CFDD algorithm presented here has a more flexible approach based on LMS and also avoids (three-dimensional) matrix inversion. However, the process of convergence is deliberately slowed (discussed later) through over-damping to achieve a robust SIR for all cases. However, this drawback and short filter length are mitigated with the help of the second structure, FDAFS, in the hybrid algorithm.

### 3.2 Frequency domain adaptive feedback separation

_{ i }. The coefficients of the de-mixing filters

**w**

_{ ij }are estimated by

*β*in the above equation is the slope parameter; in this algorithm, it is merely assigned the value 1. The purpose of choosing this feedback neural network approach without using equalisation filters w

_{ ii }is to avoid temporal whitening (also equalisation has nothing to do with separation). It is important to emphasise that the de-mixing filters avoid the inverse of the deterministic mixing matrix but still require the inverse of filters h

_{11}, h

_{22}and h

_{33}to estimate the de-mixing filters that also need to be realisable (for details, see the inverse of non-minimum phase systems, pp. 348–349 in [9]). However, it is far less problematic than the inverse of the determinant mixing matrix due to two reasons. Firstly, if the sensors are closer to the speech sources, the unrealisable inverse filtering problem can be avoided all the time [8]. Secondly, the first structure (CFDD) is sufficiently robust and FDAFS works sequentially on already de-correlated speech signals.The frequency domain implementation structure used in FDAFS is based on fast block-by-block calculation of coefficients instead of sample by sample. For this, the frequency domain block LMS methodology is modified for feedback adaptation and is shown in Figure 2.

This time-frequency domain implementation shown in Figure 2 for the filter w_{12} does not have any permutation problem. The reason is that the separation is not completely in the frequency domain: the error is estimated in the time domain, and therefore, it is not a bin-by-bin separation as in the case of CFDD. The working details of the block LMS can be seen in ([31], pp. 350–353); the above structure is just the interpretation of Equations 12 and 13 whilst using the overlap and save method. Only the power constraint block is integrated into the structure. That is needed to normalise the coefficients of each bin with the corresponding power from each bin of the output signal. Here the output signal is selected for normalisation instead of the input signal due to its superior performance.

## 4 Harmonic alignment

The purpose of using harmonic alignment (HA) is to exploit the properties of the speech signal to improve the SIR in a real room environment. The DOA techniques [10, 11] also use speech properties to align harmonics at lower frequencies where the width of the beam becomes broader. However, in our case, it is applied in a different way after the hybrid algorithm on separated speech.

_{1}, u

_{2}and u

_{3}containing separated signals is calculated for small segments of speech called syllables. The size of each syllable is based on the quasi-stationary property of speech and is typically between 10 and 30 ,ms. Any pitch detection algorithm can be used, based on the FFT, and must have a high frequency resolution per bin. For the purpose of completeness, a pitch detection algorithm is shown as below:

_{2}, P

_{3}, P

_{4}and P

_{5}are the holding vectors of the second, third, fourth and fifth harmonics, respectively, that are initially populated with a vector of ones. In order to populate them with their required harmonics that are the multiples of the fundamental, the following loop is used:

*N*corresponds to the size of the FFT and

*a*denotes the harmonic number. The pitch is calculated by first taking the product of these harmonic vectors and then obtaining the maximum bin number by applying a periodogram maximiser as shown below:

*m*corresponds to the bin number that has the maximum energy in its fundamental and harmonics. The fundamental frequency is calculated from the bin by using

*f*

_{ f }

*= m ×*(

*N/f*

_{ s }), where

*f*

_{ s }is the sampling frequency. Only the fundamental frequency in the range (50 Hz ≤

*f*

_{ f }

*≤*450 Hz) is considered a pitch of the formant part of human speech, and anything else is either a non-voiced segment or the noise part (fricatives). Figure 3 shows the formant and fricative sections of a segment of the speech signal from a single speaker. The pitch detection only in formant sections can be seen in Figure 4b. If the harmonic and pitch are removed from this segment, as shown in Figure 4c,d, then the resultant signal will contain only the fricatives and the residual formant (i.e. greatly degraded in strength). The removal of the harmonic and the pitch will be discussed shortly.

*u*

_{1}(

*n*) the primary pitch is that of separated signal

*s*

_{1}(

*n*) and the secondary pitches are of suppressed speakers

*s*

_{2}(

*n*) and

*s*

_{3}(

*n*). The primary pitch ${\mathit{f}}_{1}^{\mathit{s}}$ is calculated from

*u*

_{1}(

*n*) and the secondary pitches ${\mathit{f}}_{2}^{\mathit{s}}$ and ${\mathit{f}}_{3}^{\mathit{s}}$ are calculated from

*u*

_{2}(

*n*) and

*u*

_{3}(

*n*), respectively. The associated amplitudes with these pitches obtained from Equation 14 are ${\mathit{a}}_{\mathit{f}1}^{\mathit{s}}$, ${\mathit{a}}_{\mathit{f}2}^{\mathit{s}}$ and ${\mathit{a}}_{\mathit{f}3}^{\mathit{s}}$. The superscript

*s*shows the syllable number. The algorithm is stated as follows:

*d*is the number of harmonics that needs to be removed and

*v*is the width of the comb filter that is needed to remove adjacent frequencies.

*v*can be variable instead of a fixed value, and ${\mathit{Z}}_{1}^{\mathit{s}}$ is the output of HA of the first stream, initialised by ${\mathit{Z}}_{1}^{\mathit{s}}={\mathit{U}}_{1}^{\mathit{s}}$. Similarly, the pitch frequency from

*u*

_{3}(

*n*) can be removed in the same way. The last step is shown as below:

The above step ensures that the quality of the primary speech signal is not affected, and the reason is that the whole HA algorithm is based on the FFT that has an inherent problem of spectral leakage. For this reason, a comb filter is used, but it has a drawback of removing additional adjacent frequencies.

## 5 Simulated room environment experiment and analysis

*q*denotes the stream number. The SIR was calculated over the entire range of speech signals using a sliding window syllable of size 20 ms that was equivalent to 320 samples at a sampling frequency of 16 kHz. The performance of the hybrid algorithm is summarised in Table 1. The whole speech length was divided into 4-s segments, and from each segment, the SIR is shown based on a small section of speech (syllable). Also, the best performance given by FDAFS if used independently for separation is not more than 4 to 5 dB which is very poor and, for this reason, not shown separately. The same is true for its time domain equivalent TD-FB [8].

**SIR of hybrid algorithm for the experiment performed in the simulated room environment shown in Figure**
5

Time (s) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

0 to 4 | 4 to 8 | 8 to 12 | 12 to 16 | 16 to 20 | 20 to 24 | 24 to 28 | 28 to 32 | 32 to 36 | 36 to 40 | 40 to 44 | ||

SIR (dB) | Input (stream 1) | 4.7 | 0.8 | −9.8 | 4.6 | −1.4 | 0.8 | 3.2 | −4.4 | −1.3 | −18.3 | −3.1 |

CFDD | 5.0 | 1.8 | −4.2 | 15.8 | 0.2 | 4.8 | 10.4 | 5.4 | 12.2 | −12.5 | 7.6 | |

Hybrid | 7.5 | 4.2 | −0.7 | 23.1 | 10.4 | 7.7 | 14.4 | 9.2 | 12.7 | −9.3 | 11.7 | |

Input (stream 2) | 4.1 | −4.1 | 0.4 | −5.4 | −5.6 | −3.1 | −10.3 | −8.8 | −6.5 | 0.5 | −6.4 | |

CFDD | 7.0 | −1.9 | 5.4 | 0.6 | 0.5 | 1.1 | 2.1 | 6.0 | 6.0 | 11.2 | 4.0 | |

Hybrid | 11.1 | 2.4 | 14.7 | 5.3 | 10.7 | 8.4 | 7.1 | 10.2 | 15.9 | 15.4 | 8.2 | |

Input (stream 3) | −5.2 | −1.8 | −5.1 | −2.2 | −0.3 | 1.9 | −21.4 | 7.5 | −1.9 | 3.4 | 0.7 | |

CFDD | −3.2 | 1.4 | 3.3 | 5.7 | 3.6 | 8.5 | −12.8 | 21.3 | 12.0 | 13.5 | 14.2 | |

Hybrid | 1.0 | 6.3 | 10.4 | 9.8 | 10.0 | 13.3 | −10.2 | 25.6 | 14.5 | 16.3 | 15.5 | |

SAR | Average SAR hybrid | 32.72 | 31.24 | 27.92 | 26.16 | 21.13 | 19.28 | 18.12 | 15.71 | 15.9 | 16.87 | 18.24 |

It is evident from Table 1 that the hybrid algorithm gives superior performance to CFDD and FDAFS if used independently; after 20 s, it improves the SIR of the input by 12 dB. It is very important to emphasise that the CFDD coefficients are not updated with each consecutive block but is updated after the fifth block with *α* = 0.9. Therefore, it is more immune to the non-stationary behaviour of the speech signal, manifests reduced computational load and converges to a true minimum. This over-damped criterion is necessary but results in a very slow convergence that is compensated by the FDAFS (second stage). However, it has the benefit of reduced complexity since estimation and update is performed only once in five blocks.

In Table 1, it can be seen that for input (stream 1) during the segment from 16 to 20 s, the SIR improvement shown by the CFDD on a small section of speech is only 1.6 dB. It is due to an anomaly (permutation misalignments): certain harmonics that need to be suppressed more are not suppressed at all. These anomalies do happen due to the slow learning process of the CFDD but reduce in amplitude as time progresses. However, these limitations are addressed in the FDAFS (second stage) as can be seen in Table 1. It is pertinent to mention that this slow learning process is to avoid local solution (associated with the non-stationary nature of speech signals). If the weight factor is increased in the algorithm (CFDD), the separation can be achieved in 2 s, but this separation will be local. The coefficients of this separation when applied to the next segment of the convolutive mixture of speech signals will not do any separation. So, either another local solution (permuted) is obtained (that is useless) or true separation filters are calculated. For this reason, longer speech results (40 s) have been shown to verify the convergence of the algorithm to its true minimum and thus obtaining the true separation filters.

**Average SIR of hybrid algorithm for the experiment performed in the simulated room environment shown in Figure**
5

Time (s) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

0 to 4 | 4 to 8 | 8 to 12 | 12 to 16 | 16 to 20 | 20 to 24 | 24 to 28 | 28 to 32 | 32 to 36 | 36 to 40 | 40 to 44 | ||

Average SIR (dB) | Input (streams) | −0.02 | −2.68 | −2.69 | −3.79 | −3.72 | −0.15 | −2.22 | −3.90 | −2.24 | −2.57 | −4.34 |

CFDD | 2.44 | 0.33 | 3.06 | 0.270 | 1.85 | 3.44 | 3.87 | 5.848 | 5.489 | 7.001 | 5.081 | |

ECoBLISS | 0.04 | −1.80 | −0.96 | −1.01 | 0.618 | 2.038 | 2.752 | 6.42 | 4.477 | 8.32 | 4.158 | |

Hybrid | 4.13 | 3.52 | 7.66 | 4.30 | 4.93 | 8.209 | 6.505 | 9.213 | 7.903 | 10.2 | 7.889 |

## 6 Real room experimental results

**Average SIR of hybrid algorithm for the experiment performed in the real room environment shown in Figure**
9

Time (s) | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

0 to 4 | 4 to 8 | 8 to 12 | 12 to 16 | 16 to 20 | 20 to 24 | 24 to 28 | 28 to 32 | 32 to 36 | 36 to 40 | ||

Average SIR (dB) | Input (streams) | −0.38 | 0.55 | −2.40 | −2.44 | −1.15 | −0.98 | 4.70 | −7.01 | 0.59 | −2.09 |

CFDD | 0.49 | 1.77 | −0.72 | −0.25 | 4.04 | 3.96 | 9.40 | 1.62 | 8.01 | 3.06 | |

Hybrid | 1.43 | 3.60 | 1.82 | 2.42 | 6.01 | 5.94 | 12.55 | 3.7 | 9.34 | 5.74 | |

Improvement CFDD | 0.87 | 1.21 | 1.67 | 2.19 | 5.19 | 4.94 | 4.70 | 8.63 | 7.41 | 5.15 | |

Improvement hybrid | 1.81 | 3.04 | 4.22 | 4.87 | 7.16 | 6.92 | 7.85 | 10.71 | 8.74 | 7.83 |

## 7 Discussion

*N*is the size of FFT and the CFDD's overlap and add method is running at optimum level with the filter length equivalent to half that of FFT, then the average computational cost over five blocks is 19.2 ×

*N*log

_{2}

*N*+ 1,976.2

*N*operations. Similarly, the computational cost for the FDAFS structure is 19.2 ×

*N*log

_{2}

*N*+ 411

*N*operations whereas its time domain equivalent TD-FB cost is (18 ×

*N*log

_{2}

*N*+ 210)

*f*

_{s}operations. These computational costs are calculated based on multiplication accumulation (MAC) operation of a typical DSP board. It is important to mention that for complex multiplication, six operations have been considered. Similarly for division or exponential or square root, 20 operations have been considered depending upon if the Taylor series expansion or polynomial fit curve is used by a designer. For the FFT, the number of MAC operation considered based on radix 2 implementation is 2 ×

*N*log

_{2}

*N*. The low computational load in terms of million multiplication-accumulations (MMACS) per second of the hybrid algorithm is evident from Table 4.

**Computational load in MMACS for different algorithms at sampling frequency of 16 and 48 kHz**

Sampling frequency | ||||||||
---|---|---|---|---|---|---|---|---|

16,000 Hz | 48,000 Hz | |||||||

Impulse length (ms) | 64 | 128 | 256 | 21.33 | 42.66 | 85.33 | 170.66 | |

Filter taps ( | 1,024 | 2,048 | 4,096 | 1,024 | 2,048 | 4,096 | 8,192 | |

MMACS | CFDD | 69.99 | 70.61 | 71.22 | 209.99 | 211.83 | 213.67 | 215.52 |

FDAFS | 30.04 | 31.58 | 33.12 | 90.14 | 94.75 | 99.36 | 103.96 | |

ECoBLISS | 141.23 | 142.76 | 144.30 | 423.69 | 428.30 | 432.91 | 437.52 | |

HYBRID | 100.03 | 102.19 | 104.34 | 300.13 | 306.58 | 313.03 | 319.48 | |

TD-FB | 298.27 | 593.18 | 1183 | 894.81 | 1779.6 | 3549 | 7088 |

It can be seen from Table 4 that FDAFS is many times faster than its time domain equivalent TD-FB that is similar to frequency domain LMS as compared to its time domain equivalent (pp. 353 in [31]). The frequency domain implementation will have two types of latencies in the algorithm: the first will be the computational latency calculated from the MMACS of the algorithm divided by the MMACS capacity of the digital signal processor and the second is the time it takes to fill up the block for the FFT. Nothing can be done about the latter issue; however, regarding processor speed, this is continually advancing, with even non-FPGA-type devices routinely available with speeds of a few thousand MMACS.

In this paper we have solely discussed the main source separation algorithm without discussing background noise (not the additive sensor noise as taken in most cases). It has been shown (pp. 397–399 in [36]) that supervised adaptive filtering using a reference microphone to detect the noise source based on the least mean squares (LMS) technique gives the optimum performance in removing background noise. So, supervised adaptive filtering should be implemented prior to the use of an unsupervised hybrid algorithm instead of using acoustic echo cancellation (AEC) as shown in [13].

The convolutive mixture problem is complex, and *in extremis*, the whole source separation scenario becomes mathematically ill-posed (discussed earlier) and thus nothing works. The hybrid algorithm uses adaptive filtering methodology that is also unsupervised. So, in the case of *extremis*, multiple spurious minima can occur that result in either the algorithm taking longer to converge to a true minimum based on step size, or it may not converge at all. In such cases, in order to achieve separation (convergence to a true minimum), it is necessary to incorporate additional knowledge about the mixing process and/or the source signals; however, this would make the separation process semi-blind or supervised. Future investigations will focus on improving the robustness and applicability of the method.

## 8 Conclusions

In this paper we have presented a novel hybrid algorithm that uses an integrated, multiple conditions approach to solve the convolutive mixture problem of speech sources, instead of relying on only one condition. The performance of the algorithm based on experiments has been shown for simulated and real room environments. The proposed algorithm with its improved SIR using harmonic alignment and efficient computational complexity is suitable for hardware implementation for the real-time blind source separation of speech signals.

## Declarations

## Authors’ Affiliations

## References

- Comon P: Independent component analysis, a new concept?
*Signal Process.*1994, 36: 287-314. 10.1016/0165-1684(94)90029-9View ArticleMATHGoogle Scholar - Cardoso JF: Infomax and maximum likelihood for blind source separation.
*IEEE Signal Process. Letter*1997, 4: 109-111.View ArticleGoogle Scholar - Bell AJ, Sejnowiski TJ: An information-maximization approach to blind separation and blind deconvolution.
*Neural Comput.*1995, 7: 1129-1159. 10.1162/neco.1995.7.6.1129View ArticleGoogle Scholar - Salberg B, Grbic N, Claesson I: Online maximization of subband kurtosis for blind adaptive beamforming in realtime speech extraction. In
*IEEE, Proc. of the 2007 15 Intl. Conf. on Digital Signal Process*. Bleking Institute Technol, Ronneby; 2007.Google Scholar - Nandi AK:
*Blind Estimation Using Higher Order Statistics*. Kluwer, London; 1999.View ArticleGoogle Scholar - Amari S, Cichocki A: Adaptive blind signal processing - neural network approaches.
*Proc. IEEE*1998, 86(10):2026-2048. 10.1109/5.720251View ArticleGoogle Scholar - Amari S: Natural gradient works efficiently in learning.
*Neural Comput.*1998, 10: 251-276. 10.1162/089976698300017746View ArticleGoogle Scholar - Lambert RH:
*Multichannel blind deconvolution: fir matrix algebra and separation of multipath mixtures.**PhD Thesis.*University of Southern California; 1996.Google Scholar - Haykin S:
*Unsupervised Adaptive Filtering*.*Volume 1*. Wiley, New York; 2000.Google Scholar - Sawada H, Mukai R, Araki S, Makino S: A robust and precise method for solving the permutation problem of frequency-domain blind source separation.
*IEEE Transon. Speech Audio Process.*2004, 12(5):530-538. 10.1109/TSA.2004.832994View ArticleGoogle Scholar - Sawada H, Mukai R, Araki S, Makino S: Solving the permutation problem of frequency-domain BSS when spatial aliasing occurs with wide sensor spacing. In
*IEEE Conference on Acoustic Speech and Signal Processing, ICASSP 2006*. Toulouse; 2006:V77-V80.Google Scholar - Peng B, Liu W, Mandic DP: Reducing permutation error in subband-based convolutive blind separation.
*IET Signal Process.*2012, 6(1):34-44. 10.1049/iet-spr.2011.0015MathSciNetView ArticleGoogle Scholar - Schobben DWE, Sommen PCW: A frequency domain blind signal separation method based on de-correlation.
*IEEE Trans. Signal Process.*2002, 50: 1855-1865. 10.1109/TSP.2002.800417View ArticleGoogle Scholar - Araki S, Mukai R, Makino S, Nishikawa T, Saruwatari H: The fundamental limitation of frequency domain blind source separation for convolutive mixtures of speech.
*IEEE Trans. Speech Audio Process.*2003, 11(2):109-116. 10.1109/TSA.2003.809193View ArticleMATHGoogle Scholar - Hoffman E, Vicente D, Orglmeister R: Time frequency masking strategy for blind source separation of acoustic signals based on optimally-modified log-spectral amplitude estimator.
*LNCS*2009, 5441: 581-588.Google Scholar - Pedersen MS, Larsen J, Kjems U, Parra LC: A survey of convolutive blind source separation methods. In
*Springer Handbook on Speech Processing and Speech Communication*. Springer, Berlin; 2007:1-34.Google Scholar - Jain SN, Rai C: Blind source separation and ICA techniques: a review.
*IJEST*2012, 4(4):1490-1503.Google Scholar - Ma WK, Hsieh TH, Chi CY: DOA estimation of quasi-stationary signals with less sensors than sources and unknown spatial noise covariance: a Khatri–Rao subspace approach.
*IEEE Trans. Signal Process.*2010, 58(4):2168-2180.MathSciNetView ArticleGoogle Scholar - Pertila P: Online blind speech separation using multiple acoustic speaker tracking and time–frequency masking.
*Elsevier Comput. Speech Lang.*2013, 27: 683-702. 10.1016/j.csl.2012.08.003View ArticleGoogle Scholar - Blandin C, Ozerov A, Vincent E: Multi-source TDOA estimation in reverberant audio using angular spectra and clustering.
*Elsevier Signal Process.*2012, 92: 1950-1960. 10.1016/j.sigpro.2011.09.032View ArticleGoogle Scholar - Miranda RK, Sahoo SK, Zelenovsky R, Da Costa CL: Improved frequency domain blind source separation for audio signals via direction of arrival knowledge. In
*Society for Design and Process Science, SPDS 2013*. Sao Paulo; 2013:35-39.Google Scholar - Katayama T, Ishibashi T: A real-time blind source separation for speech signals based on the orthogonalization of the joint distribution of the observed signals. In
*IEEE/SICE International Symposium on System Integration (SII)*. Kyoto; 2011:920-925.View ArticleGoogle Scholar - Na Y, Yu J, Chai B: Independent vector analysis using subband and subspace nonlinearity.
*EURASIP J. Adv. Signal Process.*2013, 2013: 1. 10.1186/1687-6180-2013-1View ArticleGoogle Scholar - Kokkinakis K, Zarzoso V, Nandi AK: Blind separation of acoustic mixtures based on linear prediction analysis. In
*4th International Symposium on Independent Component Analysis and Blind Signal Separation (ICA2003)*. Nara; 2003:343-348.Google Scholar - Ozerov A, Vincent E, Bimbot F: A general flexible framework for the handling of prior information in audio source separation.
*IEEE Trans. Audio Speech Lang. Process.*2012, 20(4):1118-1133.View ArticleGoogle Scholar - Aichner R, Buchner H, Yan F, Kellermann W: Real-time convolutive blind source separation based on a broadband approach. Fifth International Conference, ICA 2004, Granada, Spain, September 22–24, 2004. Lecture notes in computer science, vol. 3195. In
*Independent Component Analysis and Blind Signal Separation*. Edited by: Puntonet CG, Prieto A. Springer, Berlin; 2004:840-848.View ArticleGoogle Scholar - Belouchrani A, Abed-Meraim K, Cardoso JF, Moulines E: A blind source separation technique using second-order statistics.
*IEEE Trans. Signal Process.*1997, 45(2):434-444. 10.1109/78.554307View ArticleGoogle Scholar - Schobben DWE, Sommen PCW: On the indeterminancies of convolutive blind signal separation based on second order statistics. In
*ISSPA 99*. Brisbane; 1999:215-218.Google Scholar - Joho M, Mathis H: Joint diagonalisation of correlation matrices by using gradient methods with application to blind signal separation. In
*SAM 2002*. Rosslyn; 2002:273-277.Google Scholar - Parra L, Spence C: Convolutive blind separation of non-stationary sources.
*IEEE Trans. Signal Process.*2000, 8(3):320-327.MATHGoogle Scholar - Haykin S, Kailath T:
*Adaptive Filter Theory*. 4th edition. Pearson Education, Upper Saddle River; 2007.Google Scholar - Bellamy JC:
*Digital Telephony*. 3rd edition. Wiley, New York; 2000.Google Scholar - McGovern SG: A model for room acoustics. 2004.http://www.sgm-audio.com/research/rir/rir.html . Accessed 21 Jan 2013Google Scholar
- Vincent E, Gribonval R, Fevotte C: Performance measurement in blind audio source separation.
*IEEE Trans. Audio Speech Lang. Process.*2006, 14(4):1462-1469.View ArticleGoogle Scholar - Holters M, Corbach T, Zoelzer U: Impulse response measurement techniques and their applicability in the real world. In
*Proceedings of the 12th International Conference on Digital Audio Effects (DAFx-09)*. Como; 2009.Google Scholar - Gaydeck P:
*A Foundation of Digital Signal Processing: Theory, Algorithms and Hardware Design*. IEE, London; 2004.View ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.