A hybrid algorithm for blind source separation of a convolutive mixture of three speech sources
© Minhas and Gaydecki; licensee Springer. 2014
Received: 23 August 2013
Accepted: 23 May 2014
Published: 17 June 2014
In this paper we present a novel hybrid algorithm for blind source separation of three speech signals in a real room environment. The algorithm in addition to using second-order statistics also exploits an information-theoretic approach, based on higher order statistics, to achieve source separation and is well suited for real-time implementation due to its fast adaptive methodology. It does not require any prior information or parameter estimation. The algorithm also uses a novel post-separation speech harmonic alignment that results in an improved performance. Experimental results in simulated and real environments verify the effectiveness of the proposed method, and analysis demonstrates that the algorithm is computationally efficient.
The blind source separation (BSS) of speech signals also known as convolutive BSS is a very challenging problem in real room environments. It can be broadly divided into two categories, those that use an information-theoretic approach and those based on de-correlation. Some of the most widely applied information-theoretic approaches include independent component analysis (ICA) , maximum likelihood , information maximisation  and Kurtosis maximisation . Based on these information-theoretic approaches, the neural network-based algorithms presented in [5–8] are unsuitable for implementation of BSS in real time. The reason is massive complexity, i.e. the calculation of thousands of adaptive filter coefficients and also the temporal whitening problem (pp. 340–345 in ).
The frequency domain implementation decomposes this convolutive mixture problem into multiple instantaneous mixing problems; however, this in turn leads to scaling and permutation alignment problems (pp. 352–353 in ). To solve this permutation problem, many algorithms have been proposed, such as in  and , that exploit the direction of arrival (DOA) and also speech harmonics. These DOA-based algorithms are more semi-blind in nature than blind itself since they are dependent on certain geometrical arrangement. Another way to resolve the permutation alignment issue is to exploit the correlation property between separated signals at adjacent frequency bands . The reliability of this and other similar techniques is based on the amount of correlation and that surely varies case by case. In , a different approach is developed, based on de-correlation in the frequency domain; the algorithm avoids permutation with its very slowly converging diagonalisation procedure, but this slow convergence makes it less suitable for real-time implementation. Apart from the permutation problem, there are some other frequency-based limitations as discussed in detail in . In , a secondary algorithm is proposed, based on time-frequency masking, which improves the signal-to-interference ratio (SIR) of separated streams. Such techniques are completely dependent on the BSS of the primary algorithm, and if the primary fails, then so does the secondary.
The BSS of more than two sources is a more complicated and computationally intense problem. The fact can be seen in : a detailed survey reveals that, of 400 publications that employed convolutive source separation, only 2% of the publications dealt with more than two sources. Even for two sources, it was concluded that only 10% of them really worked with varying degrees of SIR from 5 to 18 dB using real room impulse responses. The results are still questionable since it is very difficult to analyse or compare different algorithms due to a lack of unified test bench methods for performance measure .
In this paper a novel algorithm for the BSS of three speech sources in real room environments is proposed. It uses both the information-theoretic and de-correlation approaches to achieve superior source separation with fast convergence. The algorithm has low complexity and is optimised for real-time implementation. In addition, it does not require any prior parameter estimation; furthermore, a harmonic alignment methodology, presented in this paper, improves the quality of separated speech in a real room environment.
The paper is organised as follows: In the following section, the motivation behind the new hybrid algorithm will be discussed. In Section 3, the hybrid algorithm will be presented followed by a discussion of its constituent parts. In Section 4, a novel harmonic alignment method will be presented. In Section 5, the performance of the algorithm will be analysed based on a simulated room environment. The results from the real room experiment will be shown in Section 6, followed by discussion (showing computational load) and conclusion in Sections 7 and 8, respectively.
The notation that will be used in this paper will be small letter x for scalar quantities, small and bold letter x for vector quantities or first-order tensors and bold and capital letter X for two-dimensional matrices or second-order tensors, and for three-dimensional matrices or third-order tensors, it will be similar to two-dimensional ones but with a double bar on top
The motivation behind the hybrid algorithm will become evident as we progress in this section. The DOA-based algorithms presented in [18–21] have considered source separation cases for more than two sources in real room environments. However, for this, a single large microphone that consists of an array of microphones (within it) is used, which has limitations. The limitations are not only in the placement of sources in a geometrical arrangement, but also the performance is dependent on the distance of the microphone from sources. The source separation for a speaker behind a speaker or a speaker whose face is towards the wall (rather than the microphone) also cannot be achieved through DOA.
Most of the practical scenarios require cases that do require the placement of arbitrary microphones to pick up the stronger source signal and cancel the other weaker interfering signals. For example, in the case of musical instruments in concerts, acoustics in theatre performances, meetings in conference rooms, discussion in parliament houses, etc. All of these cases do require a BSS algorithm for real-time separation. This research will show the potential of working with only three speech sources with an equivalent number of sensors (microphones), i.e. a critically determined BSS case in a real room environment.
To obtain original speech signals s1, s2 and s3, the de-mixing matrix needs to be calculated. Most of the algorithms only use simulated room environments for mixing matrix instead of real room as shown in [22, 23]. Apart from this, the temporal whitening caused by the equalisation filters w11, w22 and w33 will render the output useless. To address this problem, in , a linear predictive codec-based solution is proposed, but that is not suitable in all cases. In , it is stated that the main difficulty is that audio source separation problems are usually mathematically ill-posed and to succeed it is necessary to incorporate additional knowledge about the mixing process and/or the source signals. However, by definition, blindness implies an absence of prior information.
This research has exploited the fusion of two different criteria, i.e. one based on de-correlation and the other based on information theory. The former requires the implementation in the frequency domain, and the latter requires that in the time-frequency domain using neural networks. This fusion used in the hybrid algorithm improves the SIR performance compared to each technique if used individually (independently). It obviates the requirement for semi-blind array processing methodologies to resolve the permutation problem. It also does not have any temporal whitening problem and is suitable for real-time digital signal processing (DSP) board implementation based on its low computational load shown later on.
3 Hybrid algorithm
Here x1(n), x2(n) and x3(n) are three convolved mixed streams of data coming from the sensors. The output of the algorithm u1(n), u2(n) and u3(n) are three separated signals. The hybrid algorithm fuses two approaches based on two conditions in a sequential manner. The first approach uses frequency domain diagonalisation based on a de-correlation condition; the second approach is neural network feedback based on a statistical independence condition using information maximisation . The reason for choosing each condition with its relevant approach will be discussed in the following subsections. The implementation mechanism for both of these approaches is novel. Each structure of the hybrid algorithm, i.e. controlled frequency domain diagonalisation (CFDD) and frequency domain adaptive feedback separation (FDAFS), will be discussed in the following two subsections.
3.1 Controlled frequency domain diagonalisation
Frequency domain diagonalisation is applied here through a controlled mechanism in order to avoid the permutation problem similar to that shown by Schobben and Sommen in  for two sources in a real room environment. Joint diagonalisation of correlation matrices based on the Jacobi method  could also be implemented in the frequency domain for convolutive mixture problems, but the adaptive controlled diagonalisation mechanism proposed here is more robust.
The CFDD starts by converting the time domain BSS problem into the frequency domain. This simplifies the time domain (multi-dimensional) matrix inversion problem to bin-by-bin separation in the frequency domain. The time to frequency domain conversion process is performed by using the overlap and save method; a Hanning window is applied. This is also known as the short-time Fourier transform (STFT). The length of the fast Fourier transform is N, the length of the filter in the time domain is K and the size of the speech signal block taken is B.
The permutation problem in the algorithm is resolved by the linear convolution constraint that results in the population of zeros in the time domain that links the otherwise independent frequencies, similar to Parra and Spence in . However, the length of the filter K versus frequency resolution constrains the length of the filter to be less than the typical impulse response of the room, approximately 200 to 300 ms. This CFDD algorithm presented here has a more flexible approach based on LMS and also avoids (three-dimensional) matrix inversion. However, the process of convergence is deliberately slowed (discussed later) through over-damping to achieve a robust SIR for all cases. However, this drawback and short filter length are mitigated with the help of the second structure, FDAFS, in the hybrid algorithm.
3.2 Frequency domain adaptive feedback separation
This time-frequency domain implementation shown in Figure 2 for the filter w12 does not have any permutation problem. The reason is that the separation is not completely in the frequency domain: the error is estimated in the time domain, and therefore, it is not a bin-by-bin separation as in the case of CFDD. The working details of the block LMS can be seen in (, pp. 350–353); the above structure is just the interpretation of Equations 12 and 13 whilst using the overlap and save method. Only the power constraint block is integrated into the structure. That is needed to normalise the coefficients of each bin with the corresponding power from each bin of the output signal. Here the output signal is selected for normalisation instead of the input signal due to its superior performance.
4 Harmonic alignment
The purpose of using harmonic alignment (HA) is to exploit the properties of the speech signal to improve the SIR in a real room environment. The DOA techniques [10, 11] also use speech properties to align harmonics at lower frequencies where the width of the beam becomes broader. However, in our case, it is applied in a different way after the hybrid algorithm on separated speech.
The above step ensures that the quality of the primary speech signal is not affected, and the reason is that the whole HA algorithm is based on the FFT that has an inherent problem of spectral leakage. For this reason, a comb filter is used, but it has a drawback of removing additional adjacent frequencies.
5 Simulated room environment experiment and analysis
SIR of hybrid algorithm for the experiment performed in the simulated room environment shown in Figure 5
0 to 4
4 to 8
8 to 12
12 to 16
16 to 20
20 to 24
24 to 28
28 to 32
32 to 36
36 to 40
40 to 44
Input (stream 1)
Input (stream 2)
Input (stream 3)
Average SAR hybrid
It is evident from Table 1 that the hybrid algorithm gives superior performance to CFDD and FDAFS if used independently; after 20 s, it improves the SIR of the input by 12 dB. It is very important to emphasise that the CFDD coefficients are not updated with each consecutive block but is updated after the fifth block with α = 0.9. Therefore, it is more immune to the non-stationary behaviour of the speech signal, manifests reduced computational load and converges to a true minimum. This over-damped criterion is necessary but results in a very slow convergence that is compensated by the FDAFS (second stage). However, it has the benefit of reduced complexity since estimation and update is performed only once in five blocks.
In Table 1, it can be seen that for input (stream 1) during the segment from 16 to 20 s, the SIR improvement shown by the CFDD on a small section of speech is only 1.6 dB. It is due to an anomaly (permutation misalignments): certain harmonics that need to be suppressed more are not suppressed at all. These anomalies do happen due to the slow learning process of the CFDD but reduce in amplitude as time progresses. However, these limitations are addressed in the FDAFS (second stage) as can be seen in Table 1. It is pertinent to mention that this slow learning process is to avoid local solution (associated with the non-stationary nature of speech signals). If the weight factor is increased in the algorithm (CFDD), the separation can be achieved in 2 s, but this separation will be local. The coefficients of this separation when applied to the next segment of the convolutive mixture of speech signals will not do any separation. So, either another local solution (permuted) is obtained (that is useless) or true separation filters are calculated. For this reason, longer speech results (40 s) have been shown to verify the convergence of the algorithm to its true minimum and thus obtaining the true separation filters.
Average SIR of hybrid algorithm for the experiment performed in the simulated room environment shown in Figure 5
0 to 4
4 to 8
8 to 12
12 to 16
16 to 20
20 to 24
24 to 28
28 to 32
32 to 36
36 to 40
40 to 44
Average SIR (dB)
6 Real room experimental results
Average SIR of hybrid algorithm for the experiment performed in the real room environment shown in Figure 9
0 to 4
4 to 8
8 to 12
12 to 16
16 to 20
20 to 24
24 to 28
28 to 32
32 to 36
36 to 40
Average SIR (dB)
Computational load in MMACS for different algorithms at sampling frequency of 16 and 48 kHz
Impulse length (ms)
Filter taps (N)
It can be seen from Table 4 that FDAFS is many times faster than its time domain equivalent TD-FB that is similar to frequency domain LMS as compared to its time domain equivalent (pp. 353 in ). The frequency domain implementation will have two types of latencies in the algorithm: the first will be the computational latency calculated from the MMACS of the algorithm divided by the MMACS capacity of the digital signal processor and the second is the time it takes to fill up the block for the FFT. Nothing can be done about the latter issue; however, regarding processor speed, this is continually advancing, with even non-FPGA-type devices routinely available with speeds of a few thousand MMACS.
In this paper we have solely discussed the main source separation algorithm without discussing background noise (not the additive sensor noise as taken in most cases). It has been shown (pp. 397–399 in ) that supervised adaptive filtering using a reference microphone to detect the noise source based on the least mean squares (LMS) technique gives the optimum performance in removing background noise. So, supervised adaptive filtering should be implemented prior to the use of an unsupervised hybrid algorithm instead of using acoustic echo cancellation (AEC) as shown in .
The convolutive mixture problem is complex, and in extremis, the whole source separation scenario becomes mathematically ill-posed (discussed earlier) and thus nothing works. The hybrid algorithm uses adaptive filtering methodology that is also unsupervised. So, in the case of extremis, multiple spurious minima can occur that result in either the algorithm taking longer to converge to a true minimum based on step size, or it may not converge at all. In such cases, in order to achieve separation (convergence to a true minimum), it is necessary to incorporate additional knowledge about the mixing process and/or the source signals; however, this would make the separation process semi-blind or supervised. Future investigations will focus on improving the robustness and applicability of the method.
In this paper we have presented a novel hybrid algorithm that uses an integrated, multiple conditions approach to solve the convolutive mixture problem of speech sources, instead of relying on only one condition. The performance of the algorithm based on experiments has been shown for simulated and real room environments. The proposed algorithm with its improved SIR using harmonic alignment and efficient computational complexity is suitable for hardware implementation for the real-time blind source separation of speech signals.
- Comon P: Independent component analysis, a new concept? Signal Process. 1994, 36: 287-314. 10.1016/0165-1684(94)90029-9View ArticleMATHGoogle Scholar
- Cardoso JF: Infomax and maximum likelihood for blind source separation. IEEE Signal Process. Letter 1997, 4: 109-111.View ArticleGoogle Scholar
- Bell AJ, Sejnowiski TJ: An information-maximization approach to blind separation and blind deconvolution. Neural Comput. 1995, 7: 1129-1159. 10.1162/neco.1918.104.22.1689View ArticleGoogle Scholar
- Salberg B, Grbic N, Claesson I: Online maximization of subband kurtosis for blind adaptive beamforming in realtime speech extraction. In IEEE, Proc. of the 2007 15 Intl. Conf. on Digital Signal Process. Bleking Institute Technol, Ronneby; 2007.Google Scholar
- Nandi AK: Blind Estimation Using Higher Order Statistics. Kluwer, London; 1999.View ArticleGoogle Scholar
- Amari S, Cichocki A: Adaptive blind signal processing - neural network approaches. Proc. IEEE 1998, 86(10):2026-2048. 10.1109/5.720251View ArticleGoogle Scholar
- Amari S: Natural gradient works efficiently in learning. Neural Comput. 1998, 10: 251-276. 10.1162/089976698300017746View ArticleGoogle Scholar
- Lambert RH: Multichannel blind deconvolution: fir matrix algebra and separation of multipath mixtures. PhD Thesis. University of Southern California; 1996.Google Scholar
- Haykin S: Unsupervised Adaptive Filtering. Volume 1. Wiley, New York; 2000.Google Scholar
- Sawada H, Mukai R, Araki S, Makino S: A robust and precise method for solving the permutation problem of frequency-domain blind source separation. IEEE Transon. Speech Audio Process. 2004, 12(5):530-538. 10.1109/TSA.2004.832994View ArticleGoogle Scholar
- Sawada H, Mukai R, Araki S, Makino S: Solving the permutation problem of frequency-domain BSS when spatial aliasing occurs with wide sensor spacing. In IEEE Conference on Acoustic Speech and Signal Processing, ICASSP 2006. Toulouse; 2006:V77-V80.Google Scholar
- Peng B, Liu W, Mandic DP: Reducing permutation error in subband-based convolutive blind separation. IET Signal Process. 2012, 6(1):34-44. 10.1049/iet-spr.2011.0015MathSciNetView ArticleGoogle Scholar
- Schobben DWE, Sommen PCW: A frequency domain blind signal separation method based on de-correlation. IEEE Trans. Signal Process. 2002, 50: 1855-1865. 10.1109/TSP.2002.800417View ArticleGoogle Scholar
- Araki S, Mukai R, Makino S, Nishikawa T, Saruwatari H: The fundamental limitation of frequency domain blind source separation for convolutive mixtures of speech. IEEE Trans. Speech Audio Process. 2003, 11(2):109-116. 10.1109/TSA.2003.809193View ArticleMATHGoogle Scholar
- Hoffman E, Vicente D, Orglmeister R: Time frequency masking strategy for blind source separation of acoustic signals based on optimally-modified log-spectral amplitude estimator. LNCS 2009, 5441: 581-588.Google Scholar
- Pedersen MS, Larsen J, Kjems U, Parra LC: A survey of convolutive blind source separation methods. In Springer Handbook on Speech Processing and Speech Communication. Springer, Berlin; 2007:1-34.Google Scholar
- Jain SN, Rai C: Blind source separation and ICA techniques: a review. IJEST 2012, 4(4):1490-1503.Google Scholar
- Ma WK, Hsieh TH, Chi CY: DOA estimation of quasi-stationary signals with less sensors than sources and unknown spatial noise covariance: a Khatri–Rao subspace approach. IEEE Trans. Signal Process. 2010, 58(4):2168-2180.MathSciNetView ArticleGoogle Scholar
- Pertila P: Online blind speech separation using multiple acoustic speaker tracking and time–frequency masking. Elsevier Comput. Speech Lang. 2013, 27: 683-702. 10.1016/j.csl.2012.08.003View ArticleGoogle Scholar
- Blandin C, Ozerov A, Vincent E: Multi-source TDOA estimation in reverberant audio using angular spectra and clustering. Elsevier Signal Process. 2012, 92: 1950-1960. 10.1016/j.sigpro.2011.09.032View ArticleGoogle Scholar
- Miranda RK, Sahoo SK, Zelenovsky R, Da Costa CL: Improved frequency domain blind source separation for audio signals via direction of arrival knowledge. In Society for Design and Process Science, SPDS 2013. Sao Paulo; 2013:35-39.Google Scholar
- Katayama T, Ishibashi T: A real-time blind source separation for speech signals based on the orthogonalization of the joint distribution of the observed signals. In IEEE/SICE International Symposium on System Integration (SII). Kyoto; 2011:920-925.View ArticleGoogle Scholar
- Na Y, Yu J, Chai B: Independent vector analysis using subband and subspace nonlinearity. EURASIP J. Adv. Signal Process. 2013, 2013: 1. 10.1186/1687-6180-2013-1View ArticleGoogle Scholar
- Kokkinakis K, Zarzoso V, Nandi AK: Blind separation of acoustic mixtures based on linear prediction analysis. In 4th International Symposium on Independent Component Analysis and Blind Signal Separation (ICA2003). Nara; 2003:343-348.Google Scholar
- Ozerov A, Vincent E, Bimbot F: A general flexible framework for the handling of prior information in audio source separation. IEEE Trans. Audio Speech Lang. Process. 2012, 20(4):1118-1133.View ArticleGoogle Scholar
- Aichner R, Buchner H, Yan F, Kellermann W: Real-time convolutive blind source separation based on a broadband approach. Fifth International Conference, ICA 2004, Granada, Spain, September 22–24, 2004. Lecture notes in computer science, vol. 3195. In Independent Component Analysis and Blind Signal Separation. Edited by: Puntonet CG, Prieto A. Springer, Berlin; 2004:840-848.View ArticleGoogle Scholar
- Belouchrani A, Abed-Meraim K, Cardoso JF, Moulines E: A blind source separation technique using second-order statistics. IEEE Trans. Signal Process. 1997, 45(2):434-444. 10.1109/78.554307View ArticleGoogle Scholar
- Schobben DWE, Sommen PCW: On the indeterminancies of convolutive blind signal separation based on second order statistics. In ISSPA 99. Brisbane; 1999:215-218.Google Scholar
- Joho M, Mathis H: Joint diagonalisation of correlation matrices by using gradient methods with application to blind signal separation. In SAM 2002. Rosslyn; 2002:273-277.Google Scholar
- Parra L, Spence C: Convolutive blind separation of non-stationary sources. IEEE Trans. Signal Process. 2000, 8(3):320-327.MATHGoogle Scholar
- Haykin S, Kailath T: Adaptive Filter Theory. 4th edition. Pearson Education, Upper Saddle River; 2007.Google Scholar
- Bellamy JC: Digital Telephony. 3rd edition. Wiley, New York; 2000.Google Scholar
- McGovern SG: A model for room acoustics. 2004.http://www.sgm-audio.com/research/rir/rir.html . Accessed 21 Jan 2013Google Scholar
- Vincent E, Gribonval R, Fevotte C: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 2006, 14(4):1462-1469.View ArticleGoogle Scholar
- Holters M, Corbach T, Zoelzer U: Impulse response measurement techniques and their applicability in the real world. In Proceedings of the 12th International Conference on Digital Audio Effects (DAFx-09). Como; 2009.Google Scholar
- Gaydeck P: A Foundation of Digital Signal Processing: Theory, Algorithms and Hardware Design. IEE, London; 2004.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.