Blind source separation for robot audition using fixed HRTF beamforming
 Mounira Maazaoui^{1}Email author,
 Karim AbedMeraim^{1} and
 Yves Grenier^{1}
https://doi.org/10.1186/16876180201258
© Maazaoui et al; licensee Springer. 2012
Received: 15 June 2011
Accepted: 6 March 2012
Published: 6 March 2012
Abstract
In this article, we present a twostage blind source separation (BSS) algorithm for robot audition. The first stage consists in a fixed beamforming preprocessing to reduce the reverberation and the environmental noise. Since we are in a robot audition context, the manifold of the sensor array in this case is hard to model due to the presence of the head of the robot, so we use premeasured head related transfer functions (HRTFs) to estimate the beamforming filters. The use of the HRTF to estimate the beamformers allows to capture the effect of the head on the manifold of the microphone array. The second stage is a BSS algorithm based on a sparsity criterion which is the minimization of the l_{1} norm of the sources. We present different configuration of our algorithm and we show that it has promising results and that the fixed beamforming preprocessing improves the separation results.
Keywords
1 Introduction
Robot audition consists in the aptitude of an humanoid to understand its acoustic environment, separate and localize sources, identify speakers and recognize their emotions. This complex task is one of the target points of the Romeo project^{a} that we work on. This project aims to build an humanoid (Romeo) that can act as a comprehensive assistant for persons suffering from loss of autonomy. Our task in this project is focused on the blind source separation (BSS) topic using a microphone array (more than two sensors). Source separation is a very important step for humanrobot interaction: it allows latter tasks like speakers identification, speech and motion recognition and environmental sound analysis to be achieved properly. In a BSS task, the separation should be done from the received microphone signals without prior knowledge of the mixing process. The only knowledge is limited to the array geometry.
The problem of BSS has been studied by many authors [1], and we present here some of the stateoftheart methods related to robot audition. Tamai et al. [2] performed sound source localization by a delay and sum beamforming and source separation in a real environment with frequency band selection using a microphone array located on three rings with 32 microphones. Yamamoto et al. [3] proposed a source separation technique based on geometric constraints as a preprocessing for the speech recognition module in their robot audition system. This system was implemented in the humanoids SIG2 and Honda ASIMO with an eight sensors microphone array, as a part of a more complete system for robot audition named HARK [4]. Saruwatari et al. [5] proposed a twostage binaural BSS system for an humanoid. They combined a singleinput multipleoutput model based on independent component analysis (ICA) and a binary mask processing.
One of the main challenges of BSS remains to obtain good BSS performance in a real reverberant environments. A beamforming preprocessing can be a solution to improve BSS performance in a reverberant room. Beamforming consists in estimating a spatial filter that operates on the outputs of a microphone array in order to form a beam with a desired directivity pattern [6]. It is useful for many purposes, particularly for enhancing a desired signal from its measurement corrupted by noise, competing sources and reverberation [6]. Beamforming filters can be estimated in a fixed or in an adaptive way. A fixed beamforming, contrarily to an adaptive one, does not depend on the sensors data, the beamformer is built for a set of fixed desired directions. In this article, we propose a twostage BSS technique where a fixed beamforming is used in a preprocessing step.
Ding et al. propose to use a beamforming preprocessing where the steering directions are the directions of arrival (DOA) of the sources. In this case, the DOA of the sources are supposed to be known a priori[7]. The authors evaluate their method in a determined case with 2 and 4 sources and a circular microphone array. Saruwatari et al. present a combined ICA [8] and beamforming method: first the authors perform a subband ICA and estimate the direction of arrivals (DOA) of the sources using the directivity patterns in each frequency bin, second they use the estimated DOA to build a null beamforming, and third they integrate the subband ICA and the null beamforming by selecting the most suitable separation matrix in each frequency [9]. In this article, we propose to use a fixed beamforming preprocessing with fixed steering directions, independently from the direction of arrival of the sources, and we compare this preprocessing method to the one proposed by Wang et al. We are interested in studying the effect of the beamforming as a preprocessing tool so we are not going to include the algorithm of [9] in our evaluation (the authors of [9] use the beamforming as a separation method alternatively with ICA).
However, in a beamforming task, we need to know the manifold of the sensor array, which is hard to model for the robot audition case because the head of the robot alters the acoustic near field. To overcome the problem of the array geometry modeling and take into account the influence of the robot's head on the received signals, we propose to use the head related transfer functions (HRTFs) of the robot's head as steering vectors to build the fixed beamformer. The main advantages of our method are its reduced computational cost (as compared to the one based on adaptive beamforming), its improved separation quality and its relatively fast convergence rate. Its weaknesses consist in the lack of theoretical analysis or proofs that guarantee the convergence to the desired solution and in the case where source localization is needed, our method provides only a rough estimation of the direction of arrival.
This article is organized as follows: in Section 2, we present the signal model used in the BSS task, Sections 4 and 3 are dedicated respectively for the beamforming using HRTF step and for the presentation of the BSS using sparsity criterion step, and we assess the algorithms performances in Section 5, while Section 6 provides some concluding remarks.
2 Signal model
where h(l) is the l^{ th }matrix of impulse response and n(t) is a noise vector. We consider a spatially decorrelated diffuse noise which energy is supposed to be negligible comparing to the punctual sources ones. If the noise is punctual, it will be considered as a sound source. This scenario corresponds to our experimental and real life application setups.
The inverse STFT of the estimated sources in the frequency domain Y allows the recovery of the estimated sources y(t) = [y_{1} (t),...,y_{ N }(t)]^{ T }in the time domain.
Separating the sources for each frequency bin introduces the permutation problem: the order of the estimated sources is not the same from one frequency to another. To solve the permutation problem, we use the method proposed by Weihua and Fenggang and described in [10]. This method is based on the signals correlation between two adjacent frequencies. In this article, we are not going to investigate the permutation problem and we use the cited method for all the proposed algorithm.
where W(f) is the separation matrix estimated using a sparsity criterion and B(f) is a fixed beamforming filter. More details are presented in the following subsections (cf. Algorithm 1).
2.1 Beamforming preprocessing
The role of the beamformer is essentially to reduce the reverberation and the interferences coming from directions other than the looked up ones. Once the reverberation is reduced, Equation (2) is better satisfied which leads to an improved BSS quality.
2.2 Blind source separation
3 Fixed beamforming using HRTF
In the case of robot audition, the geometry of the microphone array is fixed once for all. To build the fixed beamformers, we need to determine the "desired" steering directions and the characteristics of the beam pattern (the beamwidth, the amplitude of the sidelobes and the position of nulls). The beamformers are estimated only once for all scenarii using these spatial information and independently of the measured mixture in the sensors.
The leastsquare (LS) technique is used [6] to estimate the beamformer filters that will achieve the desired beam pattern according to a desired direction response. To accomplish this beamformers estimation, we need to calculate the steering vectors which represent the phase delays of a plane wave evaluated at the microphone array elements.
where d is the distance between two sensors and c is the speed of sound.
For a human hearing, there is a spectral filtering of the sound source by the head and the pinna, and thus a transfer function between the source and each ear is defined and refered to as: the HRTF. The HRTF takes into account the interaural time difference^{b} (ITD), the interaural intensity difference^{c} (IID) and the shape of the head and the pinna. It defines how a sound emitted from a specific location and altered by the head and the pinna is received at an ear. The notion of HRTF remains the same if we replace the human head by a dummy head and the ears by two microphones. We extend the usual concept of binaural HRTF to the context of robot audition where the humanoid is equipped with a microphone array. In our case, a HRTF h_{ m }(f,θ) at frequency f characterizes how a signal emitted from a specific direction θ is received at the m th sensor fixed in a head.
In the following, we present the different configurations of the combined beamformingBSS algorithm.
3.1 Beamforming with known DOA
3.2 Beamforming with fixed DOA
3.3 Beamforming with beams selection
4 BSS using sparsity criterion
The optimization technique used to update the separation matrix W(f) is the natural gradient. Section 4.1 summarizes the natural gradient algorithm [11], Section 4.2 shows how we use this optimization algorithm in our cost function.
4.1 Natural gradient algorithm
The natural gradient is an optimization method proposed by Amari et al. [11]. In this modified gradient search method, the standard gradient search direction is altered according to the local Riemannien structure of the parameter space. This guarantees the invariance of the natural gradient search direction to the statistical relationship between the parameters of the model and leads to a statistically efficient learning performance [12].
4.2 Sparsity separation criterion
We consider a separation criterion based on the sparsity of the signals in the timefrequency domain. For every frequency bin, we look for a separation matrix W(f) that leads to the sparsest estimated sources Y(f,:) = [Y(f,1),...,Y(f,N_{ T })].
In the same manner, we define the mixture matrix in each frequency bin X(f,:) = [X(f ,1),...,X(f,N_{ T })].
with ${\mathbf{G}}_{t}\left(f\right)=\mathbf{f}\left({\mathbf{Y}}_{t}\left(f,:\right)\right){\mathbf{Y}}_{t}^{H}\left(f,:\right).$.
with ${c}_{t}\left(f\right)=\frac{1}{\frac{1}{N}{\sum}_{i=1}^{N}{\sum}_{j=1}^{N}\left{g}_{t}^{ij}\left(f\right)\right}$ and ${g}_{t}^{ij}\left(f\right)={\left[{\mathbf{G}}_{t}\left(f\right)\right]}_{ij}$.
4.3 Initialization
where D_{ m }is a matrix containing the first M rows and M columns of the matrix D and E_{:M}is the matrix containing the first M columns of the matrix E. D and E are respectively the diagonal matrix and the unitary matrix of the singular value decomposition of the autocorrelation matrix of the received data X(f,:) or the filtered data after beamforming Z(f,:).
5 Experimental results
5.1 Experimental database
To evaluate the proposed BSS techniques, we built two databases: a HRTFs database and a speech database.
5.1.1 HRTF database
We measured 504 HRTF for each microphone as follow:

72 azimuth angles from 0° to 355° with a 5° step

7 elevation angles: 40°, 27°, 0°, 20°, 45°, 60° and 90°
To measure the HRTFs, the dummy was fixed on a turntable in the center of the loudspeaker arc in the anechoic room (cf. Figure 2). For each azimuth angle, a sequence of complementary Golay codes is emitted sequentially from each loudspeaker (this is to vary the elevation) and recorded with the 16 sensors array. This operation was repeated for all the azimuth angles. The Golay complementary sequences have the useful property that their autocorrelation functions have complementary sidelobes: the sum of the autocorrelation sequences is exactly zero everywhere except at the origin. Using this property and the recorded complementary Golay codes, the HRTF are calculated as in [16].
Details about the experimental process of HRTF calculation as well as the HRTF databases at the sampling frequencies of 48 and 16 KHz are available at http://www.tsi.telecomparistech.fr/aao/?p=347.
5.1.2 Test database
The output signals x(t) are the convolutions of 40 pairs of speech sources (male and female speaking French and English) by two of the impulse responses {h (l)}_{0≤l≤L}measured for the direction of arrivals presented in Figure 11.
Parameters of the blind source separation algorithms
Sampling frequency  16 KHz 

Analysis window  Hanning 
Analysis window length  2048 
Shift length  1,024 
μ  0.2 
Signals length  5 s 
Number of iterations  100 
5.2 Evaluation results
 (1)
The beamforming stage only: beamforming of 37 lobes from 90° to 90° with a step angle of 5° (BF[5°])
 (2)
The BSS algorithm only
 (a)
with minimization of the l _{1} norm (BSSl _{1})
 (b)
with ICA from [15] (ICA)
 (3)
The twostage algorithm, BSS and the beamforming preprocessing:
 (a)
beamforming of N lobes in the DOA of the sources (BF[DOA]+BSSl _{1})
 (b)
beamforming of 7 lobes from 90° to 90° with a step angle of 30° (BF[30°]+BSSl _{1} when the l _{1} norm minimization is used in the BSS step and BF[30°]+ICA when ICA is used in the BSS step)
 (c)
beamforming of 13 lobes from 90° to 90° with a step angle of 15° (BF[15°]+BSSl _{1})
 (d)
beamforming of 19 lobes from 90° to 90° with a step angle of 10° (BF[10°]+BSSl _{1})
 (e)
beamforming of 37 lobes from 90° to 90° with a step angle of 5° (BF[5°] +BSSl _{1})
 (f)
beamforming of 7 lobes from 90° to 90° with a step angle of 30° with selection of the N beams containing the highest energy before proceeding the BSS (BF[30°]+BS +BSSl _{1})
 (g)
beamforming of 37 lobes from 90° to 90° with a step angle of 5° with selection of the N beams containing the highest energy before proceeding the BSS (BF[5°]+BS +BSSl _{1})
We evaluate the proposed twostage algorithm by the signaltointerference ratio (SIR) and the signaltodistortion ratio (SDR) estimated using the BSSeval toolbox [17]. All the presented curves are the average result of the 40 pairs of speech.
5.2.1 Influence of the beamforming preprocessing
Influence of the beams selection
As we can observe from Figures 12, 13, 14, and 15, the beamforming preprocessing with beams selection (BF[30°] +BS+ BSSl_{1} and BF[5°]+BS+BSSl_{1}) and the beamforming preprocessing with known direction of arrivals (BF[DOA]+BSSl_{1}) have close results in terms of SIR (cf. Figures 12 and 14) and SDR (cf. Figures 13 and 15). However, if we are in a reverberant environment where the direction of arrivals can not be estimated accurately, the beamforming preprocessing with beams selection would be a good solution to improve the SIR and the SDR of the estimated sources comparing to the use of the BSS algorithm only (BSSl_{1}).
5.2.2 Comparison between BSSl_{1}and ICA
Independent component analysis and the l_{1} norm minimization have quite close results with or without the preprocessing step. However, we believe that replacing BSSl_{1} by BSSl_{ p }with p < 1 or with varying p value might lead to a significant improvement of the separation quality. This observation is based on the preliminary results we obtained in [14] and would be the focus of future investigations.
5.2.3 Convergence analysis
6 Conclusion
In this article, we present a twostage BSS algorithm for robot audition. The first stage is a preprocessing step with fixed beamforming. To deal with the effect of the head of the robot in the acoustic near field and model the manifold of the sensors array, we used HRTFs as steering vectors in the beamformers estimation step. The second stage is a BSS algorithm exploiting the sparsity of the sources in the timefrequency domain.
We tested different configurations of this algorithm with steering directions of the beams equal to the direction of arrivals of the sources and with fixed steering directions. We also varied the step angle between the beams. The beamforming preprocessing improves the separation performance as it reduces the reverberation and noise effects. The maximum gain is obtained when we select the beams with the highest energies and use the corresponding filters as beamformers or when the sources DOAs are known. The beamforming preprocessing with fixed steering directions has also good performance and does not use an estimation of the DOAs or beam selection, which represent a gain in the processing time. Using the 5° step beamforming preprocessing with beams selection, we can also have a rough estimation of the direction of arrivals of the sources.
Endnotes
^{a}Romeo project: http://www.projetromeo.com. ^{b}The ITD is the difference in arrival times of a sound wavefront at the left and right ears. ^{c}The IID is the amplitude difference of a sound that reaches the right and left ears. ^{d}For a complex number z, $\mathsf{\text{sign}}\left(z\right)=\frac{z}{\leftz\right}$. ^{e}The names of the algorithms that we are going to use in the legends of the figures are between brackets.
 1.Input:
 (a)
The output of the microphone array x = [x (t _{1}),..., x(t _{ T })]
 (b)
The beamforming precalculated filters ${\left\{\mathbf{B}\left(f\right)\right\}}_{1\le f\le \frac{{N}_{f}}{2}+1}$
 (a)
 2.
${\left\{\mathbf{X}\left(f,k\right)\right\}}_{1\le f\le {N}_{f},1\le k\le {N}_{T}}=\mathbf{S}TFT\left(\mathbf{x}\right)$
 3.for each frequency bin f
 (a)
beamforming preprocessing step: Z (f,:) = B (f) X (f,:)
 (b)
initialization step: W(f) = W _{0} (f)
 (c)
Y _{0} (f,:) = W _{0} (f) Z(f,:)
 (d)
for each iteration t:
blind source separation step to estimate W(f)
 (a)
 4.
Permutation problem solving
 5.
Output: the estimated sources $\mathbf{y}=\mathsf{\text{ISTFT}}\left({\left\{\mathbf{Y}\left(f,k\right)\right\}}_{1\le f\le {N}_{f},1\le k\le {N}_{K}}\right)$
 1.
SelectedBeams = Ø
 2.for each frequency bin f:
 (a)
Form K beams (beamformer outputs) Z(f,:) = B(f)X(f,:), Z(f,:) = [z _{1} (f,:),...,z _{ K }(f,:)]^{ T }
 (b)
Compute the energy of the beamformer outputs: E(f) = [e _{1}(f),...,e _{ K }(f)] with ${e}_{i}\left(f\right)=\frac{1}{{N}_{T}}{\sum}_{k=1}^{{N}_{T}}{\left{\mathbf{z}}_{i}\left(f,k\right)\right}^{2}$
 (c)
Decreasing order sort of E(f), Beams are the beams corresponding to the sorted energies: Beams = sort (E(f))
 (d)
Select the N highest energies, the indexes are stored in B.
 (e)
SelectedBeams = SelectedBeams ∪ B
 (a)
 3.
Compute the frequency of appearance of each beam and store the occurrences in I.
 4.
Select the N beams with the highest occurrence
Declarations
Acknowledgements
This work is funded by the IledeFrance region, the General Directorate for Competitiveness, Industry and Services (DGCIS) and the City of Paris, as a part of the ROMEO project.
Authors’ Affiliations
References
 Comon Pierre, Jutten Christian: Handbook of Blind Source Separation Independent Component Analysis and Applications. Elsevier; 2010.Google Scholar
 Tamai Y, Sasaki Y, Kagami S, Mizoguchi H: "Three ring microphone array for 3d sound localization and separation for mobile robot audition,". IEEE/RSJ International Conference on Intelligent Robots and Systems 2005, 41724177.Google Scholar
 Yamamoto S, Nakadai K, Nakano M, Tsujino H, Valin JM, Komatani K, Ogata T, Okuno HG: "Design and implementation of a robot audition system for automatic speech recognition of simultaneous speech,". IEEE Workshop on Automatic Speech Recognition Understanding 2007, 111116.Google Scholar
 Nakajima H, Nakadai K, Hasegawa Y, Tsujino H: "High performance sound source separation adaptable to environmental changes for robot audition,". IEEE/RSJ International Conference on Intelligent Robots and Systems 2008, 21652171.Google Scholar
 Saruwatari H, Mori Y, Takatani T, Ukai S, Shikano K, Hiekata T, Morita T: "Twostage blind source separation based on ica and binary masking for realtime robot audition system,". IEEE/RSJ International Conference on Intelligent Robots and Systems 2005, 23032308.Google Scholar
 Benesty Jacob, Chen Jingdong, Huang Yiteng: Microphone Array Signal Processing Chapter 3: Conventional beamforming techniques. 1st edition. Springer; 2008.Google Scholar
 Ding Heping, Wang Lin, Yin Fuliang: "Combining superdirective beamforming and frequencydomain blind source separation for highly reverberant signals,". EURASIP Journal on Audio Speech and Music Processing 2010., 2010:Google Scholar
 Comon Pierre: "Independent component analysis, a new concept?,". Signal Processing 1994.Google Scholar
 Saruwatari H, Kurita S, Takeda K, Itakura F, Nishikawa T, Shikano K: "Blind source separation combining independent component analysis and beamforming,". EURASIP Journal on Applied Signal Processing 2003, 11351146.Google Scholar
 Weihua Wang, Fenggang Huang: "Improved method for solving permutation problem of frequency domain blind source separation,". 6th IEEE International Conference on Industrial Informatics 2008, 703706.Google Scholar
 Amari S, Cichocki A, Yang HH: "A new learning algorithm for blind signal separation,". Advances in Neural Information Processing Systems 1996, 757763.Google Scholar
 Amari ShunIchi: "Natural gradient works efficiently in learning,". Neural Computation 1998, 10: 251276. 10.1162/089976698300017746View ArticleGoogle Scholar
 Niall Hurley, Scott Rickard: "Comparing measures of sparsity,". IEEE Workshop on Machine Learning for Signal Processing 2009, 55: 47234741.Google Scholar
 Maazaoui M, Grenier Y, AbedMeraim K: "Frequency domain blind source separation for robot audition using a parameterized sparsity criterion,". 19th European Signal Processing Conference EUSIPCO 2011.Google Scholar
 Douglas SC, Gupta M: "Scaled natural gradient algorithms for instantaneous and convolutive blind source separation,". IEEE International Conference on Acoustics, Speech and Signal Processing 2007, 2: 637640.Google Scholar
 Foster S: "Impulse response measurement using golay codes,". IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP '86 1986, 11: 929932.View ArticleGoogle Scholar
 Vincent E, Gribonval R, Fevotte C: "Performance measurement in blind audio source separation,". IEEE Transactions on Audio, Speech, and Language Processing 2006, 14: 14621469.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.