EURASIP Journal on Applied Signal Processing 2003:11, 1135–1146 c ○ 2003 Hindawi Publishing Corporation Blind Source Separation Combining Independent Component Analysis and Beamforming

We describe a new method of blind source separation (BSS) on a microphone array combining subband independent component analysis (ICA) and beamforming. The proposed array system consists of the following three sections: (1) subband ICA-based BSS section with estimation of the direction of arrival (DOA) of the sound source, (2) null beamforming section based on the estimated DOA, and (3) integration of (1) and (2) based on the algorithm diversity. Using this technique, we can resolve the low-convergence problem through optimization in ICA. To evaluate its effectiveness, signal-separation and speech-recognition experiments are performed under various reverberant conditions. The results of the signal-separation experiments reveal that the noise reduction rate (NRR) of about 18 dB is obtained under the nonreverberant condition, and NRRs of 8 dB and 6 dB are obtained in the case that the reverberation times are 150 milliseconds and 300 milliseconds. These performances are superior to those of both simple ICA-based BSS and simple beamforming method. Also, from the speech-recognition experiments, it is evident that the performance of the proposed method in terms of the word recognition rates is superior to those of the conventional ICA-based BSS method under all reverberant conditions.


INTRODUCTION
Source separation for acoustic signals is to estimate original sound source signals from the mixed signals observed in each input channel. This technique is applicable to the realization of noise-robust speech-recognition and high-quality handsfree telecommunication systems. The methods of achieving source separation can be classified into two groups: methods based on a single-channel input and those based on multichannel inputs. As single-channel types of source separation, a method of tracking a formant structure [1], the organization technique for hierarchical perceptual sounds [2], and a method based on auditory scene analysis [3] have been proposed. On the other hand, as multichannel type source separation, the method based on array signal processing, for example, a microphone array system, is one of the most effective techniques [4]. In this system, the directions of arrival (DOAs) of the sound sources are estimated and then each of the source signals is separately obtained using the directivity of the array. The delay-and-sum (DS) array and the adaptive beamformer (ABF) are the conventional and popular microphone arrays currently used for source separation and noise reduction.
For high-quality acquisition of audible signals, several microphone array systems based on the DS array have been implemented since the 1980s. The most successful example was proposed by Flanagan et al. [5] for a speech pickup in auditoriums, in which a two-dimensional array composed of 63 microphones is used with automatic steering to enable detection and location of the desired signal source at any given moment. Recently, many microphone array systems with talker localization have been implemented for hands-free telecommunications or speech recognition [6,7,8]. While the DS array has a simple structure, it requires, however, a large number of microphones to achieve high performance, particularly in low-frequency regions. Thus, the degradation of separated signals at low frequencies cannot be avoided in these array systems.
In order to further improve the performance using more efficient methods than the DS array, the ABF has been introduced for acoustic signals analogously to an adaptive array antenna in radar systems [9,10,11]. The goal of the adaptive algorithm is to search for optimum directions of the nulls under the specific constraint that the desired signal arriving from the look direction is not significantly distorted. This method can improve the signal-separation performance even with a small array in comparison to that of the DS array. The ABF, however, has the following drawbacks. (1) The look direction for each signal which is separated is necessary in the adaptation process. Thus, the DOAs of the separated sound source signals must be previously known. (2) The adaptation procedure should be performed during breaks of the target signal to avoid any distortion of separated signals. However, in conventional use, we cannot estimate signal breaks in advance. The above-mentioned requirements arise from the fact that the conventional ABF is based on supervised adaptive filtering, and this significantly limits the applicability of the ABF to source separation in the practical applications.
In recent years, alternative source-separation approaches have been proposed by researchers using not array signal processing but a specialized branch of information theory, that is, information-geometry theory [12,13]. Blind source separation (BSS) is the approach to estimate original source signals using only the information of the mixed signals observed in each input channel, where the independence among the source signals is mainly used for the separation. This technique is based on unsupervised adaptive filtering [13] and provides us with extended flexibility in which the sourceseparation procedure requires no training sequences and no a priori information on DOAs of the sound sources. The early contributory works on the BSS have been performed by Cardoso and Jutten [14,15], where high-order statistics of the signals are used for measuring the independence. Comon [16] has clearly defined the term independent component analysis (ICA) and presented an algorithm that measures independence among the source signals. The ICA was later followed by Bell and Sejnowski [17], and was extended to the informax (or the maximum-entropy) algorithm for BSS which is based on a minimization of mutual information of the signals. In recent works on the ICA-based BSS, several methods, in which the complex-valued unmixing matrices are calculated in the frequency domain, have been proposed to deal with the arriving lags among each element of the microphone array system [18,19,20,21]. Since the calculations are carried out at each frequency independently, the following problems arise in these methods: (1) permutation of each sound source, and (2) arbitrariness of each source gain. Various methods to overcome the permutation and scaling problems have been proposed. For example, a priori assumption of similarity among the envelopes of source signal waveforms [19] or interfrequency continuity with respect to the unmixing matrices [18,20,21] is necessary to resolve these problems.
In this paper, a new method of BSS on a microphone array using the subband ICA and beamforming is proposed. The proposed array system consists of the following three sections: (1) subband ICA section, (2) null beamforming section, and (3) integration of (1) and (2). First, a new subband ICA is introduced to achieve frequency domain BSS on the microphone array system, where directivity patterns of the array are explicitly used to estimate each DOA of the sound sources [22]. Using this method, we can resolve both permutation and arbitrariness problems simultaneously without the assumption for the source signal waveforms or interfrequency continuity of the unmixing matrices. Next, based on the DOA estimated in the above-mentioned ICA section, we construct a null beamformer in which the directional null is steered to the direction of the undesired sound source, in parallel with the ICA-based BSS. This approach to signal separation has the advantage that there is no difficulty with respect to a low convergence of optimization because the null beamformer is determined by only DOA information without independence between sound sources. Finally, both signal separation procedures are appropriately integrated by the algorithm diversity in the frequency domain [23].
In order to evaluate the effectiveness of the proposed method, both signal-separation and speech-recognition experiments are performed under various reverberant conditions. The results reveal that the performance of the proposed method is superior to that of the conventional ICAbased BSS method [19], and we also show that the proposed method did not cause heavy degradations of the separation Sound source 1 performance compared with those of the previous ICA-based BSS method, particularly when the durations of the observed signals are exceedingly short. In addition, the speechrecognition experiment clarifies that the proposed method is more applicable to the recognition task in multispeaker cases than the conventional BSS. The rest of this paper is organized as follows. In Sections 2 and 3, the formulation of the general BSS problems and the principle of the proposed method are explained. In Section 4, the signal-separation experiments are described. Following a discussion on the results of the experiments, we give the conclusions in Section 5.

SOUND MIXING MODEL OF MICROPHONE ARRAY
In this study, a straight-line array is assumed. The coordinates of the elements are designated as d k (k = 1, . . . , K) and the DOAs of multiple sound sources are designated as θ l (l = 1, . . . , L) (see Figure 1).
In general, the observed signals in which multiple source signals are mixed linearly are given by the following equation in the frequency domain: where X( f ) is the observed signal vector, S( f ) is the source signal vector, and A( f ) is the mixing matrix. These are given as We introduce the model to deal with the arriving lags among each of the elements of the microphone array. In this case, A kl ( f ) is assumed to be complex valued. Hereafter, for convenience, we only consider the relative lags among each of the elements with respect to the arrival time of the wavefront of each sound source, and neglect the pure delay between the microphone and sound source. Also, S( f ) is identically regarded as the source signals observed at the origin. For example, by neglecting the effect of the room reverberation, we can rewrite the elements in the mixing matrix (4) as the following simple expression: where τ kl is the arriving lag with respect to the lth source signal from the direction of θ l , observed at the kth microphone at the coordinate of d k . Also, c is the velocity of sound. If the effect of room reverberation is considered, the elements in the mixing matrix A kl ( f ) are given by more complicated values depending on the room reflections.

System overview of the proposed method
This section describes a new BSS method, using a microphone array, and its algorithm. The proposed array system consists of the following three sections (see Figure 2 for the system configuration): (1) subband ICA section for ICAbased BSS and DOA estimation, (2) null beamforming section for efficient reduction of directional interference signals, and (3) integration of (1) and (2) based on the algorithm diversity [23], selecting the most appropriate algorithm from (1) and (2) in the frequency domain. The following sections describe each of the procedures in detail.

Estimation on unmixing matrix
In this study, we perform the signal-separation procedure as described below (see Figure 3), where we deal with the case in which the number of sound sources L equals that of microphones K, that is, K = L. First, the short-time analysis of the observed signals is conducted by using discrete Fourier transform (DFT) frame by frame. By plotting the spectral values in a frequency bin of one microphone input, frame by frame, we consider them as a time series. The other inputs at the same frequency bin are dealt with in the same manner. Hereafter, we designate the time series as X( Next, we perform signal separation by using the complex-valued unmixing matrix W( f ) so that the L time series output Y( f , t) becomes mutually independent; this procedure can be given as where  Figure 2: Configuration of the proposed microphone array system based on subband ICA and beamforming. Here,θ l , θ l ( f ), and σ l represent estimated DOA of lth sound source, DOA of lth sound source at each frequency f , and deviation with respect to the estimated DOA of lth sound source, respectively. The bold arrows indicate the subband-signal lines. Here "st-DFT" represents the short time DFT.
Separated signals We perform this procedure with respect to all frequency bins. Finally, by applying the inverse DFT and the overlap-add technique to the separated time series Y( f , t), we reconstruct the resultant source signals in the time domain.
Considering the calculation of the unmixing matrix W( f ), we use the optimization algorithm based on the minimization of the Kullback-Leibler divergence; this algorithm has been introduced by Murata and Ikeda for online learning [19] and modified by the authors for offline learning with stable convergence. The optimal W( f ) is obtained by using the following iterative equation: where H denotes the Hermitian and · t denotes the timeaveraging operator, i is used to express the value of the ith step in the iterations, and η is the step size parameter. Also, we define the nonlinear vector function Φ(·) as where are the real and imaginary parts of Y l ( f , t), respectively.

Source permutation and gain arbitrariness problems and their solutions
This section describes the problems which arise after the signal separation described in Section 3.2.1, and solutions for these problems are newly proposed. Hereafter, we assume a two-channel model without loss of generality, that is, We assume that the following separation has been completed at frequency bin f : whereŜ 1 ( f , t) andŜ 2 ( f , t) are the components of the estimated source signals. Since the above calculations are carried out at each frequency bin independently, the following two problems arise (see Figure 4). Problem 1. The permutation of the source signalsŜ 1 ( f , t) andŜ 2 ( f , t) arises. That is, the separated signal components can be permuted at every frequency bin, for example, at a frequency bin of

t), and at another frequency bin of
Problem 2. The gains ofŜ 1 ( f , t) andŜ 2 ( f , t) are arbitrary. That is, different gains are obtained at different frequency bins f = f 1 and f = f 2 .
In order to resolve Problems 1 and 2, we focus on the mechanism of the BSS as array signal processing to obtain the separated signals in the acoustical space. For example, from (10),Ŝ 1 ( f , t) is given bŷ Gain This equation shows that the resultant output signals are obtained by multiplying the array signals of X 1 ( f , t) and X 2 ( f , t) by the weight W lk ( f ), and then adding them. Thus, from the standpoint of array signal processing, this operation implies that directivity patterns are produced in the array system. Accordingly, we calculate directivity patterns with respect to W lk ( f ) obtained at every frequency bin. The directivity pattern F l ( f , θ) is given by [24] F This equation shows that the lth directivity pattern F l ( f , θ) is produced to extract the lth source signal. Using the directivity pattern F l ( f , θ), we propose the following procedure to resolve Problems 1 and 2.
Step 1. We plot the directivity patterns in all frequency bins; for example, in the frequency bins of f 1 and f 2 , directivity patterns are plotted as shown in Figure 4.
Step 2. In the directivity patterns, directional nulls exist in only two particular directions and these nulls represent DOAs of the sound sources. Accordingly, by obtaining statistics with respect to the directions of nulls at all frequency bins, we can estimate the DOAs of the sound sources. The DOA of the lth sound source,θ l , can be estimated aŝ where N is a total point of DFT and θ l ( f m ) represents the DOA of the lth sound source at the mth frequency bin. These are given by where min[x, y] (max[x, y]) is defined as a function in order to obtain the smaller (larger) value among x and y.
Gain Step 3. From these directivity patterns in all frequency bins, we collect the specific ones in which the directional null is steered to the directions ofŜ 1 ( f , t). Also, we collect the other specific directivity patterns in which the directional null is steered to the directions ofŜ 2 ( f , t). Here, we decide to collect the directivity patterns in which the null is steered to the direction ofŜ 1 ( f , t) (Ŝ 2 ( f , t)) on the right-(left-)hand side of Figure 5. From this constraint, we replace F 1 ( f 2 , θ) with F 2 ( f 2 , θ) at the frequency bin of f = f 2 . By performing this procedure, we can resolve Problem 1.
Step 4. Problem 2 is resolved by normalizing the directivity patterns according to the gain in each source direction after the classification (see Figure 5). In Figure 5, α 1 and α 2 are the constants which normalize the gain in the direction ofŜ 1 ( f , t), and β 1 and β 2 are the constants which normalize the gain in the direction ofŜ 2 ( f , t).
By applying the above-mentioned modifications, we can finally obtain the unmixing matrix in the ICA section, W (ICA) ( f ), as follows:

Beamforming section
In the beamforming section, we can construct an alternative unmixing matrix in parallel, based on the null beamforming technique where the DOA information obtained in the ICA section is used. In the case that the look direction isθ 1 and the directional null is steered toθ 2 , the elements of the unmixing matrix, W (BF) 1k ( f m ), satisfy the following simultaneous equations: The solutions of the equations are given by Also in the case that the look direction isθ 2 and the directional null is steered toθ 1 , the elements of the unmixing matrix, W (BF) 2k ( f m ), satisfy the following simultaneous equations: The solutions of the equations are given by These unmixing matrices are approximately optimal for the signal separation when the ideal far-field propagation is only considered and the effect of the room reverberation is negligible. However, these acoustic conditions are oversimplified. In contrast, the optimality cannot hold under reverberant conditions because the signal reduction cannot be achieved by the directional nulls only. This signal-separation approach, however, has the advantage that there is no difficulty with respect to a low-convergence of optimization because the null beamformer is determined by DOA information only without independence between sound sources. The effectiveness of the null beamforming will appear especially when we combine the beamforming and ICA as described in the next section.

Integration of subband ICA with null beamforming
In order to integrate the subband ICA with null beamforming, we introduce the following strategy for selecting the most suitable unmixing matrix in each frequency bin, that is, algorithm diversity in the frequency domain. If the directional null is steered to the proper estimated DOA of the undesired sound source, we use the unmixing matrix obtained by the subband ICA, W (ICA) lk ( f ). If the directional null deviates from the estimated DOA, we use the unmixing matrix obtained by the null beamforming, W (BF) lk ( f ), in preference to that of the subband ICA. The above strategy yields the following algorithm: where h is a magnification parameter of the threshold and σ l represents the deviation with respect to the estimated DOA of the lth sound source; it can be given as Using the algorithm with an adequate value of h, we can recover the unmixing matrix trapped on a local minimizer of the optimization procedure in ICA. Also, by changing the parameter h, we can construct various types of array signal processing for BSS, for example, a simple null beamforming with h = 0 and a simple ICA-based BSS procedure with h = ∞.
By substituting W( f ) after performing the abovementioned modification for (10) and applying inverse DFT to the outputsŜ 1 ( f , t) andŜ 2 ( f , t), we can obtain the source signals correctly.

EXPERIMENTS AND RESULTS
Signal-separation experiments are conducted using the sound data convolved with the impulse responses recorded in two environments specified by different reverberation times (RTs). In these experiments, we investigated the performance of separation under different reverberant conditions from two standpoints: an objective evaluation of separated speech quality and a word recognition test.

Conditions for experiments
A two-element array with the interelement spacing of 4 cm is assumed. We determined this interelement spacing by considering that the spacing should be smaller than half the minimum wavelength to avoid the spatial aliasing effect; it corresponds to 8.5/2 cm in 8 kHz sampling. The speech signals are assumed to arrive from two directions: −30 • and 40 • . Six sentences spoken by six male and six female speakers selected from the ASJ continuous speech corpus for research [25] are used as the original speech. Using these sentences, we obtain 36 combinations with respect to speakers and source directions. In these experiments, we used the following signals as the source signals: (1) the original speech not convolved with the room impulse responses (only considering the arrival lags among microphones) and (2) the original speech convolved with the room impulse responses recorded in the two environments specified by the different RTs. Hereafter, we designate the experiments using the signals described in (1) as the nonreverberant tests, and those of (2) as the reverberant tests. The impulse responses are recorded in a variable RT room as shown in Figure 6. The RTs of the impulse responses recorded in the room are 150 milliseconds and 300 milliseconds, respectively. These sound data which are artificially convolved with the real impulse responses have the following advantages. (1) We can use the realistic mixture model of two sources neglecting the affection of background noise.
(2) Since the mixing condition is explicitly measured, we can easily calculate a reliable objective score to evaluate the separation performance as described in Section 4.2. The analysis conditions of these experiments are summarized in Table 1.

Objective evaluation score
Noise reduction rate (NRR), defined as the output signal-tonoise ratio (SNR) in dB minus the input SNR in dB, is used as the objective evaluation score in this experiment. The SNRs are calculated under the assumption that the speech signal of the undesired speaker is regarded as noise. The NRR is Step size parameter η = 1.0 × 10 −4 defined as where SNR (O) l and SNR (I) l are the output SNR and the input SNR, respectively, and l = n. Also, H i j ( f ) is the element in the ith row and the jth column of the matrix where the mixing matrix A( f ) corresponds to the frequency-domain representation of the room impulse responses described in Section 4.1.

Alternative method for comparison
In order to perform a comparison with the proposed method, we also performed a BSS experiment using the alternative method proposed by Murata and Ikeda [19] with the modification for offline learning.
Our proposed method is based on the utilization of directivity patterns; in contrast, Murata's method is based on the utilization of W −1 ( f ) for the normalization of gain and the a priori assumption of similarity among the envelopes of source signal waveforms for the recovery of the source permutation. In this method, the following operations are performed: whereS l ( f , t) denotes the component of the lth estimated source signal in the frequency bin of f . By using both W( f ) and W −1 ( f ), the gain arbitrariness vanishes in the separation procedure. Also, the source permutation can be detected and recovered by measuring the similarity among the envelopes ofS l ( f , t) between the different frequency bins.

Objective evaluation of separated signal
In order to illustrate the behavior of the proposed array for different values of h, the NRR is shown in Figures 7, 8, and 9. These values are taken as the average of all of the combinations with respect to speakers and source directions.    Figure 7, for the nonreverberant tests, it can be seen that the NRRs monotonically increase as the parameter h decreases, that is, the performance of the null beamformer is superior to that of ICA-based BSS. This indicates that the directions of the sound sources are estimated correctly by the proposed method, and thus the null beamforming technique is more suitable for the separation of directional sound sources under nonreverberant condition.
In contrast, from Figures 8 and 9, for the reverberant tests, it is shown that the NRR monotonically increases as the parameter h decreases in the case that the observed signals of 1 second duration are used to learn the unmixing matrix, and we can obtain the optimum performances by setting the appropriate value of h, for example, h = 2, in the case that the learning durations are 3 seconds and 5 seconds. We can summarize from these results that the proposed combi-  In order to perform a comparison with the conventional BSS method, we also perform the same BSS experiments using Murata's method as described in Section 4.3. Figure 10a shows the results obtained using the proposed method and Murata's method where the observed signals of 5 second duration are used to learn the unmixing matrix, Figure 10b shows those of 3 second duration, and Figure 10c shows those of 1 second duration. In these experiments, the parameter h in the proposed method is set to be 2.
From Figure 10, in both nonreverberant and reverberant tests, it can be seen that the BSS performances obtained by using the proposed method are the same as or superior to those of Murata's conventional method. In particular, from Figure 10c, it is evident that the NRRs of Murata's method degrade markedly in the case that the learning duration is 1 second; however, there are no significant degradations in the case of the proposed method compared with those of Murata's method. By looking at the similarity, for example, frequency-averaged cosine distance defined by among the source signals of different lengths, we can summarize the main reasons for the degradations in Murata's method as follows (see Figure 11). (1) The envelopes of the original source speech become more similar to each other as the duration of the speech shortens. (2) The separated signals' envelopes at the same frequency are similar to each other since the inaccurate unmixing matrix is estimated to have many components of crosstalk. Therefore, the recovery of the permutation tends to fail in Murata's method. In contrast, our method did not fail to recover the source 20   permutation because we did not use any informations of signal waveforms, but rather used only the directivity patterns.

Word recognition test
The   ing sets are selected from the ASJ continuous speech corpus for research. The remaining conditions are summarized in Table 2. Figure 12 shows the results in terms of word recognition rates under different reverberant conditions. Compared with the results of Murata's BSS method, it is evident that the improvements of the proposed method are superior to those of the conventional ICA-based BSS method under all conditions with respect to both reverberation and learning duration. These results indicate that the proposed method is applicable to the speech-recognition system, particularly when confronted with interfering speech signals.

CONCLUSION
In this paper, a new BSS method using subband ICA and beamforming was described. In order to evaluate its effectiveness, signal-separation and speech-recognition experiments were performed under various reverberant conditions. The signal-separation experiments with observed signals of sufficient duration reveal that the NRR of about 18 dB is obtained under the nonreverberant condition, and NRRs of 8 dB and 6 dB are obtained in the case that the RTs are 150 milliseconds and 300 milliseconds, respectively. These performances were superior to those of both simple ICA-based BSS and simple  beamforming technique. Also, it was evident that the NRRs of Murata's ICA-based BSS method degrade markedly in the case that the learning duration is 1 second; however, there are no significant degradations in the case of the proposed method. From the speech-recognition experiments, compared with the results of Murata's BSS method, it was evident that the improvements of the proposed method are superior to those of Murata's BSS method under all conditions with respect to both reverberation and learning duration. These results indicate that the proposed method is applicable to the speech-recognition system, particularly when confronted with interfering speech signals.
In this paper, we mainly showed that the utilization of beamforming in ICA can improve the separation performance. As for the other application of beamforming to ICA, we have already presented a method [27] in which we are particularly concerned with the acceleration of convergence speed in the ICA learning. These results show the explicit evidence for the effectiveness of beamforming used in ICA framework; however, further study and development on the alternative combination technique between ICA and beamforming is an open problem.