# Estimation of Sound Source Number and Directions under a Multisource Reverberant Environment

- Jwu-Sheng Hu
^{1}and - Chia-Hsin Yang
^{1}Email author

**2010**:870756

https://doi.org/10.1155/2010/870756

© Jwu-Sheng Hu and Chia-Hsin Yang. 2010

**Received: **3 December 2009

**Accepted: **27 May 2010

**Published: **22 June 2010

## Abstract

Sound source localization is an important feature in robot audition. This work proposes a sound source number and directions estimation method under a multisource reverberant environment. An eigenstructure-based generalized cross-correlation method is proposed to estimate time delay among microphones. A source is considered as a candidate if the corresponding time delay combination among microphones gives reasonable sound speed estimation. Under reverberation, some candidates might be spurious but their direction estimations are not consistent for consecutive data frames. Therefore, an adaptive K-means++ algorithm is proposed to cluster the accumulated results from the sound speed selection mechanism. Experimental results demonstrate the performance of the proposed algorithm in a real room.

## Keywords

## 1. Introduction

Sound source localization is one of the fundamental features of robot audition for human-robot interaction as well as recognition of the environment. The idea of using multiple microphones to localize sound sources has been developed for a long time. Among various kinds of sound localization methods, generalized cross correlation (GCC) [1–3] was used for robotic applications [4] but it is not robust under multiple sources environment. Improvements on the performance in the multiple sources and reverberant environment have also been discussed [5, 6]. Another approach, proposed by Balan and Rosca [7], explores the eigenstructure of the correlation matrix of the microphone array by separating speech signals and noise signals into two orthogonal subspaces. The direction-of-arrival (DOA) is then estimated by projecting the manifold vectors onto the noise subspace. MUSIC [8, 9] combined with spatial smoothing [10] is one of the most popular methods for eliminating the coherence problem and it is also applied to the robot audition [11].

Based on the geometrical relationship among time delay values, Walworth and Mahajan [12] proposed a linear equation formulation for the estimation of the three-dimensional (3D) position of a wave source. Later, Valin et al*.*[13] gave a simple solution for the linear equation in [12] based on the far-field assumption and developed a novel weighting function method to estimate the time delay. In a real environment, the sound source may move. Valin et al*.*[14] proposed a localization and tracking of simultaneous moving sound sources method using eight microphones and this method is based on a frequency domain implementation of a steered beamformer along with a particle filter-based tracking algorithm. In addition, Badali et al*.*[15] investigated the accuracy of different time delay of arrival estimation audio localization implementations in the context of artificial audition for robotic systems.

Yao et al*.* [16] presented an efficient blind beamformer technique to estimate the time delays from the dominant source. This method estimated the relative time delay from the dominant eigenvector computed from the time-averaged sample correlation matrix. They have also formulated a source linear equation similar with [12] to estimate the source location and velocity via least square method. Statistical methods [17–19] have also been proposed to solve the DOA problem under complex environment. These methods yield superior performance than conventional DOA method especially when the sound source is not within line-of-sight. However, a training procedure is needed for these methods to obtain the pattern of sound wave arrival. This may not be realistic for the robot applications when the environment is unknown.

The methods above assume that the sound source number is known. But this may not be a realistic assumption because the environment usually contains various kinds of sound sources. Several eigenvalue-based methods have been proposed [20, 21] to estimate the sound source number. However, the eigenvalue distribution is sensitive to noise and reverberation. The work in [22] used the support vector machine (SVM) to classify the distribution with respect to the sound source number. However, it still requires a training stage for a robust result and the binary classification is inadequate when the sound source number is larger than two.

The objective of this work is to estimate the multiple fixed sound source directions without a priori information of the sound source number and the environment. This work utilizes the time delay information and microphone array geometry to estimate the sound source directions [23]. A novel eigenstructure-based GCC (ES-GCC) method to estimate the time delay under a multi-source environment between two microphones is proposed. The theoretical proof of the ES-GCC method is given, and the experimental results show that it is robust in a noisy environment. As a result, the sound source direction and velocity can be obtained by solving the proposed linear equation model using the time delay information. Fundamentally, the sound source number should be known while estimating the sound source directions. Hence, the method which can estimate sound source number and directions simultaneously using the proposed adaptive -means++ is introduced and all the experiments are conducted in a real environment. This paper is organized as follows. In Section 2, we introduce the novel ES-GCC method for time delay estimation. With the time delay estimation, the sound source direction and speed estimation method is presented in Section 3, where the estimation error is also analyzed. In Section 4, we propose the sound speed selection mechanism and adaptive -means++ algorithm. Experimental results, presented in Section 5, demonstrate the performance of the proposed algorithm in a real environment. Section 6 concludes the paper.

## 2. Time Delay Estimation

*M*microphones in a noisy environment. The received signal of the th microphone which contains sources can be described as:

*d*th sound source to the th microphone assumed to be time-invariant over the observation period and represents the convolution operation. and are the

*d*th sound source and the nondirectional noise, respectively. It is assumed that and are mutually uncorrelated and sound source signals are mutually independent. Applying the short-time Fourier transform (STFT) to (1), we have

*M*-

*D*eigenvectors ( to ) is referred to as noise eigenvectors and spans the noise subspace. The MUSIC algorithm [8, 9] uses the orthogonal property of the signal and noise subspaces to estimate the signal directions and it mainly uses the eigenvectors that lie in the noise subspace. Rather than using the noise subspace information, this paper considers the eigenvectors that lie in the signal subspace for time delay estimation (TDE) to minimize the influence of noise. The idea that employs the eigenvectors in the signal subspace can also be referred as the Blackman-Tukey frequency estimation method [24]. In the signal eigenvectors, is the eigenvector associated with the maximum eigenvalue:

*i*th and

*j*th microphone can be represented as

## 3. Sound Source Localization and Speed Estimation

### 3.1. Sound Source Location Estimation Using Least-Square Method

*i*th microphone location , and the relative time delays, , between the

*i*th microphone and the first microphone. The relative time delay satisfies

*i*th microphone and is the speed of sound. Equation (10) is equivalent to

### 3.2. Sound Source Direction Estimation Using Least-Square Method for Far-Field Case

**A**

_{ g }must be full rank. However, for matrix

**A**

_{ g }, the condition on rank is more complicated and can be ill-conditioned easily. For example, if the microphones are distributed on a spherical surface (i.e., ,

*R*

_{ m }is radius, and and are azimuth and elevation angle resp.), it can be verified that the fourth column in

**A**

_{ g }is the linear combination of column 1, 2, and 3. Secondly, if the aperture of the array is small compared with the source distance (far-field), the distance estimation is also sensitive to noise. In the following, a detailed analysis of (13) is presented which leads to a formulation for the far-field case. Define and as,

*i*th and the first microphones. Let the distance difference be

*d*

_{ i }, that is,

Also,
achieves its maximum value of
when
(i.e., when the source is located along the line passing through the midpoint of and perpendicular to the segment connecting the *i* th and the first microphone). This also means that
has the order of magnitude less than or equal to the magnitude of vector
.

*the field distance ratio*and as

*the near field influence factor*for their roles in the sound source localization using microphone array. Equation (26) can also be derived from a plane wave assumption. Consider a single incident plane wave and a pair of microphones as shown in Figure 1 and the relative time delay between two microphones can be described as:

### 3.3. Estimation Error Analysis

*i*) but has adverse effect on sensor pair (1,

*j*) if and are perpendicular to each other. A simple simulation for estimation error is illustrated for the microphone locations depicted in Figure 7. We assume that there is no time delay estimation error and the sound velocity is 34300 cm/sec. The sound source location is moved on the direction vector (0.3256, 0.9455, 0) to make sure that . The estimated sound source direction and velocity are obtained by using (31) and (32). Figure 2 shows the relation between direction estimation error and the factor . The direction estimation error is defined as the difference between real angel and estimated angle. As it can be seen, the estimation error becomes smaller and converges to a small value when is increased. In particular, the estimation error would not change dramatically when is larger than 5 ( is larger than five times of ). Figure 3 shows the relation between estimated velocity and . The estimated velocity converges to 34300 when is increased and this is consistent with the analysis at the beginning of this section.

## 4. Sound Source Number and Directions Estimation

This paper assumes that the distance from source to the array is much larger than the array aperture, and (29) is used to solve the sound source direction estimation problem. If the number of sound sources is known, the sound source directions can be estimated by putting time delay vector of corresponding sound source into (32). However, if the sound source number is unknown, the sound source directions estimation will become more complicated since there are several combinations to form the timed delay vectors. This section describes how to estimate the sound sources number and directions simultaneously using the proposed method in Sections 2 and 3.2. A two-step algorithm is proposed to estimate the source number. First, the combinations of delays are filtered by the estimated sound velocity which does not fall within a reasonable range of the true one. But in a reverberant environment, it is still possible to have a phantom source that results in reasonable sound speed estimation. This paper assumes that the power level of phantom source is much weaker than that of the true source. Therefore, only a true source can exhibit a consistent estimation of direction on consecutive frames of signals because the weighting function of ES-GCC also has certain robustness to reverberation. The second step of source number estimation is to cluster the accumulated results from the first step using clustering technique and the reverberation can be considered as the outlier for the clustering technique. The well-known clustering method, -means, is sensitive to initial conditions and is not robust to outliers. In addition, the cluster number should be known in advance for -means which cannot be met in our scenario since we have no information of the sound source number. To improve the problems of robustness and cluster number, this paper proposes the adaptive -means++ method based on the -means [26] and -means++ [27] methods for clustering. The -means++ method is a way of initializing -means by choosing random starting centers with very specific probabilities. It then runs the normal -means algorithm afterwards. Because the seeding technique of -means++ method can improve both the speed and accuracy of the -means method [27], this paper employs the seeding technique of -means++ method to seed the initial centers for the proposed adaptive -means++ method.

### 4.1. Rejecting Incorrect Time Delay Combinations Using Acceptable Velocity Range

*n*

_{ i }th largest peak in ES-GCC function . If possesses no time delay sample that can meet the constraint above, the will be set to one. Hence, there are possible combinations to form the possible time delay vector and there should be

*D*correct combinations in those possible combinations. Figure 4 illustrates the procedure of forming the possible time delay vector combinations and is the sampling rate. The relation between estimated time delay and estimated time delay sample is:

*i*th microphone and is the estimated time delay sample between the

*i*th microphone and the first microphone. The next issue is how to choose correct combinations and determine the sound source number.

where and are azimuth and elevation angle for the sound source, respectively.

### 4.2. Proposed Adaptive K-means++ for Sound Source Number and Directions Estimation

*q*th testing. So far, we have data and each data has two features ( and ). Our goal is to divide these data into clusters based on the two features. A cluster is defined as a set of sound source direction data points. For a cluster, the data within this cluster should be similar to one another and it means that the data within this cluster should come from the same sound source direction. The number is defined as the sound source number. Therefore, among the set of sound source direction data points, we wish to choose cluster centers so as to minimize the potential function:

where is the expectation operation and is a specified threshold. Equation (40) is used to check the variance of each cluster when the -means algorithm converges. If one of the variance of each cluster is not less than , the value of is increased by one. Then the other initial center is found by using the seeding technique of -means++ [27] defined in (41) and the -means algorithm is computed again.

where represents the distance between and the nearest center we have already chosen; is the real number chosen uniformly at random between 0 and .

Step 1. Calculate ES-GCC function . Pick the peaks satisfying (34) from for each microphone pair and list all the possible time delay vector combinations .

Step 2. Select time delay vector from using (36) and estimate the corresponding sound source direction using (37).

Step 3. Repeat Steps 1 to 2 times and accumulate the results. Before each repeat, shift the start frame of Step 1 with frames.

Step 4. Cluster the accumulated results using adaptive -means++ algorithm and the final cluster number and centers are sound source number and directions, respectively.

## 5. Experimental Results

### 5.1. ES-GCC Time Delay Estimation Performance Evaluation

*i*th time delay estimation, and is the

*i*th correct delay sample with a integer. Figure 8 shows the RMSE results as a function of SNR for three different TDE algorithms. The total number of estimation is 294. As seen from Figure 8, the GCC-PHAT yields better TDE performance than GCC-ML at higher SNR. This is because the experimental environment is reverberant and the GCC-ML suffers significant performance degradation under reverberation.

Comparing to GCC-ML, the GCC-PHAT has robustness with respect to reverberation. However, the GCC-PHAT method neglects the noise effect, and hence, it begins to exhibit dramatic performance degradation as the SNR is decreased. Unlike GCC-PHAT, GCC-ML does not exhibit this phenomenon since it has a priori knowledge about the noise power spectra which can help estimator to cope with distortion. The ES-GCC achieves the best performance, because the ES-GCC method does not focus on the weighting function process of GCC-based method and it directly takes the principal component vector as the microphone received signal for further signal processing. The appendix provides the proof that the principal component vector can be considered as the approximation of speech-only signal and this is the reason why the ES-GCC method is robust to the SNR.

### 5.2. Evaluation of Sound Source Number and Directions Estimation

## 6. Conclusion

This work explains a sound source number and directions estimation algorithm. The multiple source time delay vector combination problem can be solved by the proposed reasonable sound velocity-based method. By accumulating the estimated sound source angle, the sound source number and directions can be obtained by the proposed adaptive -means++ algorithm. The proposed algorithm is evaluated in a real environment and the experimental results show that the proposed algorithm is robust to real environment and can provide reliable information for further robot audition research.

The accuracy of adaptive -means++ may be influenced by outliers if there is no outlier rejection. Therefore, the outlier rejection method may be incorporated to improve the performance. Moreover, the parameters of , , and are determined by our experience. In our experience, the parameter is not as sensitive as and to influence the results. The sensitivity of these parameters to influence the results is the other issue and this is left as a further research topic.

## Appendix

*M*unknowns ( ) shown at (A.17):where is the variance part which is defined asTo solve , we assume that the variance part can be neglected. This is possible if . Therefore we chose the maximum eigenvalue to solve this linear equation. In (A.17), the first row divided by the second row is and we havewhere denotes . Therefore,With the similar method, the eigenvector associated with the maximum eigenvalue can be obtained:where is a scalar. Hence, the eigenvector can be represented asIf the observation time is sufficiently long, then . Therefore, the microphone received signal can be modeled asAs can be seen from (A.23), the received speech signal is only the scalar version of the corresponding eigenvector for the maximum eigenvalue. Therefore, we take this eigenvector as the microphone received signal for time delay estimation. Equation (A.23) is obtained by using the maximum eigenvalue to solve (A.17). If other eigenvalues can also neglect the variance as , they can also have the speech signal approximation property. It represents that if the sound source number is one, is the only eigenvector which can represent the received speech signal since is the only dominant eigenvalue and the other eigenvectors ( , contain the noise information. If the sound source number is larger than one, the other eigenvectors ( , may contain some speech signal information. However, the conversational speech sources are asynchronous and contain many short pauses. Some speech sources information may not be represented by in this frame but may be represented in the next frame. Based on this concept, this paper uses eigenvector for time delay estimation since it can represent received speech signal most, accumulates the estimated DOA results, and uses adaptive -means++ for clustering the accumulated results. The algorithms that use the vectors that lie in the signal subspace are based on a principal components analysis (PCA) of the autocorrelation matrix and are referred to as signal subspace method [24]. This paper further justifies the use of since it can represent the speech signal better than the other eigenvectors from (A.17) and (A.23).

## Authors’ Affiliations

## References

- Carter GC, Nuttall AH, Cable PG: The smoothed coherence transform (SCOT). In
*Tech. Memo*. Naval Underwater Systems Center, New London Laboratory, New London, Conn, USA; 1972.Google Scholar - Knapp CH, Carter GC: The generalized correlation method for estimation of time delay.
*IEEE Transactions on Acoustics, Speech, and Signal Processing*1976, 24: 320-327. 10.1109/TASSP.1976.1162830View ArticleGoogle Scholar - Brandstein MS, Silverman HF: A robust method for speech signal time-delay estimation in reverberant rooms.
*Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97), April 1997, Munich, Germany*375-378.Google Scholar - Wang QH, Ivanov T, Aarabi P: Acoustic robot navigation using distributed microphone arrays.
*Information Fusion*2004, 5(2):131-140. 10.1016/j.inffus.2003.10.002View ArticleGoogle Scholar - Scheuing J, Yang B: Correlation-based TDOA-estimation for multiple sources in reverberant environments. In
*Speech and Audio Processing in Adverse Environments*. Springer, Berlin, Germany; 2008:381-416.View ArticleGoogle Scholar - Doclo S, Moonen M: Robust adaptive time delay estimation for speaker localization in noisy and reverberant acoustic environments.
*EURASIP Journal on Applied Signal Processing*2003, 2003(11):1110-1124. 10.1155/S111086570330602XMATHView ArticleGoogle Scholar - Balan RV, Rosca J: Apparatus and method for estimating the direction of Arrival of a source signal using a microphone array. European Patent no. US2004013275, 2004Google Scholar
- Schmidt RO: Multiple emitter location and signal parameter estimation.
*IEEE Transactions on Antennas and Propagation*1986, 34(3):276-280. 10.1109/TAP.1986.1143830View ArticleGoogle Scholar - Wax M, Shan T, Kailath T: Spatio-Temporal spectral analysis by eigenstructure methods.
*IEEE Transactions on Acoustics, Speech, and Signal Processing*1984, 32(4):817-827. 10.1109/TASSP.1984.1164400View ArticleGoogle Scholar - Wang H, Kaveh M: Coherent signal-subspace processing for detection and estimation of angles of arrival of multiple wide-band sources.
*IEEE Transactions on Acoustics, Speech, and Signal Processing*1985, 33(4):823-831. 10.1109/TASSP.1985.1164667View ArticleGoogle Scholar - Hara I, Asano F, Asoh H, Ogata J, Ichimura N, Kawai Y, Kanehiro F, Hirukawa H, Yamamoto K: Robust speech interface based on audio and video information fusion for humanoid HRP-2.
*Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS '04), October 2004, Sendai, Japan*2404-2410.Google Scholar - Walworth M, Mahajan A: 3D Position sensing using the difference in the time-of-flights from a wave source to various receivers.
*Proceedings of the International Conference on Advanced Robotics (ICAR '97), July 1997, Monterey, Calif, USA*611-616.Google Scholar - Valin J-M, Michaud F, Rouat J, Létourneau D: Robust sound source localization using a microphone array on a mobile robot.
*Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, October 2003, Maui, Hawaii, USA*1228-1233.Google Scholar - Valin J-M, Michaud F, Rouat J: Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering.
*Robotics and Autonomous Systems*2007, 55(3):216-228. 10.1016/j.robot.2006.08.004View ArticleGoogle Scholar - Badali AP, Valin JM, Aarabi P: Evaluating real-time audio localization agorithms for artificial audition on mobile robots.
*Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2009, St. Louis, Mo, USA*2033-2038.Google Scholar - Yao K, Hudson RE, Reed CW, Chen D, Lorenzelli F: Blind beamforming on a randomly distributed sensor array system.
*IEEE Journal on Selected Areas in Communications*1998, 16(8):1555-1566. 10.1109/49.730461View ArticleGoogle Scholar - Strobel N, Rabenstein R: Classification of time delay estimates for robust speaker localization.
*Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '99), March 1999*6: 3081-3084.View ArticleGoogle Scholar - Potamitis I, Chen H, Tremoulis G: Tracking of multiple moving speakers with multiple microphone arrays.
*IEEE Transactions on Speech and Audio Processing*2004, 12(5):520-529. 10.1109/TSA.2004.833004View ArticleGoogle Scholar - Hu J-S, Cheng C-C, Liu W-H: Robust speaker's location detection in a vehicle environment using GMM models.
*IEEE Transactions on Systems, Man, and Cybernetics, Part B*2006, 36(2):403-412.View ArticleGoogle Scholar - Cantoni A, Butler P: Properties of the eigenvectors of persymmetric matrices with applications to communication theory.
*IEEE Transactions on Communications*1976, 24(8):804-809. 10.1109/TCOM.1976.1093391MATHMathSciNetView ArticleGoogle Scholar - Wax M, Kailath T: Detection of signals by information theoretic criteria.
*IEEE Transactions on Acoustics, Speech, and Signal Processing*1985, 33(2):387-392. 10.1109/TASSP.1985.1164557MathSciNetView ArticleGoogle Scholar - Yamamoto K, Asano F, Van Rooijen WFG, Ling EYL, Yamada T, Kitawaki N: Estimation of the number of sound sources using support vector machine.
*Proceedings of the IEEE International Conference on Accoustics, Speech, and Signal Processing, April 2003, Hong Kong*485-488.Google Scholar - Hu J-S, Yang C-H, Wang C-K: Estimation of sound source number and directions under a multi-source environment.
*Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS '09), December 2009, St. Louis, Mo, USA*181-186.Google Scholar - Hayes MH:
*Statistical Digital Signal Processing and Modeling*. John Wiley & Sons, New York, NY, USA; 1996.Google Scholar - Chen J, Benesty J, Huang Y: Time delay estimation in room acoustic environments: an overview.
*EURASIP Journal on Applied Signal Processing*2006, 2006:-19.Google Scholar - Hartigan JA, Wong MA: A k-means clustering algorithm.
*Applied Statistics*1979, 28: 100-108. 10.2307/2346830MATHView ArticleGoogle Scholar - Arthur D, Vassilvitskii S: K-means++: the advantages of careful seeding.
*Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '07), 2007, New Orleans, La, USA*Google Scholar - Bechler D, Kroschel K: Considering the second peak in the GCC function for multi-Source TDOA estimation with a microphone array.
*Proceedings of the International Workshop on Acoustic Echo and Noise Control (IWAENC '03), September, 2003, Kyoto, Japan*315-318.Google Scholar - Pham T, Sadler BM: Adaptive wideband aeroacoustic array processing.
*Proceedings of the IEEE Signal Processing Workshop on Statistical Signal and Array Processing, June 1996, Corfu, Greece*295-298.View ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.