- Research Article
- Open Access
Estimation of Sound Source Number and Directions under a Multisource Reverberant Environment
© Jwu-Sheng Hu and Chia-Hsin Yang. 2010
- Received: 3 December 2009
- Accepted: 27 May 2010
- Published: 22 June 2010
Sound source localization is an important feature in robot audition. This work proposes a sound source number and directions estimation method under a multisource reverberant environment. An eigenstructure-based generalized cross-correlation method is proposed to estimate time delay among microphones. A source is considered as a candidate if the corresponding time delay combination among microphones gives reasonable sound speed estimation. Under reverberation, some candidates might be spurious but their direction estimations are not consistent for consecutive data frames. Therefore, an adaptive K-means++ algorithm is proposed to cluster the accumulated results from the sound speed selection mechanism. Experimental results demonstrate the performance of the proposed algorithm in a real room.
- Sound Source
- Signal Subspace
- Microphone Array
- Reverberation Time
- Noise Subspace
Sound source localization is one of the fundamental features of robot audition for human-robot interaction as well as recognition of the environment. The idea of using multiple microphones to localize sound sources has been developed for a long time. Among various kinds of sound localization methods, generalized cross correlation (GCC) [1–3] was used for robotic applications  but it is not robust under multiple sources environment. Improvements on the performance in the multiple sources and reverberant environment have also been discussed [5, 6]. Another approach, proposed by Balan and Rosca , explores the eigenstructure of the correlation matrix of the microphone array by separating speech signals and noise signals into two orthogonal subspaces. The direction-of-arrival (DOA) is then estimated by projecting the manifold vectors onto the noise subspace. MUSIC [8, 9] combined with spatial smoothing  is one of the most popular methods for eliminating the coherence problem and it is also applied to the robot audition .
Based on the geometrical relationship among time delay values, Walworth and Mahajan  proposed a linear equation formulation for the estimation of the three-dimensional (3D) position of a wave source. Later, Valin et al. gave a simple solution for the linear equation in  based on the far-field assumption and developed a novel weighting function method to estimate the time delay. In a real environment, the sound source may move. Valin et al. proposed a localization and tracking of simultaneous moving sound sources method using eight microphones and this method is based on a frequency domain implementation of a steered beamformer along with a particle filter-based tracking algorithm. In addition, Badali et al. investigated the accuracy of different time delay of arrival estimation audio localization implementations in the context of artificial audition for robotic systems.
Yao et al.  presented an efficient blind beamformer technique to estimate the time delays from the dominant source. This method estimated the relative time delay from the dominant eigenvector computed from the time-averaged sample correlation matrix. They have also formulated a source linear equation similar with  to estimate the source location and velocity via least square method. Statistical methods [17–19] have also been proposed to solve the DOA problem under complex environment. These methods yield superior performance than conventional DOA method especially when the sound source is not within line-of-sight. However, a training procedure is needed for these methods to obtain the pattern of sound wave arrival. This may not be realistic for the robot applications when the environment is unknown.
The methods above assume that the sound source number is known. But this may not be a realistic assumption because the environment usually contains various kinds of sound sources. Several eigenvalue-based methods have been proposed [20, 21] to estimate the sound source number. However, the eigenvalue distribution is sensitive to noise and reverberation. The work in  used the support vector machine (SVM) to classify the distribution with respect to the sound source number. However, it still requires a training stage for a robust result and the binary classification is inadequate when the sound source number is larger than two.
The objective of this work is to estimate the multiple fixed sound source directions without a priori information of the sound source number and the environment. This work utilizes the time delay information and microphone array geometry to estimate the sound source directions . A novel eigenstructure-based GCC (ES-GCC) method to estimate the time delay under a multi-source environment between two microphones is proposed. The theoretical proof of the ES-GCC method is given, and the experimental results show that it is robust in a noisy environment. As a result, the sound source direction and velocity can be obtained by solving the proposed linear equation model using the time delay information. Fundamentally, the sound source number should be known while estimating the sound source directions. Hence, the method which can estimate sound source number and directions simultaneously using the proposed adaptive -means++ is introduced and all the experiments are conducted in a real environment. This paper is organized as follows. In Section 2, we introduce the novel ES-GCC method for time delay estimation. With the time delay estimation, the sound source direction and speed estimation method is presented in Section 3, where the estimation error is also analyzed. In Section 4, we propose the sound speed selection mechanism and adaptive -means++ algorithm. Experimental results, presented in Section 5, demonstrate the performance of the proposed algorithm in a real environment. Section 6 concludes the paper.
3.1. Sound Source Location Estimation Using Least-Square Method
3.2. Sound Source Direction Estimation Using Least-Square Method for Far-Field Case
Also, achieves its maximum value of when (i.e., when the source is located along the line passing through the midpoint of and perpendicular to the segment connecting the i th and the first microphone). This also means that has the order of magnitude less than or equal to the magnitude of vector .
3.3. Estimation Error Analysis
This paper assumes that the distance from source to the array is much larger than the array aperture, and (29) is used to solve the sound source direction estimation problem. If the number of sound sources is known, the sound source directions can be estimated by putting time delay vector of corresponding sound source into (32). However, if the sound source number is unknown, the sound source directions estimation will become more complicated since there are several combinations to form the timed delay vectors. This section describes how to estimate the sound sources number and directions simultaneously using the proposed method in Sections 2 and 3.2. A two-step algorithm is proposed to estimate the source number. First, the combinations of delays are filtered by the estimated sound velocity which does not fall within a reasonable range of the true one. But in a reverberant environment, it is still possible to have a phantom source that results in reasonable sound speed estimation. This paper assumes that the power level of phantom source is much weaker than that of the true source. Therefore, only a true source can exhibit a consistent estimation of direction on consecutive frames of signals because the weighting function of ES-GCC also has certain robustness to reverberation. The second step of source number estimation is to cluster the accumulated results from the first step using clustering technique and the reverberation can be considered as the outlier for the clustering technique. The well-known clustering method, -means, is sensitive to initial conditions and is not robust to outliers. In addition, the cluster number should be known in advance for -means which cannot be met in our scenario since we have no information of the sound source number. To improve the problems of robustness and cluster number, this paper proposes the adaptive -means++ method based on the -means  and -means++  methods for clustering. The -means++ method is a way of initializing -means by choosing random starting centers with very specific probabilities. It then runs the normal -means algorithm afterwards. Because the seeding technique of -means++ method can improve both the speed and accuracy of the -means method , this paper employs the seeding technique of -means++ method to seed the initial centers for the proposed adaptive -means++ method.
4.1. Rejecting Incorrect Time Delay Combinations Using Acceptable Velocity Range
where and are azimuth and elevation angle for the sound source, respectively.
4.2. Proposed Adaptive K-means++ for Sound Source Number and Directions Estimation
where is the expectation operation and is a specified threshold. Equation (40) is used to check the variance of each cluster when the -means algorithm converges. If one of the variance of each cluster is not less than , the value of is increased by one. Then the other initial center is found by using the seeding technique of -means++  defined in (41) and the -means algorithm is computed again.
where represents the distance between and the nearest center we have already chosen; is the real number chosen uniformly at random between 0 and .
Step 1. Calculate ES-GCC function . Pick the peaks satisfying (34) from for each microphone pair and list all the possible time delay vector combinations .
Step 2. Select time delay vector from using (36) and estimate the corresponding sound source direction using (37).
Step 3. Repeat Steps 1 to 2 times and accumulate the results. Before each repeat, shift the start frame of Step 1 with frames.
Step 4. Cluster the accumulated results using adaptive -means++ algorithm and the final cluster number and centers are sound source number and directions, respectively.
5.1. ES-GCC Time Delay Estimation Performance Evaluation
Comparing to GCC-ML, the GCC-PHAT has robustness with respect to reverberation. However, the GCC-PHAT method neglects the noise effect, and hence, it begins to exhibit dramatic performance degradation as the SNR is decreased. Unlike GCC-PHAT, GCC-ML does not exhibit this phenomenon since it has a priori knowledge about the noise power spectra which can help estimator to cope with distortion. The ES-GCC achieves the best performance, because the ES-GCC method does not focus on the weighting function process of GCC-based method and it directly takes the principal component vector as the microphone received signal for further signal processing. The appendix provides the proof that the principal component vector can be considered as the approximation of speech-only signal and this is the reason why the ES-GCC method is robust to the SNR.
5.2. Evaluation of Sound Source Number and Directions Estimation
This work explains a sound source number and directions estimation algorithm. The multiple source time delay vector combination problem can be solved by the proposed reasonable sound velocity-based method. By accumulating the estimated sound source angle, the sound source number and directions can be obtained by the proposed adaptive -means++ algorithm. The proposed algorithm is evaluated in a real environment and the experimental results show that the proposed algorithm is robust to real environment and can provide reliable information for further robot audition research.
The accuracy of adaptive -means++ may be influenced by outliers if there is no outlier rejection. Therefore, the outlier rejection method may be incorporated to improve the performance. Moreover, the parameters of , , and are determined by our experience. In our experience, the parameter is not as sensitive as and to influence the results. The sensitivity of these parameters to influence the results is the other issue and this is left as a further research topic.
- Carter GC, Nuttall AH, Cable PG: The smoothed coherence transform (SCOT). In Tech. Memo. Naval Underwater Systems Center, New London Laboratory, New London, Conn, USA; 1972.Google Scholar
- Knapp CH, Carter GC: The generalized correlation method for estimation of time delay. IEEE Transactions on Acoustics, Speech, and Signal Processing 1976, 24: 320-327. 10.1109/TASSP.1976.1162830View ArticleGoogle Scholar
- Brandstein MS, Silverman HF: A robust method for speech signal time-delay estimation in reverberant rooms. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97), April 1997, Munich, Germany 375-378.Google Scholar
- Wang QH, Ivanov T, Aarabi P: Acoustic robot navigation using distributed microphone arrays. Information Fusion 2004, 5(2):131-140. 10.1016/j.inffus.2003.10.002View ArticleGoogle Scholar
- Scheuing J, Yang B: Correlation-based TDOA-estimation for multiple sources in reverberant environments. In Speech and Audio Processing in Adverse Environments. Springer, Berlin, Germany; 2008:381-416.View ArticleGoogle Scholar
- Doclo S, Moonen M: Robust adaptive time delay estimation for speaker localization in noisy and reverberant acoustic environments. EURASIP Journal on Applied Signal Processing 2003, 2003(11):1110-1124. 10.1155/S111086570330602XMATHView ArticleGoogle Scholar
- Balan RV, Rosca J: Apparatus and method for estimating the direction of Arrival of a source signal using a microphone array. European Patent no. US2004013275, 2004Google Scholar
- Schmidt RO: Multiple emitter location and signal parameter estimation. IEEE Transactions on Antennas and Propagation 1986, 34(3):276-280. 10.1109/TAP.1986.1143830View ArticleGoogle Scholar
- Wax M, Shan T, Kailath T: Spatio-Temporal spectral analysis by eigenstructure methods. IEEE Transactions on Acoustics, Speech, and Signal Processing 1984, 32(4):817-827. 10.1109/TASSP.1984.1164400View ArticleGoogle Scholar
- Wang H, Kaveh M: Coherent signal-subspace processing for detection and estimation of angles of arrival of multiple wide-band sources. IEEE Transactions on Acoustics, Speech, and Signal Processing 1985, 33(4):823-831. 10.1109/TASSP.1985.1164667View ArticleGoogle Scholar
- Hara I, Asano F, Asoh H, Ogata J, Ichimura N, Kawai Y, Kanehiro F, Hirukawa H, Yamamoto K: Robust speech interface based on audio and video information fusion for humanoid HRP-2. Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS '04), October 2004, Sendai, Japan 2404-2410.Google Scholar
- Walworth M, Mahajan A: 3D Position sensing using the difference in the time-of-flights from a wave source to various receivers. Proceedings of the International Conference on Advanced Robotics (ICAR '97), July 1997, Monterey, Calif, USA 611-616.Google Scholar
- Valin J-M, Michaud F, Rouat J, Létourneau D: Robust sound source localization using a microphone array on a mobile robot. Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, October 2003, Maui, Hawaii, USA 1228-1233.Google Scholar
- Valin J-M, Michaud F, Rouat J: Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering. Robotics and Autonomous Systems 2007, 55(3):216-228. 10.1016/j.robot.2006.08.004View ArticleGoogle Scholar
- Badali AP, Valin JM, Aarabi P: Evaluating real-time audio localization agorithms for artificial audition on mobile robots. Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2009, St. Louis, Mo, USA 2033-2038.Google Scholar
- Yao K, Hudson RE, Reed CW, Chen D, Lorenzelli F: Blind beamforming on a randomly distributed sensor array system. IEEE Journal on Selected Areas in Communications 1998, 16(8):1555-1566. 10.1109/49.730461View ArticleGoogle Scholar
- Strobel N, Rabenstein R: Classification of time delay estimates for robust speaker localization. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '99), March 1999 6: 3081-3084.View ArticleGoogle Scholar
- Potamitis I, Chen H, Tremoulis G: Tracking of multiple moving speakers with multiple microphone arrays. IEEE Transactions on Speech and Audio Processing 2004, 12(5):520-529. 10.1109/TSA.2004.833004View ArticleGoogle Scholar
- Hu J-S, Cheng C-C, Liu W-H: Robust speaker's location detection in a vehicle environment using GMM models. IEEE Transactions on Systems, Man, and Cybernetics, Part B 2006, 36(2):403-412.View ArticleGoogle Scholar
- Cantoni A, Butler P: Properties of the eigenvectors of persymmetric matrices with applications to communication theory. IEEE Transactions on Communications 1976, 24(8):804-809. 10.1109/TCOM.1976.1093391MATHMathSciNetView ArticleGoogle Scholar
- Wax M, Kailath T: Detection of signals by information theoretic criteria. IEEE Transactions on Acoustics, Speech, and Signal Processing 1985, 33(2):387-392. 10.1109/TASSP.1985.1164557MathSciNetView ArticleGoogle Scholar
- Yamamoto K, Asano F, Van Rooijen WFG, Ling EYL, Yamada T, Kitawaki N: Estimation of the number of sound sources using support vector machine. Proceedings of the IEEE International Conference on Accoustics, Speech, and Signal Processing, April 2003, Hong Kong 485-488.Google Scholar
- Hu J-S, Yang C-H, Wang C-K: Estimation of sound source number and directions under a multi-source environment. Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS '09), December 2009, St. Louis, Mo, USA 181-186.Google Scholar
- Hayes MH: Statistical Digital Signal Processing and Modeling. John Wiley & Sons, New York, NY, USA; 1996.Google Scholar
- Chen J, Benesty J, Huang Y: Time delay estimation in room acoustic environments: an overview. EURASIP Journal on Applied Signal Processing 2006, 2006:-19.Google Scholar
- Hartigan JA, Wong MA: A k-means clustering algorithm. Applied Statistics 1979, 28: 100-108. 10.2307/2346830MATHView ArticleGoogle Scholar
- Arthur D, Vassilvitskii S: K-means++: the advantages of careful seeding. Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '07), 2007, New Orleans, La, USAGoogle Scholar
- Bechler D, Kroschel K: Considering the second peak in the GCC function for multi-Source TDOA estimation with a microphone array. Proceedings of the International Workshop on Acoustic Echo and Noise Control (IWAENC '03), September, 2003, Kyoto, Japan 315-318.Google Scholar
- Pham T, Sadler BM: Adaptive wideband aeroacoustic array processing. Proceedings of the IEEE Signal Processing Workshop on Statistical Signal and Array Processing, June 1996, Corfu, Greece 295-298.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.