Skip to main content

Estimation of Sound Source Number and Directions under a Multisource Reverberant Environment

Abstract

Sound source localization is an important feature in robot audition. This work proposes a sound source number and directions estimation method under a multisource reverberant environment. An eigenstructure-based generalized cross-correlation method is proposed to estimate time delay among microphones. A source is considered as a candidate if the corresponding time delay combination among microphones gives reasonable sound speed estimation. Under reverberation, some candidates might be spurious but their direction estimations are not consistent for consecutive data frames. Therefore, an adaptive K-means++ algorithm is proposed to cluster the accumulated results from the sound speed selection mechanism. Experimental results demonstrate the performance of the proposed algorithm in a real room.

1. Introduction

Sound source localization is one of the fundamental features of robot audition for human-robot interaction as well as recognition of the environment. The idea of using multiple microphones to localize sound sources has been developed for a long time. Among various kinds of sound localization methods, generalized cross correlation (GCC) [1–3] was used for robotic applications [4] but it is not robust under multiple sources environment. Improvements on the performance in the multiple sources and reverberant environment have also been discussed [5, 6]. Another approach, proposed by Balan and Rosca [7], explores the eigenstructure of the correlation matrix of the microphone array by separating speech signals and noise signals into two orthogonal subspaces. The direction-of-arrival (DOA) is then estimated by projecting the manifold vectors onto the noise subspace. MUSIC [8, 9] combined with spatial smoothing [10] is one of the most popular methods for eliminating the coherence problem and it is also applied to the robot audition [11].

Based on the geometrical relationship among time delay values, Walworth and Mahajan [12] proposed a linear equation formulation for the estimation of the three-dimensional (3D) position of a wave source. Later, Valin et al.[13] gave a simple solution for the linear equation in [12] based on the far-field assumption and developed a novel weighting function method to estimate the time delay. In a real environment, the sound source may move. Valin et al.[14] proposed a localization and tracking of simultaneous moving sound sources method using eight microphones and this method is based on a frequency domain implementation of a steered beamformer along with a particle filter-based tracking algorithm. In addition, Badali et al.[15] investigated the accuracy of different time delay of arrival estimation audio localization implementations in the context of artificial audition for robotic systems.

Yao et al. [16] presented an efficient blind beamformer technique to estimate the time delays from the dominant source. This method estimated the relative time delay from the dominant eigenvector computed from the time-averaged sample correlation matrix. They have also formulated a source linear equation similar with [12] to estimate the source location and velocity via least square method. Statistical methods [17–19] have also been proposed to solve the DOA problem under complex environment. These methods yield superior performance than conventional DOA method especially when the sound source is not within line-of-sight. However, a training procedure is needed for these methods to obtain the pattern of sound wave arrival. This may not be realistic for the robot applications when the environment is unknown.

The methods above assume that the sound source number is known. But this may not be a realistic assumption because the environment usually contains various kinds of sound sources. Several eigenvalue-based methods have been proposed [20, 21] to estimate the sound source number. However, the eigenvalue distribution is sensitive to noise and reverberation. The work in [22] used the support vector machine (SVM) to classify the distribution with respect to the sound source number. However, it still requires a training stage for a robust result and the binary classification is inadequate when the sound source number is larger than two.

The objective of this work is to estimate the multiple fixed sound source directions without a priori information of the sound source number and the environment. This work utilizes the time delay information and microphone array geometry to estimate the sound source directions [23]. A novel eigenstructure-based GCC (ES-GCC) method to estimate the time delay under a multi-source environment between two microphones is proposed. The theoretical proof of the ES-GCC method is given, and the experimental results show that it is robust in a noisy environment. As a result, the sound source direction and velocity can be obtained by solving the proposed linear equation model using the time delay information. Fundamentally, the sound source number should be known while estimating the sound source directions. Hence, the method which can estimate sound source number and directions simultaneously using the proposed adaptive -means++ is introduced and all the experiments are conducted in a real environment. This paper is organized as follows. In Section 2, we introduce the novel ES-GCC method for time delay estimation. With the time delay estimation, the sound source direction and speed estimation method is presented in Section 3, where the estimation error is also analyzed. In Section 4, we propose the sound speed selection mechanism and adaptive -means++ algorithm. Experimental results, presented in Section 5, demonstrate the performance of the proposed algorithm in a real environment. Section 6 concludes the paper.

2. Time Delay Estimation

Consider an array with M microphones in a noisy environment. The received signal of the th microphone which contains sources can be described as:

(1)

where is the transfer function from the d th sound source to the th microphone assumed to be time-invariant over the observation period and represents the convolution operation. and are the d th sound source and the nondirectional noise, respectively. It is assumed that and are mutually uncorrelated and sound source signals are mutually independent. Applying the short-time Fourier transform (STFT) to (1), we have

(2)

where is the frequency band, is the frame number, and is the STFT point., , , and are the STFT of the respective signals. Rewrite (2) in matrix form:

(3)

where

(4)

Suppose the noises are spatially white, and the noise correlation matrix is diagonal matrix . Therefore, the received signal correlation matrix using frames with eigenvalue decomposition (EVD) can be described as

(5)

where H denotes conjugation transpose; ; and are eigenvalues and corresponding eigenvectors with . The signal-only correlation matrix can be expressed as (6) using the property (the proof of this property is given in the appendix):

(6)

The eigenvalues and eigenvectors are divided into two groups. The first group, consisting of eigenvectors ( to ) is referred to as signal eigenvectors and spans the signal subspace. The second group, consisting of M-D eigenvectors ( to ) is referred to as noise eigenvectors and spans the noise subspace. The MUSIC algorithm [8, 9] uses the orthogonal property of the signal and noise subspaces to estimate the signal directions and it mainly uses the eigenvectors that lie in the noise subspace. Rather than using the noise subspace information, this paper considers the eigenvectors that lie in the signal subspace for time delay estimation (TDE) to minimize the influence of noise. The idea that employs the eigenvectors in the signal subspace can also be referred as the Blackman-Tukey frequency estimation method [24]. In the signal eigenvectors, is the eigenvector associated with the maximum eigenvalue:

(7)

This paper chooses the eigenvector for TDE because it lies in the signal subspace and it contributes most to construct the signal-only correlation matrix. We call the eigenvector first principal component vector since it contains the information of the speech sound sources and is robust to the noise. It is different from the conventional GCC methods where a number of weighting functions are adjusted for different applications. In essence, this paper replaces the microphone-received signal with for TDE since can be considered as the approximation of . A detailed explanation is given in the appendix. Hence, the ES-GCC function between the i th and j th microphone can be represented as

(8)

The weighting function in (8) follows the idea of GCC-PHAT [2] and the reason is that studies [3, 25] showed it is more immune to reverberation time than other cross-correlation-based methods but sensitive to noise. By replacing the original signals with the principal component vectors, the robustness to noise can be enhanced. As a result, the time delay sample can be estimated by finding the maximum peak of the ES-GCC function as

(9)

3. Sound Source Localization and Speed Estimation

3.1. Sound Source Location Estimation Using Least-Square Method

The sound source location can be estimated from geometrical calculation of the time delays among the microphone array elements. The work in [16] provides a linear equation model for estimating the source localization and propagation speed. The following derivations explain the idea. Consider sound source location vector , the i th microphone location , and the relative time delays, , between the i th microphone and the first microphone. The relative time delay satisfies

(10)

where is the time delay from the sound source to the i th microphone and is the speed of sound. Equation (10) is equivalent to

(11)

Squaring both sides, we have

(12)

By some algebraic manipulations, (12) becomes

(13)

Next, define the normalized sound source position vector as,

(14)

And define two other variables as

(15)

The linear equation (13) considering all M microphones can be written as

(16)

where ,

(17)

For more than five sensors, the least square solution of equation is given by

(18)

The estimated sound source location and speed of sound can be obtained as

(19)

3.2. Sound Source Direction Estimation Using Least-Square Method for Far-Field Case

To solve (16), the matrix A g must be full rank. However, for matrix A g , the condition on rank is more complicated and can be ill-conditioned easily. For example, if the microphones are distributed on a spherical surface (i.e., , R m is radius, and and are azimuth and elevation angle resp.), it can be verified that the fourth column in A g is the linear combination of column 1, 2, and 3. Secondly, if the aperture of the array is small compared with the source distance (far-field), the distance estimation is also sensitive to noise. In the following, a detailed analysis of (13) is presented which leads to a formulation for the far-field case. Define and as,

(20)

represents the unit vector in the source direction and means the ratio of the array size to the distance between the array and source, that is, for far-field sources, . Substituting (20) to (13), we have,

(21)

The term means the distance difference between the sound source to the i th and the first microphones. Let the distance difference be d i , that is,

(22)

Equation (21) can be rewritten as

(23)

where

(24)

It is straightforward to see that ≥ 0 since

(25)

Also, achieves its maximum value of when (i.e., when the source is located along the line passing through the midpoint of and perpendicular to the segment connecting the i th and the first microphone). This also means that has the order of magnitude less than or equal to the magnitude of vector.

From (23), it is clear that for far-field sources (), the delay relation approaches

(26)

Thus, the left hand side of (23) consists of the far-field term and near field influence of the delay relation. We define as the field distance ratio and as the near field influence factor for their roles in the sound source localization using microphone array. Equation (26) can also be derived from a plane wave assumption. Consider a single incident plane wave and a pair of microphones as shown in Figure 1 and the relative time delay between two microphones can be described as:

(27)

The parameters can be represented as:

(28)

Equation (26) can be derived by substituting (28) into (27).

Figure 1
figure 1

Geometry model of plane wave and two microphones.

For far-field sources (), the overdetermined linear equation system (16) becomes (from (26))

(29)

where

(30)

The unit vector of the source direction () can be estimated using the least square method similar with (18). And the speed of sound is obtained by

(31)

Then, the sound source direction for far-field case can be given by:

(32)

3.3. Estimation Error Analysis

Equation (29) is an approximation by considering plane wave only. It will give errors both in the source direction and the speed of sound. The error in the speed of sound is more interesting as it can reveal the relative distance information of sources to the microphone array. It can be shown that the closer the sound source, the larger the estimate of the speed. To see this, consider the original close form relation of (23) by moving the second term on the left-hand side to the right:

(33)

Without loss of generality, assume that . Since both and are nonnegative, (33) shows that if the far-field assumption is utilized (see (26)), the delay shall be decreased to match the real situation. However, when solving (26), there is no modification of the value . Therefore, one possibility to match the case of augmented delay is to decrease the speed of sound. Another possibility is to change the direction of the source vector . However, for an array spans the 3D space, the possibility of adjusting the source direction for all sensor pairs is small since the least square method is applied. For example, changing the direction may work for sensor pair (1,i) but has adverse effect on sensor pair (1, j) if and are perpendicular to each other. A simple simulation for estimation error is illustrated for the microphone locations depicted in Figure 7. We assume that there is no time delay estimation error and the sound velocity is 34300 cm/sec. The sound source location is moved on the direction vector (0.3256, 0.9455, 0) to make sure that . The estimated sound source direction and velocity are obtained by using (31) and (32). Figure 2 shows the relation between direction estimation error and the factor . The direction estimation error is defined as the difference between real angel and estimated angle. As it can be seen, the estimation error becomes smaller and converges to a small value when is increased. In particular, the estimation error would not change dramatically when is larger than 5 ( is larger than five times of ). Figure 3 shows the relation between estimated velocity and . The estimated velocity converges to 34300 when is increased and this is consistent with the analysis at the beginning of this section.

Figure 2
figure 2

Direction estimation error versus .

Figure 3
figure 3

Estimated velocity versus .

4. Sound Source Number and Directions Estimation

This paper assumes that the distance from source to the array is much larger than the array aperture, and (29) is used to solve the sound source direction estimation problem. If the number of sound sources is known, the sound source directions can be estimated by putting time delay vector of corresponding sound source into (32). However, if the sound source number is unknown, the sound source directions estimation will become more complicated since there are several combinations to form the timed delay vectors. This section describes how to estimate the sound sources number and directions simultaneously using the proposed method in Sections 2 and 3.2. A two-step algorithm is proposed to estimate the source number. First, the combinations of delays are filtered by the estimated sound velocity which does not fall within a reasonable range of the true one. But in a reverberant environment, it is still possible to have a phantom source that results in reasonable sound speed estimation. This paper assumes that the power level of phantom source is much weaker than that of the true source. Therefore, only a true source can exhibit a consistent estimation of direction on consecutive frames of signals because the weighting function of ES-GCC also has certain robustness to reverberation. The second step of source number estimation is to cluster the accumulated results from the first step using clustering technique and the reverberation can be considered as the outlier for the clustering technique. The well-known clustering method, -means, is sensitive to initial conditions and is not robust to outliers. In addition, the cluster number should be known in advance for -means which cannot be met in our scenario since we have no information of the sound source number. To improve the problems of robustness and cluster number, this paper proposes the adaptive -means++ method based on the -means [26] and -means++ [27] methods for clustering. The -means++ method is a way of initializing -means by choosing random starting centers with very specific probabilities. It then runs the normal -means algorithm afterwards. Because the seeding technique of -means++ method can improve both the speed and accuracy of the -means method [27], this paper employs the seeding technique of -means++ method to seed the initial centers for the proposed adaptive -means++ method.

4.1. Rejecting Incorrect Time Delay Combinations Using Acceptable Velocity Range

For multiple sound sources environment, the GCC function should have multiple peaks [28]. Without a priori knowledge of the sound source number, the time delay sample for each microphone pair which meets the constraint below will be selected as the time delay sample candidates:

(34)

where is a gain factor and and are the time delay samples corresponding to the largest and the n i th largest peak in ES-GCC function . If possesses no time delay sample that can meet the constraint above, the will be set to one. Hence, there are possible combinations to form the possible time delay vector and there should be D correct combinations in those possible combinations. Figure 4 illustrates the procedure of forming the possible time delay vector combinations and is the sampling rate. The relation between estimated time delay and estimated time delay sample is:

(35)

where is the estimated time delay from the sound source to the i th microphone and is the estimated time delay sample between the i th microphone and the first microphone. The next issue is how to choose correct combinations and determine the sound source number.

Figure 4
figure 4

Illustration of the procedure of forming possible time delay vector combinations.

To access whether the delay combination is likely to be a correct one, this work proposes a novel concept of evaluating if the corresponding sound velocity estimation of (31) is within an acceptable range. In other words, each possible combination is plugged into (31) to compute the sound velocity. It is considered as a correct combination if the following criterion is satisfied.

(36)

where is the sound velocity in cm/sec and is a threshold representing the acceptable range. Assume that there are combinations () satisfying (36) and the corresponding sound sources direction can be obtained by

(37)

where and are azimuth and elevation angle for the sound source, respectively.

4.2. Proposed Adaptive K-means++ for Sound Source Number and Directions Estimation

For the robustness consideration, the final sound source number and directions will be determined over -times results from (37). Define all the accumulated estimation angle results over -times of (37) estimation as

(38)

where represents the combination number which meets (36) constraint at the q th testing. So far, we have data and each data has two features ( and ). Our goal is to divide these data into clusters based on the two features. A cluster is defined as a set of sound source direction data points. For a cluster, the data within this cluster should be similar to one another and it means that the data within this cluster should come from the same sound source direction. The number is defined as the sound source number. Therefore, among the set of sound source direction data points, we wish to choose cluster centers so as to minimize the potential function:

(39)

where there are clusters and is the center of all the points . The sound source direction data is assigned to , if is the closet cluster center to . Because the sound source number is unknown, we set the cluster number to be one and initial center to be the median of and as the initial condition to execute -means. When the -means algorithm converges, the constraint below is checked:

(40)

where is the expectation operation and is a specified threshold. Equation (40) is used to check the variance of each cluster when the -means algorithm converges. If one of the variance of each cluster is not less than , the value of is increased by one. Then the other initial center is found by using the seeding technique of -means++ [27] defined in (41) and the -means algorithm is computed again.

Find the integer that

(41)

where represents the distance between and the nearest center we have already chosen; is the real number chosen uniformly at random between 0 and .

Otherwise, the final sound source number is and the sound source directions are

(42)

For the adaptive -means++ algorithm, the inputs are and the outputs are and . The flowchart of the adaptive -means++ algorithm for estimating the sound sources number and directions is shown in Figure 5 and is summarized as follows.

Figure 5
figure 5

The flowchart of adaptive K-means++ algorithm.

Step 1. Calculate ES-GCC function . Pick the peaks satisfying (34) from for each microphone pair and list all the possible time delay vector combinations .

Step 2. Select time delay vector from using (36) and estimate the corresponding sound source direction using (37).

Step 3. Repeat Steps 1 to 2 times and accumulate the results. Before each repeat, shift the start frame of Step 1 with frames.

Step 4. Cluster the accumulated results using adaptive -means++ algorithm and the final cluster number and centers are sound source number and directions, respectively.

5. Experimental Results

The experiments were performed in a real room approximately of the size 10.5 m × 7.2 m and height of 3.6 m and its reverberation time at 1000 Hz is 0.52 second. The reverberation time was measured by playing a 1000 Hz tone and then estimating the time of the direct sound to decay by 60 dB below the level of the direct sound. An 8-channel digital microphone array platform is installed on the robot for the experiment shown in Figure 6 and the microphone positions are marked with the circle symbol. The room temperature is approximately C and the sampling rate is 16 kHz. The experimental condition is shown in Figure 7 and the distance from each sound source to the origin is 270 cm. The sound sources are Chinese and English conversational speech in female and male. Each conversational speech source is different and is spoken by different people. In Figure 7, the microphone and sound source locations are set to (cm)

(43)

The dehumidifier which is 430 cm from the first microphone is turned on during this experiment (Noise 1 in Figure 7). The parameters of , , and are determined by our experience and are empirically set to be 0.7, 5000, and 23. The accumulation parameters and are set to be 20 and 25.

Figure 6
figure 6

Digital microphone array mounted on the robot.

Figure 7
figure 7

Arrangement of microphone array and sound sources.

5.1. ES-GCC Time Delay Estimation Performance Evaluation

Two GCC-based TDE algorithms, GCC-PHAT and GCC-ML [2], are computed to compare with the proposed ES-GCC algorithm. Seven microphone pairs ((1,2), (1,3), (1,4), (1,5), (1,6), (1,7), and (1,8) ) and six sound source positions in Figure 7 are selected for this TDE experiment. For each test, only one speech source is active and seven microphone pairs are all chosen to test. The STFT size is set to be 512 with 50% overlap and mutually independent white Gaussian noise is properly scaled and added to each microphone signal to control the signal-to-noise ratio (SNR). The performance index, Root Mean Square Error (RMSE), is defined below to evaluate the performance of the suggested method:

(44)

where is the total number of estimation, is the i th time delay estimation, and is the i th correct delay sample with a integer. Figure 8 shows the RMSE results as a function of SNR for three different TDE algorithms. The total number of estimation is 294. As seen from Figure 8, the GCC-PHAT yields better TDE performance than GCC-ML at higher SNR. This is because the experimental environment is reverberant and the GCC-ML suffers significant performance degradation under reverberation.

Figure 8
figure 8

TDE RMSE results versus SNR.

Comparing to GCC-ML, the GCC-PHAT has robustness with respect to reverberation. However, the GCC-PHAT method neglects the noise effect, and hence, it begins to exhibit dramatic performance degradation as the SNR is decreased. Unlike GCC-PHAT, GCC-ML does not exhibit this phenomenon since it has a priori knowledge about the noise power spectra which can help estimator to cope with distortion. The ES-GCC achieves the best performance, because the ES-GCC method does not focus on the weighting function process of GCC-based method and it directly takes the principal component vector as the microphone received signal for further signal processing. The appendix provides the proof that the principal component vector can be considered as the approximation of speech-only signal and this is the reason why the ES-GCC method is robust to the SNR.

5.2. Evaluation of Sound Source Number and Directions Estimation

The wideband incoherent MUSIC algorithm [9] with arithmetic mean is adopted to compare with the proposed algorithm. Ten major frequencies, ranging from 0.1 KHz to 3.4 KHz, were adopted for the MUSIC algorithm. Outliers were removed from the estimated angles by utilizing the method provided in [29]. In addition, the sound source number should be known first for MUSIC algorithm to construct the noise projection matrix. Therefore, the eigenvalues-based information theoretic criteria (ITC) method [21] is employed to estimate the sound source number. The sound source number estimation RMSE result is shown in Figure 9 and the averaged SNR is 17.23 dB. The RMSE is defined similar to (44) with a different measurement unit. The sound source positions are chosen randomly from six positions shown in Figure 7 and the number of estimation for each condition is 100. The noise 1 in Figure 7 is active in this experiment. As can be seen, the proposed sound source number estimation method yields better performance than the ITC method. One of the reasons is that the eigenvalue distribution is sensitive to reverberation and background noise. When the sound source number is larger than or equal to three, the ITC method often estimates a higher sound source number (5, 6, or 7).

Figure 9
figure 9

Sound source number estimation result.

The sound source direction estimation RMSE result is shown in Figure 10. For fair comparison, the RMSE is calculated when the sound source number estimation is correct. Figure 10 shows that the MUSIC algorithm becomes worse as the sound source number is increased since the MUSIC algorithm is sensitive to coherent signal especially when the environment is multiple sound sources and reverberant. The proposed method uses sound velocity as the criterion for time delay candidate selection and the adaptive -means++ is employed at final stage to cluster the sound source number and directions. The other advantage of the proposed method is that there is no a priori knowledge for sound source number and we use the adaptive K-means++ to estimate the sound source number and directions simultaneously. An incorrect sound source number for MUSIC algorithm would cause an even worse performance than Figure 10. In addition, in multiple sound sources case, if we take all time delay combinations to estimate the sound source direction without sound velocity selection mechanism, the result becomes very poor. We find that the wrong combination of time delay vector will cause the estimated sound speed to range between 9000 and 15000 or more than 50000.

Figure 10
figure 10

Sound source directions estimation result.

6. Conclusion

This work explains a sound source number and directions estimation algorithm. The multiple source time delay vector combination problem can be solved by the proposed reasonable sound velocity-based method. By accumulating the estimated sound source angle, the sound source number and directions can be obtained by the proposed adaptive -means++ algorithm. The proposed algorithm is evaluated in a real environment and the experimental results show that the proposed algorithm is robust to real environment and can provide reliable information for further robot audition research.

The accuracy of adaptive -means++ may be influenced by outliers if there is no outlier rejection. Therefore, the outlier rejection method may be incorporated to improve the performance. Moreover, the parameters of , , and are determined by our experience. In our experience, the parameter is not as sensitive as and to influence the results. The sensitivity of these parameters to influence the results is the other issue and this is left as a further research topic.

Appendix

Equation (2) can also be written as a square matrix form:

(A1)

where

(A2)

Suppose that the noises are spatially white, and the noise correlation matrix is diagonal matrix . Therefore, the received signal correlation matrix with EVD can be described as

(A3)

where ; and are eigenvalues and corresponding eigenvectors with . Since the eigenvectors are orthogonal to one another, they form a basis and can be used to express an arbitrary vector in the following

(A4)

Since for and for . Therefore, the dot product of and is

(A5)

Substituting (A.5) into (A.4), we have

(A6)

Therefore, . Because , we have the signal-only correlation matrix:

(A7)

where

(A8)

Applying factorization to , we have

(A9)

where

(A10)

Hence,

(A11)

and are similar matrix and they have the same eigenvalues. Decompose using EVD, and we have

(A12)

where is the eigenvector matrix of defined as

(A13)

Therefore, substituting (A.12) into (A.11), we have the relationship between and :

(A14)

Next, we need to represent using and for further process. The matrix can also be expressed as

(A15)

where is the expectation operation and

(A16)

where is the variance of , is the maximum value and between and .From (A.15) and the eigenvalue equation (, we have the linear equation in M unknowns () shown at (A.17):

(A17)

where is the variance part which is defined as

(A18)

To solve , we assume that the variance part can be neglected. This is possible if . Therefore we chose the maximum eigenvalue to solve this linear equation. In (A.17), the first row divided by the second row is and we have

(A19)

where denotes . Therefore,

(A20)

With the similar method, the eigenvector associated with the maximum eigenvalue can be obtained:

(A21)

where is a scalar. Hence, the eigenvector can be represented as

(A22)

If the observation time is sufficiently long, then . Therefore, the microphone received signal can be modeled as

(A23)

As can be seen from (A.23), the received speech signal is only the scalar version of the corresponding eigenvector for the maximum eigenvalue. Therefore, we take this eigenvector as the microphone received signal for time delay estimation. Equation (A.23) is obtained by using the maximum eigenvalue to solve (A.17). If other eigenvalues can also neglect the variance as , they can also have the speech signal approximation property. It represents that if the sound source number is one, is the only eigenvector which can represent the received speech signal since is the only dominant eigenvalue and the other eigenvectors (, contain the noise information. If the sound source number is larger than one, the other eigenvectors (, may contain some speech signal information. However, the conversational speech sources are asynchronous and contain many short pauses. Some speech sources information may not be represented by in this frame but may be represented in the next frame. Based on this concept, this paper uses eigenvector for time delay estimation since it can represent received speech signal most, accumulates the estimated DOA results, and uses adaptive -means++ for clustering the accumulated results. The algorithms that use the vectors that lie in the signal subspace are based on a principal components analysis (PCA) of the autocorrelation matrix and are referred to as signal subspace method [24]. This paper further justifies the use of since it can represent the speech signal better than the other eigenvectors from (A.17) and (A.23).

References

  1. Carter GC, Nuttall AH, Cable PG: The smoothed coherence transform (SCOT). In Tech. Memo. Naval Underwater Systems Center, New London Laboratory, New London, Conn, USA; 1972.

    Google Scholar 

  2. Knapp CH, Carter GC: The generalized correlation method for estimation of time delay. IEEE Transactions on Acoustics, Speech, and Signal Processing 1976, 24: 320-327. 10.1109/TASSP.1976.1162830

    Article  Google Scholar 

  3. Brandstein MS, Silverman HF: A robust method for speech signal time-delay estimation in reverberant rooms. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97), April 1997, Munich, Germany 375-378.

    Google Scholar 

  4. Wang QH, Ivanov T, Aarabi P: Acoustic robot navigation using distributed microphone arrays. Information Fusion 2004, 5(2):131-140. 10.1016/j.inffus.2003.10.002

    Article  Google Scholar 

  5. Scheuing J, Yang B: Correlation-based TDOA-estimation for multiple sources in reverberant environments. In Speech and Audio Processing in Adverse Environments. Springer, Berlin, Germany; 2008:381-416.

    Chapter  Google Scholar 

  6. Doclo S, Moonen M: Robust adaptive time delay estimation for speaker localization in noisy and reverberant acoustic environments. EURASIP Journal on Applied Signal Processing 2003, 2003(11):1110-1124. 10.1155/S111086570330602X

    Article  MATH  Google Scholar 

  7. Balan RV, Rosca J: Apparatus and method for estimating the direction of Arrival of a source signal using a microphone array. European Patent no. US2004013275, 2004

  8. Schmidt RO: Multiple emitter location and signal parameter estimation. IEEE Transactions on Antennas and Propagation 1986, 34(3):276-280. 10.1109/TAP.1986.1143830

    Article  Google Scholar 

  9. Wax M, Shan T, Kailath T: Spatio-Temporal spectral analysis by eigenstructure methods. IEEE Transactions on Acoustics, Speech, and Signal Processing 1984, 32(4):817-827. 10.1109/TASSP.1984.1164400

    Article  Google Scholar 

  10. Wang H, Kaveh M: Coherent signal-subspace processing for detection and estimation of angles of arrival of multiple wide-band sources. IEEE Transactions on Acoustics, Speech, and Signal Processing 1985, 33(4):823-831. 10.1109/TASSP.1985.1164667

    Article  Google Scholar 

  11. Hara I, Asano F, Asoh H, Ogata J, Ichimura N, Kawai Y, Kanehiro F, Hirukawa H, Yamamoto K: Robust speech interface based on audio and video information fusion for humanoid HRP-2. Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS '04), October 2004, Sendai, Japan 2404-2410.

    Google Scholar 

  12. Walworth M, Mahajan A: 3D Position sensing using the difference in the time-of-flights from a wave source to various receivers. Proceedings of the International Conference on Advanced Robotics (ICAR '97), July 1997, Monterey, Calif, USA 611-616.

    Google Scholar 

  13. Valin J-M, Michaud F, Rouat J, Létourneau D: Robust sound source localization using a microphone array on a mobile robot. Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, October 2003, Maui, Hawaii, USA 1228-1233.

    Google Scholar 

  14. Valin J-M, Michaud F, Rouat J: Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering. Robotics and Autonomous Systems 2007, 55(3):216-228. 10.1016/j.robot.2006.08.004

    Article  Google Scholar 

  15. Badali AP, Valin JM, Aarabi P: Evaluating real-time audio localization agorithms for artificial audition on mobile robots. Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2009, St. Louis, Mo, USA 2033-2038.

    Google Scholar 

  16. Yao K, Hudson RE, Reed CW, Chen D, Lorenzelli F: Blind beamforming on a randomly distributed sensor array system. IEEE Journal on Selected Areas in Communications 1998, 16(8):1555-1566. 10.1109/49.730461

    Article  Google Scholar 

  17. Strobel N, Rabenstein R: Classification of time delay estimates for robust speaker localization. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '99), March 1999 6: 3081-3084.

    Article  Google Scholar 

  18. Potamitis I, Chen H, Tremoulis G: Tracking of multiple moving speakers with multiple microphone arrays. IEEE Transactions on Speech and Audio Processing 2004, 12(5):520-529. 10.1109/TSA.2004.833004

    Article  Google Scholar 

  19. Hu J-S, Cheng C-C, Liu W-H: Robust speaker's location detection in a vehicle environment using GMM models. IEEE Transactions on Systems, Man, and Cybernetics, Part B 2006, 36(2):403-412.

    Article  Google Scholar 

  20. Cantoni A, Butler P: Properties of the eigenvectors of persymmetric matrices with applications to communication theory. IEEE Transactions on Communications 1976, 24(8):804-809. 10.1109/TCOM.1976.1093391

    Article  MATH  MathSciNet  Google Scholar 

  21. Wax M, Kailath T: Detection of signals by information theoretic criteria. IEEE Transactions on Acoustics, Speech, and Signal Processing 1985, 33(2):387-392. 10.1109/TASSP.1985.1164557

    Article  MathSciNet  Google Scholar 

  22. Yamamoto K, Asano F, Van Rooijen WFG, Ling EYL, Yamada T, Kitawaki N: Estimation of the number of sound sources using support vector machine. Proceedings of the IEEE International Conference on Accoustics, Speech, and Signal Processing, April 2003, Hong Kong 485-488.

    Google Scholar 

  23. Hu J-S, Yang C-H, Wang C-K: Estimation of sound source number and directions under a multi-source environment. Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS '09), December 2009, St. Louis, Mo, USA 181-186.

    Google Scholar 

  24. Hayes MH: Statistical Digital Signal Processing and Modeling. John Wiley & Sons, New York, NY, USA; 1996.

    Google Scholar 

  25. Chen J, Benesty J, Huang Y: Time delay estimation in room acoustic environments: an overview. EURASIP Journal on Applied Signal Processing 2006, 2006:-19.

    Google Scholar 

  26. Hartigan JA, Wong MA: A k-means clustering algorithm. Applied Statistics 1979, 28: 100-108. 10.2307/2346830

    Article  MATH  Google Scholar 

  27. Arthur D, Vassilvitskii S: K-means++: the advantages of careful seeding. Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '07), 2007, New Orleans, La, USA

    Google Scholar 

  28. Bechler D, Kroschel K: Considering the second peak in the GCC function for multi-Source TDOA estimation with a microphone array. Proceedings of the International Workshop on Acoustic Echo and Noise Control (IWAENC '03), September, 2003, Kyoto, Japan 315-318.

    Google Scholar 

  29. Pham T, Sadler BM: Adaptive wideband aeroacoustic array processing. Proceedings of the IEEE Signal Processing Workshop on Statistical Signal and Array Processing, June 1996, Corfu, Greece 295-298.

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chia-Hsin Yang.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Hu, JS., Yang, CH. Estimation of Sound Source Number and Directions under a Multisource Reverberant Environment. EURASIP J. Adv. Signal Process. 2010, 870756 (2010). https://doi.org/10.1155/2010/870756

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1155/2010/870756

Keywords