Two-stage source tracking method using a multiple linear regression model in the expanded phase domain
© Yang and Kang; licensee Springer. 2012
Received: 31 May 2011
Accepted: 10 January 2012
Published: 10 January 2012
This article proposes an efficient two-channel time delay estimation method for tracking a moving speaker in noisy and re-verberant environment. Unlike conventional linear regression model-based methods, the proposed multiple linear regression model designed in the expanded phase domain shows high estimation accuracy in adverse condition because its the Gaussian assumption on phase distribution is valid. Therefore, the least-square-based time delay estimator using the proposed multiple linear regression model becomes an ideal estimator that does not require a complicated phase unwrapping process. In addition, the proposed method is extended to the two-stage recursive estimation approach, which can be used for a moving source tracking scenario. The performance of the proposed method is compared with that of conventional cross-correlation and linear regression-based methods in noisy and reverberant environment. Experimental results verify that the proposed algorithm significantly decreases estimation anomalies and improves the accuracy of time delay estimation. Finally, the tracking performance of the proposed method to both slow and fast moving speakers is confirmed in adverse environment.
Time delay estimation (TDE) plays key role in determining the steering capability of microphone array system which produces a direction of the target sound source required for performing spatial processing. Typical applications of microphone array system include teleconferencing, automatic speech recognition, speech enhancement, source separation and automatic auditory system for robots [1–6].
The problem of estimating relative time delay associated with a signal source and a pair of spatially separated microphones has been extensively studied [7–15]. Among TDE methods, the generalized cross-correlation (GCC) method is one of the most widely used because of its simplicity and acceptable performance [7–9]. In the GCC-based method, the time delay is calculated by finding a lag that maximizes the GCC function between acquired signals. The method has been enhanced by introducing a pre-filter or a weighting function such as maximum-likelihood (ML), phase transform (PHAT) and so on. The GCC-ML method derived from the assumption of the ideal single propagation situation is optimal in a statistical point of view in case the observed sample space is large enough. The GCC-PHAT is recognized as reasonably robust to reverberation though it is heuristically designed. Zhang et al.  verified that the GCC-PHAT could actually be derived from the ML-based algorithm in reverberant environment if noise level is low. Another technique relied on the identification of the minimum of the average magnitude difference function (AMDF) between two signals, which was recently modified by joint consideration of the AMDF and the average magnitude sum function (AMSF) to improve the performance in reverberant environment .
An adaptive filter-based algorithm utilizes the criterion of minimizing the mean-square error between the first channel signal and the filtered second channel signal to estimate relative time delay . In , an adaptive eigenvalue decomposition algorithm was proposed to improve TDE performance in reverberant environment. It first identified the room impulse response (RIR) of each channel, and then the delay was determined by finding the direct paths from the two measured RIRs. A systematic overview of the stat-of-the-art of TDE techniques was summarized in the recent literature .
The TDE method using the inter-channel phase difference (IPD) has been attracted a lot since 1980s, thanks to its advantage on obtaining the result instantaneously [19–23]. Chan et al.  verified that a least-square (LS) estimator to the phase slope of cross power spectrum was equivalent to the ML estimator. They also proved that the distribution of IPD error followed Gaussian probability density function (pdf) if the signal and noise were zero mean Gaussian processes and uncorrelated each other. By raising the coherence issue between dual-microphone noises, Piersol provided the relationship between spatial coherence function and phase bias at specific frequency. Brandstein et al.  proposed a generalized cost function of the linear regression model of IPD by adopting a bi-weight function . The method is particularly advantageous in reverberant environment, but there is no benefit in noisy environment. The performance of these approaches commonly degrades when phase wrapping occurs or the phase is corrupted by adverse environmental effects because the phase statistics cannot be modeled by a simple pdf. Since it is hard to find an ideal estimator for a non-Gaussian data set such as wrapped discrete phases, a phase unwrapping process needs to be included in the TDE method [22, 24, 25]. Tribolet  proposed an iterative phase unwrapping algorithm that adaptively integrated the derivative of the phase. Brandstein et al.  practically implemented a linear regression slope forced unwrapping method which recursively adjusted the estimated wrapping frequency using lower band phase observations. Since these methods commonly include heuristic parts, their performance vary depending on how they are implemented. Recently, recursive unwrapping methods such as maximizing a posteriori probability or adopting the expectation-maximization (EM) using the probability model of the observed phase data set are introduced [26, 27]. In those methods, a reliable phase unwrapping can be achieved at the expense of heavy computational burden.
This article proposes a multiple linear regression model-based instantaneous TDE method that uses the expanded IPD of two channel signals. An estimator designed for operating in the original phase domain, [-π ~ + π), can hardly be optimal because a phase can be wrapped corresponding to the inter-channel distance and the direction of arrival (DOA) angle. To solve the problem, a reasonable statistical model for the distribution of IPD error and its Gaussian approximation are presented. At first, a phase domain expansion method using frequency interpolation and phase shifting methodology is proposed. Conventional linear regression model of IPD can be considered as a multiple linear regression model in the proposed phase expansion framework. By applying the proposed method to TDE, an ambiguous factor due to phase wrapping is dismissed and the LS method results in an optimal estimator. This article also verifies that the proposed estimation method becomes a minimum variance estimator (MVE) in the expanded phase domain. The proposed TDE method is composed of two stages: an LS-based TDE method estimates an initial delay at the first stage, and the estimated delay is applied to the sequential recursive-LS (RLS) estimator. The proposed method is computationally simple since it does not need a minimum or maximum search stage as well as the phase unwrapping process. The proposed algorithm is fairly compared with the optimal GCC methods, a generalized linear regression estimator, and an AMDF method in noisy and reverberant environment. The performance of the candidate estimators is evaluated by detailed assessment items including the percentage of anomalies, the estimation bias for both low and high DOA angles, and the root-mean-squared error (RMSE). Experimental results show that the proposed method can be regarded as the most robust estimator for the outliers and is closer to the unbiased estimator than any other methods. Especially in the RMSE assessment, the proposed RLS-TDE shows the best performance in both noisy and reverberant environment. Finally, the superiority of the tracking performance of the proposed algorithm is verified to a moving source in low SNR conditions.
The contents of the article are divided into four parts. Conventional two-channel TDE is explained in Section 2. Section 3 describes the details of the proposed phase expansion method with a multiple linear regression model. The proposed two-step TDE method for a moving speaker is described in Section 4. Finally, various experimental results are given in Section 5.
2. Conventional TDE method
2.1. Input signal model
where α k and β k are attenuation factors normally less than one, τ θ is the time difference of arrival (TDOA) between two input signals, and τ α,k , τ β,k are time delays caused by reverberation. The first term in each of Equation 2 is a direct component from source to microphone while the second term is a reverberant component related to RIR. In a far-field source scenario assumption, the propagation time difference of two microphones relating to the direction θ is defined as τ θ = d sin(θ)/c, where d is a distance between two microphones and c is the sound velocity in the air. This article initially assumes the single path signal model that considers only the direct path signal and the additive noise term in Equation 1, and then it is extended to the multi-path environment case later.
2.2. Linear regression model-based TDE
where k = 0, 1, ..., K - 1 is discrete frequency indices, and ψ(ω k ) is a weight to normalize the disturbances. Equation 4 becomes the best linear unbiased estimator (BLUE) when ψ(ω k ) equals to the reciprocal of IPD error variance. Moreover, it becomes an MVU estimator if the pdf of IPD error, ν(ω), follows Gaussian distribution . The performance of the above LS-TDE for an acoustic signal is statistically analyzed in previous articles under the Gaussian assumption of IPD error distribution [19, 20]. If phase wrapping is considered, however, the distribution of ν(ω) does not follow Gaussian anymore unless an ideal phase unwrapping is performed as a pre-processing step. Generally, it is not an easy task to find wrapped frequencies and unwrapped phase values in noisy environment. In addition, the unwrapping process for the IPD requires time delay information before performing the TDE processing. In the next section, a novel pdf model of IPD error distribution under a noisy condition is introduced. A phase expansion method with a multiple linear regression model is also proposed, which is more efficient and generally applicable to IPD-based methodologies but does not require any complicated phase wrapping process.
3. Multiple linear regression model in the expanded phase domain
3.1. Generalized IPD distribution: sum of shifted gaussian pdfs
Without loss of generality, the multi-path effect caused by reverberation is ignored at first. Then, ν(ω) in Equation 3 can be considered as a random variable related to the phase deviations caused by N1(ω) and N2(ω). If we assume that S(ω) = 0, and N1(ω) and N2(ω) are independent zero mean Gaussian random variables, ν(ω) follows uniform distribution with variance in [-π ~ + π) range . On the other hand, when the signal power is relatively larger than the noise one, the pdf of ν(ω) can be approximated by zero mean Gaussian, whose variance is represented by signal power and magnitude coherence function (MSC) [19, 26, 31]. These properties are useful to estimate a time delay that uses the IPD of two channel signals.
3.2. Multiple linear regression model in the expanded phase domain
4. A Framework of the proposed two-stage method
The multiple linear regression model-based LS method for IPD estimation is proposed in the expanded phase domain, Ω d . The proposed method is composed of two stages: the multiple linear regression model-based LS-TDE at the first stage, and the RLS-based source tracking method using the delay information estimated at the first stage. After constructing an LS cost function for the TDE method based on the multiple linear regression model, it is verified that the proposed LS method is an ideal estimator which is unconstrained by phase wrapping. In the second stage, the RLS-TDE method is proposed which works very well for both fixed and moving source tracking. The proposed RLS method can be implemented by a simple equation, and it is also appropriate for conversational speech. Finally, a novel two-channel weighting method for noisy and reverberant environment is described.
4.1. First stage: multiple linear regression model-based TDE
4.2. Second stage: RLS for moving speaker tracking
Equation 15 is same as Equation 10 except the term δ q which exponentially decreases the contribution of the past data set. In addition, a process is included such that all of the RLS vectors are initialized when long silence interval is included in the observation data. Experimental results described in detail later confirm that the performance of RLS-TDE is superior to conventional methods even for the fast moving speech source.
4.3. Weighting for LS-TDE in noisy and reverberant condition
The proposed LS-TDE in the expanded phase domain given in Equation 10 with the weighting function above satisfies all the ML estimation conditions, e.g., the Gaussian assumption of IPD error and weighting of its variance reciprocal. The weighting given in Equation 16 is useful when the coherence between two noises of dual-sensor and the target speech signal are ignor-able. However, it cannot distinguish values of speech from other signals if we assume a reverberant environment. Piersol  paid attention to the spatial coherence between two-sensors and proved the effects to the TDE by lots of experimental results, which are consistent with the theoretical analysis. To design a practical two channel system under the reverberant environment, a substitutable method which can suppress the reverberation effect by signal-to-reverberation (SRR)-based weighting is introduced.
The proposed method well suppresses the late reverberation but has no impact on the early reflected component which is the principle reason of bias for the IPD distribution. The bias caused by early reflection entirely depends on the physical conditions including the shape of room, sensor and source position, etc. It is still a challenging research area to deal with the early reflection blindly.
5. Experimental results
where is the cross spectrum of two channel signal, . The GCC-ML, ψ ML (ω k ) given in Equation 16, and the phase transform (PHAT), , are well-known estimators used for noisy and reverberant environments, respectively.
where ε is a fixed positive number to prevent division overflow. The TDE of the modified AMDF, Equation 24, is determined by jointly considering the AMDF and the AMSF. The three reference TDE estimators commonly include a maximum (or minimum) searching process which requires a large amount of computation while the proposed method instantly estimates the time delay with an intra-sample precision.
In the experiment, four conversational speech signals from four different speakers, two-males and two-females are included into the test. An energy ratio-based voice activity detection (VAD) is designed and same voice active intervals are applied to different SNR conditions. The noise PSD of cross spectrum signal gathered in silence intervals is used to calculate the weighting term given in Equation 10. It is also used to GCC-ML to minimize weighting effect. The relative performance of the TDE was evaluated through a number of trials in a simulated rectangular room (12 × 10 × 3 m3). The microphone array is located at (3,3,2) and the distance from the source to the array is maintained 3 m for both fixed and moving source scenarios. We tested eight locations of the fixed source at intervals of 10° from 0° to 70°. The room environment is artificially generated by the modified frequency domain image source model (ISM) with negative reflection coefficients [28, 29]. The reverberation time, T60, is measured by Lehmann's energy decay curve (EDC) . The level of the additive white Gaussian noise (WGN) varies from 5 to 25 dB as the reverberation time is increased from 0 to 500 ms. The sampling frequency is 8000 Hz, 64 ms Hamming window is applied with 50% overlap and the space of microphone is set to 8 cm.
5.1. Fixed source case in noisy and reverberant environments
The trend of estimation bias is represented in Figure 9 which shows the results in the low DOA angle and high DOA angle cases separately. The phase of high DOA angle cases are commonly wrapped because the wrapping is occurred when the DOA angle is lager than 32° in our simulation condition. All of the tested algorithms are hardly biased when the source is located in front direction of dual-sensor as depicted in Figure 9a because the phase wrapping is less likely to occur for a low DOA angle incident case. As shown in Figure 9b, however, the estimation bias for a signal from the high DOA angle generally increases. Since the bias problem becomes more serious when the IPD is getting closer to +π (or -π) as we described in Figures 6 and 7. The proposed algorithm working in the expanded phase domain, however, does not suffer from the bias especially in noisy environment.
The final estimation performance is presented in Figure 10 which depicts RMSE results of averaged whole DOA angles. It is confirmed that the proposed method has superior performance to conventional ones in overall SNR conditions. The proposed LS and the AMDF methods show better performance than the GCC-ML and the bi-weight method while the performance of the bi-weight method and the AMDF method decrease in low SNR condition. The GCC-PHAT shows the worst performance in noisy environment.
The estimated bias represented in Figure 12 shows a different trend comparing to the result in noisy environment such that the bias can occur regardless of the DOA angle. The GCC-PHAT method shows the most robust performance irrespective of the reverberation level while its performance also slightly degrades in the high DOA angle case. The other methods except for the GCC-PHAT show that the estimation bias is larger than the result in noisy environment and it is highly affected by the RIR.
Finally, the estimation error except for anomalies is depicted in Figure 13. The GCC-ML method has a relatively small error in low reverberation condition but the error dramatically increases as the reverberation increases. Among the methods immune to reverberation, the AMDF method shows the best performance in overall conditions. As with the previous RMSE results in noisy environment, the proposed two-step method with the SRR-based weighting shows the most accurate TDE results in reverberant environment comparing to the other methods.
Overall, it is verified that the proposed method shows the highest performance especially in the noisy environments, i.e. it has minimum error and the estimation anomalies is less than 5% even in low SNR condition. It is also verified that unlikely to other methods, the proposed multiple linear regression model-based TDE method is not biased by phase wrapping. It also shows the most accurate TDE results in reverberant environments. The proposed method shows similar results to the AMDF method which shows the best performance among the reverberation immune methods in the anomalies percentage and the bias measurements.
5.2. Source tracking scenario for slow and fast moving sources
A LS TDE method based on the multiple linear regression model via the interpolated phase expansion has been proposed. By the proposed phase expansion method, the IPD distribution between two channel signals becomes more advantageous in terms of pdf. It theoretically verified that the approximated Gaussian approaches to the actual IPD distribution for higher SNR and also confirmed it by various experimental results. The proposed TDE method which is composed of two stages shows superior performance especially in the anomalies percentage and RMSE results in both noisy and reverberant environments. It was also demonstrated that the bias to zero problem for high DOA angles could be mitigated in the proposed method. Finally, the superiority of the proposed algorithm in terms of tracking a moving source in low SNR condition was verified. The proposed method provides the explicit TDE solution that can be applied to a real time application. Future work involves improving the method in reverberant environments based on detailed investigation about the IPD statistics for a multi-path effects.
Appendix A: Simplifying the IPD pdf
- Nakamura S, Hiyane K, Asano F, Kaneda Y, Yamada T, Kobayashi TN, Saruwatari H: Design and collection of acoustic sound data for hands-free speech recognition and sound scene understanding, in. Proceedings of the ICME '02 2002, 2: 161-164.Google Scholar
- Yermeche Z, Grbic N, Claesson I: Beamforming for moving source speech enhancement, in. Applications of Signal Processing to Audio and Acoustics 2005, 25-28.Google Scholar
- Gatica-Perez D, Lathoud G, Odobez JM, McCowan I: Audiovisual probabilistic tracking of multiple speakers in meetings. IEEE Trans Audio Speech Language Process 2007, 15: 601-616.View ArticleGoogle Scholar
- Wilson K: Speech source separation by combining localization cues with mixture models of speech spectra, in. ICASSP-2007 2007, 1: 33-36.Google Scholar
- Talantzis F, Boukis C: The robustness effect of acoustic source localization on blind source separation and deconvolution systems, in. Digital Signal Processing, 15th International Conference 2007, 339-342.Google Scholar
- Trifa VM, Koene A, Moren J, Cheng G, Zurich ETH: Real-time acoustic source localization in noisy environments for human-robot multimodal interaction, in. The 16th IEEE International Symposium on Robot and Human Interative Communication 2007, 393-398.Google Scholar
- Roth PR: Effective measurements using digital signal analysis. IEEE Spectrum 1973, 8: 62-70.View ArticleGoogle Scholar
- Carter GC: The smoothed coherence transform. Proc IEEE 1973, 61: 1497-1498.View ArticleGoogle Scholar
- Knapp C, Carter G: The generalized correlation method for estimation of time delay. IEEE Trans Acoust Speech Signal Process 1976, 24: 320-327. 10.1109/TASSP.1976.1162830View ArticleGoogle Scholar
- Carter GC: Time delay estimation for passive sonar signal processing. IEEE Trans Acoust Speech Signal Process 1981, 29: 463-470. 10.1109/TASSP.1981.1163560View ArticleGoogle Scholar
- Brandstein MS: A framework for speech source localization using sensor arrays. PhD thesis. Department of Electrical Engineering, Brown University; 1995.Google Scholar
- Brandstein MS: Time-delay estimation of reverberated speech exploiting harmonic structure. J Acoust Soc Am 1999, 105: 2914-2919. 10.1121/1.426904View ArticleGoogle Scholar
- Chen J, Benesty J, (Arden) Huang Y: Performance of gcc- and amdf-based time-delay estimation in practical reverberant environment. EURASIP J Appl Signal Process 2005, 25-36. 2005Google Scholar
- Chen J, Benesty J, (Arden) Huang Y: Time delay estimation in room acoustic environments:an overview. EURASIP J Appl Signal Process 2006, 1-19. 2006Google Scholar
- Dvorkind TG, Gannot S: Time difference of arrival estimation of speech source in a noisy and reverberant environment. Signal Process 2005, 85: 177-204. 10.1016/j.sigpro.2004.09.014View ArticleGoogle Scholar
- Zhang C, Florencio D, Zhang Z: Why does phat work well in low noise, reverberative environments?, in. ICASSP-2008 2008, 2565-2568.Google Scholar
- Feintuch PL, Bershad NJ, Reed FA: Time delay estimation using the lms adaptive filter-dynamic behavior. IEEE Trans Acoust Speech Signal Process 1981, 29: 571-576. 10.1109/TASSP.1981.1163608View ArticleGoogle Scholar
- (Arden) Huang Y, Benesty J, Elko GW: Adaptive eigen-value decomposition algorithm for real time acoustic source localization system, in. ICASSP-1999 1999, 43: 937-940.Google Scholar
- Chan YT, Hattin RV, Plant JB: The least squares estimation of time delay and its use in signal detection. IEEE Trans Acoust Speech Signal Process 1978, 26: 217-222. 10.1109/TASSP.1978.1163078View ArticleGoogle Scholar
- Piersol AG: Time delay estimation using phase data. IEEE Trans Acoust Speech Signal Process 1981, 29: 471-477. 10.1109/TASSP.1981.1163555View ArticleGoogle Scholar
- Hamon BV, Hannan EJ: Spectral estimation of time delay for dispersive and non-dispersive systems. J R Stat Soc (Appl Stat) 1974, 2: 134-142.Google Scholar
- Brandstein MS, Adcock JE, Silverman HF: A practical time-delay estimator for localizing speech sources with a microphone array. Comput Speech Language 1995, 9: 153-269. 10.1006/csla.1995.0009View ArticleGoogle Scholar
- Brandstein MS, Silverman HF: A robust method for speech signal time-delay estimation in reverberant rooms, in. ICASSP-1997 1997, 1: 375-378.Google Scholar
- Tribolet JM: A new phase unwrapping algorithm. IEEE Trans Acoust Speech Signal Process 1977, 25: 170-177. 10.1109/TASSP.1977.1162923View ArticleGoogle Scholar
- Li D, Levinson SE: A linear phase unwrapping method for binaural sound source localization on a robot, in. IEEE International Conference, Robotics, Automation 2002.Google Scholar
- Smaragdis P, Boufounos P: Position and trajectory learning for microphone arrays. IEEE Trans Acoust Speech Signal Process 2007, 15: 358-368.Google Scholar
- Zhang W, Rao BD: A two microphone-based approach for source localization of multiple speech sources. IEEE Trans Audio Speech Language Process 2010, 18: 1913-1928.View ArticleGoogle Scholar
- Lehmann EA, Johansson AM: Prediction of energy decay in room impulse responses simulated with an image-source model. J Acoust Soc Am 2008, 124: 269-277. 10.1121/1.2936367View ArticleGoogle Scholar
- Allen JB, Berkley DA: Image method for efficiently simulating small room acoustics. J Acoust Soc Am 1979, 65: 943-950. 10.1121/1.382599View ArticleGoogle Scholar
- Kay SM: Fundamentals of Statistical Signal Processing: Estimation Theory. Volume I. Prentice Hall PTR, Upper Saddle River; 1993.Google Scholar
- Said A, Kalker A, Schafer RW: Phase-domain statistical analysis for audio source localization, in. IEEE 9th Workshop, Multimedia Signal Processing 2007, 94-97.Google Scholar
- Haykin S: Adaptive Filter Theory. Volume 4. Prentice Hall PTR, Upper Saddle River; 2002.Google Scholar
- Cater GC, Knapp CH, Nuttall AH: Estimation of the magnitude-squared coherence function via overlapped fast Fourier transform processing. IEEE Trans Audio Electroacoustics 1973, 21: 337-344. 10.1109/TAU.1973.1162496View ArticleGoogle Scholar
- Habets EAP, Gannot S: Dual-microphone speech dereverberation using a reference signal, in. ICASSP-2008 2008, 4: 901-904.Google Scholar
- Boll S: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans Acoust Speech Signal Process 1979, 27: 113-120. 10.1109/TASSP.1979.1163209View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.