EURASIP Journal on Applied Signal Processing 2005:1, 25–36 c ○ 2005 Hindawi Publishing Corporation Performance of GCC- and AMDF-Based Time-Delay Estimation in Practical Reverberant Environments

Recently, there has been an increased interest in the use of the time-delay estimation (TDE) technique to locate and track acoustic sources in a reverberant environment. Typically, the delay estimate is obtained through identifying the extremum of the generalized cross-correlation (GCC) function or the average magnitude difference function (AMDF). These estimators are well studied and their statistical performance is well understood for single-path propagation situations. However, fewer efforts have been reported to show their performance behavior in real reverberation conditions. This paper reexamines the GCC- and AMDF-based TDE techniques in real room reverberant and noisy environments. Our contribution is threefold. First, we propose a weighted cross-correlation (WCC) estimator in which the GCC function is weighted by the reciprocal of AMDF. This new method can sharpen the peak of the GCC function, which corresponds to the true time delay and thus leads to a better estimation performance as compared to the conventional GCC estimator. Second, we propose a modified version of the AMDF (MAMDF) estimator in which the delay is determined by jointly considering the AMDF and the average magnitude sum function (AMSF). Third, we compare the performance of the GCC, AMDF, WCC, and MAMDF estimators in real reverberant and noisy environments. It is shown that the AMDF estimator can yield better performance in favorable noise conditions and is slightly more resilient to reverberation than the GCC method. The GCC approach, however, is found to outperform the AMDF method in strong noisy environments. Weighting the correlation function by the reciprocal of AMDF can improve the performance of the GCC estimator in reverberation conditions, yet its improvement in noisy environments is limited. The MAMDF algorithm can enhance the AMDF estimator in both reverberant and noisy environments.


INTRODUCTION
A microphone array, which consists of a set of microphones that are spatially distributed at known locations with reference to a common point, has the ability to reinforce a desired signal from the look direction while suppressing undesired signals such as noise from other directions. This feature impels the increasing use of microphone arrays in such situations as hands-free speech communications where a system operates under strong noise and reverberation conditions. In the microphone array system, the most crucial issue is to measure the time difference of arrival (TDOA) between two microphone signals since such a time difference often serves as the basis for beamforming and the estimation of direction of arrival (DOA).
Extensive work has been reported for determining the TDOA between two signals [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]. One typical time-delay estimation (TDE) technique is the generalized cross-correlation (GCC) method [1,2,3,4,5,6,7,9,16,17,18,19,20] in which the delay estimate is obtained as the time lag that maximizes the GCC function between two microphone signals. The measured time delay is an integral multiple of the sampling period. In other words, the time-delay resolution depends on the sampling period but is not limited to it. A finer resolution can be acquired in light of an interpolation between consecutive samples of the GCC function when necessary [17,21].
Another widely used traditional TDE technique relies on the identification of the minimum of the average magnitude difference function (AMDF) between two studied signals [7,17]. Similarly, an interpolation may be employed to refine the delay estimate.
Both the GCC and AMDF methods are formulated based on the ideal propagation model where no multipath effect is taken into account. They perform fairly well in singlepath propagation situation, but suffer performance degradation when multipath or reverberation effects are present. Recently, several advanced TDE techniques were proposed [22,23,24,25,26], which can better deal with reverberation. However, the GCC and AMDF techniques are still preferred by many engineers and are widely used in various systems for their computational efficiency and simplicity to implement.
In single-path propagation situations, the GCC and AMDF algorithms have been extensively investigated and their statistical performance is well understood [8,17]. However, multipath propagation and reverberation effects are more common in practice. Unfortunately, fewer efforts have been reported to show the performance behavior of these algorithms in practical reverberant environments. An early study [18] examined the effects of the simulated room reverberation on the performance of the GCC approach to TDE. It was shown that the performance of this algorithm severely deteriorated as the reverberation time increased. If the acceptable level of the percentage of anomalous estimates is set to be 10%, the maximum likelihood (ML) GCC method cannot be used reliably when the reverberation time is greater than 0.18 seconds, which is quite common in hands-free communication applications.
In this paper, we reexamine the GCC and AMDF algorithms in real room reverberant and noisy environments. We also show that these estimators can be improved by incorporating some other information. Our contribution is threefold. First of all, inspired by the weighted autocorrelation method, which has been recently proposed for pitch tracking [27], we propose a weighted cross-correlation (WCC) estimator in which the GCC function is weighted by the reciprocal of AMDF. This new method can sharpen the peak of the GCC function, which corresponds to the true time delay and thus leads to a better estimation performance as compared to the conventional GCC estimator. Secondly, we propose a modified version of the AMDF (MAMDF) estimator in which the delay is determined by jointly considering the AMDF and the average magnitude sum function (AMSF). We show that the combination of AMDF and AMSF can enhance the performance of the AMDF estimator. Thirdly, the GCC, AMDF, WCC, and MAMDF estimators are evaluated with data collected in the Varechoic Chamber at Bell Laboratories. On one hand, this evaluation will find which algorithm can produce better TDE in practical situations. On the other hand, such a comparative study can offer insight into the range of TDE techniques that can be employed in practical room reverberation conditions. The experimental results justify that proper manipulation of the GCC function and AMDF can make the TDE techniques more robust with respect to reverberation.

Signal model
A widely used signal model for the TDE problem is given by where x m (n), m = 1, 2, denotes the output signal of the mth microphone, α is an attenuation factor due to the propagation effect, t is the propagation time from the unknown source s(n) to Microphone 1, w m (n) is an additive noise signal at microphone m, and the parameter τ is the true time delay between two microphones. We assume that w m (n) is a (real) zero-mean stationary random process which is uncorrelated with both s(n) and the noise signal from the other microphone. It is also assumed that s(n) is reasonably broadband. This model reflects an ideal situation in which the signal propagation from the source to each microphone occurs along a single direct path in a nondispersive medium. The TDE problem is to find an estimate τ of the true delay τ, using a finite set of observation samples of x 1 (n) and x 2 (n). The signal x 1 (n) will be also called the reference signal.

TDE principles
The TDE techniques investigated in this paper are based on searching for the extremum of the GCC or some other statistical cost functions of the observed signals. Particularly, we consider the following estimators.

The generalized cross-correlation estimator
The GCC method, proposed by Knapp and Carter in 1976 [1], is the most popular technique for TDE, in which the time-delay estimate is obtained as follows: where is the GCC function, S x1x2 (k) = E{X 1 (k)X * 2 (k)} is the cross spectrum, E{·} and (·) * stand, respectively, for the expectation and complex conjugate operator, X m (k) is the discrete Fourier transform of the signal x m (n), Φ(k) is a weighting function (sometimes called a prefilter), and N denotes the number of observation samples during the observation interval.
The weighting function Φ(k) plays an important role in controlling the TDE performance. It is chosen according to some criterion. Commonly used weighting functions include unit weighting (the classical cross-correlation method), the smoothed coherence transform (SCOT) [3], the Roth processor [4], the Echart filter, the phase transform (PHAT), the ML processor [1], the Hassab-Boucher transform [5], and so forth. Some of these are optimal in the sense that the estimation variance can achieve the Cramèr-Rao lower bound (CRLB). Others are suboptimal but possess special properties, as for example the PHAT algorithm [1,12], where the weighting function is chosen as Φ(k) = 1/|S x1x2 (k)|. Substituting Φ PHAT (k) into (3) and neglecting the noise effects, one can readily derive that the weighted cross spectrum is free from the source signal and depends only on the channel responses. Consequently, the PHAT algorithm performs more consistently than many other GCC members when the characteristics of the source signal vary in time. Hence, this weighting function is adopted in this research.

The AMDF estimator
The AMDF between two studied signals is described by The delay estimate, based on AMDF, is given by The AMDF approach has been used for TDE and pitch tracking for decades [7,28]. The preference of employing AMDF over GCC for TDE is mainly due to the following facts. First, the performance of the AMDF estimator in favorable noise conditions is better than that of the GCC method as reported in [17]. Second, the AMDF technique has relatively low computational cost as no multiplications are involved in the estimation of AMDF although the computational burden may not be a big concern with today's computers.
Assuming that the signal s(t) can be modeled as a zeromean Gaussian process, from the invariance technique [29,30], we can derive the expectation of the AMDF as follows (see the appendix): where e xm = E{x 2 m (n)} represents the energy of the observation signal x m (n), and R x1x2 (n) = E{x 1 (i)x 2 (i + n)} is the direct cross-correlation function between x 1 (n) and x 2 (n). Inspection of (6) shows that the magnitude of the principle minimum of the AMDF is essentially influenced by the intensity variation and the background noise of the observation signal. This indicates that the AMDF method may be sensitive to the background noise. As a matter of fact, many reported experiments have confirmed that the AMDF estimator is less robust with respect to noise than the GCC method [31]. Equation (6) also suggests that the performance of AMDF can be affected by the source signal, like in the conventional cross-correlation approach. This problem, however, can be alleviated by prewhitening the observation signal before the estimation of AMDF.

The weighted cross-correlation estimator
The maximum of the GCC function does not necessarily occur at the true time delay as also pointed out in [6,17]. This is mainly due to the delayed version of the signal containing new samples for different time lags. Figure 1 shows one estimated GCC function between two microphone signals in a moderately reverberant but noise-free condition. As can be seen, the GCC function has two large peaks. One appears at 0.625 milliseconds which corresponds to the true time delay, and another one appears at 1.125 milliseconds. Unfortunately, the maximum peak appears at 1.125 milliseconds which leads to an estimation failure. In comparison, AMDF generally produces more accurate estimates. However, as mentioned before, the primary disadvantage of the AMDF approach is the lack of robustness with respect to noise. To achieve a good compromise between the robustness of the GCC method and the accuracy of the AMDF approach, we propose a heuristic method by weighting the GCC function with the reciprocal of AMDF, which may not necessarily be the optimum way to combine both, but will certainly improve TDE performance, as shown in Section 3. The resulting estimator is described by where and ε is a small positive number to prevent division overflow. Figure 1 also shows the weighted GCC function (WCCF). In this case, picking the maximum of the WCCF will lead to a correct estimate.

The modified AMDF estimator
The AMDF specifies the synchrony between the reference signal and a delayed version of this signal. In the noise-free condition, the AMDF yields its minimum when the two signals are synchronized. Synchrony can also be described by the AMSF defined as follows: If both signal and noise are assumed to be uncorrelated Gaussian processes, when two microphone signals are synchronized, the AMDF will reach its minimum while the AMSF will approach its maximum. We can further show that the correlation coefficients between AMDF and AMSF are approximately zero (see the appendix). This suggests that AMDF and AMSF contain supplementary information though both of them measure the same synchrony between two studied signals. Hence, we can expect to improve the TDE performance by combining AMDF and AMSF. The resulting new estimator is called the MAMDF method described as follows: where and again is a fixed positive number similar to ε in (8) to prevent division overflow.

Implementation
From (2), (5), (7), and (10), one can readily see that the estimated time delay is an integral multiple of the sampling period. This resolution is usually not sufficient for many microphone array applications. Much effort has been devoted to solving this problem [17,21,32]. Among these, interpolation around the detected peaks of the cost function is the simplest yet most effective way to refine the TDE. Here we employ the 3-point Lagrange's method to improve the resolution such that the estimated time delay can be a fraction of the sampling period. The implementation procedure of the developed estimators is summarized below.
(i) Partition the observation signal sequences x 1 (n) and x 2 (n) into nonoverlapping frames with a frame width of 128 milliseconds. For all experiments, microphone signal is digitized with a sampling frequency of 16 kHz. A Hamming window of length 128 milliseconds is applied for a better spectral estimate.
(ii) To reduce the dependence of the TDE on the structure of the source signal, we prewhiten signals x m (n) before starting the TDE. The prewhitening process is performed in the frequency domain and the FFT algorithm is used for efficiency, that is, (2), (4), (8), and (11). (iv) Search for the extremum of the cost function and the corresponding lag time is denoted as n ext . (v) Interpolate 4 points between n ext − 1 and n ext and another 4 points between n ext and n ext + 1, using the 3-point Lagrange's method (the AMDF-based cost functions are squared before interpolating [17]). Then search the extremum of the interpolated cost function. The corresponding peak (valley) position relative to n ext is denoted as ∆ ( ∆ is negative when the extremum is located in the left-hand side of n ext , and is positive when the extremum is located in the right-hand side of n ext ). (vi) The time-delay estimate is obtained as τ = n ext + ∆.

PERFORMANCE EVALUATION
In general, the performance of the GCC, AMDF, WCC, and MAMDF techniques is affected by the interpolation and finite width of the estimation window. Apart from these systematic factors, the accuracy of the estimates is substantially impaired by noise and reverberation. In this section, we present the results of the experiments to investigate the statistical performance of TDE in real reverberant and noisy environments.
Following [6,18], we distinguish an estimate as either an anomaly or a nonanomaly according to its absolute error. If the absolute error | τ i − τ| > T c /2, the estimate is identified as an anomaly; otherwise, it is declared as a nonanomaly, where τ and τ i are the true delay and ith delay estimate, respectively, and T c is the signal correlation time. To compute T c , we divide the source signal into short frames with a frame size of 128 milliseconds. A short-time autocorrelation function is estimated from each frame of data. The long-term average autocorrelation function is then computed as the arithmetic average of the short-time autocorrelation functions. T c is computed as the 3 dB width of the main lobe of the long-time average autocorrelation function (in our experiment, the calculated T c is equal to 4.3 samples). We evaluate the TDE performance in terms of the percentage of anomalous estimates over the total estimates, the bias, and the standard deviation of the nonanomalous estimates.

Experimental setup
Experiments were carried out in the Varechoic Chamber which is a unique facility at Bell Laboratories. The chamber is a 6.7×6.1×2.9 m room whose surfaces are covered by a total of 369 active panels which can be controlled digitally. Each panel consists of two perforated sheets. When the holes in the W E N S Microphone array S02 S03 S04 S05 S06 S11 S12 S13 S14 S15 S16 S17 ×10 3 x-position (mm) sheets are aligned, absorbing material behind the sheets will be exposed to the sound field, whereas a highly reflective surface can be formed if the holes are shifted to misalignment. Combination of open and closed panels can produce 2 369 different acoustic environments where the 60 dB reverberation time T 60 can change from 0.2 to almost 1 second (refer to [33,34] for more details). A linear microphone array which consists of 22 omnidirectional Panasonic WM-61A microphones was mounted at the distance of 500 mm from the north wall of the chamber and approximately at the center of the wall. The 22 microphones are uniformly distributed along an aluminum rod whose diameter is 1 cm. The spacing between adjacent microphones is 10 cm. The source signal is played by a Cabasse Baltic Murale loudspeaker in 46 different positions. An illustration of this setup is shown in Figure 2.
In order to reduce the reflections from the north wall, the wall behind the array is covered by a 3-inch-thick fiber class pillow which has a rectangle shape of 3230 × 750 mm. Its lower edge is 90 mm above the floor and the left edge 1950 mm from the west wall of the chamber. During the experiment, the chamber was not completely empty; objects such as chairs, loudspeakers, and unused equipments were left in the room. Also the inner door of the room in the east corner of the south wall was kept open during the course of the experiment.
For the purpose of data reusability, the impulse response from each source location to each microphone was measured [34]. The measurement of the impulse responses was performed using the built-in measurement tool of the Huron Lake system [34]. A 65536-point long logarithmic sweep signal digitized at a sampling rate of 48 kHz was used as the excitation signal. From each source location to each microphone, the excitation is played and recorded. An estimate of the transfer function is obtained by a spectral division between the original source excitation and the recorded microphone signal. We show in Figure 3 an impulse response measured from Microphone 22 when 30% of the panels are closed and the loudspeaker is placed at the position S21 (shown in Figure 2). Also shown in Figure 3 is the backward integrated decay curve of the measured impulse response. One can see from this decay curve that the reverberation time T 60 is approximately 0.37 second.
The observation signal is obtained by convolution of the recorded speech with the measured impulse response, and then adding noise to the results. Two types of noise have been used in the experiments: the computer generated pseudo Gaussian noise and a noise signal recorded from a New York Stock Exchanging (NYSE) room. The NYSE noise consists of sounds from various sources such as speakers, telephone rings, electric fans, and so forth. Figure 4 plots the first two seconds of the NYSE noise and its spectrogram, from which we can see the changing characteristics of such noise.

Experimental results
As pointed out before, the microphone output signal is computed by convolving a 4-minute speech from a female speaker with the corresponding measured impulse response and then adding zero-mean noise to the results for a given signal-to-noise ratio (SNR). This output signal is then segmented into nonoverlapping frames with a frame width of 128 milliseconds. For each frame, a time-delay estimate is obtained by estimators described in (2), (5), (7), and (10). The array consists of 22 microphones in total, so we have C 2 22 = 231 microphone pairs. In our experiment, however, we choose Microphone 1 as a reference and only measure the time delay of each microphone signal relative

TDE performance versus reverberation time
In the first experiment, we analyze the TDE performance versus reverberation time. To do so, we assume that the background noise is white Gaussian noise, and SNR is relatively high, say SNR = 25 dB. The source position varies from S02 to S46, as shown in Figure 2, whereas the microphone pair is a fixed one (Microphones 1 and 3 are used in this experiment). Figures 5, 6, and 7  the average percentage of anomalies, and the bias and standard deviation of the nonanomalous estimates, all as a function of the reverberation time T 60 .
Obviously, the percentage of the anomalous estimates of all estimators increases with the reverberation time. As the reverberation time increases, more reflected signals with different delay will reach the microphone sensor and as a result, the erroneous peaks (for GCC and WCC) or valleys (for AMDF and MAMDF) of the cost function increase, which leads to more mistakes in extremum searching, and eventually leads to more anomalous time-delay estimates. Compared with the GCC approach, the AMDF estimator exhibits less anomalies when the reverberation time increases. This shows the advantage of AMDF for TDE. It is interesting to note from Figure 5 that weighting the GCC function by the reciprocal of AMDF can reduce the probability of anomalous estimates. However, if the acceptable level of anomalous estimates is set to 10%, according to Figure 5, all the studied methods cannot be used reliably when T 60 > 0.5 second.
From Figures 6 and 7, it can be seen that both the bias and standard deviation of the nonanomalous estimates degrades severely as the reverberation time increases. The four estimators exhibit a similar bias and standard deviation in light reverberation conditions. In highly reverberant environments, the GCC estimator has a slightly worse performance.

TDE performance versus SNR
In the above experiment, we show the impact of reverberation time on TDE performance, where a very high SNR is assumed. In a practical situation, however, the TDE has to deal with both reverberation and noise. The second experiment is to evaluate the TDE performance in both simulated (white Gaussian) and real (NYSE) noisy environments, where we assume a moderate reverberation, say T 60 = 0.31 second. The results in Gaussian noise are presented in Figures 8, 9, and 10, and the results in NYSE noise are portrayed in Figures 11, 12, and 13. We found that the probability of anomalies reduces as SNR increases, and the four TDE methods have a similar percentage of anomalies in both noise conditions. From Figures 9 and 12, we see that in high SNR conditions, four estimators have almost identical estimation bias. In lower SNR situations, the AMDF estimator has a higher bias. This is inconsistent with the result reported in [17], in which the AMDF is shown to have much lower bias than the GCC method in high SNR conditions and has almost the same bias as GCC in strong noise environments. We attribute the difference to three factors. Firstly, the experiment in [17] was performed only in high noise conditions where reverberation is absent. Secondly, the GCC estimator tested in [17] is based on the direct cross-correlation function rather than the GCC function. Finally, the results reported in [17] did not distinguish between estimates as anomalies and nonanomalies.
In Figures 10 and 13, it can be seen that in high SNR conditions, the GCC method has a lightly higher deviation than the other three estimator. When SNR becomes lower, however, the GCC estimator shows a smaller deviation, indicating the robustness of the GCC method with respect to noise. Weighting the AMDF function with the reciprocal of the AMSF function can enhance the performance of the AMDF estimator. However, the performance of the WCC approach in noisy conditions is basically a tradeoff between the AMDF and GCC methods.

Other experiments
Additional experiments were performed including changing the source locations and using different microphone pairs. Figures 14, 15, and 16 plot the statistical performance as a function of loudspeaker position shown in Figure 2. One can see that percentage of anomalies does not vary much when the source is moved from one position to another. However, the bias and standard deviation fluctuate a lot as the source location varies. According to our experience, though in general the change of the reverberation time T 60 is negligible, the echo structure varies appreciably as the source position moves. This will eventually lead to fluctuation of the bias and standard deviation of the time-delay estimate. It is interesting to note that biases of the four investigated estimators are almost identical, whereas the percentage of anomalies and standard deviation of nonanomalous estimates of the GCC method is slightly higher than the AMDFbased methods. This is consistent with the observation from the previous experiments since this experiment is carried out at very high SNR and moderate reverberation environments.
We also varied the microphone pairs while keeping the source position fixed. Quite similar qualitative behavior as above was observed.

CONCLUSION
This paper addressed the TDE problem in real reverberant and noisy environments. We have proposed two new time-delay estimators. One is the weighted cross-correlation method in which the GCC function is weighted by the reciprocal of AMDF. This weighting process can sharpen the desired peak and suppress the other peaks in the GCC function, hence leading to more accurate time-delay estimates. The other proposed estimator is the modified version of the AMDF method in which the AMDF is weighted by the inverse AMSF-another function that can measure the synchrony between two signals. This approach is seen to exhibit a superior performance to the AMDF method in both high reverberation and high noise conditions.
We have evaluated the GCC, WCC, AMDF, and MAMDF approaches in both room reverberant and noisy environments. In general, it is observed that the cross-correlationbased method exhibits a slightly higher percentage of anomalous estimates than the AMDF-based estimator in favorable noise conditions. However, the GCC-based approaches are more resilient to strong noise.

APPENDIX
The expectation of the AMDF defined in (4) can be written as follows: (A.1) Assuming that both signal and noise can be modeled as zeromean Gaussian processes, we know from [29] that Then it is trivial to derive E Ψ AMDF (n) = 2 π e x1 + e x2 − 2R x1x2 (n) , (A.3) where e x1 = E{x 2 1 (n)} and e x2 = E{x 2 2 (n)} represent the energies of the signals x 1 (n) and x 2 (n), and R x1x2 (n) = E{x 1 (i)x 2 (i+n)} is the direct cross-correlation function. Similarly, one can derive the expectation of the AMSF as follows: E Ψ AMSF (n) = 2 π e x1 + e x2 + 2R x1x2 (n) . (A.4) The covariance between AMDF and AMSF is written as follows: cov Ψ AMDF (n), Ψ AMSF (n) (A.5) For two random Gaussian variables θ and ϑ, it can be derived from [29] that (A.6) Therefore, the covariance between AMDF and AMSF can be expressed as follows: cov Ψ AMDF (n), Ψ AMSF (n) = ξ 1 (n) + ξ 2 (n) − ξ 3 (n), (A.7) where A similar derivation can be used to derive the variance of Ψ AMDF (n) and Ψ AMSF (n). We can then calculate the correlation coefficient ρ(n) defined as ρ(n) = cov Ψ AMDF (n), Ψ AMSF (n) var Ψ AMDF (n) var Ψ AMSF (n) . (A.9) For simplicity of analysis, besides the assumptions made in Section 2, we further assume that the signal is also a Gaussian process with zero-mean and variance σ 2 s , the noise observed from different microphones has the same variance denoted by σ 2 w , and that the relative propagation attenuation between two microphones is negligible, that is, α = 1. After making the above assumptions, we have We now consider to estimate ρ(n) in two conditions, that is, n = τ and n = τ.