EURASIP Journal on Applied Signal Processing 2003:11, 1056–1063 c ○ 2003 Hindawi Publishing Corporation A Multidelay Double-Talk Detector Combined with the MDF Adaptive Filter

The multidelay block frequency-domain (MDF) adaptive filter is an excellent candidate for both acoustic and network echo cancellation. There is a need for a very good double-talk detector (DTD) to be combined efficiently with the MDF algorithm. Recently, a DTD based on a normalized cross-correlation vector was proposed and it was shown that this DTD performs much better than the Geigel algorithm and other DTDs based on the cross-correlation coefficient. In this paper, we show how to extend the definition of a normalized cross-correlation vector in the frequency domain for the general case where the block size of the Fourier transform is smaller than the length of the adaptive filter. The resulting DTD has an MDF structure, which makes it easy to implement, and a good fit with an echo canceler based on the MDF algorithm. We also analyze resource requirements (computational complexity and memory requirement) and compare the MDF algorithm with the normalized least mean square algorithm (NLMS) from this point of view.


INTRODUCTION
Network and acoustic echo cancelers work on the same principle. An echo canceler (EC) [1], to work well, should include good solutions to two important problems: a system identification problem and a so-called double-talk detection problem [2]. When the echo path is identified by an adaptive filter, a function should be included to freeze the adaptation whenever a near-end signal is detected, and thereby avoid the divergence of the adaptive algorithm. This control can either be done by a so-called step-size control (soft decision) or by a double-talk detector (DTD) hard decision. Theoretically, the step-size control method would be preferable because it can be made optimal in minimum mean-square sense [3,4,5]. In practice however, depending on situation, there is no conclusive evidence that soft decisions (step-size control) result in better performance than using the DTD hard decisions. Hence, it is of great interest to find a suitable and practical decision variable.
One of the most widely used DTDs is the Geigel algorithm [6] which works fairly well when the echo return loss is well defined. However, this is not, in general, the case in practice. The need for more sophisticated DTDs that do not depend on the path attenuation is obvious. Alternative methods for double-talk detection have been presented, for example, in [7,8]. A family of DTDs exhibiting this feature was proposed in [9].
On the system identification part, the multidelay block frequency-domain (MDF) adaptive filter [10] is an excellent candidate for both acoustic and network echo cancellation. Indeed, since the coefficients of this adaptive filter are updated in the frequency domain, block by block, using the fast Fourier transform (FFT) as an intermediary step, it is very efficient from a complexity point of view. Moreover, the block length N is independent of the filter length L; N can be chosen as small as desired, with a resulting algorithmic delay equal to N. Although, from a complexity point of view, the optimal choice is N = L, using smaller block sizes (N < L) in order to reduce the delay is still more efficient than time-domain algorithms. The block delay is not a problem for some applications, for example, in a frame-based system like a Voice-over-Internet Protocol (VoIP) network. In this network, even a sample-by-sample time-domain algorithm would introduce a delay equal to the delay of a block-based algorithm. Hence, there is no delay penalty using a block-based MDF algorithm in this scenario if its block size is matched to the frame size of the network. Figure 1: Block diagram of the echo canceler (EC), double-talk detector (DTD), and echo path.
A DTD based on a normalized cross-correlation vector was proposed in [9]. In [2], it was shown that this DTD performs much better than the Geigel algorithm and other DTDs based on the cross-correlation coefficient. In this paper, we show how to extend the ideas of [9] to the MDF algorithm. The resulting DTD has an MDF structure which makes it easy to implement and a good fit with an EC based on the MDF algorithm.
The organization of this paper is as follows. In Section 2, we introduce some definitions and notation that are used in the context of echo cancellation. In Section 3, we give the MDF algorithm. Section 4 presents the new DTD and its combination with an MDF EC. A resource analysis of the MDF algorithm is given in Section 5. Evaluation of the proposed MDF DTD is made in Section 6. Finally, we give our conclusions in Section 7.

DEFINITIONS AND NOTATION
Referring to Figure 1, the following definitions and notation are used in all the derivations: Here, n is the sample-by-sample time index and L is the length of the adaptive filter that we suppose to be equal to the length of the echo path.

THE MDF ADAPTIVE FILTER
In this section, we give the MDF algorithm [10]. For further details and explanation, see [10,11]. We assume that L is an integer multiple of N, that is, L = KN. We define the block error signal (of length N ≤ L) as where m is the block time index, and e(m) = e(mN) · · · e(mN + N − 1) The vectorĥ is defined in the same manner asĥ(n) in the previous section. It can easily be checked that X is a Toeplitz matrix of size L × N. We can show that where is an N × N Toeplitz matrix and are the subfilters ofĥ. In (3), the filterĥ (of length L) is partitioned in K subfiltersĥ k of length N and the rectangular matrix It is well known that a Toeplitz matrix T can be transformed, by doubling its size, to a circulant matrix where T is also a Toeplitz matrix. (The matrix T is expressible in terms of the elements of T, except for an arbitrary diagonal.) It is also well known that a circulant matrix is easily decomposed as follows: 2N ×2N ) and D is a diagonal matrix whose elements are the discrete Fourier transform of the first column of C.

Now, we define the frequency-domain quantities
The MDF adaptive filter is then given by the following equations: where is an exponential forgetting factor, µ (0 < µ ≤ 2) is a positive number, δ is a regularization parameter, and We now turn the focus of this paper on a DTD that fits well with the MDF adaptive filter. In the next section, we derive this DTD and show how to combine it with the MDF algorithm.

A MULTIDELAY DOUBLE-TALK DETECTOR
The best way we know to detect the presence of double talk is to form a test statistic ξ and compare it to a threshold T: if ξ ≥ T, then we say that double talk is not present; if ξ < T, then we say that double talk is present. The test statistic is, in general, related to correlation or coherence and the threshold must be a known constant for best performance.
In the derivation of the DTD, we will neglect the effect of noise (e.g., w = 0) for simplicity. It can easily be checked that where Suppose that v = 0. In this case, where H denotes conjugate transpose, E{·} is the mathematical expectation, and Thanks to (10) and (13), we have and (12) can be rewritten as with Now, in general, for v = 0, where Basically, there are two different ways to compute σ 2 y when no double talk is present, and we take advantage of this information to detect the presence of a near-end signal. If we divide (15) by (17), we obtain the following decision variable: We easily deduce from (19) that for v = 0, ξ = 1, and for v = 0, ξ < 1. Note also that ξ is not, in principle, sensitive to changes of the echo path when v = 0. In practice, ξ is estimated recursively as follows: • Spectral and correlation estimation Scheme 1: The MDF adaptive filter combined with a multidelay DTD.
The echo path of the system is estimated, in the test statistic, by a background MDF adaptive filterĥ b,k , k = 0, 1, . . . , K − 1, with an exponential window λ b (0 λ b < 1) smaller than λ, the exponential window used for the system identification by a foreground MDF algorithm. However, what is important in practice is that the statistics of the signal y(n) (containing both the echo and the near-end signal during double talk) is tracked fast enough, faster than the statistics of the update of the foreground filter, hence λ b is chosen smaller than λ. We have to use µ = 1 for the background filter so that the two different ways we compute the statistics of y(n) (numerator and denominator of (19)) are consistent and estimated at the same rate. This way, the DTD alerts the foreground filter before it diverges by freezing its adaptation during doubletalk. Furthermore, for practical reasons, even though not mathematically stringent, we use the same spectral matrix S MDF (m) for the foreground and background filters. All the variables used in the test statistic are estimated as where k = 0, 1, . . . , K − 1.
Scheme 1 summarizes the combination of the MDF EC and the MDF DTD, where k = 0, 1, . . . , K − 1; 0 < µ ≤ 2 is an adaptation step; λ, λ b are exponential windows; δ is the regularization factor; T is the threshold, Next, we will take a look at the numerical complexity and memory requirement of the core MDF algorithm.

RESOURCE ANALYSIS OF THE MDF ADAPTIVE FILTER
An arithmetic operation (op.) is considered to be any real multiplication, real addition, real subtraction, or real division. Assume that Complex operations are transformed into real operations according to Table 1. A complex variable is assumed to require two memory locations. For a Fourier-transformed vector, we assume that Table 1 Complex operations Real Real multiplications additions only half its elements need to be stored, that is, the memory required for a vector of length N is equivalent in both time and frequency domains. If a Fourier transform of length N is computed using the FFT routine devised by [12], it requires As a reference, we will use the real-valued NLMS algorithm [13] (assuming all signals are real-valued) which is the workhorse algorithm of network ECs. Tables 2 and 3 show the resource requirements for the MDF and the basic realvalued NLMS algorithms with respect to their computational complexity and memory. In Figure 2, these requirements are compared, with a filter length of L = 512 and various block sizes N. The trade-off between computational and memory requirements is clearly exemplified. These values, however, do not translate directly to complexity for a specific hardware, but are meant to give a more general insight to required resources.

SIMULATIONS
In this section, we present some performance results in the context of network echo cancellation. Figure 1 shows the principle of a network EC. The far-end speech signal x(n) goes through the echo path represented by a filter h, then it is added to the near-end talker signal v(n) and the ambient noise w(n). The composite signal is denoted by y(n). Most often, the echo path is modeled by an adaptive FIR filterĥ(n) which subtracts a replica of the echo and thereby achieves cancellation. Double talk occurs when the two talkers on both sides speak simultaneously, that is, x(n) = 0 and v(n) = 0. In this situation, the near-end speech acts as a high-level uncorrelated noise to the adaptive algorithm. The disturbing near-end speech may therefore cause the adaptive filter to diverge, passing annoying audible echo to the far end. A common way to alleviate this problem is to slow down or completely halt the filter adaptation when near-end speech is detected. This is the very important role of the DTD. Figure 3 shows a typical network impulse response that we have used  in all our simulations. Even though the active coefficients in this case occur in the early part of the impulse response, it is not the case in general. Hence, in this application, we always have to cover a longer time span than the active region. The time span of this network echo path h is 64 milliseconds (L = 512). The same length is used for the adaptive filter  Table 3: Complexity and memory requirements for the (real-valued) NLMS algorithm.

Algorithm step Operations Memory
The far-end speaker is a female (Figure 4a) and the near-end speaker is a male (Figure 4b). The sampling rate is 8 kHz and the echo-to-ambient-noise ratio is equal to 39 dB. The following parameters are used for the algorithms: Performance is measured by means of the normalized misalignment defined as (25) Figure 4c shows the misalignment of the MDF EC when combined with the proposed DTD. Double talk starts around 1.3 seconds. We can see that the proposed MDF DTD detects quickly the near-end signal and freezes the adaptation of the (foreground) adaptive filter during the whole time of double talking. Of course without a DTD, the algorithm would have diverged very quickly.  Figure 5 shows the performance of the EC after an abrupt system change where the impulse response is shifted 200 samples in 2 seconds. In this simulation, there is no double talk. Figure 5a (respectively, Figure 5b) corresponds to the case where the MDF DTD is deactivated (respectively, activated). We can see that the performance of the EC with the MDF DTD is slightly degraded than without. This is due to the fact that any DTD will trigger false alarms; consequently, adaptation is frozen during that time and convergence slows down. This unideal behavior is mainly caused by short-term correlation of the statistics used in the DTD. However, it has been shown that the false alarm rate of the proposed DTD structure is in general considerably lower than that of the Geigel DTD [14].

CONCLUSIONS
Double-talk detection is an important part of an EC system. A good DTD should be able to distinguish between double talk and echo path changes, and the threshold T should be a known constant. In this paper, we have proposed a new DTD that has these features by extending the definition of a normalized cross-correlation vector [9] in the frequency domain for the general case N ≤ L. Purposely, the proposed DTD has an MDF structure in order to take advantage of the good characteristics of the MDF algorithm and to make a successful integration between the MDF DTD and an MDF EC. With the MDF algorithm, we can easily trade off computational load with memory requirement and algorithmic delay, hence tailor the algorithm for a specific application. For example, in a frame-based VoIP system, no delay penalty is introduced compared to a time-domain (zero-delay) algorithm as long as the block size is matched to the frame size.
We can also use robust statistics [15] to derive a robust MDF adaptive filter, the same way it was done in [11] for the FLMS algorithm (N = L). A robust algorithm permits decreasing the threshold T without losing performance during double-talk; as a result, the probability of false alarm is low and the performance (convergence and tracking) of the adaptive algorithm is not much affected.