Kalman Filters for Time Delay of Arrival-Based Source Localization

In this work, we propose an algorithm for acoustic source localization based on time delay of arrival (TDOA) estimation. In earlier work by other authors, an initial closed-form approximation was ﬁrst used to estimate the true position of the speaker followed by a Kalman ﬁltering stage to smooth the time series of estimates. In the proposed algorithm, this closed-form approximation is eliminated by employing a Kalman ﬁlter to directly update the speaker position estimate based on the observed TDOAs. In particular, the TDOAs comprise the observation associated with an extended Kalman ﬁlter whose state corresponds to the speaker position. We tested our algorithm on a data set consisting of seminars held by actual speakers. Our experiments revealed that the proposed algorithm provides source localization accuracy superior to the standard spherical and linear intersection techniques. Moreover, the proposed al-gorithm, although relying on an iterative optimization scheme, proved efﬁcient enough for real-time operation.


Introduction
Most practical acoustic source localization schemes are based on time delay of arrival estimation (TDOA) for the following reasons: Such systems are conceptually simple.They are reasonably effective in moderately reverberant environments.Moreover, their low computational complexity makes them well-suited to real-time implementation with several sensors.Time delay of arrival-based source localization is based on a two-step procedure: 1.The TDOA between all pairs of microphones is estimated, typically by finding the peak in a cross correlation or generalized cross correlation function [1]. 2. For a given source location, the squared-error is calculated between the estimated TDOAs and those determined from the source location.The estimated source location then corresponds to that position which minimizes this squared error.If the TDOA estimates are assumed to have a Gaussiandistributed error term, it can be shown that the least squares metric used in Step 2 provides the maximum likelihood (ML) estimate of the speaker location [2].Unfortunately this least squares criterion results in a nonlinear optimization problem that can have several local minima.Several authors have proposed solving this optimization problem with standard gradientbased iterative techniques.While such techniques typically yield accurate location estimates, they are typically computationally intensive and thus ill-suited for real-time implementation [3,4].
For any pair of microphones, the surface on which the TDOA is constant is a hyperboloid of two sheets.A second class of algorithms seeks to exploit this fact by grouping all microphones into pairs, estimating the TDOA of each pair, then finding the point where all associated hyperboloids most nearly intersect.Several closed-form position estimates based on this approach have appeared in the literature; see Chan and Ho [5] and the literature review found there.Unfortunately, the point of intersection of two hyperboloids can change significantly based on a slight change in the eccentricity of one of the hyperboloids.Hence, a third class of algorithms was developed wherein the position estimate is obtained from the intersection of several spheres.The first algorithm in this class was proposed by Schau and Robinson [6], and later came to be known as spherical intersection.Perhaps the best known algorithm from this class is the spherical interpolation method of Smith and Abel [7].Both methods provide closed-form estimates suitable for realtime implementation.
Brandstein et al [4] proposed yet another closed-form approximation known as linear intersection.Their algorithm proceeds by first calculating a bearing line to the source for each pair of sensors.Thereafter, the point of nearest approach is calculated for each pair of bearing lines, yielding a potential source location.The final position estimate is obtained from a weighted average of these potential source locations.
In the algorithm proposed here, the closed-form approximations used in prior approaches is eliminated by employing an extended Kalman filter to directly update the speaker position estimate based on the observed TDOAs.In particular, the TDOAs comprise the observation associated with an extended Kalman filter whose state corresponds to the speaker position.Hence, the new position estimate comes directly from the update formulae associated with the Kalman filter.Similar approaches have been proposed in the past by Dvorkin and Gannot [8] for acoustic source localization, and by Duraiswami et al for a combined audio-video localization system based on a particle filter [9].
The balance of this work is organized as follows.In Section 2, we review the process of source localization based on time-delay of arrival estimation.In particular, we formulate source localization as a problem in nonlinear least squares estimation, then develop an appropriate linearized model.Section 3 summarizes a less well-known variant of the Kalman filter, known as the iterated extended Kalman filter.Section 4 presents a simple model for speaker motion, then discusses how the development in the preceding sections can be combined to develop an acoustic localization algorithm capable of tracking a moving speaker.Section 5 presents the results of our initial experiments comparing the proposed algorithm to the standard techniques.

Source Localization
Consider the i-th pair of microphones, and let mi1 and mi2 respectively be the positions of the first and second microphones in the pair.Let x denote the position of the speaker in R 3 .Then the time delay of arrival (TDOA) between the two microphones of the pair can be expressed as where s is the speed of sound.Denoting # allows (1) to be rewritten as where is the distance from the source to microphone mij.Equation ( 2) is clearly nonlinear in x = (x, y, z).In the coming development, we will find it useful to have a linear approximation.Hence, we can take a partial derivative with respect to x on both sides of (2) and write -Taking partial derivatives with respect to y and z similarly, we find We can approximate Ti(x) with a first order Taylor series expansion about the last position estimate x(t − 1) as where we have defined the row vector Equation ( 4) is the desired linearization.Source localization based on a maximum likelihood (ML) criterion [2] proceeds by minimizing the error function where τ i is the observed TDOA for the i-th microphone pair and σ 2 i is the error covariance associated with this observation.The TDOAs can be estimated with a variety of wellknown techniques [1,10].Perhaps the most popular method involves the generalized cross correlation (GCC), which can be expressed as For reasons of computational efficiency, R12(τ ) is typically calculated with an inverse FFT.Thereafter, an interpolation is performed to overcome the granularity in the estimate corresponding to the sampling interval [1].
Substituting the linearization (4) into ( 6) and introducing a time dependence provides where for i = 0, . . ., N − 1.Let us define and so that ( 9) can be expressed in matrix form as Similarly, defining 12) enables ( 8) to be expressed as In past work, the criterion (6) was minimized for each time instant t, typically with a closed-form approximation to the true minimum [6,7,5,4].Thereafter, some authors have proposed using a Kalman filter to smooth the position estimates over time [4,11].In this work, we propose to incorporate the smoothing stage directly into the estimation.This is acomplished as follows: First we note that (13) represents a nonlinear least squares estimation problem that has been appropriately linearized; we can associate τ (t) with the observation vector appearing in a Kalman filter such as we will encounter in Section 3.Moreover, we can define a model for the motion of the speaker, in the form typically seen in the process equation of a Kalman filter.Thereafter, we can apply the standard Kalman filter update formulae directly to the given recursive estimation problem without ever having recourse to a closed-form approximation for the speaker position.It is worth noting that a similar approach was used by Duraiswami et al in developing an algorithm for combined audio-video source localization based on a particle filter [9].
To see more clearly how this approach can be implemented, we briefly review a variation of the Kalman filter in Section 3.

Iterated Extended Kalman Filter
Let x(t) denote the state of a Kalman filter at time t, and let y(t) denote the associated observation.Moreover, define a transition matrix F(t + 1, t) which specifies how the state evolves in time, and functional C(t, x(t)) which specifies how the state is related to the current observation.The Kalman filter is then described by the process and observation equations: where ν1(t) and ν2(t) are the process and observation noise respectively.By assumption, ν1(t) and ν2(t) are zero mean with covariances matrices Q 1 (t) and Q 2 (t).Let x(t|Y t−1 ) denote the predicted estimated state of a Kalman filter using all the observations Y t−1 up to time t − 1.The innovation is defined as the difference between the observation y(t) and the prediction C(t, x(t|Y t−1 )) at time t: Using the innovation and the Kalman gain G f (t, x(t|Y t−1 )), the state estimate can be updated according to [12, §10] x(t|Y t ) = x(t|Y t−1 ) + G f (t, x(t|Y t−1 )) α(t, x(t|Y t−1 )) (17) The details of the recursive calculation of G f (t, x(t|Y t−1 )) can be found in Haykin [12, §10].
Jazwinski [13, §8.3] describes an iterated extended Kalman filter (IEKF), in which (16)(17) are replaced with the local iteration, where C(η i ) is the linearization of C(t, η i ) about η i .The local iteration is initialized by setting Note that η 2 = x(t|Y) as defined in (17).Hence, if the local iteration is run only once, the IEKF reduces to an extended Kalman filter.Normally (18-19) are repeated, however, until there are no substantial changes between η i and η i+1 .Both G f (t, η i ) and C(η i ) are updated for each local iteration.After the last iteration, we set x(t|Y t ) = η f .Jazwinski [13, §8.3] reports that the IEKF provides faster convergence in the presence of significant nonlinearities in the observation equation, especially when the initial state x(1|Y 0 ) is far from the optimal value.

Speaker Tracking with the Kalman Filter
In this section, we discuss the specifics of how the linearized least squares position estimation criterion ( 13) can be recursively minimized with the iterated extended Kalman filter presented in the prior section.We begin by associating the TDOA estimate τ (t) with the observation y(t).Moreover, we recognize that the linearized observation functional C(t) required for the Kalman filter is given by ( 5) and ( 10) for our acoustic localization problem.Furthermore, we can equate the TDOA error covariance matrix Σ in (12) with the observation noise covariance Q 2 (t).Hence, we have all relations needed on the observation side of the Kalman filter.We need only supplment these with an appropriate model of the speaker's dynamics to develop an algorithm capable of tracking a moving speaker, as opposed to merely finding his position at a single time instant.This is our next task.Consider the simplest model of speaker dynamics, wherein the speaker is "stationary" inasmuch as he moves only under the influence of the process noise ν1(t).The transition matrix is then F(t+1|t) = I.Assuming the process noise components in the three directions are statistically independent, we can write where T is the time since the last state update.Although the audio sampling is synchronous for all sensors, it cannot be assumed that the speaker constantly speaks, nor that all microphones receive the direct signal from the speaker's mouth; i.e., the speaker sometimes turns so that he is no longer facing the microphone array.As only the direct signal is useful for localization [14], the TDOA estimates returned by those sensors receiving only the indirect signal reflected from the walls should not be used for position updates.This is most easily done by setting a threshold on the GCC (7), and using for source localization only those microphone pairs returning a peak in the GCC above the threshold [14].This implies that no update at all is made if the speaker is not speaking.Alternatively, a partial update can be made based only on the prediction The nonlinear functional C(t, x(t)) corresponds to the TDOA model . . .
where the individual components Ti(x(t)) are given by (2-3).The linearized functional C(x(t)) is given by ( 5) and (10).As explained in [12], a numerically stable version of the Kalman filter based on the Cholesky decomposition can be readily developed.
Although the IEKF with the local iteration (18-20) was used for the experiments reported in Section 5, the localization system ran in less than real time on a Pentium Xeon processor with a clock speed of 3.0 GHz.This is so because during normal operation very few local iterations are required before the estimate converges.The local iteration compensates for the difference between the original nonlinear least squares estimation criterion (6) and the linearized criterion (8).The difference between the two is only significant during the starting phase, when the estimated position is far from the true speaker location; once the speaker position has been acquired to a reasonable accuracy, the linearized model (8) matches the original (6) quite well.The use of such a linearized model can be equated with the Gauss-Newton method, wherein higher order terms in the series expansion of the criterion function are neglected.The connection between the Kalman filter and the Gauss-Newton method is well-known, as is the fact that the convergence rate of the latter is superlinear if the error τ i − Ti(x) is small near the optimal solution x = x * .Further details can be found in Bertsekas [15, §1.5].

Experiments
The test set used to evaluate the algorithms proposed here contains approximately three hours of audio and video data recorded during seven seminars by students and faculty at the University of Karlsruhe (UKA) in Karlsruhe, Germany.Prior to the start of the seminars, four video cameras in the corners of the room had been calibrated with the technique of Zhang [16].The location of the centroid of the speaker's head in the images from the four calibrated video cameras was manually marked every 0.7 second.Using these hand-marked labels, the true position of the speaker's head in three dimensions was calculated using the technique described in [17].These "ground truth" speaker positions are accurate to within 10 cm.
As the seminars took place in an open lab area used both by seminar participants as well as students and staff engaged in other activities, the recordings are optimally-suited for evaluating acoustic source localization and other technologies in a realistic, natural setting.In addition to speech from the seminar speaker, the farfield recordings contain noise from fans, computers, and doors, in addition to cross-talk from other people  1 presents the results of a set of experiments comparing the new IEKF algorithm proposed in this work, to the the spherical intersection (SX) method of Schau and Robinson [6], as well as the linear intersection (LI) technique of Brandstein et al [4].
The SX method used three microphones of the array (micnumber 0, 2 and 4) with a distance of 8.1 cm between a pair to make an initial estimate of the speaker's position.The LI and IEKF techniques, on the other hand, made use of the same set of 12 microphone pairs.These pairs were formed out of the microphone array by dividing the array into two 8-channel subarrays and taking each possible pair of microphones with an inter-element distance of 8.1 cm .In all cases, the TDOAs were estimated using the generalized cross correlation [1].
The results shown in Table 1 summarize the position estimation error over the 14 segments of the CHIL seminar data.The root mean square (RMS) errors for azimuth and depth were obtained by comparing the true speaker positions obtained from the video labels with the position estimates produced by the several acoustic source localization algorithms.Position estimates from the SX and LI methods lying outside the physical borders of the room were omitted.
Without any smoothing, the source localization estimates returned by both the LI and SX methods are very inaccurate.The LI method provides particularly poor estimates in depth.Kalman filtering improved the position estimates provided by both the SX and LI methods, yet the average RMS distance from the true source location remained large.The new IEKF approach outperformed both the SX and LI methods for both azimuth and depth.We attribute this superior performance largely to the elimination of the initial closed-form estimate associated with the LI and SX methods, and its inherent inaccuracy.
The result of the IEKF could be further improved by implementing an adaptive threshold as proposed by [14].The total gain is about 30 percent in terms of azimuth and about 40 percent in depth as shown in Table 2.

Table 2 :
IEKF with and without adaptive threshold present in the room.For these initial experiments, the seminars were recorded with an sixteen-element, linear array with an inter-element spacing of 4 cm.Table