Joint Audio-Visual Tracking Using Particle Filters

It is often advantageous to track objects in a scene using multimodal information when such information is available. We use audio as a complementary modality to video data, which, in comparison to vision, can provide faster localization over a wider ﬁeld of view. We present a particle-ﬁlter based tracking framework for performing multimodal sensor fusion for tracking people in a videoconferencing environment using multiple cameras and multiple microphone arrays. One advantage of our proposed tracker is its ability to seamlessly handle temporary absence of some measurements (e.g., camera occlusion or silence). Another advantage is the possibility of self-calibration of the joint system to compensate for imprecision in the knowledge of array or camera parameters by treating them as containing an unknown statistical component that can be determined using the particle ﬁlter framework during tracking. We implement the algorithm in the context of a videoconferencing and meeting recording system. The system also performs high-level semantic analysis of the scene by keeping participant tracks, recognizing turn-taking events and recording an annotated transcript of the meeting. Experimental results are presented. Our system operates in real time and is shown to be robust and reliable.


INTRODUCTION
The goal of most machine perception systems is to mimic the performance of human and animal systems.A key characteristic of human systems is their multimodality.They rely on information from many modalities, chief among which are vision and audition.It is now apparent that many of the centers in the brain thought to encode space-time are activated by combinations of visual and audio stimuli [1].However, the problems of computer vision and computer audition have essentially been performed on parallel tracks, with different research communities and problems.Capabilities of computers have now reached such a level that it is now possible to build and develop systems that can combine multiple audio and video sensors and perform meaningful joint-analysis of a scene, such as joint audiovisual speaker localization, tracking, speaker change detection and remote speech acquisition using beamforming tech-niques, which is necessary for the development of natural, robust and environmentally-independent applications.Applications of such systems include novel human-computer interfaces, robots that sense and perceive their environment, perceptive spaces for applications in immersive virtual or augmented reality, and so forth.In particular, applications such as video gaming, virtual reality, multimodal user interfaces, and video conferencing, require systems that can locate and track persons in a room through a combination of visual and audio cues, enhance the sound that they produce, and perform identification.
In this paper, we describe the development of a system that is able to process input from multiple video and audio sensors.The information gathered is used to perform both lower level analysis (robust object tracking including occlusion handling), higher level scene analysis (providing breakup of the audio meeting recording into pieces corresponding to activity of individual speakers), and speech quality improvement (simple beamforming-based speech signal enhancement for the speech recognition engine).We present a probabilistic framework for combining results from the two modes and develop a particle filter based joint audio-video tracking algorithm.The availability of independent modalities allows to dynamically adjust the audio and video calibration parameters to achieve consistent tracks.Similarly, our multimodal tracker is able to robustly track through missing features in one of the modalities, and is more robust than trackers relying on one of the modes alone.The system developed is applied for smart videoconferencing and meeting recording, and for animal behavior studies.
The developed algorithm is an application of sequential Monte-Carlo methods (also known as particle filters) to 3D tracking using one or more cameras and one or more microphone arrays.Particle filters were originally introduced in the computer vision area in the form of the CONDEN-SATION algorithm [2].Improvements of a technical nature to the condensation algorithm were provided by Isard and Blake [3], MacCormick and Blake [4], Li and Chellappa [5], and Philomin et al. [6].The algorithm has seen applications to multiple aspects of both computer vision and signal processing.For example, a recent paper by Qian and Chellappa [7] describes a particle filter algorithm for the structure from motion problem using sparse feature correspondence which also performs the estimation of sensor motion from the epipolar constraint, and a recently published book [8] describes many different applications in signal detection and estimation.Overall, it can be said that particle filters provide effective solutions for challenging issues in different areas of computer vision and signal processing.
The development of multimodal sensor fusion algorithms is also an active research area.The applications seen include multisensor vehicle navigation system where computer vision, laser radar, sonar and microwave radar sensors are used together [9], recent papers on audio-visual person identification using support vector machine (SVM) classifier [10], multimodal speaker detection using Bayesian networks [11], multimodal tracking using inverse modeling techniques from computer vision, speech recognition and acoustics [12] and discourse segmentation using gesture, speech and gaze cues [13].Our algorithm combines multimodality with a particle filter framework, which enables simple and fast implementation and on-the-fly multisensor system self-calibration by tracking the relative positions and orientations of the sensors together with the coordinates of the objects.We present experimental results showing the potential of the developed algorithm.

ALGORITHMS
The multimodal tracking system consists of several relatively independent components that produce sensor measurements and perform tracking and camera control.We describe the formulation of the multimodal particle filter, discuss how it can be modified to allow for a dynamic system self-calibration, and show how the measurement vector for the particle filter is obtained.We will also describe the detection of turn-taking events and separation of the audio recording of the meeting into pieces corresponding to different talkers.

Particle filter formulation
Several different approaches can be used for multimodal tracking for videoconferencing.Probably the simplest method is a direct object detection in every frame by inverting the measurement equations and obtaining object positions from measurements.Significant drawbacks of this method are slow speed, and more important, the fact that a closed-form inversion of measurement equations may not exist or may not be numerically stable; in addition, the temporal inter-frame relationships between object positions are not exploited.The Kalman filter and the extended Kalman filter provide a statistically optimal tracking solution in the case of a Gaussian probability density function of a process; however, it cannot be used effectively for a process that is not modeled well by the Gaussian distribution.Particle filters address this problem effectively.
The particle filter algorithm provides a simple and effective way of modeling a stochastic process with arbitrary probability density function p(S) by approximating it numerically with a cloud of points called particles in a process state space S. (We use S for the state space to avoid confusion with X which we use to denote the geometric coordinates only.)Other components of a particle filter framework are the measurement vector Z, the motion model and the likelihood equation.The measurements depend on the object state, and the object state is statistically derived from them.The motion model S t+1 = F(S t ) describes the time evolution of the object state and the conditional posterior probability estimation function P(Z|S) defines the likelihood of the observed measurement for a given point of a state space.(Note that in the particle filter framework, it is never required to invert the measurement equations; only the forward projection from the state space to the measurement space has to be computed, which is usually quite easy.)The update cycle consists of propagation of every particle in the state space according to the motion model, reweighting them in accordance with the obtained measurement vector and resampling the particle set to prevent degeneration and maintain an equiweighted set.The update algorithm is described below and is very similar to the original algorithm.

Update algorithm
Every particle in the set {s i }, i = 1, . . ., N, in the state space S has a weight π i associated with it.This set is called properly weighted if it approximates the true PDF P(s), so that for every integrable function H(s), Given a properly weighted set of particles at time t with equal weights 1/N, it is possible to update it to reflect the new measurements obtained at time t + δt.The update algorithm is as follows.
(1) Propagate each particle s i in time using the object motion model to obtain an updated particle set {s * i }.
(2) Obtain a new measurement vector Z and evaluate the posterior probability density π * i on {s * i }, π * i = p(s * i |Z), which measures the likelihood of s * i given Z.This can be written using Bayes' rule where p(Z) is the prior probability of measurement, which is assumed to be a known constant, and p(s * i ) = 1/N.Thus, p(s * i |Z) = K p(Z|s * i ) for some constant K, and p(Z|s * i ) can be computed without inversion of the measurement equations.
(3) Resample from {s * i } with probabilities π * i , and generate a new properly weighted set {s i } with equal weights 1/N for each particle.
(4) Repeat steps (1)-( 3) for subsequent times.Several improvements to the original particle filter framework proposed by different researchers are implemented, including importance sampling and quasi-random sampling.They significantly improve the performance of the tracker.

Self-calibration
The particle filter is usually employed for tracking the motion of an object.However, (and this is one of the contributions of this paper), it can be used equally well to estimate the intrinsic system parameters or the sensor ego-motion.In a videoconferencing framework, there often exists uncertainty in the position of the sensors.For example, the position of a microphone array with respect to the camera can be measured with a ruler or determined from a calibrated video sequence.However, both methods are subject to measurement errors.These errors can lead to disagreement in audio and video estimations of the object position and ultimately to tracking loss.In another scenario, a multimodal tracking system with independent motion of a sensor requires estimation of sensor motion, which can be done simultaneously with tracking in the proposed framework.Such a system can include, for example, several moving platforms, each with a camera and a microphone array, or a rotating microphone array.
To perform simultaneous tracking with parameter estimation, we simply include the sensor parameters into the system state space.We should be careful, though, to avoid introducing too many free parameters as this will increase the dimensionality of the state space ("curse of dimensionality") and lead to poor tracking performance.We perform several experiments with synthetic data using one and two planar microphone arrays rotating independently and one and two rotating cameras.In all cases where at least one sensor position is fixed, tracking with simultaneous parameter estimation was successful in recovering both the object and the sensor motion.(When all sensors are free to rotate, there exist configurations in which it is impossible to distinguish between sensor and object motion.Multipoint self-calibration should be used in this case.)We also perform an experimental study of a self-calibrating videoconferencing system.In our particular experimental setup, two cameras observe the room and a microphone array lies on the room floor.The self-geometry of the array is known with good precision, but the relative position of the array to the cameras is known only approximately and is recovered correctly during tracking.

Motion model
The motion model describes the temporal update rule for the system state.The tracked object state consists of three coordinates and three velocities of the object x y z ẋ ẏ ż , thus corresponding to a first-order motion model.To allow changes in the object state, a random excitation force F modeled by Gaussian with zero mean and normal deviation σ is applied to the velocity component.(The value of σ chosen depends on expected acceleration of the tracked object.If it is set too small, tracking can be lost as the tracker cannot follow the object quickly enough; if it is set too large, the predictive value of the model disappears.In our setup, σ = 100 m/s 2 in the experiments with fast moving free-flying bat which can accelerate quickly and make sharp turns, and σ = 5 m/s 2 in videoconferencing setup where people are being tracked.)The state update rule is with similar expressions for y, ẏ, z, ż.When additional spatial parameters (position or rotation angle) are added for a sensor that is expected to be in motion, both the parameter and its first time derivative (velocity) are added, and the same motion model as in (3) is used.When parameters are added for a static sensor, the velocity is not used and the random excitation applies directly to the parameter.For example, for the case of two rotating arrays being used to track the object, the state vector consists of ten components x y z φ 1 φ 2 ẋ ẏ ż φ1 φ2 , where φ 1 and φ 2 are the rotation angles of these arrays.

Video measurements
The video data stream is acquired from two color pan-tiltzoom cameras.The relationship between image coordinates (u i , v i ) and world coordinates (X, Y, Z) of the object (the camera projection equations) for the ith camera can be described using the simple direct linear transformation (DLT) model (see [14]) The matrix P i has eleven parameters {p 11 , . . ., p 14 , p 21 , . . ., p 33 } which in this model are assumed to be independent with p 34 = 1.These parameters are estimated by using a calibration object of a known geometry placed in the field of view of both cameras with both camera pan and tilt set to zero.The calibration object consists of 25 white balls on black sticks arranged in a regular spatial pattern; the three-dimensional coordinates of the balls are known within 0.5 mm.The image coordinates of every ball is determined manually from the image of the calibration object, thus giving 25 relationships between (X j , Y j , Z j ) and (u i j , v i j ), j = 1, . . ., 25, for the ith camera of the form (4) with the unknown parameters P.This overdetermined linear system of equations is then solved for P using least squares.
In the course of tracking, the video processing subsystem analyzes the acquired video frames and computes the likelihood of an observed video frame (measurement) given a system state.This can be done in two ways.One possible way is to first extract the object coordinates from the image by template matching over the whole image and finding the best match; then see how well the image object coordinates match the coordinates obtained by projecting the system state onto the measurement space.Another, more promising, approach is to take the whole image as a measurement, perform template matching at the image point to which the system state projects and report the matching score as a likelihood measure; this has the advantage of performing matching only at points where it is likely to find a match-that is around the true object position-and has the ability to handle multiple objects in the same frame.We use a simple face detection algorithm based on skin color and template matching for the initial detection and then perform head tracking based on shape matching and color histograms [15] after the detection is done.
We denote the image coordinates of the object as (ũ i , ṽi ) (the tilde denotes measured values).Object localization is described in a later subsection.Given the system state S (and the object coordinates (x, y, z) as part of S), the data likelihood estimation P v (Z v |S) is computed as follows.First, we need to account for the (known) current camera pan and tilt angle.To do that, we simply rotate the world around the camera origin using the same pan and tilt angles obtaining a source position (x , y , z ) in the coordinate system of the rotated camera.Then, these coordinates are plugged into the DLT equations ( 4) to obtain the corresponding image object position (u i , v i ).The error measure ε v for the video localization is given by a sum over N cameras and the data likelihood estimation P v (Z v |S) as where σ v is the width of a corresponding Gaussian reflecting the level of confidence in the video measurements.(We introduce the notion of the error measure exclusively to split the complicated formula into two parts; the data likelihood can be easily expressed directly over measurements as well.)

Audio measurements
The audio localization is based on computing the time differences of arrivals (TDOA) between channels of the microphone array.TDOA values are computed by a generalized cross-correlation algorithm [16].Denote the signal at the ith microphone as h i (t) and its Fourier transform H i (ω), then, the time difference τij that maximizes the value of the generalized cross-correlation between channels i and j can be computed quickly where W(ω) is a weighting function and is equivalent to the inverse noise spectrum power |N (ω)| −2 , and H * j (ω) denotes the complex conjugate of H j (ω).The noise power spectrum is estimated during silence periods.
To be able to use these measurements in the filtering framework, we have to define the likelihood of observing an audio measurement vector Z a consisting of particular measurements {τ i j }, i, j = 1, . . ., N, for a given system state S. It is easy to do that.Assume that the state S corresponds to the source position (x s , y s , z s ) and microphone positions (x i , y i , z i ), i = 1, . . ., N. (In case of moving sensors, the microphone positions may change over time.)Then, define the distance χ i from the source to the ith microphone as The TDOA set for this system state is simply τ i j = (χ j − χ i )/c, where c is the sound speed.Now, we define the audio error measure ε a between TDOAs for the state S and the observed set of TDOAs as (10) and the data likelihood estimation P a (Z a |S) as (On a side note, a probabilistic audio source localization algorithm, similar to the one described here, is computationally more expensive but is superior to the algorithms that use pairs of cross-correlation values and perform intersection of multiple cones of equivalent time delays, since one invalid cross-correlation can throw the resulting intersection vastly off position.In contrast, a probabilistic approach does not require unstable inverse calculations and is shown to be more robust-see, for example, [17].)

Occlusion handling
The combined audio-video data likelihood estimation for the multimodal particle filter P(Z|S) is obtained by multiplication of the corresponding audio and video parts, such that, P(Z|S) = P v (Z v |S)P a (Z a |S).Note that the final formula consists simply of a product of multiple Gaussians, one per component of the measurement vector.This property allows the tracker to handle partial measurements, which can be due to occlusion of the tracked object from one of the cameras, or due to missing values for some of the TDOA estimations due to noisy or weak audio channels.In these cases, the part of the product that corresponds to the missing measurement is simply set to a constant value, meaning that the missing measurement does not give any information whatsoever.The tracking can still be performed as long as there is sufficient enough information to localize the object, no matter which particular sensor it is coming from.This allows the tracker to perform well when separate audio and video trackers would fail.We performed some experiments with real data and show in a later section the recovered track of the person through occlusion in one camera.
Occlusion handling and misdetection handling are also simplified by the underlying mechanisms of the particle filter.The PDF of the process is concentrated around the area in the system state space which the system is predicted to occupy at the next time instant, thus vastly decreasing the probability of misdetection since only the space near the predicted system state is densely sampled.If there is insufficient information available to perform tracking due to full or partial occlusion, the PDF of the process begins to widen over time, reflecting an uncertainty in the determination of the system state.The PDF still continues to be clustered around the point in the state space where the object is likely to reappear again, greatly improving the chances of successfully reacquiring the object track after the occlusion clears.If the object is not detected for such a long time that the width of the PDF reaches a certain threshold, the tracker is reinitialized using a separate detection algorithm (described below) and tracking is started over.

Face detection and tracking
To initially locate people in the scene, we use a template matching algorithm on a skin color image which works sufficiently well in the videoconferencing environment.The assumption for the method to work is that people are facing the camera, which is usually true for videoconferencing.Our face detection algorithm is described in [18]; here, we give only a brief outline of the processing.The skin color is detected using R/B and G/B color intensity ratios γ rb and γ gb for a pixel with intensity I = (R, G, B).These are compared to the "correct" values which correspond to the skin color γrb (I) and γgb (I) which are acquired by hand-localization of the face area in several sample images.Due to nonlinearity of the camera CCDs, these reference values depend on the brightness of the pixel in the scene; the functions γrb (I) and γgb (I) are obtained by sampling sample face images at pixels with different intensities.Then, if the following three conditions are satisfied, the pixel is assumed to have the skin color: The first condition rejects too dark or too bright pixels since they are often mis-recognized as skin color pixels due to nonlinearity of the camera CCD.The second and third conditions perform actual testing for the skin color.In our implementation, Îl = 0.1, Îh = 0.9, and ζ = 0.12.
Then, the image is divided into blocks of 8 × 8 pixels.These blocks are classified according to the number of skin color pixels inside, and a connected components algorithm is executed on the blocks to find skin color blobs.For every blob found, template matching is performed with a simple oval-shaped template with different template center positions and template sizes.If the best score is less than a certain threshold, the skin color blob is rejected.Otherwise, some heuristic features that are characteristic to the face image are tested (eyes, lips, nose, and forehead areas).If these features are present, the algorithm decides that a face image is found.Experimental results show that the algorithm is sufficiently fast to operate in real-time, robust to illumination changes, and capable of detecting multiple faces.
After successful localization, the head tracking algorithm described in [6] is invoked on an image sequence, and the output of this subtracker constitutes video measurements.The tracking algorithm is based on the head tracking using shape matching and object color histogram.(In principle, it can be incorporated directly into the main tracker.)The head is modelled by an ellipse with a fixed vertical orientation and a fixed aspect ratio of 1.2 similar to [15].The ellipse state is given by s = (x, y, σ), where (x, y) is the center of the ellipse and σ is the minor axis length of the ellipse.We use quasi-random points for sampling instead of the standard pseudo-random points since these points improve the asymptotic complexity of the search (number of points required to achieve a certain sampling error), can be efficiently generated and are well spread in multiple dimensions (see [6] for details).For a given tolerance to tracking error, the quasi-random sampling needs a significantly lower number of sampling points (about 1/2) as compared to pseudo-random sampling, thereby speeding up the execution of the algorithm significantly.Our measurement model is a combination of two complementary modules (see [15] for why this is good), one that makes measurements based on the object's boundary and the other that focuses on the object's interior (color histograms [19]).Figure 1 shows a sample screenshot from the face detection algorithm and three frames from the head tracking sequence in the case where two persons are presented.The tracker is able to tolerate temporary occlusions and switches back to the correct target after the occlusion is cleared.

Turn-taking detection
For applications in videoconferencing, meeting recording, or a surveillance system, it is often desirable to know the highlevel semantic structure of the scene and to provide an annotated transcript of the meeting.This information can be later used for content-based retrieval purposes.Our system can create such an annotated transcript.Currently, no speech recognition is performed; the only information available is We also optionally perform acoustic beamforming using the determined position of the speaker, as provided by the tracking algorithm.Simple delay-and-sum beamforming is used, achieving SNR gain of about 7 dB.The beamforming algorithm removes noise and interference from the recorded voice, allowing a speech recognition engine to be used on the recorded audio portions [18].

SYSTEM SETUP
To evaluate the suitability and performance of the developed tracking and event detection algorithms, we have built an experimental system which includes two cameras and two microphone arrays.We use two different setups, one of which is targeted for videoconferencing applications and the other for ultrasonic sound localization.In this section, we briefly describe these setups.
The videoconferencing setup includes two cameras and two microphone arrays.A single high-end office PC (dual PIII-933 MHZ Dell PC under WinNT) is used.The video data is acquired using two Sony EVI-D30 color pan-tilt-zoom cameras that are mounted on two tripods to form a widebaseline stereo pair.Pan, tilt, and zoom of these cameras are controlled by software through a computer serial port for videoconferencing translation.The video stream is captured using two Matrox Meteor II cards.Two microphone arrays are attached to the room wall above the cameras.Each array consists of 7 small button Panasonic microphones in a circular arrangement.The signal is digitized using a 12-bit PowerDAQ ADC board at 22.05 KHz per channel.Parallel programming is used to utilize both processors effectively, achieving a frame rate of a combined audio-visual tracking system of approximately 8 frames per second.Much higher frame rates can be achieved by performing audio and video analysis on separate networked machines.
The ultrasonic tracking system that is used for more precise localization experiments is set up in a partially anechoic room that is used for bat behavioral studies.The video data is acquired using two digital Kodak MotionCorder infrared cameras at a frame rate of 240 frames per second.(The room is illuminated only by infrared light during the experiments to ensure that the bat uses exclusively acoustic information for navigation.)The video stream is recorded on a digital tape and later digitized using a video capture card.The audio stream is captured using seven Knowles FG3329 ultrasonic microphones arranged in an L-shaped pattern on the room floor.The bat ultrasonic chirps consist of downward sweeping frequency-modulated signals ranging from 20 to 50 KHz.The microphone output is digitized at 140 KHz per channel and captured using an IoTech Wavebook ADC board.Joint audio-visual bat tracking is performed using the described algorithms.The results show that the self-calibration indeed allows for automatic compensation of inaccuracies in knowledge of sensor positions.

RESULTS
We perform several experiments with synthetic and real data in both operating environments to test the performance and robustness of the tracking algorithms.First, we evaluate algorithm performance on synthetic data using fixed cameras and fixed microphone array positions.Then, we test the selfcalibration ability of the algorithm by introducing an error in the microphone array position.The third experiment deals with the case when both the object and the sensors are in motion; we show that for the case of two independently rotating microphone arrays, the system can recover both the object motion and the array rotations.
Then, we performed experiments with real data for the sound-emitting object tracking in both setups.We show that the algorithm tracks real objects well, the self-calibration is performed along with the tracking to bring audio and video tracks in agreement, and the algorithm is capable of tracking through occlusions.

Synthetic data
First, we test algorithm performance in the case when the ground truth is available.Using the ultrasonic tracking system setup, we synthesize the track of an object moving along a spiral trajectory for one second.The trajectory (X(t), Y(t), Z(t)) is given by  All parameters of the system are taken from the real setup.The frame rate is set to the 240 frames per second corresponding to the real data.At every frame, the measurement vector corresponding to the true object position is computed.Then, a random Gaussian noise with zero mean and deviation of σ v = 3% for video measurements and σ a = 8% for audio measurements is added to every component of the vector.The tracker is run on an obtained synthetic data trace and average tracking error is computed over 128 runs for different number of particles.In Figure 2, the average tracking error for video-only based tracking (with all acoustic measurements omitted), for audio-only based tracking and for multimodal tracking is plotted versus log-number of particles.Note that the horizontal axis is logarithmic and the number of particles ranges from 1024 to 131072.It can be seen that the performance of the combined tracker is better than for both unimodal cases, and the performance increases as the number of particles grows.The smallest tracking error obtained is approximately 16.5 mm; this is an almost threefold improvement over a pure object detection in every frame without tracking, which gives an error of approximately 38.3 mm.
Since the plots in Figure 2 represents only one combination of σ v and σ a , we also tested the performance of the tracking algorithm for different combinations of σ v and σ a to see if a consistent performance improvement is obtained with a second modality.Figure 3 shows the improvement of the performance of the combined audio-video tracker relative to the performance of audio-only tracker (i.e., the effect of adding the video modality to the tracker).The performance improvement is defined as a percentage decrease of the tracking error (if the error is halved, the performance improvement is 50%).Every point in the plot is computed by averaging results from 128 runs; 4096 particles were used  in the simulations.Five curves are plotted for different levels of noise contamination of the audio-only (base) tracker.The values of the standard deviation of the audio measurement noise, σ a , are shown as "Ua" in the legend, and each curve shows the dependence of the improvement on σ v .For example, the bottom curve reflects the addition of video modality with different degree of contamination (σ v varying from 2% to 10%) to a case where the audio modality is quite accurate (σ a = 2%).Note that for the abscissa the measurements are cleaner towards the right edge of the plot.It can be seen that addition of noisy video modality to clean audio (σ a = 2%, σ v = 10%, left end of the bottom curve) improves the performance only slightly (by about 10%), as can be reasonably expected, and addition of clean video modality to clean audio modality (σ a = σ v = 2%) improves the performance by about 50%, which is also reasonable.The top curve represents the opposite case when the audio modality is contaminated significantly (σ a = 10%); when clean video is added (right point of the top curve), the tracking error decreases by 75%, and when noisy video is added to noisy audio the improvement is again about 50%.Indeed, it can be seen from the plots that the performance improvement is about 50% when σ v = σ a .The performance gain is small when a noisy modality is added to a cleaner one and larger in the opposite case, but the gain is always present. Figure 4 represents the complementary case when audio modality is added to the video-only tracker.Five curves for different levels of noise contamination of the video-only (base) tracker are plotted, and the same trends can be observed.The important results shown by this experiment is that the performance improvement is consistent and systematic, that the modalities have the same relative importance and that the addition of even a  seriously contaminated modality to a clean one produces noticeable performance gain, when both are present, and provides tracker robustness when one of the modalities is absent.
Then, the sensor motion recovery capability of the algorithm was tested.We used two L-shaped microphone arrays placed on the ground, rotating with different speeds of 0.5 and 0.25 radians per second in opposite directions.The object is moving along the same spiral trajectory as before.The rotation was modeled by adding two rotation angles and two rotational velocities into the state of the system.The measurement vector was computed using true microphone coordinates and the object position.Then, random Gaussian noise with the same parameters as before was added to the measurement vector.Due to lack of space, we show only one result here, which corresponds to the simultaneous tracking and sensor motion recovery using only one fixed camera.The algorithm succeeds in tracking, despite the fact that using any sensor alone is not sufficient to recover full object motion and the sensor's relative geometry is constantly changing.We show the plot of recovered sensor motion in Figure 5; the solid lines correspond to the true sensor rotation angles and the dashed lines are the estimations computed by the tracking algorithm.The object tracking error for this set of experiments is only slightly increased (approximately 21.4 mm) compared to the case of two static arrays and two static cameras (16.5 mm).The same results were obtained for the case of two rotating cameras and one fixed microphone array.

Real data
We use the developed algorithms to perform tracking of the echolocating bat in a quiet room.The bat is allowed to fly freely in the flight area and to hunt for a food item (a  mealworm) suspended from the ceiling.In earlier experiments, we noticed that there was disagreement between bat trajectories recovered by audio and video means, although their shapes were similar.This was attributed to the fact that microphone coordinates were determined using the image of the microphone array from two calibrated cameras, which is not very accurate for the points far from the area where the calibration object was located.That led to the idea to perform adjustment of the microphone array position and orientation as tracking progresses.The array is built of a long L-shaped tube with microphones attached to it, so the relative positions of sensors within the array are known exactly.Therefore, we introduce three additional parameters-(x a , y a ) positions of the array center and the rotation angle around the center θ a -into the state vector of the system.Since the array lies on the floor, these parameters fully describe possible inaccuracy of the array placement.The tracking is performed in the nine-dimensional space x y z ẋ ẏ ż x a y a θ a to simultaneously estimate the bat trajectory and the array position.
The results from one of the cases are shown in Figure 6.The bat flies from right to left, and the plot shows a plan view of the room.The solid line corresponds to the bat position estimated by video means only.The crosses are the audio estimations; they are discrete because the bat emits echolocation calls only intermittently.The bat behavioral pattern can be seen in the picture as infrequent vocalizations in the beginning of trajectory (search stage), a series of frequent calls in the middle (target approach stage) and the following silence (target capture stage); after that, the bat is again in the search mode.It can be seen from the track that there is a disagreement of about 0.2 meters between video and audio position estimations.The multimodal tracker with a fixed microphone array position estimated from video is run first.The output is shown in the plot with a dashed line.The combined trajectory correctly lies in-between the audio and video tracks.Still, it is desirable to eliminate misalignment between modalities; to perform that, we run a self-adjusting tracker.The output is shown with a dotted line.The new trajectory lies substantially closer to the video estimation, and after a while the parameters describing array shift (x a , y a , θ a ) stabilize around values (−0.22,−0.17, 0.067) which presumably correspond to the error in the array placement.The experiment shows that the tracker successfully recovers both the bat trajectory and the error in the sensor placement.

Occlusion handling
Another advantage of the proposed multimodal tracking algorithm is its ability to handle temporary absence of some measurements.As described before, this is done by setting the members of the cumulative data likelihood that correspond to the missing measurements to constant values.For the video measurement, the measurement is marked as missing if the face detector was not able to find a face in an image.
For the audio TDOA values, the measurement is not used if it does not pass certain consistency checks (more details in [18]).To demonstrate the possibility of tracking through occlusion, in Figure 7 we show a case of a speaking person tracked in a videoconferencing setup.
The plot shows coordinates of a speaking person moving from left to right and going down and up in the meantime.The video-only based trajectory estimation is shown as a solid line and is obtained using the face detector described previously.The crosses show the successive audio estimations of the speaker position.The audio localization is less accurate in the videoconferencing setup since the array baseline and the discretization frequency are substantially less than in the anechoic room setup.Still, the audio estimations follows the video track pretty well (note that the whole vertical axis span is only 0.5 meters).We simulate the face occlusion in one camera field of view by omitting the measurements from one camera when the person is within a marked rectangular area on the plot.The output of the tracker is shown as the dotted line.Tracking is performed successfully using partial measurements; the tracker output deviates from the video trajectory during occlusion since the audio information gets higher relative weight now, but still stays close to the correct trajectory.The tracker recovers quickly once the occlusion is cleared.

Annotated meeting recording
The developed multimodal tracking system has the ability to detect change in an active speaker and to rotate the active videoconferencing camera to point to the currently active speaker.In addition, the algorithm segments the audio recording of a meeting into pieces corresponding to the activity of individual speakers.We collected multimodal data during three simulated meetings of different types (lecturetype meeting where there is one primary speaker and occasional short interruptions occur, seminar-type meeting where speaker roles are equal and typical length of a speech segment by one person is significant, and informal talk or chat between participants where speaker changes and interruptions are quite frequent).Figure 8 shows the sequence of speaker changes for those three sequences.The time axis is horizontal and covers 80 seconds of meeting time.The bold line in the plot indicates the active speaker.Small icons attached to the tracks show the identities of individual speakers automatically captured and stored by the system.An audio recording of the meeting, enhanced by beamforming, is subdivided according to the turn-taking sequence and later is used to select parts corresponding to activities of individual speakers.A separate graphical user interface can be used later to retrieve several such recordings at once and selectively play back recordings or parts of recordings containing only the speaker(s) of interest.

SUMMARY AND CONCLUSIONS
We have developed a multimodal sensor fusion tracking algorithm based on particle filtering.The posterior distribution of the system intrinsic parameters and the tracked object position are approximated with a set of points in combined system-object state space.Experimental results from a developed real-time system are presented showing that the tracker is able to seamlessly integrate multiple modalities, cope with temporary absence of some measurements and perform selfcalibration of a multisensor system simultaneously with object tracking.

Figure 1 :
Figure 1: Sample screenshot from the face detection algorithm and three frames from a sequence of head tracking.

Figure 3 :
Figure 3: Plot of the percentage improvement in the performance of audio-video tracker versus the performance of audio-only tracker for different combinations of audio and video measurement uncertainty.

Figure 4 :
Figure 4: Plot of the percentage improvement in the performance of audio-video tracker versus the performance of video-only tracker for different combinations of audio and video measurement uncertainty.

Figure 6 :
Figure 6: An object track recovered with and without selfcalibration.

Figure 7 :
Figure 7: Track of the (X, Y )-coordinates of a person through a simulated occlusion.

Figure 8 :
Figure 8: Three samples of turn-taking sequences (speaker versus time).Speaker icons show the identity of each speaker as automatically captured by system.