- Research Article
- Open Access

# A Hierarchical Estimator for Object Tracking

- Chin-Wen Wu
^{1}, - Yi-Nung Chung
^{2}Email author and - Pau-Choo Chung
^{1}

**2010**:592960

https://doi.org/10.1155/2010/592960

© Chin-Wen Wu et al. 2010

**Received: **17 November 2009

**Accepted: **14 May 2010

**Published: **10 June 2010

## Abstract

A closed-loop local-global integrated hierarchical estimator (CLGIHE) approach for object tracking using multiple cameras is proposed. The Kalman filter is used in both the local and global estimates. In contrast to existing approaches where the local and global estimations are performed independently, the proposed approach combines local and global estimates into one for mutual compensation. Consequently, the Kalman-filter-based data fusion optimally adjusts the fusion gain based on environment conditions derived from each local estimator. The global estimation outputs are included in the local estimation process. Closed-loop mutual compensation between the local and global estimations is thus achieved to obtain higher tracking accuracy. A set of image sequences from multiple views are applied to evaluate performance. Computer simulation and experimental results indicate that the proposed approach successfully tracks objects.

## Keywords

## 1. Introduction

Visual object tracking is an important issue in computer vision. It has applications in many fields, including visual surveillance, human behavior analysis, maneuvering target tracking, and traffic monitoring. The two main types of visual tracking algorithms are target representation and localization algorithms and filtering and data association algorithms [1]. For target representation and localization algorithms, tracking a moving object typically involves matching objects in consecutive frames using features such as edge, region, shape, texture, position, and color. Comaniciu et al. [1] presented a kernel-based framework for tracking nonrigid objects. The mean shift algorithm [2] uses the repeated movement of data points to the sample means. The mean shift algorithm is shown to have effective computation and good tracking performance, but it tends to converge to a local maximum. For filtering and data association algorithms, the state estimation method is used for modeling the dynamic system of visual tracking. The state space approach recursively estimates the state vector in two consecutive stages: prediction and updating. In the prediction step, the prior estimate of the current state is derived using a dynamic equation. In the updating step, the posterior estimate of the state is updated based on measurements. A state space approach which incorporates measurements into existing object tracks within the framework of Kalman filtering was developed in [3]. Cui et al. [4] presented a laser-based dense crowd tracking method. Particle filters, which are based on the Monte Carlo integration method for implementing a recursive Bayesian filter, have also been proposed [5, 6]. The key idea is to represent the required posterior estimate by a set of random samples with associated weights. A particle filter can effectively deal with clutter and ambiguous situations. However, if the dimension of the state vector is high, a particle filter has a very large computational cost [7–9]. Cheng and Hwang [10] combined a Kalman filter with particle sampling for multiple-object video tracking.

In the tracking procedure, once measurements are received, data association must be applied to determine the exact relationship between measurements and predicted objects. Several algorithms have been developed for data association, such as probabilistic data association (PDA) and joint probabilistic data association (JPDA) [11]. The PDA approach for multitarget tracking, presented by Kershaw and Evans [12], reduces the complexity associated with more sophisticated algorithms by focusing on a few most likely hypotheses.

Occlusion is considered an essential challenge in tracking moving objects. Consequently, a number of recent studies have used multiple views to handle occlusion [13–19]. In [13], a recursive algorithm for stereo was developed. The scheme uses an extended Kalman filter to recursively estimate 3D motion and the depth of moving objects. In [14], a discrete relaxation approach for reducing the intrinsic combinatorial complexity was introduced. The algorithm uses prior knowledge from 2D tracking of each view to obtain real-time 3D tracking. Hu et al. [15] proposed a framework for tracking multiple people about uncalibrated occlusion reasoning. Khan and Shah [16] presented a tracking system based on the field of views (FOVs) of multiple cameras. Another 3D object tracking method that uses multiple views was presented in [17]. Ercan et al. [18] proposed a particle-based framework for single-object tracking with occlusions in a camera network. This approach requires prior knowledge of the environment and the FOV of each camera for estimating the likelihood of whether the object will be occluded from the view of a camera. Furthermore, they did not address the issue of data fusion. Multiple-view data fusion systems have been investigated in several studies [20, 21].

Several studies on hierarchical data fusion [22–26] have also been conducted. Majji et al. [22] presented an algorithm using centralized hierarchical fusion. However, the system does not provide feedback to the local filters for modifying their estimate. As such, their approach cannot achieve truly local-global integration to obtain highly accurate estimate. Ajgl et al. [23] discussed various fusion approaches and showed that hierarchical fusion with Millman's formula has the best performance. Wang et al. [24] developed a two-stage hierarchical framework with partial feedback and applied it to compressed video. Local estimators consist of motion, color, and face detectors. However, the measurements of some local estimates in this scheme are not always available due to intracoded frame prediction. Strobel et al. [25] presented a joint audio-video object tracking method based on decentralized Kalman filters. The front end local estimation uses two Kalman filters, one to track objects based on video and the other to track objects based on audio. The results are then passed through two inverse Kalman filters to obtain measurements, which are applied to another Kalman filter for global fusion to obtain the final tracking result. Due to the use of both Kalman filtering and inverse Kalman filtering, the method is relatively time consuming. Furthermore, it is designed as an open-loop mechanism and thus mutual compensation between the global and local estimates cannot be achieved. Medeiros et al. [26] proposed a cluster-based Kalman filter algorithm for a wireless camera (sensor) network for object tracking. In their approach, sensors that detect the same object are grouped into a cluster and the information sensed from each individual sensor in the cluster is sent to the cluster head for aggregation by a Kalman filter. The Kalman filter is divided into blocks to improve the computation efficiency. An innovative protocol procedure between individual sensors and the cluster head was developed. However, how to improve the tracking efficiency through a local-global hierarchical fusion mechanism was not discussed.

In contrast to existing approaches, the present study proposes a closed-loop local-global integrated hierarchical estimator (CLGIHE) for object tracking using multiple cameras. The Kalman filter is used to combine the local and global estimates into one estimate for mutual compensation since it can be efficiently integrated into a hierarchical fusion algorithm. The local estimate is input into the global fusion and the obtained global estimate is fed back to the local estimator to achieve iterative optimization-based improvement in both local and global estimates. The local and global estimates are combined into one estimate using the derived equations. The global estimate includes the covariance (environment conditions) from all the local estimators in the derived global fusion equations in the adjustment of fusion gain for dynamically adjusting the tracking in the optimal estimate. Mutual compensation between the local and global estimates is thus achieved to obtain more accurate position estimation.

The rest of this paper is organized as follows. Section 2 provides a brief overview of the proposed system. The proposed object tracking with hierarchical estimation is described in Section 3. The simulation and experimental results of the proposed approach are described in Section 4. Finally, the conclusions are given in Section 5.

## 2. System Overview

Global Estimator performs data fusion of the estimates obtained by Local Estimator to obtain more accurate 3D global estimates. Object tracking is achieved primarily using measurements received from local estimates that are integrated using a data fusion algorithm to form the global estimate. The fusion algorithm concludes the result considering that different local estimators have different reliability to achieve the best estimation result. Therefore, it can provide increased robustness and accurate estimates. After the global estimate is produced, the estimated 3D position of the tracking object is fed back to local filters for modifying the estimated states.

Suppose that there are camera pairs and a total of objects in the system. In the system, the local and global estimates are modeled in world coordinates, whereas 3D measurements are reconstructed by each camera pair. The motion segmentation approach is used in each image plane, for example, background subtraction is used to detect a moving object to obtain a measurement for the local estimate. After the measurement has been reconstructed and assigned to the local estimator, the state estimate is performed for the local filter with the measurement.

The following nomenclature is used throughout this study: denotes local estimate, " " denotes estimate, "super " denote transpose, "−" denotes the a priori estimate, "+" denotes the a posteriori estimate, denotes the local covariance matrix, denotes the Kalman gain of the local estimate, , , and denote the global estimate, the covariance matrix, and the Kalman gain, respectively, denotes an identity matrix, and denotes the dimension of state vector .

## 3. Proposed Hierarchical Estimator for Object Tracking Local Estimate

The algorithm for CLGIHE is described in this section. The basic idea of the proposed fusion algorithm with a hierarchical estimation approach is to combine local and global estimates for object tracking. The local predictor produces a 3D position estimate based on the local information perceived by a camera pair. The local estimate results are then sent to the global estimator to generate a global estimate of the object.

where and are the state vectors at time and , respectively, which is the number of camera pairs since one Kalman filter is used for each local estimate from two camera views, and and are the state transition and noise coupling matrices, respectively. The system noise, , associated with the moving object at frame is assumed to be white Gaussian noise distributed with zero mean and covariance matrix .

where the measurement
is formed by a pair of image positions of the *i* th local estimator at time
,
is the observation matrix of the filter
, and
is the measurement error, which is assumed to be white Gaussian noise with zero mean and covariance matrix
.

According to the dynamic system defined in (2) and (3), the solution of the Kalman filter for this model for each camera pair is given by the state prediction in [3].

This process is repeated iteratively at each time instant in all the local tracking processes. The iteration generates one instant-time estimate and the system iteratively updates the estimate.

### 3.1. Data Association

**O**

_{ 7 }. Then,

**O**

_{ 1 },

**O**

_{ 2 },

**O**

_{ 3 }, , and

**O**

_{ 5 }measurements, whose association with the objects has to be determined, remain. A suboptimal Bayesian approach, denoted as 1-step conditional maximum likelihood, is applied to determine the association between the remaining measurements and the objects. For the above equations, let be the residual covariance matrix, and the measurement residual vector at time . In each local estimator, 1-step conditional maximum likelihood is used to obtain the state estimate from all the valid measurements. The Gaussian likelihood of associated measurement with object is

### 3.2. Global Estimate with Data Fusion

where is the total number of measurements obtained from local estimators for the tracked object. The system noise, , associated with the moving object at step is assumed to be white Gaussian noise distributed with zero mean and covariance matrix . is the global observation matrix, and is the measurement error, which is assumed to be white Gaussian noise with zero mean and covariance matrix .

where **I** _{
6
} is a 6-by-6 identity matrix, **0**_{6} is a 6-by-6 matrix of zeros,
, and
.

If object is seen by camera pair in the local estimate, the output of the local estimate is fed into the global estimate with global estimate matrix . Otherwise, there is no need to be updated for none measurement provided.

In summary, each local estimate, , is computed by each local estimator using (4) and then all local estimates are sent to the global estimator. The global estimate, in (22), is obtained after performing the data fusion process in the global estimator. The global estimate is then sent to each local estimator to update the estimate of the local state vector. When the global estimate is fed back, can be determined.

## 4. Experimental Results

To evaluate performance, the proposed CLGIHE algorithm was compared with Austere's method and Kim's method [28] using computer simulation and real image sequences. Since Austere's method and Kim's method use the fusion method without specifying the local filters, to provide an accurate comparison, the Kalman filter was used as the local filter for Austere's fusion and Kim's fusion algorithms.

In the simulation, the state noise, measurement noise, and 3D object positions were created using synthetic data generators. The measurement data were obtained via a homogeneous transformation of the two-camera model in addition to measurement errors. Kalman filters were used to estimate the local state vectors. Once the measurement data was received, the corresponding probability was calculated based on each hypothesis. The conditional estimate of the object states was evaluated and combined with the individual estimate for each hypothesis, weighted by the corresponding probability function. The performance of multiple-view tracking was simulated under epipolar geometry.

When the objects are occluded, observations are unavailable. If there is no measurement to obtain, the object is seen by neither camera. In this situation, the local predicted state is not updated until new observations are generated and the global estimate is updated using only available camera pairs.

In the initial step of the experiment, the local and global estimators were initialized, and background subtraction [30] was used to separate the moving foreground objects. The measurement of the local estimator was obtained from two camera views, that is, a camera pair. The local estimate performed its Kalman filter with the estimated state and the Kalman gain was updated. Each output of the local estimator was sent to the global estimator. The global estimator and estimated 3D positions of the tracked object were computed using (22).

Average geometric error for test sequences.

Average MSE for test sequences.

Method | Sequence | ||
---|---|---|---|

Sequence 1 | Sequence 2 | Sequence 3 | |

Proposed | 12.9327 | 13.2828 | 13.5486 |

Kim | 13.0883 | 13.3073 | 13.5314 |

Austere | 13.1239 | 13.5741 | 13.8357 |

Local 1 | 14.4163 | 14.6282 | 15.9262 |

Local 2 | 14.4579 | 15.9337 | 16.7499 |

Local 3 | 13.9209 | 13.4702 | 13.8708 |

## 5. Conclusion

A closed-loop local-global integrated hierarchical estimator (CLGIHE) approach was proposed for object tracking using multiple cameras. CLGIHE adopts the Kalman filter to build an integrated hierarchical fusion estimator because it allows the local and global estimates to be combined into one estimate for mutual compensation. Compared to existing multiple-camera Kalman-filter-based object tracking approaches, CLGIHE has the following advantages. Firstly, it is implemented with a feedback loop to achieve iterative optimization-based improvement from both the local and global mutual compensation. Secondly, local and global estimates are integrated into one estimate to allow the optimal adjustment of the fusion gain based on environment conditions from each local estimator to obtain accurate and smooth tracking results. The simulation and experimental results show that the proposed algorithm is capable of tracking objects in various situations. Moreover, the data fusion algorithm applied to the multiple-view images reduces the probability of misdetection.

## Declarations

### Acknowledgment

This work was supported in part by National Science Council, Taiwan, under Grant NSC 98-2218-E-006-004.

## Authors’ Affiliations

## References

- Comaniciu D, Ramesh V, Meer P: Kernel-based object tracking.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*2003, 25(5):564-577. 10.1109/TPAMI.2003.1195991View ArticleGoogle Scholar - Comaniciu D, Ramesh V, Meer P: Real-time tracking of non-rigid objects using mean shift.
*Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR '00), June 2000*142-149.Google Scholar - Gerwal MS, Andrews AP:
*Kalman Filtering Theory and Practice*. Prentice Hall, Englewood Cliffs, NJ, USA; 1993.Google Scholar - Cui J, Zha H, Zhao H, Shibasaki R: Laser-based detection and tracking of multiple people in crowds.
*Computer Vision and Image Understanding*2007, 106(2-3):300-312. 10.1016/j.cviu.2006.07.015View ArticleGoogle Scholar - Czyz J, Ristic B, Macq B: A particle filter for joint detection and tracking of color objects.
*Image and Vision Computing*2007, 25(8):1271-1281. 10.1016/j.imavis.2006.07.027View ArticleGoogle Scholar - Hue C, Le Cadre J-P, Pérez P: Sequential Monte Carlo methods for multiple target tracking and data fusion.
*IEEE Transactions on Signal Processing*2002, 50(2):309-325. 10.1109/78.978386View ArticleGoogle Scholar - Chang C, Ansari R: Kernel particle filter: iterative sampling for efficient visual tracking.
*Proceedings of the International Conference on Image Processing (ICIP '03), September 2003*977-980.Google Scholar - Bouaynaya N, Qu W, Schonfeld D: An online motion-based particle filter for head tracking applications.
*Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), March 2005*225-228.Google Scholar - Shan C, Wei Y, Tan T, Ojardias F: Real time hand tracking by combining particle filtering and mean shift.
*Proceedings of the 6th IEEE International Conference on Automatic Face and Gesture Recognition (FGR '04), May 2004*669-674.Google Scholar - Cheng H-Y, Hwang J-N: Adaptive particle sampling and adaptive appearance for multiple video object tracking.
*Signal Processing*2009, 89(9):1844-1849. 10.1016/j.sigpro.2009.03.034View ArticleMATHGoogle Scholar - Bar-shalom Y, Fortmann T:
*Tracking and Data Association*. Academic Press, New York, NY, USA; 1988.MATHGoogle Scholar - Kershaw DJ, Evans RJ: Waveform selective probabilistic data association.
*IEEE Transactions on Aerospace and Electronic Systems*1997, 33(4):1180-1188.View ArticleGoogle Scholar - Yi J-W, Oh J-H: Recursive resolving algorithm for multiple stereo and motion matches.
*Image and Vision Computing*1997, 15(3):181-196. 10.1016/S0262-8856(96)01118-3View ArticleGoogle Scholar - Li Y, Hilton A, Illingworth J: A relaxation algorithm for real-time multiple view 3D-tracking.
*Image and Vision Computing*2002, 20(12):841-859. 10.1016/S0262-8856(02)00094-XView ArticleGoogle Scholar - Hu W, Zhou X, Hu M, Maybank S: Occlusion reasoning for tracking multiple people.
*IEEE Transactions on Circuits and Systems for Video Technology*2009, 19(1):114-121.View ArticleGoogle Scholar - Khan S, Shah M: Consistent labeling of tracked objects in multiple cameras with overlapping fields of view.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*2003, 25(10):1355-1360. 10.1109/TPAMI.2003.1233912View ArticleGoogle Scholar - Black J, Ellis T: Multi camera image tracking.
*Image and Vision Computing*2006, 24(11):1256-1267. 10.1016/j.imavis.2005.06.002View ArticleGoogle Scholar - Ercan AO, El Gamal A, Guibas LJ: Object tracking in the presence of occlusions via a camera network.
*Proceedings of the 6th International Symposium on Information Processing in Sensor Networks (IPSN '07), April 2007*509-518.Google Scholar - Senior A, Hampapur A, Tian Y-L, Brown L, Pankanti S, Bolle R: Appearance models for occlusion handling.
*Image and Vision Computing*2006, 24(11):1233-1243. 10.1016/j.imavis.2005.06.007View ArticleGoogle Scholar - Dockstader SL, Tekalp AM: Multiple camera fusion for multi-object tracking.
*Proceedings of IEEE Workshop on Multi-Object Tracking, July 2001*95-102.View ArticleGoogle Scholar - Zhou Q, Aggarwal JK: Object tracking in an outdoor environment using fusion of features and cameras.
*Image and Vision Computing*2006, 24(11):1244-1255. 10.1016/j.imavis.2005.06.008View ArticleGoogle Scholar - Majji M, Davis JJ, Junkins JL: Hierarchical multi-rate measurement fusion for estimation of dynamical systems.
*AIAA Guidance, Navigation, and Control Conference 2007, August 2007, usa*3967-3978.Google Scholar - Ajgl J,
*et al*.: Millman's formula in data fusion.*Proceedings of the 10th International PhD Workshop on Systems and Control, 2009, Prague, Czech Republic*1-6.Google Scholar - Wang J, Achanta R, Kankanhalli M, Mulhem P: A hierarchical framework for face tracking using state vector fusion for compressed video.
*Proceedings of IEEE International Conference on Accoustics, Speech, and Signal Processing, April 2003*209-212.Google Scholar - Strobel N, Spors S, Rabenstein R: Joint audio-video object localization and tracking: a presentation general methodology.
*IEEE Signal Processing Magazine*2001, 18(1):22-31. 10.1109/79.911196View ArticleGoogle Scholar - Medeiros H, Park J, Kak AC: Distributed object tracking using a cluster-based Kalman filter in wireless camera networks.
*IEEE Journal on Selected Topics in Signal Processing*2008, 2(4):448-463.View ArticleGoogle Scholar - Hartlry R, Zisserman A:
*Multiple View Geometry in Computer Vision*. 2nd edition. Cambridge University Press, Cambridge, Mass, USA; 2003.Google Scholar - Kim KH: Development of track to track fusion algorithms.
*Proceedings of the American Control Conference, July 1994*1037-1041.Google Scholar - Bardsley DJ, Bai L: 3D surface reconstruction and recognition.
*Biometric Technology for Human Identification IV, April 2007, Orlando, Fla, USA, Proceedings of SPIE*6539:View ArticleGoogle Scholar - Cucchiara R, Grana C, Piccardi M, Prati A: Detecting moving objects, ghosts, and shadows in video streams.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*2003, 25(10):1337-1342. 10.1109/TPAMI.2003.1233909View ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.