- Research Article
- Open Access
A Hierarchical Estimator for Object Tracking
EURASIP Journal on Advances in Signal Processing volume 2010, Article number: 592960 (2010)
A closed-loop local-global integrated hierarchical estimator (CLGIHE) approach for object tracking using multiple cameras is proposed. The Kalman filter is used in both the local and global estimates. In contrast to existing approaches where the local and global estimations are performed independently, the proposed approach combines local and global estimates into one for mutual compensation. Consequently, the Kalman-filter-based data fusion optimally adjusts the fusion gain based on environment conditions derived from each local estimator. The global estimation outputs are included in the local estimation process. Closed-loop mutual compensation between the local and global estimations is thus achieved to obtain higher tracking accuracy. A set of image sequences from multiple views are applied to evaluate performance. Computer simulation and experimental results indicate that the proposed approach successfully tracks objects.
Visual object tracking is an important issue in computer vision. It has applications in many fields, including visual surveillance, human behavior analysis, maneuvering target tracking, and traffic monitoring. The two main types of visual tracking algorithms are target representation and localization algorithms and filtering and data association algorithms . For target representation and localization algorithms, tracking a moving object typically involves matching objects in consecutive frames using features such as edge, region, shape, texture, position, and color. Comaniciu et al.  presented a kernel-based framework for tracking nonrigid objects. The mean shift algorithm  uses the repeated movement of data points to the sample means. The mean shift algorithm is shown to have effective computation and good tracking performance, but it tends to converge to a local maximum. For filtering and data association algorithms, the state estimation method is used for modeling the dynamic system of visual tracking. The state space approach recursively estimates the state vector in two consecutive stages: prediction and updating. In the prediction step, the prior estimate of the current state is derived using a dynamic equation. In the updating step, the posterior estimate of the state is updated based on measurements. A state space approach which incorporates measurements into existing object tracks within the framework of Kalman filtering was developed in . Cui et al.  presented a laser-based dense crowd tracking method. Particle filters, which are based on the Monte Carlo integration method for implementing a recursive Bayesian filter, have also been proposed [5, 6]. The key idea is to represent the required posterior estimate by a set of random samples with associated weights. A particle filter can effectively deal with clutter and ambiguous situations. However, if the dimension of the state vector is high, a particle filter has a very large computational cost [7–9]. Cheng and Hwang  combined a Kalman filter with particle sampling for multiple-object video tracking.
In the tracking procedure, once measurements are received, data association must be applied to determine the exact relationship between measurements and predicted objects. Several algorithms have been developed for data association, such as probabilistic data association (PDA) and joint probabilistic data association (JPDA) . The PDA approach for multitarget tracking, presented by Kershaw and Evans , reduces the complexity associated with more sophisticated algorithms by focusing on a few most likely hypotheses.
Occlusion is considered an essential challenge in tracking moving objects. Consequently, a number of recent studies have used multiple views to handle occlusion [13–19]. In , a recursive algorithm for stereo was developed. The scheme uses an extended Kalman filter to recursively estimate 3D motion and the depth of moving objects. In , a discrete relaxation approach for reducing the intrinsic combinatorial complexity was introduced. The algorithm uses prior knowledge from 2D tracking of each view to obtain real-time 3D tracking. Hu et al.  proposed a framework for tracking multiple people about uncalibrated occlusion reasoning. Khan and Shah  presented a tracking system based on the field of views (FOVs) of multiple cameras. Another 3D object tracking method that uses multiple views was presented in . Ercan et al.  proposed a particle-based framework for single-object tracking with occlusions in a camera network. This approach requires prior knowledge of the environment and the FOV of each camera for estimating the likelihood of whether the object will be occluded from the view of a camera. Furthermore, they did not address the issue of data fusion. Multiple-view data fusion systems have been investigated in several studies [20, 21].
Several studies on hierarchical data fusion [22–26] have also been conducted. Majji et al.  presented an algorithm using centralized hierarchical fusion. However, the system does not provide feedback to the local filters for modifying their estimate. As such, their approach cannot achieve truly local-global integration to obtain highly accurate estimate. Ajgl et al.  discussed various fusion approaches and showed that hierarchical fusion with Millman's formula has the best performance. Wang et al.  developed a two-stage hierarchical framework with partial feedback and applied it to compressed video. Local estimators consist of motion, color, and face detectors. However, the measurements of some local estimates in this scheme are not always available due to intracoded frame prediction. Strobel et al.  presented a joint audio-video object tracking method based on decentralized Kalman filters. The front end local estimation uses two Kalman filters, one to track objects based on video and the other to track objects based on audio. The results are then passed through two inverse Kalman filters to obtain measurements, which are applied to another Kalman filter for global fusion to obtain the final tracking result. Due to the use of both Kalman filtering and inverse Kalman filtering, the method is relatively time consuming. Furthermore, it is designed as an open-loop mechanism and thus mutual compensation between the global and local estimates cannot be achieved. Medeiros et al.  proposed a cluster-based Kalman filter algorithm for a wireless camera (sensor) network for object tracking. In their approach, sensors that detect the same object are grouped into a cluster and the information sensed from each individual sensor in the cluster is sent to the cluster head for aggregation by a Kalman filter. The Kalman filter is divided into blocks to improve the computation efficiency. An innovative protocol procedure between individual sensors and the cluster head was developed. However, how to improve the tracking efficiency through a local-global hierarchical fusion mechanism was not discussed.
In contrast to existing approaches, the present study proposes a closed-loop local-global integrated hierarchical estimator (CLGIHE) for object tracking using multiple cameras. The Kalman filter is used to combine the local and global estimates into one estimate for mutual compensation since it can be efficiently integrated into a hierarchical fusion algorithm. The local estimate is input into the global fusion and the obtained global estimate is fed back to the local estimator to achieve iterative optimization-based improvement in both local and global estimates. The local and global estimates are combined into one estimate using the derived equations. The global estimate includes the covariance (environment conditions) from all the local estimators in the derived global fusion equations in the adjustment of fusion gain for dynamically adjusting the tracking in the optimal estimate. Mutual compensation between the local and global estimates is thus achieved to obtain more accurate position estimation.
The rest of this paper is organized as follows. Section 2 provides a brief overview of the proposed system. The proposed object tracking with hierarchical estimation is described in Section 3. The simulation and experimental results of the proposed approach are described in Section 4. Finally, the conclusions are given in Section 5.
2. System Overview
The proposed hierarchical tracking system (CLGIHE) consists of Local Estimator and Global Estimator, as shown in Figure 1. Local Estimator uses data association and the Kalman filter for estimating the 3D position of the object using a camera pair. It should be noted that in the system, every two cameras are considered to be a camera pair. Given 2D images and the camera matrices, the positions of 3D points are computed using the triangulation method presented in .
Global Estimator performs data fusion of the estimates obtained by Local Estimator to obtain more accurate 3D global estimates. Object tracking is achieved primarily using measurements received from local estimates that are integrated using a data fusion algorithm to form the global estimate. The fusion algorithm concludes the result considering that different local estimators have different reliability to achieve the best estimation result. Therefore, it can provide increased robustness and accurate estimates. After the global estimate is produced, the estimated 3D position of the tracking object is fed back to local filters for modifying the estimated states.
Suppose that there are camera pairs and a total of objects in the system. In the system, the local and global estimates are modeled in world coordinates, whereas 3D measurements are reconstructed by each camera pair. The motion segmentation approach is used in each image plane, for example, background subtraction is used to detect a moving object to obtain a measurement for the local estimate. After the measurement has been reconstructed and assigned to the local estimator, the state estimate is performed for the local filter with the measurement.
The following nomenclature is used throughout this study: denotes local estimate, "" denotes estimate, "super " denote transpose, "−" denotes the a priori estimate, "+" denotes the a posteriori estimate, denotes the local covariance matrix, denotes the Kalman gain of the local estimate, , , and denote the global estimate, the covariance matrix, and the Kalman gain, respectively, denotes an identity matrix, and denotes the dimension of state vector .
3. Proposed Hierarchical Estimator for Object Tracking Local Estimate
The algorithm for CLGIHE is described in this section. The basic idea of the proposed fusion algorithm with a hierarchical estimation approach is to combine local and global estimates for object tracking. The local predictor produces a 3D position estimate based on the local information perceived by a camera pair. The local estimate results are then sent to the global estimator to generate a global estimate of the object.
The local estimate is computed by the Kalman filters from measurements obtained by a camera pair. Let be the estimated state vector in the th Kalman filter at step given by:
where and represent the position and velocity of the tracked object, respectively. The Kalman-filter-based local estimate is modeled as
where and are the state vectors at time and , respectively, which is the number of camera pairs since one Kalman filter is used for each local estimate from two camera views, and and are the state transition and noise coupling matrices, respectively. The system noise, , associated with the moving object at frame is assumed to be white Gaussian noise distributed with zero mean and covariance matrix .
The measurement equation can be expressed as
where the measurement is formed by a pair of image positions of the i th local estimator at time , is the observation matrix of the filter , and is the measurement error, which is assumed to be white Gaussian noise with zero mean and covariance matrix .
According to the dynamic system defined in (2) and (3), the solution of the Kalman filter for this model for each camera pair is given by the state prediction in .
The updating step is expressed as
with error covariance
This process is repeated iteratively at each time instant in all the local tracking processes. The iteration generates one instant-time estimate and the system iteratively updates the estimate.
3.1. Data Association
In a state estimation algorithm, one important procedure is data association, which can be used to determine the relationship between measurements and existing objects. Data association usually consists of two procedural steps: gating and correlation computation logic. If more than one measurement exists, the data association technique can be used to reduce the number of measurements. Figure 2 shows a typical gate diagram, which consists of three objects, P1, P2, and P3. In this figure, there are three objects and seven observations. The gating technique is applied to eliminate the least probable observations, such as and O 7 . Then, O 1 , O 2 , O 3 , , and O 5 measurements, whose association with the objects has to be determined, remain. A suboptimal Bayesian approach, denoted as 1-step conditional maximum likelihood, is applied to determine the association between the remaining measurements and the objects. For the above equations, let be the residual covariance matrix, and the measurement residual vector at time . In each local estimator, 1-step conditional maximum likelihood is used to obtain the state estimate from all the valid measurements. The Gaussian likelihood of associated measurement with object is
where is the determinant of . Since one object may be observed by several local filters, generating multiple estimates, the local estimates are sent to the global estimate to obtain a fused final result.
3.2. Global Estimate with Data Fusion
The global estimate is composed of the information integrated from the local estimators for tracking and identification. The estimated state of the object at time is , where contains the object position and velocity . The discrete-time global dynamic and measurement models of the tracking object are, respectively, defined as
where is the total number of measurements obtained from local estimators for the tracked object. The system noise, , associated with the moving object at step is assumed to be white Gaussian noise distributed with zero mean and covariance matrix . is the global observation matrix, and is the measurement error, which is assumed to be white Gaussian noise with zero mean and covariance matrix .
In order to determine the relationship between the local estimate and global estimate, mapping matrix is defined. Let be the mapping matrix of the object seen by camera pair at time . It is defined as follows:
where I 6 is a 6-by-6 identity matrix, 06 is a 6-by-6 matrix of zeros, , and .
The proposed data fusion algorithm in the tracking system is applied to combine the local estimate with mapping matrix . Thus, recording the output of the local estimators and to form global measurement, matrix is expressed as
If object is seen by camera pair in the local estimate, the output of the local estimate is fed into the global estimate with global estimate matrix . Otherwise, there is no need to be updated for none measurement provided.
In order to derive the estimation algorithm, local estimates are combined to form the global estimate. The goal is to compute in each time step. The global estimate, , for a tracking object can be computed with the Kalman filter as
where is a normalizing matrix for a tracking object, defined as
The global error covariance is update by
The global priori estimate and its prediction error covariance are computed as
Then, combining (12) and (15), the global estimate for the object becomes
The local estimates are combined to produce the global estimate, . The local estimates are computed by the Kalman filters and rearranged as
By rewriting (18), one can obtain
Let , where can be considered as the adjusting factor between the local and global error covariance. Then, in the global estimate,
Therefore, the final result of the global estimate for the tracking object, , in (16) is
The global estimate is fed back to local filters for improving the local estimates using
In summary, each local estimate, , is computed by each local estimator using (4) and then all local estimates are sent to the global estimator. The global estimate, in (22), is obtained after performing the data fusion process in the global estimator. The global estimate is then sent to each local estimator to update the estimate of the local state vector. When the global estimate is fed back, can be determined.
4. Experimental Results
To evaluate performance, the proposed CLGIHE algorithm was compared with Austere's method and Kim's method  using computer simulation and real image sequences. Since Austere's method and Kim's method use the fusion method without specifying the local filters, to provide an accurate comparison, the Kalman filter was used as the local filter for Austere's fusion and Kim's fusion algorithms.
In the simulation, the state noise, measurement noise, and 3D object positions were created using synthetic data generators. The measurement data were obtained via a homogeneous transformation of the two-camera model in addition to measurement errors. Kalman filters were used to estimate the local state vectors. Once the measurement data was received, the corresponding probability was calculated based on each hypothesis. The conditional estimate of the object states was evaluated and combined with the individual estimate for each hypothesis, weighted by the corresponding probability function. The performance of multiple-view tracking was simulated under epipolar geometry.
After several Monte Carlo runs, the results of position and velocity errors for the proposed method and Austere's method were obtained. A comparison is shown in Figure 3. The horizontal axis indicates the tracking steps, and the vertical axis indicates the position or velocity errors. The position and velocity errors are defined as the mean squared errors. The results indicate that the proposed method has lower MSE values than those of Austere's method. The average MSE values for the proposed method and Austere's fusion method are 4.4750 and 4.8893, respectively.
In order to determine the effect of global fusion, the proposed system's performance was measured with and without global fusion. Figures 4(a)–4(c) show the 3D estimation error comparisons between the global estimator and three local estimators (local 1, local 2, and local 3). The global estimator has lower MSE values than those of each of the three local estimators. The average MSE values obtained for local 1, local 2, and local 3 are 4.7982, 5.0101, and 4.8596, respectively, whereas that for the global estimator is 4.4750. The performance in terms of measured positions of the object compared with the ground truth is shown in Figure 5. The results show that the estimates of , , and coordinates are close to those of the object trajectory.
The performance of the proposed algorithm was also evaluated using real image sequences. In order to show the performance in real situations, three fixed calibrated digital cameras were set up to track people who were moving outdoors. Figure 6 shows the configuration of the tracking system in the experiment. The test image sequences have an image size of pixels. All the image sequences were taken with calibrated cameras. At each local estimator, a 3D state vector is determined based on the reconstruction of the camera pair. Every two views form a camera pair and are applied to a local estimator for observations. The direct linear transform (DLT)  is adopted as the reconstruction method for each camera pair. To evaluate the accuracy of reconstruction, the geometric error  is used for measuring the results. The geometric error is the sum of the projection error in each camera view for a pair of correspondence points. Before the experiment, a self-made calibrated board was used for camera calibration. The calibration uses a set of control points whose coordinates are already known. Then, several reconstructions and re-projections are used to tune the camera matrices by adjusting geometric error.
When the objects are occluded, observations are unavailable. If there is no measurement to obtain, the object is seen by neither camera. In this situation, the local predicted state is not updated until new observations are generated and the global estimate is updated using only available camera pairs.
In the initial step of the experiment, the local and global estimators were initialized, and background subtraction  was used to separate the moving foreground objects. The measurement of the local estimator was obtained from two camera views, that is, a camera pair. The local estimate performed its Kalman filter with the estimated state and the Kalman gain was updated. Each output of the local estimator was sent to the global estimator. The global estimator and estimated 3D positions of the tracked object were computed using (22).
For evaluation, three sequences, for which sample images are shown in the three rows of Table 1, were used for the test. The 3D tracking results obtained for the person wearing blue clothes (sequence 1) are shown in Figure 7(a). For comparison, the results obtained with Kim's fusion and Austere's fusion algorithms are shown in Figures 7(b) and 7(c), respectively. The average MSE values for the proposed method, Kim's fusion, and Austere's fusion are 12.9327, 13.0883, and 13.1239, respectively, (see Table 2). Results show that the proposed method has lower MSE values than those of the other fusion methods. To show the fusion effect, the results obtained from local estimates are shown in Figures 7(d)–7(f). The average MSE values for the three local estimates are 14.4163, 14.4579, and 13.9209, respectively, (see Table 2). The results were also evaluated by mapping the obtained 3D positions onto 2D image planes for comparison. The average errors in the - and -directions are listed in the last column of the first row in Table 1.
Similarly, the obtained 3D tracking results for sequence 2 and sequence 3 are shown in Figures 8 and 9, respectively. The average errors in the - and -directions of the projected 2D images are shown in the last column of the second and the third row in Table 1, respectively. The MSE values of 3D positions obtained using the proposed approach, Austere's method, Kim's method, and the three local estimators for sequence 2 and sequence 3 are listed in Table 2.
A closed-loop local-global integrated hierarchical estimator (CLGIHE) approach was proposed for object tracking using multiple cameras. CLGIHE adopts the Kalman filter to build an integrated hierarchical fusion estimator because it allows the local and global estimates to be combined into one estimate for mutual compensation. Compared to existing multiple-camera Kalman-filter-based object tracking approaches, CLGIHE has the following advantages. Firstly, it is implemented with a feedback loop to achieve iterative optimization-based improvement from both the local and global mutual compensation. Secondly, local and global estimates are integrated into one estimate to allow the optimal adjustment of the fusion gain based on environment conditions from each local estimator to obtain accurate and smooth tracking results. The simulation and experimental results show that the proposed algorithm is capable of tracking objects in various situations. Moreover, the data fusion algorithm applied to the multiple-view images reduces the probability of misdetection.
Comaniciu D, Ramesh V, Meer P: Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 2003, 25(5):564-577. 10.1109/TPAMI.2003.1195991
Comaniciu D, Ramesh V, Meer P: Real-time tracking of non-rigid objects using mean shift. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR '00), June 2000 142-149.
Gerwal MS, Andrews AP: Kalman Filtering Theory and Practice. Prentice Hall, Englewood Cliffs, NJ, USA; 1993.
Cui J, Zha H, Zhao H, Shibasaki R: Laser-based detection and tracking of multiple people in crowds. Computer Vision and Image Understanding 2007, 106(2-3):300-312. 10.1016/j.cviu.2006.07.015
Czyz J, Ristic B, Macq B: A particle filter for joint detection and tracking of color objects. Image and Vision Computing 2007, 25(8):1271-1281. 10.1016/j.imavis.2006.07.027
Hue C, Le Cadre J-P, Pérez P: Sequential Monte Carlo methods for multiple target tracking and data fusion. IEEE Transactions on Signal Processing 2002, 50(2):309-325. 10.1109/78.978386
Chang C, Ansari R: Kernel particle filter: iterative sampling for efficient visual tracking. Proceedings of the International Conference on Image Processing (ICIP '03), September 2003 977-980.
Bouaynaya N, Qu W, Schonfeld D: An online motion-based particle filter for head tracking applications. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), March 2005 225-228.
Shan C, Wei Y, Tan T, Ojardias F: Real time hand tracking by combining particle filtering and mean shift. Proceedings of the 6th IEEE International Conference on Automatic Face and Gesture Recognition (FGR '04), May 2004 669-674.
Cheng H-Y, Hwang J-N: Adaptive particle sampling and adaptive appearance for multiple video object tracking. Signal Processing 2009, 89(9):1844-1849. 10.1016/j.sigpro.2009.03.034
Bar-shalom Y, Fortmann T: Tracking and Data Association. Academic Press, New York, NY, USA; 1988.
Kershaw DJ, Evans RJ: Waveform selective probabilistic data association. IEEE Transactions on Aerospace and Electronic Systems 1997, 33(4):1180-1188.
Yi J-W, Oh J-H: Recursive resolving algorithm for multiple stereo and motion matches. Image and Vision Computing 1997, 15(3):181-196. 10.1016/S0262-8856(96)01118-3
Li Y, Hilton A, Illingworth J: A relaxation algorithm for real-time multiple view 3D-tracking. Image and Vision Computing 2002, 20(12):841-859. 10.1016/S0262-8856(02)00094-X
Hu W, Zhou X, Hu M, Maybank S: Occlusion reasoning for tracking multiple people. IEEE Transactions on Circuits and Systems for Video Technology 2009, 19(1):114-121.
Khan S, Shah M: Consistent labeling of tracked objects in multiple cameras with overlapping fields of view. IEEE Transactions on Pattern Analysis and Machine Intelligence 2003, 25(10):1355-1360. 10.1109/TPAMI.2003.1233912
Black J, Ellis T: Multi camera image tracking. Image and Vision Computing 2006, 24(11):1256-1267. 10.1016/j.imavis.2005.06.002
Ercan AO, El Gamal A, Guibas LJ: Object tracking in the presence of occlusions via a camera network. Proceedings of the 6th International Symposium on Information Processing in Sensor Networks (IPSN '07), April 2007 509-518.
Senior A, Hampapur A, Tian Y-L, Brown L, Pankanti S, Bolle R: Appearance models for occlusion handling. Image and Vision Computing 2006, 24(11):1233-1243. 10.1016/j.imavis.2005.06.007
Dockstader SL, Tekalp AM: Multiple camera fusion for multi-object tracking. Proceedings of IEEE Workshop on Multi-Object Tracking, July 2001 95-102.
Zhou Q, Aggarwal JK: Object tracking in an outdoor environment using fusion of features and cameras. Image and Vision Computing 2006, 24(11):1244-1255. 10.1016/j.imavis.2005.06.008
Majji M, Davis JJ, Junkins JL: Hierarchical multi-rate measurement fusion for estimation of dynamical systems. AIAA Guidance, Navigation, and Control Conference 2007, August 2007, usa 3967-3978.
Ajgl J, et al.: Millman's formula in data fusion. Proceedings of the 10th International PhD Workshop on Systems and Control, 2009, Prague, Czech Republic 1-6.
Wang J, Achanta R, Kankanhalli M, Mulhem P: A hierarchical framework for face tracking using state vector fusion for compressed video. Proceedings of IEEE International Conference on Accoustics, Speech, and Signal Processing, April 2003 209-212.
Strobel N, Spors S, Rabenstein R: Joint audio-video object localization and tracking: a presentation general methodology. IEEE Signal Processing Magazine 2001, 18(1):22-31. 10.1109/79.911196
Medeiros H, Park J, Kak AC: Distributed object tracking using a cluster-based Kalman filter in wireless camera networks. IEEE Journal on Selected Topics in Signal Processing 2008, 2(4):448-463.
Hartlry R, Zisserman A: Multiple View Geometry in Computer Vision. 2nd edition. Cambridge University Press, Cambridge, Mass, USA; 2003.
Kim KH: Development of track to track fusion algorithms. Proceedings of the American Control Conference, July 1994 1037-1041.
Bardsley DJ, Bai L: 3D surface reconstruction and recognition. Biometric Technology for Human Identification IV, April 2007, Orlando, Fla, USA, Proceedings of SPIE 6539:
Cucchiara R, Grana C, Piccardi M, Prati A: Detecting moving objects, ghosts, and shadows in video streams. IEEE Transactions on Pattern Analysis and Machine Intelligence 2003, 25(10):1337-1342. 10.1109/TPAMI.2003.1233909
This work was supported in part by National Science Council, Taiwan, under Grant NSC 98-2218-E-006-004.
About this article
Cite this article
Wu, CW., Chung, YN. & Chung, PC. A Hierarchical Estimator for Object Tracking. EURASIP J. Adv. Signal Process. 2010, 592960 (2010). https://doi.org/10.1155/2010/592960
- Kalman Filter
- Cluster Head
- Local Estimate
- Object Tracking
- Global Estimate