Marker-Based Human Motion Capture in Multiview Sequences
© Cristian Canton-Ferrer et al. 2010
Received: 24 March 2010
Accepted: 6 November 2010
Published: 22 November 2010
This paper presents a low-cost real-time alternative to available commercial human motion capture systems. First, a set of distinguishable markers are placed on several human body landmarks, and the scene is captured by a number of calibrated and synchronized cameras. In order to establish a physical relation among markers, a human body model is defined. Markers are detected on all camera views and delivered as the input of an annealed particle filter scheme where every particle encodes an instance of the pose of the body model to be estimated. Likelihood between particles and input data is performed through the robust generalized symmetric epipolar distance and kinematic constrains are enforced in the propagation step towards avoiding impossible poses. Tests over the HumanEva annotated data set yield quantitative results showing the effectiveness of the proposed algorithm. Results over sequences involving fast and complex motions are also presented.
Accurate retrieval of the configuration of an articulated structure from the information provided by multiple cameras is a field that found numerous applications in the recent years. The grown of computer graphics technology together with human motion capture (HMC) systems have been extensively used by the cinematographic and video games industry to generate virtual avatars . Medicine also benefited from these advances in the field of orthopedics, locomotive pathologies assessment, or sports performance improvement . In this field, despite markerless HMC systems have attained significant performance ratios in some scenarios , only HMC systems aided by markers placed on some body landmarks can produce high-accuracy results.
Depending on the type of employed markers, HMC systems are classified in two groups: nonoptical (inertial, magnetic, and mechanic) or optical systems (active and passive). Optical systems based on photogrammetric methods are more used than the nonoptical ones, usually requiring special suits embedding rigid skeletal-like structures , magnetic  or accelerometric devices  or multisensor fusion algorithms . Instead, image-based or optical systems allow a relative freedom of movement and are less intrusive. A common issue of all optical and nonoptical systems is the fact that they are usually expensive and require a dedicated hardware. The most usual involve IR retro-reflective markers that reflect back light, that is, generated near the cameras lens . Other optical systems triangulate positions by using active markers that emits a pulse modulated signal. This allows distinguishing among markers and to automatically label them .
This paper focuses on HMC systems with passive markers in a multicamera scenario. These systems first require an accurate reconstruction of the markers' 3D position from its 2D projections which is not a trivial problem. Matches need to be established between the detected markers in the different views, defining the multiple view correspondences through homographies or algebraic methods . This process is prone to errors due to occlusions, detection noise, and the proximity between markers. A temporal tracking of the markers also needs to be performed, to identify the markers in each sequence frame, thus yielding a 3D trajectory for each marker. Although professional systems exist for this purpose, errors occur when crucial markers become occluded or when markers' trajectories are confused. Finally, most applications require the transformation of the markers localization and trajectories to the motion parameters of a kinematic skeleton model. Commercial tools that perform this transformation are generally semiautomatic, thus becoming a labor-intensive task.
Once the 3D marker positions are obtained, it is required to fit a selected human body model (HBM) to these data to obtain kinematically meaningful parameters to perform either an analysis (i.e., for gesture recognition) or a synthesis (i.e., for avatar animation). However, in most of the systems, the markers' 3D position estimation and the fitting steps are decoupled. One of the first attempts to use an anatomical human model to increase the robustness of a HMC system is presented in  were the algorithm computes a skeleton-and-marker model using a standardized set of motions and uses it to resolve the ambiguities during the 3D reconstruction process. Another approach using a HBM and data clustering is presented in . Detection of 2D markers in separate images and its analysis using calibration information have been presented in  enforcing an HBM afterwards. A similar technique using a Kalman filter involving the HBM in the data association step was presented in .
In this paper, a low-cost real-time multicamera algorithm for marker-based human motion capture is presented. The proposed algorithm can work with any marker type detectable onto a set of 2D planes under perspective projection and it is robust to markers' occlusion and noisy detections. Since variables involved with the employed analysis HBM do not hold a linear relationship and the involved statistical distributions are non-Gaussian, we opted for a Monte Carlo approach to estimate the pose of the HBM at a given time instant. In our case, marker detection and HBM pose estimation are performed in the same analysis loop by means of an annealed particle filter . Epipolar geometry is exploited in the particle likelihood evaluation by means of the symmetric epipolar distance  being robust to noisy marker detections and occlusions. Moreover, kinematic restrictions are applied in the particle propagation step towards avoiding impossible poses. Finally, effectiveness of the proposed algorithm is assessed by means of objective metrics defined in the framework of the HumanEva data set . The presented algorithm is intended to work with any multicamera setup and regardless of the complexity of the selected human body model.
2. Monte Carlo-Based Human Motion Capture
2.1. Problem Formulation
The evolution of a physical articulated structure can be better captured with model-based tracking techniques . In this process, the pose of an articulated HBM is sequentially estimated along time using video data from a number of cameras. Let be the state vector to be estimated formed by the defining parameters of an articulated HBM, angles at every joint, and the state space describing all possible valid poses an HBM may adopt, where .
From a Bayesian perspective, the articulated motion estimation and tracking problem is to recursively estimate a certain degree of belief in the state vector at time , given the data up to time . Thus, it is required to calculate the posterior pdf . However, this pdf may be peaky and far from being convex, and hence cannot be computed analytically unless linear-Gaussian models are adopted. Even though Kalman filtering provides the optimal solution under certain assumptions, it tends to fail when the estimated probability density presents a multimodal distribution or the dimension of the state vector is high. Usually, this is the type of pdf s involved in HMC processes.
2.2. Particle Filtering
Usually, PF will be able to concentrate particles in the main mode of the likelihood function thus providing an estimation of the state space vector. However, multiple modes of similar size in the likelihood function might bias the estimation. In order to cope with such cases, the estimation is set to be the state vector associated to the maximum or the mean of all particle weights. Finally, a propagation model is adopted to add a drift to the state of the re-sampled particles in order to progressively sample the state space in the following iterations .
Another issue arising when applying PF techniques to computer vision problems is to derive a valid observation model relating the input data with the particle state . Nevertheless, even if such likelihood model can be defined, its evaluation may be very computationally inefficient. Instead of that, a fitness function can be constructed according to the likelihood function, such that it provides a good approximation of but is also relatively easy to calculate.
2.3. Annealing Strategy
PF is an appropriate technique to deal with problems where the posterior distribution is multimodal. This usually happens when state space dimensionality is high, like in HMC. To maintain a fair representation of , a certain number of particles is required in order to find its global maxima instead of a local one. It has been proved in  that the amount of particles required by a standard PF algorithm to achieve a successful tracking follows an exponential law with the number of dimensions. Articulated motion tracking typically employs state spaces with dimension , thus standard PF turns out to be computationally unfeasible.
There exist several possible strategies to reduce the complexity of the problem based on refinements and variations of the seminal PF idea. Partitioned and hierarchical sampling [18, 19] are presented as highly efficient solutions to this problem. In the instance when there exists a tractable substructure between some variables of the state model, specific states can be marginalized out of the posterior, leading to the family of Rao-Blackwellized PF algorithms . However, these techniques impose a linear hierarchy of sampling which may not be related to the true body structure assuming certain statistical independence among state variables. Finally, annealed PF  is one of the most general and robust approaches to estimation problems involving high-dimensional and multimodal state spaces. In this work, this technique will be extended to our marker-based scenario.
enforcing the normalization condition . The estimation of parameter is based on the particle survival technique described in . Once the weighted set is constructed, it will be used to draw the particles of the next layer.
3. Filter Implementation
In the current scenario, it is supposed that the subject under study is tracked since the moment he/she enters the scene. A simple person tracking system is employed  to obtain a coarse estimation of person's position and velocity. Assuming that backward motions are unlikely, the velocity vector allows an initial estimation of the torso orientation. Finally, for the rest of limbs, a neutral and natural walking position is defined for the initialization of the HMC system.
In the case of a global miss of the tracked subject, the variance of the state space variables associated to every particle tend to be high in comparison of the variance obtained during a correct tracking operation. Therefore, the analysis of this variance allows detecting when the HMC system is out of track. In such case, the coarse tracking system is employed to start again the initialization loop described beforehand.
Although a beforehand selected HBM is employed to track any person, the size of the limbs must be adequate to the particular subject under study. For the majority of people, there is a strong quasilinear correlation between the height of a person and the length of the limbs  thus allowing a proper scaling of these magnitudes after automatically measuring the height directly from the input images as shown, for instance, in .
3.2. Measurement Generation
The input data to the proposed tracking system will be the detection of the 2D projections of the set of distinguishable markers attached to the body of the performer onto the available images in contrast with markerless HMC systems relying on image features such as edges or silhouettes . Let be the set of locations detected in the image captured in the th view, , . In order to generate , a generic marker detection algorithm is employed whose performance is assessed by the detection rate ( ), the false positive rate ( ), and the variance estimation error ( ). This formulation of will allow performance comparisons of the tracking algorithm when using different marker detection algorithms and the assessment of occlusions.
Markers are usually placed at the joints, the end of the limbs, the top of the head and the chest of the subject. The proposed method is general enough to be applied to any type of markers detectable onto a set of 2D planes under perspective projection. An example of the detections obtained by our color-based marker detection is shown in Figure 2(b).
3.3. Likelihood Evaluation
In order to evaluate the likelihood between the body pose represented by a given particle state with reference to the input data , a fitness function must be defined. The 3D positions of the HBM landmarks corresponding to the pose described by the state vector are computed through forward kinematics . Let us denote these coordinates as the set , . The fitness function relating the 3D locations set with the 2D observations should measure how well these 2D points fit as projections of the set . A similar problem was tackled by the authors in  in a Bayesian framework and the underlying idea is applied in this context.
However, not all the 3D points may have a projection onto every view due to occlusions or a miss-detection of the marker detection algorithm. In order to detect such cases, a thresholding is applied to the elements dismissing those measurements above a threshold . In this case, using an empirically determined value of pixels. At this point, it is required measure how likely are the set of 2D measurements to be projections of the 3D HBM landmark . This can be done by means of the generalized symmetric epipolar distance .
3.4. Occlusion Management
Occlusions are a major problem in HMC systems and can be separated into two categories: auto-occlusions and occlusions generated by opaque elements in the scene. In both cases, when analyzed from a multi-view perspective, occlusions are reflected in a missing subset of detected markers into some views. Assuming that there are markers attached to some HBM landmarks, the set would ideally contain the 2D projections of the markers that are not affected by the occlusions produced by the body itself onto the th camera view. Moreover, there might be some miss-detections of these projection and a number of false measurements.
Within the current analysis framework, occlusions and miss-detections can be assumed as an underperformance of the generic marker detection thus regarded by the miss-detection rate . As previously noted, the amount of false positives is represented by the false positive rate and the error committed in the marker location estimation is assumed to have a Gaussian distribution with variance . This formation will allow simulating an arbitrary degree of corruption of the input data, as will be shown in Section 4.
Markers that are visible in, at least, three camera views can be correctly handled by the likelihood function. In the case of severe occlusions where there are only two camera views containing projections of a given marker, the distance may become inaccurate. In such cases, the position of the occluded marker is estimated using information from both the correctly estimated 3D neighboring landmarks and applying temporal coherence.
3.5. Propagation Model
4. Experiments and Results
4.1. Synthetic Data on HumanEva
In order to test the proposed algorithm, HumanEva data set  has been selected since it provides synchronized and calibrated data from both several cameras and a professional motion capture (MoCap) system to produce ground truth data. This data set contains a set of 5 actions performed by 3 different subjects captured by 4 fully calibrated cameras with a resolution of pixels at 30 fps.
HumanEva suggests two metrics, mean, , and standard deviation of the estimation error, , towards providing quantitative and comparable results. In this paper, metrics proposed in  for 3D human pose tracking evaluation are also employed. Let , , denote the landmark positions of the HBM (typically, the body joints and the end of the limbs) corresponding to the pose described by the state variable computed using forward kinematics  at a given time . Assuming that landmark positions associated to particle are available, we can define a matched marker estimation with respect to the ground truth position as the one fulfilling . This stands for those estimations that fall -close to the ground truth position. Then, the Multiple Marker Tracking Accuracy (MMTA) is defined as the percentage of markers fulfilling the condition, and the Multiple Marker Tracking Precision (MMTP) as the average of the metric error between and , of all pairs fulfilling . Finally, these scores are averaged for all frames in the sequence. Threshold , being an upper-bound of the maximum allowed error, is set to in our experiments.
Every 3D location in is projected onto every camera in order to generate the sets , . The previously estimated fleshed HBM checks the visibility of markers onto a given camera view by modeling the possible auto-occlusions among body parts. At this point, the 2D locations contained in are the positions obtained by an ideal marker detection algorithm.
The effect of the marker detection algorithm is simulated by generating a number of miss detections, false measurements and, finally, adding a Gaussian noise to all measurements, according to the statistics reflected by , , and .
When analyzing the impact of missing projections of markers, that is, occlusions, represented by and shown in Figure 6(a), it can be seen that the algorithm is still robust producing accurate estimations even in the case of a large miss of data, . Assuming a fixed and realistic amount of occlusions, , we can explore the influence of the other distorting factors. Analyzing the results shown in Figure 6(b), it may be seen that the algorithm is robust against the number of false detections since it is very unlikely that false 2D measurements in different views keep a 3D coherence. In this case, the spacial redundancy is efficiently exploited to discard these measurements. On the other hand, the performance of the algorithm decreases as the 2D marker position estimation error increases, . Another evident fact to be emphasized is the overannealing effect. The performance of the algorithm is not monotonically increasing with the number of employed annealing layers. This happens when the particles concentrate too much around the peaks of the weighting function hence impoverishing the overall representation of the likelihood distribution. For this motion tracking problem, we found that the optimal configuration is and .
4.2. Real Data
The presented body tracking algorithm has been applied to capture motion figures from 4 different types of dances: salsa, belly dancing, and two Turkish folk dances. The analysis sequences were recorded with 6 fully calibrated cameras with a resolution of pixels at 30 fps.
Markers attached to the body of the dance performer were little yellow balls and a color-based detection algorithm has been used to generate the sets for every incoming multi-view frame. The original images are processed in the YCrCb color space which gives flexibility over intensity variations in the frames of a video as well as among the videos captured by the cameras from different views. In order to learn the chrominance information of the marker color, markers on the dancer are manually labeled in one frame for all camera views. It was assumed that the distributions of Cr and Cb channel intensity values belonging to marker regions are Gaussian. Thus, the mean can be computed over each marker region (a pixel neighborhood around the labeled point). Then, a threshold in the Mahalanobis sense is applied to all images in order to detect marker locations. An empirical analysis showed that the detector had the following performance triplet: , , and mm.
4.3. Results Comparison
Hierarchical Partitioned PF 
EM + Kinematically constrained GMM 
PF + Dynamic models 
ICP + Naïve classification 
Example-based pose estimation 
Example-based pose estimation + feature selection 
Sparse probabilistic regression 
Voxel reconstruction + APF 
The other family of human motion capture algorithms is based on learning and classification instead of tracking. Basically, these techniques examine the ground truth data and extract a number of features from them. Afterwards, when a new test frame is processed, these same features are extracted, and the best match between them and the already learnt ones is outputted. Results obtained with these techniques, specially those of Urtasun et al.  and Poppe , outperform the tracking-based ones. However, these techniques are constrained to track a beforehand selected action and their applicability to unknown motion patterns is limited. It is notable the technique presented by Münderman et al.  where a 3D reconstruction is performed before computing the features to be learnt.
To the authors knowledge, there is no evaluation of a marker-based HMC system using the HumanEva dataset. The obtained results are close to those presented by classification-based markerless methods and, although the employed input data is different, it allows qualitatively evaluating its performance. An advantage of using a marker-based method is its robustness to faulty inputs, its low complexity, and the possibility of real-time implementations.
4.4. Real-Time Considerations
Once the image measurements have been obtained, the fitting of an HBM to these data using the proposed algorithm is achieved in real time in a 3 GHz computer. Due to the low dimension of the input data ( ), the computation of the involved operations in both the likelihood and propagation steps require a low computational cost. Measurements, can be obtained using elementary image filtering techniques as shown in Section 4.2 usually computed directly on the camera (as done by ) or by the digitizing hardware.
This paper presents a robust real-time low-cost approach to marker-based human motion capture using multiple cameras synchronized and calibrated. Progressive fitting of a human body model through the annealed particle filtering algorithm using a multi-view consistency likelihood function, the symmetric epipolar distance, and a kinematically constrained particle propagation model allow an accurate estimation of the body pose. Quantitative evaluation based on HumanEva dataset assessed the robustness of the algorithm when dealing faulty input data, even in very harsh conditions. Fast dance motion was also analyzed proving the adequateness of our technique to deal with a real scenario data.
- Baran I, Popović J: Automatic rigging and animation of 3D characters. Proceedings of the ACM International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH '07), August 2007Google Scholar
- Cerveri P, Pedotti A, Ferrigno G: Robust recovery of human motion from video using Kalman filters and virtual humans. Human Movement Science 2003, 22(3):377-404. 10.1016/S0167-9457(03)00004-6View ArticleGoogle Scholar
- Sigal L, Balan AO, Black MJ: HumanEva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision 2010, 87(1-2):4-27. 10.1007/s11263-009-0273-6View ArticleGoogle Scholar
- Kirk AG, O'Brien JF, Forsyth DA: Skeletal parameter estimation from optical motion capture data. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05), June 2005 782-788.Google Scholar
- Ascension http://www.ascension-tech.com/.
- Moven-inertial motion capture http://www.moven.com/.
- Roetenberg D: Inertial and magnetic sensing of human motion, Ph.D. dissertation. University of Twente, Twente, The Netherlands; 2006.Google Scholar
- Vicon http://www.vicon.com/.
- Raskar R, Nii H, Dedecker B, Hashimoto Y, Summet J, Moore D, Zhao Y, Westhues J, Dietz P, Barnwell J, Nayar S, Inami M, Bekaert P, Noland M, Branzoi V, Bruns E: Prakash: lighting aware motion capture using photosensing markers and multiplexed illuminators. ACM Transactions on Graphics 2007., 26(3):Google Scholar
- Hartley R, Zisserman A: Multiple View Geometry in Computer Vision. C. U. Press; 2004.View ArticleMATHGoogle Scholar
- Herda L, Fua P, Plänkers R, Boulic R, Thalmann D: Using skeleton-based tracking to increase the reliability of optical motion capture. Human Movement Science 2001, 20(3):313-341. 10.1016/S0167-9457(01)00050-1View ArticleGoogle Scholar
- Guerra-Filho G: Optical motion capture: theory and implementation. Journal of Theoretical and Applied Informatics 2005, 12(2):61-89.Google Scholar
- Deutscher J, Reid I: Articulated body motion capture by stochastic search. International Journal of Computer Vision 2005, 61(2):185-205.View ArticleGoogle Scholar
- Canton-Ferrer C, Casas JR, Pardàs M: Towards a Bayesian approach to robust finding correspondences in multiple view geometry environments. Proceedings of the 4th International Workshop on Computer Graphics and Geometric Modelling, 2005, Lecture Notes on Computer Science 3515: 281-289.MATHGoogle Scholar
- Moeslund TB, Hilton A, Krüger V: A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding 2006, 104(2-3):90-126. 10.1016/j.cviu.2006.08.002View ArticleGoogle Scholar
- Arulampalam MS, Maskell S, Gordon N, Clapp T: A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Transactions on Signal Processing 2002, 50(2):174-188. 10.1109/78.978374View ArticleGoogle Scholar
- Gordon NJ, Salmond DJ, Smith AFM: Novel approach to nonlinear/non-gaussian Bayesian state estimation. IEE Proceedings, Part F 1993, 140(2):107-113.Google Scholar
- MacCormick J, Isard M: Partitioned sampling, articulated objects, and interface-quality hand tracking. Proceedings of the European Conference on Computer Vision, 2000 3-19.Google Scholar
- Mitchelson J, Hilton A: Simultaneous pose estimation of multiple people using multiple-view cues with hierarchical sampling. Proceedings of the British Machine Vision Conference, 2003Google Scholar
- Madapura J, Li B: 3D articulated human body tracking using KLD-Annealed Rao-Blackwellised Particle filter. Proceedings of IEEE International Conference onMultimedia and Expo (ICME '07), July 2007 1950-1953.Google Scholar
- Canton-Ferrer C, Casas JR, Pardàs M, Sblendido R: Particle filtering and sparse sampling for multi-person 3D tracking. Proceedings of IEEE International Conference on Image Processing (ICIP '08), October 2008 2644-2647.Google Scholar
- Dockstader SL, Berg MJ, Tekalp AM: Stochastic kinematic modeling and feature extraction for gait analysis. IEEE Transactions on Image Processing 2003, 12(8):962-976. 10.1109/TIP.2003.815259MathSciNetView ArticleGoogle Scholar
- Lichtenauer J, Reinders M, Hendriks E: Influence of the observation likelihood function on particle filtering performance in tracking applications. Proceedings of the 6th IEEE International Conference on Automatic Face and Gesture Recognition (FGR '04), May 2004 767-772.Google Scholar
- Herda L, Urtasun R, Fua P: Hierarchical implicit surface joint limits for human body tracking. Computer Vision and Image Understanding 2005, 99(2):189-209. 10.1016/j.cviu.2005.01.005View ArticleMATHGoogle Scholar
- Urtasun R, Fleet DJ, Fua P: 3D people tracking with Gaussian process dynamical models. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '06), June 2006 238-245.Google Scholar
- Husz Z, Wallance A: Evaluation of a hierarchical partitioned particle filter with action primitives. Proceedings of the 2nd Workshop on Evaluation of Articulated Human Motion and Pose Estimation, 2007Google Scholar
- Kotecha JH, Djuric PM: Gibbs sampling approach for generation of truncated multivariate Gaussian random variables. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '99), March 1999 1757-1760.Google Scholar
- Canton-Ferrer C, Casas J, Pard`as M, Monte E: Towards a fair evaluation of 3D human pose estimation algorithms. Technical University of Catalonia; 2009.Google Scholar
- Cheng S, Trivedi M: Articulated body pose estimation from voxel reconstructions using kinematically constrained Gaussian mixture models: algorithm and evaluation. Proceedings of the 2nd Workshop on Evaluation of Articulated Human Motion and Pose Estimation, 2007Google Scholar
- Brubaker M, Fleet D, Hertzmann A: Physics-based human pose tracking. Proceedings of the Workshop on Evaluation of Articulated Human Motion and Pose Estimation, 2006Google Scholar
- Münderman L, Corazza S, Andriacchi T: Markerless human motion capture through visual hull and articulated icp. Proceedings of the Workshop on Evaluation of Articulated Human Motion and Pose Estimation, 2006Google Scholar
- Poppe R: Evaluating example-based pose estimation: experiments on the humaneva sets. Proceedings of the 2nd Workshop on Evaluation of Articulated Human Motion and Pose Estimation, 2007Google Scholar
- Okada R, Soatto S: Relevant feature selection for human pose estimation and localization in cluttered images. Proceedings of the European Conference on Computer Vision, 2008Google Scholar
- Canton-Ferrer C, Casas JR, Pardàs M: Voxel based annealed particle filtering for markerless 3D articulated motion capture. Proceedings of the 3rd IEEE Conference on 3DTV (3DTV-CON '09), May 2009Google Scholar
- Mikič I: Human body model acquisition and tracking using multi-camera voxel data, Ph.D. dissertation. University of California, San Diego, Calif, USA; 2003.Google Scholar
- Caillette F, Howard T: Real-time markerless human body tracking with multi-view 3-D voxel reconstruction. Proceedings of the British Machine Vision Conference, 2004 2: 597-606.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.