3D hand tracking using Kalman filter in depth space
© Park et al; licensee Springer. 2012
Received: 1 June 2011
Accepted: 17 February 2012
Published: 17 February 2012
Hand gestures are an important type of natural language used in many research areas such as human-computer interaction and computer vision. Hand gestures recognition requires the prior determination of the hand position through detection and tracking. One of the most efficient strategies for hand tracking is to use 2D visual information such as color and shape. However, visual-sensor-based hand tracking methods are very sensitive when tracking is performed under variable light conditions. Also, as hand movements are made in 3D space, the recognition performance of hand gestures using 2D information is inherently limited. In this article, we propose a novel real-time 3D hand tracking method in depth space using a 3D depth sensor and employing Kalman filter. We detect hand candidates using motion clusters and predefined wave motion, and track hand locations using Kalman filter. To verify the effectiveness of the proposed method, we compare the performance of the proposed method with the visual-based method. Experimental results show that the performance of the proposed method out performs visual-based method.
Recently, human-computer interaction (HCI) technology has drawn attention as a promising man-machine communication method. Advancements of HCI have been led by associated developments of computing power, various sensors, and display techniques [1, 2].
Interest in human-to-human communication modalities for HCI also has been increased. These include movements of human hands and arms. Human hand gestures are non-verbal communication that ranges from simple pointing to complex interactions between people. Main advantage of hand gestures is the ability of communication in the distance . The use of hand gestures for HCI demands that the configurations of the human hand can be measurable by the computer. The performance highly depends on the accuracy of detection and tracking of hand locations. Current hand detection and tracking methods are using various sensors including directly attached to hand, special feature gloves, and color or depth images [4–7].
The hand detection and tracking via image sensor may be done with 2D or 3D information. However, as obtaining 3D information needs high computing power and high cost equipment, 2D methods have been more developed than 3D. In 2D hand detection and tracking methods, the most common method is a visual-based method, which uses information such as color, shape, and edge. Visual-based methods can be categorized as color-based and template-based methods. The color-based method starts by finding a hand region using color information (RGB, HSV, YCbCr). Then, a color histogram is made from the detected hand. Based on this color histogram the region which is similar to hand color can be tracked [8, 9]. The template-based method creates an edge image through the color or gray image. The edge image is matched to the trained hand template, and then the hand is tracked .
However, hand movements generally occur in 3D space. Then, 2D method only can use 2D information, which eliminates the movement information along the z-axis. This makes the limitation of 2D methods inherently. Recently, the equipment for obtaining 3D information is becoming faster, more accurate, and cost-effective. This equipment includes depth sensors such as ToF cameras and PrimeSensor . After the emergence of this equipment, real-time 3D hand tracking methods rapidly developed. For example, Breuer et al.  used an infra-red ToF camera to create a near real-time gesture recognition system. Grest et al.  proposed a human motion tracking method using a combination of depth and silhouette information.
In this article, we propose a novel real-time 3D hand tracking method in depth space using PrimeSensor with Kalman filter. We generate the motion image from depth image. Then, we detect hand candidates using motion clusters and predefined wave motion, and track hand locations using Kalman filter.
The organization of this article is as follows. In Section 2, related works are briefly reviewed. In Section 3, the preprocessing of depth information and the proposed hand detection and tracking method are described. In Section 4, several experiments of our hand tracking system are performed. Finally, we conclude the article in Section 5.
2.1 Visual-based hand tracking
There are two well-known visual hand tracking methods: color- and template-based methods. In color-based methods, after initial hand detection, the color information is extracted from the specified initial region. This color information is made up of RGB-space pixel colors or transformed into HSI-space pixel colors. In , the color histogram is made from hue and saturation values of the region. Then, the obtained color histogram is used to hand tracking. In template-based methods, the initial hand is found by matching the whole image with a prepared trained hand template. The template is moved near to the initial hand region, and the matching point of the hand is found. This process is used for every frame .
Visual-based methods are natural tracking method. However, visual-based methods are highly affected by the illumination conditions. When using a color histogram or skin color probability density function, RGB, hue, and saturation values may change by illumination. This can make it difficult to find and track the hand. Also, when a specific part of the hand is occluded or shaded by an object, then hand tracking can fail [16, 17].
2.2. Depth-based hand tracking
Depth-based hand tracking methods can be categorized into model-based and motion-based. Model-based hand tracking uses the 3D articulation model to fit the hand. The motion-based method uses hand motion in depth space.
Breuer et al.  proposed the model-based hand tracking in depth space. In order to estimate location and orientation of the hand, principal component analysis is used with 3D points. These 3D points are subsequently fitted to an articulated hand model for refinement of the first estimation. Also, Oikonomidis et al.  proposed a system using model-based full-degree-of-freedom hand model initialization and tracking in near real-time with Kinect. They optimized hand model parameters to minimize discrepancy between the appearance and 3D structure of hypothesized instances of a hand model and the actual hand observations. The tracker based on stochastic meta-descent for optimizations in high dimensional state spaces is proposed by Bray et al. . This algorithm is based on a gradient descent approach with adaptive and parameter-specific step sizes. The hand tracker is reinforced by the integration of a deformable hand model based on linear blend skinning and anthropometrical measurements.
In motion-based hand tracking method, Holte et al.  proposed the view invariant gesture recognition system with the ToF camera. This method finds the motion primitives from an accumulated image based on 3D data. It detects movements using a 3D vision of 2D double differencing (subtracting the depth values pixel-wise in two pairs of depth images), thresholding, and accumulating.
2.3. Color information versus depth information
Advantage and disadvantage of color and depth information
Easy to find feature
Robust to light variation
Getting real depth value
Sensitive to light conditions
Hard to find features
Noise in edges
2.4. Kalman filter
Kalman  proposed a recursive method to solve the problem of linear filtering of discrete data. Providing many advantages in digital computing, Kalman filter is applied in a variety of research fields and real application areas . The main procedure of Kalman filter is to estimate the state, then refine the state from the error.
Summary of Kalman filter
3. Proposed method
The depth image from the depth sensor has various sources of noise such as reflectance and mismatched patterns. Sometimes these noises are detected as real motion information. Therefore, noise reduction should be performed before hand detection. Also preprocessing includes clustering algorithm for initial hand detection.
3.1.1. Motion image (accumulated difference image)
We accumulate difference images. In this accumulated image, all movement of human, object, and noise are represented. Next, noise reduction, motion clustering, and hand detection procedures are applied to this motion image.
3.1.2. Noise reduction
We use a spatial filtering and a morphological processing for noise reduction. When the noise reduction method is applied to the motion image, real motion can be shown clearly. A 5 × 5 aperture median filter is used for spatial filtering. The median filter replaces the pixel value with the median value of the sub-image with aperture . This median filter provides excellent salt and pepper noise reduction with considerably less blurring. As the noise pattern of the motion image is very similar to salt and pepper noise, the median filter is very effective. We also use morphological processing for noise reduction. We use the opening operation which consists of erosion followed by dilation . The basic effect of the opening operation is to reduce the outer shape of the object by erosion and to expand the outers. Generally, this operation smooths the outers, splits the narrow region, and removes the thin perimeter. Thus, the opening operation removes the randomly generated noise and smooths the original image. The erosion operation slips off the object or particles layer, reducing irrelevant pixels and small particles from the image. The dilation operation does the inverse of the erosion operation. It attaches layers to the object or particles, and it can return the eroded objects or particles to their original size. These operations are highly effective for the depth image noise reduction.
3.1.3. Motion clustering
In this section, we describe how to cluster motion regions from the motion image. First we select connected components from the motion image. Then the obtained connected components are clustered. These clusters are possible candidates for the hand. The selected clusters can be either real motion or noise. The noise clusters are usually small or split frequently, so if the size is smaller than some threshold, then we can decide it as a noise cluster, and remove it.
3.2. Initial hand detection
In the preprocessing section, we generated the motion image by accumulating difference images, reduced the noise in the motion image, and found the motion clusters. In this section, we find the hand cluster from the remaining clusters in the image shown in Figure 7b.
3.3. Hand tracking
In [15, 27, 28], many object tracking methods are explained. Among these, the Kalman filter has the following advantages for hand tracking. The first is computational efficiency; the Kalman filter needs small data storage for previous data in operating the recursive process, because we only need information of the previous state, and not the whole previous frame. The second advantage is that the Kalman filter is suitable for treating a time varying signal. Therefore, we apply the Kalman filter for hand tracking.
These vectors are the initial setting for the Kalman filter.
The nominated motion clusters should be fitted to the hand size which is found from the polynomial regression method. This nominated point is now the current hand cluster, and we store the reference point.
4. Experimental results
The experimental environment is a PC with Intel® Core™ i5 CPU 750 @ 2.67 GHz 2.66 GHz, and to obtain depth information we used Primesense's PrimeSensor development kit. The sensor obtained the depth image as follows. The IR light of PrimeSensor scatters the IR pattern, and the depth camera gets the pattern and creates the depth image. It also supports the color image. The resolution of the depth image is VGA (640 × 480), and the maximum frame rate is 60 fps. The resolution of the color image is UXGA (1600 × 1200). The operating range is 0.6-3.5 m . In the proposed method, we use only the depth image.
4.1. Hand detection experiment
4.2. Depth-based hand tracking experiment
The first hand tracking experiment is finding X- and Y-axis errors of the hand tracking at three distances. We manually gave the central hand position for the ground truth and compared it with the result of the proposed hand tracking.
The result of tracking error in 2D
The second hand tracking experiment is finding the error of the hand tracking in Z-axis. For this experiment, we use two types of motions, the one is a push motion and the other is a spring motion which draws a circle in a push motion. For the push motion, we push two times in one experiment. For the spring motion, we draw three circles. We obtained the data sets of each experiment from 12 persons with 10 times for each person.
The result of tracking error in 3D
4.3. Depth-based hand tracking and color-based hand tracking
We compare the performance of depth-based and color-based hand trackings. We used the Camshift  for the color-based hand tracking. After the initial hand detection, the hand is tracked by the proposed method with depth information, and independently by the Camshift with color information. For the Camshift tracker, the 5 × 5 window center is set to the initial hand point in order to extract the color histogram. The ground truth is measured by color information with a marker which is attached to the hand. The depth and color information are calibrated. Therefore, we used the point of the ground truth for each tracking method.
The mean pixel error of depth-based hand tracking
The mean pixel error of color-based hand tracking
We proposed a novel hand detection and a tracking method using depth information. We make the motion image, as a basic source of the proposed hand tracking system, which is the accumulated difference image from depth image sequences. In the preprocessing stage, we perform noise reduction, applying a spatial filtering and a morphological processing, and motion clustering, obtaining the moving region from the motion image. We detect the hand from this motion clusters using waving motion. We also suggest three-axis Kalman filter for tracking. Comparing the proposed method with color-based method, we can see the effectiveness of the proposed method. Especially, the depth information-based method is very robust to the light variation environment. As for the future work, in order to improve the accuracy of tracking, more effective noise reduction methods or other tracking methods such as Unscented Kalman filter or particle filter can be considered.
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (No. 2011-0016302). And this research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2010-0011472).
- Pavlovic VI, Sharma R, Huang TS: Visual interpretation of hand gestures for human-computer interaction: a review. IEEE Trans Pattern Anal Mach Intell 1997, 19(7):677-695. 10.1109/34.598226View ArticleGoogle Scholar
- Just A, Marcel S: A comparative study of two state-of-the-art sequence processing techniques for hand gesture recognition. Comput Vis Image Understand 2009, 113(4):532-543. 10.1016/j.cviu.2008.12.001View ArticleGoogle Scholar
- Ionescu B, Coquin D, Lambert P: V Buzuloiu, Dynamic hand gesture recognition using the skeleton of the hand. EURASIP J Appl Signal Process 2005, 2005: 2101-2109. 10.1155/ASP.2005.2101View ArticleGoogle Scholar
- Sturman DJ, Zelter D: A survey of glove-based input. IEEE Comput Graph Appl 1994, 14(1):30-39.View ArticleGoogle Scholar
- Wang RY, Popovic J: Real-time hand-tracking with a color glove. ACM Trans Graph 2009, 28(3):1-8.Google Scholar
- Shan C, Tan T, Wie Y: Real-time hand tracking using a mean shift embedded particle filter. Pattern Recogn 2007, 40(7):1958-1970. 10.1016/j.patcog.2006.12.012View ArticleGoogle Scholar
- Li Z, Jarvis R: Real time hand gesture recognition using a range camera. In Proceedings of Australasian Conference on Robotics and Automation (ACRA). Sydney, Australia; 2009.Google Scholar
- Kjeldsen R, Kender J: Toward the use of gesture in traditional user interfaces. In Proceedings of the Second International Conference on Automatic Face and Gesture Recognition. Killington, VT, USA; 1996:151-156.View ArticleGoogle Scholar
- Imagawa K, Lu S, Igi S: Color-based hands tracking system for sign language recognition. In IEEE International Conference on Automatic Face and Gesture Recognition. Nara, Japan; 1998:462-467.View ArticleGoogle Scholar
- Stenger B, Thayananthan A, Torr PHS, Cipolla R: Model-based hand tracking using a hierarchical bayesian filter. IEEE Trans Pattern Anal Mach Intell 2006, 28(9):1372-1384.View ArticleGoogle Scholar
- Breuer P, Eckes C, Muller S: Hand gesture recognition with a novel IR time-of-flight range camera--a pilot study. Lecture Note Comput Sci 2007, 4418: 247. 10.1007/978-3-540-71457-6_23View ArticleGoogle Scholar
- Grest D, Kruger V, Koch R: Single view motion tracking by depth and silhouette information. In Proceedings of the Scandinavian Conference on Image Analysis. Aalborg, Denmark; 2007:719-729.Google Scholar
- Manresa C, Varona J, Mas R, Perales F: Hand tracking and gesture recognition for human-computer interaction. Electron Lett Comput Vis Image Anal 2005, 5(3):96-104.Google Scholar
- Yilmaz A, Javed O, Shah M: Object tracking: a survey. ACM Comput Surv 2006, 38(4):1-45.View ArticleGoogle Scholar
- Moeslund TB, Hilton A, Kruger V: A survey of advances in vision-based human motion capture and analysis. Comput Vis Image Understand 2006, 104(2-3):90-126. 10.1016/j.cviu.2006.08.002View ArticleGoogle Scholar
- Lu H, Plataniotis KN, Venetsanopoulos AN: A full-body layered deformable model for automatic model-based gait recognition. EURASIP J Adv Signal Process 2008, 2008: 13.View ArticleGoogle Scholar
- Oikonomidis I, Kyriazis N, Argyros AA: Efficient model-based 3D tracking of hand articulations using Kinect. In British Machine Vision Conference. Dundee, UK; 2011:101.1-101.11.Google Scholar
- Bray M, Koller-Meier E, Muller P, Gool LV, Schraudolph NN: 3D hand tracking by rapid stochastic Gradient Descent using a skinning model. In 1st European Conference on Visual Media Production. London, UK; 2004:59-68.Google Scholar
- Holte MB, Moeslund TB, Fihl P: Fusion of range and intensity information for view invariant gesture recognition. In IEEE Computer Society Conference on Computer Vision & Pattern Recognition Workshops. Anchorage, AK, USA; 2008:1-7.Google Scholar
- Kalman RE: A new approach to linear filtering and prediction problems. Trans ASME J Basic Eng 1960, 82: 34-45.Google Scholar
- Bishop G, Welch G: An introduction to the Kalman filter. In SIGGRAPH 2001, Course 8. Los Angeles, CA, USA; 2001.Google Scholar
- Gonzalez RC, Woods RE: Digital Image Processing. 3rd edition. Prentice Hall, Upper Saddle River, NJ; 2008.Google Scholar
- Toh KA, Tran QL, Srinivasan D: Benchmarking a reduced multivariate polynomial pattern classifier. IEEE Trans Pattern Anal Mach Intell 2004, 26(6):740-755. 10.1109/TPAMI.2004.3View ArticleGoogle Scholar
- Bradski GR, Davis JW: Motion segmentation and pose recognition with motion history gradients. Mach Vis Appl 2002, 13(3):174-184. 10.1007/s001380100064View ArticleGoogle Scholar
- Munoz-Salinas R, Medina-Carnicer R, Madrid-Cuevas F, Carmona-Poyato A: Depth silhouettes for gesture recognition. Pattern Recogn Lett 2008, 29(3):319-329. 10.1016/j.patrec.2007.10.011View ArticleGoogle Scholar
- Lewis FL: Optimal Estimation: With An Introduction to Stochastic Control Theory. Wiley, NY; 1986.Google Scholar
- Brown RG, Hwang PYC: Introduction to Random Signals and Applied Kalman Filtering. Wiley, NY; 1997.Google Scholar
- Bradski GR: Computer vision face tracking for use in a perceptual user interface. In IEEE workshop on Applications of Computer Vision. Princeton, NJ, USA; 1998:214-219.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.