Human Action Recognition Using Ordinal Measure of Accumulated Motion
© Wonjun Kim et al. 2010
Received: 14 December 2009
Accepted: 1 February 2010
Published: 12 April 2010
This paper presents a method for recognizing human actions from a single query action video. We propose an action recognition scheme based on the ordinal measure of accumulated motion, which is robust to variations of appearances. To this end, we first define the accumulated motion image (AMI) using image differences. Then the AMI of the query action video is resized to a subimage by intensity averaging and a rank matrix is generated by ordering the sample values in the sub-image. By computing the distances from the rank matrix of the query action video to the rank matrices of all local windows in the target video, local windows close to the query action are detected as candidates. To find the best match among the candidates, their energy histograms, which are obtained by projecting AMI values in horizontal and vertical directions, respectively, are compared with those of the query action video. The proposed method does not require any preprocessing task such as learning and segmentation. To justify the efficiency and robustness of our approach, the experiments are conducted on various datasets.
There are two types of human action recognition models: learning-based models and template-based models. In the former, reliable action dataset is essentially needed to build a classifier whereas the single template (i.e., training-free) is used to find the query action in target video sequences in the latter. Since it is hard to maintain the large dataset for real applications, the latest algorithms for human action recognition tend to be template-based. In this sense, we also propose a template-based action recognition method for static camera applications.
Main contributions of the proposed method are summarized as follows: first, the accumulated motion image (AMI) is defined by using image differences to represent the spatiotemporal features of occurring actions. It should be emphasized that only areas containing changes are meaningful for computing AMI instead of the whole silhouette of human body as in previous methods [4, 5]. Thus, the segmentation task such as background subtraction to obtain the silhouette of human body is not required in our method. Secondly, we propose to employ the ordinal measure of accumulated motion for detecting query actions in target video sequences. Our method is motivated by the earlier work using the ordinal measure for detecting image and video copies [6, 7], in which authors show that the ordinal measure is robust to various modifications of original images. Thus, it can be employed to cope with variations of appearance for the accurate action recognition. Finally, the energy histograms, which are obtained by projecting AMI values in horizontal and vertical directions, are used to determine the best match among local windows detected as candidates close to the query action.
The rest of this paper is organized as follows: the related work is briefly summarized in Section 2. The technical details about the steps outlined above are explained in Section 3. Various real videos are tested to justify the efficiency and robustness of our proposed method in Section 4 and followed by conclusion in Section 5.
2. Review of Related Work
Human action recognition has been widely studied for last several decades. Bobick and Davis  propose the temporal templates as models for actions. They construct two vector images, that is, motion energy image (MEI) and motion history image (MHI), which are designed to encode a variety of motion properties. In detail, an MEI is a cumulative motion image whereas an MHI denotes recent moving pixels. Finally, these view-specific templates are matched against the model of query actions. Schüldt et al.  use space-time interest points proposed in  to represent the motion patterns and integrate such representations with SVM classification schemes. Ikizler et al.  propose to use lines and optical flow histograms for human action recognition. In particular, they introduce a new shape descriptor based on the distribution of lines fitted to the silhouette of human body. In , authors define the integral video to efficiently calculate 3D spatiotemporal volumetric features and train cascaded classifiers to select features and recognize human actions. Hu et al.  use the MHI along with foreground image obtained by background subtraction and the histogram of oriented gradients (HOG)  to obtain discriminative features for action recognition. Then they build a multiple-instance learning framework to improve the performance. Authors of  propose to use the mixture particle filters and then cluster the particles using local nonparametric clustering. However, these approaches require supervised learning based on the large reliable dataset before recognizing human actions.
Yilmaz and Shah  encode both shape and motion features to represent the 3D action models. More specifically, they treat actions as 3D objects in space and compute action descriptors by analyzing the differential geometrical properties of spatiotemporal volume. Gorelick et al.  also induce the silhouette in the space-time volume for human action recognition. Unlike , they use the blobs obtained by background subtraction instead of contours. However, these silhouette-based approaches require accurate background subtraction.
A recent trend in human action recognition has been toward the template-based models as mentioned. Shechtman and Irani  introduce a novel similarity measure based on the correlation of behavior. They use intensity values in a small space-time patch. In detail, a space-time video template for the query action consists of such small space-time patches. It is correlated against a larger target video sequence by checking its consistency with every video segment to find the best match with the given query action. Furthermore, they propose to measure similarity between actions based on matching internal self-similarities . Ning et al.  propose the hierarchical space-time framework enabling efficient search for desirable actions. Similar to , they also use the correlation between the query action template and candidates in the target video. However, these approaches may be unstable under noisy environments. In , authors propose the space-time local steering kernels (LSK) to represent the volumetric features. They compare the 3D LSK features of the query action efficiently against those obtained from the target video sequences using a matrix generalization of the cosine similarity measure. Although the shape information is well defined in the LSK features, it is hard to apply it for real-time applications due to the high dimensionality.
Basically, our approach belongs to the template-based model. Unlike previous methods, the ordinal measure employed in our method easily generalizes across appearance variations due to different clothes and body figures. Further technical details will be presented in the following section.
3. Proposed Method
3.1. Accumulated Motion Image (AMI)
Since the accumulated motion is differentiable across various actions, it can be regarded as a discriminative feature for recognizing human actions. Based on this observation, we introduce a new feature, AMI, enabling efficient representation of the accumulated motion.
3.2. Ordinal Measure for Detecting Candidates
To this end, AMI is firstly resized to a subimage by intensity averaging as shown in Figure 4. Let us define the rank matrix of resized AMI for the query action video as where equals to . It is set to 9 in our implementation. For example, the rank matrix of the query action can be represented as in Figure 4 and also each element of the rank matrix can be expressed as . Thus, the accumulated motion of query video is effectively encoded in a single rank matrix.
where denotes the length of the query action video as mentioned. denotes the index of the rank matrix. This 1-norm is known to be more robust to outliers than 2-norm  and also computed efficiently. The rank matrix of query action is consistently applied to compute the distance regardless of the frame and local window indexes of the target video as shown in (3). Finally, if the distance defined in (3) is smaller than the threshold, the corresponding local windows are detected as candidates close to the query action. It is important to note that a comparison between the rank matrices of the query action video and local windows is conducted after initial frames in (3). It is because that the length of query action video is required at least to generate the reliable AMI of each local window for the accurate comparison. Thus, the latest frames of the target video need to be stored. However, It should be emphasized that computing (3) with all local windows in each target video frame is very fast since rank matrices are only used as our features for the similarity measure instead of full 3D feature vectors (i.e., spatiotemporal cubes shown in [3, 19]).
3.3. Determination of the Best Match Using Energy Histograms
For the sake of completeness, the overall procedure of our proposed method is summarized in Algorithm 1.
Algorithm 1: Human action recognition using ordinal measure of accumulated motion
4. Experimental Results
In this section, we divide the experiments into three phases. First of all, we test our proposed method in the Weizmann dataset  to evaluate the robustness and discriminability. The performance for the query action recognition among multiple actions is also evaluated in the second phase. Finally, the performance of our method for real applications such as surveillance scenarios and event retrieval is evaluated.
4.1. Robustness and Discriminability
The robustness determines the reliability of the system which can be represented by the accuracy of the query action detection before false detections begin to occur whereas the discriminability is concerned with its ability to reject irrelevant actions such that false detections do not occur. To evaluate the robustness and discriminability of our proposed method, we employ the Weizmann human action dataset , which is one of the most widely used standard datasets. In this dataset, total 10 actions conducted by nine people (i.e., 90 videos) are contained, which can be divided into two categories: global actions (like run, forward jump, side jump, skip, walk) and local actions (like bend (bd), jack (jk), vertical jump (vjp), one-hand wave (wv1), two-hand wave (wv2)). Since most events observed in static camera applications are related to local actions, we thus focus on the five local actions in the Weizmann dataset (see Figure 3).
The two threshold values used for candidate detection and determination of the best match are empirically set. The size of local windows is set to be equal to the image size of the query action video. Note that the spatial and temporal scale changes up to 20% can be handled in our method. The framework for evaluating performance has been implemented by using Visual Studio 2005 (C++) under FFMpeg library, which has been utilized for MPEG and Xvid decoding. The experiments are performed on the low-end PC (Core2Duo 1.8 GHz). The test videos in the Weizmann dataset are encoded with the image size of pixels. The query action video for each local motion is cropped from one of nine videos related to the corresponding action in our experiment. Since the processing speed of our algorithm achieves about 45 fps for the test videos, it can be sufficiently applied for real-time applications.
4.2. Recognition Performance in Multiple Actions
In this subsection, we demonstrate the recognition accuracy of the proposed method by using our videos captured in different environments (i.e., indoor and outdoor) with the image size of pixels. In particular, the performance for the query action recognition among multiple actions is evaluated.
4.3. Recognition Performance for Real Applications
Since most standard action dataset including the Weizmann dataset is captured in well-controlled environments while actions in the real world often occur in much more complex scenes, there exists a considerable gap between these samples and real world scenarios.
False positive rate of each selected video.
A novel method for human action recognition is proposed in this paper. Compared to previous methods, our proposed algorithm is performed very fast based on the simple ordinal measure of accumulated motion. To this end, AMI is firstly defined by using image differences. Then the rank matrix is generated based on the relative ordering of resized AMI values and distances from the rank matrix of query action video to the rank matrices of all local windows in the target video are computed. To determine the best match among the candidates close to the query action, we propose to use the energy histograms obtained by projecting AMI values in horizontal and vertical directions, respectively. Finally, experiments are performed on diverse videos to justify the efficiency and robustness of the proposed method. The classification results of our algorithm are comparable to state-of-the-art methods and further, the proposed method can be used for real-time applications. Our future work is to extend the algorithm to describe human actions in dynamic scenes.
This research was supported by the MKE (The Ministry of Knowledge Economy), Korea, under the ITRC (Information Technology Research Center) support program supervised by the NIPA (National IT Industry Promotion Agency) (NIPA-2010-(C1090-1011-0003)).
- Briassouli A, Kompatsiaris I: Robust temporal activity templates using higher order statistics. IEEE Transactions on Image Processing 2009, 18(12):2756-2768.MathSciNetView ArticleGoogle Scholar
- Boiman O, Irani M: Detecting irregularities in images and in video. Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV '05), October 2005, Beijing, China 1: 462-469.View ArticleGoogle Scholar
- Seo HJ, Milanfar P: Detection of human actions from a single example. Proceedings of the International Conference on Computer Vision (ICCV '09), October 2009Google Scholar
- Chandrashekhar VH, Venkatesh KS: Action energy images for reliable human action recognition. Proceedings of the Asian Symposium on Information Display (ASID '06), October 2006 484-487.Google Scholar
- Ahmad M, Lee S-W: Recognizing human actions based on silhouette energy image and global motion description. Proceedings of the 8th IEEE International Conference on Automatic Face and Gesture Recognition (FG '08), September 2008, Amsterdam, The Netherlands 1-6.Google Scholar
- Kim C: Content-based image copy detection. Signal Processing: Image Communication 2003, 18(3):169-184. 10.1016/S0923-5965(02)00130-3Google Scholar
- Kim C, Vasudev B: Spatiotemporal sequence matching for efficient video copy detection. IEEE Transactions on Circuits and Systems for Video Technology 2005, 15(1):127-132.View ArticleGoogle Scholar
- Bobick AF, Davis JW: The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence 2001, 23(3):257-267. 10.1109/34.910878View ArticleGoogle Scholar
- Schüldt C, Laptev I, Caputo B: Recognizing human actions: a local SVM approach. Proceedings of the 17th International Conference on Pattern Recognition (ICPR '04), August 2004, Cambridge, UK 3: 32-36.View ArticleGoogle Scholar
- Laptev I, Lindeberg T: Space-time interest points. Proceedings of the 9th IEEE International Conference on Computer Vision, October 2003, Nice, France 1: 432-439.View ArticleMATHGoogle Scholar
- Ikizler N, Cinbis RG, Duygulu P: Human action recognition with line and flow histograms. Proceedings of the 19th International Conference on Pattern Recognition (ICPR '08), December 2008, Tampa, Fla, USA 1-4.Google Scholar
- Ke Y, Sukthankar R, Hebert M: Efficient visual event detection using volumetric features. Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV '05), October 2005, Beijing, China 1: 166-173.Google Scholar
- Hu Y, Cao L, Lv F, Yan S, Gong Y, Huang TS: Action detection in complex scenes with spatial and temporal ambiguities. Proceedings of International Conference on Computer Vision (ICCV '09), October 2009Google Scholar
- Dalal N, Triggs B: Histograms of oriented gradients for human detection. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05), June 2005, San Diego, Calif, USA 1: 886-893.Google Scholar
- Dhillon PS, Nowozin S, Lampert CH: Combining appearance and motion for human action classification in videos. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '09), June 2009, Miami, Fla, USA 22-29.Google Scholar
- Blank M, Gorelick L, Shechtman E, Irani M, Basri R: Actions as space-time shapes. Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV '05), October 2005, Beijing, China 2: 1395-1402.View ArticleGoogle Scholar
- Yilmaz A, Shah M: Actions sketch: a novel action representation. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05), June 2005, San Diego, Calif, USA 1: 984-989.Google Scholar
- Gorelick L, Blank M, Shechtman E, Irani M, Basri R: Actions as space-time shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence 2007, 29(12):2247-2253.View ArticleGoogle Scholar
- Shechtman E, Irani M: Space-time behavior based correlation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05), June 2005, San Diego, Calif, USA 1: 405-412.Google Scholar
- Shechtman E, Irani M: Matching local self-similarities across images and videos. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '07), June 2007, Minneapolis, Minn, USA 1-8.Google Scholar
- Ning H, Han TX, Walther DB, Liu M, Huang TS: Hierarchical space-time model enabling efficient search for human actions. IEEE Transactions on Circuits and Systems for Video Technology 2009, 19(6):808-820.View ArticleGoogle Scholar
- Han J, Bhanu B: Individual recognition using gait energy image. IEEE Transactions on Pattern Analysis and Machine Intelligence 2006, 28(2):316-322.View ArticleGoogle Scholar
- Yu S, Tan T, Huang K, Jia K, Wu X: A study on gait-based gender classification. IEEE Transactions on Image Processing 2009, 18(8):1905-1909.MathSciNetView ArticleGoogle Scholar
- Rousseeuw PJ, Leroy AM: Robust Regression and Outlier Detection. John Wiley & Sons, New York, NY, USA; 1987.View ArticleMATHGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.