- Research Article
- Open Access

# Novel Kernel-Based Recognizers of Human Actions

- Somayeh Danafar
^{1}Email author, - Alessandro Giusti
^{1}and - Jürgen Schmidhuber
^{1}

**2010**:202768

https://doi.org/10.1155/2010/202768

© Somayeh Danafar et al. 2010

**Received:**16 January 2010**Accepted:**8 April 2010**Published:**16 May 2010

The Erratum to this article has been published in EURASIP Journal on Advances in Signal Processing 2012 2012:124

## Abstract

We study unsupervised and supervised recognition of human actions in video sequences. The videos are represented by probability distributions and then meaningfully compared in a probabilistic framework. We introduce two novel approaches outperforming state-of-the-art algorithms when tested on the KTH and Weizmann public datasets: an *unsupervised* nonparametric kernel-based method exploiting the Maximum Mean Discrepancy test statistic; and a *supervised* method based on Support Vector Machine with a characteristic kernel specifically tailored to histogram-based information.

## Keywords

- Video Sequence
- Optical Flow
- Action Recognition
- Reproduce Kernel Hilbert Space
- Speaker Verification

## 1. Introduction

Huge video archives require advanced video analysis to automatically interpret, understand, and summarize the semantics of video contents. In this paper we focus on localizing and categorizing different human actions in surveillance videos.

The task is challenging as visual perceptions of such events are very high-dimensional, and huge intraclass variations are common due to view point changes, camera motion, occlusions, clothing, cluttered background, geometric, and photometric object distortions.

- (i)
extraction of features from the video data,

- (ii)
if necessary, dimensionality reduction of feature vectors, by means of techniques such as PCA,

- (iii)
classification of the sequence.

Our main contributions are new techniques for the third step, classification.

In our experimental evaluation we consider two different state-of-the-art feature descriptors, which have been described in action recognition systems providing top-tier results on the publicly available KTH [1, 2] and Weizmann [3] datasets. By using our proposed classification algorithm with such features, we manage to further improve classification results on the same datasets.

In order to be sufficiently powerful to descriptively represent video content, such features are high dimensional. This is commonly handled by using kernel-based methods, which allow one to perform classification implicitly in a reduced space.

- (i)
unsupervised clustering of sequences from unlabeled data, given the desired number of clusters;

- (ii)
supervised classification of new input sequences, given a set of labeled training sequences.

*characteristic kernels*, which enable injective embedding of probabilities [4–7]. The distance between mapped distributions is known as Maximum Mean Discrepancy (MMD) [8, 9], whose well-defined application is homogeneity testing.

- (i)
The first main contribution of this paper is the novel use of MMD as a homogeneity test for unsupervised action recognition (see Figure 1). Its encouraging performance, exceeding the best results in the literature, suggests that our classification technique is well suited to action recognition problems, and manages to capture differences between different classes while being robust to the significant appearance variations in the provided datasets. This is in accordance to several works in the literature, where MMD has been successfully used for unsupervised tasks in several different applications (see Section 2).

- (ii)
The second main contribution is in the supervised case: we use an SVM-based approach (see Figure 2), with a novel characteristic kernel specifically tailored for histogram-based data. Also in this context, we provide experimental evidence that selecting an appropriate kernel leads to significant performance gains.

By representing video sequences by means of probability distributions of feature vectors associated to video frames, we implicitly disregard frame ordering; such property is shared by several other approaches exploiting bag-of-features techniques [10, 11], and allows us to bypass the problem of determining the initial or final times of an action, while at the same time taking advantage of the action periodicity.

We review related literature in Section 2. In Section 3 we illustrate Maximum Mean Discrepancy, which is the core of our unsupervised method, and review the definition of characteristic kernels. Next we address the case of characteristic kernels which are defined for Abelian semigroups in Section 4. It gives us a characteristic kernel which is proper for histogram-based feature descriptors that we use in our supervised method. We discuss about the general framework of our unsupervised and supervised approaches in Section 5. In Section 6, we provide a brief overview to the feature extraction approaches that we use for our experimental validation, which is described in Section 7 using KTH and Weizmann datasets. In Section 7, we also discuss computational cost. Lastly, we draw conclusions and discuss future works in Section 8.

## 2. Related Works

A large amount of different approaches have been proposed so far for action recognition (a recent review is given in [12]). We provide a broad classification in the following.

### 2.1. Features for Action Recognition

*Shape-based* approaches attempt to extract silhouettes of actors and then recognize the actions by analyzing such data [3, 13–16]. One inherent disadvantage of this class of techniques is that they can not capture the internal motion of the object within the silhouette region. More importantly, even state-of-the-art background subtraction techniques are unable to reliably recover precise silhouettes, especially in dynamic environments, which reduces the robustness of techniques in this class.

*Flow-based* techniques estimate the optical flow field between adjacent frames and use such features for action recognition, and provide the important advantage of requiring no background subtraction. A pioneering algorithm in this category was proposed by Efros [17]. They reported their results on a database of images taken at distance. Shechtman and Irani [18] use a template matching approach to correlate the flow consistency between the template and the video. Danafar and Gheissari [19] proposed an optical-flow-based algorithm which has the advantages of both holistic (look at human body as whole) and body-part-based approaches. This is one of the two descriptors used in this paper, and is outlined in Section 6. Jhuang et al. [20] extract dense local motion information with a set of flow filters. The responses are pooled locally, and converted to higher-level responses using complex learned templates. These templates are pooled again, and fed into a discriminative classifier.

In order to design features robust to changes in camera view and variability in the speed of actions, some researchers proposed space-time interest point features [1, 2, 10]. Dollár et al. [21] present a spatiotemporal interest point detector based on 1D-Gabor filters, which identifies regions with sudden or periodic intensity changes in time. Thereafter for each 3D interest region, optical flow descriptors are obtained. A fixed set of 3D visual words is compared with a histogram of a new sequence of visual words by a nearest neighbor approach. Ke et al. [22] also presented a new spatiotemporal shape and flow correlation algorithm for action recognition which works on oversegmented videos and does not require background subtraction.

Using both form and flow features simultaneously is also suggested in the seminal work of Giese and Poggio [23], which describes the strategy of biological systems: form and motion are processed simultaneously but independently in two separate pathways. However in their paper. The implementation of such system is designed for simple, schematic stimuli.

The approach is taken further by Schindler and Van Gool [24], which investigates the detection of actions from very short sequences called snippets. The motion pathway extracts optic flow at different scales, directions, and speeds. In the form pathway, they apply Gabor filter at multiple orientations and scales. In both pathways, the filter responses are MAX-pooled, and computed to a set of learned templates. The similarities from both pathways are concatenated to a feature vector and classified with a bank of linear classifiers by SVM. In our approach, we use such powerful feature descriptor, computed on each pair of frames independently, as the input of our classification algorithm.

### 2.2. Classification for Action Recognition

Many classification techniques are proposed in literature, both supervised and unsupervised.

In [25], the authors propose compound features that are assembled from simple 2D corners in both space and time. Compound features are learned in a weakly-supervised approach using a data mining algorithm. Several researchers have explored unsupervised methods for motion analysis. Hoey [26] applies a hierarchical dynamic Bayesian network model to recognize facial expressions in an unsupervised manner. Zhong et al. [27] have proposed an unsupervised approach to detect unusual activity in video sequences. A simple descriptor vector per each frame is considered and video is clustered by looking at co-occurrences of motion and appearance patterns. Their method identifies spatially isolated clusters as unusual activity. In [28], the authors detect abnormal activities by means of the multi-observation Hidden Markov Model and spectral clustering to unsupervised training of behavior models. Boiman and Irani [29] explain a video sequence using patches from a database; as dense sampling of the patches is necessary in their approach, the resulting algorithm is very time consuming and unpractical for action recognition. Wang et al. [30] propose to use an unsupervised learning approach to discover the set of action classes present in a large collection of training images. Thereafter, these action classes are used to label test images. The features are based on the coarse shape of human figures and the distance between a pair of images is computed using a linear programming relaxation technique. Spectral clustering is performed using the resulting distances. Niebles et al. [11] present an unsupervised learning method for human action categories. Their algorithm automatically learns the probability distribution of the spatiotemporal words that each corresponds to an action category, and builds a model for each class. This is achieved by using latent topic models such as probabilistic Latent Semantic analysis (pLSA) model and Latent Dirichlet Allocation (LDA).

Many researchers use supervised and discriminative approaches for the classification stage, particularly with Support Vector Machines with an appropriate kernel according to feature descriptors [2, 19, 24, 31, 32]. Other approaches represent videos by using sparse spatiotemporal words, then summarized in a histogram. In such approaches, the temporal order of frames is disregarded, which is also shared in our approach. Nowozin et al. [31] propose a sequential representation which retains the temporal order. They introduce a discriminative subsequent mining to find optimal discriminative subsequent patterns, and extend the prefix span subsequence mining algorithm [33] in combination with LPBoost [34].

Maximum Mean Discrepancy as a statistical test has application in variety of areas. For instance, in bioinformatics we might wish to find whether different procedures in different labs on the same tissue obtain different DNA microarry data [35]. In database attribute matching has been used for merging heterogeneous databases [8]. In speaker verification, such test can be used to identify the correspondence between a speech sample to a person for whom previously recorded speech is available [36]. In this paper we propose a novel use of MMD as an unsupervised action recognition method.

## 3. The Maximum Mean Discrepancy

In this section we briefly recall the theoretical foundations of MMD: in Section 5, we show how it is employed in our context.

Recent studies [5, 6, 8] have shown that mapping random variables into a suitable reproducing kernel Hilbert space (RKHS) gives a powerful and straightforward method of dealing with higher order statistics of the variables. The idea behind this is to do linear statistics in RKHS and derive its meaning in the original space. One basic statistic on Euclidean space is the *mean*. By embedding the distributions to RKHS, the corresponding factor is the *mean element*, which was introduced by Gretton et al. [8, 9]. The distance between mapped mean elements is known as Maximum Mean Discrepancy (MMD). One well-defined application of MMD is for homogeneity testing or for the two sample test. The two sample problem tests whether two probability measures
and
coincide or not.

Definition 1.

Let be an RKHS on the separable metric space , with a continuous feature mapping for each . The inner product between feature mappings is given by the positive definite kernel function . We assume that the kernel is bounded. Let be the set of Borel probability measures on .

The definition of MMD is explained in the following theorems [8, 9].

Theorem 2.

In practice, because we do not have access to the population of distributions and , we compare two sets of data which are drawn from the populations. The homogeneity test becomes a problem of testing whether two samples of random variables are generated from the same distribution. , the unbiased empirical estimation of the MMD is defined as follows.

Definition 3.

*,*drawn independently and identically distributed from and , respectively, the unbiased estimate of MMD is the one-sample

*-*statistic:

where , , and is the sample size.

The *biased* estimate
is achieved by replacing the
-statistic in the above equation with a
-statistic (then the sum includes the term
).

In the two sample test, we require both a measure of distance between probabilities and a notion of whether this distance is statistically significant. The former is given in Theorem 2. For the latter, we give an expression for the asymptotic distribution of this distance measure, from which a significance threshold may be obtained. More precisely, we conduct a hypothesis test with null hypothesis defined as , and alternative hypothesis as . We must therefore specify a threshold that the empirical MMD will exceed with small probability when .

Theorem 4.

In [8] the proof of Theorem 4 has been shown by means of so called *Rademacher average*. We accept the null hypothesis
if the value of
satisfies the inequality in Corollary 5 and reject the null hypothesis if not.

Corollary 5.

where is the user-defined significance threshold (confidence interval) for test statistic.

In practice we used the looser significance threshold that is defined in Corollary 5. Empirically to estimate the boundary, the bootstrap method of Gretton et al. [8, 25] on the aggregated data is used. For theoretical point of view we elaborate on a tighter significance threshold of our two-sample test which is obtained by an expression of the asymptotic distribution. The following theorem explains that the unbiased empirical version of MMD asymptotically converges to the population value of MMD and obtains the threshold.

Theorem 6.

and is the centered RKHSkernel.

The goal is to determine whether the empirical test statistic is so large to be outside the quantile of the null distribution (consistency of the resulting test is guaranteed by the form of the distribution under ). One way to estimate this quantile is using the bootstrap on the aggregated data [8, 9].

Clearly the quality of the MMD as a statistic depends on the richness of RKHS space
which is defined by a measurable kernel
. A set of kernels is called *characteristic kernels*, introduced in [4, 5] gives an RKHS for which probabilities have unique images. The necessary and sufficient condition for a kernel to be characteristic is expressed in following lemma.

Lemma 7.

Let be a measurable space, be a measurable positive definite kernel on , and be the associated RKHS. Also let be an RKHS, then k is characteristic if and only if is dense in for every probability on .

The definition of a characteristic kernel generalizes the well-known property of the characteristic functions which uniquely determines a Borel probability measure. The Gaussian RBF kernel is a famous example of a characteristic kernel on the entire . We use this kernel in the present work for unsupervised action recognition, whereas in the supervised case we introduce a different characteristic kernel in the following section.

## 4. Characteristic Kernels on Abelian Semigroups

Our supervised action recognition approach, outlined in Section 5, is based on SVM. The crucial condition that a kernel should satisfy to be suitable for SVM is to be positive definite, meaning that the SVM problem is convex, and hence that the solution of its objective function is unique. Positive definite kernels are defined as following.

Definition 8.

*Histogram Intersection (HI)*kernel as a positive definite kernel. HI has been first introduced in computer vision by Swain and Ballard in [37]:

where and are two bins histograms (in ). This kernel was successfully used as a similarity measure for image retrieval and recognition tasks [38, 39]. In [38] they proved that for histograms of the same size with integer values, is a positive definite kernel.

*Generalized Histogram Intersection kernel*was introduced as a positive-definite kernel:

where and . If we set , the is a special case of and is a positive definite kernel for absolute real values.

Characteristic kernels have positive definite property and have been shown to be more discriminative, because they can take higher order statistics into account. For instance, in [40] Fukumizu et al. showed by optimizing kernel mappings one can find the most predictive subspace in regression. We verify this in practice in Section 7, where we show that the characteristic kernel provides significantly better performance than an HI kernel. Previously, characteristic kernel has been defined on spaces. However, the kernel should be chosen according to the nature of the available data. In our supervised recognition case, just like in many other computer vision tasks, features are histogram-based, and are not naturally represented in the space.

Therefore, we are going to investigate whether characteristic kernels can be defined on spaces besides . Several such domains constitute topological groups or semigroups; this is relevant in our context, as histograms are examples of Abelian semigroups.

Fukumizu et al. [6] introduced characteristic kernels on groups and semigroups by establishing some conditions. In this section we first recall the Bochner theorem which characterizes a set of continuous shift-invariant positive-definite kernels on by the Fourier transform. Thereafter we bring the related theorems, which define characteristic kernels for Abelian semigroups and it is achieved based on Laplace transform in the Bochner theorem. The purpose here is to introduce a class of characteristic kernels for histograms that are examples of Abelian semigroups.

Theorem 9 (Bochner).

Before explanation of the related theorem on semigroups we briefly review the definition of semigroups.

Definition 10.

Theorems 11 and 12 [6] obtain necessary and sufficient conditions for tailored kernels on Abelian semigroups .

Theorem 11.

Based on the above theorem, we have the following sufficient condition of characteristic property.

Theorem 12.

Let be a positive definite function given equation in Theorem 11. If , then the positive definite kernel is characteristic.

As histograms represent an example of Abelian Semigroups, we take advantages of Theorems 11 and 12, and define this following *Histogram Characteristic* kernel.

Histogram Characteristic Kernel

Our proposed HC kernel provides significantly better performance than both the HI kernel (which is just positive definite, and not characteristic), and the Gaussian kernel (which is characteristic but not tailored on histogram-based information).

## 5. Unsupervised and Supervised Action Recognition

In this paper, we are applying the theoretical findings reported in the previous sections to two different problems: unsupervised and supervised action recognition.

In the *unsupervised* case, we aim at clustering unlabeled sequences belonging to the same action, assuming that the number of clusters is known. In this problem we use MMD with a gaussian kernel, as introduced in Theorem 4.
is automatically determined, in such a way to return the required number of clusters. We considered the significance level,
, of MMD as a two-sample test equal to 0.05. The reported results are percentage of acceptance rate in 1000 times running the MMD. Clusters are found by pairwise comparisons (two-sample test) of distributions corresponding to sequences: two sequences belong to the same cluster if and only if the MMD is close enough to 0 (the threshold is computed as in Theorem 4 and Corollary 5). For each cluster, a single representative distribution is then chosen. Thereafter, a new sequence can be classified by comparing with the same approach its related probability distribution, to the representative distribution of each of the clusters (see Figure 1). For *supervised* action recognition, we use as a learning algorithm an SVM with the characteristic kernel, introduced in Example. The dataset is divided in three parts: training, testing, and validation. The validation data is first used in order to tune the
parameter of the kernel with a leave-one-out cross-validation procedure. According to the results of cross validation procedure
tuned as 0.001 for HC kernel and 1 for GHI kernel (which obtains the HI kernel). Then we use the training data in order to obtain support vectors which define the discriminative classifier. Lastly, the testing data is processed in order to evaluate the performance of the classifier (prediction, see Figure 2).

## 6. Feature Extraction Approaches

We evaluated our classification approach with two state of the art feature descriptors, which we will refer to as F1 and F2 in the following. They have been described in recent literature and shown to have excellent performance on the action recognition task.

## 7. Experiments and Evaluation

In order to gather experimental evidence that supports our proposed approach, we used two public datasets frequently referenced in the action recognition literature: the KTH human action database [1, 2] and the Weismann human action dataset [3].

The KTH dataset contains 2391 sequences of 6 types of human actions: walking, jogging, running, boxing, hand waving, and hand clapping. These actions were performed by 25 people in four different scenarios: outdoors (s1), outdoors with scale variations (s2), outdoors with different clothes (s3), and indoors (s4). Some samples from this dataset are shown in Figure 1.

### 7.1. Unsupervised Classification

We tested unsupervised classification on both databases using feature F1. Mirroring the experimental validation in [12], we considered a 27-frame sequence for each of the Weizmann videos, and a 17-frame sequence for KTH videos. We used the same bounding box data as in [12]. In particular, in the Weizmann dataset the fixed-size bounding boxes can be trivially extracted by considering a simple background subtraction algorithm. In the more challenging KTH dataset, bounding boxes for all frames are linearly interpolated from known initial and final positions.

Because of the small number of actors in Weizmann dataset, we evaluate the results with leave-one-out cross-validation. First, 72 unlabeled sequences from 8 subjects are used for recovering the 9 clusters in an unsupervised way; then, the 9 sequences from the one remaining subject are used for testing generalization capability. The procedure repeated for all 9 permutations. In the larger KTH dataset, we used a single partition of 16 subjects for clustering and 9 for testing generalization capability.

Comparison of recognition results on Weismann dataset with different approaches.

Comparison of recognition results on KTH dataset with different approaches. Note that the recognition rate reported by Jhuang et al. [20] is obtained on video sequences from scenarios 1 and 4 only. Other reported rates are on all scenarios.

Method | Classification | Recognition rate % |
---|---|---|

MMD | Unsupervised | 94.4 |

SVM by charac. Kernel | Supervised | 93.1 |

Schindler and Van Gool [24] | Supervised | 92.7 |

Jhuang et al. [20] | Supervised | 91.7 |

Nowozin et al. [31] | Supervised | 87 |

Wong and Cipolla [32] | Supervised | 86.6 |

Danafar and Gheissari [19] | Supervised | 85.3 |

Niebles et al. [11] | Unsupervised | 83.3 |

Dollár et al. [21] | Supervised | 81.2 |

Schüldt et al. [2] | Supervised | 71.7 |

On the larger KTH dataset, training and testing took, respectively, 287 and 100 seconds, on a mid-level dual core laptop. The computational complexity is quadratic with respect to the number of frames in each sequence [8, 9]. The overall acceptance rate of , representing the similarity of two sequences, was computed from 100 runs of each homogeneity test.

### 7.2. Supervised Classification

For supervised classification, we worked with feature descriptor F2. On the KTH dataset, we considered subsequences of at most 150 frames, which are all summarized in a single feature vector. Such feature has proven to be less powerful than F1, which causes in the KTH dataset a recognition rate in the supervised case of 93.1%; this is lower than the 94.4% rate we obtained in the unsupervised case, when using the F1 features.

It is interesting to compare the effect of characteristic kernels, which we are using in this paper, to histogram intersection kernels, which are not characteristic and are widely used in the computer vision literature [37–39] for classifying histogram-based data. In fact, as reported in Section 4, characteristic kernels bear important advantages from the theoretical point of view. Our results confirm such advantage in this practical application. Our reported accuracy of 93.1%, obtained with characteristic kernels, is a very significant improvement with respect to the accuracy of 85.3% reported in [19], obtained using histogram intersection kernels in the same setting.

Therefore, we can conclude that our experimental results are due to our kernel being both characteristic and suitable for histogram-based data, removing any of the two properties results in a significant performance loss.

Training and testing on the KTH dataset required 8 and 2 seconds, respectively. During the testing phase, the complexity is linear with the number of support vectors. The complexity of the training phase is dominated by the solution of a quadratic optimization problem.

is the kernel. The difficulty of solving the above equation is the density of , whose elements are in general not zero. To overcome this problem the decomposition method is implemented. The time complexity is at most if we suppose each kernel evaluation is [46]. The time performance for training and testing are, respectively 8, and 2 seconds.

In our case we deal with multiclass type of classification, and we consider one-vs-one procedure. Thus, if m is the number of classes (actions), comparisons are needed (in our case ).

## 8. Conclusions

- (i)
an unsupervised nonparametric kernel method based on Maximum Mean Discrepancy,

- (ii)
a supervised method using Support Vector Machines with a novel proper characteristic kernel for Abelian semigroups.

On the two major data sets for action recognition, our approaches outperformed those found in the literature, both in the unsupervised and supervised case.

The new characteristic kernel is suitable for histograms, and may be useful for many other computer vision problems involving histogram-based features.

## Notes

## Declarations

### Acknowledgment

This work was partially funded by SNF Sinergia grant number CRSIKO_122697/1.

## Authors’ Affiliations

## References

- Laptev I, Lindeberg T: Local descriptors for spatio-temporal recognition.
*Proceedings of the 9th IEEE International Conference on Computer Vision (ICCV '03), 2003*Google Scholar - Schüldt C, Laptev I, Caputo B: Recognizing human actions: a local SVM approach.
*Proceedings of the 17th International Conference on Pattern Recognition (ICPR '04), August 2004*32-36.Google Scholar - Blank M, Gorelick L, Shechtman E, Irani M, Basri R: Actions as space-time shapes.
*Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV '05), October 2005*1395-1402.View ArticleGoogle Scholar - Fukumizu K, Bach FR, Jordan MI: Dimensionality reduction for supervised learning with reporoducing kernel Hilbert spaces.
*Journal of Machine Learning Research*2004, 5: 73-99.MathSciNetMATHGoogle Scholar - Fukumizu K, Gretton A, Sun X, Schölkopf B: Kernel measures of conditional dependence.
*Advances in Neural Information Processing Systems*2008, 20: 489-496.Google Scholar - Fukumizu K, Sriperumbudur BK, Gretton A, Schölkopf B: Characteristic kernels on groups and semigroups.
*Proceedings of the 22nd Annual Conference on Neural Information Processing Systems(NIPS '08), 2008*Google Scholar - Sriperumbudur BK, Gretton A, Fukumizu K, Lanckreit G, Schölkopf B: Injective Hilbert space embeddings of probability measures. In
*Proceedings of the 21st Annual Conference on Learning Theory (COLT '08), July 2008*. Edited by: Servedio R, Zhang T. Springer; 111-122.Google Scholar - Gretton A, Borgwardt K, Rasch M, Smola A, Schölkopf B: A kenel method for the two-sample problem. In
*Proceedings of the 19th Conference on Advances in Neural Information Processing Systems, 2006, Vancouver, Canada*. Edited by: Schölkopf B, Platt J, Hoffman T. MIT Press; 513-520.Google Scholar - Gretton A, Borgwardt K, Rasch M, Smola A, Schölkopf B:
*A kenel method for the two-sample problem.*Tech. Rep. 157, Max-Planck-Institut for Biological Cybernetics; 2008.Google Scholar - Niebles JC, Wang H, Fei-Fei L: Unsupervised learning of human action categories using spatio-temporal words.
*Proceedings of the British Machine Vision Conference (BMVC '06), 2006*Google Scholar - Niebles JC, Wang H, Fei-Fei L: Unsupervised learning of human action categories using spatial-temporal words.
*International Journal of Computer Vision*2008, 79(3):299-318. 10.1007/s11263-007-0122-4View ArticleGoogle Scholar - Turaga P, Chellappa R, Subrahmanian VS, Udrea O: Machine recognition of human activities: a survey.
*IEEE Transactions on Circuits and Systems for Video Technology*2008, 18(11):1473-1488.View ArticleGoogle Scholar - Bobick AF, Davis JW: The recognition of human movement using temporal templates.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*2001, 23(3):257-267. 10.1109/34.910878View ArticleGoogle Scholar - Carlson S, Sullivan J: Action recognition by shape matching to key frames.
*Proceedings of the Workshop on Models versus Exemplars in Computer Vision, 2001*Google Scholar - Wang L, Suter D: Recognizing human activities from silhouettes: motion subspace and factorial discriminative graphical model.
*Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '07), June 2007, Minneapolis, Minn, USA*1-8.Google Scholar - Yilmaz A, Shah M: Actions sketch: a novel action representation. In
*Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05), June 2005*. IEEE Computer Society; 984-989.Google Scholar - Efros AA, Berg AC, Mori G, Malik J: Recognizing action at a distance.
*Proceedings of the 9th IEEE International Conference on Computer Vision (ICCV '03), October 2003*726-733.View ArticleGoogle Scholar - Shechtman E, Irani M: Space-time behavior based correlation.
*Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05), June 2005*405-412.Google Scholar - Danafar S, Gheissari N: Action recognition for surveillance application using optic flow and SVM.
*Proceedings of the 8th Asian Conference on Computer Vision (ACCV '07), November 2007, Tokyo, Japan*Google Scholar - Jhuang H, Serre T, Wolf L, Poggio T: A biologically inspired system for action recognition.
*Proceedings of the IEEE 11th International Conference on Computer Vision (ICCV '07), October 2007, Rio de Janeiro, Brazil*Google Scholar - Dollár P, Rabaud V, Cottrell G, Belongie S: Behavior recognition via sparse spatio-temporal features.
*Proceedings of the 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS '05), October 2005*65-72.Google Scholar - Ke Y, Sukthankar R, Hebert M: Spatio-temporal shape and flow correlation for action recognition.
*Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '07), June 2007*1-8.Google Scholar - Giese MA, Poggio T: Neural mechanisms for the recognition of biological movements.
*Nature Reviews Neuroscience*2003, 4(3):179-192.View ArticleGoogle Scholar - Schindler K, Van Gool L: Action snippets: how many frames does human action recognition require?
*Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR '08), June 2008*1-8.Google Scholar - Gilbert A, Illingworth J, Bowden R: Scale invariant action recognition using compound features mined from dense spatio-temporal corners.
*Proceedings of the 10th European Conference on Computer Vision (ECCV '08), 2008, Marseille, France*222-233.Google Scholar - Hoey J: Hierarchical unsupervised learning of facial expression categories.
*Proceedings of the IEEE Workshop on Detection and Recognition of Action Video, 2001*99-106.View ArticleGoogle Scholar - Zhong H, Shi J, Visontai M: Detecting unusual activity in video.
*Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '04), July 2004*819-826.Google Scholar - Xiang T, Gong S: Video behaviour profiling and abnormality detection without manual labelling.
*Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV '05), October 2005, Beijing, China*1238-1245.View ArticleGoogle Scholar - Boiman O, Irani M: Detecting irregularities in images and in video. In
*Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV '05), October 2005*.*Volume 1*. IEEE Computer Society; 462-469.View ArticleGoogle Scholar - Wang Y, Jiang H, Drew MS, Li Z-N, Mori G: Unsupervised discovery of action classes.
*Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '06), June 2006*2: 1654-1661.Google Scholar - Nowozin S, Bakir G, Tsuda K: Discriminative subsequence mining for action classification.
*Proceedings of the IEEE 11th International Conference on Computer Vision (ICCV '07), October 2007, Rio de Janeiro, Brazil*Google Scholar - Wong S-F, Cipolla R: Extracting spatiotemporal interest points using global information.
*Proceedings of the IEEE 11th International Conference on Computer Vision (ICCV '07), October 2007, Rio de Janeiro, Brazil*1-8.Google Scholar - Pei J, Han J, Mortazavi-Asl B,
*et al*.: Mining sequential patterns by pattern-growth: the prefixspan approach.*IEEE Transactions on Knowledge and Data Engineering*2004, 16(11):1424-1440. 10.1109/TKDE.2004.77View ArticleGoogle Scholar - Demiriz A, Bennett KP, Shawe-Taylor J: Linear programming boosting via column generation.
*Journal of Machine Learning*2002, 46(1–3):225-254.View ArticleMATHGoogle Scholar - Borgwardt KM, Gretton A, Rasch MJ, Kriegel H-P, Schölkopf B, Smola AJ: Integrating structured biological data by kernel maximum mean discrepancy.
*Bioinformatics*2006, 22(14):e49-e57. 10.1093/bioinformatics/btl242View ArticleGoogle Scholar - Hachaoui Z, Bach F, Moulines E: Testing for homogeneity with kernel fisher discriminant analysis.
*Proceedings of the 22nd Annual Conference on Neural Information Processing Systems (NIPS '08), December 2008, Vancouver, Canada*20: 609-616.Google Scholar - Swain MJ, Ballard DH: Color indexing.
*International Journal of Computer Vision*1991, 7(1):11-32. 10.1007/BF00130487View ArticleGoogle Scholar - Odone F, Barla A, Verri A: Building kernels from binary strings for image matching.
*IEEE Transactions on Image Processing*2005, 14(2):169-180.MathSciNetView ArticleGoogle Scholar - Boughorbel S, Tarel J-P, Boujemaa N: Generalized histogram intersection kernel for image recognition.
*Proceedings of the IEEE International Conference on Image Processing (ICIP '05), September 2005, Genoa, Italy*161-164.Google Scholar - Fukushima K: Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position.
*Biological Cybernetics*1980, 36(4):193-202. 10.1007/BF00344251View ArticleMATHGoogle Scholar - Field DJ: Relations between the statistics of natural images and the response properties of cortical cells.
*Journal of the Optical Society of America A*1987, 4(12):2379-2394. 10.1364/JOSAA.4.002379View ArticleGoogle Scholar - Casile A, Giese MA: Critical features for the recognition of biological motion.
*Journal of Vision*2005, 5(4):348-360.View ArticleGoogle Scholar - Dalal N, Triggs B: Histograms of oriented gradients for human detection.
*Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV '05), October 2005*886-893.Google Scholar - Black MJ, Anandan P: The robust estimation of multiple motions: parametric and piecewise-smooth flow fields.
*Computer Vision and Image Understanding*1996, 63(1):75-104. 10.1006/cviu.1996.0006View ArticleGoogle Scholar - Harris C, Stephens M: A combined corner and edge detector.
*Proceedings of the 4th Alvey Vision Conference, 1988, Manchester, UK*147-151.Google Scholar - Chang C-C, Lin C-J: LIBSVM: a library for Support Vector Machines. 2009, http://www.csie.ntu.edu.tw/~cjlin/libsvm/

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.