- Open Access
Abnormal event detection in crowded scenes using histogram of oriented contextual gradient descriptor
© The Author(s). 2018
- Received: 16 January 2018
- Accepted: 30 July 2018
- Published: 24 August 2018
Detecting abnormal events in crowded scenes is an important but challenging task in computer vision. Contextual information is useful for discovering salient events in scenes; however, it cannot be characterized well by commonly used pixel-based descriptors, such as the HOG descriptor. In this paper, we propose contextual gradients between two local regions and then construct a histogram of oriented contextual gradient (HOCG) descriptor for abnormal event detection based on the contextual gradients. The HOCG descriptor is a distribution of contextual gradients of sub-regions in different directions, which can effectively characterize the compositional context of events. We conduct extensive experiments on several public datasets and compare the experimental results using state-of-the-art approaches. Qualitative and quantitative analysis of experimental results demonstrate the effectiveness of the proposed HOCG descriptor.
- Abnormal event detection
- Histogram of oriented contextual gradients
As one of the key technologies in intelligent video sequence, abnormal event detection (AED) has been actively researched in computer vision due to the increasing concern regarding public security and safety . A large number of cameras have been deployed in many public locations, such as campuses, shopping malls, airports, railway stations, subway stations, and plazas. Traditional video surveillance systems rely on a human operator to monitor scenes and find unusual or irregular events by observing monitor screens. However, watching surveillance video is a labor-intensive task. Therefore, significant efforts have been devoted to AED in video surveillance, and great progress has been made in recent years, which can free operators from exhausting and tedious tasks and thereby significantly save on labor costs.
AED in crowded scenes is fairly challenging due to many factors, such as frequent occlusion, heavy noise, clutter and dynamic scenes, complexity and diversity of events, unpredictability, and contextual dependency. The aim of AED is to find unusual or prohibitive events in a scene and essentially identify the patterns that significantly deviate from a predefined normal pattern of models via pattern recognition . The definition of an abnormal event is heavily dependent on how a normal event is modeled and which event description is applied. Therefore, the key component of a successful AED is event description, which is the organization of raw input data into various constructs that represent abstract properties of video data .
Traditional pixel-wise descriptors, such as the HOG descriptor, are normally employed to capture the appearance and/or motion of an event. However, these descriptors are unable to capture contextual information that is useful for discovering saliency events in scenes. In contrast to the pixels statistics, contextual information is macro-structure information, which reflects the composition relationship among regions. Boiman et al.  first exploited the distributions of cuboids inside a larger ensemble. The authors proposed an inference by composition (IBC) algorithm to compute the joint probability between a database and query ensemble. Although the algorithm was accurate, the computational burden was heavy. Roshtkhari et al.  modeled the spatio-temporal composition of small cuboids in a large volume using a probabilistic model and detected abnormal events with irregular compositions in real-time. Li et al.  exploited the compositional context under a dictionary learning and sparse coding framework. Gupta et al.  proposed a probabilistic model that exploits contextual information for visual action analysis to improve object recognition as well as activity recognition. However, these approaches use a learning framework with complicated inference processes rather than an efficient handcrafted descriptor to capture the contextual information.
In this paper, we extend the traditional gradient from pixel- to context-wise and thus propose a novel HOCG descriptor to capture contextual information for event description. Compared with the traditional HOG descriptor, the HOCG descriptor is the distribution of contextual gradients in different directions, which can reflect the compositional relationship among sub-regions within an event. The proposed HOCG descriptor is compact, flexible, and discriminative. We employ an online sparse reconstruction framework to identify abnormal events with high reconstruction costs. We conduct extensive experiments on different public datasets and make extensive comparisons with the HOG as well as other state-of-the-art descriptors to demonstrate the advantages of the proposed HOCG descriptor.
The main contributions of our work are as follows:
1) We extend the gradient computation from pixel- to context-wise. The contextual gradient is more descriptive and flexible than the pixel-wise gradient and is useful in finding salient events in scenes;
2) We construct a HOCG descriptor for event description in AED using the contextual gradients of sub-regions within an event. The HOCG descriptor can efficiently capture contextual information;
3) We conduct extensive experiments on different datasets to validate the effectiveness of the HOCG descriptor for AED.
The remainder of this paper is organized as follows. Section 2 gives a brief overview of related works regarding event descriptors in AED. Section 3 presents a detailed description of our proposed approach, including the principle of the context-wise gradient, construction of the HOCG descriptor, and AED using the HOCG descriptor. Experiments and results analysis are presented in Section 4, and Section 5 concludes the paper.
A trajectory-wise descriptor is high-level and robust and can accurately describe the spatial movement of objects. The trajectory of moving objects can be obtained by applying tracking methods, such as a Kalman filter, particle filter, among others. Then, normal trajectories are utilized to build the normal event model. Finally, the abnormal event is identified by measuring the deviation or probability of the testing trajectory with respect to the normal event models. Li et al.  learned a dictionary using normal trajectories and then detected abnormal events according to the reconstruction error of their trajectory under the learned dictionary. Aköz et al.  proposed a traffic event classification system that learned normal and common traffic flows by clustering vehicle trajectories. Laxhammar et al.  proposed a method for online learning and sequential anomaly detection using trajectories. Bera et al.  proposed a real-time anomaly detection method in low- to medium-density crowd videos using trajectory-wise behavior learning.
Pixel-wise descriptors can be directly extracted from scenes without requiring object detection and tracking and thus are frequently adopted for analyzing crowd events. HOG and histogram of optical flow (HOF) are two commonly used pixel-wise descriptors used in previous works. Wang et al.  applied HOF for AED. Zhao et al.  utilized both HOF and HOG to describe the motion and appearance inside a volume. Bertini et al.  employed a spatio-temporal HOG to describe both the motion and appearance of a volume. Zhang et al.  combined the features of both HOF and gradients for AED. Cong et al.  proposed a multiscale histogram of optical flow (MHOF) to describe the motion in a volume at different scales.
A context-wise descriptor is another type of descriptor that is used to capture contextual information and plays a key role in the process of discovering salient events. Contextual information can be classified into a motion context and appearance context according to the descriptor generated based on the motion/appearance feature words. Yang et al.  proposed a semantic context descriptor both locally and globally to find rare classes in a scene. Yuan et al.  exploited contextual evidence using a structural context descriptor (SCD) to describe the relationship of individuals. Hu et al.  proposed a compact and efficient local nearest neighbors distance (LNND) descriptor to incorporate the spatial and temporal contextual information around a video event for AED. In fact, contextual information is an important cue for AED since it reflects the co-occurrence relationships or macro-structural information among semantic descriptors. Meanwhile, the context-wise descriptor is more efficient and flexible than the pixel-wise descriptor for AED since it is computed based on different types of regional features. However, the context-wise descriptor has not attracted as much attention as trajectory- and pixel-wise descriptors.
In the last decade, much effort has been devoted to learning an effective descriptor via deep learning. Different types of deep neural networks have been designed to learn rich discriminative features, and a strong performance has been achieved in AED. Hasan et al.  proposed a convolutional autoencoder framework for reconstructing a scene, and the reconstruction costs were computed for identifying abnormalities in the scene. Sabokrou et al.  proposed a deep network cascade for AED. In the first stage, most normal patches were rejected by a small stack of an auto-encode, and a deep convolutional neural network (CNN) was applied to extract the discriminative features for the final decision. Hu et al.  proposed a deep incremental slow features analysis (D-IncSFA) network to learn the slow features in a scene. Feng et al.  proposed a deep Gaussian mixture model (D-GMM) network to model normal events. Zhou et al.  proposed a spatio-temporal CNN to learn the jointed features of both appearance and motion. Although a deep neural network can automatically learn useful descriptors, handcrafted features could still play a dominant role and be widely used in both image and video domains because they can benefit from human ingenuity and prior knowledge as well as enjoy flexibility and computational efficiency without relying on large sets of samples for training.
In this section, we first introduce the principle of contextual gradient, then present the construction process of the HOCG descriptor for an event, and finally, present the details of AED using the HOCG descriptor under the online sparse reconstruction framework.
3.1 Contextual gradients
respectively. If the 3D contextual gradient is used, the temporal contextual gradient is also computed:
Other robust distance measurements, such as earth movers’ distance (EMD) , can also be adopted to improve the robustness.
3.2 Histogram of oriented contextual gradient descriptor construction
Algorithm 1 shows the algorithms of the HOCG descriptor construction.
3.3 Abnormal event detection
Due to the unpredictability of abnormal events, most previous approaches only learn normal event models in an unsupervised or semi-supervised manner, and abnormal events are considered to be patterns that significantly deviate from the created normal event models. In this work, we employ the online dictionary learning and sparse reconstruction framework for AED in which the abnormal event is identified as its sparse reconstruction cost (SRC) higher than a specific threshold.
Dictionary learning is a representation learning method that aims at finding a sparse representation of the input data in the form of a linear combination of atoms in the dictionary. The dictionary can be learned in either an offline or online manner. Offline learning must process all training samples at one time, while online learning only draws one input or a small batch of inputs at any time t. Consequently, both the computational complexities and memory requirements of the online method are significantly lower than those of the offline method. Meanwhile, the online learning method has better adaptability than the offline method in practice. Thus, our work adopts the online dictionary learning method for AED, which is followed by two steps: sparse coding and dictionary updating.
3.4 Sparse coding
This sparse approximation problem can be efficiently solved using orthogonal matching pursuit (OMP), which is a greedy forward selection algorithm.
3.5 Dictionary update
We conduct experiments on different public datasets to evaluate the performances of AED approaches using the HOCG descriptor. The public datasets are UCSD , UMN , PETS2009 , and Avenue datasets . All of the experiments are performed on a PC with a dual-core 2.5 GHz Intel CPU and 4 GB of RAM using MATLAB R2016a implementation. We use the UMN and PETS 2009 datasets for global AED, where abnormal events occur in most of the parts of scenes. We use the UCSD and Avenue datasets for local AED, where abnormal events occur in a relatively small local region.
4.1 UCSD dataset
The UCSD dataset consists of the Ped1 and Ped2 subsets, which are taken from the UCSD campus by stationary monocular cameras. The density of the crowd varies from sparse to very crowded. The only normality in the scene is pedestrians walking on the walkway. The abnormalities include bikers, skaters, and vehicles crossing the walkway. We adopt the Ped1 subset for experiments since it provides complete ground truth for evaluating performance. The Ped1 dataset contains 34 training and 36 testing clips, in which each clip contains 200 frames with a resolution of 158 × 238 pixels. We resize the resolution to 160 × 240. The training set contains 34 clips of normal event, and the testing set contains 36 testing clips. The sequence is first divided into a set of volumes with a size of 16 × 16 × 5, in which each volume is considered as an event. Then, the volume is further divided into a set of cuboids with a size of 4 × 4 × 5. We extract the slow features proposed in our previous works  from each cuboid as the regional feature, which is robust and discriminative. We construct a 2D HOCG descriptor for each event, i.e., the spatial direction range is quantized into 8 directions with each direction being 45°. The dimensionality of the HOCG descriptor is 8. In contrast to the 20-dimensional regional features, the dimensionality is reduced significantly. The number of atoms for the dictionary is set to 20, and λ = 0.5.
Summary of the quantitative performance and comparison with state-of-the-art descriptors in the Ped1 dataset
Although our performances are lower than those of the approaches of MDT  and SRC for the frame level evaluation, our performances outperform all of the comparison approaches for the pixel level evaluation, which is stricter than the frame level evaluation. As various types of regional features can be embedded in the HORG descriptor, we also demonstrate the performance improvement of using HORG descriptor as well as the reduction of dimensionality. We utilize four types of regional features, i.e., the gray value, 3D gradient, GCM , and slow features (SF)  descriptors. The comparisons demonstrate that after constructing the HORG descriptors, not only the performance of AED is improved but also the dimensionality is also reduced.
Comparisons of the computational efficiency of HOCG descriptors with different parameters
Size of frame
Size of volume
Sizes of regions
160 × 240
16 × 16 × 5
2 × 2 × 5
4 × 4 × 5
240 × 320
2 × 2 × 5
4 × 4 × 5
4.2 UMN dataset
4.4 Avenue dataset
In this paper, we extended the gradient from pixel- to context-wise and then constructed a HOCG descriptor using contextual gradients for AED. The HOCG descriptor is simple, compact, flexible, discriminative, and can efficiently capture the contextual information of an event. We conducted extensive experiments on different challenging public datasets to demonstrate the effectiveness of context-wise gradients. Quantitative and qualitative analyses of the experimental results showed that the HOCG descriptor outperformed the traditional pixel-wise HOG descriptor in AED and was comparable to the state-of-the-art approaches without using complicated modeling approaches. In future works, on the one hand, we will explore other applications of the HOCG descriptor, such as human action recognition and crowd activity recognition. On the other hand, we will investigate how well the compositional context of events under the deep learning framework is captured.
This work was jointly supported in part by the National Natural Science Foundation of China (Grant No. 61374197), Shanghai Natural Science Foundation (17ZR1443500), Shanghai Sailing Program, and Talent Program of Shanghai University of Engineering Science.
National Natural Science Foundation of China (Grant No. 61374197);
Shanghai natural science foundation (17ZR1443500);
Shanghai Sailing Program;
Talent Program of Shanghai University of Engineering Science.
Availability of data and materials
All data and material are available.
YH initiated the project. XH designed the algorithms. QD and JD conceived, designed, and performed the experiments. WC and HY analyzed the data. XH wrote the paper. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- M Paul, SME Haque, S Chakraborty, Human detection in surveillance videos and its applications-a review. EURASIP. J. Adv. Signal. Process 2013(1), 176 (2013)View ArticleGoogle Scholar
- V Chandola, A Banerjee, V Kumar, Anomaly detection: a survey. ACM Comput. Surveys 41(3), 15 (2009)View ArticleGoogle Scholar
- T Wang, H Snoussi, Detection of abnormal visual events via global optical flow orientation histogram. IEEE Trans. on Inf. Foren. Sec 9(6), 988–998 (2014)View ArticleGoogle Scholar
- O Boiman, M Irani, Detecting irregularities in images and in video. Int. J. Comput. Vis. 74(1), 17–31 (2007)View ArticleGoogle Scholar
- MJ Roshtkhari, MD Levine, An on-line, real-time learning method for detecting anomalies in videos using spatio-temporal compositions. Comput. Vis. Image Understand. 117(10), 1436–1452 (2013)View ArticleGoogle Scholar
- N Li, X Wu, D Xu, H Guo, W Feng, Spatio-temporal context analysis within video volumes for anomalous-event detection and localization. Neurocomputing 155, 309–319 (2015)View ArticleGoogle Scholar
- A Gupta, LS Davis, Proc. IEEE Conf. on CVPR. Objects in action: an approach for combining action understanding and object perception (2007), pp. 1–8Google Scholar
- V Saligrama, J Konrad, PM Jodoin, Video anomaly identification. IEEE Signal Process. Magaz. 27(5), 18–33 (2010)View ArticleGoogle Scholar
- C Li, Z Han, Q Ye, J Jiao, Visual abnormal behavior detection based on trajectory sparse reconstruction analysis. Neurocomputing 119, 94–100 (2013)View ArticleGoogle Scholar
- Ö Aköz, ME Karsligil, Traffic event classification at intersections based on the severity of abnormality. Mach. Vision Appl 25(3), 613–632 (2014)View ArticleGoogle Scholar
- R Laxhammar, G Falkman, Online learning and sequential anomaly detection in trajectories. IEEE Trans. Pattern Anal. Mach. Intell. 36(6), 1158–1173 (2014)View ArticleMATHGoogle Scholar
- A Bera, S Kim, D Manocha, Proc. IEEE-CVPRW. Realtime anomaly detection using trajectory-wise crowd behavior learning (2016), pp. 50–57Google Scholar
- S Coşar, G Donatiello, V Bogorny, C Garate, LO Alvares, F Brémond, Toward abnormal trajectory and event detection in video surveillance. IEEE Trans. Circuits Syst. Video Technol. 27(3), 683–695 (2017)View ArticleGoogle Scholar
- B Zhao, L Fei-Fei, EP Xing, Proc. IEEE-CVPR. Online Detection of Unusual Events in Videos via Dynamic Sparse Coding (2011), pp. 3313–3320Google Scholar
- M Bertini, A Del Bimbo, L Seidenari, Multi-scale and real-time non-parametric approach for anomaly detection and localization. Comput. Vis. Image Und. 116(3), 320–329 (2012)View ArticleGoogle Scholar
- Y Zhang, H Lu, L Zhang, X Ruan, Combining motion and appearance cues for anomaly detection. Pattern Recogn. 51, 443–452 (2016)View ArticleGoogle Scholar
- Y Cong, J Yuan, J Liu, Abnormal event detection in crowded scenes using sparse representation. Pattern Recogn. 46(7), 1851–1864 (2013)View ArticleGoogle Scholar
- V Kaltsa, A Briassouli, I Kompatsiaris, LJ Hadjileontiadis, MG Strintzis, Swarm intelligence for detecting interesting events in crowded environments. IEEE Trans. Image Process. 24(7), 2153–2166 (2015)MathSciNetView ArticleGoogle Scholar
- RVHM Colque, C Caetano, MTLD Andrade, WR Schwartz, Histograms of optical flow orientation and magnitude and entropy to detect anomalous events in videos. IEEE Trans. Circ. Syst. Vid 27(3), 683–695 (2017)View ArticleGoogle Scholar
- W Li, V Mahadevan, N Vasconcelos, Anomaly detection and localization in crowded scenes. IEEE Trans. Pattern Anal. Mach. Intell 36(1), 18–32 (2014)View ArticleGoogle Scholar
- J Wang, Z Xu, Spatio-temporal texture modelling for real-time crowd anomaly detection. Comput. Vis. Image Und 144, 177–187 (2016)View ArticleGoogle Scholar
- A Zaharescu, R Wildes, Proc. ECCV. Anomalous behaviour detection using spatiotemporal oriented energies, subset inclusion histogram comparison and event-driven processing (2010), pp. 563–576Google Scholar
- E Ali, TH Mehrtash, D Farhad, CL Brian, Novelty detection in human tracking based on spatiotemporal oriented energies. Pattern Recogn 48, 812–826 (2015)View ArticleGoogle Scholar
- PC Ribeiro, R Audigier, QC Pham, RIMOC, a feature to discriminate unstructured motions: Application to violence detection for video-surveillance. Comput. Vis. Image Und 144, 121–143 (2016)View ArticleGoogle Scholar
- J Yang, B Price, S Cohen, MH Yang, Proc. IEEE-CVPR. Context Driven Scene Parsing with Attention to Rare Classes (2014), pp. 3294–3301Google Scholar
- Y Yuan, J Fang, Q Wang, Online anomaly detection in crowd scenes via structure analysis. IEEE Trans. Cybern 45(3), 548–561 (2015)View ArticleGoogle Scholar
- X Hu, S Hu, X Zhang, H Zhang, L Zhang, Anomaly Detection Based on Local Nearest Neighbor Distance Descriptor in Crowded Scenes. Sci. World. J 2014(6), 632575 (2014)Google Scholar
- M Hasan, J Choi, J Neumann, AK Roy-Chowdhury, LS Davis, Proc. IEEE-CVPR, Learning temporal regularity in video sequences (2016), pp. 733–742Google Scholar
- M Sabokrou, M Fayyaz, M Fathy, R Klette, Deep-cascade: Cascading 3d deep neural networks for fast anomaly detection and localization in crowded scenes. IEEE Trans. Image Process 26(4), 1992–2004 (2017)MathSciNetView ArticleGoogle Scholar
- X Hu, S Hu, Y Huang, H Zhang, H Wu, Video anomaly detection using deep incremental slow feature analysis network. IET Comput. Vis 10(4), 258–265 (2016)View ArticleGoogle Scholar
- Y Feng, Y Yuan, X Lu, Learning deep event models for crowd anomaly detection. Neurocomputing 219(5), 548–556 (2017)Google Scholar
- S Zhou, W Shen, D Zeng, M Fang, Y Wei, Z Zhang, Spatial-temporal convolutional neural networks for anomaly detection and localization in crowded scenes. Signal Process-Image Communication 47, 358–368 (2016)View ArticleGoogle Scholar
- S Goferman, L Zelnikmanor, A Tal, Context-aware saliency detection. IEEE Trans. Pattern Anal. Mach. Intell. 34(10), 1915–1926 (2012)View ArticleGoogle Scholar
- X Hu, S Hu, J Xie, S Zheng, Robust and efficient anomaly detection using heterogeneous representations. J. Electron. Imaging 24(3), (033021)1–(03302112 (2015)View ArticleGoogle Scholar
- Y Rubner, C Tomasi, L Guibas, The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 40(2), 99–121 (2000)View ArticleMATHGoogle Scholar
- C Lu, J Shi, J Jia, Proc. IEEE-CVPR, Online Robust Dictionary Learning (2013), pp. 415–422Google Scholar
- http://www.svcl.ucsd.edu/projects/anomaly/dataset.htm, Accessed 18 Jul 2014
- http://mha.cs.umn.edu/Movies/Crowd-Activity-All.avi, Accessed 26 Oct 2012
- http://www.cvg.reading.ac.uk/PETS2009/a.html, Accessed 6 Jun 2011
- http://www.cse.cuhk.edu.hk/leojia/projects/detectabnormal/dataset.html, Accessed 10 May 2016
- V. Mahadevan, W. Li, V. Bhalodia, N. Vasconcelos,Proc. IEEE-CVPR, Anomaly Detection in Crowded Scenes (2010), pp. 1975-1981Google Scholar
- A Adam, E Rivlin, I Shimshoni, D Reinitz, Robust realtime unusual event detection using multiple fixed-location monitors. IEEE Trans. Pattern Anal. Mach. Intell. 30(3), 555–560 (2008).View ArticleGoogle Scholar
- R Mehran, A Oyama, M Shah, Proc. IEEE-CVPR, Abnormal crowd behavior detection using social force model (2009), pp. 935–942Google Scholar
- S Wu, BE Moore, M Shah, Proc. IEEE-CVPR, Chaotic invariants of lagrangian particle trajectories for anomaly detection in crowded scenes (2010), pp. 2054–2060Google Scholar
- X Cui, Q Liu, M Gao, DN Metaxas, Proc. IEEE-CVPR, Abnormal detection using interaction energy potentials (2011), pp. 3161–3167Google Scholar
- R Raghavendra, A Del Bue, M Cristani, V Murino, Proc. IEEE-CVPRW, Optimizing interaction force for global anomaly detection in crowded scenes (2011), pp. 136–143Google Scholar
- J Xu, S Denman, C Fookes, S Sridharan, Proc. IEEE-AVSS, Unusual Scene Detection Using Distributed Behaviour Model and Sparse Representation (2012), pp. 48–53Google Scholar
- J Xu, S Denman, S Sridharan, C Fookes, R Rana, Proc. 2011 ACM-MM, Dynamic texture reconstruction from sparse codes for unusual event detection in crowded scenes (2011), pp. 25–30Google Scholar
- C Lu, J Shi, J Jia, Proc. IEEE-ICCV, Abnormal event detection at 150 FPS in MATLAB (2013), pp. 2720–2727Google Scholar
- Z Zhang, X Mei, B Xiao, Abnormal event detection via compact low-rank sparse learning. IEEE Intell. Syst, 31(1), 29–36 (2016)Google Scholar
- Y Yuan, Y Feng, X Lu, Statistical hypothesis detector for abnormal event detection in crowded scenes. IEEE Trans. Cybern 47(11), 3597–3608 (2016)Google Scholar
- S Li, Y Yang, C Liu, Anomaly detection based on two global grid motion templates, Signal Process-Image, (2017)Google Scholar