In this section, we first introduce the principle of contextual gradient, then present the construction process of the HOCG descriptor for an event, and finally, present the details of AED using the HOCG descriptor under the online sparse reconstruction framework.
3.1 Contextual gradients
Different from traditional gradients that are computed pixel-wise, contextual gradients are computed regional-wise. We define the contextual gradients of the given region Rij in the vertical and horizontal as
$$ {G}_i\left(i,j\right)=\operatorname{sign}\left({R}_{i+1,j},{R}_{i-1,j}\right)\cdot \mathrm{dist}\left({R}_{i+1,j},{R}_{i-1,j}\right) $$
(1)
$$ {G}_j\left(i,j\right)=\operatorname{sign}\left({R}_{i,j+1},{R}_{i,j-1}\right)\cdot \mathrm{dist}\left({R}_{i,j+1},{R}_{i,j-1}\right) $$
(2)
respectively. If the 3D contextual gradient is used, the temporal contextual gradient is also computed:
$$ {G}_{\tau}\left(i,j,\tau \right)=\operatorname{sign}\left({R}_{i,j,\tau +1},{R}_{i,j,\tau -1}\right)\cdot \mathrm{dist}\left({R}_{i,j,\tau +1},{R}_{i,j,\tau -1}\right) $$
(3)
where dist(, ∙ , ) is the distance measure between a pair of regions and sing(, ∙ , ) returns the sign of the contextual gradient. Figure 1 shows visualizations of gradient maps horizontally, vertically, and temporally as well as a gradient magnitude map of scenes for pixel- and context-wise gradients.
Unlike the sign of a pixel gradient, which can be directly determined by a value comparison, we could not directly judge the sign of the gradient between two regions. To solve this problem, we utilized the saliency value of the region to determine the sign of the gradient. Specifically, we first computed the saliency value for each region in the scene and then determined the sign of the contextual gradients by comparing their saliency values given by.
$$ \operatorname{sign}\left({R}_i,{R}_j\right)=\left\{\begin{array}{cc}1& if\ {S}_{R_i}>{S}_{R_j}\\ {}-1& \mathrm{otherwise}\end{array}\kern0.5em \right. $$
(4)
where \( {S}_{R_j} \) refers to the local saliency value of Rj. Saliency is one of the most popular concepts for computational visual attention modeling and can be quantitatively measured by the center-surround difference, information maximization, incremental coding length and site entropy rate, among others. For each region Ri, we used the context-aware method [33] to compute its saliency value. The saliency value of a given region as the center-surround difference measured by the distance between the features of the center and its K nearest neighbors in the surrounding regions is given by
$$ {S}_{R_i}=\sum \limits_{k=1}^K{\mathrm{dist}}_{fea}\left({R}_i,{R}_k\right) $$
(5)
where fi refers to the regional features extracted from region Ri and distfea(∙, ∙) refers to the distance measure in the feature space. On the one hand, to reduce computational complexity, we only use the immediate eight surrounding neighbors of the center region. On the other hand, to reduce the influence of noise, we select the four nearest neighbors in the feature space from the eight neighbors. The contextual gradient can be computed based on different types of features, such as the gray values of raw pixels, HOG, HOF, and gradient central moments (GCM) [34]. We adopted the commonly used Euclidean distance as the distance measure between two features, defined as
$$ \mathrm{dis}{\mathrm{t}}_{fea}\left({R}_i,{R}_k\right)={\left\Vert {f}_i-{f}_k\right\Vert}_2 $$
(6)
Other robust distance measurements, such as earth movers’ distance (EMD) [35], can also be adopted to improve the robustness.
3.2 Histogram of oriented contextual gradient descriptor construction
In our work, a video event is a spatio-temporal volume and contextual gradients are computed for each small sub-region within the event. Based on the proposed contextual gradient, we construct a histogram for each event by quantizing each regional descriptor into a specific direction bin with respect to its contextual gradients. Given that a volume Vmntwith size of w × h × l consists a set of non-overlapping sub-regions {Rijτ}, with each having a size of p × q × r, three contextual gradients (i.e., horizontal, vertical, and temporal contextual gradients) are computed for each sub-region, where w, h, and l are divisible by p, q, and r, respectively. Figure 2 illustrates the process of the HOCG descriptor construction. The contextual gradient magnitude φijτ is computed as
$$ {\varphi}_{ij\tau}=\sqrt{G_i^2+{G}_j^2} $$
(7)
and the spatial directional anglesθijτ is computed as
$$ {\theta}_{ij\tau}={\tan}^{-1}{G}_i/{G}_j,{\theta}_{ij\tau}\in \left[-\pi, \pi \right] $$
(8)
If the 3D HOCG descriptor is used, the magnitude ψijτ should be computed using three spatial and temporal gradients that are given by
$$ {\varphi}_{ij\tau}=\sqrt{G_i^2+{G}_j^2+{G}_{\tau}^2} $$
(9)
and the temporal directional angles ϕijτ also should be computed
$$ {\phi}_{ij\tau}={\tan}^{-1}\frac{G_{\tau }}{\sqrt{G_i^2+{G}_j^2}},{\phi}_{ij\tau}\in \left[-\frac{\pi }{2},\frac{\pi }{2}\right] $$
(10)
Using the computed magnitude and directional angle of all of the sub-regions in the volume, we construct a histogram with Bs bins, which means that 360° of the spatial direction range is quantized into Bs directions; the angle range of each direction is 360°/Bs. If the 3D HOCG descriptor is used, we need to further quantize 180° of the temporal direction into Bt directions with the angle range of each direction at 180°/Bt and construct a histogram with BsBt bins. Given a sub-region Rijτ, we quantize the region into the bth bin. For 2D HOCG,
$$ b=\left\lceil {B}_s{\theta}_{ij\tau}/2\pi \right\rceil . $$
(11)
For 3D HOCG,
$$ b={B}_s\left\lfloor {B}_t{\phi}_{ij\tau}/\pi \right\rfloor +\left\lceil {B}_s{\theta}_{ij\tau}/2\pi \right\rceil $$
(12)
where ⌊∙⌋ and ⌈∙⌉ are the operations of rounding down and rounding up, respectively. Then, we assign the cuboid Rijτ with a B-dimensional vector uijτ = 0, uijτ ∈ ℝB, in which all elements are zeros except for u(b) = 1 weighted by its magnitude φijτ. Finally, we obtain the HOGR descriptor of the volume by accumulating all of the vectors of cuboids in the volume.
$$ {w}_{mnt}={\sum}_{R_{ij\tau}\in {V}_{mnt}}{u}_{ij\tau} $$
(13)
Algorithm 1 shows the algorithms of the HOCG descriptor construction.
3.3 Abnormal event detection
AED can be classified as global AED (GAED) and local AED (LAED) [17]. GAED aims to detect an abnormal event caused by the group that occurs in the whole scene, such as a suddenly scattered crowd. Additionally, AED aims to detect an abnormal event caused by individuals and that occurs in a local region of the scene. For GAED, the video sequence is first divided into a set of temporal clips and each clip is considered as a global event. For local AED, each clip is further divided into a set of local volume and each volume is considered as a local event. Figure 3 illustrates the process of dividing the video sequence for global AED and local AED.
Due to the unpredictability of abnormal events, most previous approaches only learn normal event models in an unsupervised or semi-supervised manner, and abnormal events are considered to be patterns that significantly deviate from the created normal event models. In this work, we employ the online dictionary learning and sparse reconstruction framework for AED in which the abnormal event is identified as its sparse reconstruction cost (SRC) higher than a specific threshold.
Given an event with HOCG descriptors wmnt, its SRC can be computed as
$$ {C}_{mn t}=\frac{1}{2}{\left\Vert {D}_{mn,t-1}{\alpha}_{mn t}-{w}_{mn t}\right\Vert}_2^2+\lambda {\left\Vert {w}_{mn t}\right\Vert}_1 $$
(14)
where Dmn, t − 1 ∈ ℝB × S is the online dictionary updated at t − 1, Dmn, t − 1is continuously updated to Dmnt at each time t using wmnt, λ is the regularization parameter, and αmnt ∈ ℝS is the sparse coefficient obtained by sparse coding under Dmn, t − 1.
Dictionary learning is a representation learning method that aims at finding a sparse representation of the input data in the form of a linear combination of atoms in the dictionary. The dictionary can be learned in either an offline or online manner. Offline learning must process all training samples at one time, while online learning only draws one input or a small batch of inputs at any time t. Consequently, both the computational complexities and memory requirements of the online method are significantly lower than those of the offline method. Meanwhile, the online learning method has better adaptability than the offline method in practice. Thus, our work adopts the online dictionary learning method for AED, which is followed by two steps: sparse coding and dictionary updating.
3.4 Sparse coding
Given a fixed dictionary Dmn, t − 1and a HOCG descriptor wmnt, the sparse coefficient αmnt ∈ ℝS can be obtained by optimizing
$$ {\alpha}_{mn t}=\underset{{\boldsymbol{\alpha}}_{mn t}}{\arg\ \min}\frac{1}{2}{\left\Vert {D}_{mn,t-1}{\alpha}_{mn t}-{w}_{mn t}\right\Vert}_2^2+\lambda {\left\Vert {w}_{mn t}\right\Vert}_1 $$
(15)
This sparse approximation problem can be efficiently solved using orthogonal matching pursuit (OMP), which is a greedy forward selection algorithm.
3.5 Dictionary update
At each time t, the optimal dictionary can be obtained by optimization
$$ {D}_{mnt}=\underset{D_{mnt}}{\arg\ \min}\frac{1}{t}{\sum \limits}_{i=1}^t\left(\frac{1}{2}{w}_{mnt}-{D}_{mnt}{\alpha_{mnt}}_2^2\right) $$
(16)
For more details regarding online dictionary learning and sparse coding, we refer to [36]. Finally, we labeled the event as normal or abnormal based on a threshold δ
$$ \mathrm{Label}\left({V}_{mnt}\right)=\left\{\begin{array}{cc}\mathrm{Normal}& {C}_{mnt}\le \delta \\ {}\mathrm{Abnormal}& \mathrm{otherwise}\end{array}\right. $$
(17)
where the threshold δ can be chosen experimentally when the approach achieves the best performance. The steps of AED with online dictionary learning and sparse coding are shown in Algorithm 2.