2.1 Graph convolutional network
Most of the graph neural network models use graph convolution, whose core is the convolution kernel parameter sharing in local areas. The same convolution kernel is used for the convolution operation of each graph node, which can greatly reduce the number of model parameters. Parameter update of the convolution kernel can be seen as learning a graph function G = (υ, ε), which respectively represent the connecting edge between vertices in the graph. The input is eigenmatrix X ∈ RN × D, N is the number of vertices, Dis the characteristic dimension, and the matrix expression of the graph structure (usually expressed as adjacency matrix A). The output is Z ∈ RN × F and F is the output dimension of the convolution layer of the graph. Each graph convolution layer can be represented as the following nonlinear function:
$$ {H}^{\left(l+1\right)}=f\left({H}^{(l)},A\right) $$
(1)
where H(0) = X, H(l) = Z, and l is the number of convolution layers. For different models of the task, the appropriate convolutional function f(⋅, ⋅) will be selected and parameterized. This paper uses the same convolution function as Kipf et al. [8], whose basic form is:
$$ f\left({H}^{(l)},A\right)=\sigma \left({AH}^{(l)}{W}^{(l)}\right) $$
(2)
where W(l) is the weight parameter of the l-level neural network, and σ(⋅) is the nonlinear activation function, usually rectified linear unit (ReLU). After the above improvement, the graph convolution function can be calculated as:
$$ f\left({H}^{(l)},A\right)=\sigma \left({\hat{D}}^{-\frac{1}{2}}{\hat{A}\hat{D}}^{-\frac{1}{2}}{H}^{(l)}{W}^{(l)}\right) $$
(3)
where \( \hat{A}=A+I \), I is the identity matrix, and \( \hat{D} \) is the diagonal vertex degree matrix of \( \hat{A} \).
2.2 Extraction of continuous hidden state characteristics based on The Constant differential equation
Ordinary differential equation network is a new branch of neural network. It makes the neural network continuous and uses ordinary differential equation solver to fit the neural network itself. The basic problem domain equation is as follows:
$$ {h}_{t+1}={h}_t+f\left({h}_t,{\theta}_t\right) $$
(4)
$$ \frac{dh(t)}{dt}=f\left(h(t),t,\theta \right) $$
(5)
$$ h(T)=h\left({t}_0\right)+{\int}_{T_0}^Tf\left(h(t),t,\theta \right) dt $$
(6)
where ht stands for the hidden state, and f(⋅) represents the nonlinear transformation of a monolayer neural network. Equation (5) represents the forward propagation process between the residual blocks in the standard residual network. The neural network of each layer fits the residual term, while in Eq. (6), the output of the neural network is regarded as the gradient of the hidden state. Then, the hidden state value of t can be obtained at any time by solving the equation integrally. The number of evaluation points can be considered as equivalent to the number of model layers of the discrete neural network. In this paper, basic applications of ODE network on various mainstream model structures are proposed. The convolutional neural network model and the implicit state model based on time span are referred.
The feature extraction of video pedestrian mainly includes two aspects. On the one hand, it is the static feature extraction of video frame images in regular space, including pedestrian edge, color, and other features. In this respect, the mainstream neural network has been able to obtain a high recognition rate. Experiments show that the static feature extraction of pedestrians does not need too deep network scale. On the other hand, it is also one of the difficulties of video pedestrian detection, which is the spatiotemporal dynamic characteristics of pedestrian in time span. Many scholars have proposed different methods to extract the dynamic characteristics of pedestrians. However, none of the current methods takes into account the continuous information lost between discrete video frames. From the perspective of continuous events, this paper attempts to fit the probability distribution of the hidden dynamic characteristics of person \( \tilde{Z} \) through the hidden state model of ODE network, as shown in Fig. 1.
Firstly, the static feature vector X of a single frame is extracted by common convolutional network, such as residual network. If the feature of the image block is extracted, an additional layer of convolutional network is added to predict the category of the complete image of the frame. The feature vector of the complete image can be obtained from the sequentially arranged feature of the image block, and the pooling layer can also be used. Secondly, the static feature X is sampled in reverse chronological order, and the predicted initial hidden state
is obtained through the timing network (cyclic neural network is used in this paper). The hidden state probability distribution P(
) is obtained from the ODE network, and the hidden state value
can be predicted at any time. Finally, the implicit state value is converted to target feature vector
by the decoder.
2.3 The video person reidentification based on ODE and GCN
The video frames firstly are considered on the time span of contact with Gk = (υk, εk) graph model, the window size is 2k + 1, at the current moment as the center. With the current moment as the center, each video frame has k entry and exit edges and a self-ring edge, a total of 2k + 1. And the undirected graph is used for the reason of considering the relevance of before and after the event simultaneously. Each layer of graph convolutional network can contain n such windows, depending on the size of the network and usually determined by the length of the video block. Thus, the state update equation of each node in the middle layer of the graph convolution network can be expressed as:
$$ {X}_t^{l+1}=\sigma \left(\sum \limits_{t_i=t-k}^{t+k}\frac{1}{\tilde{Z}}{X}_{t_i}^l{W}_{{\mathrm{t}}_i}^l\right)k=1,2,\cdots $$
(7)
where \( \tilde{Z} \) is the normalization factor, the same as Eq. (7). l represents the number of layers in the graph convolution network, and σ is the activation function.
Video detection still boils down to classification. In the classification task, there are mature full supervision algorithms. However, a large amount of high-quality data is required, while the acquisition of high-quality video in real scenes is a difficulty, and relevant real data is lacking at this stage. The transfer learning for small datasets and the weak supervision algorithm with low requirements for the tag quality of data samples show outstanding advantages.
Input: Graph model function G(υ, ε), graph function parameter W, video block window size k, initialize the graph model adjacency matrix A0, the input feature of the node in the figure X0, the margins of positive and negative samples in triplet losses were margin.
Output: X∗ can distinguish target characteristics, characteristics of person sample space center Cp, the non-person sample Cn feature space center:
-
1)
Initialize A = A0, CP = 0, Cn = 0
-
2)
Randomly sampling Triplet \( {\hat{X}}_i= Triplet\left({X}_a,{X}_p,{X}_n\right) \) from input feature sample X0, where Xa is the anchor point, Xp is the same random sample point as the anchor point category, and Xn is the opposite of the anchor point category
-
3)
Repeat
-
4)
Forward transmission
-
5)
For X in Triplet Triplet(Xa, Xp, Xn) do:
-
6)
For Gall layers do:
-
7)
Generate its diagonal node degree matrix D from A
-
8)
To calculate the normalized coefficient \( \frac{1}{Z}={\tilde{D}}^{-\frac{1}{2}}{\hat{A}\tilde{D}}^{-\frac{1}{2}} \)
-
9)
For Xt in X do:
-
10)
Node status update \( {X}_t=\sigma \left(\sum \limits_{X_{ti}\in {neighbor^k}_{X_t}}\frac{1}{\tilde{Z}}{X}_{ti}{W}_{ti}\right) \)
-
11)
End for
-
12)
Update the adjacency matrix A from the new graph node state
-
13)
End for
-
14)
Return X, update triplet set \( \hat{X} \)
-
15)
End for
-
16)
Return \( \hat{X} \) as X∗
-
17)
Back propagation
-
18)
For Triplet(Xa, Xp, Xn) in X∗ do:
-
19)
Computing triple lossL = max\( \left({\left\Vert {X}_a-{X}_p\right\Vert}_2^2-{\left\Vert {X}_a-{X}_n\right\Vert}_2^2+m\arg in,0\right) \)
-
20)
End for
-
21)
Calculate the average loss \( \overline{L} \)
-
22)
Back propagation \( \overline{L} \) gradient using Adam algorithm
-
23)
Until convergence or reach the maximum number of training