Video person reidentification based on neural ordinary differential equations and graph convolution network

Person reidentification rate has become a challenging research topic in the field of computer vision due to the fact that person appearance is easily affected by lighting, posture and perspective. In order to make full use of the continuity of video data on the time line and the unstructured relationship of features, a video person reidentification algorithm combining the neural ordinary differential equation with the graph convolution network is proposed in this paper. First, a continuous time model is constructed by using the ordinary differential equation (ODE) network so as to capture hidden information between video frames. By simulating the hidden space of the hidden variables with the hidden time series model, the hidden information between frames that may be ignored in the discrete model can be obtained. Then, the features of the generated video frames are given to the graph convolution network to reconstruct them. Finally, weak supervision is used to classify the features. Experiments on PRID2011 datasets show that the proposed algorithm can significantly improve person reidentification performance.


Introduction
In recent years, with the increasing attention to public safety and the development of video surveillance technology, more and more cameras are deployed in crowded places [1,2]. However, the operation of large-scale video monitoring system produces a large number of monitoring data, which is difficult to quickly analyze and process only relying on human resources. Therefore, the use of computer vision technology to automatically complete the task of intelligent monitoring system came into being [3]. Although the current face recognition technology has been relatively mature, it is often unable to obtain effective face images in the actual monitoring environment. Therefore, it is very important to lock and search people with whole body information. This also makes human re-recognition technology gradually become a research hotspot in the field of computer vision, which has attracted people's extensive attention [4].
The purpose of character re-recognition is to accurately identify a person who appears in one camera, when he/she appears in other cameras again [4,5]. Due to the influence of camera perspective, dramatic changes in the posture of moving human body, illumination, occlusion, and chaos of background [6,7], human re-recognition algorithm is still facing great challenges. At present, the research methods of human rerecognition are mainly divided into single-frame image-based and video-based human re-recognition [8].
Early video character detection methods are usually based on image detection, which can judge whether there is a person in each frame by extracting the static features of the image. However, with the wide application of depth model in the field of video detection, more and more attention has been paid to the temporal and dynamic characteristics of video information in recent years. Graph convolution network (GCN) and ordinary differential equation (ODE) are the latest achievements in machine learning. They apply unstructured and continuous models to various learning tasks. In this paper, the video continuous model is established by ordinary differential equation, and combined with the graph convolution network, a continuous time airspace personnel detection model based on video stream is proposed.

Graph convolutional network
Most of the graph neural network models use graph convolution, whose core is the convolution kernel parameter sharing in local areas. The same convolution kernel is used for the convolution operation of each graph node, which can greatly reduce the number of model parameters. Parameter update of the convolution kernel can be seen as learning a graph function G = (υ, ε), which respectively represent the connecting edge between vertices in the graph. The input is eigenmatrix X ∈ R N × D , N is the number of vertices, Dis the characteristic dimension, and the matrix expression of the graph structure (usually expressed as adjacency matrix A). The output is Z ∈ R N × F and F is the output dimension of the convolution layer of the graph. Each graph convolution layer can be represented as the following nonlinear function: where H (0) = X, H (l) = Z, and l is the number of convolution layers. For different models of the task, the appropriate convolutional function f(⋅, ⋅) will be selected and parameterized. This paper uses the same convolution function as Kipf et al. [8], whose basic form is: where W (l) is the weight parameter of the l-level neural network, and σ(⋅) is the nonlinear activation function, usually rectified linear unit (ReLU). After the above improvement, the graph convolution function can be calculated as: whereÂ ¼ A þ I, I is the identity matrix, andD is the diagonal vertex degree matrix ofÂ.
where h t stands for the hidden state, and f(⋅) represents the nonlinear transformation of a monolayer neural network. Equation (5)  The feature extraction of video pedestrian mainly includes two aspects. On the one hand, it is the static feature extraction of video frame images in regular space, including pedestrian edge, color, and other features. In this respect, the mainstream neural network has been able to obtain a high recognition rate. Experiments show that the static feature extraction of pedestrians does not need too deep network scale. On the other hand, it is also one of the difficulties of video pedestrian detection, which is the spatiotemporal dynamic characteristics of pedestrian in time span. Many scholars have proposed different methods to extract the dynamic characteristics of pedestrians. However, none of the current methods takes into account the continuous information lost between discrete video frames. From the perspective of continuous events, this paper attempts to fit the probability distribution of the hidden dynamic characteristics of personZ through the hidden state model of ODE network, as shown in Fig. 1.
Firstly, the static feature vector X of a single frame is extracted by common convolutional network, such as residual network. If the feature of the image block is extracted, an additional layer of convolutional network is added to predict the category of the complete image of the frame. The feature vector of the complete image can be obtained from the sequentially arranged feature of the image block, and the pooling layer can also be used. Secondly, the static feature X is sampled in reverse chronological order, and the predicted initial hidden state is obtained through the timing network (cyclic neural network is used in this paper). The hidden state probability distribution P( ) is obtained from the ODE network, and the hidden state value can be predicted at any time. Finally, the implicit state value is converted to target feature vector by the decoder.

The video person reidentification based on ODE and GCN
The video frames firstly are considered on the time span of contact with G k = (υ k , ε k ) graph model, the window size is 2k + 1, at the current moment as the center. With the current moment as the center, each video frame has k entry and exit edges and a selfring edge, a total of 2k + 1. And the undirected graph is used for the reason of considering the relevance of before and after the event simultaneously. Each layer of graph convolutional network can contain n such windows, depending on the size of the network and usually determined by the length of the video block. Thus, the state update equation of each node in the middle layer of the graph convolution network can be expressed as: whereZ is the normalization factor, the same as Eq. (7). l represents the number of layers in the graph convolution network, and σ is the activation function. Video detection still boils down to classification. In the classification task, there are mature full supervision algorithms. However, a large amount of high-quality data is required, while the acquisition of high-quality video in real scenes is a difficulty, and relevant real data is lacking at this stage. The transfer learning for small datasets and the weak supervision algorithm with low requirements for the tag quality of data samples show outstanding advantages. Input: Graph model function G(υ, ε), graph function parameter W, video block window size k, initialize the graph model adjacency matrix A 0 , the input feature of the node in the figure X 0 , the margins of positive and negative samples in triplet losses were margin. Output: X * can distinguish target characteristics, characteristics of person sample space center C p , the non-person sample C n feature space center: 1) Initialize A = A 0 , C P = 0, C n = 0 2) Randomly sampling TripletX i ¼ TripletðX a ; X p ; X n Þ from input feature sample X 0 , where X a is the anchor point, X p is the same random sample point as the anchor point category, and X n is the opposite of the anchor point category 3) Repeat 4) Forward transmission 5) For X in Triplet Triplet(X a , X p , X n ) do: 6) For Gall layers do: 7) Generate its diagonal node degree matrix D from A 8) To calculate the normalized coefficient 1 ) For X t in X do: 10) Node status update X t ¼ σð X X ti ∈neighbor k X t 1 Z X ti W ti Þ 11) End for 12) Update the adjacency matrix A from the new graph node state 13) End for 14) Return X, update triplet setX 15) End for 16) ReturnX as X * 17) Back propagation 18) For Triplet(X a , X p , X n ) in X * do: 19) Computing triple lossL = maxðkX a −X p k 2 2 −kX a −X n k 2 2 þ m argin; 0Þ 20) End for 21) Calculate the average loss L 22) Back propagation L gradient using Adam algorithm 23) Until convergence or reach the maximum number of training

Experimental
The algorithm is tested on two video character datasets prid2011 and ilids vid [9]. The prid2011 dataset contains video captured by two static cameras. A camera recorded 385 people's video information, and B camera recorded 749 people's video information, of which A and B cameras collected 20 people at the same time. Each person's video subset contains 5 to 675 video frames. In order to ensure the effectiveness of spatiotemporal features, 178 person's video frames are selected. All the videos in the dataset are shot in the outdoor environment with less occlusion and no crowding. Everyone has a wealth of walking posture images. Figure 2 shows some examples of personal video frames.
The ilids vid dataset consists of 300 different pedestrians who are viewed through two disconnected cameras in the public open space. The dataset is created from two non-overlapping pedestrian camera views observed in the i-lids multi-camera tracking scene (MCTS) dataset captured under the multi Camera CCTV network in the arrival hall of the airport. It consists of 600 image sequences from 300 different individuals, each with a pair of sequences from two camera views. Each image sequence has a variable length, from 23 frames to 192 frames, with an average of 73 frames. Figure 3 shows some examples of personal video frames.
The experiment is based on the pytorch deep learning framework. The hardware configuration is 32 GB memory, Intel (R) Core (TM) i7-4790k processor and NVIDIA gtx1080 8GB graphics card. Each experiment randomly generated training set and test set, and repeated 10 experiments under the same conditions. The average value of 10 experimental results is taken as the final result of this experiment. The experimental results evaluate the performance of the algorithm by the recognition rate.
On the video dataset, every 5 frames are trained, and all the data are tested. At the same time, the detection rate and false alarm rate of the hidden state model in the untrained frame are tested. All images are trained and all frames are tested. A small batch of 128 frames is used for training, the number of training iterations is 2000, the initial learning rate is 0.0001, and the gradient descent method is used. The hidden dimension of ODE network convolution model is 64, and the specific structure is shown in Table 1. In this paper, group normalization is used for all normalization layers, and the maximum number of groups is 32. In classification training, cross-entropy loss and triple loss are used for full supervised training and weak supervised training, and the models are CGN UXe and CGN uwk, respectively. The encoder of implicit state model adopts long short memory network (LSTM). The decoder is a fully connected layer, the hidden layer depth of the model is 128, and the window value k is 2. The implicit state sampling adopts Monte Carlo method, and the sampling points of each video clip are 100 + 50, which are the detection of the current period and the prediction of the future period, respectively. This paper uses cross-validation for image datasets. Batch number 32, hidden dimension 32, number of cycles 100, initial learning rate 0.001. The Adam method is used for back propagation gradient. At the same time, cross-entropy and triple loss are applied to the full supervised and weak supervised classification training, respectively, and the best performance of each comparison index is obtained.

Result and discussion
In the experiment, 300 people were randomly selected to form a training set, and the rest 300 people were selected to form a test set. The experimental results are compared with other typical algorithms (dynamic RNN-CNN network [10], cumulative motion context (AMOC) network [11], algorithm using matrix shared attention [12],  Note: the step size of unmarked convolution layer is 1 by default. The number after the multiplication symbol in brackets indicates the number of times the submodule in brackets is repeated rearrangement application method based on matrix shared attention [8]). See Table 2 for the results of the personnel re-identification rate. It can be seen from the data in Table 2 that the recognition rate of this algorithm has been significantly improved compared with the existing algorithms. Rank1 is 80.6%, which is 4.4% higher than the method proposed in literature [8]. Rank5 and rank20 algorithms have some improvements compared with other algorithms [13][14][15][16][17].
From the experimental data of ilds-vid dataset in Table 3, it can be seen that this method has higher recognition rate than the existing mainstream methods. Experiments further verify the effectiveness of the algorithm.

Conclusion and future work
Human recognition is an important research topic in the field of computer vision. In order to improve the performance of video character recognition, a video character recognition algorithm based on ode and graph convolution network is proposed. Firstly, the ode tacit model hidden in the video distribution is used to supplement the information lost between frames. Then, through the digital convolution learning network to connect the continuity and interval between video frames, the relationship between unstructured features is established, and the positive and negative samples are divided. Finally, the feature using a full join layer or directly calculating the center distance of positive and negative samples is selected to get the classification results. Experimental results show that this method can significantly improve the performance of video character recognition, which is of great significance to the research of video character recognition.
Our main work in the future is to continue to improve the recognition accuracy, reduce the complexity of the algorithm, and reduce the time consumption.