Personnel emotion recognition model for Internet of vehicles security monitoring in community public space

In recent years, the Internet of vehicles (IOV) with intelligent networked automobiles as terminal node has gradually become the development trend of automotive industry and research hot spot in related fields. This is due to its characteristics of intelligence, networking, low-carbon and energy saving. Real time emotion recognition for drivers and pedestrians in the community can be utilized to prevent fatigue driving and malicious collision, keep safety verification and pedestrian safety detection. This paper mainly studies the face emotion recognition model that can be utilized for IOV. Considering the fluctuation of image acquisition perspective and image quality in the application scene of IOV, the natural scene video similar to vehicle environment and its galvanic skin response (GSR) are utilized to make the testing set of emotion recognition. Then an expression recognition model combining codec and Support Vector Machine classifier is proposed. Finally, emotion recognition testing is completed on the basis of Algorithm 1. The matching accuracy between the emotion recognition model and GSR is 82.01%. In the process of model testing, 189 effective videos are involved and 155 are correctly identified.

realize the intelligent monitoring and decision-making of vehicle information to realize the intelligent control of vehicles [3].
The face recognition technology under the background of artificial intelligence has been developed and applied rapidly in many fields because of its wide application range, strong operability and rich information. At present, the applications of face recognition mainly include face detection, identity recognition and emotion recognition. Community is an important part of a city, but due to the lack of intelligent means in the traditional community management mode, it can not meet the residents' needs for safe and efficient community service. This paper focuses on face emotion recognition technology which can be applied to vehicle environment in community public space. Although the accuracy of facial emotion recognition in vehicle environment is disturbed by many factors such as angle fluctuation and transmission quality, its data has the characteristics of high feature discrimination and strong expression ability. Therefore, emotion recognition technology has high research value in the field of IOV real-time monitoring applied to fatigue driving, safety verification, malicious collision and pedestrian safety detection [4].
Emotion recognition is different from automobile manufacturing, the latter is the product of second industrial revolution with a long development process. However, it has become a hot research field with its excellent performance and application value [5]. The early concept of emotion recognition was pointed out in "Affective Computing" by Professor Picard of the Massachusetts Institute of Technology [6]. The emotion of human was often expressed by facial expressions, voices, gestures. Some scholars had conducted emotion recognition and analysis for these aspects [7][8][9]. American psychologist Mehrabian believed that facial expressions have the strongest ability to transmit information, and they can be utilized to achieve a recognition accuracy of 55% in emotion recognition [10]. We believed that the voice and posture of face are affected by subjective psychological factors, which leads to insufficient representation ability. The common facial expression recognition was static image recognition, but the prediction of emotion required dynamic facial expression because of its persistence. In addition, the development of physiology had made the recognition of human emotions by physiological data a hot field. In 2001, Picard et al. utilized multi-dimensional physiological signals to realize five levels of emotion recognition [11]. Subsequently, a large number of scholars began to analyze and research on physiological data and video emotion [12,13]. In 2006, Savran et al. utilized the International Affective Picture System (IAPS) as a stimulus material to construct a data set "2005 emotional database" containing facial data and physiological data [14]. Koelstra et al. utilized pictures and music as stimulus materials to obtain expression videos and physiological data, then they established the current popular emotion data set "DEAP" [15]. Later, Soleymani and others utilized the stimulation of network resources to construct "MAHNOB HCI" data set containing facial details, audio and physiological data [16]. It can be seen that research on the correlation of physiological data and video emotion to complete emotion recognition had become one of the mainstream directions in related fields [17]. In addition, a large number of physiological and emotional data brought high load, high power consumption and resource shortage to the IOV system. And Fifth generation (5G) network can be well applied to the communication and transmission of IOT with its sufficient spectrum resources. Therefore, a large number of scholars had studied and analyzed the optimization of 5G communication technology and its combination with IOT、IOV [18,19]. The emotion recognition model proposed in this paper was constructed by the rules shown in Algorithm 1 and expression recognition model. Therefore, the real labels of testing set were needed to verify the performance of the emotion recognition model. There were three common ways to set labels of video emotion. The labels of video emotion in first method was directly defined by the known experimental conditions of subjects. The labels of video emotion in second method was defined by the emotional self-description of subjects after experiment. The label of video emotion in third method was defined based on the physiological data of the subjects during video shooting. We thought the third method was more reliable than the previous method. Because specified experimental conditions may not be able to stimulate the corresponding emotions for everyone and self-description was easily disturbed by psychological factors. The testing set was obtained by the video of subjects under the natural scene video similar to vehicle environment to make the research more valuable.
Therefore, the video emotion recognition process was mainly divided into three processes, the definition of video emotional label, the training of video expression recognition models and the recognition of video emotion.

Preparation of physiological data
At present, there are many video data sets about emotion recognition, including the early Cohn Kanade dataset plus (CK+) [20] and recent DEAP data set. These data sets have the advantages of rich content and strong representation. However, the above video data sets only have standard faces, which is very different from the facial video data sets with multi angles and large quality fluctuation under the background of IOV applications. Therefore, the facial video of young people's specific behavior is collected to study the facial emotion recognition in the vehicle environment. In addition, the physiological data are obtained to complete accurate emotion prediction and inference. Figure 1 shows the physiological data of the subjects in the video state.
The first channel is Heart rate based on PhotoPlethysmoGraphy (PPG). The second channel is the value of GSR. The third channel is the value of electrical signal of respiration. The fourth channel is the value of ElectroCardioGram (ECG). The fifth channel is the value of ElectroEncephaloGram (EEG).

The definition of video emotional label
The GSR of human is controlled by human nervous system, it has strong physiological characteristics [21]. A large number of studies have shown that emotional fluctuations can cause significant changes in GSR [22,23]. Therefore, GSR is selected to define video emotional label to improve work efficiency. Related studies utilizes the feature extraction method of the University of Augsburg in Germany to find that the emotion of subjects can reflect their characteristics on GSR. This conclusion can also be shown in Fig. 2 [24]. Above images are derived from the characteristic results of some subjects and they are not very persuasive. However, the following applicable conclusion can be obtained by observation and testing of data set when levels of emotion are categorized into three categories.
• Happy: Within the range of video, there are denser multi-band peaks, which are mostly distributed at the beginning of the video; • Quiet: Within the range of video, there is basically no peaks or only once at both ends; • Unhappy: Within the range of video, there are peaks at the beginning and end of the video, or only dense peaks appear in the middle of the video with almost no intervals.
After above rules are summarized, the emotional label of the testing video is defined by the value of GSR and verification. The specific experimental steps are as follows.
Data preprocessing Firstly, the most representative GSR in the physiological data is completed noise reduction and smoothing. The abnormal value in the data is updated to its nearby value to complete data noise reduction. Savitzky-Golay filter is utilized to smooth the data. The Savitzky-Golay filter is a digital filter that fits adjacent data points to a low-order polynomial by linear least squares [25]. The solution of least squares equation can be found when the data spacing is equal. Figure 3 is a diagram of its smoothing process.
The blue point in each window of Fig. 3 is the center point of the window, and the mathematical principle of filtering is (1).
The Savitzky-Golay filter utilizes the least squares to regress a small window of data to a polynomial, and then utilizes the polynomial to estimate the point at the center of window. Where h i is the smoothing coefficient. h i H is fitted by the principle of least squares in (1).
On the same curve, different widths of window can be selected at any position to meet the needs of different filtering. This is useful for processing time series data at different stages.
The definition of emotions in videos Emotional swings are short and continuous when they are not stimulated by a strong external environment. Therefore, the definition method of dividing time segment is utilized to define the emotional label of each short video. The total length of each video is about 3 min. It includes the process from the beginning of experiment to the completion of the specified action and then the end of experiment. Therefore, we believe that this process can reflect a variety of specific emotions. It is stipulated that every 15 s video is defined as an emotional video to improve the accuracy of definition. There is a 5 s interval between every two emotional videos.
(1) Then the emotional label of each short video is defined based on the relationship from Fig. 2 and the value of GSR after preprocessing.

Structure and principle of the proposed model
The expression recognition model of the paper is a combination of convolutional codec and SVM classifier. And the model and Algorithm 1 cooperate to complete the prediction of video emotion. The model features extracted by the convolutional codec have strong abstraction. The feature can reduce the training noise caused by the large difference of facial style. The core of the codec is image convolution and image deconvolution. Image convolution is developed from signal convolution. Image convolution is obtained by expanding the one-dimensional signal in two dimensions and rotating its convolution kernel by 180°. Image convolution introduces the three calculation concepts of convolution kernel F , stride S , and padding P . Their calculation relationship is shown in (2) and (3)  (2) In a convolutional network, reasonable settings of F , S , P are required to ensure that the size of image is controllable and the number of network layers continues to rise. Figure 4 shows a common image convolution process.
The value of S is 1 in the convolution process shown in Fig. 4. A total of four convolutions occur in the convolution shown in Fig. 4. The process of convolution can be digitized when the image is expanded as shown in Fig. 5 and the convolution kernel is expanded into matrix.
The matrix of the convolution kernel is shown in (4).
Therefore, the operation of convolution can be expressed by (5).
where Y represents the result of convolution, and C represents the matrix of the convolution kernel. First convolution is the multiplication of the first line of (4) with the matrix of image in (5), and subsequent convolution is also based on this process of calculation.
A vector with size of [4 × 1] is subsequently obtained after the calculation of (5). The output image after convolution can be restored by following the reverse process of Fig. 5. Therefore, the convolution process can be described as a multiplication of weight matrix with an image vector.    The process of convolution is essentially a combination of forward propagation and backward derivative propagation. The principle of deriving x in back propagation is (6).
where y i can be expressed by (7).
The multiplication of matrices can be achieved by changing Σ in (9) to the form of a matrix.
where C ij is the matrix of forward propagation, and C * j is the matrix of backward propagation. In the process of deconvolution, the mathematical meanings of above two parameters need to be exchanged. The relationship of calculation corresponding to deconvolution is shown in (11) and (12), which means that W and H before convolution are obtained by calculation of W ′ and H ′ after convolution.
(10) ∂Loss ∂x j = ∂Loss ∂y T × C * j = C T * j × ∂Loss ∂y . Therefore, the structure of the expression recognition model is shown in Fig. 6. Where 'Conv' represents for convolution and 'Deconv' represents for deconvolution.

The training and testing of the expression recognition model
The initialization state of the testing set is unlabeled to ensure the rationality of emotional label. Therefore, another data set with self-descriptive labels is utilized to complete the training of the model. Figure 7 shows partial data set utilized to train the expression recognition model. Therefore, values of the model parameter can be obtained by training the model on the data set shown in Fig. 7. Subsequent testing follows the principle of video frame image analysis. Nearly 30 frame images appear each 1 s in the testing video. Following recognition process is defined so that the label of the expressions can be accurately defined within 1 s.
CascadeClassifier in Opencv is utilized for face detection in the process of model testing. This is a cascaded classifier utilizing Harr feature of images. The principle of Harr feature can be utilized to complete face recognition well [27]. Figure 8 shows the partial results of face detection and the expression recognition model. Therefore, the recognition result of expression in each frame image can be obtained. Figure 8 shows that a face is detected by CascadeClassifier and it is recognized as happy by the expression recognition model.
The most important thing in this section is that the result of expression recognition of each frame image is transformed into the expression recognition result of the image per second. The expression in 1 s is considered to be 'Unhappy' if the number of 'Unhappy' frame images in 1 s is greater than 6 and greater than the number of 'Happy' frame images. If not, the judgments of other expressions are subsequently continued. The expression in 1 s is considered to be 'Happy' if the number of 'Happy' frame images in 1 s is greater than 4 and greater than the number of 'Unhappy' frame images. And the expression in 1 s is considered to be 'Happy' if the number of 'Happy' frame images in 1 s is greater than 0 and greater than the number of 'Unhappy' frame images when the number of 'Quiet' frame images in 1 s is less than 5. The expression per second is defined as 'Quiet' when above conditions are not met.

Emotion recognition for short videos
Emotion sequence represented by '1' , '0' and'2' can be obtained after the expression prediction model is trained and tested. Above three numbers represent three emotions defined in Sect. 2.2. The emotion prediction of each short video is completed by Algorithm 1 after the emotional label per second is obtained. Where N represents the total number of expressions per second in the video. N 0 , N 1 , N 2 respectively indicate the number of corresponding expressions. n 0 , n 1 , n 2 are utilized to represent the number of different expressions in the video each 15 s. The number of 'Happy' and 'Unhappy' is defined as n i1 and n i2 of interval video in each 5 s to reduce the loss of useful information. In addition, the distance between first '1' appearing in each interval video and last '1' in 15 s video of previous section is defined as d 11 . The distance between last '1' appearing in each interval video and first '1' in 15 s video of previous section is defined as d 12 . Therefore, the emotion recognition model needs the characteristics of different subject analyzed by Algorithm 1.

The overall structure of the emotion recognition model
The overall flow chart of the model is shown in Fig. 9 to clearly show the process of emotion recognition.

The result of preparation
The data set utilized for training needed to be grayed and standardized. The preparation of data set utilized for testing was divided into two parts. First, video needed to be grayed and standardized, then corresponding GSR needed to be completed data noise reduction and smoothing. Partial results of the latter was shown in Figs. 10 and 11. Fig. 9 The overall structure of the emotion recognition model

The setting of parameter
The training parameters of the proposed expression recognition model and comparison models were shown in Table 1. These parameter were mainly applied to the proposed facial expression recognition model and its comparison model: Resnet18 and VGG16. The initial weights obtained by transfer learning on ImageNet were applied to the comparison model. The parameters of CascadeClassifier were shown in Table 2.

Model evaluation method
The evaluation of the model in the paper was obtained by matching the label defined by GSR with the result of emotion recognition model. The former was utilized as the true value of video emotion, the latter was the predicted value. (13) for model evaluation was as follows.
(13) Accuracy = TP + TN TP + FN + FP + TN . Fig. 11 The value of GSR after preprocessing for a subject  where TP referred to predict the target class as the number of target classes, TN referred to predict the non-target class as the number of non-target classes, FN referred to predict the non-target class as the number of target classes, FP referred to predict the target class as the number of non-target classes.

Experimental results
The results of the emotion recognition model were mainly divided into two parts: the training part and testing part. Figure 12 showed the training part of the model.
The loss decay of the model and the increase of accuracy took a total of 120 epochs. The expressive ability of feature extracted by the model was very strong, but the training process of the model was slow. The original image can be restored more accurately by this feature. The input images were uniformly grayed and standardized to reduce the influence of facial background. The comparison between original image and generated image from its features was shown in Fig. 13.
SVM classifier needed to be trained to complete the intact process of model training after the abstract features were obtained by the codec model. The training of SVM classifier included the selection of its kernel function γ and penalty coefficient c. The selection of the kernel function was shown in Fig. 14.
As shown in Fig. 14, it can be found that the highest accuracy was 97.71% when the kernel function was gauss. It can also be obtained that the peak accuracy was about 97.45% and 91.59%, respectively, when the kernel function is poly and sigmoid. And it can be determined that best values of γ corresponding to three kernel functions were 8.935, 8.935, 2, respectively. On this basis, the process of adjusting the penalty coefficient c was shown in Fig. 15.
It can be found that the highest accuracy of classification was 97.83% when kernel function was gauss, the value of γ was 8.935 and the value of c was 15. Resnet18 and VGG16 were trained and verified on the training set to further illustrate the superiority of the proposed model in expression recognition. Figure 16 showed the result of training and verification of the comparison model on training set.
It can be found from Fig. 16 that the accuracy of Resnet18 was slightly better than VGG16, and its highest accuracy of verification reached 97.64% when its epoch was 11. But this result was also slightly lower than the proposed model.
The predicted results of the models were matched with the labels defined by GSR after the training of three models. The matching accuracy was 82.01% after testing all subjects. The number of effective short video emoticons in testing set was 189, and the number correctly identified was 155. The testing accuracy of the comparison models were both 69.84%. The testing accuracy of three models changed with the increase in the number of subjects was shown in Fig. 17.
The three curves quickly reached peaks and their accuracy decreased in the second half of them. A confusion matrix was utilized to further show the result of the  • The differences of facial styles leaded to the difficulty of facial emotion recognition in some subjects with unclear facial expressions or rich facial expressions. This class of subjects on testing set distributed in the second half of the data set. This conclusion was also reflected in the misjudgment in Fig. 18; • The model can not detect facial expression due to the subjects' head down, which also leaded to the decrease of accuracy; • The recognition of 'Quiet' emotion in scene had high recognition accuracy due to the high frequency of 'Quiet' emotion. However, the ability of recognition for 'Unhappy' emotion was weak due to the small number of samples.

Conclusions
Firstly, reliable emotional labels of the proposed emotion recognition model was obtained from GSR. Then the model achieved an accuracy of 82.01% by a reasonable process of recognition. The proposed model has certain practical value for predicting the human emotion of natural activities in vehicle environment due to its data set utilized in the model has certain characteristic similarity with the video data in the vehicle environment.