Skip to main content

Personnel emotion recognition model for Internet of vehicles security monitoring in community public space


In recent years, the Internet of vehicles (IOV) with intelligent networked automobiles as terminal node has gradually become the development trend of automotive industry and research hot spot in related fields. This is due to its characteristics of intelligence, networking, low-carbon and energy saving. Real time emotion recognition for drivers and pedestrians in the community can be utilized to prevent fatigue driving and malicious collision, keep safety verification and pedestrian safety detection. This paper mainly studies the face emotion recognition model that can be utilized for IOV. Considering the fluctuation of image acquisition perspective and image quality in the application scene of IOV, the natural scene video similar to vehicle environment and its galvanic skin response (GSR) are utilized to make the testing set of emotion recognition. Then an expression recognition model combining codec and Support Vector Machine classifier is proposed. Finally, emotion recognition testing is completed on the basis of Algorithm 1. The matching accuracy between the emotion recognition model and GSR is 82.01%. In the process of model testing, 189 effective videos are involved and 155 are correctly identified.

1 Introduction

With the development and integration of information technology, computer technology and automobile manufacturing industry, the IOV proposed to improve the level of automobile intelligent driving is known to public. IOV is a branch of industrial Internet of things (IOT) technology, so it also has the advantages of sensing technology, mobile communication technology and intelligent analysis of the IOT [1]. Intelligent networked automobiles make the IOV technology in automotive industry a hot spot. Then IOV technology takes the moving automobile as the object of information perception, and greatly improves the safety performance of the automobile by strengthening global optimization and control [2]. IOV is the specific implementation and application of traditional Internet of things technology in automotive field. And it can greatly improve the intelligence and efficiency of traffic management by wireless communication technology and intelligent information processing technology. Therefore, IOV technology can realize the intelligent monitoring and decision-making of vehicle information to realize the intelligent control of vehicles [3].

The face recognition technology under the background of artificial intelligence has been developed and applied rapidly in many fields because of its wide application range, strong operability and rich information. At present, the applications of face recognition mainly include face detection, identity recognition and emotion recognition. Community is an important part of a city, but due to the lack of intelligent means in the traditional community management mode, it can not meet the residents’ needs for safe and efficient community service. This paper focuses on face emotion recognition technology which can be applied to vehicle environment in community public space. Although the accuracy of facial emotion recognition in vehicle environment is disturbed by many factors such as angle fluctuation and transmission quality, its data has the characteristics of high feature discrimination and strong expression ability. Therefore, emotion recognition technology has high research value in the field of IOV real-time monitoring applied to fatigue driving, safety verification, malicious collision and pedestrian safety detection [4].

Emotion recognition is different from automobile manufacturing, the latter is the product of second industrial revolution with a long development process. However, it has become a hot research field with its excellent performance and application value [5]. The early concept of emotion recognition was pointed out in “Affective Computing” by Professor Picard of the Massachusetts Institute of Technology [6]. The emotion of human was often expressed by facial expressions, voices, gestures. Some scholars had conducted emotion recognition and analysis for these aspects [7,8,9]. American psychologist Mehrabian believed that facial expressions have the strongest ability to transmit information, and they can be utilized to achieve a recognition accuracy of 55% in emotion recognition [10]. We believed that the voice and posture of face are affected by subjective psychological factors, which leads to insufficient representation ability. The common facial expression recognition was static image recognition, but the prediction of emotion required dynamic facial expression because of its persistence. In addition, the development of physiology had made the recognition of human emotions by physiological data a hot field. In 2001, Picard et al. utilized multi-dimensional physiological signals to realize five levels of emotion recognition [11]. Subsequently, a large number of scholars began to analyze and research on physiological data and video emotion [12, 13]. In 2006, Savran et al. utilized the International Affective Picture System (IAPS) as a stimulus material to construct a data set “2005 emotional database” containing facial data and physiological data [14]. Koelstra et al. utilized pictures and music as stimulus materials to obtain expression videos and physiological data, then they established the current popular emotion data set “DEAP” [15]. Later, Soleymani and others utilized the stimulation of network resources to construct “MAHNOB HCI” data set containing facial details, audio and physiological data [16]. It can be seen that research on the correlation of physiological data and video emotion to complete emotion recognition had become one of the mainstream directions in related fields [17]. In addition, a large number of physiological and emotional data brought high load, high power consumption and resource shortage to the IOV system. And Fifth generation (5G) network can be well applied to the communication and transmission of IOT with its sufficient spectrum resources. Therefore, a large number of scholars had studied and analyzed the optimization of 5G communication technology and its combination with IOT、IOV [18, 19].

The emotion recognition model proposed in this paper was constructed by the rules shown in Algorithm 1 and expression recognition model. Therefore, the real labels of testing set were needed to verify the performance of the emotion recognition model. There were three common ways to set labels of video emotion. The labels of video emotion in first method was directly defined by the known experimental conditions of subjects. The labels of video emotion in second method was defined by the emotional self-description of subjects after experiment. The label of video emotion in third method was defined based on the physiological data of the subjects during video shooting. We thought the third method was more reliable than the previous method. Because specified experimental conditions may not be able to stimulate the corresponding emotions for everyone and self-description was easily disturbed by psychological factors. The testing set was obtained by the video of subjects under the natural scene video similar to vehicle environment to make the research more valuable.

Therefore, the video emotion recognition process was mainly divided into three processes, the definition of video emotional label, the training of video expression recognition models and the recognition of video emotion.

2 Experiment and proposed method

2.1 Preparation of physiological data

At present, there are many video data sets about emotion recognition, including the early Cohn Kanade dataset plus (CK+) [20] and recent DEAP data set. These data sets have the advantages of rich content and strong representation. However, the above video data sets only have standard faces, which is very different from the facial video data sets with multi angles and large quality fluctuation under the background of IOV applications. Therefore, the facial video of young people's specific behavior is collected to study the facial emotion recognition in the vehicle environment. In addition, the physiological data are obtained to complete accurate emotion prediction and inference. Figure 1 shows the physiological data of the subjects in the video state.

Fig. 1
figure 1

Partial physiological data of experimental data set

The first channel is Heart rate based on PhotoPlethysmoGraphy (PPG). The second channel is the value of GSR. The third channel is the value of electrical signal of respiration. The fourth channel is the value of ElectroCardioGram (ECG). The fifth channel is the value of ElectroEncephaloGram (EEG).

2.2 The definition of video emotional label

The GSR of human is controlled by human nervous system, it has strong physiological characteristics [21]. A large number of studies have shown that emotional fluctuations can cause significant changes in GSR [22, 23]. Therefore, GSR is selected to define video emotional label to improve work efficiency. Related studies utilizes the feature extraction method of the University of Augsburg in Germany to find that the emotion of subjects can reflect their characteristics on GSR. This conclusion can also be shown in Fig. 2 [24].

Fig. 2
figure 2

The relationship between common emotions and GSR

Above images are derived from the characteristic results of some subjects and they are not very persuasive. However, the following applicable conclusion can be obtained by observation and testing of data set when levels of emotion are categorized into three categories.

  • Happy: Within the range of video, there are denser multi-band peaks, which are mostly distributed at the beginning of the video;

  • Quiet: Within the range of video, there is basically no peaks or only once at both ends;

  • Unhappy: Within the range of video, there are peaks at the beginning and end of the video, or only dense peaks appear in the middle of the video with almost no intervals.

After above rules are summarized, the emotional label of the testing video is defined by the value of GSR and verification. The specific experimental steps are as follows.

Data preprocessing Firstly, the most representative GSR in the physiological data is completed noise reduction and smoothing. The abnormal value in the data is updated to its nearby value to complete data noise reduction. Savitzky–Golay filter is utilized to smooth the data. The Savitzky–Golay filter is a digital filter that fits adjacent data points to a low-order polynomial by linear least squares [25]. The solution of least squares equation can be found when the data spacing is equal. Figure 3 is a diagram of its smoothing process.

Fig. 3
figure 3

The principle of Savitzky–Golay filtering

The blue point in each window of Fig. 3 is the center point of the window, and the mathematical principle of filtering is (1).

$$x_{{k,{\text{smooth}}}} = \overline{x} = \frac{1}{H}\mathop \sum \limits_{i = - w}^{ + w} x_{k + i} h_{i} .$$

The Savitzky–Golay filter utilizes the least squares to regress a small window of data to a polynomial, and then utilizes the polynomial to estimate the point at the center of window. Where \(h_{i}\) is the smoothing coefficient. \(\frac{{h_{i} }}{H}\) is fitted by the principle of least squares in (1).

On the same curve, different widths of window can be selected at any position to meet the needs of different filtering. This is useful for processing time series data at different stages.

The definition of emotions in videos Emotional swings are short and continuous when they are not stimulated by a strong external environment. Therefore, the definition method of dividing time segment is utilized to define the emotional label of each short video. The total length of each video is about 3 min. It includes the process from the beginning of experiment to the completion of the specified action and then the end of experiment. Therefore, we believe that this process can reflect a variety of specific emotions. It is stipulated that every 15 s video is defined as an emotional video to improve the accuracy of definition. There is a 5 s interval between every two emotional videos. Then the emotional label of each short video is defined based on the relationship from Fig. 2 and the value of GSR after preprocessing.

2.3 Structure and principle of the proposed model

The expression recognition model of the paper is a combination of convolutional codec and SVM classifier. And the model and Algorithm 1 cooperate to complete the prediction of video emotion. The model features extracted by the convolutional codec have strong abstraction. The feature can reduce the training noise caused by the large difference of facial style. The core of the codec is image convolution and image deconvolution.

Image convolution is developed from signal convolution. Image convolution is obtained by expanding the one-dimensional signal in two dimensions and rotating its convolution kernel by 180°. Image convolution introduces the three calculation concepts of convolution kernel \(F\), stride \(S\), and padding \(P\). Their calculation relationship is shown in (2) and (3) [26].

$$W^{\prime} = \left( {W - F + 2P} \right)/S + 1,$$
$$H^{\prime} = \left( {H - F + 2P} \right)/S + 1.$$

Above equations represent the calculation of the output image when the size of input image is \(\left[ {W \times H \times D} \right]\). Where \(W^{\prime}\) and \(H^{\prime}\) represent the width and height of output image, and the depth \(D^{\prime}\) of the output image is determined by the number of convolution kernels.

In a convolutional network, reasonable settings of \(F\), \(S\), \(P\) are required to ensure that the size of image is controllable and the number of network layers continues to rise. Figure 4 shows a common image convolution process.

Fig. 4
figure 4

Schematic diagram of image convolution

The value of \(S\) is 1 in the convolution process shown in Fig. 4. A total of four convolutions occur in the convolution shown in Fig. 4. The process of convolution can be digitized when the image is expanded as shown in Fig. 5 and the convolution kernel is expanded into matrix.

Fig. 5
figure 5

Matrix representation of the image

The matrix of the convolution kernel is shown in (4).

$$\left[ {\begin{array}{*{20}l} {w_{0,0} } \hfill & {w_{0,1} } \hfill & {w_{0,2} } \hfill & 0 \hfill & {w_{1,0} } \hfill & {w_{1,1} } \hfill & {w_{1,2} } \hfill & 0 \hfill & {w_{2,0} } \hfill & {w_{2,1} } \hfill & {w_{2,2} } \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill \\ 0 \hfill & {w_{0,0} } \hfill & {w_{0,1} } \hfill & {w_{0,2} } \hfill & 0 \hfill & {w_{1,0} } \hfill & {w_{1,1} } \hfill & {w_{1,2} } \hfill & 0 \hfill & {w_{2,0} } \hfill & {w_{2,1} } \hfill & {w_{2,2} } \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill \\ 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & {w_{0,0} } \hfill & {w_{0,1} } \hfill & {w_{0,2} } \hfill & 0 \hfill & {w_{1,0} } \hfill & {w_{1,1} } \hfill & {w_{1,2} } \hfill & 0 \hfill & {w_{2,0} } \hfill & {w_{2,1} } \hfill & {w_{2,2} } \hfill & 0 \hfill \\ 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & {w_{0,0} } \hfill & {w_{0,1} } \hfill & {w_{0,2} } \hfill & 0 \hfill & {w_{1,0} } \hfill & {w_{1,1} } \hfill & {w_{1,2} } \hfill & 0 \hfill & {w_{2,0} } \hfill & {w_{2,1} } \hfill & {w_{2,2} } \hfill \\ \end{array} } \right].$$

Therefore, the operation of convolution can be expressed by (5).

$$Y = CX.$$

where \(Y\) represents the result of convolution, and \(C\) represents the matrix of the convolution kernel. First convolution is the multiplication of the first line of (4) with the matrix of image in (5), and subsequent convolution is also based on this process of calculation. A vector with size of \(\left[ {4 \times 1} \right]\) is subsequently obtained after the calculation of (5). The output image after convolution can be restored by following the reverse process of Fig. 5. Therefore, the convolution process can be described as a multiplication of weight matrix with an image vector.

The process of convolution is essentially a combination of forward propagation and backward derivative propagation. The principle of deriving \(x\) in back propagation is (6).

$$\frac{{\partial {\text{Loss}}}}{{\partial x_{i} }} = \mathop \sum \limits_{i} \frac{{\partial {\text{Loss}}}}{{\partial y_{i} }} \times \frac{{\partial y_{i} }}{{\partial x_{i} }}.$$

where \(y_{i}\) can be expressed by (7).

$$y_{i} = \mathop \sum \limits_{j = 1}^{16} C_{ij} X_{j} .$$

Then (8) can be obtained by (7).

$$\frac{{\partial y_{i} }}{{\partial x_{j} }} = C_{ij} .$$

And (9) can be obtained by substituting (8) into (6).

$$\frac{{\partial {\text{Loss}}}}{{\partial x_{i} }} = \mathop \sum \limits_{i = 1}^{4} \frac{{\partial {\text{Loss}}}}{{\partial y_{i} }} \times C_{ij} .$$

The multiplication of matrices can be achieved by changing Σ in (9) to the form of a matrix.

$$\frac{{\partial {\text{Loss}}}}{{\partial x_{j} }} = \left( {\frac{{\partial {\text{Loss}}}}{\partial y}} \right)^{T} \times C_{*j} = C_{*j}^{T} \times \left( {\frac{{\partial {\text{Loss}}}}{\partial y}} \right).$$

where \(C_{ij}\) is the matrix of forward propagation, and \(C_{*j}\) is the matrix of backward propagation. In the process of deconvolution, the mathematical meanings of above two parameters need to be exchanged. The relationship of calculation corresponding to deconvolution is shown in (11) and (12), which means that \(W\) and \(H\) before convolution are obtained by calculation of \(W^{\prime}\) and \(H^{\prime}\) after convolution.

$$W = S\left( {W^{\prime} - 1} \right) - 2P + F,$$
$$H = S\left( {H^{\prime} - 1} \right) - 2P + F.$$

Therefore, the structure of the expression recognition model is shown in Fig. 6.

Fig. 6
figure 6

The structure of the expression recognition model

Where 'Conv' represents for convolution and 'Deconv' represents for deconvolution.

2.4 The training and testing of the expression recognition model

The initialization state of the testing set is unlabeled to ensure the rationality of emotional label. Therefore, another data set with self-descriptive labels is utilized to complete the training of the model. Figure 7 shows partial data set utilized to train the expression recognition model.

Fig. 7
figure 7

The training set of the expression recognition model

Therefore, values of the model parameter can be obtained by training the model on the data set shown in Fig. 7. Subsequent testing follows the principle of video frame image analysis. Nearly 30 frame images appear each 1 s in the testing video. Following recognition process is defined so that the label of the expressions can be accurately defined within 1 s.

CascadeClassifier in Opencv is utilized for face detection in the process of model testing. This is a cascaded classifier utilizing Harr feature of images. The principle of Harr feature can be utilized to complete face recognition well [27]. Figure 8 shows the partial results of face detection and the expression recognition model.

Fig. 8
figure 8

Partial results of face detection and the expression recognition model

Therefore, the recognition result of expression in each frame image can be obtained. Figure 8 shows that a face is detected by CascadeClassifier and it is recognized as happy by the expression recognition model.

The most important thing in this section is that the result of expression recognition of each frame image is transformed into the expression recognition result of the image per second. The expression in 1 s is considered to be ‘Unhappy’ if the number of ‘Unhappy’ frame images in 1 s is greater than 6 and greater than the number of ‘Happy’ frame images. If not, the judgments of other expressions are subsequently continued. The expression in 1 s is considered to be ‘Happy’ if the number of ‘Happy’ frame images in 1 s is greater than 4 and greater than the number of ‘Unhappy’ frame images. And the expression in 1 s is considered to be ‘Happy’ if the number of ‘Happy’ frame images in 1 s is greater than 0 and greater than the number of ‘Unhappy’ frame images when the number of ‘Quiet’ frame images in 1 s is less than 5. The expression per second is defined as ‘Quiet’ when above conditions are not met.

2.5 Emotion recognition for short videos

Emotion sequence represented by ‘1’, ‘0’ and’2’ can be obtained after the expression prediction model is trained and tested. Above three numbers represent three emotions defined in Sect. 2.2. The emotion prediction of each short video is completed by Algorithm 1 after the emotional label per second is obtained.

figure a

Where \(N\) represents the total number of expressions per second in the video. \(N_{0}\), \(N_{1}\), \(N_{2}\) respectively indicate the number of corresponding expressions. \(n_{0}\), \(n_{1}\), \(n_{2}\) are utilized to represent the number of different expressions in the video each 15 s. The number of ‘Happy’ and ‘Unhappy’ is defined as \(n_{i1}\) and \(n_{i2}\) of interval video in each 5 s to reduce the loss of useful information. In addition, the distance between first ‘1’ appearing in each interval video and last ‘1’ in 15 s video of previous section is defined as \(d_{11}\). The distance between last ‘1’ appearing in each interval video and first ‘1’ in 15 s video of previous section is defined as \(d_{12}\). Therefore, the emotion recognition model needs the characteristics of different subject analyzed by Algorithm 1.

2.6 The overall structure of the emotion recognition model

The overall flow chart of the model is shown in Fig. 9 to clearly show the process of emotion recognition.

Fig. 9
figure 9

The overall structure of the emotion recognition model

3 Results and discussion

3.1 The result of preparation

The data set utilized for training needed to be grayed and standardized. The preparation of data set utilized for testing was divided into two parts. First, video needed to be grayed and standardized, then corresponding GSR needed to be completed data noise reduction and smoothing. Partial results of the latter was shown in Figs. 10 and 11.

Fig. 10
figure 10

The value of GSR during experiment for a subject

Fig. 11
figure 11

The value of GSR after preprocessing for a subject

3.2 The setting of parameter

The training parameters of the proposed expression recognition model and comparison models were shown in Table 1.

Table 1 Training parameters

These parameter were mainly applied to the proposed facial expression recognition model and its comparison model: Resnet18 and VGG16. The initial weights obtained by transfer learning on ImageNet were applied to the comparison model.

The parameters of CascadeClassifier were shown in Table 2.

Table 2 Parameters of CascadeClassifier

3.3 Model evaluation method

The evaluation of the model in the paper was obtained by matching the label defined by GSR with the result of emotion recognition model. The former was utilized as the true value of video emotion, the latter was the predicted value. (13) for model evaluation was as follows.

$${\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{FN}} + {\text{FP}} + {\text{TN}}}}.$$

where \(TP\) referred to predict the target class as the number of target classes, \({\text{TN}}\) referred to predict the non-target class as the number of non-target classes, \({\text{FN}}\) referred to predict the non-target class as the number of target classes, \({\text{FP}}\) referred to predict the target class as the number of non-target classes.

3.4 Experimental results

The results of the emotion recognition model were mainly divided into two parts: the training part and testing part. Figure 12 showed the training part of the model.

Fig. 12
figure 12

The training process of the proposed model

The loss decay of the model and the increase of accuracy took a total of 120 epochs. The expressive ability of feature extracted by the model was very strong, but the training process of the model was slow. The original image can be restored more accurately by this feature. The input images were uniformly grayed and standardized to reduce the influence of facial background. The comparison between original image and generated image from its features was shown in Fig. 13.

Fig. 13
figure 13

Original image and generated image from its features

SVM classifier needed to be trained to complete the intact process of model training after the abstract features were obtained by the codec model. The training of SVM classifier included the selection of its kernel function γ and penalty coefficient c. The selection of the kernel function was shown in Fig. 14.

Fig. 14
figure 14

The classification accuracy of SVM on testing set with different kernel and value of γ

As shown in Fig. 14, it can be found that the highest accuracy was 97.71% when the kernel function was gauss. It can also be obtained that the peak accuracy was about 97.45% and 91.59%, respectively, when the kernel function is poly and sigmoid. And it can be determined that best values of γ corresponding to three kernel functions were 8.935, 8.935, 2, respectively. On this basis, the process of adjusting the penalty coefficient c was shown in Fig. 15.

Fig. 15
figure 15

The accuracy of classification changed with the value of c

It can be found that the highest accuracy of classification was 97.83% when kernel function was gauss, the value of γ was 8.935 and the value of c was 15.

Resnet18 and VGG16 were trained and verified on the training set to further illustrate the superiority of the proposed model in expression recognition. Figure 16 showed the result of training and verification of the comparison model on training set.

Fig. 16
figure 16

The training process of Resnet18 and VGG16

It can be found from Fig. 16 that the accuracy of Resnet18 was slightly better than VGG16, and its highest accuracy of verification reached 97.64% when its epoch was 11. But this result was also slightly lower than the proposed model.

The predicted results of the models were matched with the labels defined by GSR after the training of three models. The matching accuracy was 82.01% after testing all subjects. The number of effective short video emoticons in testing set was 189, and the number correctly identified was 155. The testing accuracy of the comparison models were both 69.84%. The testing accuracy of three models changed with the increase in the number of subjects was shown in Fig. 17.

Fig. 17
figure 17

The diagram of testing accuracy

The three curves quickly reached peaks and their accuracy decreased in the second half of them. A confusion matrix was utilized to further show the result of the proposed emotion recognition model to analyze the limitation of model. The matrix was shown in Fig. 18.

Fig. 18
figure 18

Confusion matrix of the proposed emotion recognition model

Following conclusions can be obtained by combing with the detailed result of emotion recognition in Figs. 17 and 18.

  • The differences of facial styles leaded to the difficulty of facial emotion recognition in some subjects with unclear facial expressions or rich facial expressions. This class of subjects on testing set distributed in the second half of the data set. This conclusion was also reflected in the misjudgment in Fig. 18;

  • The model can not detect facial expression due to the subjects’ head down, which also leaded to the decrease of accuracy;

  • The recognition of ‘Quiet’ emotion in scene had high recognition accuracy due to the high frequency of ‘Quiet’ emotion. However, the ability of recognition for ‘Unhappy’ emotion was weak due to the small number of samples.

4 Conclusions

Firstly, reliable emotional labels of the proposed emotion recognition model was obtained from GSR. Then the model achieved an accuracy of 82.01% by a reasonable process of recognition. The proposed model has certain practical value for predicting the human emotion of natural activities in vehicle environment due to its data set utilized in the model has certain characteristic similarity with the video data in the vehicle environment.

Availability of data and materials

The datasets used and analysed during the current study are available from the corresponding author on reasonable request.



Internet of vehicles


Galvanic skin response


Support Vector Machine


International Affective Picture System


Cohn Kanade dataset plus








  1. X. Liu, X.P. Zhai, W.D. Lu, C. Wu, QoS-guarantee resource allocation for multibeam satellite industrial Internet of things with NOMA. IEEE Trans. Ind. Inform. 17(3), 2052–2061 (2021).

    Article  Google Scholar 

  2. S.S. Devi, A. Bhuvaneswari, Quantile regressive fish swarm optimized deep convolutional neural learning for reliable data transmission in IoV. Int. J. Comput. Netw. Commun. 13(2), 81–97 (2021).

    Article  Google Scholar 

  3. F. Valocky, M. Orgon, I. Fujdiak, Experimental autonomous car model with safety sensor in wireless network. IFAC PapersOnLine. 52(27), 92–97 (2019).

    Article  Google Scholar 

  4. K. Afzal, R. Tariq, F. Aadil, Z. Iqbal, M. Sajid, An optimized and efficient routing protocol application for IoV. Math. Probl. Eng. (2021).

    Article  Google Scholar 

  5. S. Turabzadeh, H.Y. Meng, R.M. Swash, M. Pleva, J. Juhar, Facial expression emotion detection for real-time embedded system. Technologies 6, 17 (2018).

    Article  Google Scholar 

  6. R.W. Picard, Affective Computing: Challenges (MIT Press, USA, 1997), pp. 2–10

    Google Scholar 

  7. K. Anderson, P.W. Mcowan, A real-time automated system for the recognition of human facial expressions. IEEE Trans. Cybern. 36, 96–105 (2006).

    Article  Google Scholar 

  8. J. Ang, R. Dhillon, A. Krupski, E. Shriberg, A. Stolcke, Prosody-based automatic detection of annoyance and frustration in human–computer dialog, in Seventh International Conference on Spoken Language Processing (2002). p. 2037–2040.

  9. C Feichtenhofer, A Pinz, A Zisserman, Convolutional Two-Stream Network Fusion for Video Action Recognition, in Computer Vision and Pattern Recognition (IEEE, 2016). p. 1933–1941.

  10. A. Mehrabian, Communication without words. Psychol. Today. 2, 53–55 (1968).

    Article  Google Scholar 

  11. R.W. Picard, E. Vyzas, J. Healey, Toward machine emotional intelligence: analysis of affective physiological state. IEEE Trans. Pattern Anal. Mach. Intell. 23, 1175–1191 (2016).

    Article  Google Scholar 

  12. X. Liu, X.Y. Zhang, NOMA-based resource allocation for cluster-based cognitive industrial Internet of things. IEEE Trans. Ind. Inform. 16, 5379–5388 (2020).

    Article  Google Scholar 

  13. N. Samadiani, G. Huang, W. Luo, C.H. Chi, Y.F. Shu, R. Wang, T. Kocaturk, A multiple feature fusion framework for video emotion recognition in the wild. Concurr. Comput. Pract. Exp.. (2020).

    Article  Google Scholar 

  14. A. Savran, K. Ciftci, G. Chanel, J. Mota, L. Viet, B. Sankur, L. Akarun, A. Caplier, M. Rombaut, Emotion detection in the loop from brain signals and facial images. International Summer Workshop on Multimodal Interfaces (2006).

  15. S. Koelstra, C. Muhl, M. Soleymani, J.S. Lee, A. Yazdani, T. Ebrahimi, T. Pun, A. Nijholt, I. Patras, Deap: a database for emotion analysis; using physiological signals. IEEE Trans. Affect. Comput. 3, 18–31 (2012).

    Article  Google Scholar 

  16. M. Soleymani, J. Lichtenauer, T. Pun, M. Pantic, A multimodal database for affect recognition and implicit tagging. IEEE Trans. Affect. Comput. 3, 42–55 (2012).

    Article  Google Scholar 

  17. W.R. Hu, G. Huang, L.L. Li, L. Zhang, Z.G. Zhang, Z. Liang, Video-triggered EEG-emotion public databases and current methods: a survey. Brain Sci. Adv. 6, 255–287 (2019).

    Article  Google Scholar 

  18. X. Liu, X.Y. Zhang, Rate and energy efficiency improvements for 5G-based IoT with simultaneous transfer. IEEE Internet Things J. 6(4), 5971–5980 (2019).

    Article  Google Scholar 

  19. X. Liu, X.Y. Zhang, M. Jia, L. Fan, W. Lu, X. Zhai, 5G-based green broadband communication system design with simultaneous wireless information and power transfer. Phys. Commun. 28, 130–137 (2018).

    Article  Google Scholar 

  20. P. Lucey, J.F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, I. Matthews, The extended Cohn-Kanade dataset (CK+): a complete dataset for action unit and emotion-specified expression, in Proceedings of the Third International Workshop on CVPR for Human Communicative Behavior Analysis(CVPR4HB) (2010). p. 94–101.

  21. C. Tronstad, H. Kalvøy, S. Grimnes, G. Martinsen-Ørjan, Improved estimation of sweating based on electrical properties of skin. Ann. Biomed. Eng. 41, 1074–1083 (2013).

    Article  Google Scholar 

  22. M.M. Bradley, P.J. Lang, Measuring emotion: behavior, feeling, and physiology, in Cognitive Neuroscience of Emotion, ed. by R.D. Lane, L. Nadel (Oxford University Press, New York, 2000). p. 242–276

  23. P.J. Lang, Emotion and motivation: attention, perception, and action. J. Sport Exerc. Psychol. 22, 180–199 (2020).

    Article  Google Scholar 

  24. K.H. Kim, S.W. Bang, S.R. Kim, Emotion recognition system using short-term monitoring of physiological signals. Med. Biol. Eng. Comput. 42, 419–427 (2004).

    Article  Google Scholar 

  25. H.H. Madden, Comments on Savitzky–Golay convolution method for least-squares fit smoothing and differentiation of digital data. Anal. Chem. 50, 1383–1386 (1978).

    Article  Google Scholar 

  26. V. Dumoulin, F. Visin, A Guide to Convolution Arithmetic for Deep Learning (2019), pp. 1–28

  27. R. Lienhart, A. Kuranov, V. Pisarevsky, Empirical analysis of detection cascades of boosted classifers for rapid object detection, in Joint Pattern Recognition Symposium vol. 2781 (2003). p. 297–304.

Download references


I would like to thank Wang Yihao, Feng Jinyu, Cen Muosuo, Liu Renwei and Wang Min for their hard work in the data collection stage.


Supported by project of shenzhen science and technology innovation committee (JCYJ20190809145407809), project of shenzhen Institute of Information Technology School-level Innovative Scientific Research Team (TD2020E001).

Author information

Authors and Affiliations



Erkang Fu and Xi Li designed the research. Zhi Yao and Yuxin Ren conducted the literature review and wrote this manuscript. Yuanhao Wu and Qiqi Fan performed the numerical calculations and derived the formulae in the paper. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Xi Li.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fu, E., Li, X., Yao, Z. et al. Personnel emotion recognition model for Internet of vehicles security monitoring in community public space. EURASIP J. Adv. Signal Process. 2021, 81 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: