2.1 Preparation of physiological data
At present, there are many video data sets about emotion recognition, including the early Cohn Kanade dataset plus (CK+) [20] and recent DEAP data set. These data sets have the advantages of rich content and strong representation. However, the above video data sets only have standard faces, which is very different from the facial video data sets with multi angles and large quality fluctuation under the background of IOV applications. Therefore, the facial video of young people's specific behavior is collected to study the facial emotion recognition in the vehicle environment. In addition, the physiological data are obtained to complete accurate emotion prediction and inference. Figure 1 shows the physiological data of the subjects in the video state.
The first channel is Heart rate based on PhotoPlethysmoGraphy (PPG). The second channel is the value of GSR. The third channel is the value of electrical signal of respiration. The fourth channel is the value of ElectroCardioGram (ECG). The fifth channel is the value of ElectroEncephaloGram (EEG).
2.2 The definition of video emotional label
The GSR of human is controlled by human nervous system, it has strong physiological characteristics [21]. A large number of studies have shown that emotional fluctuations can cause significant changes in GSR [22, 23]. Therefore, GSR is selected to define video emotional label to improve work efficiency. Related studies utilizes the feature extraction method of the University of Augsburg in Germany to find that the emotion of subjects can reflect their characteristics on GSR. This conclusion can also be shown in Fig. 2 [24].
Above images are derived from the characteristic results of some subjects and they are not very persuasive. However, the following applicable conclusion can be obtained by observation and testing of data set when levels of emotion are categorized into three categories.
-
Happy: Within the range of video, there are denser multi-band peaks, which are mostly distributed at the beginning of the video;
-
Quiet: Within the range of video, there is basically no peaks or only once at both ends;
-
Unhappy: Within the range of video, there are peaks at the beginning and end of the video, or only dense peaks appear in the middle of the video with almost no intervals.
After above rules are summarized, the emotional label of the testing video is defined by the value of GSR and verification. The specific experimental steps are as follows.
Data preprocessing Firstly, the most representative GSR in the physiological data is completed noise reduction and smoothing. The abnormal value in the data is updated to its nearby value to complete data noise reduction. Savitzky–Golay filter is utilized to smooth the data. The Savitzky–Golay filter is a digital filter that fits adjacent data points to a low-order polynomial by linear least squares [25]. The solution of least squares equation can be found when the data spacing is equal. Figure 3 is a diagram of its smoothing process.
The blue point in each window of Fig. 3 is the center point of the window, and the mathematical principle of filtering is (1).
$$x_{{k,{\text{smooth}}}} = \overline{x} = \frac{1}{H}\mathop \sum \limits_{i = - w}^{ + w} x_{k + i} h_{i} .$$
(1)
The Savitzky–Golay filter utilizes the least squares to regress a small window of data to a polynomial, and then utilizes the polynomial to estimate the point at the center of window. Where \(h_{i}\) is the smoothing coefficient. \(\frac{{h_{i} }}{H}\) is fitted by the principle of least squares in (1).
On the same curve, different widths of window can be selected at any position to meet the needs of different filtering. This is useful for processing time series data at different stages.
The definition of emotions in videos Emotional swings are short and continuous when they are not stimulated by a strong external environment. Therefore, the definition method of dividing time segment is utilized to define the emotional label of each short video. The total length of each video is about 3 min. It includes the process from the beginning of experiment to the completion of the specified action and then the end of experiment. Therefore, we believe that this process can reflect a variety of specific emotions. It is stipulated that every 15 s video is defined as an emotional video to improve the accuracy of definition. There is a 5 s interval between every two emotional videos. Then the emotional label of each short video is defined based on the relationship from Fig. 2 and the value of GSR after preprocessing.
2.3 Structure and principle of the proposed model
The expression recognition model of the paper is a combination of convolutional codec and SVM classifier. And the model and Algorithm 1 cooperate to complete the prediction of video emotion. The model features extracted by the convolutional codec have strong abstraction. The feature can reduce the training noise caused by the large difference of facial style. The core of the codec is image convolution and image deconvolution.
Image convolution is developed from signal convolution. Image convolution is obtained by expanding the one-dimensional signal in two dimensions and rotating its convolution kernel by 180°. Image convolution introduces the three calculation concepts of convolution kernel \(F\), stride \(S\), and padding \(P\). Their calculation relationship is shown in (2) and (3) [26].
$$W^{\prime} = \left( {W - F + 2P} \right)/S + 1,$$
(2)
$$H^{\prime} = \left( {H - F + 2P} \right)/S + 1.$$
(3)
Above equations represent the calculation of the output image when the size of input image is \(\left[ {W \times H \times D} \right]\). Where \(W^{\prime}\) and \(H^{\prime}\) represent the width and height of output image, and the depth \(D^{\prime}\) of the output image is determined by the number of convolution kernels.
In a convolutional network, reasonable settings of \(F\), \(S\), \(P\) are required to ensure that the size of image is controllable and the number of network layers continues to rise. Figure 4 shows a common image convolution process.
The value of \(S\) is 1 in the convolution process shown in Fig. 4. A total of four convolutions occur in the convolution shown in Fig. 4. The process of convolution can be digitized when the image is expanded as shown in Fig. 5 and the convolution kernel is expanded into matrix.
The matrix of the convolution kernel is shown in (4).
$$\left[ {\begin{array}{*{20}l} {w_{0,0} } \hfill & {w_{0,1} } \hfill & {w_{0,2} } \hfill & 0 \hfill & {w_{1,0} } \hfill & {w_{1,1} } \hfill & {w_{1,2} } \hfill & 0 \hfill & {w_{2,0} } \hfill & {w_{2,1} } \hfill & {w_{2,2} } \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill \\ 0 \hfill & {w_{0,0} } \hfill & {w_{0,1} } \hfill & {w_{0,2} } \hfill & 0 \hfill & {w_{1,0} } \hfill & {w_{1,1} } \hfill & {w_{1,2} } \hfill & 0 \hfill & {w_{2,0} } \hfill & {w_{2,1} } \hfill & {w_{2,2} } \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill \\ 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & {w_{0,0} } \hfill & {w_{0,1} } \hfill & {w_{0,2} } \hfill & 0 \hfill & {w_{1,0} } \hfill & {w_{1,1} } \hfill & {w_{1,2} } \hfill & 0 \hfill & {w_{2,0} } \hfill & {w_{2,1} } \hfill & {w_{2,2} } \hfill & 0 \hfill \\ 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill & {w_{0,0} } \hfill & {w_{0,1} } \hfill & {w_{0,2} } \hfill & 0 \hfill & {w_{1,0} } \hfill & {w_{1,1} } \hfill & {w_{1,2} } \hfill & 0 \hfill & {w_{2,0} } \hfill & {w_{2,1} } \hfill & {w_{2,2} } \hfill \\ \end{array} } \right].$$
(4)
Therefore, the operation of convolution can be expressed by (5).
where \(Y\) represents the result of convolution, and \(C\) represents the matrix of the convolution kernel. First convolution is the multiplication of the first line of (4) with the matrix of image in (5), and subsequent convolution is also based on this process of calculation. A vector with size of \(\left[ {4 \times 1} \right]\) is subsequently obtained after the calculation of (5). The output image after convolution can be restored by following the reverse process of Fig. 5. Therefore, the convolution process can be described as a multiplication of weight matrix with an image vector.
The process of convolution is essentially a combination of forward propagation and backward derivative propagation. The principle of deriving \(x\) in back propagation is (6).
$$\frac{{\partial {\text{Loss}}}}{{\partial x_{i} }} = \mathop \sum \limits_{i} \frac{{\partial {\text{Loss}}}}{{\partial y_{i} }} \times \frac{{\partial y_{i} }}{{\partial x_{i} }}.$$
(6)
where \(y_{i}\) can be expressed by (7).
$$y_{i} = \mathop \sum \limits_{j = 1}^{16} C_{ij} X_{j} .$$
(7)
Then (8) can be obtained by (7).
$$\frac{{\partial y_{i} }}{{\partial x_{j} }} = C_{ij} .$$
(8)
And (9) can be obtained by substituting (8) into (6).
$$\frac{{\partial {\text{Loss}}}}{{\partial x_{i} }} = \mathop \sum \limits_{i = 1}^{4} \frac{{\partial {\text{Loss}}}}{{\partial y_{i} }} \times C_{ij} .$$
(9)
The multiplication of matrices can be achieved by changing Σ in (9) to the form of a matrix.
$$\frac{{\partial {\text{Loss}}}}{{\partial x_{j} }} = \left( {\frac{{\partial {\text{Loss}}}}{\partial y}} \right)^{T} \times C_{*j} = C_{*j}^{T} \times \left( {\frac{{\partial {\text{Loss}}}}{\partial y}} \right).$$
(10)
where \(C_{ij}\) is the matrix of forward propagation, and \(C_{*j}\) is the matrix of backward propagation. In the process of deconvolution, the mathematical meanings of above two parameters need to be exchanged. The relationship of calculation corresponding to deconvolution is shown in (11) and (12), which means that \(W\) and \(H\) before convolution are obtained by calculation of \(W^{\prime}\) and \(H^{\prime}\) after convolution.
$$W = S\left( {W^{\prime} - 1} \right) - 2P + F,$$
(11)
$$H = S\left( {H^{\prime} - 1} \right) - 2P + F.$$
(12)
Therefore, the structure of the expression recognition model is shown in Fig. 6.
Where 'Conv' represents for convolution and 'Deconv' represents for deconvolution.
2.4 The training and testing of the expression recognition model
The initialization state of the testing set is unlabeled to ensure the rationality of emotional label. Therefore, another data set with self-descriptive labels is utilized to complete the training of the model. Figure 7 shows partial data set utilized to train the expression recognition model.
Therefore, values of the model parameter can be obtained by training the model on the data set shown in Fig. 7. Subsequent testing follows the principle of video frame image analysis. Nearly 30 frame images appear each 1 s in the testing video. Following recognition process is defined so that the label of the expressions can be accurately defined within 1 s.
CascadeClassifier in Opencv is utilized for face detection in the process of model testing. This is a cascaded classifier utilizing Harr feature of images. The principle of Harr feature can be utilized to complete face recognition well [27]. Figure 8 shows the partial results of face detection and the expression recognition model.
Therefore, the recognition result of expression in each frame image can be obtained. Figure 8 shows that a face is detected by CascadeClassifier and it is recognized as happy by the expression recognition model.
The most important thing in this section is that the result of expression recognition of each frame image is transformed into the expression recognition result of the image per second. The expression in 1 s is considered to be ‘Unhappy’ if the number of ‘Unhappy’ frame images in 1 s is greater than 6 and greater than the number of ‘Happy’ frame images. If not, the judgments of other expressions are subsequently continued. The expression in 1 s is considered to be ‘Happy’ if the number of ‘Happy’ frame images in 1 s is greater than 4 and greater than the number of ‘Unhappy’ frame images. And the expression in 1 s is considered to be ‘Happy’ if the number of ‘Happy’ frame images in 1 s is greater than 0 and greater than the number of ‘Unhappy’ frame images when the number of ‘Quiet’ frame images in 1 s is less than 5. The expression per second is defined as ‘Quiet’ when above conditions are not met.
2.5 Emotion recognition for short videos
Emotion sequence represented by ‘1’, ‘0’ and’2’ can be obtained after the expression prediction model is trained and tested. Above three numbers represent three emotions defined in Sect. 2.2. The emotion prediction of each short video is completed by Algorithm 1 after the emotional label per second is obtained.
Where \(N\) represents the total number of expressions per second in the video. \(N_{0}\), \(N_{1}\), \(N_{2}\) respectively indicate the number of corresponding expressions. \(n_{0}\), \(n_{1}\), \(n_{2}\) are utilized to represent the number of different expressions in the video each 15 s. The number of ‘Happy’ and ‘Unhappy’ is defined as \(n_{i1}\) and \(n_{i2}\) of interval video in each 5 s to reduce the loss of useful information. In addition, the distance between first ‘1’ appearing in each interval video and last ‘1’ in 15 s video of previous section is defined as \(d_{11}\). The distance between last ‘1’ appearing in each interval video and first ‘1’ in 15 s video of previous section is defined as \(d_{12}\). Therefore, the emotion recognition model needs the characteristics of different subject analyzed by Algorithm 1.
2.6 The overall structure of the emotion recognition model
The overall flow chart of the model is shown in Fig. 9 to clearly show the process of emotion recognition.