 Research
 Open access
 Published:
EEG emotion recognition based on differential entropy feature matrix through 2DCNNLSTM network
EURASIP Journal on Advances in Signal Processing volume 2024, Article number: 49 (2024)
Abstract
Emotion recognition research has attracted great interest in various research fields, and electroencephalography (EEG) is considered a promising tool for extracting emotionrelated information. However, traditional EEGbased emotion recognition methods ignore the spatial correlation between electrodes. To address this problem, this paper proposes an EEGbased emotion recognition method combining differential entropy feature matrix (DEFM) and 2DCNNLSTM. In this work, first, the onedimensional EEG vector sequence is converted into a twodimensional grid matrix sequence, which corresponds to the distribution of brain regions of the EEG electrode positions, and can better characterize the spatial correlation between the EEG signals of multiple adjacent electrodes. Then, the EEG signal is divided into equal time windows, and the differential entropy (DE) of each electrode in this time window is calculated, it is combined with a twodimensional grid matrix and differential entropy to obtain a new data representation that can capture the spatiotemporal correlation of the EEG signal, which is called DEFM. Secondly, we use 2DCNNLSTM to accurately identify the emotional categories contained in the EEG signals and finally classify them through the fully connected layer. Experiments are conducted on the widely used DEAP dataset. Experimental results show that the method achieves an average classification accuracy of 91.92% and 92.31% for valence and arousal, respectively. The method performs outstandingly in emotion recognition. This method effectively combines the temporal and spatial correlation of EEG signals, improves the accuracy and robustness of EEG emotion recognition, and has broad application prospects in the field of emotion classification and recognition based on EEG signals.
1 Introduction
Human emotion recognition plays an important role in human–computer interaction and has become an important research field in cognitive science, computer science, psychology, and other fields [1]. It is also considered a hot topic in neuroscience and artificial intelligence research because emotions are affective states that accompany cognition and awareness and have a crucial role in human social interaction.
Observing the external and internal reactions of humans can infer their emotional state, as different emotional states elicit different responses. At present, emotion recognition research methods include nonphysiological signals and physiological signals, and nonphysiological signals include facial expressions [2], speech [3], and body movements [4]. Physiological signals include electrocardiogram (ECG) signals [5], electromyogram (EMG) [5] signals, electrooculogram (EOG) [6] signals, and electroencephalogram (EEG) [7] signals. Compared with nonphysiological signals [4], physiological [4] signals are not easily affected by external factors and subjective intentions, thus increasing the reliability and objectivity of the experiment. In recent years, with the progress of sensor technology, it has become possible to monitor, record, and analyze multichannel neurophysiological signals. EEG, as a noninvasive brain electrophysiological technology, only needs to place electrodes on the scalp, which is relatively safe and has a wide range of applications. At the same time, EEG signals are realtime and can monitor changes in the brain’s electrical activity in realtime, which makes EEG the focus of many researchers. Therefore, the potential application scenarios of emotion recognition have become a hot topic in the research field, and more and more people pay attention to them [8,9,10,11].
In the field of emotion recognition, the analysis of EEG signals is widely used to understand an individual’s emotional state. Emotions are complex and multidimensional experiences, often manifested in dynamic changes in time and spatial features. Timedomain features include frequency analysis, amplitude, and waveform shape, which can reflect the activity state of the brain at different time points. Alazrai et al. [12] introduced a novel emotion recognition method based on EEG, which uses an innovative time–frequency feature extraction technique. Specifically, the study uses a quadratic time–frequency distribution (QTFD) to establish a highresolution representation of the time–frequency characteristics of EEG signals to effectively capture the spectral changes of EEG signals on the time axis, and experiments show that the average classification accuracy of their proposed method is between 73.8% and 86.2%. Li et al. [13] proposed an innovative approach a multidomain adaptive graph convolutional network (MDAGCN) that utilizes differential entropy (DE) as a feature extraction method. It cleverly integrates knowledge from both the frequency domain and the time domain to fully explore complementary information within EEG signals. Extensive experiments demonstrate that the introduced method consistently achieves excellent results across various experimental settings. Timedomain analysis is a key part of EEG research, and spatial information also provides valuable information for research. EEG signals are recorded on the scalp through an electrode array, forming a spatial topological structure. Each electrode corresponds to a specific region of the brain, so spatial features provide important information about the distribution of emotions in the brain. For example, Li et al. [14] fully considered the spatial information of EEG. The method of hierarchical neural network is used to classify emotions, and the classification results are good. Song et al. [15] proposed a novel dynamic graph convolutional neural network (DGCNN) to mine the spatial relationship of multichannel EEG data. Tao et al. [16] proposed an attentionbased convolutional recurrent neural network (ACRNN), which assigns different weights to different channels to make full use of channel information and improve the accuracy of emotion recognition. Combining timedomain information can help to capture the changing trend of emotion in time while combining spatial information can model the difference of emotion expression in different parts so that the model can better adapt to the dynamic change of emotion expression. Zhang et al. [17] proposed a new deep learning framework called spatiotemporal recursive neural network (STRN). It captures remote context clues by traversing regions of space in different directions along each time slice. Subsequently, RNN layer learning is used to represent the timedependent discriminant features of the generated sequences. Experimental results on datasets show that the proposed method achieves high classification performance. Rudakov et al. [18] proposed an innovative emotion recognition model, the multitask convolutional neural network (MTCNN), which takes brain maps generated from EEG as input and outputs emotion states related to arousal and valence. Experimental results demonstrate that the proposed approach achieves high classification performance.
In the emotion recognition problem, most of the existing methods are based on machine learning, commonly used are support vector machine (SVM) [19] and knearest neighbors (KNN) [20]. With the increasing penetration of deep learning algorithms into various fields, deep learning has become a popular method for studying emotion recognition due to its superior performance and remarkable achievements. In recent years, several outstanding algorithms have been applied to emotion recognition, such as deep belief networks (DBN) [21], convolutional neural networks (CNN) [22], graph convolutional neural networks (GCNNs) [23], and capsule networks (CapsNet) [24]. Hwang et al. [25] compared with traditional LSTM networks, using information from the past and future biological signals to more effectively assign weights for emotion recognition under the current LSTM cell state, and integrating ant colony optimization (ACO) to find the optimal combination of features among many, thereby enhancing performance. Alhagry et al. [26] applied LSTM algorithms, extracting features from EEG signals, and finally performing classification through fully connected layers. Their method achieved average accuracies of 85.65% and 85.45% for arousal and valence classification, respectively, on the DEAP dataset. Tripathi et al. [27] ingeniously combined modern techniques such as dropout and linear units with CNN and classified preprocessed EEG data. Through extensive experiments on the DEAP dataset, the results indicated classification accuracies of 81.41% for emotion and 73.35% for arousal. Additionally, Song et al. [28] leveraged the significant advantages of CNN in graphics to classify multichannel EEG signals for emotion recognition. By training and classifying using publicly available datasets, they achieved accuracies of 86.23% for valence, 85.54% for arousal, and 85.02% for dominant emotion classification.
From the literature, it is evident that applying deep learning for emotion recognition outperforms traditional machine learning methods. However, deep learning offers numerous advantages, two challenges need to be addressed. First, the common method of EEG classification processing is to extract features in the time domain, time–frequency domain, and spatial domain, and then use machine learning or deep learning to classify. Applying CNN to timedomain data often reveals features related to the frequency domain [29]. However, this method does not take into account the information characteristics of different frequency bands and the interrelationship of spatial information between different electrode channels. Second, applying CNN to the temporal dimension for extracting temporal features allows simultaneous extraction of spatiotemporal features. However, long timeseries data, containing a wealth of information, may pose challenges for traditional CNN structures, as they are prone to issues such as vanishing or exploding gradients. There is limited research that effectively combines both aspects.
To better integrate spatiotemporal features, a method of EEG signal characterization based on differential entropy feature matrix (DEFM) is proposed, and deep learning models will be used, especially the hybrid model combining 2DCNN and LSTM. TwodimensionalCNN is used for feature extraction in space to capture the relationship between different electrodes, while LSTM can effectively prevent the problem of gradient disappearance or gradient explosion. By combining these two structures, the model can better understand the overall context and more accurately identify patterns and laws in spatiotemporal sequences, reducing the number of parameters in the overall model and reducing the computational burden.
The main contributions of this paper are as follows:

A new feature extraction method called differential entropy feature matrix (DEFM), based on differential entropy and spatial feature matrix, has been proposed. According to the relative positions of 32 electrodes in brain space, we construct a 9 × 9 feature matrix, which helps analyze the influence of electrode position on emotion. At the same time, we divided the 60s EEG of each subject into 120 times windows of equal length of 0.5 s and calculated the DE of 32 electrodes in each time window. In this way, 2D images of each time window could be obtained, and the spatial and spectral information of the EEG signal could be captured by this method.

We propose a 2DCNNLSTM network model for emotion classification. TwodimensionalCNN can automatically extract features from the above 2D images through convolution continuously, and finally input them into LSTM through the connection layer, and make use of LSTM’s advantages in learning time series for continuous training. Finally, emotion classification is carried out by the connection layer.

To verify the effect of the proposed method on emotion classification, we conducted a large number of experiments on the DEAP dataset. The experimental results show that the average accuracy of valence and arousal is 91.92% and 92.31%, respectively. Therefore, our proposed method has a high classification effect in emotion classification.
The rest of this paper is organized as follows. In Sect. 2, we introduce the datasets and proposed method in detail. In Sect. 3, we report experiments and results.
2 Materials and methods
2.1 The overall framework of the proposed methodology
The general framework of the proposed method is shown in Fig. 1 and is divided into three steps in total:
Step 1 Preprocessing of EEG signals. Identification and processing of outliers and noise in the data used.
Step 2 Feature extraction. According to the relative positions of the electrodes in the brain distribution, the onedimensional EEG vector sequence is converted into a twodimensional network matrix sequence, to better represent the spatial correlation between the electrodes. Then, a whole EEG signal is divided into several equal time windows using a sliding window, and the DE of each electrode in the time period is calculated, and the DEFM is obtained by combining the DE and the time window.
Step 3 Classification with 2DCNNLSTM. TwodimensionalCNNLSTM combines the advantages of CNN automatic feature extraction and LSTM which can better handle time series to achieve better classification results.
2.2 Dataset and preprocessing
This paper verifies the effectiveness of the proposed method based on the DEAP dataset [30]. The DEAP dataset is a largescale opensource dataset containing physiological signals such as electroencephalography developed by a research team at Queen Mary University of London. The details of the DEAP are shown in Table 1. The dataset consisted of 32 brain electrical channels and eight channels that recorded other physiological signals caused by music videos of different emotional tendencies. In particular, 32 subjects watched 40 stimulus videos, recorded EEG signals at a sampling frequency of 512 Hz, and then downsampled them to 128 Hz. After the viewing, 1–9 consecutive values were used to evaluate arousal, efficacy, preference, dominance, and familiarity. Forty of the stimulation videos were composed of three seconds of resting time and 60 s of video. In this paper, only EEG signals are used, so 32 channels of EEG data are selected to record. To better identify emotions, arousal and valence are selected. We choose a threshold of 5, according to the evaluation value of these two indicators, if the evaluation value is greater than or equal to 5, it is marked as high arousal (HA) and high valence (HV), if it is less than 5, it is marked as low arousal (LA) and low valence (LV).
First, the EEG data are downsampled, reducing the sampling rate to 128 Hz. To further filter out noise and eliminate artifacts, EEG data are bandpass filtered and restricted to a frequency range of 4–45 Hz.
2.3 Feature extraction
Differential entropy (DE) [31] is a concept in information theory used to measure the uncertainty of a random variable. In EEG research, DE is used to analyze the complexity and randomness of EEG signals, which has some advantages. At the same time, the onedimensional EEG vector sequence is transformed into a twodimensional network matrix sequence. Then, the whole EEG signal is divided into multiple time windows by sliding window. The DE of each electrode in the time window is calculated.
2.3.1 Differential entropy
DE is a method of measuring the uncertainty of random variables that can be used to describe the random nature of probability density functions. Similar to discrete entropy, differential entropy is also a nonnegative real number, but it can be infinite, which is related to the infinity of continuous variables. Differential entropy has a wide range of applications in information theory, statistics, machine learning, and other fields, such as density estimation, source coding, channel coding, probability density estimation in machine learning, and other issues. For a continuous random variable \(x\), its probability density function is \(p\left(x\right)\), then its differential entropy calculated as shown in Eq. (1):
where \(p\left(x\right)\) represents the probability density function of the continuous signal \(\left[a,b\right]\), it represents the interval of information value. For a signal of a specific length, the differential entropy calculation formula of an EEG with an approximate Gaussian distribution \(N\left({\sigma }^{2}\right)\) is shown in Eq. (2):
Since DEAP includes baseline data of 3 s, which do not record any information, the data of 3 s are removed to avoid the impact on the EEG signal. In this work, we denoised the 60s EEG signals of 32 subjects, respectively, and then divided the EEG signals into 120 equal small time windows, each time period of 0.5 s. In our experiment, this 0.5s time window was mainly studied. Then, the DE of the 0.5s time window is calculated according to Eq. (2), and the DE calculated in each time window is taken as the feature to form a feature vector.
2.3.2 Twodimensional EEG mesh feature conversion method based on DE
In order to better integrate the timedomain and spatial information of EEG signals, we will extract DE features from 32 channels to form a feature matrix, as shown in Fig. 2. Specifically, based on the relative distribution of 32 electrodes in the brain, we obtain a 9 × 9 feature matrix of the electrode distribution on a twodimensional plane, where the positions of no electrodes are set to 0, and these 0’s do not play any role in our experiment. Then, we perform normalization calculation on the DE value and get the rightmost twodimensional color image, on the right side of the color image is a color bar, you can see the relationship between the values of DE and color, different DEs have different colors.
To better describe the proposed method, we chose one of the subjects as an example, and the whole process is shown in Fig. 3. Taking the Fp1 channel as an example, the 60s EEG signal is divided into 120 time windows, and each time window is 0.5 s. Four frequency bands θ(4 \(\le \hspace{0.17em}\)θ < 8 Hz), α(8 \(\le \hspace{0.17em}\)α < 15 Hz), β(15 \(\le \hspace{0.17em}\)β < 32 Hz), and γ(32 \(\le \hspace{0.17em}\)γ < 45 Hz) were, respectively, extracted from 120 time windows. At the same time, 9 × 9 color graphs of four frequency bands were obtained according to the method shown in Fig. 2. Fp1 could obtain 120 × 4 sample numbers. Therefore, a sample feature matrix of dimension 32 × 32 × 120 × 4 is generated in this experiment. Where the first 32 represents the number of subjects, the last 32 represents the number of electrodes, 120 represents the number of time windows, and 4 represents the number of frequency bands.
Since the calculation of DE can lead to the presence of outliers, which might affect the performance of the model, it is necessary to normalize the feature matrices for each participant. These matrices should be scaled to a range between 0 and 1. The normalization is performed using Eq. (3) [32].
During this process, we first normalize the feature values using Eq. (3), where F represents the original feature value, \({F}_{{\text{max}}}\) and \({F}_{{\text{min}}}\) represent the maximum and minimum feature values, respectively, and \({F}^{\mathrm{^{\prime}}}\) represents the normalized feature value. After normalization, we gather the features of 32 channels in the same frequency band for each sample and construct a submatrix following the mapping rule illustrated in Fig. 1. The submatrix contains the average DE values of each corresponding channel, while the elements that correspond to channels without corresponding electrodes are set to 0 by default.
The feature extraction method adopted in this paper combines time–frequency and spatial features to provide richer EEG change information, which can be used to classify emotional states. Timedomain information provides insights about the dynamic changes in emotions, while spatial information provides insights about how emotions are distributed in the brain. Therefore, the method can provide more comprehensive information on changes in EEG signals, which can better classify different emotional states.
2.4 Fusion model of 2DCNN and LSTM for emotion recognition
The overall architecture of CNNS and LSTMS for emotion recognition is shown in Fig. 4, it contains the CNN layer, LSTM layer, and dense layer. We employ a 2DCNN to capture spatial features from each twodimensional matrix of EEG data. Subsequently, these extracted spatial feature sequences are fed into an LSTM to further capture the temporal features of the EEG data. Following this, we utilize a fully connected layer to receive the output of the LSTM network at the last time step, thereby forming a feature vector. Lastly, this feature vector is passed through a dense connection layer (Dense) for the final emotion classification.
2.4.1 TwodimensionalCNN
TwodimensionalCNN refers to a 2D convolutional neural network. It is a deep learning neural network structure, which is widely used in computer vision tasks, such as image recognition, object detection, semantic segmentation, etc.
In a twodimensional convolutional neural network, the input data are usually a twodimensional image or video, and each input datum is represented as a matrix or tensor. The neural network processes the input data through the structures of the convolutional layer, pooling layer, and full connection layer to learn and extract image features.
The convolutional layer is the core component of the twodimensional convolutional neural network. It uses a set of learnable convolution checks to carry out convolution operations on input data, to extract different features. The calculation formula is shown in Eq. (4). The pooling layer is used to downsample the feature graphs output by the convolutional layer to reduce the dimension and computation amount of feature graphs while preserving important features. The fully connected layer is used to flatten the feature map output by the pooling layer and match it with the label to get the final prediction result.
where N is the output size, W is the input size, F is the convolution kernel size, P is the filling value size, and S is the step size.
The 2DCNN structure has the advantages of hierarchical, automatic feature extraction, and multilevel feature learning, so it is widely used in computer vision tasks and has achieved excellent results in many application fields.
It should be noted that the calculation of these two stages is usually limited by the accuracy requirements, so we need to make certain adjustments and optimizations to improve the accuracy and efficiency of the algorithm.
2.4.2 LSTM
Inspired by the human brain, the LSTM uses selective input and selective forgetting mechanisms, introducing three “gate” structures (forgetting gate, input gate, and output gate) to control the flow of information in the form of filters. Through this mechanism, LSTMs can selectively retain and update past information while also remembering current information to better capture longterm dependencies in sequence data. The LSTM structure is shown in Fig. 5
The LSTM takes in three components as its input: the current moment input \({X}_{t}\), the previous moment’s output value \({h}_{\left(t1\right)}\) of the LSTM, and the state of the unit \({C}_{\left(t1\right)}\). It provides two types of outputs: the current moment’s LSTM output value \({H}_{t}\) and the cell state \({C}_{t}\). The input gate calculation as shown in Eq. (5), which controls the amount of input information, the forget gate controls the amount of historical information retained, and the output gate calculation as shown in Eq. (6), which controls information from the current unit state is to be output to the current hidden state. The gates can be adjusted adaptively according to the network’s needs to achieve better results.
input door:
output door:
Among them, σ(·) is a sigmoid function that outputs a value between 0 and 1. If the value of f is close to 0, the information will be forgotten, and the information close to 1 will be retained. When the LSTM network forgets part of the previous state information, it needs to absorb new memory from the current memory to fill the blank, and this process is realized by the input gate. At this time, the input gate will filter the input information, select some current information to enter the current cell state with a certain probability, and together with the forgetting gate, it selectively updates the current cell state with a certain probability for the current information and the information at the previous time.
The LSTM network replaces the neuron in the ordinary recurrent neural network with the above gating structure and effectively saves the historical information to help the current decisionmaking. The emergence of the LSTM network effectively overcomes the problems of gradient disappearance, gradient explosion, and other problems in the learning process of neural networks. The LSTM neural network is a logic unit with a “gate” structure added to each neuron, so that the error direction propagation can directly pass through the “gate,” thus avoiding the gradient disappearance and explosion in the error reverse propagation so that the gradient of the LSTM network in the transmission process remains relatively stable and will not disappear completely.
2.4.3 TwodimensionalCNNLSTM
The 2DCNNLSTM model structure was proposed in this paper, as shown in Fig. 4. To better extract EEG signal features, we input the DEFM features in the four bands of θ, α, β, and γ extracted in Sect. 2.3 into the network 2DCNNLSTM in the form of N × 9 × 9 × 4, where N refers to the number of samples sent to the model each time, in our experiment, N selected the number of 10 samples, 9 × 9 refers to the size of the twodimensional matrix, and 4 refers to the four frequency bands under study. These inputs contain temporal and spatial information of EEG data. Based on 2DCNN feature extraction (time and space), LSTM is used to further extract timeseries features. Finally, the output of the last time point of the LSTM network is received through the fully connected layer, and the feature vector is generated, and then, the feature vector is fed to the SoftMax layer for the final emotion classification. Combined with the temporal and spatial characteristics of EEG signals, this method improves the effect of emotion recognition.
Specifically, the 2DCNN in the hybrid model is mainly composed of three convolution layers and three pooled layers, where each convolution layer has a convolution kernel of 32, 64, and 128, respectively, and is optimized using the ReLu activation function and the Adam optimizer, with a learning rate of 0.0005. In the hybrid model, the LSTM model contains two hidden layers with 64 and 128 neurons, respectively, and finally, it has 0.1 dropouts to prevent overfitting, and finally a fully connected layer with 258 neurons. The LSTM network is used to further calculate the relevant characteristics of EEG fragments in the time domain, making the features extracted by the model more objective and accurate.
2.5 Evaluation indices
To demonstrate the performance of the proposed method, there are several metrics commonly used to evaluate the quality of algorithms. Below are a few common and important evaluation metrics:
Accuracy is the most commonly used and intuitive metric, calculated as shown in Eq. (7) [33]. Here, TP represents the number of samples correctly identified as low arousal/negative valence emotions by the classification model (referred to as positives); TN represents the number of samples correctly identified as high arousal/positive valence emotions by the classification model (referred to as negatives); FP represents the number of samples where negative valence emotions are incorrectly classified as positive valence emotions; and FN represents the number of samples where positive valence emotions are incorrectly predicted as negative valence emotions.
The precision is calculated as shown in Eq. (8) [33]:
The recall is calculated as shown in Eq. (9) [33]:
The Fscore, also known as the F1score, is a harmonic average of precision and recall used to comprehensively evaluate the performance of a classification model, especially in unbalanced datasets. The Fscore is calculated as shown in Eq. (10) [34]:
Precision represents accuracy, while recall indicates the recall rate. As seen from the calculation formula, the F1score ranges between 0 and 1. The closer it is to 1, the better the model’s performance, showcasing superior classification effectiveness. Furthermore, when precision and recall simultaneously achieve higher values, the corresponding F1score also increases. The role of the F1score is to strikingly balance precision and recall. This is particularly crucial when dealing with uneven sample distributions, where the impact of the F1score becomes more pronounced.
The Kappa coefficient primarily measures the effectiveness of a classifier using statistical methods. Its characteristic lies in its thorough consideration of the model’s randomness and continuous enhancement of the accuracy of random classification. It can be employed to assess the consistency of classification tasks. The calculation method for Kappa is shown in Eq. (11) [35]:
\({P}_{o}\) refers to the observed accuracy, which can be obtained by summing the diagonal elements of the confusion matrix. \({P}_{e}\) refers to the accuracy of random classification in a completely random state. It can be calculated by summing the product of the true label frequency and the predicted label frequency for each category in the confusion matrix. From the calculation formula, it is evident that when the Kappa coefficient is closer to 1, the model’s classification performance is better.
3 Results and discussion
3.1 Experiments results
In this study, each subject had 120 × 40 samples, 120 being the time window and 40 being the stimulus video, so there are a total of 4800 samples per subject. The network model is crossverified by tenfold.
The 9th subject was selected to adjust the network parameters because subject 9 had a more uniform distribution of labels. We used the 2DCNNLSTM network to investigate the effect of the number of 2DCNN convolutional layer and LSTM hidden layer cells on emotion classification. Through experiments, we find that the number of convolutional kernels has the greatest influence on the network model compared with the number of hidden layer cells. When the number of hidden layer cells is 2 and the number of convolutional kernels is 3, the accuracy of the network model is the highest. The network model parameter of 2DCNNLSTM is shown in Table 2.
To better assess the model’s performance, tenfold crossvalidation [36] is chosen for evaluation. In tenfold crossvalidation, the dataset is divided into ten subsets, and the model is trained and evaluated ten times. In each iteration, nine subsets are used for training, and the remaining subset is used for validation. This process repeats to ensure that each subset serves as the validation set exactly once. Ultimately, the model’s performance evaluation is the average of these ten validation results. This approach is effective in better assessing the model’s generalization ability, reducing the risk of overfitting or underfitting, and identifying optimal hyperparameter settings. According to the above parameter settings, 32 subjects were tested, respectively, and the results are shown in Table 3.
As shown from Table 3, the classification result of our proposed method is relatively high. The average accuracy of 32 subjects in the arousal classification was 92.31%; the average accuracy of 32 subjects in the valence classification was 91.92%; the average Fscore of 32 subjects in the arousal classification was 90.75%; and the average Fscore of 32 subjects in the valence classification was 92.31%;
The average Kappa of 32 subjects in the arousal classification was 91.76%; the average Kappa of 32 subjects in the valence classification was 92.36%.
Accuracy is the most intuitive index to measure the performance of the model, and it is also the most important index. As shown in Fig. 6, in the 2DCNNLSTM model proposed in this paper, as the number of iterations increases, the training accuracy of arousal and valence classification accuracy to 97.2% and 96.8%, respectively. Meanwhile, the test accuracy also keeps improving, and finally, the test accuracy of arousal and valence classification accuracy to 92.31% and 91.92%, respectively. The results show that the proposed method has good flexibility and is effective in the field of emotion recognition.
The confusion matrix is an indicator to evaluate the model results and is part of the model evaluation. In addition, a confusion matrix is often used for judgment classifiers and is suitable for data models of different types. Figure 7 shows the confusion matrix of arousal and valence. From the results of the confusion matrix, about 7.6% of the valence was wrongly classified, and about 7.2% of the arousals were wrongly classified. The overall classification effect was good.
The ROC curve, also known as the receiver operating characteristic curve, is a popular visual metric used to evaluate the performance of binary classification models. It typically plots the truepositive rate (TPR) on the yaxis and the falsepositive rate (FPR) on the xaxis. The TPR represents the proportion of actual positive cases that the model correctly identifies, while the FPR represents the proportion of actual negative cases that are incorrectly classified as positive. The shape of the ROC curve reflects the overall performance of the model, with curves closer to the upperleft corner indicating better performance. The classification accuracy of the model proposed in this paper is relatively high, as shown from Fig. 8.
3.2 Recognition performance of different time windows
Since the length of the EEG signal determines the different emotional information it contains, this section will focus on the influence of different time windows on emotion recognition performance. We select four different time window periods, namely, N ∈ [0.5,0.2,0.8,1.0]. The table shows the classification results of emotion recognition for valence and arousal under four different time windows. Table 4 shows that the recognition performance of the CNNLSTM model is optimal when the time window is 0.5 s. The average recognition rates of valence and arousal classification were 91.9% and 92.3%, respectively. Compared with the other three times windows, when N is 0.5 s, the classification accuracy is increased by 4.19%, 5.99%, and 3.31% compared with 0.2 s, 0.4 s, and 1.0 s, respectively.
3.3 Compared with other methods
To better show the superior performance of our proposed method, Table 5 lists the comparison between the 2DCNNLSTM network model proposed in this paper and other network models. It can be seen from the experimental results that the proposed method has higher classification accuracy than other models. This is due to the comprehensive consideration of the spatial positioning of electrodes and its effect on emotion recognition; at the same time, time, space, and frequency information are extracted from EEG signals, where CNN learns the spatial characteristics of the twodimensional grid data at each sampling point. LSTM further captures the global temporal dynamics between the continuous sampling points in EEG samples, thus realizing the potential value of feature exploration and achieving higher recognition accuracy in binary classification. Therefore, the proposed feature fusion method has a strong spatiotemporal representation. The temporal and spatial characteristics of fusion significantly improved the accuracy of emotion recognition in EEG, and the accuracy of arousal and valence was more than 91%.
4 Conclusion
In this paper, we propose a method of EEG emotion recognition based on DEFM and 2DCNNLSTM. DEFM is a DE feature vector method for EEG signal characterization, which considers the time, space, and frequency of the EEG signal. The method converts the original onedimensional chain channel information into twodimensional grid spatial information, corresponding to the brain region distribution of EEG electrode positions, and effectively characterizes the spatial correlation between multiple adjacent electrodes in the physics of EEG signal. A time window is used to segment the twodimensional grid sequence into equallength time segments, which is a new data representation integrating the spatiotemporal correlation of EEG. In addition, an endtoend, trainable hybrid deep neural network model for EEG emotion recognition is proposed, which combines 2DCNN and LSTM networks to capture the spatial correlation of data between physically adjacent electrodes and the temporal dependence of EEG data streams. The model was evaluated for potency and arousal using 32 subjects in a largescale DEAP dataset to evaluate the performance of the EEG spatiotemporal feature representation and the proposed hybrid deep learning model. The experimental results show that the average accuracy in valence and arousal is 91.92% and 92.31%, respectively, which is significantly better than the most advanced methods. Although our proposed method effectively combines the spatiotemporal correlation of EEG and improves the accuracy and robustness of EEG emotion recognition, it also has certain limitations, such as significant differences between EEG signals and other spatiotemporal information between different individuals. Future research can further carry out crossparadigm, crossdevice, and crosspopulation research on EEG emotion recognition.
Availability of data and materials
Parts of the models, data, and codes that support the study are available from the corresponding author upon reasonable request.
References
R. Cowie, E. DouglasCowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, J.G. Taylor, Emotion recognition in humancomputer interaction. IEEE Signal Process. Mag. 18(1), 32–80 (2001)
R. Adolphs et al., Recognition of facial emotion in nine subjects with bilataral amygdala damage. Neuropsychologia 37, 1111–1117 (1999)
M. Chatterjee, D.J. Zion, M.L. Deroche, B.A. Burianek, C.J. Limb, A.P. Goren, A.M. Kulkarni, J.A. Christensen, Voice emotion recognition by cochlearimplanted children and their normallyhearing peers. Hearing Res. 2015(322), 151–162 (2015). https://doi.org/10.1016/j.heares.2014.10.003
P.D. Ross, L. Polson, M.H. Grosbras, Developmental changes in emotion recognition from fulllight and pointlight displays of body movement. PLoS ONE (2012). https://doi.org/10.1371/journal.pone.0044815
H. Chao, H.Z. Zhi, L. Dong, Y.L. Liu, Recognition of Emotions Using Multichannel EEG Data and DBNGCBased Ensemble Deep Learning Framework. Comput. Intel. Neurosc. (2018). https://doi.org/10.1155/2018/9750904
Y. Li, J. Huang, H. Zhou, H.Y. Zhou, N. Zhong, Human emotion recognition with electroencephalographic multidimensional features by hybrid deep neural networks. Appl. Sci. (2017). https://doi.org/10.3390/app7101060
W.L. Zheng, B.N. Dong, B.L. Lu, Multimodal emotion recognition using EEG and eye tracking data, in Proceedings of the 2014 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, (Chicago, 2014). https://doi.org/10.1109/EMBC.2014.6944757.
W.L. Zheng, J.Y. Zhu, Y. Peng, B.L. Lu, EEGbased emotion classification using deep belief networks, Proc.  IEEE Int. Conf. Multimed. Expo. (2014). https://doi.org/10.1109/ICME.2014.6890166.
M. Bilalpur, S.M. Kia, M. Chawla, T.S. Chua, R. Subramanian, Gender and emotion recognition with implicit user signals, ICMI 2017  Proc. 19th ACM Int. Conf. Multimodal Interact. 2017, 379–387. (2017). https://doi.org/10.1145/3136755.3136790.
W. Liu, W.L. Zheng, B.L. Lu, Emotion recognition using multimodal deep learning. Lect. Notes Comput. Sci. 9948, 521–529 (2016). https://doi.org/10.1007/9783319466729_58
W. Liu, W.L. Zheng, B.L. Lu, Multimodal emotion recognition using multimodal deep learning. Available online: https://arxiv.org/abs/1602.08225 (Accessed on 30 September 2016)
R. Alazrai, R. Homoud, H. Alwanni, M.I. Daoud, EEGbased emotion recognition using quadratic timefrequency distribution. Sensors 18(8), 2739 (2018)
R. Li, Y. Wang, B.L. Lu, A multidomain adaptive graph convolutional network for EEGbased emotion recognition, in Proceedings of the 29th ACM International Conference on Multimedia (pp. 5565–5573). (2021)
J. Li, Z. Zhang, H. He, Hierarchical convolutional neural networks for eegbased emotion recognition. Cogn. Comput. 10, 368–380 (2018)
T. Song, W. Zheng, P. Song, Z. Cui, Eeg emotion recognition using dynamical graph convolutional neural networks, IEEE Trans. Affect. Comput. (2018)
W. Tao, C. Li, R. Song, J. Cheng, Y. Liu, F. Wan, X. Chen, Eegbased emotion recognition via channelwise attention and self attention, IEEE Trans. Affect. Comput. (2020)
T. Zhang, W. Zheng, Z. Cui, Y. Zong, Y. Li, Spatial–temporal recurrent neural network for emotion recognition. IEEE Trans. Cyber. 49(3), 839–847 (2019). https://doi.org/10.1109/TCYB.2017.2788081
E. Rudakov, L. Laurent, V. Cousin, A. Roshdi, R. Fournier, A. Naitali, S. Al Kork, Multitask CNN model for emotion recognition from EEG Brain maps, in 2021 4th International Conference on BioEngineering for Smart Technologies (BioSMART) (pp. 1–4). IEEE. (2021)
V. Rozgić, S. Ananthakrishnan, S. Saleem, R. Kumar, R. Prasad, Ensemble of svm trees for multimodal emotion recognition, in Proceedings of the 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference pp. 1–4. IEEE. (2012)
M. Li, H. Xu, X. Liu, S. Lu, Emotion recognition from multichannel EEG signals using Knearest neighbor classification. Technol. Health Care 26(S1), 509–519 (2018)
K. Murphy, Y. Weiss. The factored frontier algorithm for approximate inference in DBNs. arXiv preprint arXiv:1301.2296, (2013).
Y. Wei et al., CNN: Singlelabel to Multilabel, 6(1), 1–14. (2014). https://doi.org/10.1109/TPAMI.2015.2491929
S. Verma, Z.L. Zhang, Graph capsule convolutional neural networks, 2018. Available: http://arxiv.org/abs/1805.08090
R. Mukhometzianov, J. Carrillo, CapsNet comparative performance evaluation for image classification. arXiv:1805.11195, arXiv.org, pp. 1–14, 2018, [Online]. Available: https://arxiv.org/ftp/arxiv/papers/1805/1805.11195.pdf
W.H. Hwang, D.H. Kang, D.H. Kim, Brain lateralisation feature extraction and ant colony optimisationbidirectional LSTM network model for emotion recognition. IET Signal Proc. 16(1), 45–61 (2022)
S. Alhagry, A. Aly, R. A., Emotion recognition based on EEG using LSTM recurrent neural network, Int. J. Adv. Comput. Sci. Appl. (2017). https://doi.org/10.14569/ijacsa.2017.081046.
S. Tripathi, S. Acharya, R.D. Sharma, S. Mittal, S. Bhattacharya, Using deep and convolutional neural networks for accurate emotion classification on deap dataset
T. Song, W. Zheng, P. Song, Z. Cui, EEG emotion recognition using dynamical graph convolutional neural networks. IEEE Trans. Affect. Comput. 11(3), 532–541 (2020). https://doi.org/10.1109/TAFFC.2018.2817622
Schirrmeister, R. Tibor, et al., Deep learning with convolutional neural networks for EEG decoding and visualization. Human Brain Map 38(11), 5391–5420. (2017)
S. Koelstra et al., DEAP: A database for emotion analysis; Using physiological signals. IEEE Trans. Affect. Comput. 3(1), 18–31 (2012). https://doi.org/10.1109/TAFFC.2011.15
T.M. Cover, J.A. Thomas, Differential entropy. Elements of Inf. Theory, 224–238. (1991)
J. Fdez, N. Guttenberg, O. Witkowski, A. Pasquali, Crosssubject EEGbased emotion recognition through neural networks with stratified normalization. Front. Neurosci. 15, 626277 (2021)
A. Baratloo, M. Hosseini, A. Negida, G. El Ashal, Part 1: simple definition and calculation of accuracy, sensitivity and specificity. (2015)
N. Salankar, P. Mishra, L. Garg, Emotion recognition from EEG signals using empirical mode decomposition and secondorder difference plot. Biomed. Signal Process. Control 65, 102389 (2021)
T.A.N.G. Wan, H.U. Jun, H. Zhang, W.U. Pan, H.E. Hua, Kappa coefficient: a popular measure of rater agreement. Shanghai Arch. Psychiatry 27(1), 62 (2015)
X. Li, Y. Zhang, P. Tiwari, D. Song, B. Hu, M. Yang, P. Marttinen, EEG based emotion recognition: A tutorial and review. ACM Comput. Surveys 55(4), 1–57 (2022)
H.J. Yoon, S.Y. Chung, EEGbased emotion estimation using Bayesian weightedlogposterior function and perceptron convergence algorithm. Comput. Biol. Med. 43(12), 2230–2237 (2013). https://doi.org/10.1016/j.compbiomed.2013.10.017
P. ArnauGonzalez, M. ArevalilloHerraez, S. Katsigiannis, N. Ramzan, On the influence of affect in EEGbased subject identification. IEEE Trans. Affect. Comput. 12(2), 391–401 (2021). https://doi.org/10.1109/TAFFC.2018.2877986
V. Gupta, M.D. Chopda, R.B. Pachori, Crosssubject emotion recognition using flexible analytic wavelet transform from EEG signals. IEEE Sens. J. 19(6), 2266–2274 (2019). https://doi.org/10.1109/JSEN.2018.2883497
R. Gupta, K. ur Rehman Laghari, T. H. Falk, Relevance vector classifier decision fusion and EEG graphtheoretic features for automatic affective state characterization, Neurocomputing, 174, 875–884, (2016). https://doi.org/10.1016/j.neucom.2015.09.085
J. Cheng et al., Emotion recognition from multichannel EEG via deep forest. IEEE J. Biomed. Heal. Inf. 25(2), 453–464 (2021). https://doi.org/10.1109/JBHI.2020.2995767
S. Soleymani, J. Lichtenauer, T. Pun, M. Pantic, A multimodal database for affect recognition and implicit tagging. IEEE Trans. Affect. Comput. 3(1), 42–55 (2012). https://doi.org/10.1109/TAFFC.2011.25
P. ArnauGonzález, S. Katsigiannis, N. Ramzan, D. Tolson, and M. ArevalilloHerráez, ES1D: A deep network for EEGbased subject identification, Proc.  2017 IEEE 17th Int. Conf. Bioinforma. Bioeng. BIBE 2017, 2018, 81–85, (2017). https://doi.org/10.1109/BIBE.2017.0074.
D. Zhang, L. Yao, K. Chen, J. Monaghan, A convolutional recurrent attention model for subjectindependent EEG signal analysis. IEEE Signal Process. Lett. 26(5), 715–719 (2019). https://doi.org/10.1109/LSP.2019.2906824
Y. Yin, X. Zheng, B. Hu, Y. Zhang, X. Cui, EEG emotion recognition using fusion model of graph convolutional neural networks and LSTM. Appl. Soft Comput. 100, 106954 (2021). https://doi.org/10.1016/j.asoc.2020.106954
A. Topic, M. Russo, Emotion recognition based on EEG feature maps through deep learning network. Eng. Sci. Technol. an Int. J. 24(6), 1442–1454 (2021). https://doi.org/10.1016/j.jestch.2021.03.012
W. Liu, J.L. Qiu, W.L. Zheng, B.L. Lu, Multimodal emotion recognition using deep canonical correlation analysis, (2019). arXiv preprint arXiv. 1908. 05349.
Y. Yang, Q. Wu, M. Qiu, Y. Wang, X. Chen, Emotion recognition from multichannel EEG through parallel convolutional recurrent neural network, in 2018 International Joint Conference on Neural Networks (IJCNN) (pp. 1–7). IEEE.
Q. Gao, Y. Yang, Q. Kang, Z. Tian, Y. Song, EEGbased emotion recognition with feature fusion networks. Int. J. Mach. Learn. Cybern. 13(2), 421–429 (2022)
Acknowledgements
We appreciate the editors and reviewers who processed and reviewed our manuscript to provide detailed professional comments on the technical contributions, logical structure, and content presentation of this paper.
Funding
This work is supported by the National Natural Science Foundation of China (62266053, 62062070, 62365017, 62062069, and 62005235), the Natural Science Foundation of Yunnan Province (202101AT070100), and Yunnan Expert Workstation (No. 202305AF150012).
Author information
Authors and Affiliations
Contributions
TW was involved in conceptualization, data curation, validation, and writing—original draft preparation. ZNX and WDC were responsible for methodology, software, visualization, and writing—original draft preparation. HXQ and YHT contributed to funding acquisition, resources, supervision, and writing—reviewing and editing.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
All authors approved the final manuscript and the submission to this journal.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wang, T., Huang, X., Xiao, Z. et al. EEG emotion recognition based on differential entropy feature matrix through 2DCNNLSTM network. EURASIP J. Adv. Signal Process. 2024, 49 (2024). https://doi.org/10.1186/s1363402401146y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1363402401146y