Robust automatic modulation classification under noise mismatch

Automatic modulation classification plays a critical role in the intelligent reception of unknown wireless signals. In practice, the dynamic wireless environment brings a great challenge, and the actual test model is inconsistent with the training model. Therefore, aiming at the problem of noise mismatch, this paper proposes a new modulation classification method based on KD-GoogLeNet and Squeeze-Excitation (KD-GSENet). Using the k-dimensional tree, the complex wireless signals are converted into color images rather than normal constellations, which can enhance the classification features. Considering the attention block has the inherent advantage of assigning more weights to important features, this paper further uses it to improve the GoogLeNet. Finally, extensive experiments are presented including Gaussian noise, non-Gaussian noise, and the scenarios of noise mismatch. Numerical results verify the superior classification performance of the proposed KD-GSENet under different scenarios.


Introduction
Automatic modulation classification (AMC) is an important technique for detecting signal modulation schemes in intelligent communication receivers. As a crucial technique to identify the modulation formats under noise and interference, AMC has been widely used in military, cognitive radio, and crowded electromagnetic spectrum communications [1]. Traditional classification algorithms often require manual extraction by experienced experts. Most traditional methods cannot achieve the requirements of high efficiency and high classification rate.
In recent years, with the rapid development of artificial intelligence, deep learning (DL) has been widely used [2]. The essence of the modulation classification problem is a typical pattern classification problem [3]. The progress of DL promotes the development of modulation classification. Through artificial neural networks, DL can realize automatic feature extraction of different modulated signals. Using DL in AMC can process large amounts of data and extract more comprehensive features without manual feature selection. DL-based AMC method certainly improves classification accuracy; however, most models only consider ideal scenarios such as common Gaussian noise. As the complexity of the electromagnetic environment increases [4], the background noise often changes dynamically and can be non-Gaussian [5]. In the presence of non-Gaussian noise, the performance of the AMC scheme degrades dramatically. These accelerated the development of AMC. Through preprocessing, the DL-based AMC method can improve the robustness of classification. Therefore, it is necessary to explore the application value of modulation classification in many complex environments [6].
Machine learning (ML) is an advanced technology that includes different classifiers such as artificial neural networks, K-nearest neighbors, and genetic programming [12]. DL is an important branch of ML, which can simultaneously realize feature extraction and classification. Thus, it is widely used in AMC [13]. An AMC method based on the convolutional neural networks (CNNs) was proposed in [14], which can automatically extract the features and estimate the signal-to-noise ratio (SNR) from sequences. By exploring the interactive features of in-phase/quadrature (I/Q) and amplitude/phase (A/P) [15], Chang et al. proposed a fusion deep neural network. Spatial-temporal characteristics of original complex signals were effectively explored in [16], which helps to obtain more efficient classification features.
Considering that the signal cannot be directly used as the input of the CNN, scholars have proposed different preprocessing methods. A new data preprocessing method was proposed in [17]. Further, many scholars directly converted the signal into a twodimensional image. O'Shea et al. [13] proposed a modulation classification model based on end-to-end CNN. Inspired by their work, more scholars converted signals into different images. Peng et al. converted complex signals into three-channel constellations and configured models based on GoogLeNet [18]. Zhou et al. [19] proposed a method to classify the received signal without feature extraction, which can automatically learn features from the received signal. The article [20] gave a modulation classification algorithm based on a constellation density matrix to identify different orders of amplitude shift keying (ASK), phase shift keying (PSK), and quadrature amplitude modulation (QAM). Using contrastive full convolutional networks, a novel AMC approach based on a grid constellation matrix was proposed in [21]. In addition to converting signals into constellations, Yan et al. proposed a new feature extraction method based on the cyclic spectrum [22]. They also presented an AMC method for multi-binary QAM signals based on constellation diagram analysis [23]. According to frequency variation with time under different modulation, a short-time discrete Fourier transform was used to convert one-dimensional radio signals into spectral images [24].
For better performance, the researchers designed new networks to extract different representations from the received signals. By adjusting the number of layers and adding new layers, [25] gave an improved AMC network based on CNN. Bu et al. provided a learning architecture that combined adversarial training and knowledge transfer [26].
The long short-term memory (LSTM) was used to learn amplitude and phase information in the time domain [27]. Thien et al. [28] proposed a robust AMC network adopting multiple specific convolution blocks for modern communication systems. They also designed a high-performance CNN structure, which mainly involved multiple high-level processing blocks to learn the intrinsic features of combined waveforms [29]. Huang et al. [30] offered a novel gated recurrent residual neural network. In [31], residual networks were used to extract discriminant features. Aiming at the classification accuracy and calculation time, the article [32] introduced an efficient AMC scheme by exploiting the bottleneck structure of the residual network. Using CNN and gate recurrent unit as feature extraction layers, [33] presented an efficient model based on phase parameter estimation and transformation.
In actual communication, signal modulation classification is vulnerable to dynamic environments. Therefore, it is crucial to come up with more robust AMC methods for different cases. To improve the classification performance under impulsive noise, Zhang et al. pointed out a modulation classification method based on the cyclic correlation entropy spectrum [1]. The paper [34] generated the feature vector of missing modulated signals based on semantic feature vector, which greatly improved the classification accuracy of undiscovered classes. Adopting the Cauchy distribution function as a robust feature of acoustic noise, an improved constellation was presented [35]. To overcome the intra-class classification problem caused by the dynamic changes, Luan et al. [36] suggested an AMC method based on the multi-scale network.
Thus, it is urgent to find a method that can adapt to different noise environments.

Contributions
In this paper, a new AMC method is proposed. By combining preprocessing and improving the network, the proposed method improves the classification accuracy. The method has good robustness to non-Gaussian noise and noise mismatch. The contributions of this paper are summarized as follows: • A GoogLeNet and Squeeze-Excitation (GSENet) network is proposed. By assigning more weight on important features, the network combines a self-attention mechanism to enhance the discrimination and expression. • A method of k-dimensional tree (KD-tree) preprocessing is further introduced, which directly converts the signals into three-channel constellations. Unlike traditional constellations, it enlarges the differences between different modulations by combining more characteristics of signals. The KD-GoogLeNet and Squeeze-Excitation (KD-GSENet) is capable of identifying the received signals in the case of noise mismatch. • Extensive numerical results are performed to evaluate the performance of the KD-GSENet under Gaussian noise, non-Gaussian noise, and the case of noise mismatch. Besides, the classification accuracy and computational complexity are compared with other methods in this article. Numerical results verify that the proposed method not only has superior classification accuracy under Gaussian noise but also has little performance loss under non-Gaussian noise. Moreover, compared with other methods, the proposed has high robustness and generalization in the case of noise mismatch, while the increase of algorithm complexity is not significant.

Organization
The rest of this paper is organized as follows. In Sect. 2, the system model is summarized. Section 3 presents the proposed AMC method. The numerical results are presented and discussed in Sect. 4. Section 5 draws conclusions.

System model
In this section, the signal model and the model of background noise are briefly introduced.

Signal model
This paper aims to identify the correct modulation scheme among binary phase shift keying (BPSK), four amplitude shift keying (4ASK), quadrature phase shift keying (QPSK), offset QPSK (OQPSK), eight phase shift keying (8PSK), 16-ary quadrature amplitude modulation (16QAM), 32-ary quadrature amplitude modulation (32QAM), and 64-ary quadrature amplitude modulation (64QAM). According to the traditional modulation classification model [1], in which the receiver is equipped with a single antenna, the received signal can be represented as where y(n) is the received signal, h is the channel gain, which is invariant during the classification process, s(n) is the transmitted signal with eight possibilities. N is the sample number, w(n) is the generalized Gaussian noise (GGN) with zero mean, which will be discussed in the posterior subsection.

Model of background noise
In addition to Gaussian noise, non-Gaussian noise [37] is considered in this paper. Non-Gaussian noise is a random process in which the probability density function (PDF) does not satisfy the Gaussian distribution. GGN includes Gaussian noise and partially non-Gaussian noise, the PDF of GGN is where ̟ = 0 is the mean, υ is the "scale parameter", β is the "shape parameter", and Ŵ(·) denotes the Gamma function. In particular, equation (2) represents the Gaussian distribution when β = 2 , which is Gaussian noise. The remaining cases are part of non-Gaussian noise, such as the Laplacian distribution when β = 1 . In GGN, the expectation and variance of noise are given in [38] as and D[·] denote the expectation and variance operators. Thus, the SNR ϕ can be expressed as where the signal power P = lim N →∞ 1 N N n=1 |s(n)| 2 . Some possible shapes and realizations of generalized Gaussian distribution with the same variance are shown in Fig. 1. It can be noticed that, the Laplacian time series ( β = 1 ) exhibits more spikes or outliers than time series with larger β.
In real communication, the environment is variable. Not only Gaussian noise and non-Gaussian noise, but also the cases of noise mismatch are considered in the following experiments. Noise mismatch in this paper refers to the noise inconsistency in training and testing, in which the test data is determined by the real wireless environment.

Proposed AMC method
In this paper, an AMC method based on KD-tree enhancement and GSENet is proposed. Figure 2 is the overall structure, which mainly includes a signal preprocessing module and a network identification module. First, in the signal preprocessing module, according to the difference in distance between each signal point, the KD-tree enhancement strategy is used to color different modulation types. The generated signals are directly converted into color constellations. Secondly, in the network identification module, the enhanced constellation is used to train the network for modulation classification. The key is to build the GSENet model, which introduces the Squeeze-and-Excitation (SE) block in the sub-module and auxiliary classifier. It improves the ability to classify different signals. Further, a batch normalization (BN) layer is added and the activation function is updated to the rectified linear unit (ReLU) [39], which effectively enhances the generalization of the network. Finally, the trained KD-GSENet is used to identify the enhanced constellations under different E b /N 0 . The specific algorithm is as follows.

Preprocessing: KD-tree enhancement
The enhanced constellation is drawn by the method of the KD-tree [40] neighborhood point search. To improve the classification characteristics of images, the radio signals are directly converted into color constellations. As a node of the KD-tree, each signal point is divided into a root node or a leaf node.
For a sample set composed of n d-dimensional data, the eigenvalues of any sample can be used as the root node. To ensure the fastest search to the nearest point, the construction of a balanced binary tree is shown in Fig. 3. 1. Determine the root node Select dimensions according to the sequential traversal method, and all nodes are sorted by the division dimension. Initially, the intermediate node is used as the root node. 2. Determine left and right subtrees Compare the value of one node with a split node of the same dimension. When the value of a node is greater than the split node, it should be placed in the subtree to the right of the split node. Conversely, if the value of a node is less than the split node, it will be placed in the left subtree. According to the constructed KD-tree, the nearest neighbor points are searched. First, assuming that the "current nearest neighbor" is its parent node, the minimum distance is the distance to the parent node. During the backtracking, if the distance between the child node and the target node is smaller than the distance between the "current nearest neighbor" and the target node, the "current nearest neighbor" is updated to the selected child node. The iteration is terminated until coming back to the root of the tree. The minimum distance between the target node and its nearest neighbor is calculated. After obtaining the nearest distance of each signal point, it is approximated as the density of that position. Finally, all signal points are colored according to density. The enhanced constellation of QPSK is shown in Fig. 4. It can be seen that each point in the preprocessed constellation has different information. No longer independent or have equal information. This processing method condenses more timeaccumulated features of received signals in the constellation, which enhances its separability and achieves feature enhancement. Thus, the received signals have been converted into color images with dimensions of 3 ×224×224.

Network optimization
The DL model consists of multiple layers, each containing multiple neurons for automatic feature extraction. The initial layer extracts abstract features. Deep layers obtain important features by applying multiple nonlinear transformations on the output of the previous layer. GoogLeNet [41] is a DL structure proposed by the Google team, which won the ImageNet competition with a significant advantage. The parallel structure adopted by the model can integrate feature information of different scales, the model also uses 1x1 convolution kernels for dimensionality reduction and mapping, besides, two auxiliary classifiers are added to help with training. By replacing the traditional dropout fully connected layer with the average pooling layer, the parameters of the model are greatly reduced. Therefore, to improve the classification ability of modulated signals, the GoogLeNet is introduced.
To speed up the training of the network, a BN layer is added after each convolutional layer. Convolutional layers extract image features. By normalizing the same feature of different samples, the BN layer is used to normalize the feature data, which accelerates the network training speed and improves the generalization ability. ReLU is used as the activation function.
Original signals have been converted into color constellations by the transformation of KD-tree enhancement. Considering that the SE block can assign more weights to important features [42], it is used for feature classification on the sub-modules and auxiliary classifiers of GoogLeNet. Specifically, the SE module performs adaptive average pooling on each channel according to the obtained feature matrix. The output vector is obtained through two fully connected layers. The number of nodes in the first fully connected layer is 1/4 of the characteristic matrix channel. The number of nodes in the second fully connected layer is consistent with the input characteristic matrix channel. The vector output of the second fully connected layer analyzes the weight relation of each channel. Important channels are given larger weights, while unimportant channels correspond to smaller weights. Each result is multiplied by the corresponding number of channels, assigning more weight to important features.  Figure 5 shows the specific sub-module. The first sub-module is elaborated as an illustration. The feature matrix Y ∈ R H ×W ×C output from the previous layer is input into four branches Y k , k = {1, 2, 3, 4} for processing The first branch is a convolutional layer C with a kernel size of 1 × 1. The second branch adopts a 1 × 1 convolutional layer with a dimensionality reduction function and a 3 × 3 convolutional layer. The third branch passes through a 1 × 1 convolutional layer with a dimensionality reduction function and a 5 × 5 convolutional layer. The fourth branch passes through a 3 × 3 maximum pooling layer P and a 1 × 1 convolutional layer for dimensionality reduction. In each branch, the parameter of the cth filter is y k,c ∈ R H ×W ×K . The output is Y k = y k,1 , y k,2 , . . . , y k,C . A statistic s k,c ∈ R C×K is generated by shrinking the spatial dimensions H × W of Y k , the cth element of s k is calculated by After obtaining s k , there are two fully connected layers where W 1 ∈ R C r ×C , W 2 ∈ R C× C r , r is the reduction ratio used to reduce the dimension of the fully connected layer. δ represents the sigmoid activation function, σ refers (6) Fig. 5 The structure of the improved sub-module to the ReLU, which is used as the activation function to introduce nonlinearity into the network. In addition, ReLU is beneficial to avoid gradient vanishing and explosion. It increases the sparsity of the network and alleviates the over-fitting problem. The expression of ReLU is where z is the neuron. Note that the convolutional layer is used to replace the pooling layer when implemented. Since the down-sampling during pooling may confuse the information in the enhanced constellation. In addition, successive convolutional layers can improve the nonlinearity of the network and limit the scale, which helps to enhance learning and prevent overfitting.
The final output of the block is obtained by rescaling Y k with the activation e k where Ỹ k = y k,1 , y k,2 , . . . , y k,C , e k,c y k,c refers to channel-wise multiplication between the scalar e k,c and the feature map y k,c ∈ R H ×W ×K . The enhanced feature matrices of different scales are obtained through four branches Y k , k = {1, 2, 3, 4} . Each feature matrix has the same height and width. After processing, the obtained four feature matrices are spliced by depth. Thus, the output feature matrix is These four branches increase the width of the network, enabling GSENet to learn multiscale information. Using the concat operation [29], the features of different convolutional layers are merged, which increases the non-linear capacity of the trainable network.
The structure of the improved auxiliary classifier is shown in Fig. 6. First, the average pooling layer has a 5 × 5 pooling kernel and a 3 stride. 128 convolutional layers with 1 × 1 convolution kernels are used to reduce the dimension. The weights of different channels are obtained through the SE block and multiplied by the corresponding channels. The specific calculation process is the same as the sub-module. The obtained feature matrix is flattened. To reduce over-fitting, the dropout function at 50% is used to randomly inactivate neurons during forward propagation. The number of nodes in the output layer is the same as the modulation types. Finally, the output is converted to the probability that the input signal belongs to each candidate modulation format through the Softmax [43], which can be expressed as where y i is the output value of the ith neuron, M is the number of output neurons equal to the number of modulation types. The output value of the multi-classification can be converted into a probability distribution in the range [0, 1] and 1 by Softmax, p i is the probability of the corresponding neuron.
In training, 8 modulated signals need to be identified. After calculating the probability of each type through Softmax, the loss function is used to find the optimal weight parameter, the formula is the probability models with smaller cross entropy are closer.
Using adaptive moment estimation (Adam) [44] as the optimizer, different adaptive learning rates are assigned to different weight parameters. By updating the model parameters, the loss function is minimized. The Adam algorithm combines momentum and adaptation to avoid the cold start problem. Unlike stochastic gradient descent, Adam uses the gradient second moment to accelerate the convergence speed. On this basis, the back-propagation algorithm is introduced to update the weights until the loss converges to a stable value.
The KD-GSENet algorithm is summarized in Algorithm 1.

Numerical results and discussion
In this section, numerical results are performed to verify the superiority and robustness of the proposed method. In the experiment, the KD-GSENet classification method is verified by the simulation dataset containing 8 modulated signals. The proposed method is further compared with other classification methods. At the same time, the influence of KD-tree enhancement, the different noise, and the impact of E b /N 0 changes in the classification performance are also analyzed through experiments. Finally, the implementation complexity and processing speed of these methods are compared. The experiment is measured by the ratio of bit energy to noise power spectral density ( E b /N 0 ), which is classically defined as the ratio of energy per bit ( E b ) to the  Tables 1 and 2. In the table, the second column represents the output size, the last digit is the number of channels. In the third column, w × h × c represents the convolution parameter. The number of output channels is c, w × h is the size of the convolution kernel. During training, an optimization algorithm with a learning rate µ = 0.0003 is adopted. Figure 7 shows the classification performance of four methods under different E b /N 0 . The classification accuracy is obtained by averaging the classification performance of 8 modulation types. The proposed method is also compared with the modulation classification method based on AlexNet [48], MobileNetV3 [49], and GoogLeNet [18]. As shown in Fig. 7, the modulation classification performance improves with E b /N 0 increasing for all algorithms. Obviously, the proposed KD-GSENet is superior to other models, achieving higher accuracy at the same E b /N 0 .

Numerical results and discussion
To further evaluate the effect of KD-tree enhancement on various modulated signals, the confusion matrices are drawn. The first two subgraphs of Fig. 8 show the confusion matrices of the GSENet model and the proposed KD-GSENet model with E b /N 0 = −1 dB. The classification accuracy of the two models is 88.5% and 89.2% , respectively. It can be seen that using the KD-GSENet model achieves the 100% classification of 4ASK, BPSK, QPSK, 8PSK, and OQPSK. The method of KD-tree enhancement improves the classification accuracy of both 8 signals and the intra-class. The confusion matrices with E b /N 0 = −4 dB present similar results in the last two subgraphs of Fig. 8. Figure 9 shows the average classification accuracy under non-Gaussian noise. Two cases are considered, respectively, for training β = 1 corresponding to test β = 1 and training β = 5 corresponding to test β = 5 . In different noise environments, the classification accuracy is improved by using KD-GSENet. The classification accuracy generally improves with the increase of E b /N 0 .  In practice, the case of noise mismatch is easy to occur. Therefore, some experiments were done for this common situation. Figures 10, 11, and 12 show the classification accuracies when the training and test sets do not match. Three methods are considered: GoogLeNet constellation (GC), AlexNet constellation (AC), and KD-GSE Net.   the proposed method under Gaussian noise and non-Gaussian noise, but also reveal its robustness under noise mismatch. Figure 13 shows the classification accuracy curve under noise mismatch. The figure shows the case of Gaussian noise in training and non-Gaussian noise in test. It can be seen that the classification accuracies of the three methods are similar when both training and test are Gaussian noise (Train β = 2 , Test β), and the proposed method is slightly higher than the others. When the noise is mismatched, the classification accuracy of the proposed method decreases slightly, while the other two methods decrease significantly. This result shows the robustness of the proposed method under noise mismatch, which is consistent with the conclusion of the previous experiment. Figure 14 shows the box plots under noise mismatch, which can more intuitively compare changes in results. Box plots more visually represent the variability of results. In the figure, the pink, blue, and yellow boxes represent the three modulation classification methods of AC, the proposed method, and GC, respectively. Intuitively, the method proposed in this paper (blue box) has the best classification accuracy in different noise mismatch scenarios, which shows that the proposed method has superior robustness in noise mismatch.

Algorithm complexity analysis
As shown in Table 3, the total parameter size and parameter storage size of the proposed method are smaller than AC, since the average pooling layer is adopted and the fully connected layer in AC is abandoned. The proposed takes up more memory than GC because it introduces SE blocks, which increase the number of parameters. Mobile-NetV3 constellation (MC) has the lightest network due to the depth-wise separable convolution and inverted residual structure. In Fig. 15, the proposed classification model is slightly inferior to other models in training and test speed, because the proposed has the maximum depth and more parameters. Since the difference is small, it can be considered comparable to other models.