A robust modulation classification method using convolutional neural networks

Automatic modulation classification (AMC) is a core technique in noncooperative communication systems. In particular, feature-based (FB) AMC algorithms have been widely studied. Current FB AMC methods are commonly designed for a limited set of modulation and lack of generalization ability; to tackle this challenge, a robust AMC method using convolutional neural networks (CNN) is proposed in this paper. In total, 15 different modulation types are considered. The proposed method can classify the received signal directly without feature extracion, and it can automatically learn features from the received signals. The features learned by the CNN are presented and analyzed. The robust features of the received signals in a specific SNR range are studied. The accuracy of classification using CNN is shown to be remarkable, particularly for low SNRs. The generalization ability of robust features is also proven to be excellent using the support vector machine (SVM). Finally, to help us better understand the process of feature learning, some outputs of intermediate layers of the CNN are visualized.


Introduction
Automatic modulation classification (AMC) that identifies the modulation type of the received signal is an essential part of noncooperative communication systems. The AMC plays an important role in many civil and military applications such as cognitive radio, adaptive communication, and electronic reconnaissance.
In these systems, transmitters can freely choose the modulation type of signals; however, the knowledge of modulation type is necessary to the receivers to demodulate the signals so that the transmission can be successful. AMC is a sufficient way to solve this problem with no effects on spectrum efficiency AMC algorithms have been widely studied in the past 20 years. In general, conventional AMC algorithms can be divided into two categories: likelihood-based (LB) [1] and feature-based (FB) [2]. LB methods are based on the *Correspondence: wuzhilu@hit.edu.cn 1 School of Electronics and Information Engineering, Harbin Institute of Technology, No. 92 West Dazhi Street, Nangang District, Harbin, China Full list of author information is available at the end of the article likelihood function of the received signal, and FB methods depend on feature extraction and classifier design.
Although LB methods can theoretically achieve the optimal solution, they suffer from high computational complexity and require prior information from transmitters. In contrast, FB methods can obtain suboptimal solutions with much smaller computational complexity and do not depend on prior information.
Since the prior information required by LB methods is often unavailable in practice, researchers have paid more attention to FB methods over the past two decades. The two most important parts of FB methods are feature extraction and classifier. Various types of features have been studied and used in AMC algorithms. For example, instantaneous features [3,4] were extracted from the instantaneous amplitude, frequency, and phase in the time domain. Transformation-based features were calculated from Fourier and wavelet transforms [5,6]. The high-order cumulant (HOC) features [7,8] are statistical features obtained from different orders of cumulants from the received signals. Additive white Gaussian noise (AWGN) can be completely mathematically eliminated in HOC features. Cyclostationary features are based on the spectral correlation function (SCF) derived from Fourier transform of the cyclic autocorrelation function [9,10]. The highest values of SCF for different cyclic frequencies are taken by the cyclic domain profile and used to train the classifiers.
The classifier is another important part of FB methods. The decision tree [3] is the most widely applied linear classifier in early years. Linear classifiers are notably easy to implement but not feasible for linearly inseparable problems. Many nonlinear classifiers are applied in AMC, e.g., K nearest neighbor [11], neural networks [12], and support vector machine (SVM) with kernels [13]. SVM is considered to have advantages when the number of samples is limited and can provide better generalization ability at the same time. Thus, SVM has become the most useful classifier for AMC problems in recent years.
The performance of FB methods primarily depends on the extracted feature set. Features must be manually designed to accommodate the corresponding set of modulation and channel environment and may not be feasible in all conditions. Moreover, looking for effective features requires great consideration. Considering these factors, deep learning (DL) methods, which can automatically extract features, have been adopted. DL is a branch of machine learning and has achieved remarkable success because of its excellent classification ability. DL has been applied in many fields such as image classification [14] and natural language processing [15]. Several typical DL networks such as a deep belief network [16], stacked auto encoder [17], and convolutional neural network (CNN) [18] have been applied in AMC. DL networks are commonly deployed as classifiers in most current DL methods. They address different aforementioned features. The classification accuracy of DL methods has proven to be higher than other classifiers, particularly when the signal-to-noise ratio (SNR) is low.
Currently, most DL-based AMC methods are still implemented in two steps: preprocessing and classification. The preprocessing can be either transforms or feature extraction. DL networks are applied as classifiers to handle preprocessed signals. An AMC method based on DBN was proposed in [19]. The modulation set consists of 11 modulation types. Spectra in different orders are calculated for the classification. The classification accuracy is higher than that of conventional neural networks. Zhu and Fujii proposed a high-accuracy classification scenario [20], where 10 different HOC features were extracted from 5 modulation types, and SDAE was used to classify these features. Mendis et al. [16] proposed a DBN-based method using the SCF of the received signals. The classification accuracy is 95% when the SNR is − 2 dB. Dai et al. [17] proposed an interclass classification method using the ambiguity function of the signal.
Stacked sparse autoencoders are deployed as its classifier. The modulation set contains 7 modulation types, and the generalization ability is also studied. The classification accuracy reaches 90.4% when the SNR is between − 10 and 0 dB. O'Shea et al. [18] trained the CNN with the received based band signals directly, and the classification accuracy was higher than those trained by HOC features. Some features extracted by CNN were also displayed. A heterogeneous model based on real-measured data is proposed in [21], and the performance is enhanced by combining CNN with recurrent neural networks.
The existing methods are all based on the assumption that the SNRs of training and testing are equal. However, the result of SNR estimation is often inaccurate in practice, the actual channel SNR may also be unstable or rapid varying under certain conditions. In this case, current schemes are often lack of generalization ability. To solve this problem, a CNN-SVM model for AMC is proposed in this paper. Considering the advantages of the powerful capability of feature learning for deep learning networks, CNN is deployed to explore new features that are suitable for classification under various SNRs. In this paper, CNN directly handles the received signals at mid-frequency from − 10 to 20 dB, and is able to create new features robust to SNR variation. The generalization ability of AMC under varying SNR conditions can be significantly improved by these features. The advantages and contributions of our proposed method in this paper are stated as follows: • Most current methods identify a limited set of modulation types, whereas the set of modulations considered in this paper is more complicated and contains 15 different types in total. • Received signals are directly handled by the DL network at intermediate frequency (IF); however, most existing methods still require extra processing or transformation before classifying signals. • The method can provide an outstanding classification accuracy under a large SNR range; however, most existing method is only feasible under a certain SNR level. • The CNN built in this paper plays the role of the feature extractor, whereas most DL methods only regard DL networks as powerful classifiers. The features learned by the CNN are displayed and analyzed. The contribution of different convolutional kernels is also visualized to better understand the feature learning process.
The remainder of the paper is organized as follows: the basic model and details of our proposed method is explained in Section 2, followed by the simulation results and discussion in Section 3. The paper is finally concluded in Section 4.

System model and proposed method
The AMC is an intermediate process that occurs between signal detection and demodulation at the receiver. The structure of our proposed AMC method in comparison with the conventional ones is illustrated in Fig. 1. Preprocessing in Fig. 1 refers to sampling and quantization for IF signals. The procedures inside the dashed frame, which include the feature extraction, feature selection, and classifier, are replaced by the CNN proposed here. The CNN is pre-trained offline with proper amount of samples before it is deployed. Furthermore, as long as the SNR range of the communication channel is known, the CNN can learn the features that adapt to the corresponding condition. This property makes our method independent from the SNR estimation.

Signal model
In this paper, signals are processed in IF and are corrupted by AWGN. Then, the received signal can be denoted as where s(t) is the transmitted signal of different modulation types, n(t) is AWGN, and SNR is defined as P s /P n (P s is the power of signal and P n is the power of noise).
where A m , a n , T s , f c , f m , ϕ 0 , and ϕ m are the modulation amplitude, symbol sequence, symbol period, carrier frequency at IF, modulation frequency, initial phase, and modulation phase, respectively, and g(t) is the gate function represented as: For M-QAM (M = 4, 16, 64) signals, we have where a n , b n ∈ 2m − 1 − √ M , m = 1, 2, ..., √ M, and two carriers are modulated by a n and b n , respectively.
The OFDM signal, which is the output of a multicarrier system, can be expressed as where a n and b n are the in-phase component and orthogonal component of the symbol sequence on the n-th subcarrier, respectively, and f n is the frequency of the n-th subcarrier.
The LFM signals in a period are denoted as where k and f 0 are defined as the chirp rate and initial frequency, respectively. Finally, for MSK signals, we have where a n (k) denotes the k-th symbol in the symbol sequence, and ϕ k is the phase constant of the k-th symbol.

Convolutional neural network
CNNs are simply NNs that use convolution in place of general matrix multiplication in at least one of their layers [22]. Typical CNN architectures consist of three different types of layers: convolutional layer, pooling layer, and fully connected layer. There is an extra softmax regression layer deployed as the classifier at the last layer of the CNN in supervised learning. In this paper, we replace the fully connected layer with a global average pooling layer, so that there is no fully connected layer.

Convolutional layer
In convolutional layers, there are several convolution kernels (also known as filters) to process the received signal.
Since the received signal is a 1-dimensional vector in AMC, the kernel is also a 1-dimensional vector. Suppose that the l-th layer of an NN is a convolutional layer, N s , L l s , N l k , and L l k represent the number of inputs, length of the input, number of kernels, and length of kernels of the l-th layer, respectively. The convolution operation [23] in the l-th layer is described as follows: where x ∈ R N s ×L l s is the set of inputs, W ∈ R N l k ×L l k is the set of kernels, and b ∈ R N s is the bias for each output. The output of the k-th k = 1, 2, ..., N l k kernel is denoted by (8), and x l × W l k is the convolution between x l and W l k . Assume that the length of the output is L l o . The output h l ∈ R N l k ×L l o is the set of output, which is also known as the feature map. f (·) is the activation function to achieve the nonlinear mapping of outputs, which is often the sigmoid or tanh function. In this paper, the exponential linear unit (ELU) [24] is selected as the activation function, which is denoted as The ELU is a simple piecewise function derived from the rectified linear unit (ReLU). It is designed to overcome gradient vanishing [25] while accelerating the convergence speed.

Max-pooling and global average pooling
The pooling layer is another important type of layer in the CNN. As mentioned, the convolutional layer performs several convolutions to produce a set of outputs, each of which runs through a nonlinear activation function (ELU). Then, a pooling function is used to further modify the output of the layer. A pooling function replaces the output of the net at a certain location with a summary statistic of the nearby outputs [23]. Max pooling is used in this paper, which is an operation that reports the maximum output within a pooling window [26]. Assume that the output of a convolutional layer h l is max-pooled. The output h l+1 is shown as where p is the length of the pooling window; m l+1 is the margin between two adjacent pooling windows, which is also known as the stride.
Global average pooling [27] is applied after the last convolutional layer. It takes the average of each feature map, and the output vector is directly fed into the softmax layer. Similarly, we assume that the output of the former convolutional layer is h l , which contains the output of N l k kernels. The output of global average pooling h L k is represented as

Batch normalization
The batch normalization (BN) layer can accelerate deep network training by reducing the internal co-variate shift [28]. The internal covariate shift is defined as the change in distribution of output of each layer during training. The changes are commonly caused by unbalanced nonlinear mapping (e.g., ELU activation). In stochastic gradient descent, a single mini-batch is represented as B = {x 1 , x 2 , ..., x m }, and the output y i is normalized by the BN layer. Suppose that the mean and variance of B are denoted as μ B and σ B , respectively. The procedures of BN are shown in Table 1.
In the BN process, parameters γ and β must be learned with the training process of CNN. is a small quantity added to variance to avoid dividing by zero. BN is

1.
Calculate the mean and variance value of B: Normalizê deployed before the activation function when it is proposed, but experiments prove that BN should occur after the activation function [29]. As a result, BN is applied after each activation function in this paper.

Softmax regression
The last layer of the CNN in supervised learning is the softmax regression layer. Softmax regression is a multiclass classifier generalized from logistic regression, whose output is a set of probability distributions of different classes. Considering an n-class classification problem, the input of the softmax regression is h L , which is the output of the global average pooling layer, and the output of softmax regression y o can be denoted as: where c = 1, 2, ..., n, W L , b L is the weight and bias between the former output and the softmax. The neuron with the maximum output is selected as the classification result, which is also the output of the entire CNN. The loss function of the CNN is defined as J (W , b). Then, the training process is described as The problem in (13) can be solved by a gradient descent. Partial derivatives are calculated using the backpropagation method [30] and used to update W and b. The process is as follows: where α is known as the learning rate, which controls the update step of the parameters.

Simulation parameters
All signals are generated based on the description in Section2, and the parameters of modulation are shown in Table 2. The number of subcarriers in the OFDM signal is set to N c , and the subcarriers are modulated by 4PSK. Additionally, we denote the SNR of the training samples and the SNR of the testing samples as SNR tr and SNR te , respectively. The CNN that we built for AMC consists of 16 convolutional layers, whose structure is similar to that of VGG-19 [31] (as shown in Table 3). The input size of the CNN can be calculated by f s · N c /f d = 4000. The parameters of the convolutional layers and pooling layers in the l-th layer take the form of N l k , L l k , L l p , m l , respectively. Signals are normalized to [ − 1, 1] with zero mean and are then processed by CNN. The CNN in this paper is implemented by the DL library, Keras [32], with Theano [33] as its backend.

Classification with CNN
In this section, signals are directly classified by CNN, and the results are from the final softmax layer. The process is displayed in Fig. 2. The classification accuracy under fixed SNR level is firstly displayed in Table 4 for SNR te = SNR tr . We generate 20000 training samples and 1000 testing samples for each modulation type at every SNR level. As observed from Table 4, the classification accuracy is 90% when SNR tr = − 10 dB. This finding demonstrates excellent performance for AMC methods. The classification accuracy of all individual classes reaches almost 100% when SNR tr 5 dB. Because our channel is AWGN, the signals with amplitude modulation suffer most from the decreasing SNR tr . The accuracy of 4ASK and 8ASK dramatically deteriorates when SNR tr 0 dB. Only 48.8% of 4ASK are correctly classified under − 10 dB. The accuracy of 16QAM and 64QAM also rapidly decreases when SNR tr − 4 dB.
The detailed classification result when SNR tr = − 10 dB is shown in Table 5. Signals with identical classes but different orders (e.g., 2ASK, 4ASK, and 8ASK) may be mixed, but there is very little interclass misclassification. For example, nearly half of 4ASK signals are classified as 2ASK and 8ASK, but all of them are M-ASK signals. The intraclass classification result for M-QAM and M-ASK signals may be unsatisfactory, whereas the interclass accuracy remains nearly 100%.
The classification accuracy of CNNs with different numbers of layers versus SNR is also provided. The result is illustrated in Fig. 3. When the SNR is low, increasing the number of layers can significantly improve the classification performance of the CNN. However, for SNR te 0 dB, CNN with five convolutional layers can correctly classify over 99% of all signals. The results for SNR tr 2 dB are not plotted because they are above 99.5% for all three CNNs. Deeper CNNs can significantly improve the classification accuracy under low SNR conditions.   Fig. 4.
The CNNs trained in a certain SNR range are robust to SNR variations when SNR te is in the range of SNR tr . The classification accuracy is also notably close to that under SNR te = SNR tr . The generalization ability can stretch to the higher SNR range when SNR te is not in the range of SNR tr . For CNN trained under [ − 10, 0] dB, the classification accuracy can still reach 96% under 20 dB. The CNNs can be robust to SNR variation; thus, they can be deployed in a certain SNR range.

Feature learning with CNN
For most existing DL-based AMC methods, DL networks are treated as classifiers. However, DL networks also have the powerful capability of feature learning. Only the last layer of a CNN (softmax layer) is a classifier; thus, the input of softmax layer h L is equal to the features learned by the CNN. Thus, we can analyze these features by observing h L (the output of the global average pooling layer). The multi-dimensional scaling (MDS) method [34] is applied to map h L , which is a 60-dimensional vector, into a 2-D axis for convenient observation. Features under SNR tr = − 10 dB and SNR tr = 5 dB are normalized and visualized in Fig. 5.
The features of 4PSK and 8PSK signals are completely mixed in Fig. 5a, which also shows why these two categories are poorly classified when SNR tr = −10 dB. The situation is similar to 4ASK and 8ASK signals. In contrast to Fig. 5a, the distribution of CNN-learned features in Fig. 5b is much better because of the increase in SNR. Most signals in the same categories are distributed in the same cluster, and the margins among different clusters are evident, which implies that the extracted fea- We obtain several CNNs that are robust to SNR variations by training them in a certain SNR range as in the previous subsection. We can learn noise-robust features in a notably similar manner. The dimension reduction to h L is accomplished using the neural networks by adding a hidden layer containing four neurons between the global average pooling layer and softmax layer (see Fig. 6). Thus, the learned features will be 4-dimensional vectors. Each dimension of the feature under different SNR levels is separately plotted (SNR tr ∈[ − 5, 15] dB) in Fig. 7, where feature 1, feature 2, feature 3, and feature 4 correspond to the output of the four neurons. We can find that for each modulation type, at least one feature rarely changes with the SNR (e.g., feature 1 and feature 4 of 2ASK and feature 2 of OFDM and 4PSK). These features are robust to SNR variation; thus, they are expected to provide an excellent generalization ability under the varying SNR te .
A linear support vector machine (SVM) is deployed to test the generalization ability of the learned features. Unlike the previous subsection, the SVM is trained for a  Fig. 7). The generalization ability of CNN-learned features is outstanding. The SVM trained for − 5 dB can correctly classify 99.1% of signals under 20 dB, and the classification accuracy of signals for − 5 dB is 90% for the SVM trained for 15 dB. In this way, the classifiers trained by CNNlearned features can reduce their dependency on the SNR estimation.
The method in [35], which focuses on selecting proper features from manually extracted feature set under varying SNR conditions, is chosen for comparison. The signals are re-generated according to the modulation set in [35],  Fig. 9, we can observe that the performance is significantly improved, especially under low SNRs, indicating the superiority of CNN-learned features.

Visualization of feature learning process
We have proven that the CNN can learn efficient features for classification. In this section, the process of feature learning by CNN is analyzed by visualizing the outputs of the intermediate layers Each convolution kernel retains only a portion of the frequency components. The frequency component of symbol 6 (in the magenta box) is maintained in the 38th kernel of layer 6 but filtered out in the 29th kernel of layer 11, e.g., each kernel concerns only a part of the information from the received signals. By comparing Fig. 10a with Fig. 10b, we also find that feature learning becomes harder with the decrease in SNR tr . Different frequencies can be eas-ily distinguished in layer 6 when SNR tr = 15 dB, but the differences are not obvious when SNR tr = 2 dB. Hence, we need more layers when the SNR is low. Similar to the frequency information in Fig. 10a, the kernel can also learn the phase information in Fig. 10c. For the 16QAM signal in Fig. 10d, the amplitude information and phase information are recorded by the 20th kernel of layer 6 and the 33rd kernel in layer 11, respectively. Thus, symbols 1 and 14, which are mod-

Conclusion
In this paper, an AMC method based on a CNN has been proposed. First, we have used the CNN as a powerful classifier. In total, 15 different modulation types have been studied, and the classification for fixed SNR and generalization ability for certain SNR ranges have been considered. The numerical results show that the classification accuracy can reach 90% under − 10 dB and is notably close to 100% when the training SNR is higher than 5 dB. We have also improved the generalization ability by training the CNN under a certain SNR range. The CNN trained under [ − 10, 0] dB can correctly classify 96% of all signals when the testing SNR is 20 dB. Then, the features that the CNN learns from the received signals have been analyzed. Features are mapped to a 2-D axis using MDS, where we observe that most signals in the same categories are distributed in the same cluster. The margins among different clusters are also evident; thus, they are well suited for classification. Robust features learned under [ − 5, 15] dB are also studied. Robust features are insensitive to the SNR variation, so they have strong generalization ability. The SVM trained by these robust features under − 5 dB can correctly classify 99.1% of signals when the testing SNR is 20 dB. As a result, CNNs trained in this way can be robust to SNR variation.
Additionally, we visualize some typical outputs of the intermediate layers. We find that each kernel in the convolutional layer can learn different information from the received signal. The information includes the phase, frequency, amplitude, and other information that is difficult for us to understand.