 Research
 Open Access
 Published:
Intelligent radar HRRP target recognition based on CNNBERT model
EURASIP Journal on Advances in Signal Processing volume 2022, Article number: 89 (2022)
Abstract
Stable and reliable feature extraction is crucial for radar highresolution range profile (HRRP) target recognition. Owing to the complex structure of HRRP data, existing feature extraction methods fail to achieve satisfactory performance. This study proposes a new deep learning model named convolutional neural network–bidirectional encoder representations from transformers (CNNBERT), using the spatio–temporal structure embedded in HRRP for target recognition. The convolutional token embedding module characterizes the local spatial structure of the target and generates the sequence features by token embedding. The BERT module captures the longterm temporal dependence among range cells within HRRP through the multihead selfattention mechanism. Furthermore, a novel cost function that simultaneously considers the recognition and rejection ability is designed. Extensive experiments on measured HRRP data reveal the superior performance of the proposed model.
1 Introduction
The wideband radar has a high range resolution, and its echo is called the target's highresolution range profile (HRRP). HRRP has the advantages of easy acquisition and processing and contains rich target structure information such as radial dimension, scatterer’s distribution, and echo intensity. Therefore, HRRPbased radar automatic target recognition (RATR) is becoming a research hotspot in intelligent radar signal processing [1,2,3,4].
Several methods have been proposed in recent years to extract the spatial and temporal features of HRRP. These methods can be roughly divided into two classes. (1) Considering HRRPs as random points scattered in highdimensional space, HRRPs are assumed to follow specific statistical distributions when extracting the corresponding spatial structural features. For example, the HRRPs are assumed to follow Gaussian distribution in [5,6,7], and the adaptive Gaussian classifier, Gaussian mixture model (GMM), and support vector machine (SVM) are used to classify unknown radar targets, respectively; in [11], the HRRPs are modeled using the Gamma mixture model to describe their statistical characteristics accurately; in [12], the subspace structure of HRRP is studied, while in [13], the multisubspace structure of HRRP is exploited. (2) Viewing HRRP as a onedimensional temporal sequence along the range dimension, sequential modeling of HRRP was conducted to extract the implicit evolution structures as features. For example, in [8, 9], the hidden Markov models (HMMs) are employed in HRRP target recognition; Pan et al. characterize the spectrogram feature extracted from HRRP via the TSBHMM model, in which multiaspect frames of one target are learned jointly [10]. In [14], the temporal factor analysis model is used. Though these methods achieved acceptable results in relatively simple tasks, their structural simplicity limits their description ability and performance in complex scenarios.
Owing to their excellent nonlinear feature extraction ability, deep learning models have gradually become the mainstream method in HRRP recognition. Deep learningbased methods focusing on spatial or temporal structures have recently been applied to the radar HRRP recognition field. The convolutional neural network (CNN) models extract the local spatial structure features from the HRRP envelope using the convolution operation [22]. Wan et al. used CNN for extracting multiresolution spectrogram features of HRRP and then weighed them using an attention mechanism [15]. In [32], Chen et al. integrated the target recognition and rejection tasks using CNN by adding a deconvolution decoder. However, CNN models find it challenging to describe HRRP’s global structure due to the limited receptive field. Furthermore, the failure to capture the timeseries correlation across the HRRP range cells causes a loss of valuable information reflecting the target physical structure characteristics.
Deep learning recurrent neural network (RNN) models have demonstrated excellent sequence modeling ability in natural language processing, machine translation, and speech recognition. In RATR field, RNN is used to extract the longrange temporal structure embedded in the HRRP sequence. In [36], Xu et al. proposed a new attentionbased RNN model to reveal the structural correlation inside the target. In [35], Li et al. proposed a bidirectional simple recurrent unit network (SAMBiSRU) to extract robust features effectively from HRRP with good noise immunity. However, the dimension and length of input sequences in RNN models are coupled and cannot be adjusted independently, reducing the model’s flexibility. Furthermore, the longrange dependence will be severely weakened as the sequence length grows, thereby limiting the application of deep models in HRRP target recognition.
This study overcomes the challenges of CNN and RNNbased models by characterizing the spatial structure of the HRRP envelope and temporal dependence across range cells simultaneously. Specifically, we developed a novel deep model named convolutional neural network–bidirectional encoder representations from transformers (CNNBERT) for HRRP feature extraction. It comprises a convolutional token embedding module and BERT module and adjusts the feature importance at the backend using an attention mechanism. The main characteristics of the proposed model are summarized below:

1.
The convolutional token embedding module finely describes the HRRP’s local spatial structure and generates the sequence features. It considerably improves the proposed model's early expression capability and the efficiency and flexibility of the HRRP modeling task.

2.
The BERT module models the longrange temporal dependence in HRRP. To the best of our knowledge, this is the first attempt to introduce a BERTbased model into the RATR field. The multihead selfattention mechanism in the BERT module perfectly describes the dependency relationship between two range cells at any position and captures onestep local and global dependencies. Additionally, the BERT module has good parallelism.

3.
In designing the cost function, recognition and rejection abilities are simultaneously considered. Furthermore, their roles can be adjusted according to the application scenario.

4.
Experimental results and attention map visualization based on the measured data verify the effectiveness of the proposed method.
The remaining article is organized as follows. We analyze the principles and longrange feature capture capabilities of related deep models in Sect. 2. The proposed model is introduced in Sect. 3. Section 4 details the training and testing process. Section 5 presents the performance analysis based on various experiments. Finally, the conclusions are offered in Sect. 6.
2 Analysis of related deep learning models
Currently, deep neural networks are being used for HRRP recognition and have achieved good recognition performance. In particular, RNN and CNN can effectively describe the interdependence among range cells and are widely used. This section mainly analyzes the principles, structures, and longrange feature capture capabilities of CNN and RNN models to understand their ability to utilize the timeseries information within HRRP range cells.
2.1 RNN model
RNN model is widely used for sequential data representation. However, the original RNN has shortterm memory capability and suffers from gradient explosion and vanishing problems with relatively long input sequences [16]. The long shortterm memory network (LSTM) based on the gated RNN was proposed to improve longterm memory ability [17,18,19]. A schematic of the LSTM structure, comprising the input, hidden, and output layers, is shown in Fig. 1. A linear sequence is formed between the hidden layer nodes, which propagates the extracted information from front to back in chronological order.
LSTM controls the importance weights of current inputs and historical information through input and forgetting gates. When important information appears in the input of the current moment, the input gate value is close to 1, while that of the forgetting gate is close to 0, and the historical information will be forgotten. The older the information, the higher the degree of forgetting. According to the principle of the gatingbased RNN, we plot the dependency between the input sequences and output features in Fig. 1. The connecting lines indicate a dependency between the output and input, while the thickness indicates the dependency strength. For example, the output feature \(o_{3}\) mainly depends on the current input \(t_{3}\) and past inputs \(t_{3}\) and \(t_{2}\). The strongest dependency relationship is with the input at the current time \(t_{3}\) and gradually decreases with the backward movement of time.
However, for HRRPs, the output of one moment is related to the past and subsequent inputs. Thus, the bidirectional LSTM (BiLSTM) model (Fig. 2), which extends the direction of information transfer, is adopted [20]. Compared with LSTM, BiLSTM allows bidirectional information transfer using two hidden layers in inverse order. The dependency of BiLSTM is shown in Fig. 2; the output feature \(o_{3}\) is jointly influenced by the current input \(t_{3}\), past inputs \(t_{3}\) and \(t_{2}\), and future input \(t_{4}\). Therefore, BiLSTM overcomes the shortcomings of RNN and LSTM by obtaining the longrange dependency between sequence inputs using layerbylayer recursion. However, this dependency will gradually weaken as the length of the sequences grows, limiting the sequence modeling ability of BiLSTM.
2.2 CNN model
General CNN models contain convolutional and pooling layers in their convolutional modules [21]. This causes information loss due to the pooling layer discarding the position information of the sequences. Thus, the pooling layer is discarded when dealing with sequential modeling problems, and onedimensional convolutional layers are directly stacked to process sequences data. The CNN with a twolayer convolutional layer is shown in Fig. 3; the dashed box indicates the location of the convolutional operation.
The dependency between the input sequences and output features is shown in Fig. 3; each output neuron has the same local receptive field size and is directly associated with the three input neurons. The strength of dependency on the three input neurons is equal. The longest sequence dependency distance captured by the first convolutional layer depends on the kernel size, while that captured by the second convolutional layer is 5. The convolutional layers required to associate any two inputs increase with increasing distance between the two inputs. Therefore, the CNN model is difficult to describe HRRP’s global structure because the convolution operation can only capture limited local information. The connection between larger regions requires enhancing the perceptual field through multiple stacked layers.
3 HRRP recognition based on the CNNBERT model
We propose a new deep learning framework for HRRP recognition named CNNBERT, as shown in Fig. 4. The proposed framework contains four modules: data preprocessing, convolutional token embedding, BERT, and classifier modules. The functionalities of each module are discussed in this section.
3.1 Data preprocessing module
This module solves the intensity and translation sensitivity problems of HRRP. The intensity of HRRP is affected by many factors, such as target distance, radar transmitter power, and antenna gain; thus, the intensity of the same target’s HRRP differs depending on observation conditions. This intensity sensitivity problem is solved by l_{2} normalization in the preprocessing module. The raw HRRP sample can be expressed as \({\mathbf{x}}_{{}} = \left[ {x_{1} , \ldots ,x_{l} , \ldots ,x_{L} } \right]\), where \(x_{l}\) denotes the magnitude of the lth range cell within HRRP, and L denotes the total range cells. The intensitynormalized HRRP sample \({\mathbf{x}}_{{{\text{norm}}}}\) can be expressed as follows:
In addition, HRRP is obtained by intercepting the radar return with a range window. The translational motion of the target varies the position of the HRRP in the range window, a phenomenon known as the translational sensitivity of HRRP. Here, an absolute alignment method can overcome this sensitivity issue. Specifically, a cyclic shift operation on \({\mathbf{x}}_{{{\text{norm}}}}\) places the center of gravity \(G\) at the center of the range window as follows:
where \(\tilde{x}_{l}\) denotes the magnitude of the lth range cell within \({\mathbf{x}}_{{{\text{norm}}}}\).
The raw HRRP samples recorded consecutively for the same target and preprocessed HRRP samples are shown in Fig. 5.
3.2 Convolutional token embedding module
The convolutional token embedding module uses the convolutional operation to characterize the spatial structural features of the HRRP envelope and embeds the original HRRP to obtain the sequence features as input sequences for the BERT module. This idea was inspired by [25], which shows that early convolutions help transformers see better; we use convolutional operation instead of the timedomain segmentation and patchify methods to obtain input sequences for the BERT module. The timedomain segmentation method used in RNN or LSTM causes information redundancy and dimensionalitylength constraints [23]. The direct patchify method, used by neural networks such as ViT [24], is implemented with a large convolutional kernel and large stride, violating the typical design of the convolutional layer. Moreover, the hard locality constraint in the early layer hinders the network's expressive ability. By contrast, the extracted sequence features by the convolutional token embedding module retain the local structure information in HRRP and have translation and scaling invariance. Moreover, the convolution kernel size and the number of the convolution channels independently control the number and dimension of token features to realize the decoupling of the dimensionality length. Furthermore, the sequence feature generation avoids hard locality constraints and enhances the initial expression ability of the network.
The convolutional token embedding module (Fig. 6) contains three parts: the convolutional layer, batch normalization (BN) layer, and activation layer. The preprocessed HRRP samples \({\tilde{\mathbf{x}}}\) are convolved by K onedimensional convolutional kernels to obtain the output sequence F, that is computed as
where \(\otimes\) denotes the convolutional operation and \({\text{kernel}}(k)\) denotes the kth convolutional kernel. \({\mathbf{F}}(l) = \sum_{k = 1}^{K} {F\left( {l,k} \right)}\) denotes the token embedding vector of the lth range cell.
The output sequence F passes through the BN and activation layers to generate the sequential embedding representation \({\mathbf{F}}_{{{\text{embedding}}}}\) of HRRP. The output sequence feature map is given in Fig. 7. The Xaxis represents the range cell dimension, and the Yaxis represents the feature channel dimension. Further, the output feature of one channel is visualized on the right. It can be seen that the feature focuses more on the local characteristics of the target.
3.3 BERT module
BERT has demonstrated superior performance and is gradually replacing RNNs in longterm dependence modeling problems [26]. The BERT module uses the depth sequence encoding capability to extract temporal structural information embedded in input sequences and compensate for the lack of timing modeling capability of the convolutional token embedding module. The input sequence here refers to the sequential embedding representation of HRRP. The BERT module comprises a positional encoding layer and \(N_{{{\text{bert}}}}\) successive encoder blocks. Each encoder block comprises a multihead selfattention layer that aggregates the relationship within the token embedding vector of the range cell, a feedforward layer that extracts the feature representation at the position level, and an add and norm layer. The implementation details of each layer are as follows.
3.3.1 Positional encoding
The features extracted by the convolutional token embedding module do not explicitly include the positional relationship within range cell token embedding. The positional encoding technique fully uses the sequential relationship among range cells of HRRP. The sine and cosine functions can encode the odd and even bits of the input sequences, respectively, as follows:
where l denotes the index of range cell in the input sequences, \(P\left( {l,k} \right)\) denotes the kth element in the lth range cell of the positional encoding vector P, with \(0 \le l < L\), \(k \le d_{{{\text{model}}}}\).
According to the properties of sine and cosine functions, \({\mathbf{P}}\left( {l + i} \right)\) of the (l + i)th range cell can be expressed as a linear combination of \({\mathbf{P}}\left( l \right)\) and \({\mathbf{P}}\left( i \right)\).
The sequence feature map obtained by adding the output feature of the convolution module and positional encoding vector (Eq. 6) is shown on the right side of Fig. 8.
The texture in the feature map after positional encoding represents the unique position information, strengthening the temporal structure in the extracted HRRP features.
3.3.2 Multihead selfattention layer
The multihead selfattention layer captures the local and global structure of input feature sequences and extracts the longterm dependency within range cells of HRRP.
3.3.2.1 Scaled dotproduct attention
The proposed framework adopts a scaled dotproduct attention mechanism for fast execution and memory space efficiency. A transformation layer maps input sequences \({\mathbf{F}}_{{{\text{conv\_emb}}}} \in {\mathbb{R}}^{{L \times d_{{\bmod {\text{el}}}} }}\) to three different sequential vectors, i.e., query Q, key K, and value V, as follows:
where \({\mathbf{W}}_{q}^{{}} \in {\mathbb{R}}^{{d_{{{\text{model}}}} \times d_{q} }}\), \({\mathbf{W}}_{k}^{{}} \in {\mathbb{R}}^{{d_{{{\text{model}}}} \times d_{k} }}\), and \({\mathbf{W}}_{v}^{{}} \in {\mathbb{R}}^{{d_{\bmod el} \times d_{{\text{v}}} }}\) are the three weight matrices; \(d_{q}\), \(d_{k}\), and \(d_{v}\) are dimensions of the query, key, and value, respectively.
Secondly, as shown in Fig. 9, the query is explicitly aggregated with the corresponding key by calculating the product of Q and K. A scaling factor \(\sqrt {d_{k} }\) and Softmax operation are subsequently applied to get the attention weights of the value V, also called an attention map. Combining the resulting attention weights with V, we obtain the output features \(F_{{{\text{selfatt}}}}\) as follows:
3.3.2.2 Multihead selfattention mechanism
The HRRP data have a typical multisubspace structure [13], while the singlehead selfattention module can only obtain limited information from one of these subspaces. Therefore, the multihead attention mechanism extracts features from multiple subspaces to enrich the diversity of feature representations.
As shown in Fig. 10, Q, K, and V are projected to multiple feature subspaces using several independent attention heads simultaneously. The resulting output vectors of each subspace are concatenated and mapped to the final output \({\mathbf{F}}_{{{\text{atten}}}}\) as follows:
where h is the head number, \({\mathbf{head}}_{i} = {\text{Attention}} ({\mathbf{QW}}_{i}^{Q} ,{\mathbf{KW}}_{i}^{K} ,{\mathbf{VW}}_{i}^{V} )\) denotes the resulting vectors of each head, \({\mathbf{W}}_{i}^{Q} \in {\mathbb{R}}^{{d_{{{\text{model}}}} \times d_{k} }}\), \({\mathbf{W}}_{i}^{K} \in {\mathbb{R}}^{{d_{{{\text{model}}}} \times d_{k} }}\), and \({\mathbf{W}}_{i}^{V} \in {\mathbb{R}}^{{d_{{{\text{model}}}} \times d_{v} }}\) are the three groups of weight matrices, and \({\mathbf{W}}^{O} \in {\mathbb{R}}^{{hd_{v} \times d_{{{\text{model}}}} }}\) is the output projected matrix.
3.3.2.3 Analysis of longrange feature extraction capability
The multihead attention mechanism is the core operation in the BERT module. All input sequences can be input into the multihead attention layer simultaneously (Fig. 11 left), ensuring the parallelism capability of the model. Meanwhile, the feature dimension of the output and input layers are the same to facilitate the stacking of the BERT modules.
According to Eq. (8), the selfattention mechanism directly uses the product of Q and K to obtain the attention weights. Each element in the input sequences is compared with other elements, and the distance between each element is equal. Accordingly, the schematic of the input sequences and output feature dependency is drawn on the right side of Fig. 11. Each output layer feature depends on the input sequences at all moments, and the dependency degree is the same without attenuation. Therefore, only one multihead attention layer is needed, and the longest dependency distance captured by the output layer features is the length of the whole sequences. Thus, the BERT module can capture the global and local features using the multihead attention mechanism.
3.3.3 FeedForward layer
The feedforward layer enhances the separability of the extracted features using two successive feedforward networks with a ReLU activation to map the feature representation to a highdimensional hidden space. The output of the feedforward layer is given as follows:
where W_{1}, W_{2}, b_{1}, and b_{2} represent the weight matrices and biases of two linear changes, and \(\max ( \cdot , \cdot )\) represent the maximum function.
3.3.4 Add and Norm layer
The add and norm layer performs residual connection and layer normalization (LN) operation. Since the gradient of the deep neural network during training will gradually vanish during the backpropagation process, adjusting the parameters of the previous layers is challenging. A residual connection can overcome the vanishing gradient problem caused by stacking multilayer BERT modules and facilitate the building of deeper models.
Moreover, LN can stabilize the model training process. Unlike BN in the convolutional token embedding module, LN can address the interval covariate shift problem [27]. Specifically, BN normalizes the features of the same channel among different samples, whereas LN normalizes the features of the same sample in different channels, and the computation is independent of the batch size parameter. The calculation process of LN can be expressed as follows:
where x is the input of the LN layer, \(\mu\) and \(\sigma^{2}\) denotes the mean and the variance, respectively, \(\varepsilon\) is a very small positive number, and \(\alpha\) and \(\beta\) are the scaling and translation parameters, respectively. Let M denote the number of neurons in the LN layer; then \(\mu\) and \(\sigma^{2}\) can be calculated as follows.
3.4 Classifier module
The classifier module comprises attention and Softmax layers. The attention mechanism strengthens the deep features useful for recognition by assigning weights to the output features \(O_{{{\text{bert}}}}\) of the BERT module along the feature channel dimension. Thus, the features \({\mathbf{F}}_{{{\text{ATT}}}} = \{{\mathbf{F}}_{{{\text{ATT}}}} (l)\}_{i = 1}^{L}\) can be obtained as follows:
where \(O_{{{\text{bert}}}} \left( {l,k} \right)\) denotes the kth element in the lth range cell of the output feature vector and \(a\left( {l,k} \right)\) denotes the weight of the corresponding elements of \(O_{{{\text{bert}}}} \left( {l,k} \right)\). The proposed model can automatically learn \(a\left( {l,k} \right)\) according to the importance of the features.
Next, linear mapping and Softmax operation are adopted to classify the feature \({\mathbf{F}}_{{{\text{ATT}}}}\). The posterior probability that x belongs to the cth target can be calculated as follows:
where \({\mathbf{F}}_{s} = {\mathbf{W}}_{s} {\mathbf{F}}_{{{\text{ATT}}}}\) and \({\mathbf{W}}_{s}\) is a weight matrix, \({\mathbf{F}}_{{\text{s}}} \left( i \right)\) refers to the ith element in the vector \({\mathbf{F}}_{{\text{s}}}\), \(C\) denotes the class number of inlibrary targets, \(c \le C + 1\). Finally, an HRRP sample x is classified into the \(c_{0}\)class as follows:
3.5 Cost function
The cost function determines the function and performance of the model. In RATR, besides recognition performance, identifying outoflibrary targets is important. Thus, while designing the cost function, we consider the recognition and rejection performance simultaneously. The rejection function is integrated into our model by regarding the outoflibrary samples as the (C + 1)th class in the training process. A nonnegative regularization hyperparameter λ balances the recognition and rejection ability, and the cost function is defined as follows:
where \(L_{{{\text{recognition}}}} =  \frac{1}{{N_{1} }}\sum\limits_{n = 1}^{{N_{1} }} {\sum\limits_{c = 1}^{C + 1} {z^{\left( n \right)} \ln p_{c}^{\left( n \right)} } } \left( {\mathbf{x}} \right)\), and \(L_{{{\text{rejection}}}} =  \frac{1}{{N_{2} }}\sum\limits_{n = 1}^{{N_{2} }} {\sum\limits_{c = 1}^{C + 1} {z^{\left( n \right)} \ln p_{c}^{\left( n \right)} } } \left( {\mathbf{x}} \right)\). N_{1} denotes the total number of inlibrary samples identified as within the data library and outlier samples identified as out of the data library. N_{2} denotes the total number of inner samples identified as out of the data library and outlier samples identified as within the data library. N = N_{1}+ N_{2} represents the total number of samples in each minibatch and \(z^{\left( n \right)}\) represents the real label of the nth sample in the corresponding minibatch. Positive and negative λ implies that the model is more concerned with rejection and recognition performances, respectively.
4 Training and testing procedure
The detailed training test flow is shown in Algorithm 1. We first preprocess the raw HRRP data and then initialize the model parameters in the training phase. After training the model with the minibatchbased BP algorithm, the model parameters are saved for testing. In the testing phase, we first preprocess the test HRRP samples, input these samples for forward propagation, and finally obtain the recognition results.
5 Results and discussion
5.1 Experimental dataset
The recognition performance of the proposed model is examined by the measured data of three types of aircraft targets. Yark42 is a largesized jet, Cessna Citation S/II is a small jet, and An26 is a mediumsized propeller aircraft [14, 29]. The division of training and test sets is consistent with that in [28]. The parameters of the radar and aircraft targets are shown in Table 1.
Furthermore, ten classes of simulated HRRPs are generated as train outoflibrary samples to evaluate the recognition performance. Each class have 1600 HRRP samples. In addition, 1600 HRRPs of real aircraft targets are used as test outoflibrary samples to detect the rejection performance in the model testing phase.
5.2 Model setup
5.2.1 Proposed model
The parameter settings of the CNNBERT model are set as follows to evaluate the recognition and rejection performance of the proposed model. The kernel size S of the convolutional token embedding module is set to 5, the number of convolutional channels K to 768, and the step size to 1. For the BERT module, we set the attention head number h to 8, and N_{bert} to 6. The dimensions of W_{1} and W_{2} in the feedforward layer are set to 768 × 3072 and 3072 × 768, respectively. For the cost function, λ in Eq. (16) is set to 2. The CNNBERT model is built according to the above parameters.
5.2.2 Comparative models
The proposed model is compared with several conventional HRRP recognition models, including SVM and GMM in traditional models, AE and CNN in the deep nontimeseries models, and RNN in the deep timeseries models.
The SVM model is implemented using the LIBSVM toolbox and the kernel function with radial basis function. The GMM is implemented using the scikitlearn toolkit for python.
The AE model contains a stack of five AEs, where the number of neurons per layer is 300, 600, 900, 2000, and 3, respectively. The CNN model consists of three convolutional and two fully connected layers. The number of convolutional channels in each layer is 8, 16, and 32, and the kernel size is 1 × 16 with a step size of 2. The two fully connected layers contain 300 and 3 neurons.
The RNN implementation is based on LSTM cells whose input sequences are extracted from HRRP samples based on the timedomain segmentation method with a sliding window. The sliding window step size is set to 16 and 8, respectively.
5.3 Recognition performance evaluation
5.3.1 Experimental results using all training data
Table 2 compares the recognition accuracy with three aircraft targets for the GMM, SVM, CNN, AE, RNN, and proposed model. Bold values in this Table means the highest average recognition rate (ARR). Compared with SVM and GMM models, the average recognition rate of the proposed model is 4.23% higher than the best SVM model. Compared with the AE and CNN models, the ARR of the proposed model is 5.90% higher than the best AE model. Compared with the RNN model, the ARR of the proposed model is improved by 5.77%.
The recognition performance needs to meet minimum standards for practical engineering applications by considering the overall recognition performance while balancing the recognition performance of each target type. Therefore, we compare the recognition balance of each method by analyzing the confusion matrix in Fig. 12. The difference between An26 and Cessna aircraft with the highest and lowest recognition accuracy, respectively, is only 1.10%. Thus, the proposed method can model the characteristics of the three aircraft in a more balanced manner. Although the overall recognition accuracy of the comparative models exceeded 90%, only the SVM model exceeded 90% recognition accuracy for each type of aircraft target. Thus, based on the SVM model, Cessna and An26 misjudge each other more. The difference between Yark42 and Cessna aircraft with the highest and lowest recognition accuracy, respectively, is 6.47%. Thus, the SVM model fails to extract the unique attributes of each class of aircraft targets widening the gap between Cessna and An26 and causing uneven recognition performance. The problem is considerably evident in CNN, GMM, AE, and RNN models. In contrast, our proposed model integrates local structure features of targets, and longrange features between range cells, fusing multilevel physical structure features for recognition. With its excellent nonlinear sequence modeling ability to extract better separable features, the recognition performance of each class is balanced.
5.3.2 Recognition performance evaluation with different training sample sizes
To evaluate the impact of training set size on the recognition results, we sampled the training data set uniformly with different sampling rates and obtained 4 small sample sets of sizes 34,560, 8640, 2160, and 1080. The 137,880 HRRP training samples are divided into multiple frames, with 4 samples in each frame. Each frame randomly selects one HRRP sample to form the first small sample dataset. Those small sample sets are generated from the frames with sample sizes 4, 16, 64, and 128.
Table 3 compares the recognition accuracy with different training data sizes of the proposed model with that of the conventional models. The proposed model exhibits superior recognition performance under small samples condition compared with other models. The smaller the number of training samples, the more prominent the effect of our model. Particularly, when the number of training samples is 1080, the proposed model can reach an ARR of 96.20%. Compared to our method, the performances of the SVM, GMM, AE, CNN, and RNN methods are lowered by 24.91%, 9.09%, 13.58%, 14.61%, and 12.10%, respectively. Moreover, the recognition accuracies of other models significantly decline when the training data number decreases. Thus, the proposed model can solve small sample problems.
5.4 Rejection performance evaluation
We integrate the outoflibrary rejection task into the recognition model by introducing an importance parameter λ. We expect that the introduction of outoflibrary samples in the training phase can widen the spacing between inlibrary and outoflibrary samples without changing the differentiability of the inlibrary samples. Therefore, we used the idea of weighting to equalize the importance of inlibrary recognition loss and outoflibrary rejection loss using λ.
Because unreasonable setting of λ can reduce the model’s recognition and rejection performance, we first analyze the impact of λ on the proposed model before comparing the rejection performance of different models. We use the ARR and the area under the receiver operating characteristic curve (AUC) as the evaluation index of recognition and rejection performances, respectively [30,31,32,33,34]. We plot a line chart of AUC and accuracy with different λ, as shown in Fig. 13. The parameter λ influences the recognition and rejection performances; the rejection performance being more sensitive to λ variations. When λ = 1, the model degenerates to a general recognition model with a recognition function; the recognition accuracy is 0.987, while AUC is 0.82. When we increase λ, the model is more focused on the rejection performance. Thus, when \(\lambda = 2\), the recognition and rejection performances of the model are improved, reaching the peak value; the recognition accuracy is 0.998, and AUC is 0.98. When λ continues to increase, recognition and rejection performance decreases. At this time, the model is overly concerned with rejection performance and ignores recognition performance, resulting in small loss weights for recognition, thereby decreasing recognition and rejection performances.
To graphically portray the influence of λ on the rejection performance, we plot AUCs with optimal and general λ as shown in Fig. 14. The AUC with \(\lambda = 2\) and 2.5 is 0.98 and 0.85, respectively. Because λ significantly impact the rejection performance, we set \(\lambda = 2\) for the subsequent evaluation of the rejection performance.
Figure 15 shows the receiver operating characteristic curve of models to quantify the rejection performance of each model. The AUC values are 0.98, 0.12, 0.93, 0.65, 0.59 and 0.73 for our proposed model \((\lambda = 2)\), GMM, SVM, CNN, RNN, and AE models, respectively. Our proposed model rejects the outoflibrary samples better than the other models. The introduction of outoflibrary samples in the training phase and adjustment of the model cost function by the importance parameter λ enhances the rejection performance of the proposed model based on the guaranteed recognition performance.
5.5 Visualization
5.5.1 Visualization of longrange dependency
We also provides an intuitive and effective way to inspect the variation of longrange dependency at different layers. This is done by visualizing the attention map of the BERT module using Eq. (8). Attention values indicate the strength of the interdependence relationship between different range cell sequences and the importance of different range cells. In Fig. 16, the horizontal and vertical coordinates indicate the HRRP range cells. Brighter colors represent higher attention values in the attention map.
Attention maps of the selfattention layer in the shallow, middle, and deep layers of the BERT encoder block are shown in Figs. 17, 18, and 19, respectively. For simplicity, we only show the attention map of 5 heads, while the BERT encoder block has 8 heads. To show the interdependency between each range cell more comprehensively, we also give the average map of 8 different heads, which integrates the interdependency obtained by different heads from different perspectives. In Fig. 17a–i, the head learned interdependence relationship varies; the shallow layer BERT encoder block initially extracts the longdistance features and learns relatively strong interdependency within the range cells in the HRRP support area. Compared with Fig. 17, Fig. 18 shows the expansion of the strong correlation area, indicating that the middle layer BERT encoder block can better extract the longrange features. The attention map in the deep layer BERT encoder block in Fig. 19 shows that the important information is aggregated to specific range cells, and the attention value is irrelevant to the query Q. As shown in Fig. 19f, the attention map shows vertical lines. Combined with the physical properties of HRRP, range cells in the support area can better reflect the radial size of the target and scattering point distribution. Moreover, it explains why the important information is mainly aggregated to the range cells in the support area and that the most core aggregation point is the peak position of HRRP.
We selected the representative attention map obtained from the three types of aircraft to observe the commonality and differences. In Figs. 20, 21 and 22a–c represent the average attention map in the shallow, middle, and deep layers of the BERT encoder block, respectively. A comparison of Figs. 20, 21 and 22 shows a significantly different strong correlation region size. The strong correlation region of Yark42 aircraft is the largest, followed by that of An26 and Cessna Citation S/II aircraft, which corresponds to the actual target size, as shown in Table 1. Therefore, the attention map extracted by the BERT module can reflect the target size information, indicating that the model has learned the physical structure variability between different targets.
5.5.2 Visualization of separability
For a simple and intuitive analysis of the separability of the features extracted by different models, the PCA visualization projections of the deep feature vectors extracted by our proposed model and the deep neural network model are given in Fig. 23; “other” indicates outoflibrary samples. PCA operation is performed on the corresponding deep feature, and the 2D projection matrix is constructed using the principal components corresponding to the largest two feature values. The comparison of visualization performance reveals that our model has a smaller overlap region between inlibrary samples and between inlibrary and outoflibrary samples than AE, CNN, and RNN models. The good separability and rejection performance further verify that the features extracted by the proposed model are suitable for recognition and rejection tasks.
6 Conclusion
This study proposed an improved BERTbased deep neural network for radar HRRP target recognition. The convolutional token embedding module provides the input sequence feature reflecting the local spatial structure of the target, and the BERT module describes the longrange dependency within the input sequence to extract deep temporal features. The experimental results reveal that the ARR of the proposed model is better than other comparative models when all training samples are applied and is more balanced across targets. In addition, even when the training sample size is reduced to 1/128 of the original training samples, the ARR of the proposed model for each aircraft is over 96%. Finally, the proposed model has a much higher rejection capability with the AUC being 0.98 and can effectively deal with recognition tasks in complex environments. Thus, the proposed model has excellent engineering utility and extends the application of HRRP target recognition. In future work, we are devoted to lightweight deep learning model research and further improve computation and parameter efficiency of the proposed model.
Availability of data and materials
Please contact author for data requests.
Abbreviations
 HRRP:

Highresolution range profile
 CNN:

Convolutional neural network
 RNN:

Recurrent neural network
 LSTM:

Long shortterm memory network
 BERT:

Bidirectional encoder representations from transformer
References
O. Karabayır, O.M. Yücedağ, M.Z. Kartal et al. Convolutional neural networksbased ship target recognition using high resolution range profiles, in 2017 18th International Radar Symposium (IRS). IEEE, 2017, pp. 19.
L. Du, H. Liu, Z. Bao, Radar HRRP statistical recognition: parametric model and model selection. IEEE Trans. Signal Process. 56(5), 1931–1944 (2008)
J. Lundén, V. Koivunen, Deep learning for HRRPbased target recognition in multistatic radar systems, in 2016 IEEE Radar Conference (RadarConf) (IEEE, 2016), pp. 1–6
L. Du, H. He, L. Zhao et al., Noise robust radar HRRP target recognition based on scatterer matching algorithm. IEEE Sens. J. 16(6), 1743–1753 (2015)
F. Chen, Q.Y. Hou, H.W. Liu et al., New adaptive angularsector segmentation algorithm for radar ATR based on HRRP. J. Xidian Univ. 36(3), 410–417 (2009)
J. Wang, Z. Liu, T. Li et al., Radar HRRP target recognition via statisticsbased scattering centre set registration. IET Radar Sonar Navig. 13(8), 1264–1271 (2019)
L.E.I. Lei, X.D. Wang, Y.Q. Xing et al., Multipolarized HRRP classification by SVM and DS evidence theory. Control Decis. 28(6), 861–866 (2013)
L. Du, P. Wang, H. Liu et al., Radar HRRP target recognition based on dynamic multitask hidden Markov model, in 2011 IEEE RadarCon (RADAR) (IEEE, 2011), pp. 253–255
J. Tu, T. Huang, X. Liu et al., A novel HRRP target recognition method based on LSTM and HMM decisionmaking, in 2019 25th International Conference on Automation and Computing (ICAC) (IEEE, 2019), pp. 1–6
M. Pan, P.H. Wang, H.W. Liu et al., Radar HRRP target recognition based on truncated stickbreaking hidden Markov model. J. Electron. Inf. 35(7), 1547–1554 (2013)
L. Du, H. Liu, Z. Bao et al., A twodistribution compounded statistical model for radar HRRP target recognition. IEEE Trans. Signal Process. 54(6), 2226–2238 (2006)
D. Zhou, X. Shen, G. Wang et al., Orthogonal kernel projecting plane for radar HRRP recognition. Neurocomputing 106, 61–67 (2013)
D. Zhou, Orthogonal maximum margin projection subspace for radar target HRRP recognition. EURASIP J. Wirel. Commun. Netw. 2016(1), 1–11 (2016)
L. Shi, P. Wang, H. Liu et al., Radar HRRP statistical recognition with local factor analysis by automatic Bayesian YingYang harmony learning. IEEE Trans. Signal Process. 59(2), 610–617 (2010)
J. Wan, B. Chen, Y. Yuan et al., Radar HRRP recognition using attentional CNN with multiresolution spectrograms, in 2019 International Radar Conference (RADAR) (IEEE, 2019), pp. 1–4.
R. Pascanu, T. Mikolov, Y. Bengio, On the difficulty of training recurrent neural networks, in International Conference on Machine Learning (PMLR, 2013), pp. 1310–1318
S. Kanai, Y. Fujiwara, S. Iwamura, Preventing gradient explosions in gated recurrent units, in Advances in Neural Information Processing Systems (2017), p. 30.
S. Hochreiter, The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncert. Fuzziness Knowl. Based Syst. 6(02), 107–116 (1998)
M. Sundermeyer, R. Schlüter, H. Ney, LSTM neural networks for language modeling, in Thirteenth Annual Conference of the International Speech Communication Association (2012).
M. Schuster, K.K. Paliwal, Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)
J. Wan, B. Chen, Y. Liu et al., Recognizing the HRRP by combining CNN and BiRNN with attention mechanism. IEEE Access 8, 20828–20837 (2020)
J. Song, Y. Wang, W. Chen et al., Radar HRRP recognition based on CNN. J. Eng. 2019(21), 7766–7769 (2019)
M. Pan, A. Liu, Y. Yu et al., Radar HRRP target recognition model based on a stacked CNNBiRNN with attention mechanism. IEEE Trans. Geosci. Remote Sens. 60, 1–14 (2021)
A. Dosovitskiy, L. Beyer, A. Kolesnikov et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
T. Xiao, M. Singh, E. Mintun et al., Early convolutions help transformers see better. Adv. Neural. Inf. Process. Syst. 34, 30392–30400 (2021)
J. Devlin, M.W. Chang, K. Lee et al., Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
J.L. Ba, J.R. Kiros, G.E. Hinton, Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
L. Du, P. Wang, H. Liu et al., Bayesian spatiotemporal multitask learning for radar HRRP target recognition. IEEE Trans. Signal Process. 59(7), 3182–3196 (2011)
B. Feng, B. Chen, H. Liu, Radar HRRP target recognition with deep networks. Pattern Recogn. 61, 379–393 (2017)
Q. Li, B. Li, Z. Yang, Plane HRRP rejection based on SVDD technology, in 2011 3rd International AsiaPacific Conference on Synthetic Aperture Radar (APSAR) (IEEE, 2011), pp. 1–4
D. Zhou, R. Wang, C. Zheng et al., Gamma modelbased target HRRP rejection, in Proceedings of the 2012 International Conference on Information Technology and Software Engineering (Springer, Berlin, Heidelberg, 2013), pp. 349–356
X. Zhang, P. Wang, L. Du et al., New method for radar HRRP recognition and rejection based on weighted majority voting combination of multiple classifiers, in 2011 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC) (IEEE, 2011), pp. 1–4
Y. Wang, W. Chen, J. Song et al., Open set radar HRRP recognition based on random forest and extreme value theory, in 2018 International Conference on Radar (RADAR) (IEEE, 2018), pp. 1–4
J. Wan, B. Chen, B. Xu et al., Convolutional neural networks for radar HRRP target recognition and rejection. EURASIP J. Adv. Signal Process. 2019(1), 1–17 (2019)
X. Li, Z. Guo. A bisru neural network based on soft attention for hrrp target recognition, in 2019 12th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISPBMEI) (IEEE, 2019), pp. 1–5
B. Xu, B. Chen, J. Wan et al., Targetaware recurrent attentional network for radar HRRP target recognition. Signal Process. 155, 268–280 (2019)
Acknowledgements
The authors would like to thank the handing editor and the anonymous reviewers for their valuable comments and suggestions for this paper. This work was supported in part by the National Natural Science Foundation under Grant No. 61701379 and the stabilization support of National Radar Signal Processing Laboratory under Grant No. KGJ202204.
Funding
This research was funded by the National Natural Science Foundation under Grant No. 61701379 and the stabilization support of National Radar Signal Processing Laboratory under Grant No. KGJ202204.
Author information
Authors and Affiliations
Contributions
PW and MP proposed the method and designed the experiments; PW, TC, and ST performed the experiments and wrote the paper; JD revised the paper. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
The picture materials quoted in this article have no copyright requirements, and the source has been indicated.
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wang, P., Chen, T., Ding, J. et al. Intelligent radar HRRP target recognition based on CNNBERT model. EURASIP J. Adv. Signal Process. 2022, 89 (2022). https://doi.org/10.1186/s13634022009099
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13634022009099
Keywords
 Highresolution range profile (HRRP)
 Convolutional neural network (CNN)
 Bidirectional encoder representations from transformers (BERT)
 Attention mechanism
 Intelligent target recognition