The proposed method is performed through two steps: the first step is training the feature extraction part by unsupervised learning which is consisted of convolution autoencoder and selfattention model, and the second step is train classify part by supervised learning which is consisted of feature extraction with fixed weights and multilayer perceptron. In this section, we firstly introduce the convolution autoencoder and selfattention model that we used, then introduce the twostep framework.
2.1 Convolutional autoencoder
Convolutional autoencoder (CAE) [34, 35] is a kind of artificial neural network used in unsupervised learning, which uses convolution kernel for feature extraction. It can reduce the number of network parameters through weight sharing and local awareness features, while improving the model's ability to extract local features from the data.
The working principle of the CAE is shown in Fig. 1, the convolutional transformation process from feature mapping input to output is called convolutional encoder, and the output value is reconstructed by transposed convolution operation, called convolution decoder, where T represents the convolutional encode operation and T' represents the convolutional decode operation. The input feature matrix is \(x \in R^{n \times Q \times Q}\). It contains \(n\) feature matrices, and the size of each feature matrix is \({\text{Q}} \times {\text{Q}}\).
2.2 Selfattention mechanism
The capabilities we expect from CAE are not simply copying input to output, and we would like to add some constraints to the CAE, so that the model will be forced to consider which parts of the input are much critical and need to be copied firstly. For example: undercomplete autoencoder, regular autoencoder, denoising autoencoder, etc. For the network to extract better features about the location information of the source, this paper uses the selfattention (SA) mechanism to impose constraints on the CAE. The SA was first applied to natural language processing [36, 37]. The traditional convolution operation extracts features based on the weights of the convolution filter over a local perceptual field using an aggregation function, and these weights are shared throughout the feature matrices. In contrast, the SelfAttention (SA) module uses a weighted average operation based on the input features to dynamically calculate the attention weights by correlation operations on the similarity function between features [38, 39].
Considering the different and complementary nature of convolutional operations and SA, there exists the potential to benefit from both paradigms through integration, so this paper combines the CAE with the selfattentive mechanism to propose the CAE with the Residual selfattention mechanism module (RACAE), introducing the residual module can make the training process more efficient by the ability to transform through identity.
The model structure of SA is shown in Fig. 2, where the output feature tensor \({\varvec{X}} \in {\varvec{R}}^{{C_{in} \times W \times H}}\) of the convolutional layer in the CAE is used as the input of the layer, where H and W denote the dimensions of the tensor, let \(x_{ij} \in {\varvec{R}}^{{C_{in} }}\) denote the elements of the input tensor, let \({\varvec{Y}} \in {\varvec{R}}^{{C_{{\text{out }}} \times W \times H}}\) denote the output feature matrix, and let \(y_{ij} \in {\varvec{R}}^{{C_{out} }}\) denote the components of the output tensor. Let:
$$\begin{array}{l} {{\varvec{q}}_{ij} = {\varvec{W}}^{q} x_{ij} } \\ {{\varvec{k}}_{ij} = {\varvec{W}}^{k} x_{ij} } \\ {{\varvec{v}}_{ij} = {\varvec{W}}^{v} x_{ij} } \\ \end{array}$$
(1)
Then, the output of the SA can be expressed as:
$$\begin{aligned} y_{ij} & = \mathop \sum \limits_{{a,b \in {\mathcal{N}}_{k} (i,j)}} {\text{Attention}}\left( {q_{ij} ,k_{ab} } \right)v_{ab} \\ & = \mathop \sum \limits_{{a,b \in {\mathcal{N}}_{k} (i,j)}} {\text{softmax}}\left( {\frac{{\left( {W^{q} x_{ij} } \right)^{T} \left( {W^{k} x_{ab} } \right)}}{\sqrt d }} \right)W^{v} x_{ab} \\ \end{aligned}$$
(2)
where \(W^{q} ,W^{k} ,W^{v}\) denotes the weight matrix, \({\mathcal{N}}_{k} (i,j)\) denotes a local region centered at \((i,j)\) with spatial extent k, \(S_{ij}\) denotes the attention weight of features in the region \({\mathcal{N}}_{k} (i,j)\), and d denotes the feature dimension of \(W^{q} x_{ij}\).
In this paper, the SA projects the feature matrix output from the autoencoder Conv2 as \(Q,K,V\) using a convolution kernel of \(1 \times 2\). After that the attention weights are computed and matrix aggregation is performed to extract the local features of the classified objects.
2.3 Proposed model framework: RACAESSL
In underwater acoustics, dataset acquisition is limited, especially reliable labeled dataset, which are difficult to form "big data" conditions, and this also limits the application of deep supervised learning models to underwater acoustic source localization. In this paper, we propose a twostep semisupervised learning framework under the assumption that labeled dataset are insufficient and unlabeled dataset are relatively abundant. The specific steps are as follows:
Step 1: Training the RACAE model
The first step performs unsupervised learning on RACAE model to achieve coverage of the entire dataset (unlabeled data and labeled data).
The structure of RACAE is shown in Fig. 3a. The encoder consists of three convolutional modules can project the input data into the hidden space; The decoder has a symmetrical structure with the encoder and is dedicated to reconstructing the input data from the hidden space; The residual block with selfattention is placed between the encoder and the decoder and serves to place attention on the features that are more important. The whole dataset (including unlabeled part and labeled part) will be used in this step as training dataset since it doesn’t need additional category information, and the loss function of mean square error (MSE).
Step 2: Training RACAESSL model for source localization
The second step performs supervised learning on the RACAESSL model.
After completing the training of the RACAE model, taking out part of the structure in Fig. 3a and freezing the parameters as a feature extraction network, which is connected with a 4layer MLP classification network to form the RACAESSL model, whose construction is shown in Fig. 3b The labeled dataset is first passed through a trained feature extraction network. Then, the extracted features are fed into the MLP for classification learning to achieve the source localization task with a loss function of CrossEntropy Loss Function (CELF).
2.4 Performance metrics
The commonly used evaluation metrics in traditional sound source localization are Mean Absolute Error (MAE) and Probability of credible localization (\(P_{CL}\)), and the total number of samples is S. The actual distance corresponding to the \(i\) th sample is \(y_{i}\), and the predicted value is \(f\left( {x_{i} } \right)\).
The MAE is calculated by the following formula:
$${\text{MAE}} = \frac{1}{S}\sum\limits_{i = 1}^{S} {\left {y_{i}  f\left( {x_{i} } \right)} \right}$$
(3)
\(P_{CL}\) specifies an error limit, and considers all samples falling within the error limit as correctly predicted samples, and calculates the localization accuracy from this. For example, at the 5% error limit, the localization accuracy \(P_{CL  5\% }\) is calculated as follows:
$$P_{CL  5\% } = \frac{{\sum\limits_{i = 1}^{S} {\eta \left( i \right)} }}{S}$$
(4)
where
$$\eta \left( i \right) = \left\{ \begin{gathered} 1,\quad \quad \frac{{\left {y_{i}  f\left( {x_{i} } \right)} \right}}{{y_{i} }} \times 100\% \le 5\% \hfill \\ 0,\quad \quad {\text{otherwise}} \hfill \\ \end{gathered} \right.$$
(5)
A smaller MAE value indicates better positioning performance, and a larger \(P_{CL}\) value indicates better positioning performance.
2.5 Difference from MFP

(1)
The execution strategies and efficiency of the algorithms are different: Machine learning methods can be thought of as an offline training, online prediction strategy.

(2)
The cost function used for localization is different: machine learning methods mostly use cost functions such as minimum mean square error or minimum crossentropy training. The matching field processing mostly adopts the method of correlation processing.