Skip to main content

Seal call recognition based on general regression neural network using Mel-frequency cepstrum coefficient features


In this paper, general regression neural network (GRNN) with the input feature of Mel-frequency cepstrum coefficient (MFCC) is employed to automatically recognize the calls of leopard, ross, and weddell seals with widely overlapping living areas. As a feedforward network, GRNN has only one network parameter, i.e., spread factor. The recognition performance can be greatly improved by determining the spread factor based on the cross-validation method. This paper selects the audio data of the calls of the above three kinds of seals and compares the recognition performance of three machine learning models for inputting MFCC features and low-frequency analyzer and recorder (LOFAR) spectrum. The results show that at the same signal-to-noise ratio (SNR), the recognition result of the MFCC feature is better than that of the LOFAR spectrum, which is verified by statistical histogram. Compared with other models, GRNN for inputting MFCC features has better recognition performance and can still achieve effective recognition at low SNRs. Specifically, the accuracy is 97.36%, 93.44%, 92.00% and 88.38% for cases with an infinite SNR and SNR of 10, 5 and 0 dB, respectively. In particular, GRNN has the least training and testing time. Therefore, all results show that the proposed method has excellent performance for the seal call recognition.

1 Introduction

Marine mammal acoustics is a science that studies the acoustic behavior of marine mammals. Different from the land–air environment, the energy of light, heat, electromagnetic wave, and other forms will decay rapidly in the marine water environment, but the acoustic signal can spread effectively over a long distance. Therefore, most marine mammals evolved the ability of underwater voice to achieve the purposes of communication, individual recognition, navigation and positioning, foraging for food, and so on. Thus, best method to study marine mammals is to use passive acoustic monitoring (PAM) technology to record and analyze their underwater acoustic signals.

Almost all seals can make sounds underwater, and the vocal behavior is closely related to the habitat and activities of seals during the whole life cycle. Seals spend most of their time underwater to complete breeding, predation, and other behaviors. The traditional visual monitoring of seal behavior depends on weather, sea conditions, light, and other conditions, so it is difficult to track and monitor seals for a long time and in a wide range. However, PAM technology can get rid of most of the limitations of traditional visual monitoring, which can be effectively used for the investigation and research of seals. Watkins et al. analyzed the air and underwater vocalization of the Ross seal and found that its underwater vocalization includes pulses similar to that in the air. The difference is that the center frequency and bandwidth of underwater vocalization are large [1]. Mossbridge et al. [2] found through spectrum analysis that when leopard seals and killer whales in Antarctica appear in similar sea areas, the two animals will use frequency modulation to avoid the crossover of their vocal frequencies. Rogers et al. [3] studied the seals and Ross seals in Antarctica and found that the vocalization of the two seals was related to factors (e.g., day, night, and seasons). Risch et al. [4] studied the sounds of bearded seals in different Arctic waters and found significant differences in the vocal characteristics of different groups of bearded seals. Frouin-Mouy et al. [5] found that bearded seals vocalize more at night than during the day and more in winter than in summer. Halliday et al. [6] studied the seasonal vocal patterns of marine mammals (e.g., bearded seals, spotted seals, and whales underwater) and obtain the importance of sea ice to seal call. Cziko et al. [7] found that seals can emit ultrasonic waves close to 50 kHz through the broadband data collected underwater for a long time.

Underwater sound is collected by PAM technology, and acoustic features are extracted and input into a machine learning model, which has been widely used in the detection and recognition of marine mammals. Potter et al. used an artificial neural network to detect bowhead whales. The results show that the performance of this method is better than the time series matched filter and hidden Markov model [8]. Mouy et al. [9] used image processing methods for spectrum noise reduction and realized the detection and classification of blue whales and baleen whales through time–frequency analysis technology. Halkias et al. [10] established the restricted Boltzmann machine (RBM) model to realize the classification of baleen whales under a low signal-to-noise ratio (SNR). Parisi et al. [11] distinguished the underwater calls of bearded seals following different center frequencies and bandwidths. Frouin-Mouy et al. [12] used a spectrum map to realize the automatic detection of spotted seals and studied the distribution pattern of spotted seals in the Bering Sea, Chukchi Sea, and Beaufort Sea. Luo et al. proposed the local energy normalization method for inputting underwater sound data spectrum of different lengths. The convolutional neural network (CNN) model was applied for the effective detection of the echolocation sound of odontocetes [13]. Zhong et al. [14] established a transfer learning model and input the extracted small sample underwater sound spectrum for the detection of beluga whales. Shiu et al. [15] built various machine learning models to detect the phonation of North Atlantic right whales. The results show that the CNN model can significantly improve accuracy. Mishachandar et al. [16] applied the CNN model to marine noise recognition, effectively distinguishing different human, natural, and marine animal sounds. Escobar-Amado et al. [17] realized the effective classification of the sound of bearded seals by extracting the region of interest of bearded seals in the spectrum and using the CNN model.

Few studies related to seals, especially studies on the automatical recognition of seal calls, are noted in the field of underwater sound recognition of marine mammals. When different kinds of seals live in a similar sea area, the recognition of seal calls is of great significance to the study of abundance estimation and population structure. Therefore, a method of seal call effective recognition at different SNRs is essential. Machine learning is a data-driven method that has emerged in recent years. It can automatically extract the most representative abstract features of the category as the basis for recognition through a series of deep extractions. It is often of high accuracy and good noise resistance, realizing the intellectualization of underwater acoustic recognition. Most of the existing acoustic signal recognition methods based on machine learning models are complex and require training a large number of network parameters. This makes it difficult to find the optimal structure to adapt to the recognition task, and it is easy to overfit. Besides, the training time is often long.

In this study, a general regression neural network (GRNN) method using Mel-frequency cepstrum coefficient (MFCC) features is proposed to accurately recognizing seal calls. The GRNN method has only one network parameter. The optimal network architecture can be obtained by determining the spread factor, and the training time is short. The MFCC features can extract the feature information of the low-frequency part more effectively, and the frequency of the seal calls is mainly concentrated in the low-frequency part. Based on the above considerations, the GRNN model for inputting MFCC features combines the feature advantages of MFCC and the nonlinear fitting ability of the GRNN in order to improve the recognition performance for seal calls. Figure 1 shows the whole structure of our framework.

Fig. 1
figure 1

The whole structure of the proposed method


MFCC is a widely used feature in speech recognition. It is proposed based on the characteristics of the human ear. Due to the particularity of human ear structure, the listener can automatically separate the low- and high-frequency part of speech, in which the low-frequency part recognizes speech characteristics. Based on this, the characteristics of the human ear can be simulated and effective spectrum features extracted (i.e., convert the spectrum into Mel spectrum) by setting denser filters in the low-frequency part and fewer filters in the high-frequency part. Cepstrum is used in log functions to transform multiplicative signals into additive signals to reflect the low-frequency envelope spectrum characteristics and high-frequency detail characteristics of the signal. Through cepstrum analysis of the Mel spectrum, MFCC can be obtained and widely used in underwater target recognition [18].

Figure 2 shows the MFCC feature extraction process, which is mainly composed of preprocessing, fast Fourier transform (FFT), Mel filtering, and discrete cosine transform (DCT).

Fig. 2
figure 2

MFCC feature extraction flow chart

The preprocessing process includes pre-emphasis, framing, and windowing. Pre-emphasis enables the spectrum of the signal to be gentler by raising the spectrum of the high-frequency part of the signal. Framing can divide the signal into several short-term signals in which the signal can be regarded as a stationary process. In the process of framing, the method of overlapping segmentation is generally adopted to make the frame to frame excessively smooth. Windowing reduces the truncation effect of the signal. Thus, the signal and the window function are set as \(s(n)\) and \(w(n)\), respectively. The signal obtained after windowing is as follows:

$$s^{\prime } (n) = s(n)w(n),\;0 \le n \le N - 1$$

where \(N\) is the number of samples and \(w(n)\) is the Hamming window.

After preprocessing, FFT is performed on each frame signal to obtain the spectrum. The discrete spectrum \(S_{a}^{^{\prime}} (k)\) of the signal can be expressed as

$$S_{a}^{^{\prime}} (k) = \sum\limits_{n = 0}^{N - 1} {s^{^{\prime}} (n)} e^{ - j2\pi k/N} ,0 \le k \le N$$

The spectrum is then filtered through a group of triangular band-pass filters to obtain Mel filters. Moreover, M filters exist, and the center frequency is \(f(m)\), of which \(m = 1,2, \cdots ,M\). The formula of triangular filter is [19]:

$$H_{m} (k) = \left\{ {\begin{array}{*{20}l} {0,} \hfill & {k < f\left( {m - 1} \right)} \hfill \\ {\frac{{2\left( {k - f\left( {m - 1} \right)} \right)}}{{\left( {f\left( {m + 1} \right) - f\left( {m - 1} \right)} \right)\left( {f\left( m \right) - f\left( {m - 1} \right)} \right)}},} \hfill & {f\left( {m - 1} \right) \le k \le f\left( m \right)} \hfill \\ {\frac{{2\left( {f\left( {m - 1} \right) - k} \right)}}{{\left( {f\left( {m + 1} \right) - f\left( {m - 1} \right)} \right)\left( {f\left( m \right) - f\left( {m - 1} \right)} \right)}},} \hfill & {f\left( m \right) \le k \le f\left( {m + 1} \right)} \hfill \\ {0,} \hfill & {k > f\left( {m + 1} \right)} \hfill \\ \end{array} } \right.$$

The logarithmic energy output by each filter is calculated, which is expressed as follows:

$$S^{*} (m) = \ln \left( {\sum\limits_{k = 0}^{N - 1} {\left| {S_{a}^{\prime } (k)} \right|H_{m} (k)} } \right),0 \le m \le M$$

Perform DCT on the M logarithm energies calculated by Formula (4) to obtain the MFCC of order L (L = 12–16), where the DCT is [20]:

$$C_{n} = \sum\limits_{m = 0}^{N - 1} {S_{m}^{*} \cos \frac{n*(m - 0.5)*\pi }{M}} ,n = 1,2, \ldots ,L$$

In practical application, the cepstrum difference parameter (delta cepstrum) is calculated following the value of L MFCC cepstrum coefficients, which is expressed as [18]:

$$d_{n} = \left\{ {\begin{array}{*{20}l} {C_{n + 1} - C_{n} ,} \hfill & {n < K} \hfill \\ {\frac{{\sum\nolimits_{k = 1}^{K} {k\left( {C_{n + k} - C_{n - k} } \right)} }}{{\sqrt {2\sum\nolimits_{k = 1}^{K} {k^{2} } } }},} \hfill & {else} \hfill \\ {C_{n} - C_{n - 1} ,} \hfill & {n \ge L - K} \hfill \\ \end{array} } \right.$$

where d represents the nth first-order difference result, \(C_{n}\) represents the nth cepstrum coefficient calculated by Formula (5), L represents the order when calculating MFCC, and K represents the time difference of the first-order derivative, which can be taken as 1 or 2. The second-order difference result can be obtained when the calculation result is brought into Formula (6).

Generally, MFCC and the first- and second-order cepstrum difference parameters are combined as the eigenvector of the signal.


GRNN is a feedforward neural network. This neural network is based on kernel regression analysis [21] and has a good nonlinear mapping ability. The input characteristic is one-dimensional sequence. GRNN obtains the conditional probability density function by calculating the input and output of the training data and the input of the test data to further obtain the output of the test data.

3.1 Theoretical basis

The theoretical basis of the GRNN model is kernel regression analysis. Kernel regression analysis is a kind of nonlinear regression analysis. \(f(x,y)\) is assumed to be the joint probability density function of random variables x and y. The regression of non-independent variable y to independent variable x (conditional mean) calculates y with the maximum probability value. Then, the regression of y to x is as follows:

$$E(\left. y \right|x) = \frac{{\int_{ - \infty }^{\infty } {yf(x,y)dy} }}{{\int_{ - \infty }^{\infty } {f(x,y)dy} }}$$

GRNN estimates \(f(x,y)\) according to the training data. Let \(X_{i}\)(\(i = 1,2, \cdots ,N\)) and \(Y_{i}\)(\(i = 1,2, \cdots ,N\)) be the observed values of the training samples of x and y, respectively, and X and Y are the observed values of the test samples of x and y, respectively. Then, the estimated density function can be provided according to the Parzen nonparametric estimation [22]:

$$\hat{f}(X,Y) = \frac{1}{{N(2\pi )^{(M + 1)/2} \sigma^{M + 1} }}\sum\limits_{i = 1}^{N} {\exp \left[ { - \frac{{(X - X_{i} )^{T} (X - X_{i} )}}{{2\sigma^{2} }}} \right]} \exp \left[ { - \frac{{(Y - Y_{i} )^{2} }}{{2\sigma^{2} }}} \right]$$

where N is the number of training samples, M is the dimension of random variable x, and \(\sigma\) is the spread factor.

By substituting \(\hat{f}(X,Y)\) into \(f(x,y)\) in Formula (7) and exchanging the order of integration and summation, \(\hat{Y}(X)\) can be calculated as follows:

$$\hat{Y}(X) = \frac{{\sum\limits_{i = 1}^{N} {\exp \left[ { - \frac{{(X - X_{i} )^{T} (X - X_{i} )}}{{2\sigma^{2} }}} \right]\int_{ - \infty }^{\infty } {y\exp \left[ { - \frac{{(Y - Y_{i} )^{2} }}{{2\sigma^{2} }}} \right]dy} } }}{{\sum\limits_{i = 1}^{N} {\exp \left[ { - \frac{{(X - X_{i} )^{T} (X - X_{i} )}}{{2\sigma^{2} }}} \right]\int_{ - \infty }^{\infty } {\exp \left[ { - \frac{{(Y - Y_{i} )^{2} }}{{2\sigma^{2} }}} \right]dy} } }}$$

Given that \(\int_{ - \infty }^{\infty } {\lambda \exp ( - \lambda^{2} )d\lambda } = 0\), Formula (9) can be simplified as [23]

$$\hat{Y}(X) = \frac{{\sum\limits_{i = 1}^{N} {Y_{i} \exp \left[ { - \frac{{(X - X_{i} )^{T} (X - X_{i} )}}{{2\sigma^{2} }}} \right]} }}{{\sum\limits_{i = 1}^{N} {\exp \left[ { - \frac{{(X - X_{i} )^{T} (X - X_{i} )}}{{2\sigma^{2} }}} \right]} }}$$

According to Formula (10), \(\hat{Y}(X)\) can be regarded as the weighted average of the observed values (\(Y_{i}\)) of all the training samples, and the weight of each observed value (\(Y_{i}\)) is the exponent of the square of the Euclidean distance between the observed value of the corresponding training sample (\(X_{i}\)) and the observed value (X) of the test sample. When \(\sigma\) approaches zero, \(\hat{Y}(X)\) is approximate to observation value \(Y_{i}\) of the training sample corresponding to the observation value (\(X_{i}\)) nearest observation value X. When \(\sigma\) is very large, \(\hat{Y}(X)\) is approximate to the mean value of all training sample observations \(Y_{i}\). \(\sigma\) controls the width of the RBF and determines the fitting degree of the network. It is the only parameter to be optimized. Only when the spread factor is appropriate can we obtain a satisfactory fitting result.

3.2 Network structure

GRNN includes the input, pattern, summation, and output layers. GRNN only needs one network parameter, while other neural network models generally need to select multiple parameters. Therefore, GRNN has obvious advantages in network construction. Figure 3 shows the structure of the GRNN model. The purpose of each layer of GRNN is as follows:

Fig. 3
figure 3

Structure diagram of GRNN model

The input layer passes the input vector to the pattern layer.

The number of neurons in the pattern layer is consistent with the number of training samples (N), and each training sample has a corresponding neuron. Neuron Gi is the exponent of the square of the Euclidean distance between training data input Xi and test data input X [23]:

$$G_{i} = \exp \left[ { - \frac{{(X - X_{i} )^{T} (X - X_{i} )}}{{2\sigma^{2} }}} \right],\;i = 1,2, \ldots ,N$$

where \(\sigma\) is the spread factor to be selected.

The summation layer includes two types of neurons: the D and S neurons. The D neuron is the arithmetic summation of all the neurons in the pattern layer, and the S neuron is the weighted summation of all the neurons in the pattern layer (the weight between the ith neuron in the pattern layer and the kth neuron in the summation layer is yik). D and S neurons can be expressed as [21]

$$D = \sum\limits_{i = 1}^{N} {G_{i} } .$$
$$S_{k} = \sum\limits_{i = 1}^{N} {y_{ik} } G_{i}$$

The number of neurons in the output layer is dimension K of the output vector, and the output of the corresponding neurons is calculated by dividing the corresponding pattern layer (Sk) by D:

$$\hat{y}_{k} (X) = \frac{{S_{k} }}{D},\;k = 1,2, \ldots ,K$$

3.3 Spread factor determination

The GRNN model has only one network parameter, i.e., spread factor. Only by optimizing the spread factor can the training performance of the network be improved. Generally, the smaller the spread factor is, the better the fitting degree is, but an extremely small spread factor can easily lead to overfitting. In this study, the input feature of the GRNN model is the extracted one-dimensional MFCC feature, and the k-fold cross-validation is used to determine the optimal spread factor. The steps are as follows:

(1) The value range of the spread factor is determined, e.g., from 0.1 to 1.0 with an increment of 0.1.

(2) Recognition accuracy is taken as the measurement index of the recognition result, and the formula is as follows:

$${\text{Accuracy}} = \frac{{\overline{N} }}{N} \times 100{\text{\% }}$$

where N is the number of samples and \(\overline{N}\) is the number of correctly recognized samples in N samples.

(3) For k-fold cross-validation, randomly dividing the training samples into k folds with onefold used as validation set and the other k-1 folds used as training set.

(4) For each spread factor, building the network with training set using this spread factor and applying the resulting network for validation set and recording the accuracy; repeating this process k times so that all the k folds are used as the validation set once and computing the average of all the k accuracies (namely, as the averaged accuracies).

(5) The procedure (4) is repeated for all the spread factors, and the spread factor corresponding to the minimum value of the averaged accuracies is taken as the optimal spread factor.

3.4 Computer configuration

The proposed method was developed on a workstation with 11th Gen Intel(R) Core(TM) i7-1165G7 CPU*8. The code was written using MATLAB R2020b (

4 Results and discussion

4.1 Dataset

In this study, calls of leopard, ross, and weddell seals are conducted. Figure 4 shows the distribution maps of leopard, ross, and weddell seals. It shows that the above three kinds of seals have extensive overlap living areas in Antarctica. Therefore, it is of great significance to realize the recognition of seal calls automatically.

Fig. 4
figure 4

Distribution maps of leopard, ross, and weddell seals [24]

The data used in this study are the audio signals of leopard, ross, and weddell seals in Watkins marine mammal sound database [25]. This database provides a wide range of sounds from the 1940’s to the 2000’s and contains approximately 2000 unique recordings of over 60 species of marine mammals, which can be used as a reference dataset for the detection of marine mammals in PAM data collected worldwide. Figure 5 shows the geographic regions collected by the Watkins database, and the seal audio used in this study is from the elliptical area. Figure 6 shows the waveforms of the calls of different kinds of seals.

Fig. 5
figure 5

Geographic regions collected by the Watkins database [25]

Fig. 6
figure 6

Sound waveforms of different kinds of seal calls

4.2 Extraction of input features

The data used in this study are from ‘Best of Cuts’ in Watkins marine mammal sound database. The sound clips of ‘Best of Cuts’ are considered to be of higher quality and lower noise. The sampling rates of leopard, ross and weddell seal data are 5000 Hz, 20000 Hz and 20480 Hz, respectively. On the one hand, the sampling rate needs to be unified. On the other hand, the sound characteristics of seal calls are concentrated in the low-frequency part. Therefore, all sounds are resampled at 5000 Hz. For the MFCC feature and LOFAR spectrum extracted from the resampled audios, data is divided into frames with a window length of 25 ms for 125 samples with a step size of 10 ms for 50 samples. Table 1 shows the information of raw recordings of seal calls and number of samples after framing.

Table 1 The information of raw recordings of seal calls

In the process of extracting MFCC features of each segment of data, 24 groups of filters are set, and their first- and second-order difference coefficients are calculated, so that each segment of data can obtain a 1 × 36 feature vector. This is used as the feature input of a single sample. Leopard, ross, and weddell seals all take the MFCC feature of the 2.5-s sound data as an example as shown in Fig. 7a–c. In this study, 5218, 7194 and 4296 MFCC features are extracted from the calls of leopard seal, ross seal and weddell seal, respectively, with a total of 16,708 MFCC features. We randomly select 75% of each category for training and the rest for testing. A total of 12,531 training samples and 4177 test samples are obtained.

Fig. 7
figure 7

MFCC features of different kinds of seal calls

In this study, low-frequency analyzer and recorder (LOFAR) spectrum of signal energy variation characteristics in the time–frequency domain can be obtained by continuous sampling of signal and short-time Fourier transform (STFT) of continuous signal samples. For STFT, the frequency interval of STFT is set to 10 Hz. The LOFAR spectrum of the three kinds of seal calls are shown in Fig. 8. It can be seen that the spectrum characteristics change significantly with time. For LOFAR spectrum, the input feature is the amplitude of all frequency points under each time frame, and its dimension is 1 × 251. Similarly, a total of 16,708 LOFAR spectrum are extracted. When allocating training and test samples, ensure that the time frame corresponding to the training and test data of LOFAR spectrum is consistent with the MFCC features, so that the subsequent recognition performance comparison between the MFCC characteristics and the LOFAR spectrum is meaningful.

Fig. 8
figure 8

LOFAR spectrum of different kinds of seal calls

The statistical histograms of MFCC features and LOFAR spectrum are shown in Fig. 9. The different colors represent three types of seals; Fig. 9a shows that the three types of seal calls have obvious differences in terms of MFCC, which are specifically expressed in the location, shape, skewness and kurtosis of the distributions. Figure 9b shows that the calls of leopard and weddell seals are very similar in terms of LOFAR spectrum, which greatly reduces the recognition performance.

Fig. 9
figure 9

Statistical histograms of MFCC features and LOFAR spectrum

4.3 Design of recognition model

In practical application, the ocean often has different degrees of environmental noise. This paper analyzes the recognition performance of this method at different SNRs. Gaussian white noise is added to the original audio signals. Taking the calls of leopard seals as an example, Fig. 10 shows a comparison of the waveform of the original data and the corresponding samples at different SNRs. It can be obtained that the lower the SNR, the more burrs appear in the signal.

Fig. 10
figure 10

Waveform at different SNRs

In this study, three machine learning models, namely GRNN, support vector machine (SVM) and CNN, are designed to study the recognition performance with the input feature of MFCC and LOFAR spectrum at different SNRs.

For the GRNN model, the spread factor is determined using tenfold cross-validation. The spread factor range is set to be \(\left[ {0.01:0.01:0.1} \right.\)\(\left. {0.2:0.1:1.0} \right]\). Figure 11a–b, respectively, shows the relationship between the spread factor and the recognition performance of the GRNN method using MFCC features and LOFAR spectrum at different SNRs. The results show that the recognition performance of the MFCC features and LOFAR spectrum decreases with the increase in the spread factor, and the optimal spread factor is concentrated between 0 and 0.1. Compared with LOFAR spectrum, GRNN method with MFCC feature input feature has higher average accuracy after determining the optimal spread factor.

Fig. 11
figure 11

Determination of the optimal spread factor of GRNN at different SNRs

For the SVM model, it solves the inner product operation in high-dimensional space by introducing kernel function [26]. In this study, the radial basis function (RBF) is used as the kernel function of the SVM model.

RBM is a kind of randomly generated neural network that can obtain the probability distribution by training the data set. RBM contains a visible layer and a hidden layer. In this paper, RBM is combined with the softmax activation function, and the number of hidden units is set to 200 [10].

CNN is a special neural network. The convolution layer extracts the local features of the data through the convolution kernel while reducing the impact of unrelated factors [27]. Most of the CNN models used in the field of underwater acoustic recognition of marine mammals have a relatively few number of layers, including 2–4 convolution layers [13, 15, 17, 28]. The CNN classification model built in this paper has three convolution layers. Convolution kernel sizes are all set to 1 × 2, and number of convolution kernels is set of 128, 128 and 256, respectively. After each convolution layer, the rectified linear unit (ReLU) is used as the activation function. The pooling layer reduces the dimensionality based on the local correlation of the feature data using the average-pooling method while keeping the feature scale constant. An average-pooling layer is set behind each ReLU, and the size of the pooling filter is set of 1 × 2. The features extracted through the convolution and pooling operations are passed through the fully connected layers, which are connected to the output layer. The CNN structure is shown in Fig. 12. The obtained model is trained by Stochastic Gradient Descent with Momentum (SGDM) algorithm with the learning rate of 0.0001. The minibatch size is set of 10. The CNN model is trained until the loss function converges. The cross-entropy is used as the loss function.

Fig. 12
figure 12

CNN structure

In 2015, He et al. [29] studied the CNN model with residual connection in depth, which can be used to build a deeper network architecture. Residual CNN is a widely used and state-of-the-art model. For residual CNN, this paper uses the typical ResNet18 as the recognition model. The network parameter configuration is consistent with the above common CNN.

Tables 2 and 3 show the training and testing time of different models with the input feature of MFCC feature and LOFAR spectrum, respectively. The results show that since GRNN has only one network parameter, and compared with LOFAR spectrum, the MFCC feature has fewer parameters, the GRNN method for inputting MFCC feature has the least training and testing time, while the ResNet18 model has the longest time due to the most network parameters.

Table 2 The training time (s) of different models
Table 3 The testing time (s) of different models

4.4 Recognition results

Figure 13 shows the comparison of the three machine learning methods with the input feature of MFCC and LOFAR spectrum of the original data. The results show that in the scene without added noise, GRNN method has better recognition performance than other methods for the same input features, and the accuracy of GRNN method with the input features of MFCC reach 97.36%. When MFCC features are given as input, the recognition performance of RBM, CNN and SVM are similar, and that of ResNet18 is slightly better. When inputting LOFAR spectrum, the recognition performance of CNN and ResNet18 is obviously better than that of SVM. The recognition accuracy of RBM is only 88.53%, while that of GRNN, CNN, ResNet18 and SVM models are 93.41%, 92.39%, 92.08% and 90.12%, respectively, that is, RBM has the worst recognition performance. The MFCC features are obtained according to the human ear auditory mechanism, which can extract the feature information of the low-frequency part more effectively, and the frequency of the seal calls is mainly concentrated in the low-frequency part, so the recognition performance of MFCC features is obviously better than that of the LOFAR spectrum for the same method.

Fig. 13
figure 13

Comparison of the recognition results using MFCC features and LOFAR spectrum of the original data

Tables 4, 5, 6, 7 and 8 show the recognition accuracy of GRNN, CNN, ResNet18, SVM and RBM models, respectively, inputting MFCC features and LOFAR spectrum at different SNRs. A, B, and C represent the leopard, ross, and weddell seals, respectively.

Table 4 Accuracy (%) of seal calls recognition by the GRNN model
Table 5 Accuracy (%) of seal calls recognition by the CNN model
Table 6 Accuracy (%) of seal calls recognition by the ResNet18 [29] model
Table 7 Accuracy (%) of seal calls recognition by the SVM [26] model
Table 8 Accuracy (%) of seal calls recognition by the RBM [10] model

Figure 14 shows the change in the accuracy curves with SNR using MFCC features as input features. These models can effectively recognize seal calls at high SNRs, but the reduction of SNR has a great impact on the RBM method. The error of the RBM method is large and cannot realize effective recognition when the SNR is 0 dB. Compared with RBM, SVM and ResNet18, the CNN method is less affected by the reduction of SNR, but the results of CNN method also have a large error at low SNRs, especially at 0 dB. Because the texture, gradient and other features of underwater acoustic signal are not obvious, namely there is less detailed information, too deep network is easy to cause gradient disappearance and overfitting during the training. Therefore, compared with traditional CNN, the estimation performance of ResNet18 is poor at low SNRs. The SNR reduction has the least impact on the GRNN method. The GRNN method can realize accurate recognition when the SNR exceeds 5 dB, and it can still realize approximate recognition when the SNR is 0 dB.

Fig. 14
figure 14

Comparison of the five models with the input feature of MFCC

Figure 15 shows the change in the accuracy curves with SNR using LOFAR spectrum as input features. For the GRNN, ResNet18 and CNN models, inputting the LOFAR spectrum can realize effective recognition at high SNRs, but the error is large at low SNRs. For the RBM and SVM method, inputting the LOFAR spectrum can only realize approximate recognition, which is no longer applicable at low SNRs. Especially for RBM, it is difficult to determine the optimal network parameters, which leads to the convergence of the training to the local optimal solution. It is also easy to overfit. In addition, RBM belongs to the traditional shallow neural network, which is difficult to extract deep features. Therefore, compared with other models, its recognition results are poor at low SNRs.

Fig. 15
figure 15

Comparison of the five models with the input feature of LOFAR spectrum

Figure 16 shows the recognition accuracy of different methods with the input feature of MFCC and LOFAR spectrum. Due to the poor recognition performance of RBM and SVM, only GRNN, ResNet18 and CNN are shown here. The results show that the underwater acoustic features have a greater impact on the recognition results than the recognition models. At different SNRs, the recognition results of MFCC features using various models are better than those of LOFAR spectrum.

Fig. 16
figure 16

Comparison of the various models with the input feature of MFCC and LOFAR spectrum

In general, because the GRNN model for inputting MFCC features combines the feature advantages of MFCC and the nonlinear fitting ability of the GRNN, the seal call recognition can be realized stably at various SNRs, and its recognition accuracy of the test samples is better than that of other models at the same SNR. Few studies related to seals, especially studies on the automatical recognition of seal calls, are noted in the field of underwater sound recognition of marine mammals. All results show that the proposed method has excellent performance for the seal call recognition, which fills the blank in this field.

5 Conclusion

This paper presents an accurate seal call recognition method based on the GRNN with the input feature of MFCC. The RBM, SVM, CNN, ResNet18 and GRNN models are compared for inputting MFCC and LOFAR spectrum. The results show that at the same SNR, the recognition result of the MFCC feature is better than that of LOFAR spectrum. The accuracy of GRNN method with the input features of MFCC of the original data reaches 97.36%. Compared with other models, the GRNN model has better recognition performance and can still realize effective recognition at low SNRs. In particular, GRNN has the least training and testing time. In addition, although this method is used for seal call recognition in this paper, it is not limited to this. In the field of passive acoustics, this method is also applicable to the preliminary work of other underwater vocal biological species observation (e.g., distance and seasonal measurement, abundance estimation, and population structure determination). It can also be used in other target sound recognition (e.g., natural sound and ship noise recognition).

Availability of data and materials

The dataset supporting the conclusions of this article is available at



General regression neural network


Mel-frequency cepstrum coefficient


Low-frequency analyzer and recorder


Signal-to-noise ratio


Passive acoustic monitoring


Convolutional neural network


Fast Fourier transform


Short-time Fourier transform


Support vector machine


Radial basis function


Rectified linear unit


Stochastic gradient descent with momentum


Restricted Boltzmann machine


  1. A.W. Watkins, In-air and underwater sounds of the Ross seal Ommatophoca rossi. J. Acoust. Soc. Am. 77(4), 1598–1598 (1985).

    Article  Google Scholar 

  2. J.A. Mossbridge, J.A. Thomas, An “acoustic niche” for Antarctic killer whale and leopard seal calls. J. Acoust. Soc. Am. 109(5), 2390–2390 (2001).

    Article  Google Scholar 

  3. T.L. Rogers, G.A. Rowney, M.B. Ciaglia, D.H. Cato, Seasonal and diurnal calling patterns of ross and leopards. J. Acoust. Soc. Am. 118(3), 1938–1938 (2005).

    Article  Google Scholar 

  4. D. Risch, C.W. Clark, P.J. Corkeron, Vocalizations of male bearded seals, Erignathus barbatus: classification and geographical variation. Anim. Behav. 73(5), 747–762 (2007).

    Article  Google Scholar 

  5. H. Frouin-Mouy, X. Mouy, B. Martin, Underwater acoustic behavior of bearded seals (Erignathus barbatus) in the northeastern Chukchi Sea, 2007–2010. Mar. Mammal Sci. 32(1), 141–160 (2016).

    Article  Google Scholar 

  6. W.D. Halliday, S.J. Insley, T. Jong, Seasonal patterns in acoustic detections of marine mammals near Sachs Harbour, Northwest Territories. Arct. Sci. 4, 259–278 (2017)

    Google Scholar 

  7. P.A. Cziko, L.M. Munger, N.R. Santos, J.M. Terhune, Weddell seals produce ultrasonic vocalizations. J. Acoust. Soc. Am. 148(6), 3784–3796 (2020).

    Article  Google Scholar 

  8. J.R. Potter, D.K. Mellinger, C.W. Clark, Marine mammal call discrimination using artificial neural networks. J. Acoust. Soc. Am. 96(3), 1255–1262 (1994).

    Article  Google Scholar 

  9. X. Mouy, M. Bahoura, Y. Simard, Automatic recognition of fin and blue whale calls for real-time monitoring in the St. Lawrence. J. Acoust. Soc. Am. 126(6), 2918–2928 (2009).

    Article  Google Scholar 

  10. X.C. Halkias, S. Paris, H. Glotin, Classification of mysticete sounds using machine learning techniques. J. Acoust. Soc. Am. 134(5), 3496–3505 (2013).

    Article  Google Scholar 

  11. I. Parisi, G. Vincenzi, M. Torri, E. Papale, Underwater vocal complexity of Arctic seal Erignathus barbatus in Kongsfjorden (Svalbard). J. Acoust. Soc. Am. 142(5), 3104–3115 (2017).

    Article  Google Scholar 

  12. H. Frouin-Mouy, X. Mouy, C.L. Berchok, S.B. Blackwell, K.M. Stafford, Acoustic occurrence and behavior of ribbon seals (Histriophoca fasciata) in the Bering, Chukchi, and Beaufort seas. Polar Biol. 42(4), 657–674 (2019).

    Article  Google Scholar 

  13. W. Luo, W. Yang, Z. Yu, Convolutional neural network for detecting odontocete echolocation clicks. J. Acoust. Soc. Am. 145(1), 7–12 (2019).

    Article  Google Scholar 

  14. M. Zhong, M. Castellote, R. Dodhia, J.L. Ferres, A. Brewer, Beluga whale acoustic signal classification using deep learning neural network models. J. Acoust. Soc. Am. 147(3), 1834 (2020).

    Article  Google Scholar 

  15. Y. Shiu, K.J. Palmer, M.A. Roch, E. Fleishman, H. Klinck, Deep neural networks for automated detection of marine mammal species. Sci. Rep. 10(1), 1–12 (2020).

    Article  Google Scholar 

  16. B. Mishachandar, S. Vairamuthu, Diverse ocean noise classification using deep learning. Appl. Acoust. (2021).

    Article  Google Scholar 

  17. C.D. Escobar-Amado, M. Badiey, S. Pecknold, Automatic detection and classification of bearded seal vocalizations in the northeastern Chukchi Sea using convolutional neural networks. J. Acoust. Soc. Am (2022).

    Article  Google Scholar 

  18. K. Yang, X. Zhou, Deep learning classification for improved bicoherence feature based on cyclic modulation and cross-correlation. J. Acoust. Soc. Am 146(4), 2201–2211 (2019).

    Article  Google Scholar 

  19. L. Muda, M. Begam, I. Elamvazuthi. Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) Techniques. Ttps, (2010), 2.

  20. T. Lim, K. Bae , C. Hwang. Classification of underwater transient signals using MFCC feature vector. in 2007 9th international symposium on signal processing and its applications.

  21. D.F. Specht, A general regression neural network. IEEE Tran. Neural. Netw. 2(6), 568–576 (1991).

    Article  Google Scholar 

  22. E. Parzen, On estimation of a probability density function and mode. Ann. Math. Statist. 33(3), 1065–1076 (1962).

    Article  MathSciNet  MATH  Google Scholar 

  23. T. Masters, W. Land, A new training algorithm for the general regression neural network. IEEE 3, 1990–1994 (1997).

    Article  Google Scholar 

  24. K. Kaschner, K.Kesner-Reyes, C. Garilao, J. Segschneider, J. Rius-Barile, T. Rees, R. Froese, AquaMaps: Predicted range maps for aquatic species. World wide web electronic publication (2019).

  25. L. Sayigh, M.A. Daher, J. Allen, H. Gordon, K. Joyce, C. Stuhlmann, Fourth international conference on the effects of noise on aquatic life the watkins marine mammal sound database: an online, freely accessible resource 040013 (2019).

  26. C. Saunders, M.O. Stitson, J. Weston, Support vector machine. Com. Sci. 1(4), 1–28 (2002)

    Google Scholar 

  27. K. Fukushima, S. Miyake, T. Ito, Neocognitron: A neural network model for a mechanism of visual pattern recognition. IEEE Trans. Syst. Man Cybern. (1983).

    Article  Google Scholar 

  28. J. Jiang, L. Bu, F. Duan et al., Whistle detection and classification for whales based on convolutional neural networks. Appl. Acou. 150, 169–178 (2019).

    Article  Google Scholar 

  29. K. He, X. Zhang, S. Ren et al., Deep residual learning for image recognition. IEEE Conf. Com. Vis. Pat. Rec. (2016).

    Article  Google Scholar 

Download references


Not applicable.


This work was supported in part by National Key R&D Program of China (2021YFB3203001) and Shaanxi’s Young Science and Technology Star Program under Grant 2021KJXX-07.

Author information

Authors and Affiliations



All authors contributed to the conception and design of the experiments and the interpretation of simulation results. Y conceived the idea, prepared the manuscript, and conducted numerical and experimental validations. W substantially revised the manuscript, and Y and S contributed additional revisions of the text. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Yong Wang.

Ethics declarations

Competing interests

The author declares that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yao, Q., Wang, Y., Yang, Y. et al. Seal call recognition based on general regression neural network using Mel-frequency cepstrum coefficient features. EURASIP J. Adv. Signal Process. 2023, 48 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: