 Research
 Open access
 Published:
Textindependent speaker recognition based on adaptive course learning loss and deep residual network
EURASIP Journal on Advances in Signal Processing volume 2021, Article number: 45 (2021)
Abstract
Textindependent speaker recognition is widely used in identity recognition that has a wide spectrum of applications, such as criminal investigation, payment certification, and interestbased customer services. In order to improve the recognition ability of log filter bank feature vectors, a method of textindependent speaker recognition based on deep residual networks model was proposed in this paper. The deep residual network was composed of a residual network (ResNet) and a convolutional attention statistics pooling (CASP) layer. The CASP layer could aggregate framelevel features from the ResNet into an utterancelevel features. Extracting speech features for each speaker using deep residual networks was a promising direction to explore, and a straightforward solution was to train the discriminative feature extraction network by using a marginbased loss function. However, a marginbased loss function often has certain limitations, such as the margins between different categories were set to be the same and fixed. Thus, we used an adaptive curriculum learning loss (ACLL) to address the problem and introduce two different marginbased losses for this problem, i.e., AMSoftmax and AAMSoftmax. The proposed method was applied to a largescale VoxCeleb2 dataset for extensive textindependent speaker recognition experiments, and average equal error rate (EER) could achieve 1.76% on VoxCeleb1 test dataset, 1.91% on VoxCeleb1E test dataset, and 3.24% on VoxCeleb1H test dataset. Compared with related speaker recognition methods, EER was improved by 1.11% on VoxCeleb1 test dataset, 1.04% on VoxCeleb1E test dataset, and 1.69% on VoxCeleb1H test dataset.
1 Introduction
Speaker recognition (SR) [1] is the process of automatically recognizing a speaker based on original speech samples. It has become an increasingly important technology of recognizing identities in many electronic intelligent applications, law enforcement, and forensics [2, 3]. Speaker recognition includes speaker verification (SV) and speaker identification (SI) [4], and speaker recognition can be categorized into textdependent speaker recognition (TDSR) [5] and textindependent speaker recognition (TISR) [6]. The SV aims to verify whether a speech belongs to a specific enrolled speech, while the SI aims to classify the identification of an unknown speech among a specific set of enrolled speech. For the TDSR system, the speech text during training must be identical to the speech text during testing. By contrast, for the TISR system, the speaker recognition process does not depend on the speech text being spoken by the speaker. Therefore, the TISR case is more difficult than the TDSR case due to larger variations introduced by different speech transcriptions and duration. In the paper, our work focuses on the TISR case with respect to speaker recognition tasks, since it is more challenging and has greater practical significance.
Generally, speaker recognition tasks based the TISR system usually follow a similar three stage pipeline: (i) framelevel feature vectors extraction, (ii) temporal aggregation of framelevel feature vectors, and (iii) optimization of a classification loss. Framelevel feature vectors extraction processing can be achieved by using the backbone CNN structure, which is usually a 2D CNN with convolution in time domain and frequency domain [5, 7, 8]. Utterancelevel processing forms speaker representation based on the framelevel output. A pooling layer is used to aggregate framelevel information to form utterancelevel representation. For the TDSR system, all test utterance of the speaker were preseted in a training dataset, so the TDSR system was equivalent to onetoone verification, which could be regarded as a classification problems [5]. For the TISR system, the test dataset and training dataset were disjoint. Therefore, the feature vectors of a speaker needed to be projected into a discriminative embedding space, which could be treated as a metric learning problem [7]. Generally, a research method based on TISR case was mainly realized by the original softmax loss function. However, for textindependent metric learning problems, the discriminativeness of learning features was not enough such as the triplet loss [8, 9].
Recently, researchers have used several marginbased loss functions to carry out speaker recognition experiments and have obtained competitive results. For example, ASoftmax [10], AMSoftmax [11, 12], and AAMSoftmax [13] could significantly increase the margins of different categories. Therefore, a powerful speaker recognition deep network was proposed, based on a GhostVLAD layer and a AMSoftmax that was used to aggregate “thinResNet” architecture frame features [12]. However, the margins between different categories were set to be the same and fixed, which could not be well adapted to various situations. For example, the AMSoftmax and AAMSoftmax loss functions required extensive experiments to tune the two dependent superparameters to find the optimal values.
In addition, a clustering distance loss algorithm directly reduced intraclass variation and expanded the margins between different categories [14]. Recently, researchers have used temporal averaging pooling (TAP) to aggregate framelevel features, and an utterancelevel features representation was formed by averaging all framelevel feature vectors [15]. However, these methods do not distinguish speech samples well. Thus, an attention mechanism was introduced to aggregate the framelevel features in deep learning model [14, 16]. By assigning different weights to different utterance samples, this allows the weights to be focused on the important features. In addition, a higherorder statistics were introduced into the field of speaker recognition to calculate the mean and standard deviation of framelevel features [17]. Furthermore, an attention mechanism and statistical methods were combined to propose an attention statistical pooling (ASP) [18]. It provided an importance weighted standard deviation and weighted average of speaker features and calculated the sample weight importance by an attention mechanism.
Therefore, a method based on an ACLL and a deep residual network was proposed for TISR system in this paper, and the method realized a speaker recognition training strategy. The deep residual network named ResCASP had good time modeling ability, and it could extract effective information of speech feature vectors. The CASP aggregated framelevel features of the deep residual network to form an utterancelevel features. The ACLL was a loss function which optimized speaker features. Speaker features were extracted and fed into the ACLL based a deep residual network for textindependent speaker recognition.
2 Overall framework
As shown in Fig. 1, the overall framework consists of three parts: speech feature vectors extraction, deep residual network ResCASP, and ACLL. Feature vectors extraction is used to convert the original speech into 64dimensional log filter bank feature vectors. The ResCASP framework includes a ResNet, a CASP layer, and a fully connected (FC) layer. The CASP layer aggregates framelevel features of the ResNet into utterancelevel features, and the FC layer constrains the utterancelevel feature vectors to 512dimensional vector representation. The ACLL is the feature vectors which optimizes the output of the ResCASP framework. The trained ResCASP model is used for the final speaker recognition.
2.1 Log filter bank features extraction
The original speech signal is a onedimensional time domain signal, and the input of deep residual network is a twodimensional signal data. Generally, there are two main ways to extract features for speech: MFCC [19] and log filter bank [20]. Because MFCC is based on log filter bank, the feature extraction of log filter bank is more in line with the essence of speech signal, fitting the characteristics of human ear reception, and MFCC does DCT decorrelation processing on log filter bank, so log filter bank contains more information than MFCC. Therefore, the original speech signal is first extracted as a log filter bank feature vectors. The specific steps of log filter bank feature vectors extraction [19, 20] are as follows:

Preemphasis
Preemphasis is a highpass filter whose purpose is to boost highfrequency signal components. In terms of acoustic features extraction, the preemphasis filter is shown in Eq. 1.
where α is a preemphasis coefficient, and z is an input signal of original speech.

Framing
By dividing the speech signal into shorter frames, the signal can be regarded as a steadystate signal in each frame, which can be processed as the steadystate signal in the same way. At the same time, in order to make parameters between two adjacent frames more smoothly, there is a partial overlap.

Windowing function
The purpose of the windowing function is to reduce the leakage in the frequency domain. Each frame of speech signal is multiplied by a Hamming window with a frame length and a frame shift [19]. Each frame signal after preprocessing is multiplied by the Hamming window to increase the continuity of the frame. The calculation process is shown in Eq. 2.
where S(n) is the input of speech signal after preemphasis and framing, and N is a frame length.

Fast Fourier transform
Then, each frame of speech signal is performed with fast Fourier transform, and the time domain data is converted into frequency domain data. As shown in Eq. 3.
where T(n) is an input speech signal after windowing function, P is the number of Fourier transform points, and k is the frequency index (k=0,1,2,...,P−1).

Energy calculation of mel filter banks
The energy spectrum is fed to several triangular bandpass filters H_{m}(k). Each filter has triangular filtering characteristics [18]. In the frequency domain, the energy spectrum X(k)^{2} and the frequency domain response H_{m}(k) are multiplied and added. The calculation process is shown in Eq. 4.
where X(k) is the signal after fast fourier transform, and H_{m}(k) is the triangular bandpass filter. Its frequency response is shown in Eq. 5.
where f(m) is the center frequency of H_{m}(k),0≤m<L, and L is the number of bandpass filters.

Log energy spectrum
For the mth frame, the log energy spectrum of filter is defined as Eq. 6.
where M(l) is the feature vectors after calculating energy of the mel filter.
2.2 Structure of residual block
ResNet is a way to alleviate the difficulty of training deep convolutional neural network [21, 22]. It is learning the following layers of deep network into identity mapping, that is, h(x)=x, so the model degenerates into shallow network. The identity mapping greatly reduces the number of training parameters in the neural network. The ResNet is composed of many stacked residual blocks (ResBlocks). The structure of ResBlocks is shown in Fig. 2. The ResBlocks is composed of two 2D convolutional layers. The identity mapping can be used to map each ResBlocks input feature vectors to an output feature vectors. The expression defined by ResBlock as shown in Eq. 7.
where x and Γ are input feature vectors and output feature vectors, respectively. W_{i} is the learnable weight, and F(x,W_{i}) is the output of residual mapping. In addition, the identity mapping connection of does not add additional parameters and computational complexity.
In order to make full use of feature learning capabilities of ResNet and reduce loss of feature information, we use identity mapping to reduce data dimensions. In addition, in each convolutional layer, the stride is set to 1, the padding is set to SAME, and zero padding is used to prevent information from being lost at the edge of the cube.
2.3 Structure of CASP
By combining higherorder statistics and attention mechanism, the ASP is proposed [18]. It provides importance weighted standard deviations as well as the weighted means of framelevel features, for which the importance is calculated by an attention mechanism. Such previous work, however, has been evaluated only in such limited tasks as fixedduration textindependent [18, 23]. Therefore, we propose a new pooling method, called CASP. The CASP is used to aggregate the framelevel features of the deep residual network model to form utterancelevel features. This enables speaker embedding to more accurately and efficiently capture speaker factors with respect to longterm variations. The calculation process of CASP layer is shown in Fig. 3.
Firstly, the framelevel feature vectors {x_{1},x_{2},...,x_{T}} of the deep residual network are projected onto onedimensional convolutional layers to obtain the abstract feature vectors on hidden unit {h_{1},h_{2},...,h_{T}}.
Secondly, the score is normalized over all frames by a softmax function, which indicates relative importance of the hidden unit. The weight calculation formula for each sample is shown in Eq. 8.
where h_{t} is the input feature vectors, and w_{t} is the weight ratio of each feature vector.
Therefore, utterancelevel features can be expressed by weighted sum of framelevel features, and the calculation formula of the weighted sum is shown in Eq. 9.
where x_{t} is the input of feature vectors, and the normalized score e_{t} is then used as the weight in the pooling layer to calculate the weighted mean vector.
Finally, higherorder statistics with the attention mechanism are combined, that is, CASP. It can generate the mean and standard deviation by attention mechanism. Therefore, the weighted standard deviation is defined as Eq. 10.
where σ is the weighted standard deviation, and the advantages of higherorder statistics and attention mechanisms are applied to the weighted standard deviation.
2.4 Structure of ACLL
Loss function design is pivotal for largescale speaker recognition. Current stateoftheart deep speaker recognition methods mostly adopt softmaxbased classification loss [12]. Since the learned features with the original softmax loss are not guaranteed to be discriminative enough for practical speaker recognition problem, marginbased losses [24–26] are proposed. Though the marginbased loss functions are verified to obtain good performance, they do not take the difficultness of each sample into consideration, while ACLL emphasizes easy samples first and hard samples later, which is more reasonable and effective. The original softmax loss is formulated as follows:
where x_{i}∈R^{d} denotes the deep feature of ith sample which belongs to the y_{i} class, W_{j}∈R^{d} denotes the jth column of the weight W∈R^{d×n}, and b_{j} is the bias term. The class number and the embedding feature size are n and d, respectively. In practice, the bias is usually set to b_{j}=0 and the individual weight is set to ∥W_{j}∥=1 by l_{2} normalization. The deep feature is also normalized and rescaled to s. Thus, the original softmax can be modified as follows:
where \(\varphi (cos\theta _{y_{i}})\) and Y(t,cosθ_{j}) are adjusted for the similarity of the positive and negative cosine, respectively. cosθ is the cosine similarity of input feature vector y_{i} and weight w_{i}, s is the coefficient which can increase recognition speed of model, and N is the total number of classified samples. In the marginbased loss function, such as AMSoftmax [24], such that \(\varphi (cos\theta _{y_{i}})=cos\theta _{y_{i}}+m\) and Y(t,cosθ_{j})=cosθ_{j}; AAMSoftmax [25], such that \(\varphi (cos\theta _{y_{i}})=cos(\theta _{y_{i}}+m)\), Y(t,cosθ_{j})=cosθ_{j}. However, it only modifies the sine and cosine similarity of each sample to enhance feature discrimination, it could not adapt to various situations.
Therefore, ACLL is proposed [27]. The ACLL is defined as Eq. 13.
where \(\varphi (t,cos\theta _{y_{i}})\) and Y(t,cosθ_{j}) are defined by Eqs. 14 and 15, respectively, and s is a scaling factor of deep feature vectors. It should be noted that the positive cosine similarity can adopt any marginbased loss functions, and here, we adopt AAMSoftmax as an example. In the early training stage, learning from easy samples is beneficial to model convergence. Thus, t should be close to zero and I(·)=t+cosθ_{i} is smaller than 1. Therefore, the weights of hard samples are reduced, and easy samples are emphasized relatively. As training goes on, the model gradually focuses on the hard samples, i.e., the value of t shall increase and I(·) is larger than 1. Thus, the hard samples are emphasized with larger weights. Moreover, within the same training stage, I(·) is monotonically decreasing with θ_{j} so that harder sample can be assigned with larger coefficient according to its difficultness. The value of the parameter t is automatically estimated in the ACLL; otherwise, it may require lots of efforts for manual tuning. Therefore, it can adaptively adjust the relative importance of simple and difficult samples.
where m is the feature margin between different categories, \(\theta _{y_{i}}\) is the angle between the feature vectors y_{i} and the weight w_{i}, and t is the adaptive estimation parameter.
where t is adaptive estimation parameters, and exponential moving average (EMA) is used to achieve adaptive parameters. The process is shown as Eq. 16.
where r^{(k)} is the mean values of the cosine similarity of the kth batch. With the EMA, we avoid the hyperparameter tuning and make the modulation coefficients of hard sample negative cosine similarities I(·) adaptive to the current training stage.
As shown in Fig. 4, decision conditions are from \(cos\theta _{y_{i}}=cos\theta _{j}\) (blue line) to \(cos (\theta _{y_{i}}+m)=cos\theta _{j}\) (yellow line). ACLL is applied to adaptively adjust the weights of difficult samples and the decision condition becomes \(cos(\theta _{y_{i}}+m)=(t+cos\theta _{j})cos\theta _{j}\) (green line). During the training process, the decision boundary of difficult samples changes from a green line (early stage) to another green line (later stage). Simple samples are emphasized first, and then difficult samples are emphasized. In addition, the AAMSoftmax is used as the similarity of sine and cosine, namely \(\varphi (t,cos\theta _{y_{i}})=cos(\theta _{y_{i}}+m)\). It can be seen from Eq. 14 that let Y(t,cosθ_{j})=cosθ_{j} at the beginning of training.
2.5 Evaluation indicators
We evaluate the framework with equal error rate (EER). EER is denoted by the false rejection (FR) rate equal to the false acceptance (FA) rate, where FR is a correct signal which is recognized as a wrong signal; FA is a wrong signal which is recognized as a correct signal. Definitions of FR rate and FA rate are shown in Eqs. 16 and 17.
where N_{FR} is the number of false rejections, and N_{Target} is the total number of real evaluations.
where N_{FA} is the number of false rejections, and N_{impostor} is the total number of false evaluations.
3 Experiments and results
In this part, our experimental processes and training configuration details were introduced, and our method was compared with other methods. Then, our model was trained on the VoxCeleb2 [28] dataset, and our methods were evaluated for the effectiveness of our framework performance on the VoxCeleb1 [29] test dataset.
3.1 Experimental environment
The parameters of experimental environment were shown in Table 1.
3.2 Experimental dataset and training details
3.2.1 Experimental dataset
In order to verify the effectiveness of our proposed framework, extensive experiments were conducted on the VoxCeleb1 and VoxCeleb2 datasets. We trained our proposed model on the development dataset of VoxCeleb2. The development dataset of VoxCeleb2 contains 1,092,009 utterances of 5994 samples. All models in the experiment were used to verify the performance of the model on the VoxCeleb1 test dataset. The VoxCeleb1 dataset contained 153,357 utterances from 1251 samples; among them, the VoxCeleb2 development dataset and the VoxCeleb1 test dataset were completely disjoint (there was no common audio signal). In addition, the VoxCeleb1 dataset provided three versions of the test dataset: VoxCeleb1 test dataset, VoxCeleb1E test dataset, and VoxCeleb1H test dataset. The VoxCeleb1 and VoxCeleb2 data datasets were summarized in Table 2.
3.2.2 Training details
We used Adam optimizer in our experiments, and set the initial learning rate as 10^{−3}. During training, we used a fixed length 2s temporal segment, extracted randomly from each utterance. Spectrograms were extracted with a hamming window of width 25 ms and step 10 ms. For the ResCASP model, the 64dimensional log filter bank features were used as the input to the network. Mean and variance normalization (MVN) was performed by applying instance normalization to the network input. Since the VoxCeleb dataset consists mostly of continuous speech, voice activity detection (VAD) was not used in training and testing. The training time of the ResCASP model was about 4 days, and a total of 200 epochs were trained for each experiment. In order to minimize the effect of random initialization, all experiments were repeated three times independently. The trained deep residual network model was evaluated on the VoxCeleb1 test dataset. Ten 4s time datasets were sampled at fixed intervals from each test segment and calculated the similarity between all possible combinations (10 ×10 = 100) in each pair of segments. The average of 100 similarities was used as the score.
3.3 Structure of deep residual network ResCASP
As shown in Table 3, the ResCASP was composed of a ResNet and a CASP layer. The ResNet was used to extract higherdimensional abstract features with optimal classification performance, which was composed of multiple ResBlocks; the CASP layer was used to aggregate framelevel features of the ResNet. Finally, the trained model was used for final speaker recognition.
As shown in Table 3, Conv14 was used as the backbone of ResCASP architecture for scale conversion and depth conversion, and the algorithm used convolutional layers to obtain abstract features of utterance. After each convolution operation, a ReLU activation function and a BN batch normalization were added to the model which had nonlinear feature conversion capabilities. The convolutional layers in residual blocks Res14 used 32, 64, 128, and 256 convolutional kernels of size 3 × 3, respectively, and the stride was set to 1. Conv1 used 32 convolution kernels of size 7 × 7, the stride was set to 1. Conv24 used 64, 128, and 256 convolution kernels of size 1 × 1, and the stride was set to 2. Therefore, framelevel features of ResNet were aggregated into utterancelevel features by a CASP layer. Each signal dimension corresponded to a 64 × 200 residual network input, and 512dimensional framelevel features were generated by deep residual model. A fully connected (FC) layer was used to constrain the embedding vector to a 512dimensional unit vectors. Finally, textindependent speaker recognition was performed by the ACLL.
3.4 ResCASP parameters selection
The ResCASP model was trained by textindependent speaker recognition framework.
On the one hand, the training process of the ResCASP model was a process of continuously optimizing parameters in the residual network. In order to prevent the ResCASP model from overfitting during the learning process, the L2 regularization [30] mechanism was introduced into the FC layer. The Adam [31] optimizer was used in experiments, and its initial learning rate was 0.001, which was reduced by 5% every 5 epochs.
On the other hand, abstract features of the FC layer were fed into the optimized loss function at the end of each iteration of the deep residual network. In the training phase, parameters were optimized in the loss function. Because hyperparameters of m and s in the ACLL were sensitive and fixed, relatively. In order to find the best experimental configuration for m and s, experiments were set up to explore.
On the premise of 64dimensional log filter bank feature vectors, the hyperparameters m was set to 0.1, 0.2, 0.3, 0.4, and 0.5, and s was fixed at 30. As shown in Fig. 5, with the increase of the hyperparameters m, the EER of speaker recognition decreased first and then increased. Therefore, in order to have the best stability performance and the lowest EER for textindependent speaker recognitions, the best EER recognition performance was obtained when m=0.2,s=30 and dimensions of the log filter bank feature vectors was set to 64.
3.5 Performance analysis of speaker recognition
In order to verify the rationality of our proposed framework, two groups of experiments were designed to perform textindependent speaker recognition on the VoxCeleb2 dataset.
In first group of experiments, the CASP was used as an aggregated framelevel features. AMSoftmax, AAMSoftmax and ACLL were used as the loss function. The textindependent speaker recognition was performed on the ResCASP model. As shown in Fig. 6, when the CASP was used as the aggregation layer and the ACLL, AAM, and AM were used as the loss function, the average EER was 1.76%, 1.91%, and 1.98% on VoxCeleb1; 1.91%, 2.04%, and 2.06% on VoxCeleb1E; and 3.24%, 3.38%, and 3.44% on VoxCeleb1H, respectively. When ACLL was used as the loss function, the model ResCASP obtained best performance for textindependent speaker recognition on three test datasets, where AM and AAM represented AMSoftmax and AAMSoftmax, respectively. The AM and AAM were marginbased loss function, and the ACLL was used to adjust the weight ratio of simple samples and difficult samples by adaptive methods. In the proposed framework, the ACLL was more effective in textindependent speaker recognition than AM and AAM, which indicated that the log filter bank signal could be effectively extracted by adaptively adjusting simple samples and difficult samples.
In the second group of experiments, the ACLL was used as the loss function; ResTAP, ResASP, and ResCASP models were used for textindependent speaker recognition. As shown in Fig. 7, the ResTAP, the ResASP, and the ResASP achieved an average EER of 2.09%, 1.92%, and 1.76% on VoxCeleb1; 2.26%, 2.08%, and 1.91% on VoxCeleb1E; and 3.76%, 3.59%, and 3.24% on VoxCeleb1H, respectively. The ResCASP achieved a better speaker recognition performance on three test datasets. In the case of the same model parameters, the ResCASP obtained better speaker recognition performance than ResTAP and ResASP, which indicated that our model could extract features information effectively. The speaker recognition performance of ResASP and ResCASP were better than ResTAP, which indicated that the attention mechanismbased aggregation layer could capture relevant information of signal features effectively.
3.6 Comparison of the results of different experimental methods
The proposed method was compared with the current recognition methods based on ResCASP model, which were applied to the VoxCeleb1 and VoxCeleb2 dataset. As shown in Table 4, the different experimental methods of speaker recognition were carried out, and the similar methods were followed to evaluate the recognition performance. A method based on an ACLL and a deep residual network was proposed for TISR system in this paper.
Firstly, a ResNet and CASP aggregation layer was used to build a ResCASP model framework. As shown in Fig. 8, ResNet and GhostVLAD aggregation layers are used to build a speaker recognition framework [11]. The experimental results showed that the proposed method was improved EER of 1.11%, 1.04%, and 1.69%, which were lower than theirs on Vox1, Vox1E, and Vox1H, respectively, where Vox1, Vox1E, and Vox1H denoted the VoxCeleb1, VoxCeleb1E, and VoxCeleb1H test dataset, respectively. Therefore, the CASP layer could aggregate more useful speaker features information.
Secondly, the ACLL was used as the loss function to perform textindependent speaker recognition experiments. As shown in Fig. 8. We used AAMSoftmax as the loss function [32]. The experimental results showed that the proposed method was 0.56% on Vox1, 0.88% on Vox1E, and 1.69% lower on Vox1H than theirs. Therefore, the ACLL could distinguish different categories feature margins.
Thirdly, on the basic of the ResNet, we fused the CASP which captured abstract local features. And the ResCASP were used for textindependent speaker recognition. As shown in Fig. 8, a ResNet was used to conduct textindependent speaker recognition experiments [33]. The experimental results showed that the proposed method on Vox1 test dataset was lower 0.32% lower than theirs, which indicated that the combination of ResNet and ACLL was more effective for speaker recognition. Therefore, the ResCASP could extract more effectively information for textindependent speaker recognition.
Therefore, our method could achieve the lowest EER of textindependent speaker recognition on VoxCeleb1, VoxCeleb1E, and VoxCeleb1H test dataset, which was 1.76%, 1.91%, and 3.24%, respectively. Experiment verified the effectiveness of our proposed textindependent speaker recognition based on the ResCASP. Finally, the comparison of related methods was summarized as shown in Table 4.
4 Conclusion
A method of textindependent speaker recognition based on a deep residual network ResCASP was proposed in this paper. The CASP layer could assign different weights to each sample and could extract more useful relevant information. The proposed method was applied to the VoxCeleb2 dataset for model training, and the EER could achieve the best speaker recognition performance. In this paper, our innovations mainly included two aspects. Firstly, the ResCASP model constructed from ResNet and CASP was proposed and used for textindependent speaker recognition. Secondly, the mining strategy of signal features was applied to the textindependent speaker recognition by using ACLL as the loss function. Compared with existing studies, our model had a better textindependent speaker recognition performance and could achieve the lowest EER recognition results on the VoxCeleb1, VoxCeleb1E, and VoxCeleb1H test dataset.
Availability of data and materials
The database that supports the conclusions of this article is available in the [VoxCeleb [21, 22] database] repository [unique persistent identifier and hyperlink to the dataset at https://www.robots.ox.ac.uk/ vgg/data/voxceleb/. ]
Abbreviations
 CASP:

Convolutional attention statistics pooling
 ACLL:

Adaptive curriculum learning loss
 EER:

Equal error rate
 ResNet:

Residual network
 SR:

Speaker recognition
 SV:

Speaker verification
 SI:

Speaker identification
 TDSR:

Textdependent speaker recognition
 TISR:

Textindependent speaker recognition
 TAP:

Temporal averaging pooling
 ASP:

Attention statistical pooling
 ResCASP:

Deep residual network and CASP
 FC:

Fully connected
 ResBlocks:

Residual blocks
 EMA:

Exponential moving average
 FA:

False acceptance
 FR:

False rejection
 BN:

Batch normalization
 AM:

AMSoftmax
 AAM:

AAMSoftmax
 Vox1:

Voxceleb1
 Vox1E:

VoxCeleb1E
 Vox1H:

VoxCeleb1H
References
J. P. Campbell, Speaker recognition: a tutorial. Proc. IEEE. 85(9), 1437–1462 (1997).
J. Hansen, T. Hasan, Speaker recognition by machines and humans: a tutorial review. IEEE Signal Proc. Mag.32(6), 74–99 (2015).
Z. Chunlei, K. Kazuhito, J. H. L. Hansen, Textindependent speaker verification based on triplet convolutional neural network embeddings. IEEE/ACM Trans. Audio Speech Lang. Process.26(9), 1633–1644 (2018).
R. Togneri, D. Pullella, An overview of speaker identification: accuracy and robustness issues. IEEE Circ. Syst. Mag.11(2), 23–61 (2011).
A. Larcher, K. A. Lee, B. Ma, H. Li, Textdependent speaker verification: classifiers, databases and rsr2015. Speech Commun.60(3), 56–77 (2014).
J. Rohdin, A. Silnova, M. Diez, O. Plchot, P. Matějka, L. Burget, O. Glembek, Endtoend dnn based textindependent speaker recognition for long and short utterances. Comput. Speech Lang.59:, 22–35 (2020).
Z. Bai, X. L. Zhang, J. Chen, Cosine metric learning based speaker verification. Speech Commun.118:, 10–20 (2020).
C. Zhang, K. Koishida, in Interspeech 2017. Endtoend textindependent speaker verification with triplet loss on short utterances (ISCA, 2017), pp. 1487–1491.
H. Bredin, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Tristounet: triplet loss for speaker turn embedding (IEEE, 2017), pp. 5430–5434. corrabs / 1609.04301.
S. Wang, Z. Huang, Y. Qian, K. Yu, Discriminative neural embedding learning for shortduration textindependent speaker verification. IEEE/ACM Trans. Audio Speech Lang. Process.27(11), 1686–1696 (2019).
A. Nagrani, J. S. Chung, W. Xie, A. Zisserman, Voxceleb: largescale speaker verification in the wild. Comput. Speech Lang.60:, 101027 (2020).
W. Xie, A. Nagrani, J. S. Chung, A. Zisserman, in ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Utterancelevel aggregation for speaker recognition in the wild (IEEE, 2019), pp. 5791–5795. corrabs / 1902.10107.
Z. Zhao, H. Duan, G. Min, Y. Wu, Z. Huang, X. Zhuang, H. Xi, M. Fu, A lighten cnnlstm model for speaker verification on embedded devices. Futur. Gener. Comput. Syst.100:, 751–758 (2019).
T. Bian, F. Chen, L. Xu, Selfattention based speaker recognition using clusterrange loss. Neurocomputing. 368:, 59–68 (2019).
F. Richardson, D. Reynolds, N. Dehak, Deep neural network approaches to speaker and language recognition. IEEE Sig. Process Lett.22(10), 1671–1675 (2015).
N. N. An, N. Q. Thanh, Y. Liu, Deep CNNs with selfattention for speaker identification. IEEE Access. 7:, 85327–85337 (2019).
H. Taherian, Z. Q. Wang, J. Chang, D. Wang, Robust speaker recognition based on singlechannel and multichannel speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process.28:, 1293–1302 (2020).
K. Okabe, T. Koshinaka, K. Shinoda, Attentive statistics pooling for deep speaker embedding. arXiv preprint arXiv:1803.10963, 2252–2256 (2018).
O. Boujelben, M. Bahoura, Efficient fpgabased architecture of an automatic wheeze detector using a combination of MFCC and SVM algorithms. J. Syst. Archit.88:, 54–64 (2018).
A. Sithara, A. Thomas, D. Mathew, Study of MFCC and IHC feature extraction methods with probabilistic acoustic models for speaker biometric applications. Procedia Comput. Sci.143:, 267–276 (2018).
K. He, X. Zhang, S. Ren, J. Sun, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Deep residual learning for image recognition (IEEE, 2016), pp. 770–778. corrabs / 1512.03385.
R. Jahangir, W. T. Ying, N. A. Memon, G. Mujtaba, I. Ali, Textindependent speaker identification through feature fusion and deep neural network. IEEE Access. PP(99), 1 (2020).
J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham, S. Jung, B. J. Lee, I. Han, In defence of metric learning for speaker recognition. arXiv preprint arXiv:2003.11982, 2977–2981 (2020).
F. Wang, J. Cheng, W. Liu, H. Liu, Additive margin softmax for face verification. IEEE Signal Proc. Lett.25(7), 926–930 (2018).
J. Deng, J. Guo, N. Xue, S. Zafeiriou, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Arcface: additive angular margin loss for deep face recognition (IEEE, 2019), pp. 4690–4699. corr ABS / 1801.07698.
W. Liu, Y. Wen, Z. Yu, M. Yang, in ICML, 2. Largemargin softmax loss for convolutional neural networks (corrabs / 1612.02295, 2016), p. 7.
Y. Huang, Y. Wang, Y. Tai, X. Liu, P. Shen, S. Li, J. Li, F. Huang, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Curricularface: adaptive curriculum learning loss for deep face recognition (IEEE, 2020), pp. 5901–5910. corrabs / 2004.00288.
J. S. Chung, A. Nagrani, A. Zisserman, Voxceleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622, 1086–1090 (2018).
A. Nagrani, J. S. Chung, A. Zisserman, Voxceleb: a largescale speaker identification dataset. arXiv preprint arXiv:1706.08612, 2616–2620 (2017).
F. Li, J. M. Zurada, W. Wu, Smooth group l1/2 regularization for input layer of feedforward neural networks. Neurocomputing. 314:, 109–119 (2018). https://doi.org/10.1016/j.neucom.2018.06.046.
D. P. Kingma, J. Ba, Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980, 1–15 (2014).
X. Xiang, S. Wang, H. Huang, Y. Qian, K. Yu, in 2019 AsiaPacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). Margin matters: towards more discriminative deep neural network embeddings for speaker recognition (IEEE, 2019), pp. 1652–1656. corrabs / 1906.07317.
S. M. Kye, Y. Jung, H. B. Lee, S. J. Hwang, H. Kim, Metalearning for short utterance speaker recognition with imbalance length pairs. arXiv preprint arXiv:2004.02863, 1652–1656 (2020).
J. Xu, X. Wang, B. Feng, W. Liu, Deep multimetric learning for textindependent speaker verification. Neurocomputing. 410:, 394–400 (2020).
Y. Q. Yu, L. Fan, W. J. Li, in ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Ensemble additive margin softmax for speaker verification (IEEE, 2019), pp. 6046–6050.
Y. Jung, Y. Kim, H. Lim, Y. Choi, H. Kim, Spatial pyramid encoding with convex length normalization for textindependent speaker verification. arXiv preprint arXiv:1906.08333, 2982–2986 (2019).
Acknowledgements
Not applicable.
Funding
This research was funded by the Natural Science Foundation of Guangdong Province (no. 2019A1515011940), in part by the Science and Technology Program of Guangzhou under (no. 2019050001, no. 202002030353), in part by the Science and Technology Planning Project of Guangdong Province under (no. 2017B030308009), in part by the Special Project for Youth TopNotch Scholars of Guangdong Province under (no. 2016TQ03X100).
Declarations
Author information
Authors and Affiliations
Contributions
Authors’ contributions
RD designed the framework, conducted the experiments, and wrote the manuscript. QZ carried out the experiments, analyzed the results, and presented the discussion and conclusion parts. All authors read and approved the final manuscript.
Authors’ information
^{1}School of Physics and Telecommunication Engineering, South China Normal University, Guangzhou 510006, China.
^{2}South China Academy of Advanced Optoelectronics, South China Normal University, Guangzhou 510006, China.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zhong, Q., Dai, R., Zhang, H. et al. Textindependent speaker recognition based on adaptive course learning loss and deep residual network. EURASIP J. Adv. Signal Process. 2021, 45 (2021). https://doi.org/10.1186/s13634021007622
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13634021007622