As shown in Fig. 1, the overall framework consists of three parts: speech feature vectors extraction, deep residual network Res-CASP, and ACLL. Feature vectors extraction is used to convert the original speech into 64-dimensional log filter bank feature vectors. The Res-CASP framework includes a ResNet, a CASP layer, and a fully connected (FC) layer. The CASP layer aggregates frame-level features of the ResNet into utterance-level features, and the FC layer constrains the utterance-level feature vectors to 512-dimensional vector representation. The ACLL is the feature vectors which optimizes the output of the Res-CASP framework. The trained Res-CASP model is used for the final speaker recognition.
2.1 Log filter bank features extraction
The original speech signal is a one-dimensional time domain signal, and the input of deep residual network is a two-dimensional signal data. Generally, there are two main ways to extract features for speech: MFCC [19] and log filter bank [20]. Because MFCC is based on log filter bank, the feature extraction of log filter bank is more in line with the essence of speech signal, fitting the characteristics of human ear reception, and MFCC does DCT decorrelation processing on log filter bank, so log filter bank contains more information than MFCC. Therefore, the original speech signal is first extracted as a log filter bank feature vectors. The specific steps of log filter bank feature vectors extraction [19, 20] are as follows:
Pre-emphasis is a high-pass filter whose purpose is to boost high-frequency signal components. In terms of acoustic features extraction, the pre-emphasis filter is shown in Eq. 1.
$$ \begin{aligned} & H(z)=1-\alpha z^{-1} \end{aligned} $$
(1)
where α is a pre-emphasis coefficient, and z is an input signal of original speech.
By dividing the speech signal into shorter frames, the signal can be regarded as a steady-state signal in each frame, which can be processed as the steady-state signal in the same way. At the same time, in order to make parameters between two adjacent frames more smoothly, there is a partial overlap.
The purpose of the windowing function is to reduce the leakage in the frequency domain. Each frame of speech signal is multiplied by a Hamming window with a frame length and a frame shift [19]. Each frame signal after preprocessing is multiplied by the Hamming window to increase the continuity of the frame. The calculation process is shown in Eq. 2.
$$ \begin{aligned} & T(n)=S(n)\times (0.54-0.46cos[2\pi n/(N-1)]),\ 0\leq n\leq N-1 \end{aligned} $$
(2)
where S(n) is the input of speech signal after pre-emphasis and framing, and N is a frame length.
Then, each frame of speech signal is performed with fast Fourier transform, and the time domain data is converted into frequency domain data. As shown in Eq. 3.
$$ \begin{aligned} & X(k)=\sum_{n=0}^{N-1}T(n)e^{-2\pi nk/P},\ 0\leq k\leq P \end{aligned} $$
(3)
where T(n) is an input speech signal after windowing function, P is the number of Fourier transform points, and k is the frequency index (k=0,1,2,...,P−1).
The energy spectrum is fed to several triangular bandpass filters Hm(k). Each filter has triangular filtering characteristics [18]. In the frequency domain, the energy spectrum |X(k)|2 and the frequency domain response Hm(k) are multiplied and added. The calculation process is shown in Eq. 4.
$$ \begin{aligned} & M(l)=\sum_{k=0}^{N-1}\left | X(k) \right |^{2}H_{m}(k) \end{aligned} $$
(4)
where X(k) is the signal after fast fourier transform, and Hm(k) is the triangular band-pass filter. Its frequency response is shown in Eq. 5.
$$ \begin{aligned} & H_{m}(k)=\left \{ \begin{array}{ll} ~~~~~~~~~0, & ~~~~~~~~~~~ k< f(m-1)\\ \frac{k-f(m-1)}{f(m)-f(m-1)}, & ~~~~f(m-1)\leq k< f(m)\\ \frac{f(m+1)-k}{f(m+1)-f(m)}, & ~~~~f(m)< k\leq f(m+1)\\ ~~~~~~~~~0, & ~~~~~~~~~~~k>f(m+1) \end{array} \right. \end{aligned} $$
(5)
where f(m) is the center frequency of Hm(k),0≤m<L, and L is the number of bandpass filters.
For the mth frame, the log energy spectrum of filter is defined as Eq. 6.
$$ \begin{aligned} & e(l)=log(M(l)) \end{aligned} $$
(6)
where M(l) is the feature vectors after calculating energy of the mel filter.
2.2 Structure of residual block
ResNet is a way to alleviate the difficulty of training deep convolutional neural network [21, 22]. It is learning the following layers of deep network into identity mapping, that is, h(x)=x, so the model degenerates into shallow network. The identity mapping greatly reduces the number of training parameters in the neural network. The ResNet is composed of many stacked residual blocks (Res-Blocks). The structure of Res-Blocks is shown in Fig. 2. The Res-Blocks is composed of two 2D convolutional layers. The identity mapping can be used to map each Res-Blocks input feature vectors to an output feature vectors. The expression defined by Res-Block as shown in Eq. 7.
$$ \begin{aligned} & \Gamma =F(x,W_{i})+x \end{aligned} $$
(7)
where x and Γ are input feature vectors and output feature vectors, respectively. Wi is the learnable weight, and F(x,Wi) is the output of residual mapping. In addition, the identity mapping connection of does not add additional parameters and computational complexity.
In order to make full use of feature learning capabilities of ResNet and reduce loss of feature information, we use identity mapping to reduce data dimensions. In addition, in each convolutional layer, the stride is set to 1, the padding is set to SAME, and zero padding is used to prevent information from being lost at the edge of the cube.
2.3 Structure of CASP
By combining higher-order statistics and attention mechanism, the ASP is proposed [18]. It provides importance weighted standard deviations as well as the weighted means of frame-level features, for which the importance is calculated by an attention mechanism. Such previous work, however, has been evaluated only in such limited tasks as fixed-duration text-independent [18, 23]. Therefore, we propose a new pooling method, called CASP. The CASP is used to aggregate the frame-level features of the deep residual network model to form utterance-level features. This enables speaker embedding to more accurately and efficiently capture speaker factors with respect to long-term variations. The calculation process of CASP layer is shown in Fig. 3.
Firstly, the frame-level feature vectors {x1,x2,...,xT} of the deep residual network are projected onto one-dimensional convolutional layers to obtain the abstract feature vectors on hidden unit {h1,h2,...,hT}.
Secondly, the score is normalized over all frames by a softmax function, which indicates relative importance of the hidden unit. The weight calculation formula for each sample is shown in Eq. 8.
$$ \begin{aligned} & w_{t}=\frac{exp(h_{t})}{\sum_{t=1}^{T}exp(h_{t})} \end{aligned} $$
(8)
where ht is the input feature vectors, and wt is the weight ratio of each feature vector.
Therefore, utterance-level features can be expressed by weighted sum of frame-level features, and the calculation formula of the weighted sum is shown in Eq. 9.
$$ \begin{aligned} & e_{t}=\sum_{t}^{T}w_{t}x_{t} \end{aligned} $$
(9)
where xt is the input of feature vectors, and the normalized score et is then used as the weight in the pooling layer to calculate the weighted mean vector.
Finally, higher-order statistics with the attention mechanism are combined, that is, CASP. It can generate the mean and standard deviation by attention mechanism. Therefore, the weighted standard deviation is defined as Eq. 10.
$$ \begin{aligned} & \sigma =\sqrt{\sum_{t}^{T}w_{t}x_{t}\odot x_{t}-e_{t}\odot e_{t}} \end{aligned} $$
(10)
where σ is the weighted standard deviation, and the advantages of higher-order statistics and attention mechanisms are applied to the weighted standard deviation.
2.4 Structure of ACLL
Loss function design is pivotal for large-scale speaker recognition. Current state-of-the-art deep speaker recognition methods mostly adopt softmax-based classification loss [12]. Since the learned features with the original softmax loss are not guaranteed to be discriminative enough for practical speaker recognition problem, margin-based losses [24–26] are proposed. Though the margin-based loss functions are verified to obtain good performance, they do not take the difficultness of each sample into consideration, while ACLL emphasizes easy samples first and hard samples later, which is more reasonable and effective. The original softmax loss is formulated as follows:
$$ \begin{aligned} & L1=-\sum_{i=1}^{N}log\frac{e^{W_{y_{i}}x_{i}+b_{y_{i}}}}{\sum_{j=1}^{n}e^{W_{j}x_{i}+b_{j}}} \end{aligned} $$
(11)
where xi∈Rd denotes the deep feature of ith sample which belongs to the yi class, Wj∈Rd denotes the jth column of the weight W∈Rd×n, and bj is the bias term. The class number and the embedding feature size are n and d, respectively. In practice, the bias is usually set to bj=0 and the individual weight is set to ∥Wj∥=1 by l2 normalization. The deep feature is also normalized and re-scaled to s. Thus, the original softmax can be modified as follows:
$$ \begin{aligned} & L=-\frac{1}{N}\sum_{i=1}^{N}log\frac{e^{s\cdot \varphi (cos\theta_{y_{i}})}}{e^{s\cdot \varphi (cos\theta_{y_{i}})}+\sum_{j\neq y_{i}}e^{s\cdot Y(t,cos\theta_{j})}} \end{aligned} $$
(12)
where \(\varphi (cos\theta _{y_{i}})\) and Y(t,cosθj) are adjusted for the similarity of the positive and negative cosine, respectively. cosθ is the cosine similarity of input feature vector yi and weight wi, s is the coefficient which can increase recognition speed of model, and N is the total number of classified samples. In the margin-based loss function, such as AM-Softmax [24], such that \(\varphi (cos\theta _{y_{i}})=cos\theta _{y_{i}}+m\) and Y(t,cosθj)=cosθj; AAM-Softmax [25], such that \(\varphi (cos\theta _{y_{i}})=cos(\theta _{y_{i}}+m)\), Y(t,cosθj)=cosθj. However, it only modifies the sine and cosine similarity of each sample to enhance feature discrimination, it could not adapt to various situations.
Therefore, ACLL is proposed [27]. The ACLL is defined as Eq. 13.
$$ \begin{aligned} & L_{ACLL}=-\frac{1}{N}\sum_{i=1}^{N}log\frac{e^{s\cdot \varphi (t,cos\theta_{y_{i}})}}{e^{s\cdot \varphi (t,cos\theta_{y_{i}})}+\sum_{j\neq y_{i}}e^{s\cdot Y(t^{(k)},cos\theta_{j})}} \end{aligned} $$
(13)
where \(\varphi (t,cos\theta _{y_{i}})\) and Y(t,cosθj) are defined by Eqs. 14 and 15, respectively, and s is a scaling factor of deep feature vectors. It should be noted that the positive cosine similarity can adopt any margin-based loss functions, and here, we adopt AAM-Softmax as an example. In the early training stage, learning from easy samples is beneficial to model convergence. Thus, t should be close to zero and I(·)=t+cosθi is smaller than 1. Therefore, the weights of hard samples are reduced, and easy samples are emphasized relatively. As training goes on, the model gradually focuses on the hard samples, i.e., the value of t shall increase and I(·) is larger than 1. Thus, the hard samples are emphasized with larger weights. Moreover, within the same training stage, I(·) is monotonically decreasing with θj so that harder sample can be assigned with larger coefficient according to its difficultness. The value of the parameter t is automatically estimated in the ACLL; otherwise, it may require lots of efforts for manual tuning. Therefore, it can adaptively adjust the relative importance of simple and difficult samples.
$$ \begin{aligned} & \varphi (cos\theta_{y_{i}})=cos\theta_{y_{i}}+m \end{aligned} $$
(14)
where m is the feature margin between different categories, \(\theta _{y_{i}}\) is the angle between the feature vectors yi and the weight wi, and t is the adaptive estimation parameter.
$$ \begin{aligned} & Y(t,cos\theta_{j})=\left \{ \begin{array}{ll} ~~~~~~~~cos\theta_{j}, & \varphi (cos\theta_{y_{i}})\geq cos\theta_{j}\\ cos\theta_{j}(t+cos\theta_{j}), & \varphi (cos\theta_{y_{i}})< cos\theta_{j} \end{array} \right. \end{aligned} $$
(15)
where t is adaptive estimation parameters, and exponential moving average (EMA) is used to achieve adaptive parameters. The process is shown as Eq. 16.
$$ \begin{aligned} & t^{(k)}=\alpha r^{(k)}+(1-\alpha)t^{(k-1)} \end{aligned} $$
(16)
where r(k) is the mean values of the cosine similarity of the kth batch. With the EMA, we avoid the hyper-parameter tuning and make the modulation coefficients of hard sample negative cosine similarities I(·) adaptive to the current training stage.
As shown in Fig. 4, decision conditions are from \(cos\theta _{y_{i}}=cos\theta _{j}\) (blue line) to \(cos (\theta _{y_{i}}+m)=cos\theta _{j}\) (yellow line). ACLL is applied to adaptively adjust the weights of difficult samples and the decision condition becomes \(cos(\theta _{y_{i}}+m)=(t+cos\theta _{j})cos\theta _{j}\) (green line). During the training process, the decision boundary of difficult samples changes from a green line (early stage) to another green line (later stage). Simple samples are emphasized first, and then difficult samples are emphasized. In addition, the AAM-Softmax is used as the similarity of sine and cosine, namely \(\varphi (t,cos\theta _{y_{i}})=cos(\theta _{y_{i}}+m)\). It can be seen from Eq. 14 that let Y(t,cosθj)=cosθj at the beginning of training.
2.5 Evaluation indicators
We evaluate the framework with equal error rate (EER). EER is denoted by the false rejection (FR) rate equal to the false acceptance (FA) rate, where FR is a correct signal which is recognized as a wrong signal; FA is a wrong signal which is recognized as a correct signal. Definitions of FR rate and FA rate are shown in Eqs. 16 and 17.
$$ \begin{aligned} & I_{FR}=\frac{N_{FR}}{N_{Target}} \end{aligned} $$
(17)
where NFR is the number of false rejections, and NTarget is the total number of real evaluations.
$$ \begin{aligned} & I_{FA}=\frac{N_{FA}}{N_{impostor}} \end{aligned} $$
(18)
where NFA is the number of false rejections, and Nimpostor is the total number of false evaluations.