For better understanding, our previous sparse speaker adaptation, MAP-based speaker adaptation is described first. Let Φ = {π, A, Θ} be the whole parameter set of HMMs, where π is the initial state distribution, A is the transition probability matrix, and Θ is the set of GMMs for every state. The GMM distribution of state s is given as follows:
$$ p\left(\mathbf{x}\Big|{\boldsymbol{\Theta}}_s\right)=\sum_{g=1}^M{w}_{g,s}\mathcal{N}\left(\mathbf{x};{\boldsymbol{\mu}}_{g,s},{\boldsymbol{\Sigma}}_{g,s}\right) $$
(17)
where \( \mathcal{N}\left(\cdot \right) \) is a normal distribution, M is the number of Gaussian components, and w
g,s
, μ
g,s
, and Σ
g,s
are the weight, mean vector, and covariance matrix of Gaussian component g, respectively. In this paper, Σ
g,s
is set as diagonal matrix whose diagonal components are represented as [(σ
1,g,s
)2, (σ
2,g,s
)2, …, (σ
D,g,s
)2]T. Since MAP adaptation is typically performed on single state to adjust GMM parameters, we will omit the state index s and describe every procedure in terms of GMM framework. Since, in addition, it is well known that adapting mixture weights and variances is not helpful for recognition performance, we focus on how to adapt mean vectors only.
In order to adapt mean vectors of the SI model, the MAP adaptation process is composed of two major stages. In the first stage, mean vectors based on maximum likelihood (ML) criterion are computed for each mixture component of the SI model. Let X = {x
1, x
2, …, x
N
} be a set of acoustic feature vectors extracted from utterances of a target speaker. The a posteriori probability of Gaussian component g for SI model is given by
$$ p\left(g\Big|{\mathbf{x}}_n\right)=\frac{w_g^{\mathrm{SI}}\mathcal{N}\left({\mathbf{x}}_n;{\boldsymbol{\mu}}_g^{\mathrm{SI}},{\boldsymbol{\Sigma}}_g^{\mathrm{SI}}\right)}{{\displaystyle \sum_{g^{\prime }=1}^M{w}_{g\prime}^{\mathrm{SI}}\mathcal{N}\left({\mathbf{x}}_n;{\boldsymbol{\mu}}_{g\prime}^{\mathrm{SI}},{\boldsymbol{\Sigma}}_{g\prime}^{\mathrm{SI}}\right)}}. $$
(18)
With the probability of Gaussian component g, we then compute the ML mean vector:
$$ {\boldsymbol{\mu}}_g^{\mathrm{ML}}=\frac{1}{n_g}\sum_{n=1}^Np\left(g\Big|{\mathbf{x}}_n\right){\mathbf{x}}_n $$
(19)
where \( {n}_g={\varSigma}_{n=1}^Np\left(g\Big|{\mathbf{x}}_n\right) \) which is called posterior sum.
In the second stage, \( {\boldsymbol{\mu}}_g^{\mathrm{ML}} \) is used to obtain the adapted mean vector from the SI model, which is given by
$$ {\boldsymbol{\mu}}_g^{\mathrm{MAP}}=\frac{n_g{\boldsymbol{\mu}}_g^{\mathrm{ML}}+\tau {\boldsymbol{\mu}}_g^{\mathrm{SI}}}{n_g+\tau } $$
(20)
where τ is called the relevance factor which controls the balance between \( {\boldsymbol{\mu}}_g^{\mathrm{ML}} \) and \( {\boldsymbol{\mu}}_g^{\mathrm{SI}} \). By modifying (20), we can obtain
$$ {\boldsymbol{\mu}}_g^{\mathrm{MAP}}=\left({\boldsymbol{\mu}}_g^{\mathrm{ML}}-{\boldsymbol{\mu}}_g^{\mathrm{SI}}\right)\frac{n_g}{n_g+\tau }+{\boldsymbol{\mu}}_g^{\mathrm{SI}}={\boldsymbol{\varphi}}_g^{\mathrm{MAP}}+{\boldsymbol{\mu}}_g^{\mathrm{SI}}. $$
(21)
From (21), it is noticeable that \( {\boldsymbol{\varphi}}_g^{\mathrm{MAP}} \) is same as the optimal solution of the following constrained optimization problem, which is given by
$$ \begin{array}{l}\underset{{\boldsymbol{\varphi}}_g}{ \min}\frac{1}{2}\parallel {\boldsymbol{\varphi}}_g-\left({\boldsymbol{\mu}}_g^{\mathrm{ML}}-{\boldsymbol{\mu}}_g^{\mathrm{SI}}\right){\parallel}_2^2\\ {}\mathrm{s}.\mathrm{t}.\kern1em \parallel {\boldsymbol{\varphi}}_g{\parallel}_2\le \parallel {\boldsymbol{\mu}}_g^{\mathrm{ML}}-{\boldsymbol{\mu}}_g^{\mathrm{SI}}{\parallel}_2\frac{n_g}{n_g+\tau }.\end{array} $$
(22)
This constrained optimization problem is described in Fig. 2 from a geometric perspective. The shaded region implies the constraint part of (22), and the outer circle indicates the constraint when n
g
goes to infinity. As also shown in Fig. 2, the L
2 norm-based constraint can cause most of the small and redundant adjustments which can be negligible in terms of speech recognition performance. By replacing the constraint part of (22) with an L
1 norm-based constraint, we can efficiently restrict the redundant adjustments. The modified constrained optimization problem is given by
$$ \begin{array}{l}\underset{{\boldsymbol{\varphi}}_g}{ \min}\frac{1}{2}\parallel {\boldsymbol{\varphi}}_g-\left({\boldsymbol{\mu}}_g^{\mathrm{ML}}-{\boldsymbol{\mu}}_g^{\mathrm{SI}}\right){\parallel}_2^2\\ {}\mathrm{s}.\mathrm{t}.\kern1em \parallel {\boldsymbol{\varphi}}_g{\parallel}_1\le \parallel {\boldsymbol{\mu}}_g^{\mathrm{ML}}-{\boldsymbol{\mu}}_g^{\mathrm{SI}}{\parallel}_1\frac{n_g}{n_g+\tau }.\end{array} $$
(23)
The constrained optimization problem in (23) is exactly same as EPL1 except for the constraint part. As you can see in (23), the right-hand side of the constraint part is not the constant c in previous section but variables depending mostly on n
g
and τ. The posterior sum n
g
is naturally determined by the amount of adaptation data. Also, n
g
is used for considering the asymptotic property of adaptation, which means relaxation of regularization effect including sparsity as adaptation data increase. Thus, the parameter τ takes charge of controlling the sparsity and regularization instead of parameter c for speaker adaptation. Figure 3 shows how the optimal solution can have sparse vectors indicated by the red cross. Before finding the optimal solution of (23), we first define a vector which is given by
$$ {\boldsymbol{\psi}}_g=\left|{\boldsymbol{\mu}}_g^{\mathrm{ML}}-{\boldsymbol{\mu}}_g^{\mathrm{SI}}\right| $$
(24)
where |ρ| returns the vector of absolute values in ρ. To find the optimal solution of (23), we use ψ
g
for the following steps. The Lagrangian form of (23) is given by
$$ \begin{array}{c}{L}^{\mathrm{SA}\hbox{-} \mathrm{E}\mathrm{P}\mathrm{L}1}\left({\boldsymbol{\varphi}}_g,\lambda \right)=\frac{1}{2}\parallel {\boldsymbol{\varphi}}_g-{\boldsymbol{\psi}}_g{\parallel}_2^2+\lambda \left(\parallel {\boldsymbol{\varphi}}_g{\parallel}_1-\parallel {\boldsymbol{\psi}}_g{\parallel}_1\frac{n_g}{n_g+\tau}\right)\\ {}+{\boldsymbol{\kappa}}^T{\boldsymbol{\varphi}}_g.\end{array} $$
(25)
As described in Section 2, after being decoupled, the closed form solution of (23) with the optimal value λ*, and the piecewise linear function in terms of λ are given by
$$ {\varphi}_{i,g}^{\mathrm{SA}\hbox{-} \mathrm{E}\mathrm{P}\mathrm{L}1} = \max \left(0,\kern0.5em {\psi}_{i,g}-{\lambda}^{*}\right),\kern0.5em i=1,\dots, D $$
(26)
$$ {f}^{\mathrm{SA}\hbox{-} \mathrm{E}\mathrm{P}\mathrm{L}1}\left(\lambda \right)=\sum_{i\in {R}_{\lambda}^{\mathrm{SA}\hbox{-} \mathrm{E}\mathrm{P}\mathrm{L}1}}{\psi}_{i,g}-\left|{R}_{\lambda}^{\mathrm{SA}\hbox{-} \mathrm{E}\mathrm{P}\mathrm{L}1}\right|\lambda -\parallel {\boldsymbol{\psi}}_g{\parallel}_1\frac{n_g}{n_g+\tau } $$
(27)
where \( {R}_{\lambda}^{\mathrm{SA}\hbox{-} \mathrm{E}\mathrm{P}\mathrm{L}1}=\left\{i\Big|i\in \left\{1,\dots, D\right\},{\psi}_{i,g}>\lambda \right\} \). As also described in Section 2, λ* can be obtained by the sequence {λ
k} from f
SA ‐ EPL1(λ), which is given by
$$ {\lambda}^k=\frac{{\displaystyle \sum_{i\in {R}_{\lambda}^{\mathrm{SA}\hbox{-} \mathrm{E}\mathrm{P}\mathrm{L}1}}{\psi}_{i,g}}-\parallel {\boldsymbol{\psi}}_g{\parallel}_1\frac{n_g}{n_g+\tau }}{\left|{R}_{\lambda}^{\mathrm{SA}\hbox{-} \mathrm{E}\mathrm{P}\mathrm{L}1}\right|} $$
(28)
when f(λ
k) = 0 is satisfied. Thus, the final adapted mean vector from EPL1-based sparse speaker adaptation is given as follows:
$$ {\boldsymbol{\mu}}_g^{\mathrm{SA}\hbox{-} \mathrm{E}\mathrm{P}\mathrm{L}1}=sign\left({\boldsymbol{\mu}}_g^{\mathrm{ML}}-{\boldsymbol{\mu}}_g^{\mathrm{SI}}\right)\odot {\boldsymbol{\varphi}}_g^{\mathrm{SA}\hbox{-} \mathrm{E}\mathrm{P}\mathrm{L}1}+{\boldsymbol{\mu}}_g^{\mathrm{SI}}. $$
(29)