# Scaled norm-based Euclidean projection for sparse speaker adaptation

- Younggwan Kim
^{1}Email author, - Myung Jong Kim
^{1}and - Hoirin Kim
^{1}

**2015**:102

https://doi.org/10.1186/s13634-015-0290-2

© Kim et al. 2015

**Received:**24 June 2015**Accepted:**23 November 2015**Published:**1 December 2015

## Abstract

To reduce data storage for speaker adaptive (SA) models, in our previous work, we proposed a sparse speaker adaptation method which can efficiently reduce the number of adapted parameters by using Euclidean projection onto the *L*
_{1}-ball (EPL1) while maintaining recognition performance comparable to maximum *a posteriori* (MAP) adaptation. In the EPL1-based sparse speaker adaptation framework, however, the adapted Gaussian mean vectors are mostly concentrated on dimensions having large variances because of assuming unit variance for all dimensions. To make EPL1 more flexible, in this paper, we propose scaled norm-based Euclidean projection (SNEP) which can consider dimension-specific variances. By using SNEP, we also propose a new sparse speaker adaptation method which can consider the variances of a speaker-independent model. Our experiments show that the adapted components of mean vectors are evenly distributed in all dimensions, and we can obtain sparsely adapted models with no loss of phone recognition performance from the proposed method compared with MAP adaptation.

## Keywords

- Euclidean projection onto the
*L*_{1}-ball - MAP adaptation
- Scaled norm-based Euclidean projection
- Sparse speaker adaptation

## 1 Introduction

In these days, modern server-based speech recognition systems (SRSs) serve millions of users. For this reason, reducing data storage for speaker adaptive (SA) acoustic models becomes an important issue when considering speaker adaptation to enhance speech recognition performance. There are various adaptation methods for Gaussian mixture model-hidden Markov model (GMM-HMM)-based SRS [1–5]. Among those methods, maximum *a posteriori* (MAP) speaker adaptation is the most conventional and powerful method when relatively large amount of adaptation data that is about 20 min to 10 h long is available [6, 7].

SA models obtained by MAP adaptation require the data storage as much as a speaker-independent (SI) model needs, and the SI model typically has billions of parameters. Olsen et al. showed that most of the adapted parameters obtained by MAP adaptation are not closely related to speech recognition performance [6, 7]. To restrict the redundant parameter adjustments, they proposed sparse MAP (SMAP) adaptation in which a typical MAP problem is maximized with certain sparse constraints. In the SMAP approach, two sets of optimization parameters need to be controlled. The first set of the optimization parameters are related to parameter regularization which is used for typical MAP adaptation. The second set of the parameters are used to restrict the redundant parameter adjustments. However, the more parameters we have, the harder it becomes to tune those parameters because the parameters are empirically chosen to show the best recognition performance.

To resolve the aforementioned problem, in our previous work, we first reinterpreted the MAP adaptation as a constrained optimization problem with an *L*
_{2} norm-based constraint [8, 9]. To obtain sparsely updated SA models, we replace the *L*
_{2} norm-based constraint with an *L*
_{1} norm-based constraint. From the modification, we proposed a sparse adaptation method based on Euclidean projection onto the *L*
_{1}-ball (EPL1) [10], which only requires a single control parameter. By using the proposed sparse adaptation method, we showed that less data storage for SA models can be obtained with almost no loss of phone recognition performance than the SMAP adaptation method. Although the number of control parameters can be dramatically reduced, EPL1-based speaker adaptation still has a limitation that variances cannot be considered. Because of the limitation, parameters having large variances are only adapted during the adaptation step. However, we believe that parameters with small variances can also reflect speaker characteristics. Thus, in this paper, we propose scaled norm-based Euclidean projection (SNEP) which is a generalized version of EPL1, utilizing dimension-specific variances. From the SNEP framework, we also propose a new sparse speaker adaptation method. From our experiments, it is shown that the proposed SNEP-based speaker adaptation method can sparsely adapt the SI model (only about 9 % of the total number of parameters) with no loss of phone recognition performance against MAP adaptation.

The rest of this paper is organized as follows. In Section 2, we introduce EPL1 and a piecewise root finding (PRF) method which is a well-known solver for EPL1 [11, 12]. In Section 3, from the derivation of EPL1, we describe the modified optimization problem and how to find the optimal solution of SNEP. In Section 4, we briefly review MAP- and EPL1-based speaker adaptation. In Section 5, we describe our SNEP-based sparse speaker adaptation method using the variances of the SI model. In Section 6, we analyze our experimental results on adapted mean vectors and speech recognition performance. We conclude this paper in Section 7.

## 2 Euclidean projection onto the *L*
_{1}-ball

*L*

_{1}-ball (EPL1) is widely used for gradient projection methods [13–18] which are used to find the optimal sparse solution of a constrained optimization problem which is given by

*ℝ*

^{ D }→

*ℝ*is a convex and differentiable loss function, || ⋅ ||

_{1}indicates an

*L*

_{1}norm operator enforcing the sparse solution, and

*c*is a constant for controlling regularization and sparsity, meaning how many zeros are in the optimal solution vector. Gradient projection with Nesterov’s method [19–22] is an optimal first-order black-box method and can find the optimal solution of (1) by generating a sequence {

**x**

^{ k }} which is obtained from

**s**

^{ k }=

**x**

^{ k }+

*α*

_{ k }(

**x**

^{ k }−

**x**

^{ k − 1}),

*α*

_{ k }, and

*η*

_{ k }are learning rates selected by certain rules [23], ∇ℒ(

**s**

^{ k }) is the gradient of ℒ(⋅) at

**s**

^{ k }, and \( {\prod}_{L_1}\left(\mathbf{y}\right) \) is the EPL1 problem defined as

*L*

_{2}norm operator. In practice, (3) is modified into another constrained optimization problem which is given by

**z**is composed of absolute values of components in

**y**, ≽ denotes component-wise inequality, and

**0**is a vector with all zero components. The optimal solution of (3) can be obtained by

*sign*(

**ρ**) returns the vector whose components are signs of all components in

**ρ**, ⊙ is component-wise multiplication of two vectors, and

**u*** is the optimal solution of (4) which can be solved by Lagrangian function given by

*λ*and κ are the Lagrangian multipliers. We assume that optimal value

*λ** is known and ||

**z**||

_{1}>

*c*. Since the components in (6) can be decoupled, the closed form solution is as follows [10]:

**u***,

*i*is the component index; the constraints of (4) can be expressed as

*λ*, a piecewise linear function [11, 12] is used, which is given by

*R*

_{ λ }= {

*i*|

*i*∈ {1, …,

*D*},

*z*

_{ i }>

*λ*} and |

*R*| is the number of elements in the set

*R*. Figure 1 shows an illustration of

*f*(

*λ*) and a first-order gradient-based iterative method called piecewise root finding (PRF) [12] for the optimal value of

*λ*. With the PRF method, we can generate a sequence {

*λ*

^{ k }} via

*f*(

*λ*

^{ k }) = 0 is satisfied. As shown in Fig. 1, each

*λ*

^{ k }for

*k*≥ 1 represents the root of a tangent line. To determine the set \( {R}_{\lambda^k} \), every component of

**z**needs to be compared with

*λ*

^{ k }. If we set an initial value of

*λ*to 0, the sequence {

*λ*

^{ k }} could have a non-decreasing property. According to the property, in the

*k*th step, we can skip the comparing operations for the components decided as less than

*λ*

^{ k − 1}.

## 3 Scaled norm-based Euclidean projection

*L*

_{2}and

*L*

_{1}norm for EPL1 can be interpreted as a multivariate Gaussian distribution with unit variance and a multivariate Laplace distribution with unit standard deviation [24]. Hence, every component in EPL1 is equally treated for optimization without considering any scaling parameters such as dimension-specific variances and standard deviations. For this reason, we propose a scaled norm-based Euclidean projection (SNEP) method which is a more generalized version of EPL1. The proposed constrained optimization problem for SNEP is given by

*σ*

_{2,i }and

*σ*

_{1,i }denote scaling parameters for

*L*

_{2}and

*L*

_{1}norm, respectively. As shown in (11), we can apply any dimension-specific scaling parameters to the SNEP framework. The Lagrangian function of (11) and its differentiation with respect to

*u*

_{ i }are given by

*dL*

^{SNEP}(

*λ*,

**u**)/

*du*

_{ i }= 0 and considering the complementary slackness KKT condition, the optimal value \( {u}_i^{*} \) is given by

*λ**. By using (14), the piecewise linear function for SNEP is given by

*f*

^{SNEP}(

*λ*) = 0, the sequence {

*λ*

^{ k }} from

*f*

^{SNEP}(

*λ*) is generated as

*λ*

^{0}= 0.

## 4 Previous work for speaker adaptation

*s*is given as follows:

*M*is the number of Gaussian components, and

*w*

_{ g,s }, μ

_{ g,s }, and Σ

_{ g,s }are the weight, mean vector, and covariance matrix of Gaussian component

*g*, respectively. In this paper, Σ

_{ g,s }is set as diagonal matrix whose diagonal components are represented as [(

*σ*

_{1,g,s })

^{2}, (

*σ*

_{2,g,s })

^{2}, …, (

*σ*

_{ D,g,s })

^{2}]

^{ T }. Since MAP adaptation is typically performed on single state to adjust GMM parameters, we will omit the state index

*s*and describe every procedure in terms of GMM framework. Since, in addition, it is well known that adapting mixture weights and variances is not helpful for recognition performance, we focus on how to adapt mean vectors only.

**X**= {

**x**

_{1},

**x**

_{2}, …,

**x**

_{ N }} be a set of acoustic feature vectors extracted from utterances of a target speaker. The

*a posteriori*probability of Gaussian component

*g*for SI model is given by

*g*, we then compute the ML mean vector:

*τ*is called the relevance factor which controls the balance between \( {\boldsymbol{\mu}}_g^{\mathrm{ML}} \) and \( {\boldsymbol{\mu}}_g^{\mathrm{SI}} \). By modifying (20), we can obtain

*n*

_{ g }goes to infinity. As also shown in Fig. 2, the

*L*

_{2}norm-based constraint can cause most of the small and redundant adjustments which can be negligible in terms of speech recognition performance. By replacing the constraint part of (22) with an

*L*

_{1}norm-based constraint, we can efficiently restrict the redundant adjustments. The modified constrained optimization problem is given by

*c*in previous section but variables depending mostly on

*n*

_{ g }and

*τ*. The posterior sum

*n*

_{ g }is naturally determined by the amount of adaptation data. Also,

*n*

_{ g }is used for considering the asymptotic property of adaptation, which means relaxation of regularization effect including sparsity as adaptation data increase. Thus, the parameter

*τ*takes charge of controlling the sparsity and regularization instead of parameter

*c*for speaker adaptation. Figure 3 shows how the optimal solution can have sparse vectors indicated by the red cross. Before finding the optimal solution of (23), we first define a vector which is given by

**ρ**| returns the vector of absolute values in

**ρ**. To find the optimal solution of (23), we use ψ

_{ g }for the following steps. The Lagrangian form of (23) is given by

*λ**, and the piecewise linear function in terms of

*λ*are given by

*λ** can be obtained by the sequence {

*λ*

^{ k }} from

*f*

^{SA ‐ EPL1}(

*λ*), which is given by

*f*(

*λ*

^{ k }) = 0 is satisfied. Thus, the final adapted mean vector from EPL1-based sparse speaker adaptation is given as follows:

## 5 SNEP-based sparse speaker adaptation

_{ g }in all steps. The proposed constrained optimization problem for sparse speaker adaptation is given by

*λ*

^{ k }} are given as follows:

Note that the right-hand sides of (34) and (35) are composed of scaled *ψ*
_{
i,g
} by \( {\sigma}_{i,g}^{\mathrm{SI}} \). Thus, if we find the optimal solution with \( {\psi}_{i,g}/{\sigma}_{i,g}^{\mathrm{SI}} \) by EPL1, the solution would be \( {\varphi}_{i,g}^{\mathrm{SA}\hbox{-} \mathrm{SNEP}}/{\sigma}_{i,g}^{\mathrm{SI}} \). By multiplying \( {\sigma}_{i,g}^{\mathrm{SI}} \) with the solution, we can obtain exactly same result with (32).

## 6 Experimental results

The experiments were conducted on the ETRI Korean conversation speech database collected at 16 kHz sampling rate and 16-bit resolution by two types of smart phone devices in clean condition. We used about 100 h of speech data spoken by 300 speakers to train the SI triphone-based GMM-HMM acoustic model. For adaptation and evaluation, we used 50 speakers’ 350 sentences (300 sentences for adaptation and 50 sentences for the phone recognition test) and each sentence is roughly 4–5 s long. We used 12-dimensional Mel-frequency cepstral coefficients with log energy and concatenated their first and second derivatives as a feature vector to constitute 39-dimensional feature vectors. We applied a phone level unigram language model in terms of 39 Korean phonemes to our phone recognition experiments. The SI model had 11,848 tied-state triphone-based HMMs including three states per each HMM and GMM with 32 Gaussian components per state. All phone recognition tests were performed according to various values of hyperparameter *τ*.

*x*-axis indicates each dimension of the mean vector and normalized histogram of the counts is shown on

*y*-axis. For EPL1, three distinct peaks are observed, and their dimensions are related to the log energy and its first and second derivatives. On the other hand, it is noticeable that there is no peak with SNEP and every dimension is evenly adapted. As mentioned earlier, we believe that speaker characteristic is not mainly concentrated on the three dimensions which are related to log energy. Therefore, it can be said that SNEP-based sparse speaker adaption can reflect more the speaker variability than the EPL1-based method.

Phone error rate (%) and sparsity (%) for different *τ*’s

| 0.5 | 1.0 | 1.5 | 2.0 | 2.5 | 3.0 | 3.5 |
---|---|---|---|---|---|---|---|

Phone error rate (%) | |||||||

SI | 31.45 | ||||||

MLLR | 21.77 | ||||||

MAP | 17.87 | 17.76 | 17.63 | 17.71 | 17.68 | 17.94 | 18.06 |

EPL1 | 18.19 | 17.99 | 17.99 | 18.12 | 18.26 | 18.24 | 18.37 |

SNEP | 18.37 | 18.23 | 17.98 | 17.85 | 17.63 | 17.75 | 17.74 |

Sparsity (%) | |||||||

MAP | 50.96 | ||||||

EPL1 | 89.45 | 91.37 | 92.62 | 93.52 | 94.19 | 95.13 | 95.48 |

SNEP | 87.42 | 88.72 | 89.67 | 90.46 | 91.08 | 91.60 | 92.05 |

## 7 Conclusions

In this paper, we propose the SNEP method which is a more generalized version of EPL1 in which certain scaling parameters can be applied to the EPL1 framework. In addition, by using the SNEP method, we also propose sparse speaker adaptation. In our experiments, we show that a small number of dimensions are mostly adapted by EPL1-based speaker adaptation and the proposed speaker adaptation method can evenly adapt every dimension of the mean vectors by using the variances of the SI model. With the proposed methods, it is also shown that we can obtain sparsely adapted model with no loss of phone recognition performance compared with MAP adaptation. Our further work is to apply the EPL1 and SNEP framework to deep neural network-based acoustic model adaptation [25–28] with the gradient projection method.

