It can be assumed that speech frames have different amount of speaker information according to their acoustic-phonetic classes [13] as well as speaker’s voice characteristics. Under this assumption, a speech frame x
t
can be classified into its corresponding acoustic-phonetic class if some classification scheme is provided in advance. Of a variety of classification methods, we employed a hard classification approach based on the vector quantization technique with GMM for algorithmic simplicity. The unsupervised clustering capability of GMM can automatically provide a number of acoustic-phonetic classes for the whole acoustic space which spans the entire training data. The vector quantization-based hard classification approach can be defined by
(4)
where Ψ
m
denotes a set of Gaussian model parameters of the m th acoustic-phonetic class of a total of M acoustic-phonetic classes which are given by a GMM estimated from the training data which are assumed to cover the whole acoustic-phonetic space of speech signals.
Then, the speaker identification rule in (3) can be represented with the concept of acoustic-phonetic classes by classifying each frame into its acoustic-phonetic class and computing the probabilistic scores on the basis of class as
(5)
Under this framework, each frame-level log-likelihood score can be discriminatively weighted on the basis of the acoustic-phonetic class as well as speaker to consider its acoustic-phonetic classes as well as speaker’s voice characteristics in speaker identification. The speaker identification rule based on this discriminative score weighting (DSW) scheme is given by
(6)
where w
jm
stands for the discriminative weight for the m th acoustic-phonetic class and the j th speaker model. The optimal weights for this speaker identification scheme can be obtained by using the MCE-based discriminative training algorithm [5, 12, 14–16], which aims at deriving a set of speaker models which minimizes classification errors, that is, speaker identification errors for training data. To train these weights discriminatively with the MCE criterion for speaker identification, we define a discriminative function for each speaker which represents the log-likelihood of the feature vector sequence X given model Λ
j
of speaker j as
(7)
where Φ
W
stand for the weight parameters. In (7), the weights represent the amount of score contribution from their corresponding classes. In the equation, their integral sum needs to be normalized to avoid ill-training of the weights. According to these requirements, the weights need to satisfy such constraints [5, 16] as
(8)
Then, the misclassification measure is defined for the true speaker that is the label information for the input feature vector sequence k to measure how much the input feature sequence spoken by the true speaker is misclassified as
(9)
with
(10)
where η is a positive constant for weight controlling of the competing speaker classes.
A loss function for approximating the empirical loss related to the soft count of classification errors is defined as
(11)
where γ is a positive constant used to control the slope of the sigmoid function.
To satisfy the constraints in (8), we take logarithms as
This new parameter set is trained by using the generalized probabilistic descent (GPD) algorithm [16] as
(13)
where ϵ is a step size of the GPD algorithm and is derived as
(14)
where
(15)
(16)
(17)
After is updated, w
jm
is obtained by using the following transformation to satisfy the constraints in (8) as
(18)
The pseudocode of this training algorithm for the discriminative score weights is given in Algorithm 1.