 Research
 Open Access
Soft context clustering for F0 modeling in HMMbased speech synthesis
 Soheil Khorram^{2}Email author,
 Hossein Sameti^{2} and
 Simon King^{3}
https://doi.org/10.1186/1687618020152
© Khorram et al.; licensee Springer. 2015
 Received: 29 August 2014
 Accepted: 5 December 2014
 Published: 9 January 2015
Abstract
This paper proposes the use of a new binary decision tree, which we call a soft decision tree, to improve generalization performance compared to the conventional ‘hard’ decision tree method that is used to cluster contextdependent model parameters in statistical parametric speech synthesis. We apply the method to improve the modeling of fundamental frequency, which is an important factor in synthesizing naturalsounding highquality speech. Conventionally, hard decision treeclustered hidden Markov models (HMMs) are used, in which each model parameter is assigned to a single leaf node. However, this ‘divideandconquer’ approach leads to data sparsity, with the consequence that it suffers from poor generalization, meaning that it is unable to accurately predict parameters for models of unseen contexts: the hard decision tree is a weak function approximator. To alleviate this, we propose the soft decision tree, which is a binary decision tree with soft decisions at the internal nodes. In this soft clustering method, internal nodes select both their children with certain membership degrees; therefore, each node can be viewed as a fuzzy set with a contextdependent membership function. The soft decision tree improves model generalization and provides a superior function approximator because it is able to assign each context to several overlapped leaves. In order to use such a soft decision tree to predict the parameters of the HMM output probability distribution, we derive the smoothest (maximum entropy) distribution which captures all partial firstorder moments and a global secondorder moment of the training samples. Employing such a soft decision tree architecture with maximum entropy distributions, a novel speech synthesis system is trained using maximum likelihood (ML) parameter reestimation and synthesis is achieved via maximum output probability parameter generation. In addition, a soft decision tree construction algorithm optimizing a loglikelihood measure is developed. Both subjective and objective evaluations were conducted and indicate a considerable improvement over the conventional method.
Keywords
 Context clustering
 Decision treebased clustering
 F0 modeling
 Hidden Markov model
 HMM
 HMMbased speech synthesis
 Maximum entropy model
 Soft decision tree
 Soft context clustering
 Statistical parametric speech synthesis
1 Introduction
Demand for natural and highquality speechbased humancomputer interaction is increasing due to applications including speechbased virtual assistants for mobile devices. Speech synthesis plays a significant role, not only in transmitting factual information, but also as the outward ‘face’ of the application: the naturalness of the synthesis affects overall user satisfaction. Speech synthesis from text is usually achieved via an intermediate linguistic specification [1], which can be thought of as a collection of contextual factors  such as phonetic and prosodic properties of the current, preceding, and following segment  which have been derived from the text. Here, we are concerned only with the conversion of this linguistic specification to a speech waveform. In order to perform this conversion, several methods have been proposed [2], of which statistical parametric speech synthesis (SPSS)[3, 4] has been dominant, at least in research terms, for the last decade or more.
In the synthesis phase, contextual factors are obtained for the input text, and the decision tree is used to obtain the corresponding trained model parameters, using which a parameter generation (PG) algorithm [11–14] generates acoustic feature trajectories. These are then converted to a speech waveform using the vocoder.
In contrast to concatenative synthesis [15], which stores speech waveforms, the parametric representation in SPSS has several potential advantages, including flexibility in changing voice characteristics [3], speaker and style adaptation [16–19], easier multilingual support [20–22], superior coverage of acoustic space [3], reduced memory footprint [3], and better robustness to lowquality speech recordings [23].
Though compressing a human voice into a compact statistical model offers the abovementioned advantages over concatenation of waveforms, there remains one major shortcoming: lower quality synthetic speech. This is caused by one or more of the blocks shown in Figure 1, e.g., inadequate acoustic coverage of training utterances, noisy speech database [24, 25], errors in natural language processing, inadequate contextual factors [26], inaccurate statistical modeling [3], the PG algorithm [11–14], or vocoding distortion [5–9]. Here, we propose improvements to the statistical modeling (the shaded block in Figure 1) and specifically for the F0 stream.
Hidden Markov model (HMM)based speech synthesis [27–33] models not only the spectrum, but also the excitation and duration in a unified framework of contextdependent[34, 35]multispace probability distribution[36]hidden semiMarkov models (HSMMs)[37]. More precisely, an independent binarybranching hard decision tree is constructed for each stream of acoustic features (spectrum, aperiodic energy, and fundamental frequency). In the case of F0, a multispace probability distribution [36] is associated with each leaf of the decision tree. Contextual space (which is very large and very sparse due the great number of contextual factors typically employed) is divided by the decision trees into multiple hard (i.e., nonoverlapping) clusters; each cluster is a group of contextdependent HMM states that share the same output probability distribution.
Hard decision treebased context clustering, which is the standard approach to F0 modeling, has poor generalization[38]. In other words, this structure cannot accurately predict the parameters of models of unseen contexts, given the very limited subset of contexts observed in the training data. In order to predict acoustic variations with high generalization capability, the model has to be able to express a large number of robust distributions (i.e., a large number of distributions, but such that each one can be trained from a sufficient number of training samples). In hard decision trees, increasing the number of distributions by growing the depth of the tree reduces the number of training samples assigned to each leaf, and thus, the robustness of the distributions is weakened. This problem stems from the fact that the hard decision tree structure assigns each model parameter to exactly one cluster (corresponding to a small part of the large contextual space): each training sample contributes to the estimation of only one set of model parameters (one mean vector and one covariance matrix). Our hypothesis is that by enabling each training sample to influence multiple sets of model parameters (and thus cover a larger portion of contextual space), generalization to unseen contexts would be improved.
1.1 Related work
Several attempts have already been made to alleviate the limitations of F0 modeling in standard decision treeclustered H(S)MMs. One of these is the use of deep neural networks (DNNs)[38, 39] which are able to approximate complex acoustic featuretolinguistic context dependencies by employing many hidden layers  contrast this with decision trees that cannot efficiently represent something as simple as XOR or multiplexing [38], i.e., they must use an excessive number of leaves to capture such relationships and thus overfragment the already sparse training data [38]. DNNs are also able to represent nonbinary contextual features, whereas decision trees generally only use binary splits. Other deep learning approaches such as restricted Boltzmann machines (RBMs) and deep belief networks (DBNs) have also been demonstrated to be effective generative models when applied to speech synthesis [40, 41].
Speech synthesis based on Gaussian process regression (GPR)[42] is another new technique that has recently been introduced to alleviate basic limitations of HMMbased speech synthesis. The goal of GPR is to remove the incorrect stationarity assumption of state output distribution in HMMbased speech synthesis. GPR uses framelevel contextual factors  such as position of the frame within the phone, and articulatory features  to estimate framelevel acoustic trajectories [42]. GPR can directly express complex context dependencies without needing decision tree structures and is able to use all contextual factors of all types simultaneously; therefore, it has the potential to provide better generalization.
In [43], a new system is proposed that replaces the usual maximum likelihood (ML) point estimate of the model parameters with a variational Bayesian method. Their system outperforms the usual approach when the amount of training data is limited, thus demonstrating superior generalization.
F0 modeling with additive structures has also been used to express the relationship between contextual factors and the F0 trajectory [44–54]. Contextual additive modeling [45–48] assumes model parameters to be a sum of multiple independent components, each having different context dependencies; therefore, different decision trees have to be trained for them. The contextual additive model is able to exploit contextual factors more efficiently, because mean vectors and covariance matrices of the predicted distributions are the sum of mean vectors and covariance matrices of the additive components [45]: each training sample contributes to more than one model parameter. Takaki et al. [46–48] used an additive structure for spectral modeling and reported that it has a high computational cost. To alleviate this, they proposed covariance parameter tying and a simplified likelihood calculation algorithm using the matrix inversion lemma. Though the contextual additive model [45–48] was originally proposed for spectral modeling, it could be used for F0 trajectories.
Zen et al. [49] also developed an additive F0 modeling structure with multiple components for mean vectors and a single component for variance values. Accordingly, multiple decision trees were trained for the mean vectors, and just one decision tree was built for the variance values. In their system, different sets of contextual factors were used for different additive components and all trees were built concurrently. Similarly, [50] proposed another additive structure with multiple decision trees, but a minimum generation error (MGE) measure was used as the selection criterion instead of the more common maximum likelihood (ML) measure. In [51], an additive model with three different layers, including intonational phrase, word level, and pitch accent, was designed. All three components were trained concurrently using a regularized least square error measure. Qian et al. [52] proposed to use a new gradientbased treeboosting approach with a view to training multiple additive regression trees. Their decision trees were built in successive stages to minimize the squared error.
Some studies [53, 54] have also highlighted another important problem of the common decision treebased F0 modeling: its deficiency in capturing the effect of contextual features that are poorly represented in the training database. These features (i.e., questions used in the decision tree splits) have little influence on the likelihood criterion and hence will not be selected by the usual decision tree construction algorithms [53, 54]. One obvious technique to solve this problem is to build the decision tree using a twostage algorithm [53]. In the first stage, all splits are made only with these underrepresented contextual factors. This stage captures the influence of such factors, even though they are rare. In the second stage, the wellrepresented factors are employed. This procedure is not efficient, since the first stage reduces the amount of the training data available for modeling the dependency between wellrepresented contextual factors and F0 [54]. Context adaptive training with factorized decision trees [54] is another method designed to exploit rare features more effectively. There, cluster adaptive training [55] is employed such that an average model is built and then this general model is adapted using a set of transforms. In fact, wellrepresented contextual factors contribute to generate the average model, and rare contextual questions are taken into account for the transforms. Due to the use of cluster adaptive training, this structure also is able to improve context generalization.
1.2 Scope of the paper
Numerous binary and nonbinary contextual factors are generally taken into account in modeling F0. Conventional HMMbased speech synthesis converts all nonbinary contextual factors to multiple binary questions (i.e., potential decision tree splits). As mentioned earlier, this structure may suffer from inadequate context generalization. To alleviate this deficiency, we propose the direct use of nonbinary contextual factors in a soft decision tree framework [56, 57]. The proposed soft decision tree structure is an innovative binary decision tree with soft questions at each nonterminal node. Both children are selected with a specific membership degree. In contrast to a hard decision tree that partitions contextual factor space into hard contextual regions, the proposed soft decision tree is able to provide soft  i.e., overlapping  clusters. In this structure, each context will be assigned to several terminal leaves with certain membership functions, and consequently, each training sample affects multiple model parameters, and generalization should be improved.
The rest of the paper is organized as follows: Section 2 presents the classical hard decision tree approach to F0 modeling in statistical parametric speech synthesis. The proposed soft contextclustered HMM structure and details of the associated speech synthesis system that employs such trees are provided in Section 3. Section 4 reports the experiments and results, and Section 5 concludes the paper.
2 F0 modeling using hard decision trees
This section describes the predominant framework for F0 modeling in HMMbased speech synthesis, which is the same framework used for the spectral envelope, aperiodic energy, and duration. This section also sets out the notation, algorithms, and structures required for subsequent sections.
2.1 F0 modeling in the HMM framework
Typically, F0 along with its delta and deltadelta derivatives form three streams^{a} of a contextdependent[34, 35]multispace probability distribution (MSD)[36]lefttoright without skip transitions HSMM[58, 37] (which for obvious reasons, we shorten to simply ‘HMM’ in this paper). This model structure generates acoustic trajectories of a unit (e.g., phoneme) by emitting observations from hidden states. The output distribution of the state is a contextdependent multispace Gaussian distribution [36], and these are clustered into groups of related contexts using a decision tree in order to reduce the number of free parameters and allow the modeling of unseen contexts. For notational simplicity, we limit our discussion here to an HMM with just one stream. Generalizing this to the multistream case is straightforward.
When using MSD output distributions with two spaces  for defined and undefined values  the space index is an observed value equal to the voicing label. The figure also introduces c_{ j }, d_{ j }, and t_{ j } which are the contextual factors, the duration, and the last frame index of the jth state (clearly, d_{ j } = t_{ j }  t_{j  1}). Note that state boundaries are latent variables and have to be trained in an unsupervised manner using the expectation maximization (EM) [60] algorithm.
where J and λ denote the total number of states and the model parameters, respectively.
where represents a Gaussian distribution with mean vector μ, and covariance matrix Σ. In this equation, duration and output distributions are parameterized by duration mean , duration variance , voicing probability ( ), output mean vector , and observation covariance matrix .
where m_{ l } and are duration mean and variance values lying in the lth leaf of the duration decision tree. Similarly, w_{ l }, μ_{ l }, and Σ_{ l } represent parameters of the voicing and output probability distributions that are trained for the lth leaf of the output decision tree.
2.2 HMM parameter reestimation
where , and ŵ_{ l } are new values of , and w_{ l } during EM algorithm. Also, χ_{ j }(t_{ j }, t_{j  1}) is the probability of occupying the jth state from time t_{j  1} to t_{ j }, and γ_{ j }(t) denotes the posterior probability of being in state j at time t. These probabilities are calculated through the wellknown forwardbackward algorithms. It should be noted that the publically available HMMbased speech synthesis system (HTS)[61] has been implemented based on the algorithms expressed in [62]. These algorithms were originally proposed by Ferguson [63] and were refined by Levinson [64]. A more efficient version of the forwardbackward algorithm has recently been proposed by Yu et al. [65].
2.3 Decision treebased state clustering
where superscript n is an index defined for the number of training utterances. It should be noted that in order to obtain the above likelihood increase expression, the following simplifying assumptions have to be made [34]: 1  The values of occupation probabilities are assumed to be fixed during the clustering procedure [34]. 2  The overall likelihood measure is supposed to be approximated by a simple average of the log likelihoods weighted by the posterior probabilities [34]. These assumptions make the calculation of possible for all pairs of terminal nodes and questions.
3 Soft contextclustered HMM
Accordingly, to determine the distribution of a given contextual factor, we need to start from the root node and recursively apply the test at each internal node and select one of the two branches depending on the outcome. This process is repeated iteratively until a leaf node is hit at which point the distribution of the leaf is considered as the output probability distribution. Therefore, for each context, just one path from the root to a terminal node is always traversed, and each context is hereby assigned to one leaf and affects the distribution of that single leaf. In order to improve the performance of the canonical decision tree, this paper proposes the soft binary decision tree structure which is able to establish several fuzzy paths from the root to multiple leaves.
3.1 Soft contextclustered HMM structure
The soft decision tree applies soft decisions in its internal nodes m and redirects all samples to both children, but with certain membership degrees computed by and . In fact, each node of a soft decision tree represents a fuzzy subset of contextual factor space; therefore, each context belongs to several nodes with a membership degree. More precisely, when we are traversing the node m for the given context c, a soft question represents the membership grade of the left child, and clearly, computes the degree of selecting the right child.
In both hard and soft decision treebased HMMs, initially, a set of contextual factors have to be defined and extracted for all training utterances. Thereafter, as opposed to the hard decision tree that requires hard questions f_{ m }(c), here, we have to design a great number of soft questions (soft tests) for each contextual factor. These questions are finally assigned to the internal nodes of the decision tree and make fuzzy decisions to select among their children instead of the common crisp decisions.
where m_{ L } and m_{ R } are the left and right children of node m. According to the above recursion, all the membership degrees can be calculated by traversing the tree in a preorder style. The traversing procedure starts with setting the membership degree of the root to 1. After observing a node m and determining its membership degree Ĩ_{ m }(j), its left m_{ L } and right m_{ R } children are observed. If the node is a left child, its membership degree is calculated through ; otherwise, the procedure returns , in which m is the parent node.
The above constraint has to be taken into account during the procedure of defining soft questions. That is, we are not allowed to employ soft questions with a value greater than 1 or less than 0; thus, a normalization step is required for some questions before starting decision treebased clustering.
3.2 Soft contextclustered HMM distribution
The proposed soft contextclustered HMM exploits the same structure and graphical model as the original hard decision treebased HMM, and thus, the model likelihood expression given by Equation 1 is also valid for the proposed model. The only difference between the conventional and the proposed approaches lies in the method of capturing context dependencies inherent in the F0 trajectory. More specifically, the method of representing output distribution b_{ j }(⋅) in Equation 1 is different. The goal of this section is to find this probability distribution for the soft decision tree structure described in the previous section. With a view to providing an efficient context generalization, this section derives the smoothest distribution that is able to accurately express the behavior of the F0 trajectory. To estimate the smoothest distribution, the maximum entropy model (MEM)[67, 68], presented in the next subsection, is employed.
3.2.1 Maximum entropybased distributions
These constraints make the estimated distribution capture the partial firstorder moments E{Ĩ_{ l }(c)go} and the global secondorder moment E{goo^{ T }} of the training data in voiced frames (i.e., in frames where observation features o_{ t } are defined and voicing label g_{ t } is 1); therefore, the training phase of the maximum entropy model estimates the smoothest distribution that preserves the first and secondorder moments, expressed in Equation 9, of the training database. Moreover, the selected constraints lead to a simple expression for the output probability distributions that can be estimated efficiently.
where represents the new optimization function; Also, λ_{b0}, λ_{b1}, and Λ are Lagrange multiplayers incorporated in the optimization function to remove the equality constraints.
where indicates the Gaussian distribution; μ_{ l } is a Ddimensional vector of mean parameters defined for the lth leaf; Also, Σ is a DbyD covariance matrix that is used for all leaves.
In sum, each leaf of the soft decision tree carries a set of model parameters represented by μ_{ l } that contributes to express the output probability distribution b(og,c). The output probability b(og,c) is simply approximated by a Gaussian distribution. This Gaussian distribution uses a unique contextindependent covariance matrix Σ and a contextdependent mean vector. The mean component is obtained by linearly combining μ_{ l } parameters (i.e., ) and the weights of the linear combination are determined by the membership functions Ĩ_{ l }(c). In fact, the proposed maximum entropybased output probability distribution is remarkably similar to the distribution expressed by the contextual additive structure that ties all covariance matrixes [46–48]. In the contextual additive method, similar to the proposed method, the output distribution has the form of Equation 14, but the contextual additive method exploits multiple hard decision trees [46] or a hard decision tree with overlapped leaves [47] instead of the proposed soft decision tree. In other words, in contextual additive structure, Ĩ_{ l }(c) indicates a leaf indicator function that may be 1 for multiple overlapped leaves, but in the proposed model, Ĩ_{ l }(c) is a real number, ranging from 0 to 1, that represents the membership degrees of a soft decision tree terminal node.
3.3 Parameter reestimation
As it is realized from Equation 20, R represents the crosscorrelation matrix of membership functions. This matrix is symmetric and positive definite; therefore, it is possible to solve the above system of equations efficiently using Cholesky decomposition.
The above equations introduce a straightforward procedure to train the parameters of the output probability distribution factorized by a soft decision tree.
where . denotes the matrix determinant operator.
3.4 Soft context clustering algorithm
To automatically capture the dependencies between acoustic features and contextual factors, this section proposes a soft decision tree construction algorithm. Similar to the classical hard decision tree building algorithm, the soft decision tree is built iteratively through a greedy and topdown procedure which maximizes the loglikelihood measure.
The major advantage of the classical hard decision tree construction algorithm is that its terminal nodes can be split independently. In hard decision tree, terminal nodes represent nonoverlapped regions of the contextual factor space; therefore, after splitting a leaf, all values obtained for other leaves are still valid, and it is not required to calculate them once again. This advantage causes the algorithm to be computationally tractable. However, in the soft decision tree construction procedure, the different terminal nodes may cover overlapped regions of the contextual space and splitting a leaf using a soft question affects the parameters of all other leaves. Consequently, as opposed to the conventional hard decision tree structure, here, after splitting a leaf, it is required to update all values obtained for all terminal nodes, and it needs tremendous amount of computations.
The procedure of the proposed soft decision tree construction algorithm is stated as follows:
Step 1. Create the root node embarrassing all samples of the training database.
Step 2. Split all terminal nodes using all questions and compute their optimum loglikelihood value. To compute the optimum loglikelihood value for each possible pair of leaf and question, the maximum likelihood estimate of mean parameters has to be first obtained by Equation 18. Then, is calculated through Equation 21, and Equation 22 is finally employed to find the optimum loglikelihood value.
Step 3. Select the best pair of terminal node and question that provides the maximum increase in loglikelihood measure. Thereafter, split the node using the question and estimate the maximum likelihood estimate of all model parameters.
Step 4. Stop the splitting procedure, if a predefined condition is satisfied (e.g., the increase in loglikelihood falls below a certain threshold).
Algorithm 1 summarizes the overall procedure of the proposed soft context clustering. As it is realized from the explained clustering algorithm, the proposed soft clustering procedure is dramatically similar to the classical clustering algorithm. Their main difference is in the number of evaluations that has to be performed during each iteration of the clustering procedure. In hard clustering, both newly generated leaves are just required to be evaluated, but in the soft clustering, all leaf nodes have to be evaluated. This fact increases the computational complexity of the soft clustering by an order of magnitude. Assume we intend to build a decision tree with L leaves. Also, we have defined Q questions. In this case, hard clustering requires (2 L  3)Q likelihood calculations to be performed, while soft clustering will be finished after [L(L  1)/2]Q likelihood calculations.
It should be noted that the likelihood calculation in soft decision treebased clustering is more complicated than the hard clustering; it is mainly due to the fact that calculating the inverse of the matrix R to solve the system of equations expressed by Equation 18 is computationally intractable. Takaki et al. [46] proposed a method to reduce the computational complexity of calculating this inverse problem. Their method exploits the matrix inversion lemma and can also be incorporated in the soft decision tree clustering procedure.
3.5 Simple sinusoidal regression
Figure 3b,c shows the hard and soft decision tree structures trained based on the maximum likelihood decision tree construction algorithms. As can be seen from the figures, the hard decision tree requires eight leaves to have an acceptable mean square error of 0.015, but the soft decision tree is able to accurately estimate the objective function with a small mean square error of 0.0002 using just six terminal nodes.
Figure 3d,e shows the approximated functions using hard and soft decision trees, respectively. As an obvious consequence of this simple experiment, the hard decision tree structures are not efficient to exploit the continuous attributes (contextual factors), and incorporating the soft decisions in their internal nodes significantly improves their predictive capabilities.
3.6 Defining soft questions
As it was mentioned earlier, in order to construct the soft decision tree structure, a set of basic contextual factors has to be extracted initially for all training and test datasets. Section 4.1.1 gives the details of the basic contextual factors employed in our experiments. These basic contextual factors have been denoted by c in this paper and can be grouped into two types of factors: categorical and numerical factors. ‘Phoneme identity’ is a sample of categorical factors, and ‘Position of the current phoneme’ is an example of the numerical factors. In fact, a numerical factor returns some ordered values, but a categorical factor provides some unordered symbols. For the categorical factors, we cannot define meaningful soft questions, and therefore, we have no choices but to exploit the conventional hard questions. However, for the numerical factors, it is possible to define a large number of soft questions. This subsection introduces the procedure of defining these soft questions in our experiments.
In conclusion, all contextual factors were divided into two groups, namely, categorical and numerical. According to the above procedure, a set of soft questions were extracted for numerical factors, and a number of hard questions were obtained for categorical contextual factors. Thereafter, all of the extracted hard and soft questions were grouped together and competed against each other during the soft clustering procedure.
4 Experiments
This section aims to compare the performance of fundamental frequency modeling approaches based on the conventional hard decision tree and the proposed soft clustering method.
4.1 Experimental conditions
Before presenting the experimental results, this section describes the experimental conditions, including database characteristics and employed contextual factors, in detail.
An English speech database called Nick [69] consisting of approximately 2,500 utterances from a British male speaker was used in our experiments. This database is collected in Edinburgh University for the purpose of speech synthesis research. Sentences range in length from 3 to 36 words with an average length of 7.3 words per sentence. Also, the sentences cover most frequent English words, biphoneme combinations, and syllables. Totally, 2,944 different words are covered in the sentences.
Speech waveforms were sampled at 48 kHz, windowed by a 25ms Blackman window with 5ms shift. The speech analysis and synthesis conditions expressed in CSTR/EMIME HTS 2010 [69] were used in this experiment. In this platform, Barkcepstrum was extracted from smooth STRAIGHT trajectories [6], since it outperforms predominant Melcepstrum coefficients. Also, the widely used logF0 and five aperiodicity subbands (0 to 1, 1 to 2, 2 to 4, 4 to 6, and 6 to 8 kHz) were replaced with pitch in Mel and auditoryscale motivated frequency bands for aperiodicity measure [69]. The analysis process generated 40 bark cepstrum coefficients, 1 Mel in pitch value, and 25 auditoryscale motivated aperiodicity frequency subbands for each frame of training signals. These parameters along with their delta and deltadelta derivatives formed five streams of our observation vectors.
For the baseline system, a fivestate multistream lefttoright without skip path MSDHSMM was trained. A conventional maximum likelihoodbased decision tree construction algorithm was used to tie HMM states. In the conventional HMMbased speech synthesis framework, a unique tying structure (decision tree) is normally incorporated for both voicing probabilities and F0 output probabilities. As opposed to the conventional HMMbased synthesis system, the proposed method uses a soft decision tree structure for the output probability distribution and a hard decision tree for voicing probabilities; therefore, we cannot apply the same tying structure for both voicing and output probabilities in the proposed system. With a view to having a fair comparison, the baseline system was implemented with two different decision trees for F0 trajectories, one for the voicing labels and the other for the output probability distributions.
The same structure with just one different part was also implemented for the proposed synthesis system. In the proposed system, the soft decision tree structure is trained for F0 and its derivatives output probability distributions instead of the hard decision tree. All other decision trees, including the decision trees trained for state duration, Barkcepstrum, aperiodicity, and voicing probability, are completely equal to the ones trained for the baseline system. Therefore, all parameters generated for them are equal to the parameters generated for the baseline system. It should be noted that both baseline and proposed synthesis systems employ the MDL criterion [66] to determine the size of all decision trees.
We considered four sets including 100, 200, 400, and 800 utterances for training, and 400 sentences that were not included in the training sets were used as a test data.
4.1.1 Employed contextual factors
Specific information about the contextual factors is presented in this subsection. Employed contextual factors can be categorized into five levels, including phonetic, syllable, word, phrase, and sentence levels. In each of these levels, all important features were considered.
➢ Phoneticlevel factors

Phoneme identity before the preceding phoneme, preceding, current, succeeding phoneme, and phoneme identity after the next phoneme

Position of the current phoneme in the current syllable, word, phrase, and sentence
➢ Syllablelevel factors

Stress level of previous, current, and next syllable (three different stress levels are defined for this database)

Position of the current syllable in the current word, phrase, and sentence

Number of the phonemes of the previous, current, and next syllable

Whether the previous, current, and next syllable is accented or not

Number of the stressed syllables before and after the current syllable in the current phrase

Number of syllables from the previous stressed syllable to the current syllable

Number of syllables from the previous accented syllable to the current syllable
➢ Wordlevel factors

Part of speech (POS) tag of the preceding, current, and succeeding word.

Position of the current word in the current phrase and sentence (forward and backward)

Number of syllables of the previous, current, and next word

Number of content words before and after current word in the current phrase

Number of words from previous and next content word
➢ Phraselevel factors

Number of syllables and words of the preceding, current, and succeeding phrase

Position of the current phrase in the sentence.

Current phrase ToBI end tone.
➢ Sentencelevel factors

Number of phonemes, syllables, words, and phrases in the current utterance

Type of the current sentence
4.2 Experimental results
Both objective and subjective tests are conducted to evaluate the proposed F0 modeling method. The results of these tests are given in the following subsections.
4.2.1 Objective evaluation
where the F0 and its derivatives are represented by o_{ tl }, and their voicing labels are denoted by g_{ tl }. t is the frame index and l represents the dynamic or static features ranging from 1 to 3. In this figure, the above measure is depicted for both test and train data. Red and blue curves are related to the proposed soft and the conventional hard decision tree structures, respectively. Solid curves are the normalized loglikelihood measure of the training sets, and the dashed curves represent the normalized loglikelihood measure computed for the test data. Also, the optimum number of terminal leaves calculated by the MDL principle is illustrated through vertical dotted lines. As it is realized from Figure 5, all red curves surpass their corresponding blue curves; therefore, the soft decision tree is able to provide superior loglikelihood measure with a smaller number of model parameters. All learning curves confirm the fact that the soft decision tree structure is able to provide better generalization in contrast to the canonical hard decision tree structure.
4.2.2 Subjective evaluation
Two subjective tests have been selected in order to assess the effectiveness of the proposed system in comparison with the conventional synthesis system. The comparative mean opinion score (CMOS) test [7] with a 7point scale, ranging from 3 to 3, and the paired comparison test [70] have been used to evaluate the subjective similarity of the synthesized and the natural utterances. Eighteen evaluators participated in our subjective evaluations, and each of them was asked to listen to 20 randomly chosen pairs of synthesized waveforms generated by two different synthesizers (i.e., the soft decision treebased system and the conventional system).
In paired comparison tests, listeners are presented with a number of pairs of waveforms and they are asked to identify which one is more similar to its corresponding natural speech signal. If the two utterances sound equal, listeners are allowed to choose the equality option. The paired comparison test simply reports the percentage of comparisons that a certain synthesizer outperforms the other.
In CMOS tests, listeners not only select the better utterance, but also determine the difference level between two utterances. Four levels are normally defined for this purpose (namely, 0, 1, 2, and 3 which respectively have the meaning of about the same, slightly different, different, and much different). These difference levels are mainly useful in computing CMOS scores which have to be calculated in each comparison for each synthesizer separately. More precisely, a positive score equal to the difference level is computed for the winner of the comparison, and a negative score with equivalent absolute is assigned to the loser. Finally, the value of the CMOS score is obtained by taking an average over all scores.
Another considerable conclusion that can be drawn from the results presented in this section is that by increasing the number of training utterances, the improvement achieved through applying soft clustering is slightly reduced; thus, it is more efficient to employ the proposed structure in limited training datasets.
5 Conclusions
This paper addressed one of the most important shortcomings of hard decision treebased contextdependent F0 modeling, namely, poor context generalization. In the hard decision tree structure, each acoustic feature vector is associated with modeling only one contextual cluster, and it is the main reason of poor generalization. In order to alleviate this problem, the capability of exploiting soft questions was added to the conventional decision tree architecture. The resulting structure, which is called soft decision tree, splits the contextual factor space into several soft clusters; therefore, each context is assigned to several leaves and it can provide superior generalization. In this paper, a maximum entropy model was used to drive the distribution expressed by the soft decision tree architecture. Relying on maximum entropybased distribution, a speech synthesis system with all details was designed and implemented. Experimental results using both objective and subjective criteria showed that the proposed system outperforms the conventional hard decision treebased system.
Endnote
^{a}The unfortunate need for three separate streams only arises when using MSD output distributions to model F0: it is possible (at the onset or offset of voicing) for the dimensionality of the delta stream to be 0 in the same frame that the dimensionality of F0 is 1. That is, F0 exists, but its delta is undefined.
Declarations
Authors’ Affiliations
References
 King S: An introduction to statistical parametric speech synthesis.Sadhana 2011,36(5):837–852. 10.1007/s120460110048yView ArticleGoogle Scholar
 Dutoi T: An Introduction to TexttoSpeech Synthesis, vol. 3. Springer book (Kluwer Academic Publishers), The Netherlands; 1997.View ArticleGoogle Scholar
 Zen H, Tokuda K, Black AW: Statistical parametric speech synthesis.Speech Comm 2009,51(11):1039–1064. 10.1016/j.specom.2009.04.004View ArticleGoogle Scholar
 Black AW, Zen H, Tokuda K: Statistical Parametric Speech Synthesis. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Honolulu, Hawaii, USA; 2007. vol 4, pp. IV1229View ArticleGoogle Scholar
 Yoshimura T, Tokuda K, Masuko T, Kobayashi T, Kitamura T: Mixed Excitation for HMMBased Speech Synthesis. European Conference on Speech Communication and Technology INTERSPEECH, Aalborg, Denmark; 2001. pp. 2263–2266Google Scholar
 Kawahara H, MasudaKatsuse I, Cheveigné A: Restructuring speech representations using a pitchadaptive time–frequency smoothing and an instantaneousfrequencybased F0 extraction: possible role of a repetitive structure in sounds.Speech Comm 1999,27(3):187–207.View ArticleGoogle Scholar
 Drugman T, Dutoit T: The deterministic plus stochastic model of the residual signal and its applications.IEEE Transactions on Audio, Speech and Language Processing 2012,20(3):968–981.View ArticleGoogle Scholar
 Drugman T, Wilfart G, Dutoit T: A Deterministic Plus Stochastic Model of the Residual Signal for Improved Parametric Speech Synthesis. INTERSPEECH, Brighton, United Kingdom; 2009. pp. 1779–1782Google Scholar
 Stylianou Y: Applying the harmonic plus noise model in concatenative speech synthesis.IEEE Transactions on Speech and Audio Processing 2001,9(1):21–29. 10.1109/89.890068View ArticleGoogle Scholar
 Liberman MY, Church KW: Text analysis and word pronunciation in texttospeech synthesis.Advances in Speech Signal Processing 1992, 791–831.Google Scholar
 Tokuda K, Yoshimura T, Masuko T, Kobayashi T, Kitamura T: Speech Parameter Generation Algorithms for HMMBased Speech Synthesis. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Istanbul; 2000:1315–1318.Google Scholar
 Toda T, Tokuda K: Speech parameter generation algorithm considering global variance for HMMbased speech synthesis.IEICE  Transactions on Information and Systems 2007,E90D(5):816–824. 10.1093/ietisy/e90d.5.816View ArticleGoogle Scholar
 Takamichi S, Toda T, Shiga Y, Sakti S, Neubig G, Nakamura S: Parameter Generation Methods With Rich Context Models for HighQuality and Flexible TexttoSpeech Synthesis. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, British Columbia, Canada; 2013.Google Scholar
 Shannon M, Byrne W: Fast, lowArtifact Speech Synthesis Considering Global Variance. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, British Columbia, Canada; 2013:7869–7873.Google Scholar
 Hunt AJ, Black AW: Unit Selection in a Concatenative Speech Synthesis System Using a Large Speech Database. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)373, Atlanta, Georgia, USA, 376; 1996.View ArticleGoogle Scholar
 Yamagishi J, Kobayashi T: Averagevoicebased speech synthesis using HSMMbased speaker adaptation and adaptive training.IEICE  Transactions on Information and Systems 2007,90(2):533–543.View ArticleGoogle Scholar
 Yamagishi J, Nose T, Zen H, Ling ZH, Toda T, Tokuda K, King S, Renals S: Robust speakeradaptive HMMbased texttospeech synthesis.IEEE Transactions on Audio, Speech, and Language Processing 2009,17(6):1208–1230.View ArticleGoogle Scholar
 Yamagishi J, Kobayashi T, Nakano Y, Ogata K, Isogai J: Analysis of speaker adaptation algorithms for HMMbased speech synthesis and a constrained SMAPLR adaptation algorithm.IEEE Transactions on Audio, Speech, and Language Processing 2009,17(1):66–83.View ArticleGoogle Scholar
 Zen H, Braunschweiler N, Buchholz S, Gales MJ, Knill K, Krstulovic S, Latorre J: Statistical parametric speech synthesis based on speaker and language factorization.IEEE Transactions on Audio, Speech, and Language Processing 2012,20(6):1713–1724.View ArticleGoogle Scholar
 Wu YJ, Nankaku Y, Tokuda K: State Mapping Based Method for CrossLingual Speaker Adaptation in HMMBased Speech Synthesis. INTERSPEECH, Brighton, United Kingdom; 2009:528–531.Google Scholar
 Liang H, Dines J, Saheer L: A Comparison of Supervised and Unsupervised CrossLingual Speaker Adaptation Approaches for HMMBased Speech Synthesis. IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Dallas, Texas, USA; 2010. pp. 4598–4601View ArticleGoogle Scholar
 Gibson M, Hirsimaki T, Karhila R, Kurimo M, Byrne W: Unsupervised CrossLingual Speaker Adaptation for HMMBased Speech Synthesis Using twoPass Decision Tree Construction. IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Dallas, Texas, USA; 2010:4642–4645.Google Scholar
 Yamagishi J, Ling Z, King S: Robustness of HMMBased Speech Synthesis. INTERSPEECH, Brisbane, Australia; 2008:581–584.Google Scholar
 Karhila R, Remes U, Kurimo M: HMMBased Speech Synthesis Adaptation Using Noisy Data: Analysis and Evaluation Methods. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, British Columbia, Canada; 2013:6930–6934.Google Scholar
 Yanagisawa K, Latorre J, Wan V, Gales MJ, King S: Noise Robustness in HMMTTS Speaker Adaptation. 8th ISCA Speech Synthesis Workshop, Barcelona, Spain; 2013:119–124.Google Scholar
 Cernak M, Motlicek P, Garner PN: On the (un) Importance of the Contextual Factors in HMMBased Speech Synthesis and Coding. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, British Columbia, Canada; 2013:8140–8143.Google Scholar
 Yoshimura T, Tokuda K, Masuko T, Kobayashi T, Kitamura T: Simultaneous Modeling of Spectrum, Pitch and Duration in HMMBased Speech Synthesis. Proceedings of Eurospeech, Budapest, Hungary; 1999:2347–2350.Google Scholar
 Yoshimura T, Tokuda K, Masuko T, Kobayashi T, Kitamura T: Duration Modeling in HMMBased Speech Synthesis System. Proceedings of ICSLP, Sydney. Australia; 1998:29–32.Google Scholar
 Zen H, Nose T, Yamagishi J, Sako S, Masuko T, Black A, Keiichi T: The HMMBased Speech Synthesis System (HTS) Version 2.0. 6th ISCA Workshop on Speech Synthesis (SSW), Bonn, Germany; 2007:294–299.Google Scholar
 Tokuda K, Zen H, Black AW: An HMMBased Speech Synthesis System Applied to English. IEEE Workshop on Speech Synthesis, Scotland; 2002:227–230.Google Scholar
 Zen H, Toda T, Nakamura M, Tokuda K: Details of the Nitech HMMbased speech synthesis system for the Blizzard Challenge 2005.IEICE Trans Inf Syst 2007,90(1):325–333.View ArticleGoogle Scholar
 Zen H, Toda T, Tokuda K: The NitechNAIST HMMbased speech synthesis system for the Blizzard Challenge 2006.IEICE Trans Inf Syst 2008,91(6):1764–1773.View ArticleGoogle Scholar
 Tokuda K, Nankaku Y, Toda T, Zen H, Yamagishi J, Oura K: Speech synthesis based on hidden Markov models.Proc IEEE 2013,101(5):1234–1252.View ArticleGoogle Scholar
 Odell JJ: The use of Context in Large Vocabulary Speech Recognition. PhD dissertation, Cambridge University; 1995.Google Scholar
 Young SJ, Odell JJ, Woodland PC: TreeBased State Tying for High Accuracy Acoustic Modeling. Proceedings of the Workshop on Human Language Technology, Association for Computational Linguistics, Stroudsburg, PA, USA; 1994:307–312.Google Scholar
 Tokuda K, Masuko T, Miyazaki N, Kobayashi T: Multispace probability distribution HMM.IEICE Trans Inf Syst 2002,85(3):455–464.Google Scholar
 Zen H, Keiichi T, Masuko T, Kobayashi T, Kitamura T: A hidden semiMarkov modelbased speech synthesis system.IEICE Transactions on Information and Systems, E series 2007,D90(5):825–834.View ArticleGoogle Scholar
 Zen H, Senior A, Schuster M: Statistical Parametric Speech Synthesis Using Deep Neural Networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013. IEEE; 2013:7962–7966.Google Scholar
 Lu H, King S, Watts O: Combining a Vector Space Representation of Linguistic Context With a Deep Neural Network for TexttoSpeech Synthesis. 8th ISCA Speech Synthesis Workshop, Barcelona, Spain; 2013:261–265.Google Scholar
 Ling ZH, Deng L, Yu D: Modeling Spectral Envelopes Using Restricted Boltzmann Machines for Statistical Parametric Speech Synthesis. IEEE Acoustics, Speech and Signal Processing (ICASSP), Vancouver, British Columbia, Canada; 2013:7825–7829.Google Scholar
 Kang S, Qian X, Meng H: MultiDistribution Deep Belief Network for Speech Synthesis. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, British Columbia, Canada; 2013:8012–8016.Google Scholar
 Koriyama T, Nose T, Kobayashi T: Statistical parametric speech synthesis based on Gaussian process regression.IEEE Journal of Selected Topics in Signal Processing 2013, 99:1–11.Google Scholar
 Hashimoto K, Nankaku Y, Tokuda K: A Bayesian Approach to Hidden Semi Markov Model Based Speech Synthesis. Proceedings of Interspeech, Brighton, United Kingdom; 2009:1751–1754.Google Scholar
 Khorram S, Sameti H, Bahmaninezhad F, King S, Drugman T: ContextDependent Acoustic Modeling Based on Hidden Maximum Entropy Model for Statistical Parametric Speech Synthesis.EURASIP Journal on Audio, Speech, and Music Processing 2014, 12:1–21.Google Scholar
 Nankaku Y, Nakamura K, Zen H, Tokuda K: Acoustic Modeling With Contextual Additive Structure for HMMBased Speech Recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Las Vegas, Nevada, USA; 2008:4469–4472.Google Scholar
 Takaki S, Nankaku Y, Tokuda K: Spectral Modeling With Contextual Additive Structure for HMMBased Speech Synthesis. Proceedings of 7th ISCA Speech Synthesis Workshop, Kyoto, Japan; 2010:100–105.Google Scholar
 Takaki S, Nankaku Y, Tokuda K: Contextual Partial Additive Structure for HMMBased Speech Synthesis. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, British Columbia, Canada; 2013:7878–7882.Google Scholar
 Takaki S, Nankaku Y, Tokuda K: Contextual additive structure for HMMbased speech synthesis.IEEE J Selected Topics in Signal Processing 2014,8(2):229–238.View ArticleGoogle Scholar
 Zen H, Braunschweiler N: ContextDependent Additive log F0 Model for HMMBased Speech Synthesis. INTERSPEECH, Brighton, United Kingdom; 2009:2091–2094.Google Scholar
 Wu YJ, Soong F: Modeling pitch trajectory by hierarchical HMM with minimum generation error training. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan; 2012:4017–4020.Google Scholar
 Sakai S: Additive modeling of English F0 contour for speech synthesis.Proceedings of ICASSP 2008, 1:277–280.Google Scholar
 Qian Y, Liang H, Soong FK: Generating Natural F0 Trajectory With Additive Trees. INTERSPEECH, Brisbane, Australia; 2008. pp. 2126–2129Google Scholar
 Yu K, Mairesse F, Young S: WordLevel Emphasis Modeling in HMMBased Speech Synthesis. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Dallas, Texas, USA; 2010:4238–4241.Google Scholar
 Yu K, Zen H, Mairesse F, Young S: Context adaptive training with factorized decision trees for HMMbased statistical parametric speech synthesis.Speech Comm 2011,53(6):914–923. 10.1016/j.specom.2011.03.003View ArticleGoogle Scholar
 Gales MJ: Cluster adaptive training of hidden Markov models.IEEE Transactions on Speech and Audio Processing 2000,8(4):417–428. 10.1109/89.848223View ArticleGoogle Scholar
 Olaru C, Wehenkel L: A complete fuzzy decision tree technique.Fuzzy Set Syst 2003,138(2):221–254. 10.1016/S01650114(03)000897MathSciNetView ArticleGoogle Scholar
 Yuan Y, Shaw MJ: Induction of fuzzy decision trees.Fuzzy Set Syst 1995,69(2):125–139. 10.1016/01650114(94)00229ZMathSciNetView ArticleGoogle Scholar
 Rabiner L, Juang BH: An introduction to hidden Markov models.IEEE ASSP Mag 1986,3(1):4–16.View ArticleGoogle Scholar
 Yu K, Young S: Continuous F0 modeling for HMM based statistical parametric speech synthesis.IEEE Transactions on Audio, Speech, and Language Processing 2011,19(5):1071–1079.View ArticleGoogle Scholar
 Moon TK: The expectationmaximization algorithm.IEEE Signal Process Mag 1996,13(6):47–60. 10.1109/79.543975View ArticleGoogle Scholar
 HMMbased speech synthesis system (HTS). http://hts.sp.nitech.ac.jp/
 Zen H: Implementing an HSMMBased Speech Synthesis System Using an Efficient ForwardBackward Algorithm, Nagoya Institute of Technology, Technical Report TRSP0001. 2007.Google Scholar
 Ferguson JD: Variable Duration Models for Speech. Proceedings of the Symposium on the Application Hidden Markov Models to Text and Speech, USA; 1980:143–179.Google Scholar
 Levinson SE: Continuously variable duration hidden Markov models for automatic speech recognition.Computer Speech and Language 1986,1(1):29–45. 10.1016/S08852308(86)800092View ArticleGoogle Scholar
 Yu SZ, Kobayashi H: An efficient forwardbackward algorithm for an explicitduration hidden Markov model.IEEE Signal Processing Letters 2003,10(1):11–14.View ArticleGoogle Scholar
 Shinoda K, Watanabe T: Speaker Adaptation With Autonomous Model Complexity Control by MDL Principle. Proceedings of ICASSP, Atlanta, Georgia, USA; 1996:717–720.Google Scholar
 Berger AL, Pietra VJD, Pietra SAD: A maximum entropy approach to natural language processing.Computational Linguistics 1996,22(1):39–71.Google Scholar
 Ratnaparkhi A: A Maximum Entropy Model for PartofSpeech Tagging. Proceedings of the Conference on Empirical Methods in Natural Language Processin, PA, USA; 1996:133–142.Google Scholar
 Yamagishi J, Watts O: The CSTR/EMIME HTS System for Blizzard Challenge. Proceedings of Blizzard Challenge 2010, Japan; 2010:1–6.Google Scholar
 Yamagishi J: AverageVoiceBased Speech Synthesis. PhD thesis, Tokyo Institute of Technology; 2006.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.