Soft context clustering for F0 modeling in HMM-based speech synthesis

This paper proposes the use of a new binary decision tree, which we call a soft decision tree, to improve generalization performance compared to the conventional ‘hard’ decision tree method that is used to cluster context-dependent model parameters in statistical parametric speech synthesis. We apply the method to improve the modeling of fundamental frequency, which is an important factor in synthesizing natural-sounding high-quality speech. Conventionally, hard decision tree-clustered hidden Markov models (HMMs) are used, in which each model parameter is assigned to a single leaf node. However, this ‘divide-and-conquer’ approach leads to data sparsity, with the consequence that it suffers from poor generalization, meaning that it is unable to accurately predict parameters for models of unseen contexts: the hard decision tree is a weak function approximator. To alleviate this, we propose the soft decision tree, which is a binary decision tree with soft decisions at the internal nodes. In this soft clustering method, internal nodes select both their children with certain membership degrees; therefore, each node can be viewed as a fuzzy set with a context-dependent membership function. The soft decision tree improves model generalization and provides a superior function approximator because it is able to assign each context to several overlapped leaves. In order to use such a soft decision tree to predict the parameters of the HMM output probability distribution, we derive the smoothest (maximum entropy) distribution which captures all partial first-order moments and a global second-order moment of the training samples. Employing such a soft decision tree architecture with maximum entropy distributions, a novel speech synthesis system is trained using maximum likelihood (ML) parameter re-estimation and synthesis is achieved via maximum output probability parameter generation. In addition, a soft decision tree construction algorithm optimizing a log-likelihood measure is developed. Both subjective and objective evaluations were conducted and indicate a considerable improvement over the conventional method.


Introduction
Demand for natural and high-quality speech-based human-computer interaction is increasing due to applications including speech-based virtual assistants for mobile devices.Speech synthesis plays a significant role, not only in transmitting factual information, but also as the outward 'face' of the application: the naturalness of the synthesis affects overall user satisfaction.Speech synthesis from text is usually achieved via an intermediate linguistic specification [1], which can be thought of as a collection of contextual factors -such as phonetic and prosodic properties of the current, preceding, and following segment -which have been derived from the text.Here, we are concerned only with the conversion of this linguistic specification to a speech waveform.In order to perform this conversion, several methods have been proposed [2], of which statistical parametric speech synthesis (SPSS) [3,4] has been dominant, at least in research terms, for the last decade or more.
Figure 1 portrays the overall architecture of a typical SPSS system, which comprises two distinct phases [3,4], namely, training and synthesis.The training phase starts with the extraction of acoustic features and the linguistic specification (i.e., contextual factors) for all training utterances.Waveforms are converted to a compact set of acoustic features using a vocoder (such as MELP [5], STRAIGHT [6], DSM [7,8], or HNM [9]), and simultaneously, all texts are expanded into contextual factors using a natural language pre-processing front end [10].Thereafter, the training phase proceeds to the contextdependent statistical modeling step in which the dependencies between extracted acoustic features and contextual factors are modeled through context-dependent statistical models [11].It is important to note that, because of the extreme sparsity in the contextual feature space, the decision tree (DT) used to cluster the model parameters is the critical component in the statistical modeling.
In the synthesis phase, contextual factors are obtained for the input text, and the decision tree is used to obtain the corresponding trained model parameters, using which a parameter generation (PG) algorithm [11][12][13][14] generates acoustic feature trajectories.These are then converted to a speech waveform using the vocoder.
Though compressing a human voice into a compact statistical model offers the abovementioned advantages over concatenation of waveforms, there remains one major shortcoming: lower quality synthetic speech.This is caused by one or more of the blocks shown in Figure 1, e.g., inadequate acoustic coverage of training utterances, noisy speech database [24,25], errors in natural language processing, inadequate contextual factors [26], inaccurate statistical modeling [3], the PG algorithm [11][12][13][14], or vocoding distortion [5][6][7][8][9].Here, we propose improvements to the statistical modeling (the shaded block in Figure 1) and specifically for the F0 stream.
Hidden Markov model (HMM)-based speech synthesis [27][28][29][30][31][32][33] models not only the spectrum, but also the excitation and duration in a unified framework of contextdependent [34,35] multi-space probability distribution [36] hidden semi-Markov models (HSMMs) [37].More precisely, an independent binary-branching hard decision tree is constructed for each stream of acoustic features (spectrum, aperiodic energy, and fundamental frequency).In the case of F0, a multi-space probability distribution [36] is associated with each leaf of the decision tree.Contextual space (which is very large and very sparse due the great number of contextual factors typically employed) is divided by the decision trees into multiple hard (i.e., non-overlapping) clusters; each cluster is a group of context-dependent HMM states that share the same output probability distribution.
Hard decision tree-based context clustering, which is the standard approach to F0 modeling, has poor generalization [38].In other words, this structure cannot accurately predict the parameters of models of unseen contexts, given the very limited subset of contexts observed in the training data.In order to predict acoustic variations with high generalization capability, the model has to be able to express a large number of robust distributions (i.e., a large number of distributions, but such that each one can be trained from a sufficient number of training samples).In hard decision trees, increasing the number of distributions by growing the depth of the tree reduces the number of training samples assigned to each leaf, and thus, the robustness of the distributions is weakened.This problem stems from the fact that the hard decision tree structure assigns each model parameter to exactly one cluster (corresponding to a small part of the large contextual space): each training sample contributes to the estimation of only one set of model parameters (one mean vector and one covariance matrix).Our hypothesis is that by enabling each training sample to influence multiple sets of model parameters (and thus cover a larger portion of contextual space), generalization to unseen contexts would be improved.

Related work
Several attempts have already been made to alleviate the limitations of F0 modeling in standard decision treeclustered H(S)MMs.One of these is the use of deep neural networks (DNNs) [38,39] which are able to approximate complex acoustic feature-to-linguistic context dependencies by employing many hidden layers -contrast this with decision trees that cannot efficiently represent something as simple as XOR or multiplexing [38], i.e., they must use an excessive number of leaves to capture such relationships and thus over-fragment the already sparse training data [38].DNNs are also able to represent non-binary contextual features, whereas decision trees generally only use binary splits.Other deep learning approaches such as restricted Boltzmann machines (RBMs) and deep belief networks (DBNs) have also been demonstrated to be effective generative models when applied to speech synthesis [40,41].
Speech synthesis based on Gaussian process regression (GPR) [42] is another new technique that has recently been introduced to alleviate basic limitations of HMMbased speech synthesis.The goal of GPR is to remove the incorrect stationarity assumption of state output distribution in HMM-based speech synthesis.GPR uses frame-level contextual factors -such as position of the frame within the phone, and articulatory features -to estimate frame-level acoustic trajectories [42].GPR can directly express complex context dependencies without needing decision tree structures and is able to use all contextual factors of all types simultaneously; therefore, it has the potential to provide better generalization.
In [43], a new system is proposed that replaces the usual maximum likelihood (ML) point estimate of the model parameters with a variational Bayesian method.Their system outperforms the usual approach when the amount of training data is limited, thus demonstrating superior generalization.
F0 modeling with additive structures has also been used to express the relationship between contextual factors and the F0 trajectory [44][45][46][47][48][49][50][51][52][53][54].Contextual additive modeling [45][46][47][48] assumes model parameters to be a sum of multiple independent components, each having different context dependencies; therefore, different decision trees have to be trained for them.The contextual additive model is able to exploit contextual factors more efficiently, because mean vectors and covariance matrices of the predicted distributions are the sum of mean vectors and covariance matrices of the additive components [45]: each training sample contributes to more than one model parameter.Takaki et al. [46][47][48] used an additive structure for spectral modeling and reported that it has a high computational cost.To alleviate this, they proposed covariance parameter tying and a simplified likelihood calculation algorithm using the matrix inversion lemma.Though the contextual additive model [45][46][47][48] was originally proposed for spectral modeling, it could be used for F0 trajectories.
Zen et al. [49] also developed an additive F0 modeling structure with multiple components for mean vectors and a single component for variance values.Accordingly, multiple decision trees were trained for the mean vectors, and just one decision tree was built for the variance values.In their system, different sets of contextual factors were used for different additive components and all trees were built concurrently.Similarly, [50] proposed another additive structure with multiple decision trees, but a minimum generation error (MGE) measure was used as the selection criterion instead of the more common maximum likelihood (ML) measure.In [51], an additive model with three different layers, including intonational phrase, word level, and pitch accent, was designed.All three components were trained concurrently using a regularized least square error measure.Qian et al. [52] proposed to use a new gradient-based treeboosting approach with a view to training multiple additive regression trees.Their decision trees were built in successive stages to minimize the squared error.
Some studies [53,54] have also highlighted another important problem of the common decision tree-based F0 modeling: its deficiency in capturing the effect of contextual features that are poorly represented in the training database.These features (i.e., questions used in the decision tree splits) have little influence on the likelihood criterion and hence will not be selected by the usual decision tree construction algorithms [53,54].One obvious technique to solve this problem is to build the decision tree using a two-stage algorithm [53].In the first stage, all splits are made only with these under-represented contextual factors.This stage captures the influence of such factors, even though they are rare.In the second stage, the well-represented factors are employed.This procedure is not efficient, since the first stage reduces the amount of the training data available for modeling the dependency between well-represented contextual factors and F0 [54].Context adaptive training with factorized decision trees [54] is another method designed to exploit rare features more effectively.There, cluster adaptive training [55] is employed such that an average model is built and then this general model is adapted using a set of transforms.In fact, well-represented contextual factors contribute to generate the average model, and rare contextual questions are taken into account for the transforms.Due to the use of cluster adaptive training, this structure also is able to improve context generalization.

Scope of the paper
Numerous binary and non-binary contextual factors are generally taken into account in modeling F0.Conventional HMM-based speech synthesis converts all nonbinary contextual factors to multiple binary questions (i.e., potential decision tree splits).As mentioned earlier, this structure may suffer from inadequate context generalization.To alleviate this deficiency, we propose the direct use of non-binary contextual factors in a soft decision tree framework [56,57].The proposed soft decision tree structure is an innovative binary decision tree with soft questions at each non-terminal node.Both children are selected with a specific membership degree.In contrast to a hard decision tree that partitions contextual factor space into hard contextual regions, the proposed soft decision tree is able to provide soft -i.e., overlapping -clusters.In this structure, each context will be assigned to several terminal leaves with certain membership functions, and consequently, each training sample affects multiple model parameters, and generalization should be improved.
The rest of the paper is organized as follows: Section 2 presents the classical hard decision tree approach to F0 modeling in statistical parametric speech synthesis.The proposed soft context-clustered HMM structure and details of the associated speech synthesis system that employs such trees are provided in Section 3. Section 4 reports the experiments and results, and Section 5 concludes the paper.

F0 modeling using hard decision trees
This section describes the predominant framework for F0 modeling in HMM-based speech synthesis, which is the same framework used for the spectral envelope, aperiodic energy, and duration.This section also sets out the notation, algorithms, and structures required for subsequent sections.

F0 modeling in the HMM framework
Typically, F0 along with its delta and delta-delta derivatives form three streams a of a context-dependent [34,35] multi-space probability distribution (MSD) [36] left-toright without skip transitions HSMM [58,37] (which for obvious reasons, we shorten to simply 'HMM' in this paper).This model structure generates acoustic trajectories of a unit (e.g., phoneme) by emitting observations from hidden states.The output distribution of the state is a context-dependent multi-space Gaussian distribution [36], and these are clustered into groups of related contexts using a decision tree in order to reduce the number of free parameters and allow the modeling of unseen contexts.For notational simplicity, we limit our discussion here to an HMM with just one stream.Generalizing this to the multi-stream case is straightforward.
Figure 2 shows the equivalent dynamic Bayesian network (DBN) for such an HMM [59].In this figure, q t , o t , and g t respectively represent the state index, the acoustic feature vector, and the space index [36] in time t.
When using MSD output distributions with two spacesfor defined and undefined values -the space index is an observed value equal to the voicing label.The figure also introduces c j , d j , and t j which are the contextual factors, the duration, and the last frame index of the jth state (clearly, d j = t j − t j − 1 ).Note that state boundaries are latent variables and have to be trained in an unsupervised manner using the expectation maximization (EM) [60] algorithm.
According to this figure, the HMM is simply specified through three sets of fundamental distributions: i) state duration probability distribution (p j (d j |c j )), ii) voicing (space) probability distribution (w j (g t |c j )), and iii) output probability distribution given voicing labels (b j (o t |g t ,c j )).Using these fundamental distributions and considering the graphical model represented by Figure 2, the likelihood of a given utterance with observations (o, g, c) can be factorized as where J and λ denote the total number of states and the model parameters, respectively.Now, assume g t takes two values: '1' for voiced frames and '0' for unvoiced regions; also, let b j and p j be expressed through Gaussian distributions.Therefore, the above utterance likelihood can be rewritten as where N :; μ; Σ ð Þ represents a Gaussian distribution with mean vector μ, and covariance matrix Σ.In this equation, duration and output distributions are parameterized by duration mean voicing probability ( w j ), output mean vector μ j , and observation covariance matrix Σ j À Á .As previously mentioned, a canonical decision tree structure is used to express the fundamental distributions.Assume I d l c j À Á and I o l c j À Á are defined as binary indicator functions of decision trees trained for duration and output distributions where l and c j are, respectively, the leaf index and the contextual factors extracted for the jth state.In other words, I d l c j À Á and I o l c j À Á determine whether the jth state is assigned to the lth leaf of the duration and observation decision trees or not.Using these decision tree indicator functions, the HMM parameters can be expressed by m j ¼ where m l and σ 2 l are duration mean and variance values lying in the lth leaf of the duration decision tree.Similarly, w l , μ l , and Σ l represent parameters of the voicing and output probability distributions that are trained for the lth leaf of the output decision tree.

HMM parameter re-estimation
The ML criterion is commonly used to estimate model parameters of HMM.However, state boundaries are latent, and therefore, the EM algorithm has to be adopted.
, along with their corresponding contextual factors, c n f g N n¼1 , the EM algorithm leads to the following re-estimation formulas: where ml ; σ 2 l ; μl ; Σ l , and ŵ l are new values of m l ; σ 2 l ; μ l ; Σ, and w l during EM algorithm.Also, χ j (t j , t j − 1 ) is the probability of occupying the jth state from time t j − 1 to t j , and γ j (t) denotes the posterior probability of being in state j at time t.These probabilities are calculated through the well-known forward-backward algorithms.It should be noted that the publically available HMM-based speech synthesis system (HTS) [61] has been implemented based on the algorithms expressed in [62].These algorithms were originally proposed by Ferguson [63] and were refined by Levinson [64].A more efficient version of the forward-backward algorithm has recently been proposed by Yu et al. [65].

Decision tree-based state clustering
In order to capture the context dependencies inherent in the acoustic features, canonical decision trees are typically incorporated in the HMM framework.Decision trees are constructed iteratively through a greedy and top-down procedure which maximizes the log-likelihood criterion [34,35].The procedure starts with a single root node representing all contexts.In each iteration, an optimum pair of terminal node and question is selected so that splitting the terminal node by the selected question results in the largest log-likelihood increase.The splitting procedure is continued until a termination criterion (such as minimum description length (MDL) [66]) is satisfied.The overall log-likelihood increase δL, achieved by splitting a parent node l 1 into two children l 2 and l 3 , is simply obtained by the following equation [34]: where superscript n is an index defined for the number of training utterances.It should be noted that in order to obtain the above likelihood increase expression, the following simplifying assumptions have to be made [34]: 1 -The values of occupation probabilities are assumed to be fixed during the clustering procedure [34]. 2 -The overall likelihood measure is supposed to be approximated by a simple average of the log likelihoods weighted by the posterior probabilities [34].These assumptions make the calculation of δL possible for all pairs of terminal nodes and questions.

Soft context-clustered HMM
Generally, decision tree is the term for a hierarchical structure consisting of internal nodes and terminal leaves.In a canonical hard binary decision tree, used for acoustic modeling, each terminal node carries a distribution Accordingly, to determine the distribution of a given contextual factor, we need to start from the root node and recursively apply the test at each internal node and select one of the two branches depending on the outcome.This process is repeated iteratively until a leaf node is hit at which point the distribution of the leaf is considered as the output probability distribution.Therefore, for each context, just one path from the root to a terminal node is always traversed, and each context is hereby assigned to one leaf and affects the distribution of that single leaf.In order to improve the performance of the canonical decision tree, this paper proposes the soft binary decision tree structure which is able to establish several fuzzy paths from the root to multiple leaves.

Soft context-clustered HMM structure
The soft decision tree applies soft decisions f m c ð Þ in its internal nodes m and redirects all samples to both children, but with certain membership degrees computed by In fact, each node of a soft decision tree represents a fuzzy subset of contextual factor space; therefore, each context belongs to several nodes with a membership degree.More precisely, when we are traversing the node m for the given context c, a soft question f m c ð Þ represents the membership grade of the left child, and clearly, 1− f m c ð Þ computes the degree of selecting the right child.
In both hard and soft decision tree-based HMMs, initially, a set of contextual factors have to be defined and extracted for all training utterances.Thereafter, as opposed to the hard decision tree that requires hard questions f m (c), here, we have to design a great number of soft questions (soft tests) f m c ð Þ for each contextual factor.These questions are finally assigned to the internal nodes of the decision tree and make fuzzy decisions to select among their children instead of the common crisp decisions.
As it is realized from the above discussion, all terminal leaves may be active for an arbitrary context; as a consequence, it is necessary to generalize the indicator function I m (c) expressed by Equation 3 to the membership function of assigning context c to the node m.This membership function is denoted by Ĩ m (c) and can be computed through the following recursion: where m L and m R are the left and right children of node m.According to the above recursion, all the membership degrees can be calculated by traversing the tree in a preorder style.The traversing procedure starts with setting the membership degree of the root to 1.After observing a node m and determining its membership degree Ĩ m (j), its left m L and right m R children are observed.If the node is a left child, its membership degree is calculated through In the training phase, soft decisions f m c ð Þ are selected from a set of predefined contextual functions.These functions must hold the following limitation for all contextual factors: The above constraint has to be taken into account during the procedure of defining soft questions.That is, we are not allowed to employ soft questions with a value greater than 1 or less than 0; thus, a normalization step is required for some questions before starting decision tree-based clustering.

Soft context-clustered HMM distribution
The proposed soft context-clustered HMM exploits the same structure and graphical model as the original hard decision tree-based HMM, and thus, the model likelihood expression given by Equation 1 is also valid for the proposed model.The only difference between the conventional and the proposed approaches lies in the method of capturing context dependencies inherent in the F0 trajectory.More specifically, the method of representing output distribution b j (⋅) in Equation 1 is different.The goal of this section is to find this probability distribution for the soft decision tree structure described in the previous section.With a view to providing an efficient context generalization, this section derives the smoothest distribution that is able to accurately express the behavior of the F0 trajectory.To estimate the smoothest distribution, the maximum entropy model (MEM) [67,68], presented in the next subsection, is employed.

Maximum entropy-based distributions
Our task is to estimate the distribution of the observation vectors.The maximum entropy principle states that an efficient estimate is the one that maximizes entropy (uncertainty) subject to our knowledge about the observation vectors.This knowledge normally appears in the form of some constraints that make the distribution consistent with sufficient statistics of the observation vectors [67].Let us now derive a simple maximum entropy model for the output distribution given voicing labels, b j (o t |g t , c j ).Suppose the training utterances consist of T i.i.d.voicing labels g t È É T t¼1 and D-dimensional output feature vectors o t f g T t¼1 that may be influenced by some contextual information c t f g T t¼1 .Also, the contextual information is clustered through a soft decision tree structure with the total number of L leaves partitioning the contextual factor space through the membership func- The maximum entropy principle first imposes a set of constraints on the distribution and then chooses a distribution as close as possible to a uniform distribution by optimizing the entropy criterion [67].Indeed, this modeling scheme finds the least biased distribution among all distributions that satisfy our constraints.In other words, where H is the entropy measure which is defined by The constraints play a crucially important role in maximum entropy modeling.They ensure that the model captures the statistical characteristics of the training samples.In this paper, the following constraints are taken into account: The first constraint ensures that the distributions sum to 1. Also, E and Ē indicate real and empirical mathematical expectations given by the following equations: These constraints make the estimated distribution capture the partial first-order moments E{Ĩ l (c)go} and the global second-order moment E{goo T } of the training data in voiced frames (i.e., in frames where observation features o t are defined and voicing label g t is 1); therefore, the training phase of the maximum entropy model estimates the smoothest distribution that preserves the first-and second-order moments, expressed in Equation 9, of the training database.Moreover, the selected constraints lead to a simple expression for the output probability distributions that can be estimated efficiently.
In order to solve optimization problems with equality constraints, the Lagrange multipliers method can be applied.This method defines a new optimization function as follows: where J b ð Þ represents the new optimization function; Also, λ b0 , λ b1 , and Λ are Lagrange multiplayers incorporated in the optimization function to remove the equality constraints.
Taking the derivatives of the optimization function J b ð Þ with respect to the output probability distribution b (o|g,c), and setting it to zero leads to the following equation: An obvious solution satisfying the above equality is Therefore, b(o|g t ,c t ) is a simple Gaussian distribution that can be expressed by where N indicates the Gaussian distribution; μ l is a Ddimensional vector of mean parameters defined for the lth leaf; Also, Σ is a D-by-D covariance matrix that is used for all leaves.
In sum, each leaf of the soft decision tree carries a set of model parameters represented by μ l that contributes to express the output probability distribution b(o|g,c).The output probability b(o|g,c) is simply approximated by a Gaussian distribution.This Gaussian distribution uses a unique context-independent covariance matrix Σ and a context-dependent mean vector.The mean component is obtained by linearly combining μ l parameters (i.e., X L l¼1 Ĩ l c ð Þμ l ) and the weights of the linear combination are determined by the membership functions Ĩ l (c).In fact, the proposed maximum entropybased output probability distribution is remarkably similar to the distribution expressed by the contextual additive structure that ties all covariance matrixes [46][47][48].In the contextual additive method, similar to the proposed method, the output distribution has the form of Equation 14, but the contextual additive method exploits multiple hard decision trees [46] or a hard decision tree with overlapped leaves [47] instead of the proposed soft decision tree.In other words, in contextual additive structure, Ĩ l (c) indicates a leaf indicator function that may be 1 for multiple overlapped leaves, but in the proposed model, Ĩ l (c) is a real number, ranging from 0 to 1, that represents the membership degrees of a soft decision tree terminal node.

Parameter re-estimation
Having described the soft context-clustered HMM structure, it is now time to discuss its parameter reestimation procedure.In the training phase, we are given a set of N i.i.d.training utterances containing acoustic features o n f g N n¼1 ; voicing labels g n f g N n¼1 , and contextual factors c n f g N n¼1 .The goal is to find the optimum set of model parameters λ which maximizes the log-likelihood measure: This section assumes that the soft decision tree structure has been trained earlier and we just try to find the maximum log-likelihood estimate of its parameters λ, including μ l f g L l¼1 and Σ. Training the optimum soft decision tree structure will be described in the next section.Similar to the classical HMM, the likelihood expression of Equation 1 leads to an extremely complex optimization problem with seemingly impossible direct solution.The main problem is that the distribution depends on the state boundaries which are latent.The EM technique offers an iterative algorithm which is able to overcome this problem.According to the EM technique, λ is obtained by iteratively maximizing an axillary function Q λ; λ r ð Þ: where χ j and γ j are occupation probabilities defined in Section 2.2.Also, r is the index of the EM iterations, and n ranges over the utterance numbers.In order to estimate the optimum set of parameters, the partial derivatives of Q with respect to all model parameters λ have to be set to zero.These partial derivatives are calculated by considering the distribution introduced in Section 2.2 as follows: By setting these equations to zero, the maximum likelihood estimate of model parameters is obtained.According to these equations, the optimum vectors for mean parameters μl f g L l¼1 can be simply calculated through solving the following system of equations: where μ is a L-by-D matrix containing all mean parameters as Also, R and P are L-by-L and L-by-D matrixes defined by As it is realized from Equation 20, R represents the cross-correlation matrix of membership functions.This matrix is symmetric and positive definite; therefore, it is possible to solve the above system of equations efficiently using Cholesky decomposition.
Furthermore, by setting zero the partial derivatives of the auxiliary function Q λ; λ r ð Þ with respect to the globally tied covariance matrix Σ, the maximum likelihood estimate of Σ is calculated as follows: The above equations introduce a straightforward procedure to train the parameters of the output probability distribution factorized by a soft decision tree.
The next section discusses the procedure of constructing the proposed soft decision tree.In order to conduct a soft decision tree clustering algorithm, it is required to calculate the log-likelihood measure for the optimum model parameters.This optimum log-likelihood measure is expressed by where |.| denotes the matrix determinant operator.

Soft context clustering algorithm
To automatically capture the dependencies between acoustic features and contextual factors, this section proposes a soft decision tree construction algorithm.Similar to the classical hard decision tree building algorithm, the soft decision tree is built iteratively through a greedy and top-down procedure which maximizes the log-likelihood measure.
The major advantage of the classical hard decision tree construction algorithm is that its terminal nodes can be split independently.In hard decision tree, terminal nodes represent non-overlapped regions of the contextual factor space; therefore, after splitting a leaf, all values obtained for other leaves are still valid, and it is not required to calculate them once again.This advantage causes the algorithm to be computationally tractable.However, in the soft decision tree construction procedure, the different terminal nodes may cover overlapped regions of the contextual space and splitting a leaf using a soft question affects the parameters of all other leaves.Consequently, as opposed to the conventional hard decision tree structure, here, after splitting a leaf, it is required to update all values obtained for all terminal nodes, and it needs tremendous amount of computations.
The procedure of the proposed soft decision tree construction algorithm is stated as follows: Step 1. Create the root node embarrassing all samples of the training database.
Step 2. Split all terminal nodes using all questions and compute their optimum log-likelihood value.To compute the optimum log-likelihood value for each possible pair of leaf and question, the maximum likelihood estimate of mean parameters μ has to be first obtained by Equation 18.Then, Σ is calculated through Equation 21, and Equation 22is finally employed to find the optimum log-likelihood value.
Step 3. Select the best pair of terminal node and question that provides the maximum increase in loglikelihood measure.Thereafter, split the node using the question and estimate the maximum likelihood estimate of all model parameters.
Step 4. Stop the splitting procedure, if a predefined condition is satisfied (e.g., the increase in log-likelihood falls below a certain threshold).
Algorithm 1 summarizes the overall procedure of the proposed soft context clustering.As it is realized from the explained clustering algorithm, the proposed soft clustering procedure is dramatically similar to the classical clustering algorithm.Their main difference is in the number of evaluations that has to be performed during each iteration of the clustering procedure.In hard clustering, both newly generated leaves are just required to be evaluated, but in the soft clustering, all leaf nodes have to be evaluated.This fact increases the computational complexity of the soft clustering by an order of magnitude.Assume we intend to build a decision tree with L leaves.Also, we have defined Q questions.In this case, hard clustering requires (2 L − 3)Q likelihood calculations to be performed, while soft clustering will be finished after [L(L − 1)/2]Q likelihood calculations.
It should be noted that the likelihood calculation in soft decision tree-based clustering is more complicated than the hard clustering; it is mainly due to the fact that calculating the inverse of the matrix R to solve the system of equations expressed by Equation 18 is computationally intractable.Takaki et al. [46] proposed a method to reduce the computational complexity of calculating this inverse problem.Their method exploits the matrix inversion lemma and can also be incorporated in the soft decision tree clustering procedure.

Simple sinusoidal regression
In order to clarify the soft clustering advantages, a simple sinusoidal regression problem is solved using both soft and hard decision tree structures in this section.Assume we have just one continuous contextual factor named c ranging from 0 to 1, and our goal is to approximate the following sinusoidal function: where o(c) represents the observation value for a given context c, and r(c) is a normally distributed random noise with zero mean and unit variance.The 200 training samples shown in Figure 3a are independently drawn from Equation 23.Nineteen different contextual questions are defined to train the hard decision tree as follows: Therefore, each internal node of the hard decision tree structure has to select one of these hard questions denoted by Additionally, the soft decision tree is trained by exploiting four distinct soft questions defined by ∀i ∈ 1; 2; 3; 4 Figure 3b,c shows the hard and soft decision tree structures trained based on the maximum likelihood decision tree construction algorithms.As can be seen from the figures, the hard decision tree requires eight leaves to have an acceptable mean square error of 0.015, but the soft decision tree is able to accurately estimate the objective function with a small mean square error of 0.0002 using just six terminal nodes.
Figure 3d,e shows the approximated functions using hard and soft decision trees, respectively.As an obvious consequence of this simple experiment, the hard decision tree structures are not efficient to exploit the continuous attributes (contextual factors), and incorporating the soft decisions in their internal nodes significantly improves their predictive capabilities.

Defining soft questions
As it was mentioned earlier, in order to construct the soft decision tree structure, a set of basic contextual factors has to be extracted initially for all training and test datasets.Section 4.1.1gives the details of the basic contextual factors employed in our experiments.These basic contextual factors have been denoted by c in this paper and can be grouped into two types of factors: categorical and numerical factors.'Phoneme identity' is a sample of categorical factors, and 'Position of the current phoneme' is an example of the numerical factors.In fact, a numerical factor returns some ordered values, but a categorical factor provides some unordered symbols.For the categorical factors, we cannot define meaningful soft questions, and therefore, we have no choices but to exploit the conventional hard questions.However, for the numerical factors, it is possible to define a large number of soft questions.This subsection introduces the procedure of defining these soft questions in our experiments.In this study, we first normalize all numerical contextual factors to range between 0 and 1, and then soft questions are obtained by applying a fixed set of candidate functions to the normalized contextual factors.Assume c represents a normalized numerical contextual factor and f k c ð Þ is the kth soft question extracted for c.In this study, 25 soft questions have been defined for each numerical contextual factor.These soft questions are shown in Figure 4, and their mathematical expressions are given by Equation 26: In conclusion, all contextual factors were divided into two groups, namely, categorical and numerical.According to the above procedure, a set of soft questions were extracted for numerical factors, and a number of hard questions were obtained for categorical contextual factors.Thereafter, all of the extracted hard and soft questions were grouped together and competed against each other during the soft clustering procedure.

Experiments
This section aims to compare the performance of fundamental frequency modeling approaches based on the conventional hard decision tree and the proposed soft clustering method.

Experimental conditions
Before presenting the experimental results, this section describes the experimental conditions, including database characteristics and employed contextual factors, in detail.
An English speech database called Nick [69] consisting of approximately 2,500 utterances from a British male speaker was used in our experiments.This database is collected in Edinburgh University for the purpose of speech synthesis research.Sentences range in length from 3 to 36 words with an average length of 7.3 words per sentence.Also, the sentences cover most frequent English words, bi-phoneme combinations, and syllables.Totally, 2,944 different words are covered in the sentences.
Speech waveforms were sampled at 48 kHz, windowed by a 25-ms Blackman window with 5-ms shift.The speech analysis and synthesis conditions expressed in CSTR/EMIME HTS 2010 [69] were used in this experiment.In this platform, Bark-cepstrum was extracted from smooth STRAIGHT trajectories [6], since it outperforms predominant Mel-cepstrum coefficients.Also, the widely used log-F0 and five aperiodicity sub-bands (0 to 1, 1 to 2, 2 to 4, 4 to 6, and 6 to 8 kHz) were replaced with pitch in Mel and auditory-scale motivated frequency bands for aperiodicity measure [69].The analysis process generated 40 bark cepstrum coefficients, 1 Mel in pitch value, and 25 auditory-scale motivated aperiodicity frequency sub-bands for each frame of training signals.These parameters along with their delta and delta-delta derivatives formed five streams of our observation vectors.
For the baseline system, a five-state multi-stream leftto-right without skip path MSD-HSMM was trained.A conventional maximum likelihood-based decision tree construction algorithm was used to tie HMM states.In the conventional HMM-based speech synthesis framework, a unique tying structure (decision tree) is normally incorporated for both voicing probabilities and F0 output probabilities.As opposed to the conventional HMMbased synthesis system, the proposed method uses a soft decision tree structure for the output probability distribution and a hard decision tree for voicing probabilities; therefore, we cannot apply the same tying structure for both voicing and output probabilities in the proposed system.With a view to having a fair comparison, the baseline system was implemented with two different decision trees for F0 trajectories, one for the voicing labels and the other for the output probability distributions.
The same structure with just one different part was also implemented for the proposed synthesis system.In the proposed system, the soft decision tree structure is trained for F0 and its derivatives output probability distributions instead of the hard decision tree.All other decision trees, including the decision trees trained for state duration, Bark-cepstrum, aperiodicity, and voicing probability, are completely equal to the ones trained for the baseline system.Therefore, all parameters generated for them are equal to the parameters generated for the baseline system.It should be noted that both baseline and proposed synthesis systems employ the MDL criterion [66] to determine the size of all decision trees.
We considered four sets including 100, 200, 400, and 800 utterances for training, and 400 sentences that were not included in the training sets were used as a test data.

Employed contextual factors
Specific information about the contextual factors is presented in this subsection.Employed contextual factors can be categorized into five levels, including phonetic, syllable, word, phrase, and sentence levels.In each of these levels, all important features were considered.

Experimental results
Both objective and subjective tests are conducted to evaluate the proposed F0 modeling method.The results of these tests are given in the following subsections.

Objective evaluation
Figure 5 shows the learning curves obtained during building the hard and soft decision trees for 800 training utterances and 400 test sentences.Normalized log-likelihood measure, depicted in this figure, was computed through the following expression: where the F0 and its derivatives are represented by o tl , and their voicing labels are denoted by g tl .t is the frame index and l represents the dynamic or static features ranging from 1 to 3. In this figure, the above measure is depicted for both test and train data.Red and blue curves are related to the proposed soft and the conventional hard decision tree structures, respectively.Solid curves are the normalized log-likelihood measure of the training sets, and the dashed curves represent the normalized log-likelihood measure computed for the test data.Also, the optimum number of terminal leaves calculated by the MDL principle is illustrated through vertical dotted lines.As it is realized from Figure 5, all red curves surpass their corresponding blue curves; therefore, the soft decision tree is able to provide superior log-likelihood measure with a smaller number of model parameters.All learning curves confirm the fact that the soft decision tree structure is able to provide better generalization in contrast to the canonical hard decision tree structure.Another well-known objective measure, reported in this section, is the root-mean-square error (RMSE) between synthesized and natural log-F0 trajectories.In order to compute this measure, first, all test utterances were synthesized with natural voicing labels and natural durations (durations obtained through applying the Viterbi algorithm to natural acoustic trajectories).Thereafter, the RMSE measure is computed through the following expression: where g t , f T t , and f P t are voicing label, target log-F0 value, and predicted log-F0 value of the tth frame.This measure is computed for four training datasets including 100, 200, 400, and 800 training utterances.Figure 6 shows the calculated RMSE values in terms of cent.As it is realized from this figure, the log-F0 trajectories generated from the proposed approach is more similar to the natural log-F0 trajectories, and therefore, the proposed soft decision tree structure improves the performance of log-F0 modeling.However, by increasing the size of database, the amount of this improvement is slightly reduced.Hence, it can be implied that the effect of applying soft clustering for small databases is relatively more than its effect on large databases.

Subjective evaluation
Two subjective tests have been selected in order to assess the effectiveness of the proposed system in comparison with the conventional synthesis system.The comparative mean opinion score (CMOS) test [7] with a 7-point scale, ranging from −3 to 3, and the paired comparison test [70] have been used to evaluate the subjective similarity of the synthesized and the natural utterances.Eighteen evaluators participated in our subjective evaluations, and each of them was asked to listen to 20 randomly chosen pairs of synthesized waveforms generated by two different synthesizers (i.e., the soft decision tree-based system and the conventional system).
In paired comparison tests, listeners are presented with a number of pairs of waveforms and they are asked to identify which one is more similar to its corresponding natural speech signal.If the two utterances sound equal, listeners are allowed to choose the equality option.The paired comparison test simply reports the percentage of comparisons that a certain synthesizer outperforms the other.
In CMOS tests, listeners not only select the better utterance, but also determine the difference level between two utterances.Four levels are normally defined for this purpose (namely, 0, 1, 2, and 3 which respectively have the meaning of about the same, slightly different, different, and much different).These difference levels are mainly useful in computing CMOS scores which have to be calculated in each comparison for each synthesizer separately.More precisely, a positive score equal to the difference level is computed for the winner of the comparison, and a negative score with equivalent absolute is assigned to the loser.Finally, the value of the CMOS score is obtained by taking an average over all scores.
The results of CMOS and paired comparison evaluations are respectively shown in Figures 7 and 8. Remarkably, the proposed soft context-clustered HMM is noticed to outperform the conventional hard decision tree structure for all training utterances.This result is completely in line with the conclusion of the objective assessments.For small datasets (i.e., 100 and 200 training utterances), more than 58% of the comparisons are in favor of the proposed method and the average CMOS score of the proposed system is more than one unit higher than the baseline system.These results show that the proposed soft decision tree structure is able to improve the F0 estimation accuracy of the baseline system significantly in small training datasets, and therefore, an important application of the proposed system is in lowresource languages when limited amount of data is available for training.
Another considerable conclusion that can be drawn from the results presented in this section is that by increasing the number of training utterances, the   improvement achieved through applying soft clustering is slightly reduced; thus, it is more efficient to employ the proposed structure in limited training datasets.

Conclusions
This paper addressed one of the most important shortcomings of hard decision tree-based context-dependent F0 modeling, namely, poor context generalization.In the hard decision tree structure, each acoustic feature vector is associated with modeling only one contextual cluster, and it is the main reason of poor generalization.In order to alleviate this problem, the capability of exploiting soft questions was added to the conventional decision tree architecture.The resulting structure, which is called soft decision tree, splits the contextual factor space into several soft clusters; therefore, each context is assigned to several leaves and it can provide superior generalization.In this paper, a maximum entropy model was used to drive the distribution expressed by the soft decision tree architecture.Relying on maximum entropy-based distribution, a speech synthesis system with all details was designed and implemented.Experimental results using both objective and subjective criteria showed that the proposed system outperforms the conventional hard decision tree-based system.

Endnote a
The unfortunate need for three separate streams only arises when using MSD output distributions to model F0: it is possible (at the onset or offset of voicing) for the dimensionality of the delta stream to be 0 in the same frame that the dimensionality of F0 is 1.That is, F0 exists, but its delta is undefined.

Figure 1
Figure 1 Overall block diagram of a typical SPSS.

Figure 2
Figure 2 Graphical model of the common HMM.Rectangular nodes and solid arrows demonstrate random variables and their dependencies.Dotted arrows show the tying structure through decision trees.
max b H b o ð jg; c f Þg S: T : constraints:

Figure 3
Figure 3 Samples and decision tree structures.(a) Data samples used to train the decision trees.The red dots and blue line are training samples and objective function, respectively.(b) Trained hard decision tree structure.(c) Trained soft decision tree structure.(d) Predicted samples using the hard decision tree.(e) Predicted samples using the soft decision tree.

Figure 4
Figure 4 Soft questions defined for each numerical contextual factor in our experiments.

Figure 5
Figure 5 Normalized log-likelihood with respect to the number of leaves computed for each state of HMM.Blue and red curves are the learning curves of the hard and soft decision trees, respectively.In addition, solid curves illustrate the log-likelihood of the training set, and dashed curves are the log-likelihood computed for test data.MDL-based stop points are also shown through vertical dotted lines.

Figure 7
Figure 7 Subjective evaluation of soft context-clustered HMM through CMOS test.Blue and red bars are the CMOS scores of the conventional and the proposed systems.

Figure 6
Figure 6 RMSE as an objective measure to compare log-F0 trajectories generated by hard and soft decision trees.Blue and red lines are the results of hard and soft decision trees, respectively.

Figure 8
Figure 8 Paired comparison test as a subjective comparison between soft and hard decision tree structures.The scores of hard and soft decision trees are shown with blue and red colors.