A two-stage approach using Gaussian mixture models and higher-order statistics for a classification of normal and pathological voices

Lee, Ji Yeoun

doi:10.1186/1687-6180-2012-252

Research
Open access
Published: 30 November 2012

A two-stage approach using Gaussian mixture models and higher-order statistics for a classification of normal and pathological voices

Ji Yeoun Lee¹

EURASIP Journal on Advances in Signal Processing volume 2012, Article number: 252 (2012) Cite this article

3746 Accesses
11 Citations
Metrics details

Abstract

A two-stage classifier is used to improve the classification performance between normal and pathological voices. A primary classification between normal and pathological voices is achieved by the Gaussian mixture model (GMM) log-likelihood scores. For samples that do not meet the thresholds for normal or disordered voice in the GMM, the final decision is made by a higher-order statistics (HOS)-based parameter. The normalized skewness and kurtosis, and means of the normalized skewness and kurtosis were estimated using a sustained vowel /a/ from 53 normal and 173 pathological voices taken from the Disordered Voice Database. Mel-frequency cepstral coefficients (MFCC)-based GMM, the HOS methods, and a two-stage classifier based on the GMM-HOS were performed for each voice signal. A Mann–Whitney rank sum test was used to detect differences in the means of the HOS-based parameters. A fivefold cross-validation scheme was performed to test the classification method. When 16 Gaussian mixtures were used, the MFCC-based GMM algorithm is performed with 92.0% accuracy. When means of the normalized skewness and kurtosis were used, performances of 82.31 and 83.67% were obtained, respectively. The two-stage classifier with 16 Gaussian mixtures and the mean of the normalized kurtosis classified samples with a 96.96% accuracy were obtained. The proposed two-stage classifier is more accurate than the MFCC-based GMM and HOS methods alone and shows potential for the classification of voices in the clinic.

Introduction

Speech is integral to day-to-day communication. Speech impediments negatively impact social interactions leading to interest in early detection and treatment of voice disorders. Many researchers have worked towards the goal of automatic and objective classification between normal and pathological voices using minimally invasive methods. A large amount of research has focused on the automatic detection of voice pathologies by means of acoustic analysis, parametric and non-parametric feature extraction, pattern recognition algorithms, and statistical methods [1–15].

Sáenz-Lechón et al. [5] presented an overview of previous classification schemes applied to the Massachusetts Eye & Ear Infirmary (MEEI) Voice Disorders Database [16]. They described some methodological concerns to be considered when designing automatic systems for pathological voice detection. They recommended the use of a commercially well-known database, a cross-validation strategy based on several partitions to obtain averaged classification performances with confidence intervals, a report of the means of a detection error trade-off (DET), and an investigation of the area under receiver operating characteristic (ROC) curves.

The emergence of attractive pattern classification algorithms such as the Gaussian mixture model (GMM), neural network (NN), and hidden Markov model has received attention as a potential means to discriminate between normal and pathological voices [6–13]. The GMM has especially been reported as a very successful classification method [10–13]. Characteristic parameters, such as Mel-frequency cepstral coefficients (MFCC), have also become more popular for voice pathology detection [6, 8, 10–12]. Recently, Wang et al. [12] proposed a GMM supervector kernel-support vector machine (GMM-SVM) classifier which was compared with the GMM classifier as a baseline algorithm. The GMM supervectors were largely effective parameters for the discrimination of normal and pathological voices. A classification accuracy of 96.1% was achieved by SVM classification of the 16 Gaussian GMM supervectors [12].

As an acoustic analysis method, higher-order statistics (HOS) have shown promising results in a number of signal processing applications, and are of particular value when dealing with a mixture of Gaussian and non-Gaussian processes and system nonlinearity [14, 15, 17, 18]. The application of HOS to speech processing has primarily been motivated by its inherent Gaussian suppression and phase preservation properties [17, 18]. Researches in the disordered voice field have been based on the assumption that speech has non-zero HOS that is distinct from that of Gaussian noise [18, 19]. Alonso et al. [19] proposed seven new HOS-based parameters that were obtained directly or indirectly starting from the bispectrum of a voice frame. A success rate of 98.3% was obtained by using both the conventional and the HOS-based parameters with an NN classifier demonstrating the possibility of automatically discriminating pathological from healthy voices using HOS parameters [20]. Further study of how well each HOS parameter can detect pathological voices with the methodological designs recommended by Sáenz-Lechón et al. [5] is merited.

In this article, we propose new HOS-based parameters implemented in the time domain. They are means of the normalized skewness and kurtosis which are calculated from each frame and averaged in a sentence. The HOS-based parameters estimated in the time domain can easily be applied to a real-time environment in contrast to the Fourier series representation of the HOS parameters in frequency domain applied by Alonso et al. [19]. Second, we propose a two-stage approach to further improve the accuracy of the classification between normal and pathological voices. The classification system consists of a MFCC-based GMM algorithm which describes the primary classification achieved by the GMM log-likelihood scores, and an HOS-based parameter as a post-processor.

Material and methods

Material

Vocal signals were collected from the MEEI Voice Disorders Database [16]. Fifty-three normal and one hundred seventy-three pathological speakers with a wide range of organic, neurological, traumatic, and psychogenic voice disorders were selected. The extracted subset is the same as one described in the study of Wang et al. [12] to compare the result with this study. Voice samples were collected in a controlled environment and sampled with a 50- or 25-kHz sampling rate and 16 bits of resolution. Patients phonated a sustained /a/ (1–3 s). All voice data were down-sampled to 25 kHz and grouped into training (70% of the data) and test (30%) sets to implement all methods. Each set for a fivefold cross-validation scheme was randomly selected from the subset [10, 12].

Statistical analysis

Statistical analysis was conducted using Sigma Stat 3.0 (Jandel Scientific, SanRafael, CA, USA). The Mann–Whitney rank sum test was performed to test the differences between normal and pathological voices for the normalized skewness, the normalized kurtosis, and the means of the normalized skewness and kurtosis. A p-value of 0.05 was used for all measures.

MFCC-based GMM method

The voice samples were analyzed with 40-ms interval overlapping 20-ms with the previous frame and multiplied by a hamming window as performed in previous studies [12]. MFCC parameters were extracted and fed into a GMM-based detector enabling a final decision about the absence or presence of pathology. The number of the filter banks was 38, and 36-dimensional MFCC parameters (18 MFCC + 18 Delta-MFCC) with 1 derivative were calculated every frame of 18 Mel-cepstral coefficients as in Wang et al. [12]. Cepstral mean subtraction was also used during the extraction to reduce the cepstral bias of the recording channel. The Linde–Buso–Gray algorithm was used for the GMM initialization and the GMM having 8, 16, and 32 mixtures were trained with the expectation-maximization (EM) algorithm to determine the model parameters such as mean vectors, covariance matrices, and mixture weights. For an utterance $X = \{x_{1}, x_{2}, \dots . x_{T}\}$ , where T is the number of frames, the log-likelihood ratio (LLR) applying Bayes’ rule, disregarding the constant prior probabilities in the log domain, is presented in Equation (1).

Λ (X) = log [p (\frac{X_{N}}{λ_{C}})] - log [p (\frac{X_{P}}{{λ_{C}}_{\bar{λ}}})]

(1)

where Λ(X) is the LLR, and λ_C and $λ_{\bar{C}}$ are GMM models for normal and pathological voices, respectively. Also N and P indicate normal and pathological voices, respectively.

Figure 1 shows the histogram of the LLR estimated from normal and pathological voices in training procedure. The decision threshold, Λ_NP, is then set to adjust the tradeoff between rejecting pathological voices (false rejection) and accepting normal voices (false acceptance). In test procedure, the LLR, Λ(X), is compared with a threshold Λ_NP(X) and the voice is said to be pathological if Λ(X) < Λ_NP(X) and normal if Λ(X) > Λ_NP(X).

HOS method

A speech signal, x(n), which may be normal or pathological, can be expressed as given in Equation (2) [17].

x (n) = s (n) + w (n)

(2)

where s(n) is a non-Gaussian signal generated by the oscillation of the vocal folds and w(n) is Gaussian noise which can be assumed to be zero in normal voices and not to be zero in pathological voices.

Pathological voices are corrupted by noise, w(n), which is directly related to the perceived roughness of the voice [1–3, 20]. If s(n) and w(n) are statistically independent, then the energy of x(n) is the sum of speech and noise energies: E_x = E_s + E_w. Second-order statistics are thus directly affected in an additive way by the presence of noise [17, 19]. However, when HOS analysis is applied to pathological voices, unstable and discontinuous statistics of x(n) may be estimated because HOS analysis is blind to Gaussian processes. On the other hand, in a normal voice, the HOS of only non-Gaussian measurements may be extracted because a Gaussian noise can be assumed to be zero. The variation of a non-Gaussian signal which is produced by vibration of the vocal folds can be an important clue for the classification of pathological and normal voices.

If x(n), where n = 0, ±1, ± 2, … , is a real stationary discrete-time signal and its moments up to order p exist, then its p th-order moment function is given by Equation (3).

m_{p} (τ_{1}, τ_{1}, \dots, τ_{p - 1}) \equiv E \{x (n) x (n + τ_{1}) \dots x (n + τ_{p - 1})\}

(3)

And it depends only on the time differences τ₁, τ₂, …, τ_p−1, τ_i = 0, ± 1, ± 2, … for all i. Here, E{·} denotes statistical expectation and for a deterministic signal, it is replaced by a time summation over all time samples or time averaging. In addition, if the signal has zero mean, then its cumulant functions (up to order four) are given by Equation (4) [17].

second-order cumulant : C_{2} (τ_{1}) = m_{2} (τ_{1})

(4)

third-order cumulant : C_{3} (τ_{1}, τ_{2}) = m_{3} (τ_{1}, τ_{2}),

\begin{array}{l} fourth & - order cumulant : C_{4} (τ_{1}, τ_{2}, τ_{3}) = m_{4} (τ_{1}, τ_{2}, τ_{3}) \\ - m_{2} (τ_{1}) \cdot m_{2} (τ_{3} - τ_{2}) - m_{2} (τ_{2}) \cdot m_{2} (τ_{3} - τ_{1}) \\ - m_{2} (τ_{3}) \cdot m_{2} (τ_{2} - τ_{1}) \end{array}

By setting all the lags to zero in the above cumulant expressions, we can obtain the variance, skewness, and kurtosis.

Variance : γ_{2} \equiv C_{2} (0) = E \{x^{2} (n)\},

(5)

Skewness : C_{3} (0, 0) = E \{x^{3} (n)\},

Kurtosis : C_{4} (0, 0, 0) = E \{x^{4} (n)\} - 3 {[E \{x^{2} (n)\}]}^{2}

When estimating HOS from finite data records, the variance of the estimators is reduced by normalizing the input data to have a unity variance, prior to computing the estimators. Equivalently, the third- and fourth-order statistics are normalized by the appropriate powers of the data variance, thus we define the normalized skewness and kurtosis as shown in Equations (6) and (7) [17].

\begin{array}{l} Normalized skewness : \\ γ_{3} \equiv \frac{C_{3} (0, 0)}{{[C_{2} (0)]}^{1.5}} = \frac{E \{x^{3} (n)\}}{{[E \{x^{2} (n)\}]}^{1.5}} \end{array}

(6)

\begin{array}{l} Normalized kurtosis : \\ γ_{4} \equiv \frac{c_{4} (0, 0, 0,)}{{[c_{2} (0)]}^{2}} = \frac{E \{x^{4} (n)\}}{{[E \{x^{2} (n)\}]}^{2}} - 3.0 \end{array}

(7)

In this article, the normalized skewness and kurtosis are extracted in frame as shown in Equation (8).

γ_{3 t} = \frac{\sum_{n = 1}^{N} x_{t}^{3} (n)}{{[\sum_{n = 1}^{N} x_{t}^{2} (n)]}^{1.5}}, γ_{4 t} = \frac{\sum_{n = 1}^{N} x_{t}^{4} (n)}{{[\sum_{n = 1}^{N} x_{t}^{2} (n)]}^{2}}

(8)

where x_t is the speech sample value of t th frame and N is the number of samples.

The proposed HOS-based parameters are means of the normalized skewness and kurtosis: ${\bar{γ}}_{3}$ and ${\bar{γ}}_{4}$ . They are estimated in a sentence and have their roots in γ₃ and γ₄ as described in Equation (9). As in the MFCC procedures, voice samples were cut into 40-ms overlapping frames which were shifted by 20 ms and multiplied by a hamming window to extract the HOS-based parameters.

{\bar{γ}}_{3} = \frac{1}{T} \sum_{t = 1}^{T} γ_{3 t}, {\bar{γ}}_{4} = \frac{1}{T} \sum_{t = 1}^{T} γ_{4 t}

(9)

where γ_3t and γ_4t are γ₃ and γ₄ extracted in the t th frame, respectively, and T is the number of the frames.

Two-stage classifier based on GMM-HOS

The block diagram of the proposed algorithm is shown in Figure 2. In the training phase, the MFCCs from the voice samples are extracted for each analysis frame. ${\bar{Λ}}_{N}$ and ${\bar{Λ}}_{P}$ indicate the thresholds of the LLR estimated by each GMM for normal and pathological voices. An example is shown in Figure 3 with false acceptance and false rejection plots versus LLR thresholds. Both lines cross over the equal error rate (EER). ${\bar{Λ}}_{N}$ and ${\bar{Λ}}_{P}$ are the LLR values determined when the EER is 25.0%. The thresholds, ${\bar{γ_{3}}}_{thre}$ and ${\bar{γ_{4}}}_{thre}$ , are determined in advance according to the values to produce the best results when ${\bar{γ}}_{3}$ or ${\bar{γ}}_{4}$ are used alone for the pathological voice detection. That is, they are computed for the training data set. In the test phase, the LLR, $\bar{Λ}$ , is estimated with the feature vector and the pre-trained GMMs. The primary decision is executed by the MFCC-based GMM algorithm. If ${\bar{Λ}}_{P} \leq \bar{Λ} \leq {\bar{Λ}}_{N}$ , the voice samples are processed using the HOS operator. The final decision is realized after calculating ${\bar{γ}}_{3}$ and ${\bar{γ}}_{4}$ . The values of ${\bar{γ}}_{3}$ and ${\bar{γ}}_{4}$ are independently used to classify normal and pathological voice. When ${\bar{γ}}_{3}$ is used, the voice is said to be pathological if $\bar{γ_{3}} < {\bar{γ_{3}}}_{thre}$ and normal if $\bar{γ_{3}} \geq {\bar{γ_{3}}}_{thre}$ . When ${\bar{γ}}_{4}$ is used, the voice is said to be normal if $\bar{γ_{4}} < {\bar{γ_{4}}}_{thre}$ and pathological if $\bar{γ_{4}} \geq {\bar{γ_{4}}}_{thre}$ .