doi:10.1155/2007/85286 Research Article Study of Harmonics-to-Noise Ratio and Critical-Band Energy Spectrum of Speech as Acoustic Indicators of

Acoustic analysis of speech signals is a noninvasive technique that has been proved to be an effective tool for the objective support of vocal and voice disease screening. In the present study acoustic analysis of sustained vowels is considered. A simple-means nearest neighbor classifier is designed to test the efficacy of a harmonics-to-noise ratio (HNR) measure and the critical-band energy spectrum of the voiced speech signal as tools for the detection of laryngeal pathologies. It groups the given voice signal sample into pathologic and normal. The voiced speech signal is decomposed into harmonic and noise components using an iterative signal extrapolation algorithm. The HNRs at four different frequency bands are estimated and used as features. Voiced speech is also filtered with 21 critical-bandpass filters that mimic the human auditory neurons. Normalized energies of these filter outputs are used as another set of features. The results obtained have shown that the HNR and the critical-band energy spectrum can be used to correlate laryngeal pathology and voice alteration, using previously classified voice samples. This method could be an additional acoustic indicator that supplements the clinical diagnostic features for voice evaluation.


INTRODUCTION
Diseases that affect the larynx cause changes in the patient's vocal quality. Early signs of deterioration of the voice due to vocal malfunctioning are normally associated with breathiness and hoarseness of the produced voice. The first tool used to detect laryngeal pathology is subjective analysis of the speech. Trained physicians perform a subjective evaluation of the patient's voice, which is followed by laryngeoscopy that may cause discomfort to the patient. A complementary technique could be acoustic analysis of the speech signal, which is shown to be a potentially useful tool to detect voice disease. This noninvasive technique is a fast low-cost indicator of possible voice problems.
Any change in the anatomical structure because of pathology in turn results in physiological function that alters the vocal output [1][2][3][4][5][6][7]. The analysis methods found in the literature are mainly based on the periodicity of vocal fold vibration and the turbulence in the glottal flow resulting from malfunctioning of the vocal folds [8][9][10][11][12][13][14][15][16][17]. The periodicity perturbations are associated with the measurement of jitter and shimmer. Jitter is the variation between the successive fundamental periods and shimmer is the variation between successive magnitudes of the signal from cycle to cycle. The turbulence in the glottal flow is usually quantified by the noise components in the voiced speech spectrum. In this study we focus on the vocal noise for the analysis of vocal fold pathology.
Researchers have extensively used the vocal noise for the evaluation of pathologic voice. Many noise features have been used which are designed to quantify the relative noise components in a speech signal. The prominent ones are the harmonics-to-noise ratio (HNR), the normalized noise energy (NNE), and the glottal-to-noise-excitation ratio (GNE). Yumoto et al. [11] proposed the HNR as a measure of hoarseness. But the estimation of HNR is based on the assumption that a long stationary data segment is available for analysis, which may not be realistic as the speech is highly nonstationary. Kasuya et al. [12] proposed NNE as a novel and effective acoustic measure to evaluate noise components in pathologic voices. They have devised an adaptive comb filtering method operating in the frequency domain to estimate noise components and NNE from a sustained vowel phonation. A fixed length (seven times the fundamental pitch 2 EURASIP Journal on Advances in Signal Processing period) voiced segment is used for the analysis. Manfredi [13] used an adaptive window, whose length is adapted according to the fundamental pitch period for the analysis. The adaptive NNE proposed by them is particularly useful for complete word utterances. Michaelis et al. [16] have proposed a new acoustic measure called GNE for the objective description of voice quality. This parameter is related to the breathiness in the voiced speech and it indicates whether a given voice signal originates from the vibration of the vocal folds or from the turbulent noise generated in the vocal tract.
In this paper, we extract two different sets of features from the acoustic analysis of voiced speech and further use them to correlate laryngeal pathology and voice alteration on a previously classified database of voice samples. The first feature set is the energy ratio of harmonics to noise components (HNR) in the voiced speech signal at four different frequency bands and the second set of features is based on the energy spectrum at critical-band spacing [18]. A k-means nearest neighbor classifier [19] is used separately on these sets of features to test their efficacy as tools for the detection of laryngeal pathology. As the same classifier is used on the two feature sets independently, we get two different sets of classification results. As we have used a preclassified database of voices, this allows us to make a comparison between the efficacies of the two sets of features apart from their individual efficiencies.

Database
In the present study, we wanted to understand if HNR and critical-band energy spectrum could be used as effective tools for the classification of normal and pathologic voices. A prior-labeled database is helpful in such a study to correlate the results obtained. We have taken the speech signals from such a database distributed by Kay Elemetrics Corporation. This CD ROM database of acoustic records originally developed by Massachusetts Eye and Ear Infirmary (MEEI) Voice and Speech Lab. [20] contains over 1400 voice signals of approximately 700 subjects. Included are the sustained phonation and running speech samples from patients with a wide variety of organic, neurological, traumatic, and psychogenic disorders, as well as from 53 normal subjects. We have used voice samples of sustained phonation of the vowel /a/. The recordings were made in a controlled environment and data were available at sampling frequencies of 25 KHz or 50 KHz. We have down sampled all the voice signals to a sampling frequency of 16 KHz. The normal voice records are about 5 seconds long, whereas the pathologic voice records are about 3 seconds long. 53 normal and 163 pathologic voice signals have been used in our study as shown in Table 1. Approximately 50 percent of the signals of each group were considered for training (to estimate the prototype) and the remaining for testing.

Estimation of HNR
One of the important characteristics of voiced speech is the well-defined harmonic structure. The source for the voiced speech is often modeled as quasiperiodic glottal pulses. But in reality, even the sustained vowel phonation consists of some random part mainly due to turbulence of airflow through the glottis (anterior and/or posterior glottis) and due to pitch perturbations. A windowed segment s(n) of the voiced speech signal is therefore assumed to have a periodic component p(n) and a random component w(n), represented as where M is the length of the analysis window. The two components cannot be directly separated because the random component may have energy in the entire speech spectrum. But one can get an estimate of the random component by decomposing speech into periodic and random components. We have used a method similar to the one proposed by Yegnanarayana et al. [21] for the decomposition of the speech into periodic and aperiodic components. The method involves an initial approximation of the periodic and the random components using the harmonicity criterion. This is followed by an iterative reconstruction of the random component in the region labelled as "periodic" based on discrete Fourier transform (DFT) and inverse discrete Fourier transform (IDFT) pairs.

Identification of harmonic and noise regions
The first step in the signal decomposition algorithm is to derive a first approximation of periodic and aperiodic components in the frequency domain. The spectrum of a windowed voiced speech segment is schematically shown in Figure 1. An N point DFT of a Hamming windowed segment of length M of the voiced speech is assumed. The harmonic peak region P i has a width of 2N/M on either side of the peak frequency k i corresponding to the ith harmonic of the fundamental frequency. 2N/M is the approximate bandwidth of the Hamming window. This region contains both periodic and aperiodic energy. In the harmonic dip region D i , it is assumed that the periodic components have no energy and the entire energy is due to random components. In order to obtain nonempty dip region with d points, the window length Kumara Shama et al. should satisfy [13] M where f 0 is the fundamental frequency of phonation and T is the sampling interval. Thus with a nonempty dip region, one can identify the harmonic region and noise region as where k = frequency number. A peak-searching algorithm is used to initially locate the harmonic peak frequencies k i . This algorithm determines the spectral peaks by searching for the peaks in the intervals centered at each multiple of the fundamental frequency f 0 . The fundamental frequency is estimated using the method described in Section 2.2.2 below.

Estimation of f 0
Sufficient subglottal air pressure and vocal fold adduction produce oscillation of the vocal folds and therefore voiced sounds when the vocal fold tissues are pliable. The rate of vibration is the fundamental frequency. The glottis opens and closes, resulting in quasiperiodic flow of air. The instant of closure of the glottis is referred to as the glottal closure instant (GCI). During each period of voiced speech, a GCI occurs. To detect this, Wendt and Petropulu [22] used a wavelet function having a derivative property. When the speech signal is filtered by this function, maxima will occur at every GCI. For many phonation cases, normal and abnormal, the vocal folds do not come all the way together, and there is no glottal closure. However, there can be a more prominent flow reduction within a cycle, and therefore a greater acoustic excitation at that time in the cycle. Many of the pathological voices will not have closure, but will have stronger excitation moments somewhere in the cycle. Such voiced speech signals also exhibit prominent peaks when filtered through the wavelet filtering function at these stronger excitation moments. Thus the time elapsed between two adjacent maxima of the filtered signal represents the pitch period of the signal at that moment. We propose an extension to this method to estimate the pitch.
To construct a filtering function, the wavelet with the derivative property described by Mallat and Zhong [23] is combined with the bandwidth property of the wavelet transform at different scales. Let ψ(t) be the mother wavelet with derivative property. The functions represent Haar wavelet and scaling functions, respectively, at scale k. Here φ(t) is a lowpass function and is the conjugate mirror filter of ψ(t), which is a highpass function. As the approximate range of the fundamental frequency of the voiced speech is between 60 and 500 Hz [24], the final filtering function should have the same bandwidth. Thus we construct a filtering function λ(t) as where * denotes convolution. The scales k a and k b are given by k a = log 2 F s 500 , where F s is the bandwidth of the input speech signal. The speech signal is passed through this filter. The filtered signal shows dominant peaks at the GCIs. The peaks of the filtered signal are detected using a peak detection algorithm which identifies the peaks by detecting the points where the slope polarity change occurs. For real speech, the filtered signal exhibits some spurious peaks which are to be eliminated by using a suitable peak correction method. Thresholding the strength and the proximity of the adjacent peaks [25] is used in the peak correction algorithm. That is, in the first stage of correction, a peak is validated only if its amplitude is above a threshold. The threshold is fixed at 25 percent of the average peak amplitude. In the second stage, the average distance D a between the two adjacent peaks is first estimated. Every peak whose distance with its adjacent peak is shorter than 0.5D a or longer than 2D a is then eliminated. This two-stage peak correction algorithm eliminates the spurious peaks and identifies only the correct peaks. The average distance between the consecutive peaks is then found to compute the pitch period and hence the fundamental frequency f 0 .

Estimation of harmonic and noise energies
By estimating the signal energies in the identified harmonic and noise regions (Section 2.2.1), one can get only an approximate harmonics-to-noise ratio. The energy in a noise region is assumed to be due to noise components only, but in the harmonic region, the energy is a superposition of harmonic and noise components. The noise energy can be estimated by signal extrapolation methods. In this paper, we have used an iterative algorithm developed by Yegnanarayana et al. [21] to reconstruct the noise components. The algorithm is based on bandlimited signal extrapolation proposed by Papoulis [26]. The noise component is reconstructed by iteratively moving from the frequency domain to the time domain and vice versa. For an M-length signal, an N-point (N > M) DFT is first obtained. The iterations begin with zero values in the frequency region identified as the harmonic region and actual DFT values in the noise region. An inverse DFT is then obtained and the first M points of the resulting signal are retained. An N-point DFT is again obtained and the harmonic region is forced to zero. The IDFT is computed and this procedure is repeated for a few iterations. It is shown [21] that for a finite duration signal with known noise samples, the reconstructed noise component converges to the actual noise component in the mean-square sense, as the iterations grow. In fact, after a number of iterations (about 8 to 10), the noise components are reconstructed with negligible error. After reconstructing the noise components, the harmonic components are obtained by time domain subtraction. From these components the harmonicsto-noise energy ratio in the required frequency bands is estimated.

critical-band energy spectrum
The effect of noise on speech has been found to change the spectral characteristics. Marked differences are found in the distribution of energy at critical-bands between clean and noisy speech signals [27]. This difference factor was effectively used to differentiate the clean speech from the speech added with noise. We extend this idea to differentiate pathologic voices from the normal ones, as the voiced speech of subjects with vocal fold pathology has additional noise components caused mainly by the incomplete closure of the glottis and improper vibration pattern of the vocal folds. We have used energy spectra at critical-bands because the center frequency and bandwidths of the critical-bands roughly correspond to the tuning curves of human auditory neurons. The human auditory system is assumed to perform a filtering operation, which partitions the audible spectrum into critical-bands [28]. Twenty one critical-bands described in Table 2. [27] have been used in this work. Thus the proposed automated analysis mimics the human perceptual analysis of voice pathology. These 21 bands cover the frequency range from 1 to 7.7 KHz. The bandwidths at lower critical-bands are narrower and they progressively increase as the center frequency increases.  1170  190  11  1270  1480  1370  210  12  1480  1720  1600  240  13  1720  2000  1850  280  14  2000  2320  2150  320  15  2320  2700  2500  380  16  2700  3150  2900  450  17  3150  3700  3400  550  18  3700  4400  4000  700  19  4400  5300  4800  900  20  5300  6400  5800  1100  21  6400  7700  7000  1300 We have adopted a filter bank approach for the estimation of energy. Sixth-order Butterworth bandpass filters are used to obtain the 21 band filter bank. The filter bank approach is preferred due to its simple and inexpensive implementation. This approach is particularly suitable when a small set of parameters describing the spectral distribution of energy has to be derived. The outputs from a bank of 21 bandpass filters typically provide a very efficient spectral representation.
In the next section, we describe the extraction of the features and the design of the classifier.

Features based on HNR
One of the important characteristics of normal voiced speech is that it exhibits a good harmonic structure even up to about 4 KHz. In contrast, the pathologic voices exhibit higher noise levels and the noise is distributed across the entire speech spectrum. The pathologic voices may have good harmonic structure at low frequencies, and at higher frequencies the harmonic energy decreases with the increase in noise energy. This is evident from Figure 2 where the log magnitude spectra of the estimated harmonic component and noise components for a segment of speech corresponding to sustained vowel /a/ uttered by both a normal and a pathologic subject Kumara Shama et al. are shown. The harmonic and the noise components are obtained by decomposing the segment of the speech signal using the method discussed in Section 2. The normal voice shows a regular harmonic structure up to about 4 KHz with relatively low noise energy. In the case of the pathologic voice, the spectrum shows higher noise levels with deteriorated harmonic structure even at lower frequencies. The harmonicsto-noise energy ratio (HNR) at different frequency bands can therefore be used for discriminating pathologic voices from normal ones. In this study, we have used HNRs at four different frequency bands as the features for the classification as shown in Table 3. These frequency bands are the standard bands used in many speech-processing applications [27] and have logarithmic spacing that would approximate the frequency response of human ear. We have experimented with more than 4 frequency bands and no significant improvement in the results was found. Using frequencies above 5.5 KHz also had no significant effect on the results because both the normal and pathologic voices show low HNR above this frequency. The speech recordings corresponding to the sustained vowel /a/ are sampled at 16 KHz and digitized with 16-bit resolution. The data are then segmented into overlapping segments of length 1023 samples. This particular choice of the segment length is based on the following issues. The accuracy of the extrapolation algorithm for the decomposition of the voice signal into harmonic and noise components is poor for low-pitched voices, as the numbers of sample points available in the harmonic dip region for the extrapolation are fewer. At lower pitch, to have nonempty dip regions, the frame length needs to be higher (see (2)). At the same time, the data window at the higher pitch frequencies spans a large number of pitch cycles. The pitch of the voice samples used in the current study was in the range 90 Hz to 220 Hz. Thus we found the segment length of 1023 points adequate. This also suits the requirements of the iterative procedure based on DFT and IDFT used for the decomposition of speech where we have used 2048 point DFTs.
For each segment, the HNR at the four frequency bands are estimated by the method described in Section 2. These 4 HNRs are then averaged over all the segments. The averaged HNR values form the feature vector for the classifier.

Features based on energy spectrum
The voiced speech data (sustained phonation of vowel /a/) are uniformly divided into 20 ms frames. Each frame is filtered through the 21-channel filter-bank, whose center frequencies and bandwidths are taken according to criticalband spacing. These 21-bands cover a frequency range of 1 to 7.7 KHz. Energies of each of the 21-filter outputs are computed and normalized to the total energy. This normalized energy spectrum is used as a feature vector in this study. Figure 3 shows an example of normalized energy spectra for normal and pathologic voice signals. Here we have plotted normalized energy (which is the sum of both harmonic and noise energies) versus the frequency bands. It is observed that for the healthy voices considered in the study, most of the energy content is accumulated in critical-bands 5 through 10, which correspond to the frequency range of 400 Hz to 1270 Hz, whereas the pathologic voice does not show such a pattern. Pathologic voices exhibited energy distributions such that considerable energy is seen in lower bands also (critical-bands 1 through 4). It is also evident from Figure 2 where one can see the pathologic voice having large harmonic and noise energy at lower frequencies though the harmonic energy falls rapidly at higher frequencies with the increase in noise energy. However some pathologic voices show a significant amount of energy at higher frequency bands also.

Classifier
This section describes the design of a classifier to classify the given voice signal to the normal or pathologic class, based on the estimated acoustic features. The distribution functions for these features are unknown and hence nonparametric methods of classification are necessary. There are several techniques available, which include fitting an arbitrary density function to a set of samples, histogram techniques, and kernel or window techniques [29]. Apart from these, there are several nearest neighbor techniques, which do not explicitly use any density functions.

Nearest neighbor classification
This method assigns an unknown sample signal to that class having most similar or nearest sample signal in the reference set or training set of signals. The nearest sample signal is found by using the concept of distance or metric. We have used Euclidean distance as the metric. The Euclidean distance in n-dimensional feature space, which is the usual distance between the two points a = (a 1 , a 2 , . . . , a n ) and In the present work, a simple k-means nearest neighbor classifier has been used. This is a variant of the nearest neighbor technique. Here a prototype is computed from the reference set of sample signals and a given test sample signal is classified as belonging to the class of the closest prototype. The prototype is computed as the mean of feature vectors corresponding to signals in the reference set belonging to a particular class. The prototype, referred to as a centroid vector, is computed separately for both normal and pathologic voice signals. This averaging process represents the training phase of the classifier.

Classification based on HNR
Let HNR i j denote the harmonics-to-noise ratio at the ith frequency band for the jth sample signal with i = 1, 2, 3, 4. Then the centroid vector is where c = nc (normal class) or pc (pathologic class) and k = number of sample signals in the reference set belonging to class c. Two such centroid vectors are computed, one for normal voices and the other for pathologic voices. For the test sample signal, we calculate the Euclidean distance parameter D between the HNR feature vector corresponding to the test sample signal and the centroid vector. Thus we have two distance measures: where HNR i t is the ith component of the HNR vector for the test sample signal, HNR i nc and HNR i pc are the ith components of the centroid vector corresponding to normal and pathologic classes, respectively. D nc and D pc are the distances between the test vector and the corresponding centroid vectors.
The nearest neighbor rule is then applied to assign the test sample signal to normal or pathologic class. The rule is if D pc < D nc , then the test sample is considered as pathologic, otherwise as normal.

Classification based on energy spectrum
We define spectral distance SD as the Euclidean distance between the feature vector (normalized energy values at the 21band critical-bands) corresponding to the test sample signal Kumara Shama et al. 7 and that of the centroid vector as where EB i t denotes the ith normalized filter-bank energy output of the test sample and EB i c denotes the corresponding energy of the centroid vector. For any given test sample, the two spectral distances, one corresponding to the normal centroid and the other corresponding to the pathologic centroid, are estimated as respectively, where EB i nc and EB i pc denote the ith components of the centroid vectors corresponding to normal and pathologic cases, respectively. Based on the above spectral distance measures, the given test sample is classified into the normal class if SD n ≤ SD p or into the pathology class otherwise.

PERFORMANCE EVALUATION AND RESULTS
The following parameters were used to evaluate the performance of the classifier.
(1) True positive (TP): the classifier detected pathology when pathology was present.
(2) True negative (TN): the classifier detected normal when normal voice was present.
(3) False positive (FP): the classifier detected pathology when normal voice was present (false acceptance).
(5) Sensitivity (SE): likelihood that pathology will be detected given that it is present.
(6) Specificity (SP): likelihood that the absence of pathology will be detected given that it is absent.
(7) Accuracy: the accuracy with which the classifier is able to classify the given sample to the correct group.
The results are depicted in Table 4. These results were calculated based on the number of samples used for testing.

DISCUSSIONS
The HNR based features provided lower false rejection and thus higher sensitivity than the critical-band energyspectrum-based feature set. In fact, 4 pathologic cases were rejected falsely out of 79 test cases by the first classifier, whereas 7 of them were falsely rejected by the other classifier. Though significant difference in percentile specificity was seen, the two sets of features provided low false acceptance. The large difference (about 4%) in the specificity was because the number of normal subjects used in the study was small. 26 normal subjects were used for testing the classifiers; the classifier based on HNR features misclassified two of them while the other misclassified one of them. It was observed that for all the samples that were misclassified, there was a large amount of overlap between the features (HNR and energy spectrum) and the two corresponding estimated prototypes (centroids).
The frequency bands used for the estimation of HNR cover frequencies up to 5.5 KHz, whereas the critical-band energy spectrum stops at 7.7 KHz. This does not alter the results significantly as seen in Table 4. This is also evident from Figures 2 and 3, which show that there is no significant spectral energy in the voiced speech above about 5 KHz. The low harmonic energy above 5 KHz results in low HNR for both normal and pathologic cases. Hence using HNR above 5 KHz will not improve the classifier efficiency.
We have considered mainly vocal fold pathologies and normal voices in this study. The method works well for all these cases. The prototypes for individual pathologic cases were not considered because of small sample sizes and hence a comparison of the performance of the classifier in separating individual pathologic cases from normal is not reported in this paper. We have tried interpathology classification using these features, but the results were poor.
The results shown in Table 4 appear to be promising in separating the normal from pathologic voice samples. These results are comparable to those reported by several other research studies [30][31][32][33]. In [30], a voice analysis system was developed for the screening of laryngeal diseases using four different types of classifiers based on time and cepstral domain parameters derived from the speech signal of sustained phonation of the vowel /a/. Overall classification accuracy of 93.5% was reported with a test data set consisting of 50 normal and 150 pathologic subjects. In [31], automatic detection of pathologies in voice was done based on "classic" parameters, that is, shimmer, jitter, energy balance, spectral distance, and newly proposed higher-order statistics (HOS)-based parameters. Classification scores of 94.4% and 98.3%, respectively, were obtained using speech data from 100 healthy and 68 pathologic speakers. Though the results are superior to ours, the method is computationally more complex as 5 vowels are analyzed for each speaker and neural network classifiers are used. In more recent studies found in the literature [32,33], data from the Kay-Elemetrics disordered voice database have been used for the separation 8 EURASIP Journal on Advances in Signal Processing of pathological voices from normal ones. This is the same database that we used in the present study. In [32], a multilayer perceptron network was used on mel-frequency cepstral coefficients (MFCC) to achieve a classification rate of 96%. As in our study, the sustained vowel phonation /a/ was used but the classification was done on a different set of pathologic voice samples (53 normal and 82 pathologic cases). In another recent study [33], a joint time frequency approach was proposed for the discrimination of pathologic voices. Continuous speech data from 51 normal and 161 pathologic speakers were analyzed and overall classification accuracy of 93.4% was reported using linear discriminant analysis (LDA). The method proposed by us in this paper has the advantage that the k-means nearest neighbor classifiers are easy to implement with minimum computational cost. Though the critical-band energy-spectrum-based classifier has comparatively less accurate results, the parameterization is simpler and does not require the estimation of the pitch and noise.
It is well known that laryngeal pathology can lead to a voice disorder. However, all voice disorders are not due to laryngeal pathology. Acoustical variations with normal laryngeal structure and functions, as well as normal acoustical parameters with variation in the laryngeal organs, have been reported in the literature [34,35]. The results presented here are from an explorative study to look at the efficacy of HNR and energy spectrum at critical-band spacing as diagnostic tools. Both methods described in this paper may give false results in the case of normal voice produced by altered laryngeal function and "pathological" sounding voices because of some muscular imbalance due to behavioral causes or style settings for artistic purposes. However, such cases can be eliminated while recording, by a suitable screening procedure.

CONCLUSIONS
A simple k-means nearest neighbor classifier is designed for the classification of pathologic voices. The harmonics-tonoise ratio and energy spectrum at critical-band spacing of speech signals are demonstrated as tools for the differential classification of laryngeal pathology versus normal voice. This can be used as a tool to supplement the perceptual evaluation of speech for the detection of suspected laryngeal pathologies. The method has the advantage that a comparatively shorter length of speech data is sufficient for the analysis. The HNR-based classifier makes use of 4 frequency bands, while the energy spectrum based classifier makes use of 21. The 4 bands used in the first classifier as well as the 21 bands used in the second classifier correspond to the frequency response of auditory neurons of the human ear. Choice of only 4 frequency bands in the first classifier reduces the dimensionality from 21 to 4 when compared to the second classifier. Though the first method has the advantage of working on reduced dimensional features, the computational gain is used up by the need for the extraction of fundamental frequency and the estimation of noise components, which are computationally expensive. For the pathologic voices, estimation of fundamental frequency ( f 0 ) is difficult and for very breathy, almost aphonic voices, the filtered speech may not have dominant peaks or the peaks may be comparable to noise peaks leading to erroneous pitch estimation. In such cases the energy-spectrum-based classifier is preferred, though this method is comparatively less accurate.