- Open Access
Significance of parametric spectral ratio methods in detection and recognition of whispered speech
EURASIP Journal on Advances in Signal Processing volume 2012, Article number: 157 (2012)
In this article the significance of a new parametric spectral ratio method that can be used to detect whispered speech segments within normally phonated speech is described. Adaptation methods based on the maximum likelihood linear regression (MLLR) are then used to realize a mismatched train-test style speech recognition system. This proposed parametric spectral ratio method computes a ratio spectrum of the linear prediction (LP) and the minimum variance distortion-less response (MVDR) methods. The smoothed ratio spectrum is then used to detect whispered segments of speech within neutral speech segments effectively. The proposed LP-MVDR ratio method exhibits robustness at different SNRs as indicated by the whisper diarization experiments conducted on the CHAINS and the cell phone whispered speech corpus. The proposed method also performs reasonably better than the conventional methods for whisper detection. In order to integrate the proposed whisper detection method into a conventional speech recognition engine with minimal changes, adaptation methods based on the MLLR are used herein. The hidden Markov models corresponding to neutral mode speech are adapted to the whispered mode speech data in the whispered regions as detected by the proposed ratio method. The performance of this method is first evaluated on whispered speech data from the CHAINS corpus. The second set of experiments are conducted on the cell phone corpus of whispered speech. This corpus is collected using a set up that is used commercially for handling public transactions. The proposed whisper speech recognition system exhibits reasonably better performance when compared to several conventional methods. The results shown indicate the possibility of a whispered speech recognition system for cell phone based transactions.
Speech has been the most primitive modes of communication between all higher forms of life. However it is interesting to note that even while the basic organs that regulate our speech are the same, speech varies with the speaker. This variation transcends grammar and vocabulary. This difference is accounted for by prosody which is defined as a science of pitch, loudness, tempo, rhythm and intonation of speech. It is on account of these differing prosodic features that robust automatic speech recognition (ASR) systems are still a challenge. By and large models based on a large collection of regional databases have indeed proven very effective to counter the regional variations. This has led to attempts to even match prosodic features to subject’s mother tongue as in. However the vulnerability of prosodic features to emotional changes is still a challenge. The same speaker can have different prosodic features under different emotional states that can lead to different modes of speech. The ineffectiveness of the usual speech engines over changes in speech modes is evident from the study in. Whisper is one such natural mode of speech. It is a regular response to situations that require secrecy. It can selectively exclude potential listeners of the message. Patients with a collapsed larynx or who have undergone laryngectomy due to cancer of larynx may have to resort to a form of speech that is very close to whispered speech. This form is known as esophageal speech.
The classification of speech can be done into five categories depending upon the modes of speech production i.e. difference in vocal efforts. They are whispered speech, soft speech, neutral speech, loud speech and shouted speech. Whispered speech is produced in the absence of vocal cord vibrations, however larynx movements are the same as in neutral speech. This mode of speech generated has high noise-like characteristics. Soft speech is produced when the listener and the speaker are very close by and there is an element of secrecy or quietness to be maintained, for example talking to a friend in the library. Both the vocal cord and the larynx vibrate in this mode. Neutral speech is the regular baseline mode of speech spoken at leisure with people. Loud speech is the mode of speech employed when addressing a large gathering or when in a noisy environment. The increased effort is often accompanied by increased length of speech segments to increase intelligibility. Shouted speech is produced in a state of extreme anger or when addressing a person over a large distance. It is usually accompanied by extreme articulation by the glottis, however extreme articulation usually leads to less intelligibility. Of all these modes, modeling whispered speech is most challenging because of its high noise-like content. It also suffers from lack of a harmonic structure due to absence of vocal cord excitation. Hence Spkr-ID systems have been reported to perform worst in case of whispered speech. Therefore, detecting whispered speech segments in a normally phonated speech signal and recognizing the detected whispered speech segments separately can improve the performance of speech recognition systems. Several methods have been developed in detecting whisper-islands embedded in a normally phonated speech signal[5, 6]. In this work we used linear prediction (LP) to minimum variance distortion-less response (LP-MVDR) spectral ratio based features for whispered speech detection. Whisper-island detection can be done using features extracted from linear predictive residual (LPR) and Bayesian information criteria (BIC). Chi and Hanse proposes a method which detects whisper-islands in an audio clip via BIC/T2-BIC using a 4-D feature set. In some works MVDR is used for speech recognition and spectral coding of speech. Yapanel and Dharanipragada used perceptual MVDR (PMVDR) based features in speech recognition.
Acoustic aspects of neutral and whispered speech
It is important to understand both articulatory and acoustic aspects of neutral speech and whispered speech to differentiate one from the other. However since this article relates only to the acoustic aspects, a brief discussion of the acoustic aspects that differentiates neutral from whispered speech follows herein. Neutral speech is primarily characterized by its formants. The formant structure can be changed by changing the vocal tract length which is done by moving the lips, tongue, teeth or by closing or opening the nasal cavity. On the other hand whispered speech does not have the vibrating vocal cords. Hence there are very few or no glottal pulses. However oral features do the articulation and produce the required characteristic sound. This sound does not have a definite formant structure. The formants that are present are shifted to higher frequencies as compared to their neutral speech counterparts. Because of turbulence created at the vocal folds, there is a shift in spectral powers to the higher frequencies in whispered speech. Figure1, illustrates a spectrogram of a neutral speech and a whispered speech utterance. The shift in spectral powers to the higher frequencies can be noted in the spectrogram of whispered speech when compared to neutral speech.
Brief review of techniques for detection of whispered speech
From Section “Acoustic aspects of neutral and whispered speech”, we realized that whispered speech is significantly different from the neutral speech on many grounds. Whispered speech is produced without vocal cords motion that leads to lack of a formant structure (though not the absence of it). The formant structure whatever is present is also found to have shifted to higher frequencies. The air is throttled at the glottis leading to turbulence and the corresponding shift. All this in the spectral power terms reflect a shift of concentration of power to higher frequency bands. Based on these characteristics, few detection techniques have been proposed. These techniques were explored for their performance on whispered speech. They are briefly described in this section.
Spectral energy ratio
Energy ratio is a fast and easy to implement method that can detect large segments of whisper. The method uses the shift of spectral energies to higher frequencies to detect whisper. Quite simply, spectral energy ratio is defined as ratio of frequencies of higher band to lower band. Hence
The two bands’ definition is subjective. Usually lower band is taken to be from 0–1 kHz and higher band corresponds to 2.5 kHz-end frequency. The method is crude and only conceptual. The detection rate is very low with only large segments being detected. Also this method is expected to perform badly in noisy conditions as noise will have a direct effect on the spectral powers (Figure2).
Spectral flatness is a sense of the ‘flatness’ of the speech spectrum. If we wanted to explore the presence of white noise in the given signal; then a spectral flatness measure (SFM) must be devised. The measure at the same time should take into account the particular structure of speech. This means that penalties are to be assigned accordingly[16, 17] (Figure3).
Let s n be a finite real time speech signal. Thus the energy of the signal with the application of parseval’s theorem gives
The SFM that we need must have the following normalized b oundary conditions namely one value for a perfectly flat spectrum and another for a spectrum with peaks and troughs. Let us consider an integrand
We can observe that M(θ) will give a value of zero for a perfectly flat spectrum and non zero finite values otherwise. Hence M(θ) is a candidate for measuring spectral flatness. However having a simple criteria like mean square criteria will weigh positive and negative deviations from the mean value as the same. This is not acceptable in the context of speech processing as positive deviations are indicative of harmonic content. Hence these should be weighed higher making the spectrum look to be less flat. Thus we consider eM(θ)−1−M(θ). It can be seen that it has the required property for speech signals. Consider the integration over some interval
This integral basically comes down to
Thus the mathematically complex looking integrand upon integration gains a very simple form. Hence e[−γ] from previous equation will be one for a perfectly flat spectrum and lie between 0 and 1 otherwise. Hence this measure has also been used to detect unvoiced segments of speech. Also it can be written as
The above equation in discrete time can be interpreted as geometric mean to arithmetic mean of the spectral powers. Thus we have our SFM as
LP-MVDR spectral ratio method for detection of whispered speech
Linear prediction, is the most widely used and accepted method, for whisper detection and also whisper island detection. Other methods like vocal effort change point detection have also been used for improved whisper detection within normally phonated audio streams in this context. These methods in general parameterize an AR spectra using a least square errors method. However the LP model works well only for speech formants as in voiced speech and more specifically at low frequencies. The most undesirable effect of these techniques is that the LP model tends to overestimate the power at formant frequencies. Moreover increasing the model order increases the overestimation rather than correcting it. Hence this method is able to resolve harmonics but is poor at estimating the power at the formant frequencies in the spectrum. Hence it leads to poor characterization of the vocal tract transfer function. The MVDR spectra, on the other hand, is capable of modeling the power of the spectra efficiently at all harmonic frequencies due to the nature of the estimation method. The MVDR spectra also responds to increase in model order and improves the model at higher harmonic frequencies. As whisper is characterized by formant shifts and overall increased concentration of power in high frequency bands, the models must essentially provide good modeling results in those bands. Thus exploiting the use of MVDR spectra is desirable in robust whisper detection. Moreover, MVDR coefficients can be computed from LP coefficients themselves, making the process computationally less expensive. In the following section, a brief description of the significance of LP and the MVDR methods in spectrum estimation of whispered speech is discussed followed by the proposed parametric spectral ratio method.
LP method for detection of whispered speech
In LP, we approximate speech as an auto-regressive process. Hence a model discrete time signal s(n) is given by
where ‘m’ denotes the order of the AR model and a k ’s are the model constants that are to be estimated. When predicting a signal using the AR model, the error from the actual signal e(n) can hence be written as
To obtain the parameters a k , least square error technique is used. The idea is to minimize the total error with respect to each of the parameters. Hence
To find parameters a k ;
where R(i) denotes the autocorrelation over signal s(n)
The autocorrelation matrix hence formed by R(i–k) is a Toeplitz matrix. In matrix form Equation (13) can be written as
The solution to the Equation (13) is found using Levinson-Durbin recursion algorithm. The method requires only 2m storage and the time complexity is of the order of p 2 + O(p).
Inadequacy of LP method for detection of whispered speech
To understand the inadequacy of LP in the context of whispered speech; we need to analyze the spectrum in spectral domain. The inadequacy can be traced back to the least square error method that was employed to determine the parameter a k ’s. The merit of least squares method as described above in any error analysis is its ability to amplify big errors and diminish small errors of parameter. To begin with, in spectral domain Equation (9) becomes
The total error E using Equation (10) can be written as
Also the real Power
Putting in terms of P(ω), the total error E can be written as
Following the procedure of error minimization using least squares method, we can arrive at
It is well known that an infinite order AR model can always model the signal arbitrarily closely. But limiting the order of the AR model to ‘m’, leads to an approximation. Also as the signal is finite, the error can only be minimized at best with a finite order AR model as seen before. Let the minimum possible error for this model be E min. Hence the equation becomes
with non zero E min and the transfer function being A(z). The estimated power of the speech signal can now be calculated by taking the square of the modulus as
The total error in prediction as in Equation (10) can hence now be written as
Thus minimizing the total error E is equivalent to minimizing the ratio of the actual power to the estimated power integrated over the entire interval. Hence the LP problem can now be viewed in terms of the ratio of power spectra and their minimization over the interval. The coefficients that are to be subsequently calculated via the least squares method will be found using the R(i)’s whose relation with the power spectra has already been established in Equation (21). In analogy, the expected correlation coefficients can be found using predicted power spectra using
This concludes that since the spectrum depends upon the autocorrelation; the window over which it is computed becomes immensely important. This is because the process of generating the signal must be such that it can be considered stationary for that duration of time over which it is calculated.
Another important and more significant conclusion in the light of whispered speech can be analyzed through the error equations. The integral of the ratio of powers is to be brought out to some fixed value. This means that the ratios will be less than one for some time and greater than one for another to compensate the two. Moreover no regard is paid to the fact that some parts of the spectra will have low energy and others will have high energy and the minimization is applied uniformly. For that let us make two cases with respect to the spectral ratio.
Case 1: estimated poweris over estimated by ε times the actual power (Pω))
The ratio within the integral in this case turns out to be
Case 2: estimated poweris under estimated by ε times the actual power (P(ω))
The ratio within the integral in this case turns out to be
Comparing the two above mentioned cases, the effect on error is more when than when the inequality sign is reversed. Hence the error in Case 2 is much larger than in Case 1. However error minimization strategies do not take these cases into account resulting in overestimation at certain frequencies. Since LP gives a relatively smooth polynomial, the compounded ’less than one’ parts of the spectra are liable to compensate at the harmonics. This leads to overestimation at crucial harmonic frequencies is liable to give poor modeling results. As in whisper detection as well as recognition, the focus is mainly on correct detection of ’how good’ the harmonics are, an overestimation is an undesirable result. The inaccurate modeling at high frequencies by LP also compounds the problem for whisper detection.
The MVDR method for detection of whispered speech
Minimum variance distortion less response spectrum was introduced by Capon, and is also known as the Capon spectrum, or the maximum likelihood method (MLM) spectrum. The MVDR spectrum is a well-known method in array processing applications, and appears to be promising in other applications such as speech. MVDR can be looked upon as a filter design topology in which the aim is to design a filter bank that satisfies a certain constraint known as the ’distortion less’ constraint centered at one of the analysis frequencies[23, 24]. This means at the frequency of interest( ω l ); the gain of the transfer function should be unity i.e.
In addition to this, the output’s variance must be minimized subjected to the above constraint. Hence the resultant filter for a frequency ω l is obtained by solving the optimization problem
where is the spectral power at the frequency ω l . This ensures that MVDR will faithfully preserve the input signal at the frequency ω l . This is a major difference between MVDR and LP because it can give faithful response even at high frequencies. However the filter designing problem in case of MVDR is just conceptual. It can be shown that MVDR spectrum for all frequencies can be calculated as
where R x is the (M)×(M) data autocorrelation matrix and
This estimate has some interesting properties which we briefly mention below. It can be efficiently computed exploiting the relationship with LP methods as
where the parameters μ(k) are obtained by a simple non iterative computation involving the LP coefficients by minimizing the prediction error variance R e , as
The MVDR method is analogous to the periodogram whose estimate at any frequency can be interpreted as the output of a band pass filter centered at that frequency. The periodogram can hence be interpreted as a band of band pass filters which are data and frequency independent like most non-parametric methods. MVDR can also be considered as a band of filters constrained to a set of conditions. However this filter is both data and frequency dependent. Its robustness as a recognition feature has been verified in in this context.
Dependence of the MVDR method on model order in the context of whispered speech
Let us assume voiced speech signal to be perfectly harmonic for a short duration. Here we are oversimplifying the speech for analysis just at the harmonics. The harmonics can be modeled as
where L is the number of harmonics. The pitch of such a system can be clearly seen to be. The correlation will hence be
The MVDR filter h l (n) designed for lth harmonic will hence try to preserve the input power at that frequency faithfully while trying to have minimum power at the other harmonics.
For MVDR filter at the frequency ω l ;
If the MVDR filter has M filter zeros and M>2L−1 then the MVDR has enough zeros to cancel all the other input exponentials and can estimate the exact power at the harmonic. Otherwise there will be a positive bias to the spectral estimate. Hence spectral estimates are bound to get better at every frequency with increase of model order. This is in direct contrast to LP spectra where increase in order leads to more over-estimation at the harmonics.
Proposed LP-MVDR spectral ratio method for detection of whispered speech
It is important to note that MVDR spectrum is a smoother spectrum when compared to the LP spectrum. This is on account of the fact that an MVDR spectrum at any frequency can be represented as a harmonic average of the LP spectra of a particular order.
This averaging effect smooths out the spectrum at the regions of sharp rise i.e. at the harmonics. Thus the MVDR spectrum tends to have lower amplitude than that of corresponding LP spectra at the harmonics. Figure4 shows the various spectra for a short segment of neutral speech. From the illustration in Figure5, it is clear that a LP to MVDR ratio spectrum can be used to identify the whisper segments in speech, since this ratio is expected to be high where the speech signal has significant harmonics in the higher frequency region than in the neutral speech spectrum. In the context of whispered speech where the harmonic shifts are prominent, this ratio is expected to be high. Also the ratio is expected to be robust to wide band noise because the LP spectrum of high order can still model the spectrum and the averaging effect of MVDR will eliminate the effect of wide band noise when a spectral ratio is taken. Formally the the LP-MVDR spectral ratio is defined as
The LP-MVDR ratio spectrum is computed on a short time basis for each speech data window. The ratio spectrum is further smoothed and a threshold is decided depending on the penalties fixed based on the false alarm rate and detection failure rate. The whispered segments can then be segmented from the normally phonated segments of speech. The salient steps used for whisper detection using the LP-MVDR ratio spectrum is listed below.
Hamming Window the test speech waveform using a frame size of 20 ms and a frame overlap of 50%.
Compute the MVDR coefficients using Equation (35), for each frame.
Compute the smooth MVDR Power spectrum for sufficient number of frequency points (1024 in our work).
Compute the linear predictor coefficients for each frame.
Compute the LP power spectrum for the same number of frequency points as used for computing the MVDR power spectrum.
Compute the LP to MVDR ratio spectrum in each frame.
Calculate the maximum of ratio values in each frame and using these values form a vector called maxratio.
Scale maxratio by its maximum value.
Select the threshold according the penalties required for false alarm rate and detection failure rate.
Segment whispered speech at boundaries of change.
The flow diagram illustrating the proposed sequence of steps is given in Figure6: A long segment of speech containing both neutral and whispered segments is considered. The LP-MVDR ratio spectrum is computed for such a speech segment. Figure7 illustrates a smooth LP-MVDR Ratio spectrum for a speech signal containing both neutral and whispered segments. From the LP-MVDR ratio histograms of neutral speech and whispered speech, shown in Figure8, it is clear that the whispered segment exhibits a higher value of spectral ratio than the neutral segment.
However the proposed method is not able to detect the exact points where the speaker shifts from normal speech to whispered speech and vice versa. This can lead to segmentation issues in the subsequent recognition process. Hence there is a need to explore the possibility of fine tuning the position where the speaker has shifted from one mode of speech to another. Hence we propose to use the BIC. This method is discussed in the subsequent section.
Refining the whisper segment boundaries using BIC
To overcome the segmentation issues in detecting neutral to whisper or whisper to neutral change point, we apply BIC on LP-MVDR ratio obtained for the speech signal. The BIC is a model selection criteria that helps to maximize the model performance. Model selection includes the best possible selection of variables to come up with the best statistical model for given variable set. Issues like over-fitting might eventually lead to less efficient models. Another criteria called Akaike information criteria is also used but it imposes lower penalty increase of model parameters and hence is not preferred. Use of BIC as listed in Equation (42), in speaker segmentation has been extensively used in literature[29, 30].
where L is the maximized value of likelihood function for the estimated model, n is the number of data points, k is the number of free parameters to be estimated and λ is the penalty factor assumed 1 in our case. Given any two estimated models, the model with the lower value of BIC is preferred. Under the assumption that the model errors are i.i.d and that the boundary condition indicating the derivative of the log likelihood in respect to the true variance is equal to zero, the BIC criterion can be rewritten as
Hence Δ BIC can be effectively used to find the change in vocal effect. Given two audio segments, X=x 1, the binary hypothesis testing problem in this context of whisper detection can be formulated as
H 0 is the hypothesis that claims X and Y to belong to the same multivariate Gaussian distribution while H 1claims it to be derived from two distinct multivariate Gaussian Distributions. Hence Δ BIC values can be computed as the difference in the BIC values of H 1 and H 0.
where d is the dimension of the feature vector, λ is the penalty factor assumed 1 in our case. The larger the value of Δ BIC, any two segments can be considered the most dissimilar. For a single known change point detection the boundary segments for i min<i<n−i min can be written as
If Δ BIC(i)>0, then the change is occurred and the point where change is occurred is the time index which has maximum value of Δ BIC, else no change point in the data. It is to be noted that a minimum value of i min needs to be ascertained for which the number of data points are insufficient to give a reliable estimate of BIC. Hence empirically it is sufficient to keep i minto be 30–50 data points[31, 32] in the context of whisper detection, so we used a fixed search window with initial length of 50 data points. From Figure9, it is clear that, if there is a change point then Δ BIC(i)>0.
Whispered speech recognition using LP-MVDR spectral ratio method and MLLR
A general methodology followed in whispered speech recognition systems is to segment speech at the whisper boundaries and subsequently use statistical models like the hidden Markov model (HMM) to perform recognition. A block diagram of proposed ASR system incorporating the speech mode changes between neutral and whispered speech is illustrated in Figure10. Developing large databases for whispered speech is neither practical nor cost-effective. Hence a system that can use a small number of files to model transformations from one mode to another is important. An attempt in this direction is made in the form of application of existing maximum likelihood linear regression (MLLR) adaptation which has been primarily used to model the differences due to changes in environment or non nativeness of speakers. In the following section a brief description of the MLLR adaptation technique is described.
Model adaptation using MLLR
Speaker adaptation methods aim to adapt data independent statistical models to more specific models (for a particular speaker or environment) using only a small amount of new adaptation data. MLLR is one such technique. It estimates some linear transformations for groups of model parameters to maximize the likelihood of adaptation data. The linear transformations shift the component means and alter the variances in the independent system so that each state in the HMM system is more likely to generate the adaptation data[33, 34].
where A can either be a diagonal or a full transformation matrix. Other techniques modify both means and variances using a bias and a transformation matrix. The method proposed in this work uses a transformation matrix that computes an adapted mean and is given by
where W is the n×(n + 1) transformation matrix and ξ is the extended mean vector given by
For linear normalization the transformed variance is of the form
where H is the n×n transformation matrix. For modeling the speech data according to the above described procedure, HMM based normalization of means and variances was used. in the succeeding section the results of performance evaluation conducted on both detection and recognition of whispered speech are discussed.
Performance evaluation: detection of whispered speech
Two databases were used to analyze the performance of the proposed spectral ratio method for whispered speech detection namely, the CHAINS corpus and a whispered speech corpus collected over the cell phone. In the subsequent sections, the description of the databases used and the experimental results on whisper detection using ROC curves are described. The performance of the proposed whisper detection algorithm is also presented as the whisper diarization error rate.
CHAINS is a research project funded by Science Foundation Ireland from April 2005 to March 2009. Its goal is to advance the science of speaker identification by investigating those characteristics of a persons speech that make them unique. The corpus was recorded in the early phase of the project. The corpus freely available to researchers, both through its website and through the linguistic data consortium. The corpus features approximately thirty six speakers recorded under a variety of speaking conditions, allowing comparison of the same speaker across different well-defined speech styles. Speakers read a variety of texts alone, in synchrony with a dialect-matched co-speaker, in limitation of a dialect-matched co-speaker, in a whisper, and at a fast rate. There is also an unscripted spontaneous retelling of a read fable. The bulk of the speakers were speakers of Eastern Hiberno-English.
The solo recording session was carried out in a professional recording studio in December 2005 and speakers were recorded in a sound-attenuated booth. The recordings in the released corpus were done using a Neumann U87 condenser microphone. The whisper recording session from March 2006 to May 2006 was carried out in a quiet office environment, using an AKG C420 headset condenser microphone. In solo reading the speakers were asked to read a prepared text at a comfortable rate and volume. In whisper recording, the speakers read all text in a whisper. Any involuntary switch to modal voicing was interpreted as a disfluency and led to a restart of the phrase.
Text categories in the CHAINS corpus
The corpus texts can be divided into two categories. First category contains famous fables recorded in continuous speech. The second section contains short sentences. In order to provide good phonetic coverage, there are thirty three individual sentences wherein nine are selected from the CSLU Speaker Identification corpus, and twenty four from the TIMIT corpus.
Cellphone whispered speech data corpus
This corpus was developed at Indian Institute of Technology Kanpur. The speakers were fluent in English with an Indian accent living in on campus. The speakers belonged to different parts of India and hence of varied Indian accents. The data was recorded using cell phone calls placed over an open source Asterisk server.
The IVRS system was hosted on a Intel Orgi. G31 machine. The telephony card used was Sangoma’s A200/Remora FXO/FXS Analog AFT card. This card supports up to 2.048 Mbps of full duplex data through-put and up to thirty voice calls over a E1 line. The files were recorded in uncompressed wave file format. The sampling rate was 8 kHz with 16-bit values per sample. A single channel was used for the recording. The encoding of data was in 16-bit signed integer Pulse-code modulation (PCM).
Text categories in cellphone whispered speech data corpus
The corpus texts are divided into two categories. The first category contains recordings of the five TIMIT sentences recorded in neutral and whispered modes. The sentences spoken are listed in Table1. Second category contains digits from 0–9 spoken one at a time for both modes. A total of seventy speakers performed the recording.
Experiments on detection of whispered speech
Whisper detection performance was evaluated using the fables part of the CHAINS corpus and sentences section of the cellphone whispered speech data corpus. One neutral part and one whispered part (10 s each) of the same speaker were concatenated at a fixed interval. Thirty such concatenated sentences, fifteen from Cellphone Whispered Speech Data Corpus and fifteen from CHAINS database, were formed and detection was performed on each. Scaled additive white Gaussian noise (AWGN) is used to simulate the various signal to noise ratios (SNRs) and the SNRs are calculated from the concatenated sentence. To calculate SNRs we used the equation S n n= S c n + αw n, where S n n is the noisy signal, S c n is the clean signal, w n is the AWGN noise and α is a variable. By varying α we get different SNRs. Two performance measures were used in the evaluation. Receiver operating characteristic (ROC) curve which is a plot of true positive rate (TPR) versus false positive rate (FPR) and the whisper diarization error rate (WDER) were used herein. However TPR is defined as
while the FPR is defined as
On the other hand WDER computes the diarization error similar to that used in speaker diarization. A performance index similar to the one in, is used in this work. The possible errors in whisper detection are generally the false alarm (FA) and the detection failure (DF). Hence the WDER using the above sources of error can be defined as
where N f denotes the total number of speech frames, and C 1 and C 2 are the weights assigned to FA and DF respectively, where FA=FPR and DF=1−TPR. Note that C 1 and C 2 are selected based on the penalties fixed for false alarm rate and DF rate, respectively.
Experimental results on CHAINS database
The receiver operating curves (ROC) are computed for the proposed and other conventional techniques like SFM and Linear predictive residual spatial audio coding (LPRSAC) under different SNRs. From Figure11, we can see that all the techniques perform reasonably well in the absence of noise. LP to MVDR (LP-MVDR) technique performs better than both the conventional techniques, SFM and LPRSAC. The initial poor response of spectral flatness can be attributed to presence of short pauses in the data which show a high spectral flatness value. This can often lead to their classification as whispered speech segments. To calculate WDER, individual thresholds are set on each method such that all the methods give equal TPR and the FPR at this TPR is calculated from the ROC curve. WDER is determined from the FPR and TPR. The WDER after choosing appropriate thresholds is given in Table2.
The results are indicative of a better performance by LP-MVDR spectrum over other conventional techniques. Without addition of noise, in Figure11 we can see that all the methods seem to perform well. At an SNR of 50dB shown in Figure12, the performance of all the techniques worsens. The LPRSAC seems to fail at this noise with very poor response. However LP-MVDR and spectral flatness methods still perform reasonably well. At a lower SNR of SNR=45 dB, no change is observed in spectral flatness spectrum as shown in the Figure13. The LPRSAC almost coincides with the reference line in an ROC curve and hence gives completely random results. LP-MVDR still performs reasonably well with a degradation as can be observed from Figure12. At a SNR of 35 dB, LPRSAC completely fails and in fact shows an inverted characteristic shown in Figure14. The spectral flatness response still remains consistent almost following a curve similar to that for a SNR of 45 dB. However the LP-MVDR performance is degraded severely for high TPRs and falls below spectral flatness for these zones. This is probably due to the presence of short pauses which were not removed in the experiments.
It was observed that while spectral flatness is very consistent with its performance on addition of white Gaussian noise, LP-MVDR almost always performs better. LPRSAC performs poorly in comparison to either of these techniques under noisy conditions. An initial dip in spectral flatness is observed that can be explained by presence of short pauses in speech waveform which is mostly the background noise. Since spectral flatness will give a high value for white noise because of its wide band flatness, the segments are categorized as whispered although they might have been parts of neutral speech. LP-MVDR shows a sharp rise in TPR with little rise in FPR in all cases. This is a desirable result and hence validates the efficacy of the proposed method.
Experimental results on cellphone whispered speech data corpus
Similar whisper detection experiments were conducted on the cellphone whispered speech data corpus. The results obtained are presented in this section. The LPRSAC performs poorly for the cell phone corpus. This is probably due to the effect of frequency cut off in telephones that is usually a band pass filter with 300–3300 Hz with center frequency at around 1 kHz. This frequency range not only affects the first harmonic adversely as it is around 400 Hz but also cuts down a large portion of high frequency component of noise that is a characteristic of whispered speech. From Figure15, it can be noted that LPRSAC method fails even in clean conditions. The proposed spectral ratio method performs poorly at high FPR values. At a SNR of 50 dB as shown in Figure16, the performance of LP-MVDR improves. The spectral flatness method also shows a steep rise just like in the case the case of clean signal. At a lower SNR of 45 dB, the LP-MVDR spectrum worsens a bit with TPR changing less for a change in FPR. Spectral Flatness is still almost the same with the same steep rise in TPR. At a SNR of 35 dB, the LP-MVDR performance improves. This is probably due to pronounced effect of noise on the whisper segments as discussed before. The spectral flatness method also shows a small improvement in performance (Figures17 and18).
The results in Table3, indicate an overall improvement in WDER results is observed as the SNR is increased. This trend is seen in both spectral flatness as well as LP-MVDR ratio. In order to corroborate these results with the ROC, it can be seen that the TPR rises rapidly with little increase in FPR leading to a very low false alarm rate at a high TPR. The results can also be alluded to the fact that the detection methods were designed to detect the non harmonic content in the signal. However it must be noted that these results are for stationary noise. The analysis of these methods in non stationary and non Gaussian type of noise has not been studied in this work. Given the nature of the proposed technique when compared to other conventional techniques, the results are interesting since the performance of the whisper detection improves in noise. On the other hand the phonated parts of speech would be effected in exactly the opposite manner.
Performance evaluation: whispered speech recognition
As discussed earlier in Section “Whispered speech recognition using LP-MVDR spectral ratio method and MLLR”, a whispered speech recognition system includes both segmenting the speech signal at the whisper boundaries and subsequently use statistical models like the adapted HMMs to perform automatic whispered speech recognition. The purpose is also to use minimal changes in the standard speech recognition engines to implement a whispered speech recognition system. A block digram illustrating the the process of whispered speech recognition is shown in Figure19. The important blocks as illustrated in Figure19 are the extraction of features from the speech signal and the MLLR method of adapting neutral speech models to whispered speech models. The adaptation is carried out using MLLR and has been described in Section “Whispered speech recognition using LP-MVDR spectral ratio method and MLLR”. The features used in this work are the Mel frequency cepstral coefficients (MFCC). MFCCs are the most commonly used features in speech recognition systems. This makes them the ideal choice for investigating the feasibility of testing the effect of MLLR on whisper recognition. The MFCC features are calculated using the procedure given in.
Experiments on whispered speech recognition on the cellphone whispered speech data corpus
Recognition was performed for digits comprising the cellphone whispered speech data corpus. HMM with five states and three mixtures with no tied states were trained for digits spoken in neutral speech in the first stage. Fifty training files were used for each digit in the training process. This HMM was then adapted to whispered speech using MLLR adaptation as discussed earlier. Twenty five whisper files per digit were used in the adaptation process. Twenty one whisper files were used for each digit in the digit recognition process. The experimental results for the recognition scheme are shown in Figure20, as listed in Table4. The recognition performance is computed as
The following test cases have been evaluated in the experiments conducted on automatic whispered speech recognition
Case 1: Neutral speech recognition with same train and test data
Case 2: Neutral speech recognition with different train and test data
Case 3: Whispered speech recognition using neutral speech models
Case 4: Neutral speech recognition using neutral speech models adapted to whispered speech data (on similar train and test data)
Case 5: Neutral speech recognition using neutral speech models adapted to whispered speech data (on dissimilar train and test data)
Case 6: Whispered speech recognition using neutral speech models adapted to whispered speech data (on whisper adaptation data)
Case 7: Whispered speech recognition using neutral speech models adapted to whispered speech data (on whisper test data)
The baseline results in Table4 show a poor performance when neutral speech HMMs are used to recognize whispered speech as expected. Models adapted on whispered speech show an improvement of 12.73% over the baseline. Also it is seen that the performance of adapted models on neutral speech recognition is very poor. Hence treating adapted models as a unified model encompassing both whisper and neutral speech elements may not be reasonable as indicated by the experimental results.
Conclusions and future scope
The work presented herein proposes a parametric spectral ratio method for whisper detection. The method performs reasonably better than conventional methods used in whisper detection both in terms of the ROC performance and the WDER. The usefulness of this method in automatic whispered speech recognition is also discussed in a MLLR adaptation framework. However availability of whispered speech data is an issue. It is also difficult to collect whispered databases on a large scale without proper supervision because of the human tendency to jump to neutral speech in a natural environment. Since whispered speech is characterized by noise and shift in formants to higher frequencies, newer model adaptation techniques can play a major role in this context. The work presented herein has also led to the development of a new database cellphone whispered speech data corpus which comprises whisper data collected through an IVRS system over standard mobile phones in an uncontrolled environment. Implementing whisper detection systems in environments like hospitals and prisons can address possible emergency situations. This can also prove to be useful for people with a collapsed larynx, or congenital diseases of the vocal chords. Recognition of the other natural modes of speech like shouted speech is currently being explored. The possibility of utilizing the spectra derived for whisper detection as features in the recognition process is being explored.
Ng R, Lee T, Leung C, Ma B, Li H: Analysis and selection of prosodic features for language identification. 2009 International Conference on Asian Language Processing 2009, 123-128. (IEEE)
Zhang C, Hansen J: Analysis and classification of speech mode: Whispered through shouted. Eighth Annual Conference of the International Speech Communication Association 2007.
Quatieri TF: Discrete-Time Speech Signal Processing. (Pearson Education India, 2002)
Tartter V: What’s in a whisper? J. Acoust. Soc. Am 1989, 86: 1678. 10.1121/1.398598
Zhang C, Hansen J: Advancements in whisper-island detection within normally phonated audio streams. Tenth Annual Conference of the International Speech Communication Association 2009.
Wenndt SJ, Cupples EJ, Floyd RM: A study on the classification of whispered and normally phonated speech. In ICSLP-2002 2002, 649-652.
Mathur A, Hegde R: Significance of the LP-MVDR spectral ratio method in whisper detection. National Conference on Communications (NCC) 2011, 1-5.
Zhang C, Hansen J: Advancements in whisper-island detection using the linear predictive residual. IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 2010 2010, 5170-5173.
Chi Z, Hansen J: Whisper-island detection based on unsupervised segmentation with entropy-based speech feature processing. IEEE Trans. Audio Speech Lang. Process 2011, 19(4):883-894.
Seyedin S, Ahadi S: Robust MVDR-based feature extraction for speech recognition. 7th International Conference on Information, Communications and Signal Processing, 2009, ICICS 2009 2009, 1-5.
Murthi M, Rao B: MVDR based all-pole models for spectral coding of speech. 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1999, vol. 2 1999, 669-672.
Yapanel U, Dharanipragada S: Perceptual MVDR-based cepstral coefficients (PMCCs) for robust speech recognition. IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003, ICASSP ’03, vol. 1 2003, I–644-I–647. (2003), pp. I–644–I–647
Petrushin VA, Tsirulnik LI, Makarova V: Whispered Speech Prosody Modeling for TTS Synthesis. Speech Prosody 2010-Fifth International Conference 2010.
Jovicic S: Formant feature differences between whispered and voiced sustained vowels. Acta Acustica United with Acustica 1998, 84(4):739-743.
Carlin M, Smolenski B, Wenndt S: Unsupervised detection of whispered speech in the presence of normal phonation. Ninth International Conference on Spoken Language Processing 2006.
Gray Jr. A, Markel J: A spectral-flatness measure for studying the autocorrelation method of linear prediction of speech analysis. IEEE Trans. Acoust. Speech Signal Process 1974, 22(3):207-217. 10.1109/TASSP.1974.1162572
Ali A, Van der Spiegel J, Mueller P: An acoustic-phonetic feature-based system for the automatic recognition of fricative consonants. ICASSP-1998, vol. 2 1998, 961-964. (IEEE)
Yantorno R, Krishnamachari K, Lovekin J, Benincasa D, Wenndt S: The spectral autocorrelation peak valley ratio (SAPVR)-a usable speech measure employed as a co-channel detection system. Proceedings of IEEE International Workshop on Intelligent Signal Processing (WISP) 2001.
Makhoul J: Linear prediction: a tutorial review. Proc. IEEE 1975, 63(4):561-580.
Zhang C, Hansen J: Effective segmentation based on vocal effort change point detection. ITRW 2008.
Sherman PJ, Lou KN: On the family of ML spectral estimates for mixed spectrum identification. IEEE Trans. Signal Process 1991, 39(4):644-655.
Murthi M, Rao B: Minimum variance distortionless response (MVDR) modeling of voiced speech. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1997. ICASSP-97, vol. 3 1997, 1687-1690. (IEEE)
Murthi M, Rao B: All-pole modeling of speech based on the minimum variance distortionless response spectrum. IEEE Trans. Speech Audio Process 2000, 8(3):221-239. 10.1109/89.841206
Wolfel M, McDonough J: Minimum variance distortionless response spectral estimation. IEEE Signal Process. Mag 2005, 22(5):117-126.
Haykin S: Adaptive filter theory (ISE). 2003.
Dharanipragada S, Rao B: MVDR based feature extraction for robust speech recognition. IEEE International Conference on Acoustics, Speech, and Signal Processing, 2001 (ICASSP’01), vol. 1 2001, 309-312. (IEEE)
Burg J: The relationship between maximum entropy spectra and maximum likelihood spectra. Geophysics 1972, 37: 375. 10.1190/1.1440265
Acquah H: Comparison of Akaike information criterion (AIC) and Bayesian information criterion (BIC) in selection of an asymmetric price relationship. J. Develop. Agri. Econ 2010, 2(1):001-006.
Tritschler A, Gopinath R: Improved speaker segmentation and segments clustering using the Bayesian information criterion. Proc. Eurospeech, vol. 2 1999, 679-682. (Citeseer)
Zhou B, Hansen J: Unsupervised audio stream segmentation and clustering via the Bayesian information criterion. Sixth International Conference on Spoken Language Processing (ICSLP) 2000.
Cheng S, Wang H, Fu H: BIC-based speaker segmentation using divide-and-conquer strategies with application to speaker diarization. IEEE Trans. Audio Speech Lang. Process 2010, 18(1):141-157.
Cettolo M, Vescovi M, Rizzi R: Evaluation of BIC-based algorithms for audio segmentation. Comput. Speech Lang 2005, 19(2):147-170. 10.1016/j.csl.2004.05.008
Gales MF: The Generation And Use Of Regression Class Trees For MLLR Adaptation. In Tech. Rep.. Cambridge University Engineering Department; 1996.
Gales M, Woodland P: Mean and variance adaptation within the MLLR framework. Comput. Speech Lang 1996, 10(4):249-264. 10.1006/csla.1996.0013
Tamura M, Masuko T, Tokuda K, Kobayashi T: Speaker adaptation for HMM-based speech synthesis system using MLLR. The Third ESCA/COCOSDA Workshop on Speech Synthesis 1998, 273-276. (Citeseer)
Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Povey D: The HTK book (for HTK version 3.4). Cambridge University Engineering Department; 2006:2-3.
Cummins F, Grimaldi M, Leonard T, Simko J: Proceedings of SPECOM, vol. 6,. 2006, 431-435.
Zheng F, Zhang G, Song Z: Comparison of different implementations of MFCC. J. Comput. Sci. Technol 2001, 16(6):582-589. 10.1007/BF02943243
This work was supported by the BSNL-IIT Kanpur Telecom Center, MCIT, and QUALCOMM.
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
About this article
Cite this article
Mathur, A., Reddy, S.M. & Hegde, R.M. Significance of parametric spectral ratio methods in detection and recognition of whispered speech. EURASIP J. Adv. Signal Process. 2012, 157 (2012). https://doi.org/10.1186/1687-6180-2012-157
- Parametric spectrum estimation
- Linear prediction
- Minimum variance distortion less response
- Bayesian information criterion
- Whisper detection
- Automatic speech recognition (ASR)