Research Article | Open | Published:
Multimodal Speaker Verification Based on Electroglottograph Signal and Glottal Activity Detection
EURASIP Journal on Advances in Signal Processingvolume 2010, Article number: 930376 (2010)
To achieve robust speaker verification, we propose a multimodal method which includes additional nonaudio features and glottal activity detector. As a nonaudio sensor an electroglottograph (EGG) is applied. Parameters of EGG signal are used to augment conventional audio feature vector. Algorithm for EGG parameterization is based on the shape of the idealized waveform and glottal activity detector. We compare our algorithm with conventional one in the term of verification accuracy in high noise environment. All experiments are performed using Gaussian Mixture Model recognition system. Obtained results show a significant improvement of the text-independent speaker verification in high noise environment and opportunity for further improvements in this area.
Speaker Verification (SV) is the process of verifying the claimed identity of a speaker using features extracted from her/his voice. Conventional SV uses the recorded audio signal as the sole source of information. This is based on features such as linear predictive cepstral coefficients (LPCC), mel-frequency cepstral coefficients (MFCC), or log area ratio (LAR) [1–3]. Over the past several years, one of the dominant approaches for modeling in text-independent SV applications has been based on Gaussian mixture models (GMMs) [1, 4–7].
In the case of speech being corrupted by environmental noise, the distribution of the audio feature vectors is also damaged. This leads to misclassification and poor recognition. For an SV system to be of practical use in a high noise environment it is necessary to address the issue of robustness. To combat this problem, researchers have put forward several new algorithms, which assume prior knowledge of the noise, like noise filtering techniques [8, 9], parallel model combination [10–12], Jacobian environmental adaptation [13, 14], using microphone arrays [15, 16], or techniques of speech enhancement which target the modeling of speech and noise pdf [17, 18]. When there is insufficient knowledge of the noise, one may attempt to ignore the contribution of highly corrupted speech data [19, 20] or to combine multicondition model training and the missing-feature theory to model noise with unknown temporal-spectral characteristics .
It is possible to accomplish robustness by the utilization of other sensing modalities to complement the audio signal of speech. As a matter of fact, in almost every context, carefully designed multimodal interfaces turned out to be more beneficial than any single-modality interface [1, 22]. Some multimodal approaches are based on sensors where a speaker is not connected to a recording device, like GEMS, ultrasonic or video signal [23–25]. Other researches use sensors physically connected to the speaker's head, face or throat, like electroglottograph (EGG), P-microphone, bone-conducting microphone [22, 24, 26]. The practical application of physically connected sensors is in specific environment (military approach, battle field environment, etc.) as well as in situations where the user is willing to cooperate meaning amenable to attach the sensor on herself/himself.
This study demonstrates that the specificity of the EGG waveform is different relative to different speakers (see Section 3). We use EGG features representing the time characteristics of an idealized EGG waveform. Then, we concatenate both the EGG features and audio features by applying a glottal activity detector. The main contribution of this paper is to investigate the performance of this fusion for SV problem in a high noise environment. In this research, we also discuss the selection of an activity detector.
There are two stages in the SV process (see Figure 1). The first is enrollment (training), where model parameters are computed for each registered speaker,"i", , represented by the feature vectors . In the proposed SV, presents the new feature vectors, given by
where is number of feature vectors for i th speaker; and are the sequences of the audio and EGG feature vectors , , respectively, where .
During this stage, the background model (alternative hypothesis model) is created for each speaker using vectors , which do not belong to a certain person "". The background model becomes invariant and common for to all if . In the second (testing) stage, the classifier decides whether the new input utterance, denoted by , belongs or not to the claimed registered speaker, represented by model , by comparing the conditional probabilities versus , where corresponds to the background model, .
Computing of feature vectors (parameterization) is common to both stages. The different nature of audio and EGG signals requires specific methods for optimal parameterization.
Parameterization is the transformation of an input signal into a set of feature vectors which are less redundant and more suitable for statistical modeling than the input signal. The input signal is processed into frames creating a sequence of vectors. Each frame corresponds to a time window with overlapping between the consecutive frames.
The multimodal speaker verification, proposed in this work, includes audio and EGG parameterization.
2.1. Audio Parameterization
Audio parameterization is usually based on the cepstral representation of an audio signal, . Prior to computing a short-term power spectra, the audio signal is filtered with a first-order FIR filter to spectrally flatten the signal. Pure cepstral coefficients of a speaker "i", denoted by , are obtained applying the mel-scaled filter banks up to 4 kHz. Time derivatives of cepstral coefficients are resistant to linear channel mismatches between training and testing and have yielded significant improvement in the recognition processes, . These coefficients are derivatives of the time function of the cepstral coefficients and are, respectively, called the delta- and delta-delta-cepstral coefficients. Regarding this, vector is
2.2. EGG Parameterization
The electroglottograph is a device for the measurement of the time variation of the degree of contact between vibrating vocal folds during voice production. The degree of contact is proportional to the impendence between two electrodes on the subject's neck when the current is in the MHz region. Typical waveforms of EGG and related audio signal are shown on Figure 2.
For unvoiced segments, the EGG waveform contains slow changes and very low-level high-frequency noise that is easily distinguished, . To remove disturbing low-frequency (uninformative) fluctuations, the EGG signal is usually filtered, using digital linear phase high or bandpass filters.
The EGG signal can be considered as "almost periodic" in voiced segments. One period of EGG signal with characteristic segments is shown in Figure 3, synchronously with audio signal.
For voiced segments, the EGG usually has only two zero crossings per fundamental (pitch) period of voicing. In order to obtain a quantitative description of the EGG signal, a model based on the shape of the idealized waveform as proposed in [29, 30] is used. The idealized waveform has flat characteristics intervals although the original signal has a typically parabolic shape. When the vocal folds are open and it is ensured that there is no lateral contact between the vocal folds, the impedance is maximal and peak glottal flow occurs—open phase. The EGG waveform in this segment is flat, with small fluctuations. Further on, the movements of the margins of the vocal folds come into the contact and the vocal folds continue to close—closing phase. During the closing phase the vocal folds remain in contact and the airflow is blocked. Like in the open phase, limited fluctuations of the impedance are observed. However, the waveform is not flat, but rather forms a smooth hill. Pitch period— and specific durations in EGG waveform: are marked in Figure 3. Time from the maximum contact to zero crossing (about half of the opening phase) is marked as . Time is next interval up to the maximum of the open phase. Considering that the open phase is rather flat, could be calculated to the mean of the open phase of idealized waveform. Next, , are intervals to second zero crossing and to the maximum in the contact phase, respectively.
Assuming that the EGG signal contains specific information about the speaker (see Section 3) and that EGG sensor is robust in noisy environments , adding related parameters to the features in the SV process, is expected to be beneficial. The EGG features used are period of the fundamental frequency and a set of timing parameters:
where to are normalized time parameters to , with respect to , measured at the time instant . These features are correlated with the most salient glottal phenomena, that is, glottal pulse width, skewness, and abruptness of closure.
Natural speech consists of speech segments, silence, and background noise. To only extract features from speech segments, the input signals are first fed to the activity detector subsystem to separate speech from nonspeech. Based on the activity detector output, features are extracted and then normalized.
2.3. Activity Detection
Voice activity detector (VAD) is a preprocessing subsystem designed for distinguishing speech from nonspeech segments in an audio signal. Conventional VAD algorithm is based on energy and zero-crossing rate or cepstrum . In a multimodal system, distinguishing speech could be based on the additional signal produced by nonaudio sensors.
While EGG signal retains the relevant information on the excitation source, for only the voiced segments of the speech signal, classical VAD includes both voiced and unvoiced, the glottal activity detector (GAD) is used for the EGG features extraction and fusion with cepstral coefficients in the multimodal feature vectors.
3. Expected Discrimination Information of EGG Features
Discrimination property of EGG features in the proposed SV system could be analyzed and estimated in two ways: (i) a priori, without the design of the classification system and system's performance estimation; (ii) a posteriori, comparing the accuracy of the verification systems with or without augmented EGG features. A posteriori approach will be considered in Section 4, relative to the accuracy of the analysis of the proposed system.
Based on the fact that the proposed SV system compares the probabilities of GMM models, for the approach (i), discrimination property of EGG features can be measured by using Kullback-Leibler divergence (KLD) between corresponding probability distributions of EGG features, .
For each speaker the GMM model , is created as mixture of M Gaussian densities
where is an n th EGG feature vector, as in (3); represents weights, where and is single Gaussian density with mean vector and covariance matrix .
KLD is the fundamental measure between the statistical distributions, which quantifies how close a probability distribution is to another distribution
can be interpreted as the expected discrimination information between the null and alternative statistical hypotheses, for discriminating in favor for a hypothesis , against hypothesis , when hypothesis is true. If represents a model denoted by , which characterizes the hypothesized speaker in the feature space of , and represents another model , expected discrimination information becomes
where , are only based on corresponding EGG features.
, defined in (6) is measured, when models and : (a) belong to the same speaker, denoted by (intraspeaker variability) and (b) belong to the different speakers, denoted by (interspeaker variability). Figure 4 presents results for six speakers in the form of histogram.
From Figure 4, it is clear that there is almost no overlapping between the two groups of divergences and . For all speakers in database (see Section 4.1), is true in 94.2%. Considering this result, one can conclude that EGG features, have speaker discriminative property, but the contribution of these features, in the process of speaker verification, will be examined in the following experiments.
This section analyzed contribution of EGG features to the proposed system, when the audio signal has been corrupted by additive White-Gaussian noise. The proposed SV system is compared to the audio-based (conventional) SV system. In order to clearly show the contribution of EGG features, four experiments were conducted. The first experiment was conducted by using the conventional system and conventional VAD. The second experiment was identical to the first, except that VAD was replaced by GAD. The third experiment involves multimodal parameters (audio and EGG features) with the addition of GAD. The fourth experiment involves only EGG features and GAD. SV error rates (ERRs) for different values of SNR (from 0 to 30 dB) were analyzed for all experiments. These experiments are illustrated in the four systems summarized in Table 1.
The corpus consists of 50 sessions with 16 speakers with up to 4 sessions per speaker. The utterances for each session were very carefully chosen to provide a very good representation of typical Serbian language . Audio and EGG signals were recorded by microphone and an EGG device (model EG-PC3 produced by Tiger DRS, Inc., USA) synchronously. Both signals were originally sampled at 44 kHz. We used one session as enrollment and the remaining 49 sessions were used for speaker verification. This resulted in speaker verifications tests.
4.2. Conventional Verification System (System 1)
Conventional verification system consists of the front-end audio signal processing in order to produce feature vector, as in (2). The audio feature vector is formed as a collection of 14 mel-frequency cepstral coefficients, plus corresponding deltas, altogether coefficients per frame. Each frame corresponds to samples, for example, time window. The frames are overlapped to avoid the risk of losing valuable transient. In our system, frames are overlapped by one half of the frame length. After computing the MFCCs, a cepstral mean subtraction was done, . To separate speech frames from silent and noise, classical VAD based on energy and zero crossing rate was used.
The model training was done in an office environment, while in the SV testing phase, the audio signal was corrupted by a Gaussian additive noise. The obtained ERR for different SNR in the range of 0 to 30 dB is shown in Figure 5.
According to the obtained results, one can conclude that the conventional SV system is quite sensitive to high Gaussian noise. In noisy environments especially for , noise influence becomes very significant.
4.3. GAD versus VAD (System 2)
The quality of the activity detector is measured by the accuracy of speech/nonspeech segment detection. GAD is based on EGG signal and therefore it is robust with the audio noise. Since the EGG signal is only informative during glottal oscillations, GAD detects voiced speech segments. On the other hand, VAD detects both, voiced and unvoiced segments, and uses noise level adaptive threshold causing the narrowing of the detected segments for the increasing noise level.
Figure 6(a) shows a part of natural speech for . Detected segments produced by classical VAD and GAD are denoted by " VAD" and " GAD", respectively. The same part of the speech for is shown in Figure 6(b). Obviously, detected segments by classical VAD are shorter in Figure 6(b) then in Figure 6(a). At the same time, the effective signal-to-noise ratio is higher for VAD than for GAD. Figure 6(c) shows appropriate EGG signal which is unchanged regardless of SNR value.
The verification system (System 2) used in this experiment was identical as in the previously described System 1, except that VAD was replaced by GAD. The obtained results are plotted as a solid line curve in Figure 7.
Although the noise does not have an influence on GAD, the speaker verification error is increased, even in a very noisy environment. Obviously, the results are affected by the choice of detector. VAD separates speech by using adaptive thresholds depending on the level of background noise. On the other hand, GAD-detected speech segments are independent of this level.
4.4. Fusing the EGG Features with Cepstral Coefficients (System 3)
In this experiment, conventional feature vectors, , was augmented by EGG features, , (altogether 47 coefficients), as in (1). GAD detector was used.
Evaluation and testing was done as in the conventional SV system. The results are shown in Figure 8 as a solid curve.
4.5. Only EGG Features (System 4)
Verification system 4 is based only on the feature vectors defined as in (3). GAD was the natural choice as the activity detector. After the enrolment phase, the created GMM models were tested. Considering that EGG feature vectors are not sensitive to audio noise, the obtained result is shown in Figure 8 as a horizontal line, that is, constant value in respect to SNR.
Verification error rates for the different SNR are shown in Table 2. The result presented in the table show benefits , as the difference between conventional SV system 1 and the improved SV systems 3,4.
Throughout the analysis of the results presented here, one can clearly note that the EGG features have a strong influence on the performance of SV in a noise environment. As indicated in Figure 8 and Table 2, substantial gains in speaker verification in a high noise environment were obtained. Analyzing the SNR performance, there are the three different ranges, I, II, and III where System 1, System 3,and System 4 have the best performance, respectively. Therefore, a composite SV can adaptively select one of the three systems, based on the level of noise, achieving a total error rate that is lower than any single system.
One can suggest the use of other speech sensors to create stronger modality combinations that can further be fused using the proposed method to boost the overall performance of an SV system.
These results illustrate the potential of this method for noise robust speaker verification.
Considering the sensitivity of noise to a conventional speaker verification system, we examined the informativeness of EGG features. In contrast to the conventional approach, which only extracts cepstral features from audio signal, the proposed method employs information contained within the EGG signal.
The features of the EGG signal, which are robust in a noise environment, are used to augment conventional audio feature vector.
Since EGG signal is only informative during voiced speech segments, the voice activity detector is replaced by a glottal activity detector.
The presented experimental results show a significant reduction of verification error within a noise environment, especially for . As mentioned, there is further improvement, by combining all the systems depending on noise level. Another interesting aspect of the proposed framework is that it could be applied to some other speech modalities by appropriate selection of the activity detector.
As a part of further work, the feature set could be augmented by some other modality which may be more robust against noise, although such a claim would have to be validated. Future work should also explore methods on statistical significant of wider speaker populations to further validate the results.
Reynolds DA, Rose RC: Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing 1995, 3(1):72-83. 10.1109/89.365379
Chow D, Abdulla WH: Robust speaker identification based on perceptual log area ratio and gaussian mixture models. Proceedings of 8th International Conference on Spoken Language Processing (INTERSPEECH-ICSLP '04), October 2004, Jeju Island, Korea 1761-1764.
Premakanthan P, Mikhael WB: Speaker verification/recognition and the importance of selective feature extraction: review. Proceedings of the 44th IEEE Midwest Symposium on Circuits and Systems (MWSCAS '01), August 2001, Ohio. USA 1: 57-61.
Burget L, Matějka P, Schwarz P, Glembek O, Černocký JH: Analysis of feature extraction and channel compensation in a GMM speaker recognition system. IEEE Transactions on Audio, Speech and Language Processing 2007, 15(7):1979-1986.
Reynolds DA, Quatieri TF, Dunn RB: Speaker verification using adapted Gaussian mixture models. Digital Signal Processing 2000, 10(1):19-41. 10.1006/dspr.1999.0361
Campbell WM, Sturim DE, Reynolds DA: Support vector machines using GMM supervectors for speaker verification. IEEE Signal Processing Letters 2006, 13(5):308-311.
Kenny P, Ouellet P, Dehak N, Gupta V, Dumouchel P: A study of interspeaker variability in speaker verification. IEEE Transactions on Audio, Speech and Language Processing 2008, 16(5):980-988.
Ortega-Garcia J, Gonzalez-Rodriguez J: Overview of speech enhancement techniques for automatic speaker recognition. Proceedings of the 1996 International Conference on Spoken Language Processing (ICSLP '96), October 1996, Philadelpia, Pa, USA 929-932.
Suhadi S, Stan S, Fingscheidt T, Beaugeant C: An evaluation of VTS and IMM for speaker verification in noise. Proceedings of the 4th European Conference on Speech Communication and Technology (EuroSpeech '03), 2003, Geneva, Switzerland 1669-1672.
Gales MJF, Young S: HMM recognition in noise using parallel model combination. Proceedings of the European Conference on Speech Communication and Technology (EuroSpeech '93), 1993, Berlin, Germany 837-840.
Matsui T, Kanno T, Furui S: Speaker recognition using HMM composition in noisy environments. Computer Speech and Language 1996, 10(2):107-116. 10.1006/csla.1996.0007
Wong LP, Russell M: Text-dependent speaker verification under noisy conditions using parallel model combination. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '01), 2001, Salt Lake City, Utah, USA 1: 457-460.
Sagayama S, Yamaguchi Y, Takahashi S, Takahashi J: Jacobian approach to fast acoustic model adaptation. Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97), April 1997, Munich, Germany 835-838.
Cerisara C, Rigazio L, Junqua J-C: α -Jacobian environmental adaptation. Speech Communication 2004, 42(1):25-41. 10.1016/j.specom.2003.08.003
Gonzalez-Rodriguez L, Ortega-Garcia J: Robust speaker recognition through acoustic array processing and spectral normalization. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '97), 1997, Munich, Germany 2: 1103-1106.
McCowan I, Pelecanos J, Scridha S: Robust speaker recognition using microphone arrays. A Speaker Odyssey: The Speaker Recognition Workshop, 2001, Crete, Greece 101-106.
Hu Y, Loizou PC: A generalized subspace approach for enhancing speech corrupted by colored noise. IEEE Transactions on Speech and Audio Processing 2003, 11(4):334-341. 10.1109/TSA.2003.814458
Kundu A, Chatterjee S, Murthy AS, Sreenivas TV: GMM based bayesian approach to speech enhancement in signal /transform domain. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '08), April 2008 4893-4896.
Drygajlo A, El-Maliki M: Speaker verification in noisy environments with combined spectral subtraction and missing feature theory. Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '98), May 1998, Seattle, Wash, USA 121-124.
Besacier L, Bonastre JF, Fredouille C: Localization and selection of speaker-specific information with statistical modeling. Speech Communication 2000, 31(2):89-106. 10.1016/S0167-6393(99)00070-9
Ming J, Hazen TJ, Glass JR, Reynolds DA: Robust speaker recognition in noisy conditions. IEEE Transactions on Audio, Speech and Language Processing 2007, 15(5):1711-1723.
Campbell WM, Quatieri TF, Campbell JP, Weinstein CJ: Multimodal speaker authentication using nonacoustic sensors. Proceedings of the International Workshop on Multimodal User Authentication, 2003, Santa Barbara, Calif, USA 215-222.
Çetingül HE, Erzin E, Yemez Y, Tekalp AM: Multimodal speaker/speech recognition using lip motion, lip texture and audio. Signal Processing 2006, 86(12):3549-3558. 10.1016/j.sigpro.2006.02.045
Quatieri TF, Messing DP, Brady K, et al.: Exploiting nonacoustic sensors for speech enhancement. Proceedings of the International Workshop on Multimodal User Authentication, 2003, Santa Barbara, Calif, USA 66-73.
Zhu B, Hazen TJ, Glass JR: Multimodal speech recognition With ultrasonic sensors. Proceedings of the 8th Annual Conference of the International Speech Communication Association (INTERSPEECH '07), 2007, Antwerp, Belgium 4: 2328-2331.
Subramanya A, Zhang Z, Liu Z, Droppo J, Acero A: A graphical model for multi-sensory speech processing in air-and-bone conductive microphones. Proceedings of the 9th European Conference on Speech Communication and Technology, 2005, Lisbon, Portugal 2361-2364.
Furui S: Survey of the State of the Art in Human Language Technology. 1996, http://cslu.cse.ogi.edu/HLTsurvey/ch1node9.html#SECTION17
Chlders DG: Speech Processing and Synthesis Toolboxes. John Wiley & Sons, New York, NY, USA; 2000.
Rothenberg M, Mahshie JJ: Monitoring vocal fold abduction through vocal fold contact area. Journal of Speech and Hearing Research 1988, 31(3):338-351.
Baken RJ: Electroglottography. Journal of Voice 1992, 6(2):98-110. 10.1016/S0892-1997(05)80123-7
Hahn M, Park CK: An improved speech detection algorithm for isolated Korean utterances. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '07), March 1992, San Francisco, Calif, USA 525-528.
Hershey JR, Olsen PA: Approximating the Kullback Leibler divergence between Gaussian mixture models. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '07), April 2007, Calif, USA IV317-IV320.
Jovicic ST, Kasic Z, Dordevic M, Rajkovic M: Serbian emotional speech database: design, processing and evaluation. Proceedings of the 11th International Conference Speech and Computer (SPECOM 04), 2004, St.Petersburg, Russia