- Research Article
- Open Access
Multimodal Speaker Verification Based on Electroglottograph Signal and Glottal Activity Detection
© Zoran Ćirović et al. 2010
- Received: 26 March 2010
- Accepted: 28 August 2010
- Published: 1 September 2010
To achieve robust speaker verification, we propose a multimodal method which includes additional nonaudio features and glottal activity detector. As a nonaudio sensor an electroglottograph (EGG) is applied. Parameters of EGG signal are used to augment conventional audio feature vector. Algorithm for EGG parameterization is based on the shape of the idealized waveform and glottal activity detector. We compare our algorithm with conventional one in the term of verification accuracy in high noise environment. All experiments are performed using Gaussian Mixture Model recognition system. Obtained results show a significant improvement of the text-independent speaker verification in high noise environment and opportunity for further improvements in this area.
- Gaussian Mixture Model
- Audio Signal
- Verification System
- Speaker Verification
- Voice Activity Detector
Speaker Verification (SV) is the process of verifying the claimed identity of a speaker using features extracted from her/his voice. Conventional SV uses the recorded audio signal as the sole source of information. This is based on features such as linear predictive cepstral coefficients (LPCC), mel-frequency cepstral coefficients (MFCC), or log area ratio (LAR) [1–3]. Over the past several years, one of the dominant approaches for modeling in text-independent SV applications has been based on Gaussian mixture models (GMMs) [1, 4–7].
In the case of speech being corrupted by environmental noise, the distribution of the audio feature vectors is also damaged. This leads to misclassification and poor recognition. For an SV system to be of practical use in a high noise environment it is necessary to address the issue of robustness. To combat this problem, researchers have put forward several new algorithms, which assume prior knowledge of the noise, like noise filtering techniques [8, 9], parallel model combination [10–12], Jacobian environmental adaptation [13, 14], using microphone arrays [15, 16], or techniques of speech enhancement which target the modeling of speech and noise pdf [17, 18]. When there is insufficient knowledge of the noise, one may attempt to ignore the contribution of highly corrupted speech data [19, 20] or to combine multicondition model training and the missing-feature theory to model noise with unknown temporal-spectral characteristics .
It is possible to accomplish robustness by the utilization of other sensing modalities to complement the audio signal of speech. As a matter of fact, in almost every context, carefully designed multimodal interfaces turned out to be more beneficial than any single-modality interface [1, 22]. Some multimodal approaches are based on sensors where a speaker is not connected to a recording device, like GEMS, ultrasonic or video signal [23–25]. Other researches use sensors physically connected to the speaker's head, face or throat, like electroglottograph (EGG), P-microphone, bone-conducting microphone [22, 24, 26]. The practical application of physically connected sensors is in specific environment (military approach, battle field environment, etc.) as well as in situations where the user is willing to cooperate meaning amenable to attach the sensor on herself/himself.
This study demonstrates that the specificity of the EGG waveform is different relative to different speakers (see Section 3). We use EGG features representing the time characteristics of an idealized EGG waveform. Then, we concatenate both the EGG features and audio features by applying a glottal activity detector. The main contribution of this paper is to investigate the performance of this fusion for SV problem in a high noise environment. In this research, we also discuss the selection of an activity detector.
During this stage, the background model (alternative hypothesis model) is created for each speaker using vectors , which do not belong to a certain person " ". The background model becomes invariant and common for to all if . In the second (testing) stage, the classifier decides whether the new input utterance, denoted by , belongs or not to the claimed registered speaker, represented by model , by comparing the conditional probabilities versus , where corresponds to the background model, .
Computing of feature vectors (parameterization) is common to both stages. The different nature of audio and EGG signals requires specific methods for optimal parameterization.
Parameterization is the transformation of an input signal into a set of feature vectors which are less redundant and more suitable for statistical modeling than the input signal. The input signal is processed into frames creating a sequence of vectors. Each frame corresponds to a time window with overlapping between the consecutive frames.
The multimodal speaker verification, proposed in this work, includes audio and EGG parameterization.
2.1. Audio Parameterization
2.2. EGG Parameterization
For unvoiced segments, the EGG waveform contains slow changes and very low-level high-frequency noise that is easily distinguished, . To remove disturbing low-frequency (uninformative) fluctuations, the EGG signal is usually filtered, using digital linear phase high or bandpass filters.
For voiced segments, the EGG usually has only two zero crossings per fundamental (pitch) period of voicing. In order to obtain a quantitative description of the EGG signal, a model based on the shape of the idealized waveform as proposed in [29, 30] is used. The idealized waveform has flat characteristics intervals although the original signal has a typically parabolic shape. When the vocal folds are open and it is ensured that there is no lateral contact between the vocal folds, the impedance is maximal and peak glottal flow occurs—open phase. The EGG waveform in this segment is flat, with small fluctuations. Further on, the movements of the margins of the vocal folds come into the contact and the vocal folds continue to close—closing phase. During the closing phase the vocal folds remain in contact and the airflow is blocked. Like in the open phase, limited fluctuations of the impedance are observed. However, the waveform is not flat, but rather forms a smooth hill. Pitch period— and specific durations in EGG waveform: are marked in Figure 3. Time from the maximum contact to zero crossing (about half of the opening phase) is marked as . Time is next interval up to the maximum of the open phase. Considering that the open phase is rather flat, could be calculated to the mean of the open phase of idealized waveform. Next, , are intervals to second zero crossing and to the maximum in the contact phase, respectively.
where to are normalized time parameters to , with respect to , measured at the time instant . These features are correlated with the most salient glottal phenomena, that is, glottal pulse width, skewness, and abruptness of closure.
Natural speech consists of speech segments, silence, and background noise. To only extract features from speech segments, the input signals are first fed to the activity detector subsystem to separate speech from nonspeech. Based on the activity detector output, features are extracted and then normalized.
2.3. Activity Detection
Voice activity detector (VAD) is a preprocessing subsystem designed for distinguishing speech from nonspeech segments in an audio signal. Conventional VAD algorithm is based on energy and zero-crossing rate or cepstrum . In a multimodal system, distinguishing speech could be based on the additional signal produced by nonaudio sensors.
While EGG signal retains the relevant information on the excitation source, for only the voiced segments of the speech signal, classical VAD includes both voiced and unvoiced, the glottal activity detector (GAD) is used for the EGG features extraction and fusion with cepstral coefficients in the multimodal feature vectors.
Discrimination property of EGG features in the proposed SV system could be analyzed and estimated in two ways: (i) a priori, without the design of the classification system and system's performance estimation; (ii) a posteriori, comparing the accuracy of the verification systems with or without augmented EGG features. A posteriori approach will be considered in Section 4, relative to the accuracy of the analysis of the proposed system.
Based on the fact that the proposed SV system compares the probabilities of GMM models, for the approach (i), discrimination property of EGG features can be measured by using Kullback-Leibler divergence (KLD) between corresponding probability distributions of EGG features, .
where is an n th EGG feature vector, as in (3); represents weights, where and is single Gaussian density with mean vector and covariance matrix .
where , are only based on corresponding EGG features.
From Figure 4, it is clear that there is almost no overlapping between the two groups of divergences and . For all speakers in database (see Section 4.1), is true in 94.2%. Considering this result, one can conclude that EGG features, have speaker discriminative property, but the contribution of these features, in the process of speaker verification, will be examined in the following experiments.
Experiments for analyzing EGG signal contribution in high noise environment.
The corpus consists of 50 sessions with 16 speakers with up to 4 sessions per speaker. The utterances for each session were very carefully chosen to provide a very good representation of typical Serbian language . Audio and EGG signals were recorded by microphone and an EGG device (model EG-PC3 produced by Tiger DRS, Inc., USA) synchronously. Both signals were originally sampled at 44 kHz. We used one session as enrollment and the remaining 49 sessions were used for speaker verification. This resulted in speaker verifications tests.
4.2. Conventional Verification System (System 1)
Conventional verification system consists of the front-end audio signal processing in order to produce feature vector, as in (2). The audio feature vector is formed as a collection of 14 mel-frequency cepstral coefficients, plus corresponding deltas, altogether coefficients per frame. Each frame corresponds to samples, for example, time window. The frames are overlapped to avoid the risk of losing valuable transient. In our system, frames are overlapped by one half of the frame length. After computing the MFCCs, a cepstral mean subtraction was done, . To separate speech frames from silent and noise, classical VAD based on energy and zero crossing rate was used.
According to the obtained results, one can conclude that the conventional SV system is quite sensitive to high Gaussian noise. In noisy environments especially for , noise influence becomes very significant.
4.3. GAD versus VAD (System 2)
The quality of the activity detector is measured by the accuracy of speech/nonspeech segment detection. GAD is based on EGG signal and therefore it is robust with the audio noise. Since the EGG signal is only informative during glottal oscillations, GAD detects voiced speech segments. On the other hand, VAD detects both, voiced and unvoiced segments, and uses noise level adaptive threshold causing the narrowing of the detected segments for the increasing noise level.
Although the noise does not have an influence on GAD, the speaker verification error is increased, even in a very noisy environment. Obviously, the results are affected by the choice of detector. VAD separates speech by using adaptive thresholds depending on the level of background noise. On the other hand, GAD-detected speech segments are independent of this level.
4.4. Fusing the EGG Features with Cepstral Coefficients (System 3)
In this experiment, conventional feature vectors, , was augmented by EGG features, , (altogether 47 coefficients), as in (1). GAD detector was used.
4.5. Only EGG Features (System 4)
Verification system 4 is based only on the feature vectors defined as in (3). GAD was the natural choice as the activity detector. After the enrolment phase, the created GMM models were tested. Considering that EGG feature vectors are not sensitive to audio noise, the obtained result is shown in Figure 8 as a horizontal line, that is, constant value in respect to SNR.
SV error rate for (I) conventional system, (II) augmented vectors with GAD, (III) only EGG features with GAD and benefits attained, , .
(I) System 1
(II) System 3
(III) System 4
Throughout the analysis of the results presented here, one can clearly note that the EGG features have a strong influence on the performance of SV in a noise environment. As indicated in Figure 8 and Table 2, substantial gains in speaker verification in a high noise environment were obtained. Analyzing the SNR performance, there are the three different ranges, I, II, and III where System 1, System 3,and System 4 have the best performance, respectively. Therefore, a composite SV can adaptively select one of the three systems, based on the level of noise, achieving a total error rate that is lower than any single system.
One can suggest the use of other speech sensors to create stronger modality combinations that can further be fused using the proposed method to boost the overall performance of an SV system.
These results illustrate the potential of this method for noise robust speaker verification.
Considering the sensitivity of noise to a conventional speaker verification system, we examined the informativeness of EGG features. In contrast to the conventional approach, which only extracts cepstral features from audio signal, the proposed method employs information contained within the EGG signal.
The features of the EGG signal, which are robust in a noise environment, are used to augment conventional audio feature vector.
Since EGG signal is only informative during voiced speech segments, the voice activity detector is replaced by a glottal activity detector.
The presented experimental results show a significant reduction of verification error within a noise environment, especially for . As mentioned, there is further improvement, by combining all the systems depending on noise level. Another interesting aspect of the proposed framework is that it could be applied to some other speech modalities by appropriate selection of the activity detector.
As a part of further work, the feature set could be augmented by some other modality which may be more robust against noise, although such a claim would have to be validated. Future work should also explore methods on statistical significant of wider speaker populations to further validate the results.
- Reynolds DA, Rose RC: Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing 1995, 3(1):72-83. 10.1109/89.365379View ArticleGoogle Scholar
- Chow D, Abdulla WH: Robust speaker identification based on perceptual log area ratio and gaussian mixture models. Proceedings of 8th International Conference on Spoken Language Processing (INTERSPEECH-ICSLP '04), October 2004, Jeju Island, Korea 1761-1764.Google Scholar
- Premakanthan P, Mikhael WB: Speaker verification/recognition and the importance of selective feature extraction: review. Proceedings of the 44th IEEE Midwest Symposium on Circuits and Systems (MWSCAS '01), August 2001, Ohio. USA 1: 57-61.Google Scholar
- Burget L, Matějka P, Schwarz P, Glembek O, Černocký JH: Analysis of feature extraction and channel compensation in a GMM speaker recognition system. IEEE Transactions on Audio, Speech and Language Processing 2007, 15(7):1979-1986.View ArticleGoogle Scholar
- Reynolds DA, Quatieri TF, Dunn RB: Speaker verification using adapted Gaussian mixture models. Digital Signal Processing 2000, 10(1):19-41. 10.1006/dspr.1999.0361View ArticleGoogle Scholar
- Campbell WM, Sturim DE, Reynolds DA: Support vector machines using GMM supervectors for speaker verification. IEEE Signal Processing Letters 2006, 13(5):308-311.View ArticleGoogle Scholar
- Kenny P, Ouellet P, Dehak N, Gupta V, Dumouchel P: A study of interspeaker variability in speaker verification. IEEE Transactions on Audio, Speech and Language Processing 2008, 16(5):980-988.View ArticleGoogle Scholar
- Ortega-Garcia J, Gonzalez-Rodriguez J: Overview of speech enhancement techniques for automatic speaker recognition. Proceedings of the 1996 International Conference on Spoken Language Processing (ICSLP '96), October 1996, Philadelpia, Pa, USA 929-932.Google Scholar
- Suhadi S, Stan S, Fingscheidt T, Beaugeant C: An evaluation of VTS and IMM for speaker verification in noise. Proceedings of the 4th European Conference on Speech Communication and Technology (EuroSpeech '03), 2003, Geneva, Switzerland 1669-1672.Google Scholar
- Gales MJF, Young S: HMM recognition in noise using parallel model combination. Proceedings of the European Conference on Speech Communication and Technology (EuroSpeech '93), 1993, Berlin, Germany 837-840.Google Scholar
- Matsui T, Kanno T, Furui S: Speaker recognition using HMM composition in noisy environments. Computer Speech and Language 1996, 10(2):107-116. 10.1006/csla.1996.0007View ArticleGoogle Scholar
- Wong LP, Russell M: Text-dependent speaker verification under noisy conditions using parallel model combination. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '01), 2001, Salt Lake City, Utah, USA 1: 457-460.Google Scholar
- Sagayama S, Yamaguchi Y, Takahashi S, Takahashi J: Jacobian approach to fast acoustic model adaptation. Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97), April 1997, Munich, Germany 835-838.Google Scholar
- Cerisara C, Rigazio L, Junqua J-C: α -Jacobian environmental adaptation. Speech Communication 2004, 42(1):25-41. 10.1016/j.specom.2003.08.003View ArticleGoogle Scholar
- Gonzalez-Rodriguez L, Ortega-Garcia J: Robust speaker recognition through acoustic array processing and spectral normalization. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '97), 1997, Munich, Germany 2: 1103-1106.Google Scholar
- McCowan I, Pelecanos J, Scridha S: Robust speaker recognition using microphone arrays. A Speaker Odyssey: The Speaker Recognition Workshop, 2001, Crete, Greece 101-106.Google Scholar
- Hu Y, Loizou PC: A generalized subspace approach for enhancing speech corrupted by colored noise. IEEE Transactions on Speech and Audio Processing 2003, 11(4):334-341. 10.1109/TSA.2003.814458View ArticleGoogle Scholar
- Kundu A, Chatterjee S, Murthy AS, Sreenivas TV: GMM based bayesian approach to speech enhancement in signal /transform domain. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '08), April 2008 4893-4896.Google Scholar
- Drygajlo A, El-Maliki M: Speaker verification in noisy environments with combined spectral subtraction and missing feature theory. Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '98), May 1998, Seattle, Wash, USA 121-124.Google Scholar
- Besacier L, Bonastre JF, Fredouille C: Localization and selection of speaker-specific information with statistical modeling. Speech Communication 2000, 31(2):89-106. 10.1016/S0167-6393(99)00070-9View ArticleGoogle Scholar
- Ming J, Hazen TJ, Glass JR, Reynolds DA: Robust speaker recognition in noisy conditions. IEEE Transactions on Audio, Speech and Language Processing 2007, 15(5):1711-1723.View ArticleGoogle Scholar
- Campbell WM, Quatieri TF, Campbell JP, Weinstein CJ: Multimodal speaker authentication using nonacoustic sensors. Proceedings of the International Workshop on Multimodal User Authentication, 2003, Santa Barbara, Calif, USA 215-222.Google Scholar
- Çetingül HE, Erzin E, Yemez Y, Tekalp AM: Multimodal speaker/speech recognition using lip motion, lip texture and audio. Signal Processing 2006, 86(12):3549-3558. 10.1016/j.sigpro.2006.02.045View ArticleMATHGoogle Scholar
- Quatieri TF, Messing DP, Brady K, et al.: Exploiting nonacoustic sensors for speech enhancement. Proceedings of the International Workshop on Multimodal User Authentication, 2003, Santa Barbara, Calif, USA 66-73.Google Scholar
- Zhu B, Hazen TJ, Glass JR: Multimodal speech recognition With ultrasonic sensors. Proceedings of the 8th Annual Conference of the International Speech Communication Association (INTERSPEECH '07), 2007, Antwerp, Belgium 4: 2328-2331.Google Scholar
- Subramanya A, Zhang Z, Liu Z, Droppo J, Acero A: A graphical model for multi-sensory speech processing in air-and-bone conductive microphones. Proceedings of the 9th European Conference on Speech Communication and Technology, 2005, Lisbon, Portugal 2361-2364.Google Scholar
- Furui S: Survey of the State of the Art in Human Language Technology. 1996, http://cslu.cse.ogi.edu/HLTsurvey/ch1node9.html#SECTION17
- Chlders DG: Speech Processing and Synthesis Toolboxes. John Wiley & Sons, New York, NY, USA; 2000.Google Scholar
- Rothenberg M, Mahshie JJ: Monitoring vocal fold abduction through vocal fold contact area. Journal of Speech and Hearing Research 1988, 31(3):338-351.View ArticleGoogle Scholar
- Baken RJ: Electroglottography. Journal of Voice 1992, 6(2):98-110. 10.1016/S0892-1997(05)80123-7View ArticleGoogle Scholar
- Hahn M, Park CK: An improved speech detection algorithm for isolated Korean utterances. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '07), March 1992, San Francisco, Calif, USA 525-528.Google Scholar
- Hershey JR, Olsen PA: Approximating the Kullback Leibler divergence between Gaussian mixture models. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '07), April 2007, Calif, USA IV317-IV320.Google Scholar
- Jovicic ST, Kasic Z, Dordevic M, Rajkovic M: Serbian emotional speech database: design, processing and evaluation. Proceedings of the 11th International Conference Speech and Computer (SPECOM 04), 2004, St.Petersburg, RussiaGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.