- Research Article
- Open access
- Published:
Speech/Non-Speech Segmentation Based on Phoneme Recognition Features
EURASIP Journal on Advances in Signal Processing volume 2006, Article number: 090495 (2006)
Abstract
This work assesses different approaches for speech and non-speech segmentation of audio data and proposes a new, high-level representation of audio signals based on phoneme recognition features suitable for speech/non-speech discrimination tasks. Unlike previous model-based approaches, where speech and non-speech classes were usually modeled by several models, we develop a representation where just one model per class is used in the segmentation process. For this purpose, four measures based on consonant-vowel pairs obtained from different phoneme speech recognizers are introduced and applied in two different segmentation-classification frameworks. The segmentation systems were evaluated on different broadcast news databases. The evaluation results indicate that the proposed phoneme recognition features are better than the standard mel-frequency cepstral coefficients and posterior probability-based features (entropy and dynamism). The proposed features proved to be more robust and less sensitive to different training and unforeseen conditions. Additional experiments with fusion models based on cepstral and the proposed phoneme recognition features produced the highest scores overall, which indicates that the most suitable method for speech/non-speech segmentation is a combination of low-level acoustic features and high-level recognition features.
References
Shafran I, Rose R: Robust speech detection and segmentation for real-time ASR applications. Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003, Hong Kong 1: 432–435.
Beyerlein P, Aubert X, Haeb-Umbach R, et al.: Large vocabulary continuous speech recognition of broadcast news - the Philips/RWTH approach. Speech Communication 2002, 37(1):109–131. 10.1016/S0167-6393(01)00062-0
Gauvain J-L, Lamel L, Adda G: The LIMSI broadcast news transcription system. Speech Communication 2002, 37(1):89–108. 10.1016/S0167-6393(01)00061-9
Woodland PC: The development of the HTK broadcast news transcription system: an overview. Speech Communication 2002, 37(1):47–67. 10.1016/S0167-6393(01)00059-0
Magrin-Chagnolleau I, Parlangeau-Vallès N: Audio indexing: what has been accomplished and the road ahead. Proceedings of Joint Conference on Information Sciences (JCIS '02), March 2002, Durham, NC, USA 911–914.
Makhoul J, Kubala F, Leek T, et al.: Speech and language technologies for audio indexing and retrieval. Proceedings of the IEEE 2000, 88(8):1338–1353. 10.1109/5.880087
Istrate D, Scheffer N, Fredouille C, Bonastre J-F: Broadcast news speaker tracking for ESTER 2005 campaign. Proceedings of Interspeech 2005 - Eurospeech, September 2005, Lisbon, Portugal 2445–2448.
Moraru D, Ben M, Gravier G: Experiments on speaker tracking and segmentation in radio broadcast news. Proceedings of Interspeech 2005 - Eurospeech, September 2005, Lisbon, Portugal 3049–3052.
Reynolds DA, Torres-Carrasquillo PA: Approaches and applications of audio diarization. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), March 2005, Philadelphia, Pa, USA 5: 953–956.
Sinha R, Tranter SE, Gales MJF, Woodland PC: The Cambridge University March 2005 speaker diarisation system. Proceedings of Interspeech 2005 - Eurospeech, September 2005, Lisbon, Portugal 2437–2440.
Zhu X, Barras C, Meignier S, Gauvain J-L: Combining speaker identification and BIC for speaker diarization. Proceedings of Interspeech 2005 - Eurospeech, September 2005, Lisbon, Portugal 2441–2444.
Saunders J: Real-time discrimination of broadcast speech/music. Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP '96), May 1996, Atlanta, Ga, USA 2: 993–996.
Greenberg S: The ears have it: the auditory basis of speech perceptions. Proceedings of the 13th International Congress of Phonetic Sciences (ICPhS '95), August 1995, Stockholm, Sweden 3: 34–41.
Samouelian A, Robert-Ribes J, Plumpe M: Speech, silence, music and noise classification of TV broadcast material. Proceedings of International Conference on Spoken Language Processing (ICSLP '98), November–December 1998, Sydney, Australia 3: 1099–1102.
Scheirer E, Slaney M: Construction and evaluation of a robust multifeature speech/music discriminator. Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97), April 1997, Munich, Germany 2: 1331–1334.
Picone JW: Signal modeling techniques in speech recognition. Proceedings of the IEEE 1993, 81(9):1215–1247. 10.1109/5.237532
Hermansky H: Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America 1990, 87(4):1738–1752. 10.1121/1.399423
Ajmera J: Robust audio segmentation, M.S. thesis. École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; 2004.
Hain T, Johnson SE, Tuerk A, Woodland PC, Young SJ: Segment generation and clustering in the HTK broadcast news transcription system. Proceedings of the 1998 DARPA Broadcast News Transcription and Understanding Workshop, February 1998, Lansdowne, Va, USA, 133–137.
Ajmera J, McCowan I, Bourlard H: Speech/music segmentation using entropy and dynamism features in a HMM classification framework. Speech Communication 2003, 40(3):351–363. 10.1016/S0167-6393(02)00087-0
Karnebäck S: Expanded examinations of a low frequency modulation feature for speech/music discrimination. Proceedings of 7th International Conference on Spoken Language Processing (ICSLP '02 - Interspeech '02), September 2002, Denver, Colo, USA 2: 2009–2012.
Williams G, Ellis DPW: Speech/music discrimination based on posterior probabilities. Proceedings of Eurospeech '99, September 1999, Budapest, Hungary 2: 687–690.
Siegler M, Jain U, Raj B, Stern R: Automatic segmentation, classification and clustering of broadcast news data. Proceedings of the DARPA Speech Recognition Workshop, February 1997, Chantilly, Va, USA 97–99.
Žibert J, Mihelič F, Martens J-P, et al.: The COST278 broadcast news segmentation and speaker clustering evaluation - overview, methodology, systems, results. Proceedings of Interspeech 2005 - Eurospeech, September 2005, Lisbon, Portugal 629–632.
Lu L, Zhang H-J, Li SZ: Content-based audio classification and segmentation by using support vector machines. ACM Multimedia Systems Journal 2003, 8(6):482–492. 10.1007/s00530-002-0065-0
Chen SS, Gopalakrishnan PS: Speaker, environment and channel change detection and clustering via the Bayesian information criterion. Proceedings of the DARPA Speech Recognition Workshop, February 1998, Lansdowne, Va, USA 127–132.
Logan B: Mel frequency cepstral coefficients for music modeling. Proceedings of the International Symposium on Music Information Retrieval (ISMIR '00), October 2000, Plymouth, Mass, USA
Reynolds DA, Campbell JP, Campbell WM: Beyond cepstra: exploiting high-level information in speaker recognition. Proceedings of the Workshop on Multimodal User Authentication, December 2003, Santa Barbara, Calif, USA 223–229.
Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL: DARPA TIMIT acoustic-phonetic continuous speech corpus. U.S. Department of Commerce, NIST, Gaithersburg, Md, USA, February 1993
Tritschler A, Gopinath R: Improved speaker segmentation and segments clustering using the Bayesian information criterion. Proceedings of Eurospeech '99, September 1999, Budapest, Hungary 2: 679–682.
Mihelič F, Gros J, Dobrišek S, Žibert J, Pavešić N: Spoken language resources at LUKS of the university of Ljubljana. International Journal of Speech Technology 2003, 6(3):221–232. 10.1023/A:1023462002932
Young S, Evermann G, Gales M, et al.: The HTK Book (for HTK Version 3.2). Cambridge University Engineering Department, Cambridge, UK; 2004.
Lee K-F, Hon H-W: Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing 1989, 37(11):1641–1648. 10.1109/29.46546
Potamianos G, Neti C, Luettin J, Matthews I: Audio-visual automatic speech recognition: an overview. In Issues in Visual and Audio-Visual Speech Processing. Edited by: Bailly G, Vatikiotis-Bateson E, Perrier P. MIT Press, Cambridge, Mass, USA; 2004.
Žibert J, Mihelič F: Development of Slovenian broadcast news speech database. Proceedings of the International Conference on Language Resources and Evaluation (LREC '04), May 2004, Lisbon, Portugal 2095–2098.
Vandecatseye A, Martens JP, Neto J, et al.: The COST278 pan-European broadcast news database. Proceedings of the International Conference on Language Resources and Evaluation (LREC '04), May 2004, Lisbon, Portugal 873–876.
Baker B, Vogt R, Sridharan S: Gaussian mixture modelling of broad phonetic and syllabic events for text-independent speaker verification. Proceedings of Interspeech 2005 - Eurospeech, September 2005, Lisbon, Portugal 2429–2432.
Hatch AO, Peskin B, Stolcke A: Improved phonetic speaker recognition using lattice decoding. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), March 2005, Philadelphia, Pa, USA 1: 169–172.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Žibert, J., Pavešić, N. & Mihelič, F. Speech/Non-Speech Segmentation Based on Phoneme Recognition Features. EURASIP J. Adv. Signal Process. 2006, 090495 (2006). https://doi.org/10.1155/ASP/2006/90495
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1155/ASP/2006/90495