Speech/Non-Speech Segmentation Based on Phoneme Recognition Features

Žibert, Janez; Pavešić, Nikola; Mihelič, France

doi:10.1155/ASP/2006/90495

Research Article
Open access
Published: 01 December 2006

Speech/Non-Speech Segmentation Based on Phoneme Recognition Features

Janez Žibert¹,
Nikola Pavešić¹ &
France Mihelič¹

EURASIP Journal on Advances in Signal Processing volume 2006, Article number: 090495 (2006) Cite this article

1419 Accesses
8 Citations
Metrics details

Abstract

This work assesses different approaches for speech and non-speech segmentation of audio data and proposes a new, high-level representation of audio signals based on phoneme recognition features suitable for speech/non-speech discrimination tasks. Unlike previous model-based approaches, where speech and non-speech classes were usually modeled by several models, we develop a representation where just one model per class is used in the segmentation process. For this purpose, four measures based on consonant-vowel pairs obtained from different phoneme speech recognizers are introduced and applied in two different segmentation-classification frameworks. The segmentation systems were evaluated on different broadcast news databases. The evaluation results indicate that the proposed phoneme recognition features are better than the standard mel-frequency cepstral coefficients and posterior probability-based features (entropy and dynamism). The proposed features proved to be more robust and less sensitive to different training and unforeseen conditions. Additional experiments with fusion models based on cepstral and the proposed phoneme recognition features produced the highest scores overall, which indicates that the most suitable method for speech/non-speech segmentation is a combination of low-level acoustic features and high-level recognition features.

References

Shafran I, Rose R: Robust speech detection and segmentation for real-time ASR applications. Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003, Hong Kong 1: 432–435.
Article Google Scholar
Beyerlein P, Aubert X, Haeb-Umbach R, et al.: Large vocabulary continuous speech recognition of broadcast news - the Philips/RWTH approach. Speech Communication 2002, 37(1):109–131. 10.1016/S0167-6393(01)00062-0
Article Google Scholar
Gauvain J-L, Lamel L, Adda G: The LIMSI broadcast news transcription system. Speech Communication 2002, 37(1):89–108. 10.1016/S0167-6393(01)00061-9
Article Google Scholar
Woodland PC: The development of the HTK broadcast news transcription system: an overview. Speech Communication 2002, 37(1):47–67. 10.1016/S0167-6393(01)00059-0
Article Google Scholar
Magrin-Chagnolleau I, Parlangeau-Vallès N: Audio indexing: what has been accomplished and the road ahead. Proceedings of Joint Conference on Information Sciences (JCIS '02), March 2002, Durham, NC, USA 911–914.
Google Scholar
Makhoul J, Kubala F, Leek T, et al.: Speech and language technologies for audio indexing and retrieval. Proceedings of the IEEE 2000, 88(8):1338–1353. 10.1109/5.880087
Article Google Scholar
Istrate D, Scheffer N, Fredouille C, Bonastre J-F: Broadcast news speaker tracking for ESTER 2005 campaign. Proceedings of Interspeech 2005 - Eurospeech, September 2005, Lisbon, Portugal 2445–2448.
Google Scholar
Moraru D, Ben M, Gravier G: Experiments on speaker tracking and segmentation in radio broadcast news. Proceedings of Interspeech 2005 - Eurospeech, September 2005, Lisbon, Portugal 3049–3052.
Google Scholar
Reynolds DA, Torres-Carrasquillo PA: Approaches and applications of audio diarization. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), March 2005, Philadelphia, Pa, USA 5: 953–956.
Google Scholar
Sinha R, Tranter SE, Gales MJF, Woodland PC: The Cambridge University March 2005 speaker diarisation system. Proceedings of Interspeech 2005 - Eurospeech, September 2005, Lisbon, Portugal 2437–2440.
Google Scholar
Zhu X, Barras C, Meignier S, Gauvain J-L: Combining speaker identification and BIC for speaker diarization. Proceedings of Interspeech 2005 - Eurospeech, September 2005, Lisbon, Portugal 2441–2444.
Google Scholar
Saunders J: Real-time discrimination of broadcast speech/music. Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP '96), May 1996, Atlanta, Ga, USA 2: 993–996.
Google Scholar
Greenberg S: The ears have it: the auditory basis of speech perceptions. Proceedings of the 13th International Congress of Phonetic Sciences (ICPhS '95), August 1995, Stockholm, Sweden 3: 34–41.
Google Scholar
Samouelian A, Robert-Ribes J, Plumpe M: Speech, silence, music and noise classification of TV broadcast material. Proceedings of International Conference on Spoken Language Processing (ICSLP '98), November–December 1998, Sydney, Australia 3: 1099–1102.
Google Scholar
Scheirer E, Slaney M: Construction and evaluation of a robust multifeature speech/music discriminator. Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97), April 1997, Munich, Germany 2: 1331–1334.
Google Scholar
Picone JW: Signal modeling techniques in speech recognition. Proceedings of the IEEE 1993, 81(9):1215–1247. 10.1109/5.237532
Article Google Scholar
Hermansky H: Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America 1990, 87(4):1738–1752. 10.1121/1.399423
Article Google Scholar
Ajmera J: Robust audio segmentation, M.S. thesis. École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; 2004.
Google Scholar
Hain T, Johnson SE, Tuerk A, Woodland PC, Young SJ: Segment generation and clustering in the HTK broadcast news transcription system. Proceedings of the 1998 DARPA Broadcast News Transcription and Understanding Workshop, February 1998, Lansdowne, Va, USA, 133–137.
Google Scholar
Ajmera J, McCowan I, Bourlard H: Speech/music segmentation using entropy and dynamism features in a HMM classification framework. Speech Communication 2003, 40(3):351–363. 10.1016/S0167-6393(02)00087-0
Article Google Scholar
Karnebäck S: Expanded examinations of a low frequency modulation feature for speech/music discrimination. Proceedings of 7th International Conference on Spoken Language Processing (ICSLP '02 - Interspeech '02), September 2002, Denver, Colo, USA 2: 2009–2012.
Google Scholar
Williams G, Ellis DPW: Speech/music discrimination based on posterior probabilities. Proceedings of Eurospeech '99, September 1999, Budapest, Hungary 2: 687–690.
Google Scholar
Siegler M, Jain U, Raj B, Stern R: Automatic segmentation, classification and clustering of broadcast news data. Proceedings of the DARPA Speech Recognition Workshop, February 1997, Chantilly, Va, USA 97–99.
Google Scholar
Žibert J, Mihelič F, Martens J-P, et al.: The COST278 broadcast news segmentation and speaker clustering evaluation - overview, methodology, systems, results. Proceedings of Interspeech 2005 - Eurospeech, September 2005, Lisbon, Portugal 629–632.
Google Scholar
Lu L, Zhang H-J, Li SZ: Content-based audio classification and segmentation by using support vector machines. ACM Multimedia Systems Journal 2003, 8(6):482–492. 10.1007/s00530-002-0065-0
Article Google Scholar
Chen SS, Gopalakrishnan PS: Speaker, environment and channel change detection and clustering via the Bayesian information criterion. Proceedings of the DARPA Speech Recognition Workshop, February 1998, Lansdowne, Va, USA 127–132.
Google Scholar
Logan B: Mel frequency cepstral coefficients for music modeling. Proceedings of the International Symposium on Music Information Retrieval (ISMIR '00), October 2000, Plymouth, Mass, USA
Google Scholar
Reynolds DA, Campbell JP, Campbell WM: Beyond cepstra: exploiting high-level information in speaker recognition. Proceedings of the Workshop on Multimodal User Authentication, December 2003, Santa Barbara, Calif, USA 223–229.
Google Scholar
Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL: DARPA TIMIT acoustic-phonetic continuous speech corpus. U.S. Department of Commerce, NIST, Gaithersburg, Md, USA, February 1993
Book Google Scholar
Tritschler A, Gopinath R: Improved speaker segmentation and segments clustering using the Bayesian information criterion. Proceedings of Eurospeech '99, September 1999, Budapest, Hungary 2: 679–682.
Google Scholar
Mihelič F, Gros J, Dobrišek S, Žibert J, Pavešić N: Spoken language resources at LUKS of the university of Ljubljana. International Journal of Speech Technology 2003, 6(3):221–232. 10.1023/A:1023462002932
Article Google Scholar
Young S, Evermann G, Gales M, et al.: The HTK Book (for HTK Version 3.2). Cambridge University Engineering Department, Cambridge, UK; 2004.
Google Scholar
Lee K-F, Hon H-W: Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing 1989, 37(11):1641–1648. 10.1109/29.46546
Article Google Scholar
Potamianos G, Neti C, Luettin J, Matthews I: Audio-visual automatic speech recognition: an overview. In Issues in Visual and Audio-Visual Speech Processing. Edited by: Bailly G, Vatikiotis-Bateson E, Perrier P. MIT Press, Cambridge, Mass, USA; 2004.
Google Scholar
Žibert J, Mihelič F: Development of Slovenian broadcast news speech database. Proceedings of the International Conference on Language Resources and Evaluation (LREC '04), May 2004, Lisbon, Portugal 2095–2098.
Google Scholar
Vandecatseye A, Martens JP, Neto J, et al.: The COST278 pan-European broadcast news database. Proceedings of the International Conference on Language Resources and Evaluation (LREC '04), May 2004, Lisbon, Portugal 873–876.
Google Scholar
Baker B, Vogt R, Sridharan S: Gaussian mixture modelling of broad phonetic and syllabic events for text-independent speaker verification. Proceedings of Interspeech 2005 - Eurospeech, September 2005, Lisbon, Portugal 2429–2432.
Google Scholar
Hatch AO, Peskin B, Stolcke A: Improved phonetic speaker recognition using lattice decoding. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), March 2005, Philadelphia, Pa, USA 1: 169–172.
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Electrical Engineering, University of Ljubljana, Tržaška 25, Ljubljana, 1000, Slovenia
Janez Žibert, Nikola Pavešić & France Mihelič

Authors

Janez Žibert
View author publications
You can also search for this author in PubMed Google Scholar
Nikola Pavešić
View author publications
You can also search for this author in PubMed Google Scholar
France Mihelič
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Janez Žibert.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Žibert, J., Pavešić, N. & Mihelič, F. Speech/Non-Speech Segmentation Based on Phoneme Recognition Features. EURASIP J. Adv. Signal Process. 2006, 090495 (2006). https://doi.org/10.1155/ASP/2006/90495

Download citation

Received: 16 September 2005
Revised: 07 February 2006
Accepted: 18 February 2006
Published: 01 December 2006
DOI: https://doi.org/10.1155/ASP/2006/90495

Speech/Non-Speech Segmentation Based on Phoneme Recognition Features

Abstract

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords