Skip to content

Advertisement

  • Research Article
  • Open Access

Speech/Non-Speech Segmentation Based on Phoneme Recognition Features

EURASIP Journal on Advances in Signal Processing20062006:090495

https://doi.org/10.1155/ASP/2006/90495

  • Received: 16 September 2005
  • Accepted: 18 February 2006
  • Published:

Abstract

This work assesses different approaches for speech and non-speech segmentation of audio data and proposes a new, high-level representation of audio signals based on phoneme recognition features suitable for speech/non-speech discrimination tasks. Unlike previous model-based approaches, where speech and non-speech classes were usually modeled by several models, we develop a representation where just one model per class is used in the segmentation process. For this purpose, four measures based on consonant-vowel pairs obtained from different phoneme speech recognizers are introduced and applied in two different segmentation-classification frameworks. The segmentation systems were evaluated on different broadcast news databases. The evaluation results indicate that the proposed phoneme recognition features are better than the standard mel-frequency cepstral coefficients and posterior probability-based features (entropy and dynamism). The proposed features proved to be more robust and less sensitive to different training and unforeseen conditions. Additional experiments with fusion models based on cepstral and the proposed phoneme recognition features produced the highest scores overall, which indicates that the most suitable method for speech/non-speech segmentation is a combination of low-level acoustic features and high-level recognition features.

Keywords

  • Discrimination Task
  • Audio Signal
  • Acoustic Feature
  • Segmentation Process
  • Recognition Feature

[1234567891011121314151617181920212223242526272829303132333435363738]

Authors’ Affiliations

(1)
Faculty of Electrical Engineering, University of Ljubljana, Tržaška 25, Ljubljana, 1000, Slovenia

References

  1. Shafran I, Rose R: Robust speech detection and segmentation for real-time ASR applications. Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003, Hong Kong 1: 432-435.View ArticleGoogle Scholar
  2. Beyerlein P, Aubert X, Haeb-Umbach R, et al.: Large vocabulary continuous speech recognition of broadcast news - the Philips/RWTH approach. Speech Communication 2002, 37(1):109-131. 10.1016/S0167-6393(01)00062-0View ArticleMATHGoogle Scholar
  3. Gauvain J-L, Lamel L, Adda G: The LIMSI broadcast news transcription system. Speech Communication 2002, 37(1):89-108. 10.1016/S0167-6393(01)00061-9View ArticleMATHGoogle Scholar
  4. Woodland PC: The development of the HTK broadcast news transcription system: an overview. Speech Communication 2002, 37(1):47-67. 10.1016/S0167-6393(01)00059-0View ArticleMATHGoogle Scholar
  5. Magrin-Chagnolleau I, Parlangeau-Vallès N: Audio indexing: what has been accomplished and the road ahead. Proceedings of Joint Conference on Information Sciences (JCIS '02), March 2002, Durham, NC, USA 911-914.Google Scholar
  6. Makhoul J, Kubala F, Leek T, et al.: Speech and language technologies for audio indexing and retrieval. Proceedings of the IEEE 2000, 88(8):1338-1353. 10.1109/5.880087View ArticleGoogle Scholar
  7. Istrate D, Scheffer N, Fredouille C, Bonastre J-F: Broadcast news speaker tracking for ESTER 2005 campaign. Proceedings of Interspeech 2005 - Eurospeech, September 2005, Lisbon, Portugal 2445-2448.Google Scholar
  8. Moraru D, Ben M, Gravier G: Experiments on speaker tracking and segmentation in radio broadcast news. Proceedings of Interspeech 2005 - Eurospeech, September 2005, Lisbon, Portugal 3049-3052.Google Scholar
  9. Reynolds DA, Torres-Carrasquillo PA: Approaches and applications of audio diarization. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), March 2005, Philadelphia, Pa, USA 5: 953-956.Google Scholar
  10. Sinha R, Tranter SE, Gales MJF, Woodland PC: The Cambridge University March 2005 speaker diarisation system. Proceedings of Interspeech 2005 - Eurospeech, September 2005, Lisbon, Portugal 2437-2440.Google Scholar
  11. Zhu X, Barras C, Meignier S, Gauvain J-L: Combining speaker identification and BIC for speaker diarization. Proceedings of Interspeech 2005 - Eurospeech, September 2005, Lisbon, Portugal 2441-2444.Google Scholar
  12. Saunders J: Real-time discrimination of broadcast speech/music. Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP '96), May 1996, Atlanta, Ga, USA 2: 993-996.Google Scholar
  13. Greenberg S: The ears have it: the auditory basis of speech perceptions. Proceedings of the 13th International Congress of Phonetic Sciences (ICPhS '95), August 1995, Stockholm, Sweden 3: 34-41.Google Scholar
  14. Samouelian A, Robert-Ribes J, Plumpe M: Speech, silence, music and noise classification of TV broadcast material. Proceedings of International Conference on Spoken Language Processing (ICSLP '98), November-December 1998, Sydney, Australia 3: 1099-1102.Google Scholar
  15. Scheirer E, Slaney M: Construction and evaluation of a robust multifeature speech/music discriminator. Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97), April 1997, Munich, Germany 2: 1331-1334.Google Scholar
  16. Picone JW: Signal modeling techniques in speech recognition. Proceedings of the IEEE 1993, 81(9):1215-1247. 10.1109/5.237532View ArticleGoogle Scholar
  17. Hermansky H: Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America 1990, 87(4):1738-1752. 10.1121/1.399423View ArticleGoogle Scholar
  18. Ajmera J: Robust audio segmentation, M.S. thesis. École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; 2004.Google Scholar
  19. Hain T, Johnson SE, Tuerk A, Woodland PC, Young SJ: Segment generation and clustering in the HTK broadcast news transcription system. Proceedings of the 1998 DARPA Broadcast News Transcription and Understanding Workshop, February 1998, Lansdowne, Va, USA, 133-137.Google Scholar
  20. Ajmera J, McCowan I, Bourlard H: Speech/music segmentation using entropy and dynamism features in a HMM classification framework. Speech Communication 2003, 40(3):351-363. 10.1016/S0167-6393(02)00087-0View ArticleGoogle Scholar
  21. Karnebäck S: Expanded examinations of a low frequency modulation feature for speech/music discrimination. Proceedings of 7th International Conference on Spoken Language Processing (ICSLP '02 - Interspeech '02), September 2002, Denver, Colo, USA 2: 2009-2012.Google Scholar
  22. Williams G, Ellis DPW: Speech/music discrimination based on posterior probabilities. Proceedings of Eurospeech '99, September 1999, Budapest, Hungary 2: 687-690.Google Scholar
  23. Siegler M, Jain U, Raj B, Stern R: Automatic segmentation, classification and clustering of broadcast news data. Proceedings of the DARPA Speech Recognition Workshop, February 1997, Chantilly, Va, USA 97-99.Google Scholar
  24. Žibert J, Mihelič F, Martens J-P, et al.: The COST278 broadcast news segmentation and speaker clustering evaluation - overview, methodology, systems, results. Proceedings of Interspeech 2005 - Eurospeech, September 2005, Lisbon, Portugal 629-632.Google Scholar
  25. Lu L, Zhang H-J, Li SZ: Content-based audio classification and segmentation by using support vector machines. ACM Multimedia Systems Journal 2003, 8(6):482-492. 10.1007/s00530-002-0065-0View ArticleGoogle Scholar
  26. Chen SS, Gopalakrishnan PS: Speaker, environment and channel change detection and clustering via the Bayesian information criterion. Proceedings of the DARPA Speech Recognition Workshop, February 1998, Lansdowne, Va, USA 127-132.Google Scholar
  27. Logan B: Mel frequency cepstral coefficients for music modeling. Proceedings of the International Symposium on Music Information Retrieval (ISMIR '00), October 2000, Plymouth, Mass, USAGoogle Scholar
  28. Reynolds DA, Campbell JP, Campbell WM: Beyond cepstra: exploiting high-level information in speaker recognition. Proceedings of the Workshop on Multimodal User Authentication, December 2003, Santa Barbara, Calif, USA 223-229.Google Scholar
  29. Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL: DARPA TIMIT acoustic-phonetic continuous speech corpus. U.S. Department of Commerce, NIST, Gaithersburg, Md, USA, February 1993Google Scholar
  30. Tritschler A, Gopinath R: Improved speaker segmentation and segments clustering using the Bayesian information criterion. Proceedings of Eurospeech '99, September 1999, Budapest, Hungary 2: 679-682.Google Scholar
  31. Mihelič F, Gros J, Dobrišek S, Žibert J, Pavešić N: Spoken language resources at LUKS of the university of Ljubljana. International Journal of Speech Technology 2003, 6(3):221-232. 10.1023/A:1023462002932View ArticleGoogle Scholar
  32. Young S, Evermann G, Gales M, et al.: The HTK Book (for HTK Version 3.2). Cambridge University Engineering Department, Cambridge, UK; 2004.Google Scholar
  33. Lee K-F, Hon H-W: Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing 1989, 37(11):1641-1648. 10.1109/29.46546View ArticleGoogle Scholar
  34. Potamianos G, Neti C, Luettin J, Matthews I: Audio-visual automatic speech recognition: an overview. In Issues in Visual and Audio-Visual Speech Processing. Edited by: Bailly G, Vatikiotis-Bateson E, Perrier P. MIT Press, Cambridge, Mass, USA; 2004.Google Scholar
  35. Žibert J, Mihelič F: Development of Slovenian broadcast news speech database. Proceedings of the International Conference on Language Resources and Evaluation (LREC '04), May 2004, Lisbon, Portugal 2095-2098.Google Scholar
  36. Vandecatseye A, Martens JP, Neto J, et al.: The COST278 pan-European broadcast news database. Proceedings of the International Conference on Language Resources and Evaluation (LREC '04), May 2004, Lisbon, Portugal 873-876.Google Scholar
  37. Baker B, Vogt R, Sridharan S: Gaussian mixture modelling of broad phonetic and syllabic events for text-independent speaker verification. Proceedings of Interspeech 2005 - Eurospeech, September 2005, Lisbon, Portugal 2429-2432.Google Scholar
  38. Hatch AO, Peskin B, Stolcke A: Improved phonetic speaker recognition using lattice decoding. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), March 2005, Philadelphia, Pa, USA 1: 169-172.Google Scholar

Copyright

Advertisement