Skip to main content
  • Research Article
  • Open access
  • Published:

Speech/Non-Speech Segmentation Based on Phoneme Recognition Features

Abstract

This work assesses different approaches for speech and non-speech segmentation of audio data and proposes a new, high-level representation of audio signals based on phoneme recognition features suitable for speech/non-speech discrimination tasks. Unlike previous model-based approaches, where speech and non-speech classes were usually modeled by several models, we develop a representation where just one model per class is used in the segmentation process. For this purpose, four measures based on consonant-vowel pairs obtained from different phoneme speech recognizers are introduced and applied in two different segmentation-classification frameworks. The segmentation systems were evaluated on different broadcast news databases. The evaluation results indicate that the proposed phoneme recognition features are better than the standard mel-frequency cepstral coefficients and posterior probability-based features (entropy and dynamism). The proposed features proved to be more robust and less sensitive to different training and unforeseen conditions. Additional experiments with fusion models based on cepstral and the proposed phoneme recognition features produced the highest scores overall, which indicates that the most suitable method for speech/non-speech segmentation is a combination of low-level acoustic features and high-level recognition features.

References

  1. Shafran I, Rose R: Robust speech detection and segmentation for real-time ASR applications. Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003, Hong Kong 1: 432–435.

    Article  Google Scholar 

  2. Beyerlein P, Aubert X, Haeb-Umbach R, et al.: Large vocabulary continuous speech recognition of broadcast news - the Philips/RWTH approach. Speech Communication 2002, 37(1):109–131. 10.1016/S0167-6393(01)00062-0

    Article  Google Scholar 

  3. Gauvain J-L, Lamel L, Adda G: The LIMSI broadcast news transcription system. Speech Communication 2002, 37(1):89–108. 10.1016/S0167-6393(01)00061-9

    Article  Google Scholar 

  4. Woodland PC: The development of the HTK broadcast news transcription system: an overview. Speech Communication 2002, 37(1):47–67. 10.1016/S0167-6393(01)00059-0

    Article  Google Scholar 

  5. Magrin-Chagnolleau I, Parlangeau-Vallès N: Audio indexing: what has been accomplished and the road ahead. Proceedings of Joint Conference on Information Sciences (JCIS '02), March 2002, Durham, NC, USA 911–914.

    Google Scholar 

  6. Makhoul J, Kubala F, Leek T, et al.: Speech and language technologies for audio indexing and retrieval. Proceedings of the IEEE 2000, 88(8):1338–1353. 10.1109/5.880087

    Article  Google Scholar 

  7. Istrate D, Scheffer N, Fredouille C, Bonastre J-F: Broadcast news speaker tracking for ESTER 2005 campaign. Proceedings of Interspeech 2005 - Eurospeech, September 2005, Lisbon, Portugal 2445–2448.

    Google Scholar 

  8. Moraru D, Ben M, Gravier G: Experiments on speaker tracking and segmentation in radio broadcast news. Proceedings of Interspeech 2005 - Eurospeech, September 2005, Lisbon, Portugal 3049–3052.

    Google Scholar 

  9. Reynolds DA, Torres-Carrasquillo PA: Approaches and applications of audio diarization. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), March 2005, Philadelphia, Pa, USA 5: 953–956.

    Google Scholar 

  10. Sinha R, Tranter SE, Gales MJF, Woodland PC: The Cambridge University March 2005 speaker diarisation system. Proceedings of Interspeech 2005 - Eurospeech, September 2005, Lisbon, Portugal 2437–2440.

    Google Scholar 

  11. Zhu X, Barras C, Meignier S, Gauvain J-L: Combining speaker identification and BIC for speaker diarization. Proceedings of Interspeech 2005 - Eurospeech, September 2005, Lisbon, Portugal 2441–2444.

    Google Scholar 

  12. Saunders J: Real-time discrimination of broadcast speech/music. Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP '96), May 1996, Atlanta, Ga, USA 2: 993–996.

    Google Scholar 

  13. Greenberg S: The ears have it: the auditory basis of speech perceptions. Proceedings of the 13th International Congress of Phonetic Sciences (ICPhS '95), August 1995, Stockholm, Sweden 3: 34–41.

    Google Scholar 

  14. Samouelian A, Robert-Ribes J, Plumpe M: Speech, silence, music and noise classification of TV broadcast material. Proceedings of International Conference on Spoken Language Processing (ICSLP '98), November–December 1998, Sydney, Australia 3: 1099–1102.

    Google Scholar 

  15. Scheirer E, Slaney M: Construction and evaluation of a robust multifeature speech/music discriminator. Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97), April 1997, Munich, Germany 2: 1331–1334.

    Google Scholar 

  16. Picone JW: Signal modeling techniques in speech recognition. Proceedings of the IEEE 1993, 81(9):1215–1247. 10.1109/5.237532

    Article  Google Scholar 

  17. Hermansky H: Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America 1990, 87(4):1738–1752. 10.1121/1.399423

    Article  Google Scholar 

  18. Ajmera J: Robust audio segmentation, M.S. thesis. École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; 2004.

    Google Scholar 

  19. Hain T, Johnson SE, Tuerk A, Woodland PC, Young SJ: Segment generation and clustering in the HTK broadcast news transcription system. Proceedings of the 1998 DARPA Broadcast News Transcription and Understanding Workshop, February 1998, Lansdowne, Va, USA, 133–137.

    Google Scholar 

  20. Ajmera J, McCowan I, Bourlard H: Speech/music segmentation using entropy and dynamism features in a HMM classification framework. Speech Communication 2003, 40(3):351–363. 10.1016/S0167-6393(02)00087-0

    Article  Google Scholar 

  21. Karnebäck S: Expanded examinations of a low frequency modulation feature for speech/music discrimination. Proceedings of 7th International Conference on Spoken Language Processing (ICSLP '02 - Interspeech '02), September 2002, Denver, Colo, USA 2: 2009–2012.

    Google Scholar 

  22. Williams G, Ellis DPW: Speech/music discrimination based on posterior probabilities. Proceedings of Eurospeech '99, September 1999, Budapest, Hungary 2: 687–690.

    Google Scholar 

  23. Siegler M, Jain U, Raj B, Stern R: Automatic segmentation, classification and clustering of broadcast news data. Proceedings of the DARPA Speech Recognition Workshop, February 1997, Chantilly, Va, USA 97–99.

    Google Scholar 

  24. Žibert J, Mihelič F, Martens J-P, et al.: The COST278 broadcast news segmentation and speaker clustering evaluation - overview, methodology, systems, results. Proceedings of Interspeech 2005 - Eurospeech, September 2005, Lisbon, Portugal 629–632.

    Google Scholar 

  25. Lu L, Zhang H-J, Li SZ: Content-based audio classification and segmentation by using support vector machines. ACM Multimedia Systems Journal 2003, 8(6):482–492. 10.1007/s00530-002-0065-0

    Article  Google Scholar 

  26. Chen SS, Gopalakrishnan PS: Speaker, environment and channel change detection and clustering via the Bayesian information criterion. Proceedings of the DARPA Speech Recognition Workshop, February 1998, Lansdowne, Va, USA 127–132.

    Google Scholar 

  27. Logan B: Mel frequency cepstral coefficients for music modeling. Proceedings of the International Symposium on Music Information Retrieval (ISMIR '00), October 2000, Plymouth, Mass, USA

    Google Scholar 

  28. Reynolds DA, Campbell JP, Campbell WM: Beyond cepstra: exploiting high-level information in speaker recognition. Proceedings of the Workshop on Multimodal User Authentication, December 2003, Santa Barbara, Calif, USA 223–229.

    Google Scholar 

  29. Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL: DARPA TIMIT acoustic-phonetic continuous speech corpus. U.S. Department of Commerce, NIST, Gaithersburg, Md, USA, February 1993

    Book  Google Scholar 

  30. Tritschler A, Gopinath R: Improved speaker segmentation and segments clustering using the Bayesian information criterion. Proceedings of Eurospeech '99, September 1999, Budapest, Hungary 2: 679–682.

    Google Scholar 

  31. Mihelič F, Gros J, Dobrišek S, Žibert J, Pavešić N: Spoken language resources at LUKS of the university of Ljubljana. International Journal of Speech Technology 2003, 6(3):221–232. 10.1023/A:1023462002932

    Article  Google Scholar 

  32. Young S, Evermann G, Gales M, et al.: The HTK Book (for HTK Version 3.2). Cambridge University Engineering Department, Cambridge, UK; 2004.

    Google Scholar 

  33. Lee K-F, Hon H-W: Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing 1989, 37(11):1641–1648. 10.1109/29.46546

    Article  Google Scholar 

  34. Potamianos G, Neti C, Luettin J, Matthews I: Audio-visual automatic speech recognition: an overview. In Issues in Visual and Audio-Visual Speech Processing. Edited by: Bailly G, Vatikiotis-Bateson E, Perrier P. MIT Press, Cambridge, Mass, USA; 2004.

    Google Scholar 

  35. Žibert J, Mihelič F: Development of Slovenian broadcast news speech database. Proceedings of the International Conference on Language Resources and Evaluation (LREC '04), May 2004, Lisbon, Portugal 2095–2098.

    Google Scholar 

  36. Vandecatseye A, Martens JP, Neto J, et al.: The COST278 pan-European broadcast news database. Proceedings of the International Conference on Language Resources and Evaluation (LREC '04), May 2004, Lisbon, Portugal 873–876.

    Google Scholar 

  37. Baker B, Vogt R, Sridharan S: Gaussian mixture modelling of broad phonetic and syllabic events for text-independent speaker verification. Proceedings of Interspeech 2005 - Eurospeech, September 2005, Lisbon, Portugal 2429–2432.

    Google Scholar 

  38. Hatch AO, Peskin B, Stolcke A: Improved phonetic speaker recognition using lattice decoding. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), March 2005, Philadelphia, Pa, USA 1: 169–172.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Janez Žibert.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Žibert, J., Pavešić, N. & Mihelič, F. Speech/Non-Speech Segmentation Based on Phoneme Recognition Features. EURASIP J. Adv. Signal Process. 2006, 090495 (2006). https://doi.org/10.1155/ASP/2006/90495

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1155/ASP/2006/90495

Keywords