Open Access

Audiovisual Speech Synchrony Measure: Application to Biometrics

EURASIP Journal on Advances in Signal Processing20072007:070186

https://doi.org/10.1155/2007/70186

Received: 18 August 2006

Accepted: 18 March 2007

Published: 7 May 2007

Abstract

Speech is a means of communication which is intrinsically bimodal: the audio signal originates from the dynamics of the articulators. This paper reviews recent works in the field of audiovisual speech, and more specifically techniques developed to measure the level of correspondence between audio and visual speech. It overviews the most common audio and visual speech front-end processing, transformations performed on audio, visual, or joint audiovisual feature spaces, and the actual measure of correspondence between audio and visual speech. Finally, the use of synchrony measure for biometric identity verification based on talking faces is experimented on the BANCA database.

[1234567891011121314151617181920212223242526272829303132333435363738394041]

Authors’ Affiliations

(1)
Département Traitement du Signal et de l'Image, École Nationale Supérieure des Télécommunications, CNRS/LTCI

References

  1. Potamianos G, Neti C, Luettin J, Matthews I: Audio-visual automatic speech recognition: an overview. In Issues in Visual and Audio-Visual Speech Processing. Edited by: Bailly G, Vatikiotis-Bateson E, Perrier P. MIT Press, Cambridge, Mass, USA; 2004. chapter 10Google Scholar
  2. Chen T: Audiovisual speech processing. IEEE Signal Processing Magazine 2001,18(1):9-21. 10.1109/79.911195View ArticleGoogle Scholar
  3. Chibelushi CC, Deravi F, Mason JS: A review of speech-based bimodal recognition. IEEE Transactions on Multimedia 2002,4(1):23-37. 10.1109/6046.985551View ArticleGoogle Scholar
  4. Barker JP, Berthommier F: Evidence of correlation between acoustic and visual features of speech. Proceedings of the 14th International Congress of Phonetic Sciences (ICPhS '99), August 1999, San Francisco, Calif, USA 199-202.Google Scholar
  5. Yehia H, Rubin P, Vatikiotis-Bateson E: Quantitative association of vocal-tract and facial behavior. Speech Communication 1998,26(1-2):23-43. 10.1016/S0167-6393(98)00048-XView ArticleGoogle Scholar
  6. Bailly-Baillière E, Bengio S, Bimbot F, et al.: The BANCA database and evaluation protocol. In Proceedings of the 4th International Conference on Audioand Video-Based Biometric Person Authentication (AVBPA '03), January 2003, Guildford, UK, Lecture Notes in Computer Science. Volume 2688. Springer; 625-638.View ArticleGoogle Scholar
  7. Hershey J, Movellan J: Audio-vision: using audio-visual synchrony to locate sounds. In Advances in Neural Information Processing Systems 11. Edited by: Kearns MS, Solla SA, Cohn DA. MIT Press, Cambridge, Mass, USA; 1999:813-819.Google Scholar
  8. Bredin H, Miguel A, Witten IH, Chollet G: Detecting replay attacks in audiovisual identity verification. Proceedings of the 31st IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '06), May 2006, Toulous, France 1: 621-624.Google Scholar
  9. Fisher JW III, Darrell T: Speaker association with signal-level audiovisual fusion. IEEE Transactions on Multimedia 2004,6(3):406-413. 10.1109/TMM.2004.827503View ArticleGoogle Scholar
  10. Chetty G, Wagner M: "Liveness" verification in audio-video authentication. Proceedings of the 10th Australian International Conference on Speech Science and Technology (SST '04), December 2004, Sydney, Australia 358-363.Google Scholar
  11. Cutler R, Davis L: Look who's talking: speaker detection using video and audio correlation. Proceedings of IEEE International Conference on Multimedia and Expo (ICME '00), July-August 2000, New York, NY, USA 3: 1589-1592.View ArticleGoogle Scholar
  12. Iyengar G, Nock HJ, Neti C: Audio-visual synchrony for detection of monologues in video archives. Proceedings of IEEE International Conference on Multimedia and Expo (ICME '03), July 2003, Baltimore, Md, USA 1: 329-332.Google Scholar
  13. Nock HJ, Iyengar G, Neti C: Assessing face and speech consistency for monologue detection in video. Proceedings of the 10th ACM international Conference on Multimedia (MULTIMEDIA '02), December 2002, Juan-les-Pins, France 303-306.View ArticleGoogle Scholar
  14. Slaney M, Covell M: FaceSync: a linear operator for measuring synchronization of video facial images and audio tracks. In Advances in Neural Information Processing Systems 13. MIT Press, Cambridge, Mass, USA; 2000:814-820.Google Scholar
  15. Reynolds DA, Quatieri TF, Dunn RB: Speaker verification using adapted Gaussian mixture models. Digital Signal Processing 2000,10(1–3):19-41.View ArticleGoogle Scholar
  16. Sugamura N, Itakura F: Speech analysis and synthesis methods developed at ECL in NTT-from LPC to LSP. Speech Communications 1986,5(2):199-215. 10.1016/0167-6393(86)90008-7View ArticleGoogle Scholar
  17. Bregler C, Konig Y: "Eigenlips" for robust speech recognition. Proceedings of the 19th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '94), April 1994, Adelaide, Australia 2: 669-672.Google Scholar
  18. Turk M, Pentland A: Eigenfaces for recognition. Journal of Cognitive Neuroscience 1991,3(1):71-86. 10.1162/jocn.1991.3.1.71View ArticleGoogle Scholar
  19. Goecke R, Millar B: Statistical analysis of the relationship between audio and video speech parameters for Australian English. Proceedings of the ISCA Tutorial and Research Workshop on Audio Visual Speech Processing (AVSP '03), September 2003, Saint-Jorioz, France 133-138.Google Scholar
  20. Eveno N, Besacier L: A speaker independent "liveness" test for audio-visual biometrics. Proceedings of the 9th European Conference on Speech Communication and Technology (EuroSpeech '05), September 2005, Lisbon, Portugal 3081-3084.Google Scholar
  21. Eveno N, Besacier L: Co-inertia analysis for "liveness" test in audio-visual biometrics. Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis (ISPA '05), September 2005, Zagreb, Croatia 257-261.Google Scholar
  22. Fox N, Reilly RB: Audio-visual speaker identification based on the use of dynamic audio and visual features. In Proceedings of the 4th International Conference on Audio-and Video-Based Biometric Person Authentication (AVBPA '03), June 2003, Guildford, UK, Lecture Notes in Computer Science. Volume 2688. Springer; 743-751.View ArticleGoogle Scholar
  23. Chibelushi CC, Mason JS, Deravi F: Integrated person identification using voice and facial features. IEE Colloquium on Image Processing for Security Applications, March 1997, London, UK 4: 1-5.Google Scholar
  24. Hyvärinen A: Survey on independent component analysis. Neural Computing Surveys 1999, 2: 94-128.Google Scholar
  25. Sodoyer D, Girin L, Jutten C, Schwartz J-L: Speech extraction based on ICA and audio-visual coherence. Proceedings of the 7th International Symposium on Signal Processing and Its Applications (ISSPA '03), July 2003, Paris, France 2: 65-68.Google Scholar
  26. Smaragdis P, Casey M: Audio/visual independent components. Proceedings of the 4th International Symposium on Independent Component Analysis and Blind Signal Separation (ICA '03), April 2003, Nara, Japan 709-714.Google Scholar
  27. ICA http://www.cis.hut.fi/projects/ica/fastica/
  28. Canonical Correlation Analysis. http://people.imt.liu.se/~magnus/cca/
  29. Dolédec S, Chessel D: Co-inertia analysis: an alternative method for studying species-environment relationships. Freshwater Biology 1994, 31: 277-294. 10.1111/j.1365-2427.1994.tb01741.xView ArticleGoogle Scholar
  30. Fisher JW, Darrell T, Freeman WT, Viola P: Learning joint statistical models for audio-visual fusion and segregation. In Advances in Neural Information Processing Systems 13. Edited by: Leen TK, Dietterich TG, Tresp V. MIT Press, Cambridge, Mass, USA; 2001:772-778.Google Scholar
  31. Sodoyer D, Schwartz J-L, Girin L, Klinkisch J, Jutten C: Separation of audio-visual speech sources: a new approach exploiting the audio-visual coherence of speech stimuli. EURASIP Journal on Applied Signal Processing 2002,2002(11):1165-1173. 10.1155/S1110865702207015View ArticleMATHGoogle Scholar
  32. Rabiner LR: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 1989,77(2):257-286. 10.1109/5.18626View ArticleGoogle Scholar
  33. Bengio S: An asynchronous hidden Markov model for audio-visual speech recognition. In Advances in Neural Information Processing Systems 15. Edited by: Becker S, Thrun S, Obermayer K. MIT Press, Cambridge, Mass, USA; 2003:1213-1220.Google Scholar
  34. Bredin H, Aversano G, Mokbel C, Chollet G: The biosecure talking-face reference system. Proceedings of the 2nd Workshop on Multimodal User Authentication (MMUA '06), May 2006, Toulouse, FranceGoogle Scholar
  35. Messer K, Matas J, Kittler J, Luettin J, Maitre G: XM2VTSDB: the extended M2VTS database. Proceedings of International Conference on Audio- and Video-Based Biometric Person Authentication (AVBPA '99), March 1999, Washington, DC, USA 72-77.Google Scholar
  36. BT-DAVID http://eegalilee.swan.ac.uk/
  37. Garcia-Salicetti S, Beumier C, Chollet G, et al.: BIOMET: a multimodal person authentication database including face, voice, fingerprint, hand and signature modalities. Proceedings of the 4th International Conference on Audio-and Video-Based Biometric Person Authentication (AVBPA '03), June 2003, Guildford, UK 845-853.View ArticleGoogle Scholar
  38. Martin AF, Doddington GR, Kamm T, Ordowski M, Przybocki M: The DET curve in assessment of detection task performance. Proceedings of the 5th European Conference on Speech Communication and Technology (EuroSpeech '97), September 1997, Rhodes, Greece 4: 1895-1898.Google Scholar
  39. Sargin ME, Erzin E, Yemez Y, Tekalp AM: Multimodal speaker identification using canonical correlation analysis. Proceedings of the 31st IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '06), May 2006, Toulouse, France 1: 613-616.Google Scholar
  40. Text Retrieval Conference Video Track. http://trec.nist.gov/
  41. Bredin H, Chollet G: Audio-visual speech synchrony measure for talking-face identity verification. Proceedings of the 32nd IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '07), April 2007, Honolulu, Hawaii, USAGoogle Scholar

Copyright

© H. Bredin and G. Chollet 2007

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.