Skip to main content

Audiovisual Speech Synchrony Measure: Application to Biometrics

Abstract

Speech is a means of communication which is intrinsically bimodal: the audio signal originates from the dynamics of the articulators. This paper reviews recent works in the field of audiovisual speech, and more specifically techniques developed to measure the level of correspondence between audio and visual speech. It overviews the most common audio and visual speech front-end processing, transformations performed on audio, visual, or joint audiovisual feature spaces, and the actual measure of correspondence between audio and visual speech. Finally, the use of synchrony measure for biometric identity verification based on talking faces is experimented on the BANCA database.

References

  1. 1.

    Potamianos G, Neti C, Luettin J, Matthews I: Audio-visual automatic speech recognition: an overview. In Issues in Visual and Audio-Visual Speech Processing. Edited by: Bailly G, Vatikiotis-Bateson E, Perrier P. MIT Press, Cambridge, Mass, USA; 2004. chapter 10

    Google Scholar 

  2. 2.

    Chen T: Audiovisual speech processing. IEEE Signal Processing Magazine 2001,18(1):9-21. 10.1109/79.911195

    Article  Google Scholar 

  3. 3.

    Chibelushi CC, Deravi F, Mason JS: A review of speech-based bimodal recognition. IEEE Transactions on Multimedia 2002,4(1):23-37. 10.1109/6046.985551

    Article  Google Scholar 

  4. 4.

    Barker JP, Berthommier F: Evidence of correlation between acoustic and visual features of speech. Proceedings of the 14th International Congress of Phonetic Sciences (ICPhS '99), August 1999, San Francisco, Calif, USA 199–202.

    Google Scholar 

  5. 5.

    Yehia H, Rubin P, Vatikiotis-Bateson E: Quantitative association of vocal-tract and facial behavior. Speech Communication 1998,26(1-2):23-43. 10.1016/S0167-6393(98)00048-X

    Article  Google Scholar 

  6. 6.

    Bailly-Baillière E, Bengio S, Bimbot F, et al.: The BANCA database and evaluation protocol. In Proceedings of the 4th International Conference on Audioand Video-Based Biometric Person Authentication (AVBPA '03), January 2003, Guildford, UK, Lecture Notes in Computer Science. Volume 2688. Springer; 625–638.

    Google Scholar 

  7. 7.

    Hershey J, Movellan J: Audio-vision: using audio-visual synchrony to locate sounds. In Advances in Neural Information Processing Systems 11. Edited by: Kearns MS, Solla SA, Cohn DA. MIT Press, Cambridge, Mass, USA; 1999:813-819.

    Google Scholar 

  8. 8.

    Bredin H, Miguel A, Witten IH, Chollet G: Detecting replay attacks in audiovisual identity verification. Proceedings of the 31st IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '06), May 2006, Toulous, France 1: 621–624.

    Google Scholar 

  9. 9.

    Fisher JW III, Darrell T: Speaker association with signal-level audiovisual fusion. IEEE Transactions on Multimedia 2004,6(3):406-413. 10.1109/TMM.2004.827503

    Article  Google Scholar 

  10. 10.

    Chetty G, Wagner M: "Liveness" verification in audio-video authentication. Proceedings of the 10th Australian International Conference on Speech Science and Technology (SST '04), December 2004, Sydney, Australia 358–363.

    Google Scholar 

  11. 11.

    Cutler R, Davis L: Look who's talking: speaker detection using video and audio correlation. Proceedings of IEEE International Conference on Multimedia and Expo (ICME '00), July-August 2000, New York, NY, USA 3: 1589–1592.

    Article  Google Scholar 

  12. 12.

    Iyengar G, Nock HJ, Neti C: Audio-visual synchrony for detection of monologues in video archives. Proceedings of IEEE International Conference on Multimedia and Expo (ICME '03), July 2003, Baltimore, Md, USA 1: 329–332.

    Google Scholar 

  13. 13.

    Nock HJ, Iyengar G, Neti C: Assessing face and speech consistency for monologue detection in video. Proceedings of the 10th ACM international Conference on Multimedia (MULTIMEDIA '02), December 2002, Juan-les-Pins, France 303–306.

    Google Scholar 

  14. 14.

    Slaney M, Covell M: FaceSync: a linear operator for measuring synchronization of video facial images and audio tracks. In Advances in Neural Information Processing Systems 13. MIT Press, Cambridge, Mass, USA; 2000:814-820.

    Google Scholar 

  15. 15.

    Reynolds DA, Quatieri TF, Dunn RB: Speaker verification using adapted Gaussian mixture models. Digital Signal Processing 2000,10(1–3):19-41.

    Article  Google Scholar 

  16. 16.

    Sugamura N, Itakura F: Speech analysis and synthesis methods developed at ECL in NTT-from LPC to LSP. Speech Communications 1986,5(2):199-215. 10.1016/0167-6393(86)90008-7

    Article  Google Scholar 

  17. 17.

    Bregler C, Konig Y: "Eigenlips" for robust speech recognition. Proceedings of the 19th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '94), April 1994, Adelaide, Australia 2: 669–672.

    Google Scholar 

  18. 18.

    Turk M, Pentland A: Eigenfaces for recognition. Journal of Cognitive Neuroscience 1991,3(1):71-86. 10.1162/jocn.1991.3.1.71

    Article  Google Scholar 

  19. 19.

    Goecke R, Millar B: Statistical analysis of the relationship between audio and video speech parameters for Australian English. Proceedings of the ISCA Tutorial and Research Workshop on Audio Visual Speech Processing (AVSP '03), September 2003, Saint-Jorioz, France 133–138.

    Google Scholar 

  20. 20.

    Eveno N, Besacier L: A speaker independent "liveness" test for audio-visual biometrics. Proceedings of the 9th European Conference on Speech Communication and Technology (EuroSpeech '05), September 2005, Lisbon, Portugal 3081–3084.

    Google Scholar 

  21. 21.

    Eveno N, Besacier L: Co-inertia analysis for "liveness" test in audio-visual biometrics. Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis (ISPA '05), September 2005, Zagreb, Croatia 257–261.

    Google Scholar 

  22. 22.

    Fox N, Reilly RB: Audio-visual speaker identification based on the use of dynamic audio and visual features. In Proceedings of the 4th International Conference on Audio-and Video-Based Biometric Person Authentication (AVBPA '03), June 2003, Guildford, UK, Lecture Notes in Computer Science. Volume 2688. Springer; 743–751.

    Google Scholar 

  23. 23.

    Chibelushi CC, Mason JS, Deravi F: Integrated person identification using voice and facial features. IEE Colloquium on Image Processing for Security Applications, March 1997, London, UK 4: 1–5.

    Google Scholar 

  24. 24.

    Hyvärinen A: Survey on independent component analysis. Neural Computing Surveys 1999, 2: 94–128.

    Google Scholar 

  25. 25.

    Sodoyer D, Girin L, Jutten C, Schwartz J-L: Speech extraction based on ICA and audio-visual coherence. Proceedings of the 7th International Symposium on Signal Processing and Its Applications (ISSPA '03), July 2003, Paris, France 2: 65–68.

    Google Scholar 

  26. 26.

    Smaragdis P, Casey M: Audio/visual independent components. Proceedings of the 4th International Symposium on Independent Component Analysis and Blind Signal Separation (ICA '03), April 2003, Nara, Japan 709–714.

    Google Scholar 

  27. 27.

    ICA https://doi.org/www.cis.hut.fi/projects/ica/fastica/

  28. 28.

    Canonical Correlation Analysis. https://doi.org/people.imt.liu.se/~magnus/cca/

  29. 29.

    Dolédec S, Chessel D: Co-inertia analysis: an alternative method for studying species-environment relationships. Freshwater Biology 1994, 31: 277–294. 10.1111/j.1365-2427.1994.tb01741.x

    Article  Google Scholar 

  30. 30.

    Fisher JW, Darrell T, Freeman WT, Viola P: Learning joint statistical models for audio-visual fusion and segregation. In Advances in Neural Information Processing Systems 13. Edited by: Leen TK, Dietterich TG, Tresp V. MIT Press, Cambridge, Mass, USA; 2001:772–778.

    Google Scholar 

  31. 31.

    Sodoyer D, Schwartz J-L, Girin L, Klinkisch J, Jutten C: Separation of audio-visual speech sources: a new approach exploiting the audio-visual coherence of speech stimuli. EURASIP Journal on Applied Signal Processing 2002,2002(11):1165-1173. 10.1155/S1110865702207015

    MATH  Google Scholar 

  32. 32.

    Rabiner LR: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 1989,77(2):257-286. 10.1109/5.18626

    Article  Google Scholar 

  33. 33.

    Bengio S: An asynchronous hidden Markov model for audio-visual speech recognition. In Advances in Neural Information Processing Systems 15. Edited by: Becker S, Thrun S, Obermayer K. MIT Press, Cambridge, Mass, USA; 2003:1213-1220.

    Google Scholar 

  34. 34.

    Bredin H, Aversano G, Mokbel C, Chollet G: The biosecure talking-face reference system. Proceedings of the 2nd Workshop on Multimodal User Authentication (MMUA '06), May 2006, Toulouse, France

    Google Scholar 

  35. 35.

    Messer K, Matas J, Kittler J, Luettin J, Maitre G: XM2VTSDB: the extended M2VTS database. Proceedings of International Conference on Audio- and Video-Based Biometric Person Authentication (AVBPA '99), March 1999, Washington, DC, USA 72–77.

    Google Scholar 

  36. 36.

    BT-DAVID https://doi.org/eegalilee.swan.ac.uk/

  37. 37.

    Garcia-Salicetti S, Beumier C, Chollet G, et al.: BIOMET: a multimodal person authentication database including face, voice, fingerprint, hand and signature modalities. Proceedings of the 4th International Conference on Audio-and Video-Based Biometric Person Authentication (AVBPA '03), June 2003, Guildford, UK 845–853.

    Google Scholar 

  38. 38.

    Martin AF, Doddington GR, Kamm T, Ordowski M, Przybocki M: The DET curve in assessment of detection task performance. Proceedings of the 5th European Conference on Speech Communication and Technology (EuroSpeech '97), September 1997, Rhodes, Greece 4: 1895–1898.

    Google Scholar 

  39. 39.

    Sargin ME, Erzin E, Yemez Y, Tekalp AM: Multimodal speaker identification using canonical correlation analysis. Proceedings of the 31st IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '06), May 2006, Toulouse, France 1: 613–616.

    Google Scholar 

  40. 40.

    Text Retrieval Conference Video Track. https://doi.org/trec.nist.gov/

  41. 41.

    Bredin H, Chollet G: Audio-visual speech synchrony measure for talking-face identity verification. Proceedings of the 32nd IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '07), April 2007, Honolulu, Hawaii, USA

    Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Hervé Bredin.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://doi.org/creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Bredin, H., Chollet, G. Audiovisual Speech Synchrony Measure: Application to Biometrics. EURASIP J. Adv. Signal Process. 2007, 070186 (2007). https://doi.org/10.1155/2007/70186

Download citation

Keywords

  • Information Technology
  • Recent Work
  • Feature Space
  • Actual Measure
  • Quantum Information