- Research Article
- Open access
- Published:
Audiovisual Speech Synchrony Measure: Application to Biometrics
EURASIP Journal on Advances in Signal Processing volume 2007, Article number: 070186 (2007)
Abstract
Speech is a means of communication which is intrinsically bimodal: the audio signal originates from the dynamics of the articulators. This paper reviews recent works in the field of audiovisual speech, and more specifically techniques developed to measure the level of correspondence between audio and visual speech. It overviews the most common audio and visual speech front-end processing, transformations performed on audio, visual, or joint audiovisual feature spaces, and the actual measure of correspondence between audio and visual speech. Finally, the use of synchrony measure for biometric identity verification based on talking faces is experimented on the BANCA database.
References
Potamianos G, Neti C, Luettin J, Matthews I: Audio-visual automatic speech recognition: an overview. In Issues in Visual and Audio-Visual Speech Processing. Edited by: Bailly G, Vatikiotis-Bateson E, Perrier P. MIT Press, Cambridge, Mass, USA; 2004. chapter 10
Chen T: Audiovisual speech processing. IEEE Signal Processing Magazine 2001,18(1):9-21. 10.1109/79.911195
Chibelushi CC, Deravi F, Mason JS: A review of speech-based bimodal recognition. IEEE Transactions on Multimedia 2002,4(1):23-37. 10.1109/6046.985551
Barker JP, Berthommier F: Evidence of correlation between acoustic and visual features of speech. Proceedings of the 14th International Congress of Phonetic Sciences (ICPhS '99), August 1999, San Francisco, Calif, USA 199–202.
Yehia H, Rubin P, Vatikiotis-Bateson E: Quantitative association of vocal-tract and facial behavior. Speech Communication 1998,26(1-2):23-43. 10.1016/S0167-6393(98)00048-X
Bailly-Baillière E, Bengio S, Bimbot F, et al.: The BANCA database and evaluation protocol. In Proceedings of the 4th International Conference on Audioand Video-Based Biometric Person Authentication (AVBPA '03), January 2003, Guildford, UK, Lecture Notes in Computer Science. Volume 2688. Springer; 625–638.
Hershey J, Movellan J: Audio-vision: using audio-visual synchrony to locate sounds. In Advances in Neural Information Processing Systems 11. Edited by: Kearns MS, Solla SA, Cohn DA. MIT Press, Cambridge, Mass, USA; 1999:813-819.
Bredin H, Miguel A, Witten IH, Chollet G: Detecting replay attacks in audiovisual identity verification. Proceedings of the 31st IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '06), May 2006, Toulous, France 1: 621–624.
Fisher JW III, Darrell T: Speaker association with signal-level audiovisual fusion. IEEE Transactions on Multimedia 2004,6(3):406-413. 10.1109/TMM.2004.827503
Chetty G, Wagner M: "Liveness" verification in audio-video authentication. Proceedings of the 10th Australian International Conference on Speech Science and Technology (SST '04), December 2004, Sydney, Australia 358–363.
Cutler R, Davis L: Look who's talking: speaker detection using video and audio correlation. Proceedings of IEEE International Conference on Multimedia and Expo (ICME '00), July-August 2000, New York, NY, USA 3: 1589–1592.
Iyengar G, Nock HJ, Neti C: Audio-visual synchrony for detection of monologues in video archives. Proceedings of IEEE International Conference on Multimedia and Expo (ICME '03), July 2003, Baltimore, Md, USA 1: 329–332.
Nock HJ, Iyengar G, Neti C: Assessing face and speech consistency for monologue detection in video. Proceedings of the 10th ACM international Conference on Multimedia (MULTIMEDIA '02), December 2002, Juan-les-Pins, France 303–306.
Slaney M, Covell M: FaceSync: a linear operator for measuring synchronization of video facial images and audio tracks. In Advances in Neural Information Processing Systems 13. MIT Press, Cambridge, Mass, USA; 2000:814-820.
Reynolds DA, Quatieri TF, Dunn RB: Speaker verification using adapted Gaussian mixture models. Digital Signal Processing 2000,10(1–3):19-41.
Sugamura N, Itakura F: Speech analysis and synthesis methods developed at ECL in NTT-from LPC to LSP. Speech Communications 1986,5(2):199-215. 10.1016/0167-6393(86)90008-7
Bregler C, Konig Y: "Eigenlips" for robust speech recognition. Proceedings of the 19th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '94), April 1994, Adelaide, Australia 2: 669–672.
Turk M, Pentland A: Eigenfaces for recognition. Journal of Cognitive Neuroscience 1991,3(1):71-86. 10.1162/jocn.1991.3.1.71
Goecke R, Millar B: Statistical analysis of the relationship between audio and video speech parameters for Australian English. Proceedings of the ISCA Tutorial and Research Workshop on Audio Visual Speech Processing (AVSP '03), September 2003, Saint-Jorioz, France 133–138.
Eveno N, Besacier L: A speaker independent "liveness" test for audio-visual biometrics. Proceedings of the 9th European Conference on Speech Communication and Technology (EuroSpeech '05), September 2005, Lisbon, Portugal 3081–3084.
Eveno N, Besacier L: Co-inertia analysis for "liveness" test in audio-visual biometrics. Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis (ISPA '05), September 2005, Zagreb, Croatia 257–261.
Fox N, Reilly RB: Audio-visual speaker identification based on the use of dynamic audio and visual features. In Proceedings of the 4th International Conference on Audio-and Video-Based Biometric Person Authentication (AVBPA '03), June 2003, Guildford, UK, Lecture Notes in Computer Science. Volume 2688. Springer; 743–751.
Chibelushi CC, Mason JS, Deravi F: Integrated person identification using voice and facial features. IEE Colloquium on Image Processing for Security Applications, March 1997, London, UK 4: 1–5.
Hyvärinen A: Survey on independent component analysis. Neural Computing Surveys 1999, 2: 94–128.
Sodoyer D, Girin L, Jutten C, Schwartz J-L: Speech extraction based on ICA and audio-visual coherence. Proceedings of the 7th International Symposium on Signal Processing and Its Applications (ISSPA '03), July 2003, Paris, France 2: 65–68.
Smaragdis P, Casey M: Audio/visual independent components. Proceedings of the 4th International Symposium on Independent Component Analysis and Blind Signal Separation (ICA '03), April 2003, Nara, Japan 709–714.
Canonical Correlation Analysis. https://doi.org/people.imt.liu.se/~magnus/cca/
Dolédec S, Chessel D: Co-inertia analysis: an alternative method for studying species-environment relationships. Freshwater Biology 1994, 31: 277–294. 10.1111/j.1365-2427.1994.tb01741.x
Fisher JW, Darrell T, Freeman WT, Viola P: Learning joint statistical models for audio-visual fusion and segregation. In Advances in Neural Information Processing Systems 13. Edited by: Leen TK, Dietterich TG, Tresp V. MIT Press, Cambridge, Mass, USA; 2001:772–778.
Sodoyer D, Schwartz J-L, Girin L, Klinkisch J, Jutten C: Separation of audio-visual speech sources: a new approach exploiting the audio-visual coherence of speech stimuli. EURASIP Journal on Applied Signal Processing 2002,2002(11):1165-1173. 10.1155/S1110865702207015
Rabiner LR: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 1989,77(2):257-286. 10.1109/5.18626
Bengio S: An asynchronous hidden Markov model for audio-visual speech recognition. In Advances in Neural Information Processing Systems 15. Edited by: Becker S, Thrun S, Obermayer K. MIT Press, Cambridge, Mass, USA; 2003:1213-1220.
Bredin H, Aversano G, Mokbel C, Chollet G: The biosecure talking-face reference system. Proceedings of the 2nd Workshop on Multimodal User Authentication (MMUA '06), May 2006, Toulouse, France
Messer K, Matas J, Kittler J, Luettin J, Maitre G: XM2VTSDB: the extended M2VTS database. Proceedings of International Conference on Audio- and Video-Based Biometric Person Authentication (AVBPA '99), March 1999, Washington, DC, USA 72–77.
Garcia-Salicetti S, Beumier C, Chollet G, et al.: BIOMET: a multimodal person authentication database including face, voice, fingerprint, hand and signature modalities. Proceedings of the 4th International Conference on Audio-and Video-Based Biometric Person Authentication (AVBPA '03), June 2003, Guildford, UK 845–853.
Martin AF, Doddington GR, Kamm T, Ordowski M, Przybocki M: The DET curve in assessment of detection task performance. Proceedings of the 5th European Conference on Speech Communication and Technology (EuroSpeech '97), September 1997, Rhodes, Greece 4: 1895–1898.
Sargin ME, Erzin E, Yemez Y, Tekalp AM: Multimodal speaker identification using canonical correlation analysis. Proceedings of the 31st IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '06), May 2006, Toulouse, France 1: 613–616.
Text Retrieval Conference Video Track. https://doi.org/trec.nist.gov/
Bredin H, Chollet G: Audio-visual speech synchrony measure for talking-face identity verification. Proceedings of the 32nd IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '07), April 2007, Honolulu, Hawaii, USA
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://doi.org/creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Bredin, H., Chollet, G. Audiovisual Speech Synchrony Measure: Application to Biometrics. EURASIP J. Adv. Signal Process. 2007, 070186 (2007). https://doi.org/10.1155/2007/70186
Received:
Accepted:
Published:
DOI: https://doi.org/10.1155/2007/70186