Audiovisual Speech Synchrony Measure: Application to Biometrics

Bredin, Hervé; Chollet, Gérard

doi:10.1155/2007/70186

Research Article
Open access
Published: 01 December 2007

Audiovisual Speech Synchrony Measure: Application to Biometrics

Hervé Bredin¹ &
Gérard Chollet¹

EURASIP Journal on Advances in Signal Processing volume 2007, Article number: 070186 (2007) Cite this article

1512 Accesses
30 Citations
Metrics details

Abstract

Speech is a means of communication which is intrinsically bimodal: the audio signal originates from the dynamics of the articulators. This paper reviews recent works in the field of audiovisual speech, and more specifically techniques developed to measure the level of correspondence between audio and visual speech. It overviews the most common audio and visual speech front-end processing, transformations performed on audio, visual, or joint audiovisual feature spaces, and the actual measure of correspondence between audio and visual speech. Finally, the use of synchrony measure for biometric identity verification based on talking faces is experimented on the BANCA database.

References

Potamianos G, Neti C, Luettin J, Matthews I: Audio-visual automatic speech recognition: an overview. In Issues in Visual and Audio-Visual Speech Processing. Edited by: Bailly G, Vatikiotis-Bateson E, Perrier P. MIT Press, Cambridge, Mass, USA; 2004. chapter 10
Google Scholar
Chen T: Audiovisual speech processing. IEEE Signal Processing Magazine 2001,18(1):9-21. 10.1109/79.911195
Article Google Scholar
Chibelushi CC, Deravi F, Mason JS: A review of speech-based bimodal recognition. IEEE Transactions on Multimedia 2002,4(1):23-37. 10.1109/6046.985551
Article Google Scholar
Barker JP, Berthommier F: Evidence of correlation between acoustic and visual features of speech. Proceedings of the 14th International Congress of Phonetic Sciences (ICPhS '99), August 1999, San Francisco, Calif, USA 199–202.
Google Scholar
Yehia H, Rubin P, Vatikiotis-Bateson E: Quantitative association of vocal-tract and facial behavior. Speech Communication 1998,26(1-2):23-43. 10.1016/S0167-6393(98)00048-X
Article Google Scholar
Bailly-Baillière E, Bengio S, Bimbot F, et al.: The BANCA database and evaluation protocol. In Proceedings of the 4th International Conference on Audioand Video-Based Biometric Person Authentication (AVBPA '03), January 2003, Guildford, UK, Lecture Notes in Computer Science. Volume 2688. Springer; 625–638.
Chapter Google Scholar
Hershey J, Movellan J: Audio-vision: using audio-visual synchrony to locate sounds. In Advances in Neural Information Processing Systems 11. Edited by: Kearns MS, Solla SA, Cohn DA. MIT Press, Cambridge, Mass, USA; 1999:813-819.
Google Scholar
Bredin H, Miguel A, Witten IH, Chollet G: Detecting replay attacks in audiovisual identity verification. Proceedings of the 31st IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '06), May 2006, Toulous, France 1: 621–624.
Google Scholar
Fisher JW III, Darrell T: Speaker association with signal-level audiovisual fusion. IEEE Transactions on Multimedia 2004,6(3):406-413. 10.1109/TMM.2004.827503
Article Google Scholar
Chetty G, Wagner M: "Liveness" verification in audio-video authentication. Proceedings of the 10th Australian International Conference on Speech Science and Technology (SST '04), December 2004, Sydney, Australia 358–363.
Google Scholar
Cutler R, Davis L: Look who's talking: speaker detection using video and audio correlation. Proceedings of IEEE International Conference on Multimedia and Expo (ICME '00), July-August 2000, New York, NY, USA 3: 1589–1592.
Article Google Scholar
Iyengar G, Nock HJ, Neti C: Audio-visual synchrony for detection of monologues in video archives. Proceedings of IEEE International Conference on Multimedia and Expo (ICME '03), July 2003, Baltimore, Md, USA 1: 329–332.
Google Scholar
Nock HJ, Iyengar G, Neti C: Assessing face and speech consistency for monologue detection in video. Proceedings of the 10th ACM international Conference on Multimedia (MULTIMEDIA '02), December 2002, Juan-les-Pins, France 303–306.
Chapter Google Scholar
Slaney M, Covell M: FaceSync: a linear operator for measuring synchronization of video facial images and audio tracks. In Advances in Neural Information Processing Systems 13. MIT Press, Cambridge, Mass, USA; 2000:814-820.
Google Scholar
Reynolds DA, Quatieri TF, Dunn RB: Speaker verification using adapted Gaussian mixture models. Digital Signal Processing 2000,10(1–3):19-41.
Article Google Scholar
Sugamura N, Itakura F: Speech analysis and synthesis methods developed at ECL in NTT-from LPC to LSP. Speech Communications 1986,5(2):199-215. 10.1016/0167-6393(86)90008-7
Article Google Scholar
Bregler C, Konig Y: "Eigenlips" for robust speech recognition. Proceedings of the 19th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '94), April 1994, Adelaide, Australia 2: 669–672.
Google Scholar
Turk M, Pentland A: Eigenfaces for recognition. Journal of Cognitive Neuroscience 1991,3(1):71-86. 10.1162/jocn.1991.3.1.71
Article Google Scholar
Goecke R, Millar B: Statistical analysis of the relationship between audio and video speech parameters for Australian English. Proceedings of the ISCA Tutorial and Research Workshop on Audio Visual Speech Processing (AVSP '03), September 2003, Saint-Jorioz, France 133–138.
Google Scholar
Eveno N, Besacier L: A speaker independent "liveness" test for audio-visual biometrics. Proceedings of the 9th European Conference on Speech Communication and Technology (EuroSpeech '05), September 2005, Lisbon, Portugal 3081–3084.
Google Scholar
Eveno N, Besacier L: Co-inertia analysis for "liveness" test in audio-visual biometrics. Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis (ISPA '05), September 2005, Zagreb, Croatia 257–261.
Google Scholar
Fox N, Reilly RB: Audio-visual speaker identification based on the use of dynamic audio and visual features. In Proceedings of the 4th International Conference on Audio-and Video-Based Biometric Person Authentication (AVBPA '03), June 2003, Guildford, UK, Lecture Notes in Computer Science. Volume 2688. Springer; 743–751.
Chapter Google Scholar
Chibelushi CC, Mason JS, Deravi F: Integrated person identification using voice and facial features. IEE Colloquium on Image Processing for Security Applications, March 1997, London, UK 4: 1–5.
Google Scholar
Hyvärinen A: Survey on independent component analysis. Neural Computing Surveys 1999, 2: 94–128.
Google Scholar
Sodoyer D, Girin L, Jutten C, Schwartz J-L: Speech extraction based on ICA and audio-visual coherence. Proceedings of the 7th International Symposium on Signal Processing and Its Applications (ISSPA '03), July 2003, Paris, France 2: 65–68.
Google Scholar
Smaragdis P, Casey M: Audio/visual independent components. Proceedings of the 4th International Symposium on Independent Component Analysis and Blind Signal Separation (ICA '03), April 2003, Nara, Japan 709–714.
Google Scholar
ICA https://doi.org/www.cis.hut.fi/projects/ica/fastica/
Canonical Correlation Analysis. https://doi.org/people.imt.liu.se/~magnus/cca/
Dolédec S, Chessel D: Co-inertia analysis: an alternative method for studying species-environment relationships. Freshwater Biology 1994, 31: 277–294. 10.1111/j.1365-2427.1994.tb01741.x
Article Google Scholar
Fisher JW, Darrell T, Freeman WT, Viola P: Learning joint statistical models for audio-visual fusion and segregation. In Advances in Neural Information Processing Systems 13. Edited by: Leen TK, Dietterich TG, Tresp V. MIT Press, Cambridge, Mass, USA; 2001:772–778.
Google Scholar
Sodoyer D, Schwartz J-L, Girin L, Klinkisch J, Jutten C: Separation of audio-visual speech sources: a new approach exploiting the audio-visual coherence of speech stimuli. EURASIP Journal on Applied Signal Processing 2002,2002(11):1165-1173. 10.1155/S1110865702207015
MATH Google Scholar
Rabiner LR: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 1989,77(2):257-286. 10.1109/5.18626
Article Google Scholar
Bengio S: An asynchronous hidden Markov model for audio-visual speech recognition. In Advances in Neural Information Processing Systems 15. Edited by: Becker S, Thrun S, Obermayer K. MIT Press, Cambridge, Mass, USA; 2003:1213-1220.
Google Scholar
Bredin H, Aversano G, Mokbel C, Chollet G: The biosecure talking-face reference system. Proceedings of the 2nd Workshop on Multimodal User Authentication (MMUA '06), May 2006, Toulouse, France
Google Scholar
Messer K, Matas J, Kittler J, Luettin J, Maitre G: XM2VTSDB: the extended M2VTS database. Proceedings of International Conference on Audio- and Video-Based Biometric Person Authentication (AVBPA '99), March 1999, Washington, DC, USA 72–77.
Google Scholar
BT-DAVID https://doi.org/eegalilee.swan.ac.uk/
Garcia-Salicetti S, Beumier C, Chollet G, et al.: BIOMET: a multimodal person authentication database including face, voice, fingerprint, hand and signature modalities. Proceedings of the 4th International Conference on Audio-and Video-Based Biometric Person Authentication (AVBPA '03), June 2003, Guildford, UK 845–853.
Chapter Google Scholar
Martin AF, Doddington GR, Kamm T, Ordowski M, Przybocki M: The DET curve in assessment of detection task performance. Proceedings of the 5th European Conference on Speech Communication and Technology (EuroSpeech '97), September 1997, Rhodes, Greece 4: 1895–1898.
Google Scholar
Sargin ME, Erzin E, Yemez Y, Tekalp AM: Multimodal speaker identification using canonical correlation analysis. Proceedings of the 31st IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '06), May 2006, Toulouse, France 1: 613–616.
Google Scholar
Text Retrieval Conference Video Track. https://doi.org/trec.nist.gov/
Bredin H, Chollet G: Audio-visual speech synchrony measure for talking-face identity verification. Proceedings of the 32nd IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '07), April 2007, Honolulu, Hawaii, USA
Google Scholar

Download references

Author information

Authors and Affiliations

Département Traitement du Signal et de l'Image, École Nationale Supérieure des Télécommunications, CNRS/LTCI, 46 rue Barrault, Paris Cedex, 13 75013, France
Hervé Bredin & Gérard Chollet

Authors

Hervé Bredin
View author publications
You can also search for this author in PubMed Google Scholar
Gérard Chollet
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hervé Bredin.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://doi.org/creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Bredin, H., Chollet, G. Audiovisual Speech Synchrony Measure: Application to Biometrics. EURASIP J. Adv. Signal Process. 2007, 070186 (2007). https://doi.org/10.1155/2007/70186

Download citation

Received: 18 August 2006
Accepted: 18 March 2007
Published: 01 December 2007
DOI: https://doi.org/10.1155/2007/70186

Audiovisual Speech Synchrony Measure: Application to Biometrics

Abstract

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords