Skip to main content
  • Research Article
  • Open access
  • Published:

Speaker Separation and Tracking System

Abstract

Replicating human hearing in electronics under the constraints of using only two microphones (even with more than two speakers) and the user carrying the device at all times (i.e., mobile device weighing less than 100 g) is nontrivial. Our novel contribution in this area is a two-microphone system that incorporates both blind source separation and speaker tracking. This system handles more than two speakers and overlapping speech in a mobile environment. The system also supports the case in which a feedback loop from the speaker tracking step to the blind source separation can improve performance. In order to develop and optimize this system, we have established a novel benchmark that we herewith present. Using the introduced complexity metrics, we present the tradeoffs between system performance and computational load. Our results prove that in our case, source separation was significantly more dependent on frame duration than on sampling frequency.

References

  1. Moore D: The IDIAP smart meeting room. IDIAP-COM 07, IDIAP, 2002

    Google Scholar 

  2. Wooters C, Mirghafori N, Stolcke A, et al.: The 2004 ICSI-SRI-UW meeting recognition system. Lecture Notes in Computer Science, January 2005 3361: 196–208.

    Article  Google Scholar 

  3. Kern N, Schiele B, Junker H, Lukowicz P, Tröster G: Wearable sensing to annotate meeting recordings. Personal Ubiquitous Computing 2003, 7(5):263–274.

    Google Scholar 

  4. Choudhury T, Pentland A: The sociometer: a wearable device for understanding human networks. Proceedings of the Conference on Computer Supported Cooperative Work (CSCW '02), Workshop on Ad hoc Communications and Collaboration in Ubiquitous Computing Environments, November 2002, New Orleans, La, USA

    Google Scholar 

  5. Kwon S, Narayanan S: A method for on-line speaker indexing using generic reference models. Proceedings of the 8th European Conference on Speech Communication and Technology, September 2003, Geneva, Switzerland 2653–2656.

    Google Scholar 

  6. Nishida M, Kawahara T: Speaker model selection using Bayesian information criterion for speaker indexing and speaker adaptation. Proceedings of the 8th European Conference on Speech Communication and Technology, September 2003, Geneva, Switzerland 1849–1852.

    Google Scholar 

  7. Lu L, Zhang H-J: Speaker change detection and tracking in realtime news broadcasting analysis. Proceedings of the 10th ACM International Conference on Multimedia, December 2002, Juan les Pins, France 602–610.

    Google Scholar 

  8. Lathoud G, McCowan IA, Odobez J-M: Unsupervised location based segmentation of multi-party speech. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing – Meeting Recognition Workshop (ICASSP-NIST '04), May 2004, Montreal, Canada IDIAP-RR 04-14

    Google Scholar 

  9. Siracusa M, Morency LP, Wilson K, Fisher J, Darrell T: A multi-modal approach for determining speaker location and focus. Proceedings of the International Conference on Multi-modal Interfaces (ICMI '03), November 2003, Vancouver, BC, Canada 77–80.

    Chapter  Google Scholar 

  10. Ajmera J, Lathoud G, McCowan IA: Clustering and segmenting speakers and their locations in meetings. Research Report IDIAP-RR 03-55 December 2003.

    Google Scholar 

  11. Busso C, Hernanz S, Chu C-W, et al.: Smart room: participant and speaker localization and identification. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), March 2005, Philadelphia, Pa, USA 2: 1117–1120.

    Google Scholar 

  12. Amft O, Lauffer M, Ossevoort S, Macaluso F, Lukowicz P, Tröster G: Design of the QBIC wearable computing platform. Proceedings of 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP '04), September 2004 398–410.

    Google Scholar 

  13. Mann S: Wearable computing as means for personal empowerment. 1st International Conference on Wearable Computing (ICWC '98), May 1998, Fairfax, Va, USA

    Google Scholar 

  14. Pentland A: Wearable intelligence. Scientific American 1998., 276(1es1):

  15. Shriberg E, Stolcke A, Baron D: Observations on overlap: findings and implications for automatic processing of multi-party conversation. Poceedings of 7th European Conference on Speech Communication and Technology Eurospeech, September 2001, Aalborg, Denmark 2: 1359–1362.

    Google Scholar 

  16. Ferber R: Information Retrieval. dpunkt, Germany; 2003.

    MATH  Google Scholar 

  17. Yilmaz O, Rickard S: Blind separation of speech mixtures via time-frequency masking. IEEE Transactions on Signal Processing 2004, 52(7):1830–1847. 10.1109/TSP.2004.828896

    Article  MathSciNet  Google Scholar 

  18. Rickard S, Balan R, Rosca J: Blind source separation based on space-time-frequency diversity. Proceedings of 4th International Symposium on Independent Component Analysis and Blind Source Separation, April 2003, Nara, Japan 493–498.

    Google Scholar 

  19. Rickard S, Yilmaz Z: On the approximate W-disjoint orthogonality of speech. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '02), May 2002, Orlando, Fla, USA 1: 529–532.

    Google Scholar 

  20. Aarabi P, Mahdavi A: The relation between speech segment selectivity and source localization accuracy. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '02), May 2002, Orlando, Fla, USA 1: 273–276.

    Google Scholar 

  21. Basu S, Schwartz S, Pentland A: Wearable phased arrays for sound localization enhancement. Proceedings of the IEEE International Symposium on Wearable Computing (ISWC '00), 2000, Atlanta, Ga, USA 103–110.

    Google Scholar 

  22. Tritschler A, Gopinath R: Improved speaker segmentation and segments clustering using the Bayesian information criterion. Proceedings of the 6th European Conference on Speech Communication and Technology (EUROSPEECH '99), September 1999, Budapest, Hungary 679–682.

    Google Scholar 

  23. Lu L, Jiang H, Zhang HJ: A robust audio classification and segementation method. Proceedings of the 9th ACM International Conference on Multimedia, September–October 2001, Ottawa, Ontario, Canada 203–211.

    Google Scholar 

  24. Peltonen V, Tuomi J, Klapuri A, Huopaniemi J, Sorsa T: Computational auditory scene recognition. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '02), May 2002, Orlando, Fla, USA 2: 1941–1944.

    Google Scholar 

  25. Scheirer E, Slaney M: Construction and evaluation of a robust multifeature speech/music discriminator. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97), April 1997, Munich, Germany 2: 1331–1334.

    Google Scholar 

  26. Atal BS: Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. The Journal of the Acoustical Society of America 1974, 55(6):1304–1312. 10.1121/1.1914702

    Article  Google Scholar 

  27. Schwarz G: Estimating the dimension of a model. The Annals of Statistics 1978, 6(2):461–464. 10.1214/aos/1176344136

    Article  MathSciNet  Google Scholar 

  28. Delacourt P, Kryze D, Wellekens C: Speaker-based segmentation for audio data indexing. Proceedings of the ESCA Tutorial and Research Workshop (ITRW '99). Accessing Information in Spoken Audio, April 1999, Cambridge, UK 78–83.

    Google Scholar 

  29. Cettolo M, Vescovi M: Efficient audio segmentation algorithms based on the BIC. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003, Hong Kong 6: 537–540.

    Google Scholar 

  30. Ajmera J, McCowanand I, Bourlard H: BIC revisited and applied to speaker change detection. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003, Hong Kong

    Google Scholar 

  31. Campbell JP: Speaker recognition: a tutorial. Proceedings of the IEEE 1997, 85(9):1437–1462. 10.1109/5.628714

    Article  Google Scholar 

  32. Reynolds DA, Rose RC: Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing 1995, 3(1):72–83. 10.1109/89.365379

    Article  Google Scholar 

  33. Nishida M, Kawahara T: Unsupervised speaker indexing using speaker model selection based on Bayesian information criterion. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003, Hong Kong 1: 172–175.

    Article  Google Scholar 

  34. Matsui T, Furui S: Comparison of text-independent speaker recognition methods using VQ-distortion and discrete/continuous HMMs. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '92), March 1992, San Francisco, Calif, USA 2: 157–160.

    Google Scholar 

  35. Bimbot F, Bonastre J, Fredouille C, et al.: A tutorial on text-independent speaker verification. EURASIP Jounral on Applied Signal Processing 2004, 2004(4):430–451. 10.1155/S1110865704310024

    Google Scholar 

  36. Kawahara H, Irino T: Exploring temporal feature representations of speech using neural networks. Tech. Rep. SP88-31 1988.

    Google Scholar 

  37. Aoki M, Okamoto M, Aoki S, Matsui H, Sakurai T, Kaneda Y: Sound source segregation based on estimating incident angle of each frequency component of input signals acquired by multiple microphones. Acoustical Science and Technology 2001, 22(2):149–157. 10.1250/ast.22.149

    Article  Google Scholar 

  38. Baeck M, Zölzer U: Real-time implementation of a source separation algorithm. Proceedings of the 6th International Conference on Digital Audio Effects (DAFx '03), September 2003, London, UK

    Google Scholar 

  39. van Rijsbergen CJ: Information retrieval. Butterworths, London, UK; 1979.

    MATH  Google Scholar 

  40. Anliker U, Beutel J, Dyer M: A systematic approach to the design of distributed wearable systems. IEEE Transactions on Computers 2004, 53(8):1017–1033. 10.1109/TC.2004.36

    Article  Google Scholar 

  41. He JL, Liu L, Palm G: A text-independent speaker identification system based on neural networks. Proceedings of the International Conference on Spoken Language Processsing (ICSLP '94), September 1994, Yokohama, Japan 1851–1854.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to U Anliker.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Anliker, U., Randall, J. & Tröster, G. Speaker Separation and Tracking System. EURASIP J. Adv. Signal Process. 2006, 029104 (2006). https://doi.org/10.1155/ASP/2006/29104

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1155/ASP/2006/29104

Keywords