Robust Distant Speech Recognition by Combining Multiple Microphone-Array Processing with Position-Dependent CMN

Wang, Longbiao; Kitaoka, Norihide; Nakagawa, Seiichi

doi:10.1155/ASP/2006/95491

Research Article
Open access
Published: 01 December 2006

Robust Distant Speech Recognition by Combining Multiple Microphone-Array Processing with Position-Dependent CMN

Longbiao Wang¹,
Norihide Kitaoka¹ &
Seiichi Nakagawa¹

EURASIP Journal on Advances in Signal Processing volume 2006, Article number: 095491 (2006) Cite this article

1303 Accesses
16 Citations
Metrics details

Abstract

We propose robust distant speech recognition by combining multiple microphone-array processing with position-dependent cepstral mean normalization (CMN). In the recognition stage, the system estimates the speaker position and adopts compensation parameters estimated a priori corresponding to the estimated position. Then the system applies CMN to the speech (i.e., position-dependent CMN) and performs speech recognition for each channel. The features obtained from the multiple channels are integrated with the following two types of processings. The first method is to use the maximum vote or the maximum summation likelihood of recognition results from multiple channels to obtain the final result, which is called multiple-decoder processing. The second method is to calculate the output probability of each input at frame level, and a single decoder using these output probabilities is used to perform speech recognition. This is called single-decoder processing, resulting in lower computational cost. We combine the delay-and-sum beamforming with multiple-decoder processing or single-decoder processing, which is termed multiple microphone-array processing. We conducted the experiments of our proposed method using a limited vocabulary (100 words) distant isolated word recognition in a real environment. The proposed multiple microphone-array processing using multiple decoders with position-dependent CMN achieved a 3.2% improvement (50% relative error reduction rate) over the delay-and-sum beamforming with conventional CMN (i.e., the conventional method). The multiple microphone-array processing using a single decoder needs about one-third the computational time of that using multiple decoders without degrading speech recognition performance.

References

Juang BH, Soong FK: Hands-free telecommunications. Proceedings of the International Workshop on Hands-Free Speech Communication (HSC '01), April 2001, Kyoto, Japan 5–10.
Google Scholar
Omologo M, Matassoni M, Svaizer P, Giuliani D: Experiments of hands-free connected digit recognition using a microphone array. Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, December 1997, Santa Barbara, Calif, USA 490–497.
Chapter Google Scholar
Hughes TB, Kim H-S, DiBiase JH, Silverman HF: Performance of an HMM speech recognizer using a real-time tracking microphone array as input. IEEE Transactions on Speech and Audio Processing 1999, 7(3):346–349. 10.1109/89.759045
Article Google Scholar
Takiguchi T, Nakamura S, Shikano K: HMM-separation-based speech recognition for a distant moving speaker. IEEE Transactions on Speech and Audio Processing 2001, 9(2):127–140. 10.1109/89.902279
Article Google Scholar
Seltzer ML, Raj B, Stern RM: Likelihood-maximizing beamforming for robust hands-free speech recognition. IEEE Transactions on Speech and Audio Processing 2004, 12(5):489–498. 10.1109/TSA.2004.832988
Article Google Scholar
Furui S: Cepstral analysis technique for automatic speaker verification. IEEE Transactions on Acoustics, Speech, and Signal Processing 1981, 29(2):254–272. 10.1109/TASSP.1981.1163530
Article Google Scholar
Liu F, Stern RM, Huang X, Acero A: Efficient cepstral normalization for robust speech recognition. Proceedings of the ARPA Speech and Natural Language Workshop, March 1993, Princeton, NJ, USA 69–74.
Google Scholar
Kitaoka N, Akahori I, Nakagawa S: Speech recognition under noisy environments using spectral subtraction with smoothing of time direction and real-time cepstral mean normalization. Proceedings of the International Workshop on Hands-Free Speech Communication (HSC '01), April 2001, Kyoto, Japan 159–162.
Google Scholar
Doclo S, Moonen M: Robust adaptive time delay estimation for speaker localization in noisy and reverberant acoustic environments. EURASIP Journal on Applied Signal Processing 2003, 2003(11):1110–1124. 10.1155/S111086570330602X
MATH Google Scholar
Knapp CH, Carter GC: The generalized correlation method for estimation of time delay. IEEE Transactions on Acoustics, Speech, and Signal Processing 1976, 24(4):320–327. 10.1109/TASSP.1976.1162830
Article Google Scholar
Omologo M, Svaizer P: Acoustic source location in noisy and reverberant environment using CSP analysis. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '96), May 1996, Atlanta, Ga, USA 2: 921–924.
Google Scholar
Wang L, Kitaoka N, Nakagawa S: Robust distant speech recognition based on position dependent CMN using a novel multiple microphone processing technique. Proceedings of the 9th European Conference on Speech Communication and Technology (EUROSPEECH '05), September 2005, Lisbon, Portugal 2661–2664.
Google Scholar
Van Veen B, Buckley K: Beamforming: a versatile approach to spatial filtering. IEEE ASSP Magazine 1988, 5(2):4–24.
Article Google Scholar
Yamada T, Nakamura S, Shikano K: Distant-talking speech recognition based on a 3-D Viterbi search using a microphone array. IEEE Transactions on Speech and Audio Processing 2002, 10(2):48–56. 10.1109/89.985542
Article Google Scholar
Flanagan J, Johnston J, Zahn R, Elko GW: Computer-steered microphone arrays for sound transduction in large rooms. The Journal of the Acoustical Society of America 1985, 78(5):1508–1518. 10.1121/1.392786
Article Google Scholar
Huang Y, Benesty J, Elko GW, Mersereau RM: Real-time passive source localization: a practical linear-correction least-squares approach. IEEE Transactions on Speech and Audio Processing 2001, 9(8):943–956. 10.1109/89.966097
Article Google Scholar
Brandstein M: A framework for speech source localization using sensor arrays, M.S. thesis. Brown University, Providence, RI, USA; 1995.
Google Scholar
DiBiase J, Silverman H, Brandstein M: Robust localization in reverberant rooms. In Microphone Arrays: Signal Processing Techniques and Applications. Springer, Berlin, Germany; 2001:157–180. chapter 8
Chapter Google Scholar
Raykar V, Yegnanarayana B, Prasanna S, Duraiswami R: Speaker localization using excitation source information in speech. IEEE Transactions on Speech and Audio Processing 2005, 13(5):751–760.
Article Google Scholar
Bard Y: Nonlinear Parameter Estimation. Academic Press, New York, NY, USA; 1974.
MATH Google Scholar
Foy W: Position-location solutions by Taylor-series estimation. IEEE Transactions on Aerospace and Electronic Systems 1976, 12(2):187–194.
Article Google Scholar
Wang L, Kitaoka N, Nakagawa S: Distant speech recognition based on position dependent cepstral mean normalization. Proceedings of the 6th IASTED International Conference on Signal and Image Processing (SIP '04), August 2004, Honolulu, Hawaii, USA 249–254.
Google Scholar
Wang L, Kitaoka N, Nakagawa S: Robust distant speech recognition based on position dependent CMN. Proceedings of the 9th International Conference on Spoken Language Processing (ICSLP '04), October 2004, Jeju Island, Korea 2409–2052.
Google Scholar
Viterbi A: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory 1967, 13(2):260–269.
Article Google Scholar
Omologo M, Svaizer P: Use of the crosspower-spectrum phase in acoustic event location. IEEE Transactions on Speech and Audio Processing 1997, 5(3):288–292. 10.1109/89.568735
Article Google Scholar
Nakagawa S, Hanai K, Yamamoto K, Minematsu N: Comparison of syllable-based HMMs and triphone-based HMMs in Japanese speech recognition. Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, December 1999, Keystone, Colo, USA 393–396.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information and Computer Sciences, Toyohashi University of Technology, Toyahashi-shi, 441-8580, Japan
Longbiao Wang, Norihide Kitaoka & Seiichi Nakagawa

Authors

Longbiao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Norihide Kitaoka
View author publications
You can also search for this author in PubMed Google Scholar
Seiichi Nakagawa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Longbiao Wang.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Wang, L., Kitaoka, N. & Nakagawa, S. Robust Distant Speech Recognition by Combining Multiple Microphone-Array Processing with Position-Dependent CMN. EURASIP J. Adv. Signal Process. 2006, 095491 (2006). https://doi.org/10.1155/ASP/2006/95491

Download citation

Received: 29 December 2005
Revised: 20 May 2006
Accepted: 11 June 2006
Published: 01 December 2006
DOI: https://doi.org/10.1155/ASP/2006/95491

Robust Distant Speech Recognition by Combining Multiple Microphone-Array Processing with Position-Dependent CMN

Abstract

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords