Skip to main content

A Comprehensive Noise Robust Speech Parameterization Algorithm Using Wavelet Packet Decomposition-Based Denoising and Speech Feature Representation Techniques

Abstract

This paper concerns the problem of automatic speech recognition in noise-intense and adverse environments. The main goal of the proposed work is the definition, implementation, and evaluation of a novel noise robust speech signal parameterization algorithm. The proposed procedure is based on time-frequency speech signal representation using wavelet packet decomposition. A new modified soft thresholding algorithm based on time-frequency adaptive threshold determination was developed to efficiently reduce the level of additive noise in the input noisy speech signal. A two-stage Gaussian mixture model (GMM)-based classifier was developed to perform speech/nonspeech as well as voiced/unvoiced classification. The adaptive topology of the wavelet packet decomposition tree based on voiced/unvoiced detection was introduced to separately analyze voiced and unvoiced segments of the speech signal. The main feature vector consists of a combination of log-root compressed wavelet packet parameters, and autoregressive parameters. The final output feature vector is produced using a two-staged feature vector postprocessing procedure. In the experimental framework, the noisy speech databases Aurora 2 and Aurora 3 were applied together with corresponding standardized acoustical model training/testing procedures. The automatic speech recognition performance achieved using the proposed noise robust speech parameterization procedure was compared to the standardized mel-frequency cepstral coefficient (MFCC) feature extraction procedures ETSI ES 201 108 and ETSI ES 202 050.

References

  1. 1.

    Junqua J-C, Haton JP: Robustness in Automatic Speech Recognition. Kluwer Academic, Boston, Mass, USA; 1996.

    Google Scholar 

  2. 2.

    Allen JB: How do humans process and recognize speech? IEEE Transactions on Speech and Audio Processing 1994,2(4):567-577. 10.1109/89.326615

    Article  Google Scholar 

  3. 3.

    Gales MJF: Model-based techniques for noise robust speech recognition, Ph.D. thesis. University of Cambridge, Cambridge, UK; 1996.

    Google Scholar 

  4. 4.

    Gong Y: Speech recognition in noisy environments: a survey. Speech Communication 1995,16(3):261-291. 10.1016/0167-6393(94)00059-J

    Article  Google Scholar 

  5. 5.

    ETSI standard document - ETSI ES 201 108 v1.1.1 : Speech Processing, Transmission and Quality aspects (STQ), Distributed speech recognition, Front-end feature extraction algorithm, Compression algorithm. 2000.

  6. 6.

    ETSI standard document - ETSI ES 202 050 v1.1.1 : Speech Processing, Transmission and Quality aspects (STQ), Distributed speech recognition, Advanced front-end feature extraction algorithm, Compression algorithm. 2002.

  7. 7.

    Davis SB, Mermelstein P: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 1980,28(4):357-366. 10.1109/TASSP.1980.1163420

    Article  Google Scholar 

  8. 8.

    Bourlard H, Dupont S: Subband-based speech recognition. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97), April 1997, Munich, Germany 2: 1251–1254.

    Google Scholar 

  9. 9.

    Gowdy JN, Tufekci Z: Mel-scaled discrete wavelet coefficients for speech recognition. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '00), June 2000, Istanbul, Turkey 3: 1351–1354.

    Google Scholar 

  10. 10.

    Gupta M, Gilbert A: Robust speech recognition using wavelet coefficient features. Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU '01), December 2001, Madonna di Campiglio, Trento, Italy 445–448.

    Google Scholar 

  11. 11.

    Hermansky H: Perceptual linear predictive (PLP) analysis of speech. The Journal of the Acoustical Society of America 1990,87(4):1738-1752. 10.1121/1.399423

    Article  Google Scholar 

  12. 12.

    Paliwal KK: On the use of line spectral frequency parameters for speech recognition. Digital Signal Processing 1992,2(2):80-87. 10.1016/1051-2004(92)90028-W

    Article  Google Scholar 

  13. 13.

    Deller JR, Proakis JG, Hansen JHL: Discrete-Time Processing of Speech Signals. Macmillan, New York, NY, USA; 1993.

    Google Scholar 

  14. 14.

    Rabiner L, Juang B-H: Fundamentals of Speech Recognition. Prentice Hall, Upper Saddle River, NJ, USA; 1993. section 4.5

    Google Scholar 

  15. 15.

    Coifman RR, Wickerhauser MV: Entropy-based algorithms for best basis selection. IEEE Transactions on Information Theory 1992,38(2, part 2):713-718. 10.1109/18.119732

    MATH  Article  Google Scholar 

  16. 16.

    Daubechies I: Ten Lectures on Wavelets. SIAM, Philadelphia, Pa, USA; 1997.

    Google Scholar 

  17. 17.

    Mallat SG: A theory for multiresolution signal decomposition: the wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 1989,11(7):674-693. 10.1109/34.192463

    MATH  Article  Google Scholar 

  18. 18.

    Strang G, Nguyen T: Wavelets and Filter Banks. Wellesley-Cambridge Press, Wellesley, Mass, USA; 1997.

    Google Scholar 

  19. 19.

    Lu C-T, Wang H-C: Enhancement of single channel speech based on masking property and wavelet transform. Speech Communication 2003,41(2-3):409-427. 10.1016/S0167-6393(03)00011-6

    Article  Google Scholar 

  20. 20.

    Sarikaya R, Pellom BL, Hansen JHL: Wavelet packet transform features with application to speaker identification. Proceedings of the 3rd IEEE Nordic Signal Processing Symposium (NORSIG '98), June 1998, Vigsø, Denmark 81–84.

    Google Scholar 

  21. 21.

    Sheikhzadeh H, Abutalebi HR: An improved wavelet-based speech enhancement system. Proceedings of the 7th European Conference on Speech Communication and Technology (EUROSPEECH '01), September 2001, Aalborg, Denmark 1855–1858.

    Google Scholar 

  22. 22.

    Ramchandran K, Vetterli M, Herley C: Wavelets, subband coding, and best bases. Proceedings of the IEEE 1996,84(4):541-560. 10.1109/5.488699

    Article  Google Scholar 

  23. 23.

    Reyes NR, Zurera MR, Ferreras FL, Amores PJ: Adaptive wavelet-packet analysis for audio coding purposes. Signal Processing 2003,83(5):919-929. 10.1016/S0165-1684(02)00489-9

    MATH  Article  Google Scholar 

  24. 24.

    Bahoura M, Rouat J: Wavelet speech enhancement based on the Teager energy operator. IEEE Signal Processing Letters 2001,8(1):10-12. 10.1109/97.889636

    Article  Google Scholar 

  25. 25.

    Donoho DL: De-noising by soft-thresholding. IEEE Transactions on Information Theory 1995,41(3):613-627. 10.1109/18.382009

    MathSciNet  MATH  Article  Google Scholar 

  26. 26.

    Jafer E, Mahdi AE: Wavelet-based perceptual speech enhancement using adaptive threshold estimation. Proceedings of the 8th European Conference on Speech Communication and Technology (EUROSPEECH '03), September 2003, Geneva, Switzerland 569–572.

    Google Scholar 

  27. 27.

    Jansen M: Wavelet thresholding and noise reduction, Ph.D. thesis. Katholieke Universiteit Leuven, Leuven, Belgium; 2000.

    Google Scholar 

  28. 28.

    Andrassy B, Vlaj D, Beaugeant C: Recognition performance of the siemens front-end with and without frame dropping on the aurora 2 database. Proceedings of the 7th European Conference on Speech Communication and Technology (EUROSPEECH '01), September 2001, Aalborg, Denmark 193–196.

    Google Scholar 

  29. 29.

    Hirsch H-G, Pearce D: The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. Proceedings of the Automatic Speech Recognition: Challanges for the New Millennium (ISCA ITRW ASR '00), September 2000, Paris, France 181–188.

    Google Scholar 

  30. 30.

    Pearce D: Enabling new speech driven services for mobile devices: an overview of the ETSI standards activities for distributed speech recognition front-ends. Proceedings of Applied Voice Input/Output Society Conference (AVIOS '00), May 2000, San Jose, Calif, USA

    Google Scholar 

  31. 31.

    AU/225/00 : Baseline Results for Subset of SpeechDat-Car Finnish Database for ETSI STQ WI008 Advanced Front-end Evaluation. Nokia, Janurary 2000

  32. 32.

    AU/271/00 : Spanish SDC-Aurora Database for ETSI STQ Aurora WI008 Advanced DSR Front-End Evaluation: Description and Baseline Results. UPC, November 2000

  33. 33.

    AU/273/00 : Description and Baseline Results for the Subset of the Speechdat-Car German Database used for ETSI STQ Aurora WI008 Advanced DSR Front-end Evaluation. Texas Instruments, December 2001

  34. 34.

    AU/378/01 : Danish SpeechDat-Car Digits Database for ETSI STQ-Aurora Advanced DSR. Aalborg University, January 2001

  35. 35.

    Macho D, Mauuary L, Noe B, et al.: Evaluation of a noise-robust DSR front-end on aurora database. Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP '02), September 2002, Denver, Colo, USA 17–20.

    Google Scholar 

  36. 36.

    Ephraim Y, Malah D: Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing 1984,32(6):1109-1121. 10.1109/TASSP.1984.1164453

    Article  Google Scholar 

  37. 37.

    Kotnik B, Vlaj D, Horvat B: Efficient noise robust feature extraction algorithms for distributed speech recognition (DSR) systems. International Journal of Speech Technology 2003,6(3):205-219. 10.1023/A:1023410018862

    Article  Google Scholar 

  38. 38.

    Martin R: Spectral subtraction based on minimum statistics. Proceedings of the European Signal Processing Conference (EUSIPCO '94), September 1994, Edinburgh, UK 1182–1185.

    Google Scholar 

  39. 39.

    O'Shaughnessy D: Enhancing speech degraded by additive noise or interfering speakers. IEEE Communications Magazine 1989,27(2):46-52. 10.1109/35.17653

    Article  Google Scholar 

  40. 40.

    McClellan JH, Parks TW: A unified approach to the design of optimum FIR linear-phase digital filters. IEEE Transactions on Circuits Theory 1973,20(6):697-701.

    Article  Google Scholar 

  41. 41.

    Rioul O, Duhamel P: A Remez exchange algorithm for orthonormal wavelets. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing 1994,41(8):550-560. 10.1109/82.318943

    MATH  Article  Google Scholar 

  42. 42.

    Boll SF: Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing 1979,27(2):113-120. 10.1109/TASSP.1979.1163209

    Article  Google Scholar 

  43. 43.

    Hillenbrand JM, Gayvert RT: Vowel classification based on fundamental frequency and formant frequencies. Journal of Speech and Hearing Research 1993,36(4):694-700.

    Article  Google Scholar 

  44. 44.

    Klein M: A Study of Voice Activity Detectors. Speech Communications 304-523B, McGill University, Computer and Electrical Engineering Department, April 2000

    Google Scholar 

  45. 45.

    Mak B, Junqua J-C, Reaves B: A robust speech/non-speech detection algorithm using time and frequency-based features. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '92), March 1992, San Francisco, Calif, USA 1: 269–272.

    Google Scholar 

  46. 46.

    Nemer E, Gourbran R, Mahmoud S: Robust voice activity detection using higher-order statistics in the LPC residual domain. IEEE Transactions on Speech and Audio Processing 2001,9(3):217-231. 10.1109/89.905996

    Article  Google Scholar 

  47. 47.

    Sohn J, Sung W: A voice activity detector employing soft decision based noise spectrum adaptation. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '98), May 1998, Seattle, Wash, USA 1: 365–368.

    Google Scholar 

  48. 48.

    Sohn J, Kim NS, Sung W: A statistical model-based voice activity detection. IEEE Signal Processing Letters 1999,6(1):1-3. 10.1109/97.736233

    Article  Google Scholar 

  49. 49.

    Young S, Kershaw D, Odell J, Ollason D, Valtchev V, Woodland P: The HTK Book—Version 3.0. Microsoft, Redmond, Wash, USA; 2000.

    Google Scholar 

  50. 50.

    Atal BS: Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. The Journal of the Acoustical Society of America 1974,55(6):1304-1312. 10.1121/1.1914702

    Article  Google Scholar 

  51. 51.

    de Wet F, Cranen B, de Veth J, Boves L: A comparison of LPC and FFT-based acoustic features for noise robust ASR. Proceedings of the 7th European Conference on Speech Communication and Technology (EUROSPEECH '01), September 2001, Aalborg, Denmark 865–868.

    Google Scholar 

  52. 52.

    Sarikaya R, Hansen JHL: Analysis of the root-cepstrum for acoustic modeling and fast decoding in speech recognition. Proceedings of the 7th European Conference on Speech Communication and Technology (EUROSPEECH '01), September 2001, Aalborg, Denmark 687–690.

    Google Scholar 

  53. 53.

    Kotnik B, Kačič Z, Horvat B: Development and integration of the LDA-toolkit into the COST249 speechdat (II) SIG reference recognizer. Proceedings the 4th International Conference on Language Resources and Evaluation (LREC '04), May 2004, Lisbon, Portugal 2083–2086.

    Google Scholar 

  54. 54.

    Welling L: Merkmalsextraction in spracherkennungssystemen für grossen wortschatz, Ph.D. thesis. Rheinisch-Westfälische Technische Hochschule, Aachen, Germany; 1999.

    Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Bojan Kotnik.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://doi.org/creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Kotnik, B., Kačič, Z. A Comprehensive Noise Robust Speech Parameterization Algorithm Using Wavelet Packet Decomposition-Based Denoising and Speech Feature Representation Techniques. EURASIP J. Adv. Signal Process. 2007, 064102 (2007). https://doi.org/10.1155/2007/64102

Download citation

Keywords

  • Speech Signal
  • Gaussian Mixture Model
  • Wavelet Packet
  • Automatic Speech Recognition
  • Noisy Speech