- Research Article
- Open Access
Histogram Equalization to Model Adaptation for Robust Speech Recognition
© Y. Suh and H. Kim. 2010
- Received: 2 November 2009
- Accepted: 13 February 2010
- Published: 17 May 2010
We propose a new model adaptation method based on the histogram equalization technique for providing robustness in noisy environments. The trained acoustic mean models of a speech recognizer are adapted into environmentally matched conditions by using the histogram equalization algorithm on a single utterance basis. For more robust speech recognition in the heavily noisy conditions, trained acoustic covariance models are efficiently adapted by the signal-to-noise ratio-dependent linear interpolation between trained covariance models and utterance-level sample covariance models. Speech recognition experiments on both the digit-based Aurora2 task and the large vocabulary-based task showed that the proposed model adaptation approach provides significant performance improvements compared to the baseline speech recognizer trained on the clean speech data.
- Automatic Speech Recognition
- Acoustic Model
- Clean Speech
- Speech Recognizer
- Feature Compensation
Speech recognizers employed on the noisy environment usually show dramatic performance degradation . This performance degradation has been the major obstacle in introducing the automatic speech recognition (ASR) technology to the real-world applications. For this reason, one of the hot issues in the current research areas of ASR is to provide robustness against performance degradation of speech recognizers in the noisy environments [1, 2]. Noisy environments encountered in ASR are usually different from training acoustic environments. Therefore, the performance degradation of speech recognizers in the noisy environments can be well accounted for the acoustic mismatch  between training and test environments. In this case, the acoustic mismatch is mainly due to the corruption of speech by additive noise and channel distortion in the test environments. A lot of robust speech recognition approaches have been proposed to reduce the acoustic mismatch in the past few decades and most of them can be categorized into feature compensation, model adaptation, and uncertainty-based approaches . Of the three approaches, the easiest way to provide the robustness against the acoustic mismatch is feature compensation, where noisy test features are compensated or enhanced to remove noise effects and then decoded by speech recognizers trained on clean speech data . Cepstral mean normalization  and cepstral mean variance normalization  are the popular techniques for feature compensation. However, it is generally known that model adaptation has the potential for greater robustness in noisy environments than feature compensation although feature compensation is simpler and more efficient to implement . One reason for the possible superiority of model adaptation results from the fact that it can use very detailed knowledge of the underlying speech signal encoded in the acoustic models of the speech recognizer . Because acoustic models in the speech recognizer are designed to represent their own acoustic-phonetic units, they can provide a much more detailed representation of speech. On the contrary, feature compensation methods usually make use of the much simpler model of speech such as a single Gaussian mixture model (GMM). For this reason, better performance can be expected by transforming these acoustic models to match current noise conditions. Another reason for the advantage of model adaptation may be due to the fundamental limitations of feature compensation. Because of the acoustic-phonetic information loss in both noise corruption and feature extraction, it is difficult to perfectly recover clean features from noisy features by using feature compensation algorithms. As a result, this information loss causes discrepancies between clean speech models and compensated features in the decoding process of ASR. On the other hand, clean speech models can be fully adapted into acoustically matched speech models as far as the amount of adaptation data is provided enough in model adaptation. Therefore, although the same information loss can occur in model adaptation, it does not cause undesirable discrepancies between acoustic models and speech features but it is just disregarded in the decoding process. Due to these advantages, numerous environmental model adaptation techniques have been proposed for robust speech recognition until recently. A well-known environmental model adaptation method is the parallel model combination technique , which combines both clean speech and noise models in the spectral domain to obtain noisy speech models. Another representative model adaptation technique is the vector Taylor series (VTS) approach , which linearly approximates noisy speech models from both clean speech and noise models by using the Taylor series expansion. Both methods are reported to be quite effective in providing the robustness against noise. The standard adaptation methods used for speaker adaptation, such as maximum a posteriori (MAP) [1, 2] and maximum likelihood linear regression (MLLR) , can be also used for environment model adaptation. Because MAP has the asymptotic property, it can offer performance similar to those of matched conditions. MLLR uses a set of linear transforms to map the initially trained models into the adapted models such that the likelihood of the adaptation data is maximized. This method is known to be quite robust and to achieve reasonable performance with about a minute of speech for minor mismatches.
Basically, the model adaptation approach needs to adapt the entire model parameters employed in the speech recognizer. Therefore, the amount of computation in model adaptation can be a serious problem even in the small vocabulary speech recognition task. Moreover, due to the temporal and spatial variations of acoustic environments, the environmental model adaptation needs to be performed in the input utterance or temporal segment level. Therefore, the model adaptation technique should have computational efficiency as well as noise robustness in its application to real-time speech recognition.
In this paper, we propose a new efficient model adaptation approach based on the histogram equalization (HEQ) technique . HEQ is basically a nonlinear transformation-based approach. In this sense, it can fundamentally cope with the nonlinearity of noisy features in the case of logarithmic space-based features such as cepstral features and is reported to provide considerable performance improvements in both speech recognition and speaker recognition in noisy environments [11–17]. In addition, HEQ is computationally efficient because most of its algorithm consists of sort and search routines with relatively narrow depth and scope. Since its first application to speech recognition , HEQ has been mainly used in feature compensation. However, due to the potential superiority of the model adaptation approach, it is expected that the use of HEQ in model adaptation can provide more robustness in noisy environments. In the proposed approach, HEQ transforms the trained mean models of a speech recognizer into environmentally matched models. The transformation function of HEQ is obtained by using reference and test cumulative distribution functions (CDFs) of the training data and test utterance, respectively. A signal-to-noise ratio (SNR)-dependent linear interpolation-based method is used to efficiently adapt the covariance models of a speech recognizer to achieve further performance improvements in heavily noise conditions.
2.1. Basic Algorithm
where is the inverse of the reference CDF , is the test CDF, and F(y) is the transformation function of HEQ-FC and has single valued, monotonically nondecreasing characteristics.
2.2. Order Statistic-Based CDF Estimation
In (2), it is noted that the effectiveness of HEQ-FC is directly related to the reliable estimation of both reference and test CDFs. A better CDF estimation can be achieved by using a larger amount of sample data. Due to its relatively large amount of sample data in the training phase, the reference CDF can be well estimated by the classical cumulative histogram approach. However, current speech recognizers frequently employ a short utterance or word as their input unit. In this case, the amount of sample data can be insufficient for the reliable estimation of the test CDF. In this test environment, the reliable estimation of the test CDF is an important issue for the effective HEQ-FC. When the amount of sample data is small, the order statistic-based CDF estimation method can be more reliable than the classical histogram-based approach due to its enhanced probabilistic resolution. A brief algorithm of the order statistics-based CDF estimation is given as follows .
where is a test feature component at the n th frame.
Because the reference CDF is approximated by its cumulative histogram, the inverse reference CDF transformation in (6) is performed with a linear interpolation by considering the relative position of test CDF estimate within the reference histogram bin to reduce the quantization error.
3.1. Basic Algorithm
3.2. Mean Model Adaptation
Our model adaptation approach is aimed to provide the speech recognizer with robustness against acoustic noise. Therefore, the actual adaptation is applied on the acoustic models of the speech recognizer. In most cases, the acoustic model adaption is focused on the mean vectors and covariance matrices of the acoustic models in the speech recognizer due to their dominant effectiveness compared with other model parameters [1, 2]. Hence, we confine the adaptation scope to both mean and covariance models in this paper.
When the acoustic models are estimated by using the logarithmically scaled features such as cepstral coefficients, the transformation function driven by the acoustic mismatch defined in (8) is known to be the form of a nonlinear function. In this nonlinear case, the mean of transformed features is not generally the same as the transformed mean value. However, it can be assumed that the transformed features belonging to each acoustic model are distributed in its relatively small acoustic space due to the detailed definition of acoustic models. In this case, the transformation within each acoustic model space can be approximated linearly even though the overall transformation through the entire acoustic model space has nonlinear characteristics. Under this assumption, the HEQ algorithm is applied for mean model adaptation as in (8).
3.3. Covariance Model Adaptation
where and are the adapted and trained covariance coefficients, respectively, and ( ) is an SNR-dependent smoothing factor to deal with the higher covariance shrinkage effect at the lower SNR conditions and is approximated by a linearly decreasing function , ranging between 0 and 1, where is the averaged SNR value of the sequence S, and a and b are empirically chosen slope and bias constants, respectively. The parameters and are the sequence-level sample covariance coefficient obtained from the test sequence S and the global sample covariance coefficient computed from the whole training features, respectively. Equation (12) indicates that the proposed covariance model adaptation rule tries not only to make the trained covariance models less changed at the higher SNR condition but also to make them shrunk by the ratio of the sequence-level sample covariance to the global sample covariance at the lower SNR condition.
4.1. Experimental Setup
In the experiments, we used two speech databases, the Aurora2 speech database  converted from the TI-DIGITS database and the Korean phonetically optimized words (KPOW) database  consisting of 37,993 utterances of 3,848 Korean words, to examine the effectiveness of the proposed approach in the small as well as the large vocabulary speech recognition tasks. The trained acoustic models of two baseline speech recognizers were separately obtained from the clean speech training sets of the two databases. For performance evaluation, we used the three test sets of the Aurora2 noisy speech database, where test set A was added by four kinds of noise (subway, babble, car, and exhibition), test set B was corrupted by another four types of noise (restaurant, street, airport, and train station), and test set C was contaminated by two kinds of noise (subway and street) and channel distortion (MIRS) together . Additionally, we used two test sets of the KPOW noisy speech database, which were generated by artificially adding the same kinds of the Aurora noise used in the Aurora2 test sets A and B to the KPOW clean speech test set composed of 7,609 utterances. Each of the three Aurora2 test sets and the two KPOW test sets is further composed of 6 noisy subsets with the SNR levels of 20, 15, 1, 5, 0, and dB.
We employed the ETSI Aurora2 experimental framework  in our experiments as follows. In feature extraction, speech signals are firstly blocked into a sequence of frames, each 25 ms in length with a 10 ms interval. Next, speech frames are preemphasized by a first-order FIR filter with a factor of 0.97 and a Hamming window is applied to each frame. From a sequence of 23 mel-scaled log filter-bank energies, 12-dimensional mel-frequency cepstral coefficients (MFCCs) are extracted. The final 39-dimensional feature vector for each frame consists of 12 MFCCs, log energy, and their delta and acceleration coefficients. The baseline speech recognizer for the Aurora2 task employs 13 whole-word-based hidden Markov models (HMMs), which consist of 11 digit models with 16 states, a silence model with three states, and a short-pause model with a single state. The states for digit models are composed of 3 Gaussian mixture components while those for silence and short-pause models have 6 Gaussian mixture components, respectively. The baseline recognizer for the KPOW task has 6,776 tied-state triphone-based HMMs, where each HMM has 3 states and each state is modeled with 8 Gaussian mixture components. Diagonal covariance matrices are used in all of the HMMs.
In the performance evaluation, the performances of the baseline speech recognizer trained on the clean speech data, CMN, CMVN, and HEQ-FC, and HEQ-MA were examined. Additionally, the performance of the standard model adaptation technique based on the MLLR method was also evaluated by using the HTK toolkit  and compared with those of the above mentioned techniques. The number of regression classes in MLLR was set to 8 for the Aurora2 task and 16 for the KPOW task. In the MLLR-based model adaptation, we adopted the unsupervised adaptation method where the acoustic mean models were incrementally adapted with each test utterance. In feature compensation, HEQ-FC is applied to all of the 39-dimensional MFCCs independently for both training and test data after estimating the reference CDFs from all training data. In the HEQ-based model adaptation, the HEQ and proposed variance adaptation techniques are applied to the 39-dimensional mean vectors and diagonal covariance matrices, respectively, of all trained HMMs in the baseline speech recognizers on a component-by-component basis. The number of histogram bins in the reference CDFs was empirically chosen as 64. Due to the adoption of the linear interpolation in the inverse CDF transformation of both HEQ-FC in (6) and HEQ-MA in (8), a further increase in the number of histogram bins did not show any meaningful performance improvements. The SNR-dependent smoothing parameters a and b in the adaptation of covariance matrices are set to and 0.9, respectively, to make the smoothing factor ( ) become 0 and 1 at the SNRs of 30 dB and dB, respectively. The averaged SNR value was estimated as the ratio of the averaged frame energy to the averaged noise energy of the initial silence region in each test utterance. To cope with the time-varying nature of environmental noise, the histogram equalization was conducted on a single utterance basis in both feature compensation and model adaptation.
4.2. Test with SNR Conditions
4.3. Variance Shrinkage Effect
4.4. Test with Various Test Sets
Word error rates (%) on the Aurora2 task (Results are averaged between 0 and 20 dB SNRs).
HEQ-MA (mean & var.)
Word error rates (%) on the Korean POW task (Results are averaged between 0 and 20 dB SNRs).
HEQ-MA (mean & var.)
In the large vocabulary-based KPOW task, the HEQ-MA with mean and variance adaptation technique reduces recognition errors by 43.04%, 38.77%, 15.05%, 41.39%, and 8.56% over the baseline recognizer, CMN, CMVN, MLLR, and HEQ-FC, respectively. From these results, we see that HEQ-MA produces substantially better performance than the other approaches. Compared to the results in the Aurora2 task, the reduced performance gains in the KPOW task imply that the proposed adaptation technique is more suitable for the small vocabulary task due to the fewer possibilities of unobserved acoustic models in the HEQ-based adaptation process. The performance gap between both mean-only adaptation and mean and variance adaptation confirms both the importance of variance adaptation in the SNR conditions lower than 10 dB and the effectiveness of our proposed variance adaptation approach.
5.1. Recognition Performance
In Figures 1 and 2, the slight performance gain in the high SNR conditions by CMN indicates that CMN can improve recognition performance in the moderate noisy conditions. In addition, its performance degradation in the low SNR conditions also implies that CMN is not very effective in the heavily noisy conditions. This result can be interpreted to mean that the performance degradation caused by the variance mismatch can be larger than the performance gain obtained by the mean adaptation in the heavily noisy conditions, which results in the overall degradation of recognition performance. Similar results are obtained in the MLLR-based mean-only model adaptation experiments. We believe that the superior performance of MLLR at the higher SNRs is also largely resulted from the mean model adaptation. Similar to the case of CMN, the variance mismatch can be regarded as the main cause of performance degradation in both MLLR and HEQ-MA with mean-only adaptation approaches under heavily noisy conditions. Therefore, it can be said that the variance adaptation plays the more crucial role at the lower SNR conditions. The importance of variance compensation is well confirmed by CMVN as well as HEQ-FC, both of which noticeably improve the recognition performance compared to CMN. Because the HEQ-MA with mean and variance adaptation technique tries to reduce the variance mismatch by the proposed variance adaptation technique, it can provide further performance gain compared to the HEQ-MA with mean-only adaptation approach as observed in Figures 1 and 2. From the results, it can be said that the proposed techniques are also effective in the large vocabulary task although the performance gains obtained by HEQ-MA over HEQ-FC are not as remarkable as those at the Aurora2 task shown in Figure 1. We think that the reduced performance improvements in the KPOW task are mainly resulted from the reason that because the adaptation is performed on a single utterance basis, the amount of adaptation data in each test utterance becomes not enough to fully adapt the much larger number of acoustic models in this large vocabulary task. In Figures 1 and 2, it is also observe that the performance gains obtained by HEQ-MA-MV over HEQ-FC are more notable at the lower SNR conditions. These results support our previous suggestion that model adaptation is more effective than feature compensation in serious noise conditions where it becomes more difficult to compensate noisy speech features into clean speech features due to the increased loss of acoustic-phonetic information.
5.2. Computational Complexity
The computational loads in HEQ-MA are directly related to the number of acoustic mean models whereas those in HEQ-FC are dependent upon the utterance length, that is, the number of frames on the given utterance. The usual speech recognition tasks require the whole phonetic units in acoustic modeling. In this case, the number of acoustic mean models tends to be much larger than the number of frames. Therefore, it can be said that HEQ-MA usually requires much larger amounts of computational load than HEQ-FC. However, the computational loads in HEQ-MA can be comparable to those in HEQ-FC in the domain-constrained speech recognition task such as digit recognition task which employs a small number of acoustic models. Although HEQ-MA has much larger computational complexity than feature compensation techniques, it can be still regarded as an efficient model adaptation technique compared to other more complex model adaptation techniques such as MLLR due to its predominantly simple algorithmic complexity.
5.3. Implementation Issues
The feature compensation and model adaptation techniques employed in this experiment are basically conducted in the utterance-by-utterance basis to estimate the required statistics such as mean and variance. Therefore, all these approaches produce some amount of time delay in the real applications. However, a segmental estimation approach utilizing a sliding window can be used to achieve the real-time processing of feature compensation and model adaptation without any significant performance degradation. In this approach, it is reported that the appropriate size of a sliding window producing comparable results compared to the utterance-by-utterance based approach is about 600 ms for these feature compensation techniques .
In the HEQ-MA with mean and variance adaptation approach, we used an SNR-dependent covariance model adaptation technique. In this approach, the more accurate frame-level SNR estimation is required for the better covariance model adaptation. In our experiments, we employed a simple SNR estimation method, where the noise power estimated from the initial silence region is used through the entire utterance without any update procedures. Therefore, it can be said that the estimated noise power has some degree of estimation error, which causes the resulting covariance models to be adapted less accurately. More reliable noise power estimation algorithm employed in the voice activity detection techniques can be used for better SNR estimation. It is worthwhile conducting a further research activity utilizing this kind of more reliable SNR estimation technique.
We proposed a new environmental model adaptation method for robust speech recognition. The proposed approach utilizes both the histogram equalization technique for matching the acoustic mean models and an SNR-dependent linear interpolation-based method for adapting the covariance models into test environments. According to the experimental results, the proposed model adaptation approach provides substantial effectiveness in reducing the mismatch between trained acoustic models and test environments. The experimental results also indicate that the mean model adaptation plays the major role in improving the performance of the speech recognizer in noisy environments. Additionally, the variance model adaption is especially important for improving the recognition performance in the heavily noisy conditions. Due to its computational efficiency as well as noise robustness, the proposed technique can be another model adaptation approach to robust speech recognition under noisy environments. Further study about more sophisticated variance adaptation techniques is needed for enhancing the performance of the proposed approach more.
- Huang X, Acero A, Hon H-W: Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall, Upper Saddle River, NJ, USA; 2001.Google Scholar
- Gales M, Young S: The application of hidden Markov models in speech recognition. Foundations and Trends in Signal Processing 2007, 1(3):195-304. 10.1561/2000000004View ArticleMATHGoogle Scholar
- Sankar A, Lee C-H: A maximum-likelihood approach to stochastic matching for robust speech recognition. IEEE Transactions on Speech and Audio Processing 1996, 4(3):190-202. 10.1109/89.496215View ArticleGoogle Scholar
- Kim NS, Kim YJ, Kim HW: Feature compensation based on soft decision. IEEE Signal Processing Letters 2004, 11(3):378-381. 10.1109/LSP.2003.821720View ArticleGoogle Scholar
- Rosenberg AE, Lee C-H, Soong FK: Cepstral channel normalization techniques for HMM-based speaker verification. Proceedings of the 2nd International Conference on Spoken Language Processing, 1992 1835-1838.Google Scholar
- Viikki O, Laurila K: Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Communication 1998, 25(1–3):133-147.View ArticleGoogle Scholar
- Gales MJF, Young SJ: Cepstral parameter compensation for HMM recognition in noise. Speech Communication 1993, 12(3):231-239. 10.1016/0167-6393(93)90093-ZView ArticleGoogle Scholar
- Moreno PJ, Raj B, Stern RM: A vector Taylor series approach for environment-independent speech recognition. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '96), May 1996, Atlanta, Ga, USA 2: 733-736.Google Scholar
- Leggetter CJ, Woodland PC: Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech and Language 1995, 9(2):171-185. 10.1006/csla.1995.0010View ArticleGoogle Scholar
- Gonzalez RC, Woods RE: Digital Image Processing. Prentice-Hall, Upper Saddle River, NJ, USA; 2002.Google Scholar
- Hilger F, Ney H: Quantile based histogram equalization for noise robust large vocabulary speech recognition. Proceedings of the 7th European Conference on Speech Communication and Technology, 2006 1135-1138.Google Scholar
- Pelecanos J, Sridharan S: Feature warping for robust speaker verification. Proceedings of the Speaker Odyssey, 2001 213-218.Google Scholar
- Segura JC, Benítez C, de la Torre Á, Rubio AJ, Ramírez J: Cepstral domain segmental nonlinear feature transformations for robust speech recognition. IEEE Signal Processing Letters 2004, 11(5):517-520. 10.1109/LSP.2004.826648View ArticleGoogle Scholar
- de la Torre Á, Peinado AM, Segura JC, Pérez-Córdoba JL, Benítez MC, Rubio AJ: Histogram equalization of speech representation for robust speech recognition. IEEE Transactions on Speech and Audio Processing 2005, 13(3):355-366.View ArticleGoogle Scholar
- Suh Y, Kim H: Class-based histogram equalization for robust speech recognition. ETRI Journal 2006, 28(4):502-505. 10.4218/etrij.06.0206.0005View ArticleGoogle Scholar
- Suh Y, Kim S, Kim H: Compensating acoustic mismatch using class-based histogram equalization for robust speech recognition. EURASIP Journal on Advances in Signal Processing 2007, 2007:-9.Google Scholar
- Suh Y, Ji M, Kim H: Probabilistic class histogram equalization for robust speech recognition. IEEE Signal Processing Letters 2007, 14(4):287-290.View ArticleGoogle Scholar
- Dharanipragada S, Padmanabhan M: A nonlinear unsupervised adaptation technique for speech recognition. Proceedings of the 6th International Conference on Spoken Language Processing, 2000 556-559.Google Scholar
- Gales MJF, Woodland PC: Mean and variance adaptation within the MLLR framework. Computer Speech and Language 1996, 10(4):249-264. 10.1006/csla.1996.0013View ArticleGoogle Scholar
- Pearce D, Hirsch H-G: The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. Proceedings of the 6th International Conference on Spoken Language Processing, 2000 29-32.Google Scholar
- Lim Y, Lee Y: Implementation of the POW (phonetically optimized words) algorithm for speech database. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '95), May 1995, Detroit, Mich, USA 1: 89-92.Google Scholar
- Young S, et al.: The HTK Book for Version 3.2.1. Cambridge University Engineering Department, Cambridge, UK; 2002.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.