EURASIP Journal on Applied Signal Processing 2003:8, 814–823 c ○ 2003 Hindawi Publishing Corporation On the Use of Evolutionary Algorithms to Improve the Robustness of Continuous Speech Recognition Systems in Adverse Conditions

Limiting the decrease in performance due to acoustic environment changes remains a major challenge for continuous speech recognition (CSR) systems. We propose a novel approach which combines the Karhunen-Lo`eve transform (KLT) in the mel-frequency domain with a genetic algorithm (GA) to enhance the data representing corrupted speech. The idea consists of projecting noisy speech parameters onto the space generated by the genetically optimized principal axis issued from the KLT. The enhanced parameters increase the recognition rate for highly interfering noise environments. The proposed hybrid technique, when included in the front-end of an HTK-based CSR system, outperforms that of the conventional recognition process in severe interfering car noise environments for a wide range of signal-to-noise ratios (SNRs) varying from 16 dB to − 4 dB. We also showed the e ﬀ ectiveness of the KLT-GA method in recognizing speech subject to telephone channel degradations.


INTRODUCTION
Continuous speech recognition (CSR) systems remain faced with the serious problem of acoustic condition changes.Their performance often degrades due to unknown adverse conditions (e.g., due to room acoustics, ambient noise, speaker variability, sensor characteristics, and other transmission channel artifacts).These speech variations create mismatches between the training data and the test data.Numerous techniques have been developed to counter this in three major areas [1].
The first area includes noise masking [1], spectral and cepstal substraction [2], and the use of robust features [3].Robust feature analysis consists of using noise-resistant parameters such as auditory-based features, mel-frequency cepstral coefficients (MFCC) [4], or techniques such as relative spectral (RASTA) methodology [5].The second type of method refers to the establishment of compensation models for noisy environments without modification to the speech signal.The third field of research is concerned with distance and similarity measurements.The major methods of this field are founded on the principle to find a robust distorsion measure that emphasizes the regions of the spectrum that are less influenced by noise [6].
Despite these efforts to address robustness, adapting to changing environments remains the major obstacle to speech recognition in practical applications.Investigating innovative strategies has become essential to overcome the drawbacks of classical approaches.In this context, evolutionary algorithms (EAs) are robust solutions, and they are useful to find good solutions to complex problems (artificial neural networks topology or weights for instance) and to avoid local minima [7].Applying artificial neural networks, Spalanzani [8] showed that recognition of digits and vowels can be improved by using genetically optimized initialization of weights and biases.In this paper, we propose an approach which can be viewed as a signal transformation via a mapping operator using a mel-frequency space decomposition based on the Karhunen-Loève transform (KLT) and a genetic algorithm (GA) with a real-coded encoding (a part of EAs).This transformation attempts to adapt hidden Markov model-based CSR systems for adverse conditions.The principle consists of finding in the learning phase the principal axes generated by the KLT and then optimizing them for the projection of noisy data by genetic operators.The aim is to provide projected noisy data that are as close as possible to clean data.This paper is organized as follows.Section 2 describes the basis of our proposed hybrid KLT-GA enhancement method.Section 3 describes the model linking the KLT to the evolution mechanism, which leads to a robust representation of noisy data.Then, Section 4 describes the database, the platform used in our experiments and the evaluation of the proposed KLT-GA-based recognizer in a noisy car environment and in a telephone channel environment.This section includes the comparison of KLT-GA processed recognizers to a baseline CSR system in order to evaluate performance.Finally, Section 5 concludes with a perspective of this work.

General framework
CSR systems based on statistical models such as hidden Markov models (HMM) automatically recognize speech sounds by comparing their acoustic features with those determined during training [9].A bayesian statistical framework underlies the HMM-speech recognizer.The development of such a recognizer can be summarized as follows.Let w be a sequence of phones (or words), which produces a sequence of observable acoustic data o, sent through a noisy transmission channel.In our study, telephone speech is corrupted by additive noise.The recognition process aims to provide the most likely phone sequence w given the acoustic data o.This estimation is performed by maximizing a posteriori (MAP) the p(w | o) probability: where Ψ is the set of all possible phone sequences, p(w) is the prior probability, determined by the language model, that the speaker utters w, and p(o | w) is the conditional probability that the acoustic chanel produces the sequence o.Let Λ be the set of models used by the recognizer to decode acoustic parameters through the use of the MAP.Then (1) becomes The mismatch between the training and the testing environments leads to a worse estimate for the likelihood of o given Λ and thus degrades CSR performance.Reducing this mismatch should increase the correct recognition rate.The mismatch can be viewed by considering the signal space, the feature space, or the model space.We are concerned with the feature space, and consider a transformation T that maps Λ into a transformed feature space.Our approach is to find T and the phone sequence w that maximize the joint likelihood of o and w given Λ: We propose a pseudojoint maximization over w and T, where the typical conventional HMM-based technique is used to estimate w, while an EA-based technique enhances noisy data iteratively by keeping the noisy features as close as possible to the clean data.This EA-based transformation aims to reduce the mismatch between training and operating conditions by giving the HMM the ability to "recall" the training conditions.
As is shown in Figure 1, the idea is to manipulate the axes generating the feature representation space to achieve a better robustness on noisy data.MFCCs serve as acoustic features.A Karhunen-Loève decomposition in the MFCC domain allows obtaining the principal axes that constitute the basis of the space where noisy data is represented.Then, a population of these axes is created (corresponding to individuals in the initialization of the evolution process).The evolution of the individuals is performed by EAs.The individuals are evaluated via a fitness function by quantifying, through generations, their distance to individuals in a noisefree environment.The fittest individual (best principal axes) is used to project the noisy data in its corresponding dimension.Genetically modified MFCCs and their derivatives are finally used as enhanced features for the recognition process.

Cepstral acoustic features
The cepstrum is defined as the inverse Fourier transform of the logarithm of the short-term power spectrum of the signal.The use of a logarithmic function allows deconvolution of the vocal tract transfer function and the voice source.Consequently, the pulse sequence corresponding to the periodic voice source reappears in the cepstrum as a strong peak in the "frequency" domain.The derived cepstral coefficients are commonly used to describe the short-term spectral envelope of a speech signal.The computation of MFCCs requires the selection of M critical bandpass filters that roughly approximate the frequency response of the basilar membrane in the cochlea of the inner ear [4].A discrete cosine transform, C n , is applied to the output of M filters, X k .These filters are triangular, cover the 156-6844 Hz frequency range, and are spaced on the mel-frequency scale.These filters are applied to the log of the magnitude spectrum of the signal, which is estimated on a short-time basis.Thus where N is the number of the cepstral coefficients, M is the analysis order, and X k , k = 1, 2, . . ., M = 20, represents the log-energy output of the kth filter.

KLT in the mel-frequency domain
In order to reduce the effects of noise on ASR, many methods propose to decompose the vector space of the noisy signal into a signal-plus-noise subspace and a noise subspace [10].
We remove the noise subspace and estimate the clean signal from the remaining signal space.Such a decomposition applies the KLT to the noisy zero-mean normalized data.If we apply such a decomposition over the noisy zeromean normalized MFCC vector Ĉ = [ Ĉ1 , Ĉ2 , . . ., ĈN ] T with the assumption that Ĉ has a symmetric nonnegative autocorrelation matrix R = Ᏹ[ ĈT Ĉ] with a rank r ≤ N, then Ĉ can be represented as a linear combination of eigenvectors β 1 , β 2 , . . ., β r , which correspond to eigenvalues λ 1 ≥ λ 2 ≥ • • • ≥ λ r ≥ 0, respectively.That is, Ĉ can be calculated using the following orthogonal transformation: where the coefficients α k (principal components) are given by the projection of Ĉ in the space generated by the reigenvector basis.Given that the magnitudes of low-order eigenvalues are higher than for the high-order ones, the effect of the noise on the low-order eigenvalues is proportionately less than that for high-order ones.Thus, a linear estimation of the clean vector C is performed by projecting the noisy vectors on the space generated by principal components weighted by a function W k , which applies strong attenuation over higher-order eigenvectors depending on the noise variance [10].The enhanced MFCCs are then given by Various methods can find the adequate weighting function, particularly in the case of signal subspace decomposition [10].The optimal order r fixing the beginning of the strong attenuation must be determined.In our new approach, GAs determine optimal principal components.No assumptions need to be made.Optimization is achieved when vectors β 1 , β 2 , . . ., β N , which do not correspond necessarily to the eigenvectors, minimize the Euclidean distance between Ĉ and C. The genetically enhanced MFCCs, CGen , are Determining an optimal r is not needed since the GA considers vectors β 1 , β 2 , . . ., β N as the fittest individuals for the complete space dimension N.This process can be regarded as the mapping transform, T , of (3).

MODEL DESCRIPTION AND EVOLUTION
The use of GAs requires resolution of six fundamental issues: the chromosome (or solution) representation, the selection function, the genetic operators making up the reproduction function, the creation of the initial population, the termination criteria, and the evaluation function [11,12].The GA maintains and manipulates a family or population of solutions (the β 1 , β 2 , . . ., β N vectors in our case) and implements a "survival of the fittest" strategy in its search for better solutions.

Solution representation
A chromosome representation describes each individual in the population.It is important since the representation scheme determines how the problem is structured in the GA and also determines the adequate genetic operators to use [13].For our application, the useful representation of an individual or chromosome for function optimization involves genes or variables from an alphabet of floating-point numbers with values within the variables' upper and lower bounds (resp., +1 and −1).Michalewicz [14] has done extensive experimentation comparing real-valued and binary GAs, and has shown that real-valued representation offers higher precision with more consistent results across replications.

Selection function
Stochastic selection is used to keep search strategies simple while allowing adaptivity.The selection of individuals to produce successive generations plays an extremely important role in GAs.A common selection approach assigns a probability of selection, P j , to each individual, j, based on its fitness value.Various methods exist to assign probabilities to individuals; we use the normalized geometric ranking [15].This method defines P j for each individual by where where q is the probability of selecting the best individual, s the rank of the individual (1 being the best), and P the population size.

Genetic operators
The basic search mechanism of the GA is provided by two types of operators: crossover and mutation.Crossover transforms two individuals into two new individuals, while mutation alters one individual to produce a single solution.A float representation of the parents is denoted by X and Y .At the end of the search, the fittest individual survives and is retained as an optimal KLT axis in its corresponding rank of β 1 , β 2 , . . ., β N vectors.

Crossover
Crossover operators combine information from two parents and transmit it to each offspring.In order to avoid the extension of the exploration domain of the best solution, we preferred to use a crossover that utilizes fitness information, that is, a heuristic crossover [15].Let a i and b i be the lower and upper bound, respectively, of each component x i representing a member of the population (X or Y ).This operator produces a linear interpolation of X and Y .New individuals X and Y (children) are created according Algorithm 1.

Mutation
Mutation operators tend to make small random changes in an attempt to explore all regions of the solution space [16].
The principle of a nonuniform mutation used in our application consists of randomly selecting one component, x k , of an individual and setting it equal to a nonuniform random number, 1 x k : 1 Otherwise, the original values of components are maintained.

If fit[X] > fit[Y ]
Then X = X + g(X − Y ) and Y = X Estimate feasibility of X : Then generate new g; goto 2 5.If all individuals reproduced then Stop else goto 1 Algorithm 1: The heuristic crossover used in the CSR robust system.
where the function f (Gen) is given by where u 1 , u 2 are uniform random numbers between (0, 1), t a shape parameter, Gen the current generation, and Gen max the maximum number of generations.The multi-nonuniform mutation generalizes the application of the nonuniform mutation operator to all the components of the parent X.The main advantage of this operator is that the alteration is distributed on all individual components which lead to the extension of the search space and then permit to deal with any kind of noise.

Evaluation function
The GA must search all the axes generated by the KLT of the mel-frequency space (that make the noisy MFCCs if they are projected into these axes) to find the closest to the clean MFCC.Thus, evolution is driven by a fitness function defined in terms of a distance measure between the noisy MFCC projected on a given individual (axis) and the clean MFCC.The fittest individual is the axis which corresponds to the minimum of that distance.The distance function applied to cepstral (or other voice representations) refers to spectral distorsion measures and represents the cost in a classification system of speech frames.For two vectors C and Ĉ representing two frames [6], each with N components, the geometric distance is defined as For simplicity, the Euclidean distance is considered (l = 2), which has been a valuable measure for both clean and noisy speech [6,17].Figure 2 gives for the first four best axes the evolution of their fitness (distorsion measure) through 300 generations.Note that −d(C, Ĉ) is used as a distance measure because the evaluation function must be maximized.

Initialization and termination
The ideal, zero-knowledge assumption starts with a population of completely random axes.Another typical heuristic, used in our system, initializes the population with a uniform distribution in a default set of known starting points described by the boundaries (a i , b i ) for each axis component.The GA-based search ends when the population gets homogeneity in performance (when children do not surpass their parents), converges according to the Euclidean distorsion measure, or is terminated by the user if the number of maximum generations is reached.Finally, the evolution process can be summarized in Algorithm 2.

Speech material
The following experiments used the TIMIT database [18], which contains broadband recordings of a total of 6300 sentences, 10 sentences spoken by each of 630 speakers from 8 major dialect regions of the United States, each reading 10 phonetically rich sentences.To simulate a noisy environment, car noise was added artificially to the clean speech.
To study the effect of such noise on the recognition accuracy of the CSR system that we evaluated, the reference templates for all tests were taken from clean speech.The training set is composed of 1140 sentences (114 speakers) of dr1 and dr2 TIMIT subdirectories.On the other hand, the dr1 subset of the TIMIT database, composed of 110 sentences, was chosen to evaluate the recognition system.
In a second set of experiments, and in order to study the impact of telephone channel degradation on recognition accuracy of both baseline and enhanced CSR systems, the NTIMIT database was used [19].It was created by transmitting speech from the TIMIT database over long-distance telephone lines.Previous work has demonstrated that telephone line use increases the rate of recognition errors; for example, Moreno and Stern [20] report a 68% error rate by using a version of SPHINX-II [21] as CSR system, TIMIT as training database, and NTIMIT database, for the test.

CSR platform
In order to test the recognition of continuous speech data enhanced as described above, the HTK-based speech recognizer [22] was used.HTK is an HMM-based toolkit used for isolated or continuous whole-word-based recognition systems.The toolkit supports continuous-density HMMs with any number of state and mixture components.It also implements a general parameter-tying mechanism which allows the creation of complex model topologies.Twelve MFCCs were calculated using a 30-millisecond hamming window advanced by 10 milliseconds for each frame.To do this, an FFT calculates a magnitude spectrum for each frame, which is then averaged into 20 triangular bins arranged at equal mel-frequency intervals.Finally, a cosine transform is applied to such data to calculate the 12 MFCCs which form a 12-dimensional (static) vector.This static vector is then expanded after enhancement to produce a 36-dimensional (static + first and second derivatives: MFCC D A) vector upon which the HMMs, that model the speech subword units, were trained.Regarding the used frame length, the 1140 sentences of dr1 and dr2 TIMIT subsets provided 342993 frames that were used for the training.The baseline system used a triphone Gaussian mixture HMM system.Triphones were trained through a tree-based clustering method to deal with unseen context.A set of binary questions about phonetic contexts is built; the decision tree is constructed by selecting the best question from the rule set at each node [23].

GA parameters
A population of 150 individuals is generated for each β k and evolves during 300 generations.The values for the GA parameters given in Table 1 were selected after extensive crossvalidation experiments and were shown to perform well with all data.The maximum number of generations needed and the population size are well adapted to our problem since no improvement was observed when these parameters were increased.At each generation, the best individuals are retained to reproduce.In the end of the evolution process, the best individuals of the best population are considered as the optimized KLT axes.This method is used by Houk et al. in [15].For this purpose, data sets are composed of 114331 frames extracted from the TIMIT training subset and corresponding noisy frames extracted from the noisy TIMIT and NTIMIT databases.

CSR under additive car noise environment
Experiments were done using the noisy version of TIMIT at different values of SNR, from 16 dB to −4 dB. Figure 3 shows that using the KLT-GA-based optimization to enhance the MFCCs that were used for recognition with N-mixture Gaussian HMMs for N = 1, 2, 4, 8 with triphone models leads to a higher word recognition rate.The CSR system including the KLT-GA-processed MFCCs performs significantly better than its MFCC D A-and KLT-MFCC D Abased CSR systems, for low and high noise conditions.The system which contains enhanced MFCCs achieves 81.67% as the best word recognition rate (%C W ) for 16-dB SNR and four Gaussian mixtures.In the same conditions, the baseline system dealing with noisy MFCCs and the system containing KLT-processed MFCCs achieve, respectively, 73.89% and 77.25%.The increased accuracy is more significant in low SNR conditions, which attests to the robustness of the approach when acoustic conditions become severely degraded.For instance, in the −4-dB SNR case, the KLT-GA-MFCCbased CSR system has accuracy higher than KLT-MFCCand MFCC-based CSR systems, respectively, by 12% and 20%.The comparison between KLT-and KLT-GA-processed  MFCCs shows that the proposed evolutionary approach is more powerful whatever is the level of noise degradation.Considering the KLT-based CSR, inclusion of the GA technique raised accuracy by about 11%. Figure 4 plots the variations of the first four MFCCs for a signal that has been chosen from the test set.It is clear from the comparison illustrated in this figure that the processed MFCCs, using the proposed KLT-GA-based approach, are less variant than the noisy MFCCs and closer to the original ones.

Speech under telephone channel degradation
Extensive experimental studies characterized the impairments induced by telephone networks [24].When speech is recorded through telephone lines, a reduction in the analysis bandwidth yields higher recognition error, particularly when the system is trained with high-quality speech and tested using simulated telephone speech [20].In our experiments, the training set (dr1 and dr2 subdirectories of TIMIT) (1140 sentences and 342993 frames) was used to train a set of clean  speech models.The dr1 subdirectory of NTIMIT was used as a test set.This subdirectory is composed of 110 sentences and 34964 frames.Speakers and sentences used in the test were different than those used in the training phase.For the KLT-and KLT-GA-based CSR systems, we found that using the KLT-GA as a preprocessing approach to enhance the MFCCs that were used for recognition with N-mixture Gaussian HMMs for N = 1, 2, 4, and 8, using triphone models, led to an important improvement in the accuracy of the word recognition rate.Table 2 showed that this difference can reach 27% for MFCC D A-and KLT-GA-MFCC D Abased CSR systems.Table 2 shows that substitution and insertion errors are considerably reduced when the evolutionary approach is included, which gives more effectiveness to the CSR system.

CONCLUSION
We have illustrated the suitability of EAs, particularly the GAs, for an important real-world application by presenting a new robust CSR system.This system is based on the use of a KLT-GA hybrid enhancement noise reduction approach in the cepstral domain in order to get less-variant parameters.Experiments show that the use of the enhanced parameters using such a hybrid approach increases the recognition rate of the CSR process in highly interfering car noise environments for a wide range of SNRs varying from 16 dB to −4 dB and when speech is submitted to the telephone channel degradation.The approach can be applied whatever the distorsion of vectors under the condition to identify the fitness function.The front-end of the proposed KLT-GA-based CSR system does not require any a priori knowledge about the nature of the corrupting noisy signal, which allows dealing with any kind of noise.Moreover, using this enhancement technique avoids the noise estimation process that requires a speech/nonspeech preclassification, which could not be accurate for low SNRs.It is also interesting to note that such a technique is less complex than many other enhancement techniques, which need to either model or compensate for the noise.However, this enhancement technique requires

Figure 1 :
Figure 1: General overview of the KLT-EA-based CSR robust system.

Figure 2 :
Figure 2: Evolution of the performances of the best individual during 300 generations.Only the four first axes are considered among the twelve.
Fix the number of generations Gen max and boundaries of axes Generate for each principal KLT component a population of axes For Gen max generation Do For each set of components Do Project noisy data using KLT axes Evaluate global Euclidean distance for clean data End For Select and Reproduce End For Project noisy data onto space generated by the best individuals Algorithm 2: The evolutionary search technique for best KLT axes.

Table 1 :
Values of the parameters used in the GA.