Likelihood-Maximizing-Based Multiband Spectral Subtraction for Robust Speech Recognition

Automatic speech recognition performance degrades signiﬁcantly when speech is a ﬀ ected by environmental noise. Nowadays, the major challenge is to achieve good robustness in adverse noisy conditions so that automatic speech recognizers can be used in real situations. Spectral subtraction (SS) is a well-known and e ﬀ ective approach; it was originally designed for improving the quality of speech signal judged by human listeners. SS techniques usually improve the quality and intelligibility of speech signal while speech recognition systems need compensation techniques to reduce mismatch between noisy speech features and clean trained acoustic model. Nevertheless, correlation can be expected between speech quality improvement and the increase in recognition accuracy. This paper proposes a novel approach for solving this problem by considering SS and the speech recognizer not as two independent entities cascaded together, but rather as two interconnected components of a single system, sharing the common goal of improved speech recognition accuracy. This will incorporate important information of the statistical models of the recognition engine as a feedback for tuning SS parameters. By using this architecture, we overcome the drawbacks of previously proposed methods and achieve better recognition accuracy. Experimental evaluations show that the proposed method can achieve signiﬁcant improvement of recognition rates across a wide range of signal to noise ratios.


Introduction
By increasing the role of computers and electronic devices in today's life, using traditional interfaces such as mouse, keyboard, buttons, and knobs is not satisfying, so the desire for more convenient and more natural interfaces has increased. Current speech recognition technology offers the ideal complementary solution to more traditional visual and tactile man-machine interfaces. Although state-of-theart speech recognition systems perform well in the laboratory environments, accuracy of these systems degrades drastically in real noisy conditions. Therefore, improving speech recognizer robustness is still a major challenge. Statistical speech recognition at first learns the distribution of the acoustic units using training data and then relates each part of the speech signal to a class in the lexicon that most likely generates the observed feature vector. When noise affects the speech signal, distributions characterizing the extracted features from noisy speech are not similar to the corresponding distributions extracted from clean speech in the training phase. This mismatch results in misclassification and decreases speech recognition accuracy [1,2]. This degradation can only be ameliorated by reducing the difference between the distributions of test data and those used by the recognizer. However, the problem of noisy speech recognition still poses a challenge to the area of signal processing.
In recent decades, to reduce this mismatch and to compensate for the noise effect, different methods have been proposed. These methods can be classified into three categories.
Signal Compensation. Methods of this category operate on speech signals prior to feature extraction and the recognition process. They remove or reduce noise effects in the preprocessing stage. Since the goal of this approach is both transforming the noisy signal to resemble clean speech and improving the quality of the speech signal, they could also be called speech enhancement methods. These methods are used as a front end for the speech recognizer. Spectral 2 EURASIP Journal on Advances in Signal Processing subtraction (SS) [3][4][5][6][7][8][9], Wiener filtering [10,11], and modelbased speech enhancement [12][13][14] are widely used instances of this approach. Among signal compensation methods, SS is simple and easy to implement. Despite its low computational cost, it is very effective where the noise corrupting the signal is additive and varies slowly with time.
Feature Compensation. This approach attempts either to extract feature vectors invariant to noise or to increase robustness of the current feature vectors against noise. Representative methods include codeword-dependent cepstral normalization (CDCN) [15], vector Taylor series (VTS) [16], multivariate Gaussian-based cepstral compensation (RATZ) [17], cepstral mean normalization (CMN) [18], and RASTA/PLP [19,20]. Among all methods developed in this category, CMN is probably the most ubiquitous. It improves recognition performance under all kinds of conditions, even when other compensation methods are applied simultaneously. So, most speech recognition systems use CMN by default.
Classifier Compensation. Another approach for compensating noise effects is to change parameters of the classifier. This approach changes statistical parameters of the distribution in a way to be similar to the distribution of the test data. Some methods such as parallel model combination (PMC) [21] and model composition [22] change the distribution of the acoustic unit so as to compensate the additive noise effect. Other methods like maximum likelihood linear regression (MLLR) [23] involve computing a transformation matrix for the mixture component means using linear regression. However, these methods require access to the parameters of the HMM. This might not always be possible; for example, commercial recognizers often do not permit the users to modify the recognizer components or even access it. Classifier compensation methods usually require more computations than other compensation techniques and introduce latencies due to the time taken to adapt the models.
In recent years, some new approaches such as multistream [24] and missing features [25] have been proposed for dealing with the mismatch problem. These techniques try to improve speech recognition performance by giving less weight to noisy parts of the speech signal in the recognition process considering the fact that the signal-to-noise ratio (SNR) differs in various frequency bands [26]. More recently, a new method was proposed for distant-talking speech recognition using a microphone array in [27]. In this approach, called likelihood-maximizing beamforming, information from the speech recognition system itself is used to optimize a filter-and-sum beamformer.
Not all methods described above are equally applicable or effective in all situations. For instance, in commercial speech recognition engines, users have no access to features extracted from the speech signal. So in these systems, it is only possible to use signal compensation methods. Even in systems with accessible features, computational efficiency may restrict the use of compensation methods. Therefore, in such cases SS-based methods seem to be suitable. Different variations of the SS method originally proposed by Boll [3] were developed over the years to improve intelligibility and quality of noisy speech, such as generalized SS [28], nonlinear SS [7], multiband SS [29], SS with an MMSE STSA estimator [30], extended SS [31], and SS based on perceptual properties [32,33]. The most common variation involved the use of an oversubtraction factor that controlled to some degree the amount of speech spectral distortion caused by subtraction process. Different methods were proposed for computing the oversubtraction factor based on different criteria that included linear [28] and nonlinear functions [7] of the spectral SNR of individual frequency bin or bands [29] and psychoacoustic masking thresholds [34].
In conventional methods [35][36][37][38][39] incorporating SS as a signal compensation method in the front end of speech recognition systems, there is no feedback from the recognition stage to the enhancement stage, and they implicitly assume that generating a higher quality output waveform will necessarily result in improved recognition performance. However, speech recognition is a classification problem, and speech enhancement is a signal processing problem. So, it is possible that by applying speech enhancement algorithms the perceived quality of the processed speech signal is improved but no improvement in recognition performance is attained. This is because the speech enhancement method may cause distortions in the speech signal. The human ear may not be sensitive to such distortions, but it is possible that the speech recognition system be sensitive to them [40]. For instance, in telephony speech recognition where a clean speech model is not available, any signal compensation technique as judged by a waveform-level criterion will result in higher mismatch between improved speech features and the telephony model. Thus, speech enhancement methods improve speech recognition accuracy only when it generates the sequence of feature vectors which maximize the likelihood of the correct transcription with respect to other hypotheses. Hence, it seems logical that each improvement in the preprocessing stage be driven by a recognition criterion instead of a waveform-level criterion such as signal to noise ratio or signal quality. It is believed that this is the underlying reason why many SS methods proposed in literature result in highquality output waveforms but do not result in significant improvements in speech recognition accuracy.
According to this idea, in this paper a novel approach for applying multiband SS in the speech recognition system front end is introduced. SS is effective when noise is additive and uncorrelated with the speech signal. It is simple to implement and has low computational cost. The main disadvantage of this method is that it introduces distortions in the speech signal such as musical noise. We show experimentally that by incorporating the speech recognition system into the filter design process, recognition performance is improved significantly. In this paper, we assume that by maximizing or at least increasing the likelihood of the correct hypothesis, speech recognition performance will be improved. So, the goal of our proposed method is not to generate an enhanced output waveform but to generate a sequence of features that maximize the likelihood of the correct hypothesis. To implement this idea with the assumption of mel frequency cepstral coefficients (MFCCs) feature extraction and an HMM-based speech recognizer, we use an utterance for which the transcription is given and formulate the relation between SS filter parameters and the likelihood of the correct model. The proposed method has two phases: adaptation and decoding. In the adaptation phase, the spectral oversubtraction factor is adjusted based on maximizing the acoustic likelihood of the correct transcription. In the decoding phase, in turn, the optimized filter is applied for all incoming speech. Figure 1 shows the block diagram of the proposed approach.
The remainder of this paper is organized as follows. In Section 2, we review SS and multiband SS. Formulae for maximum likelihood-based SS (MLBSS) are derived in Section 3. Our proposed algorithm and its combination with CMN technique are described in Sections 4 and 5, respectively. Extensive experiments to verify the effectiveness of our algorithm are presented in Section 6, and finally in Section 7, we present the summary of our work.

Spectral Subtraction (SS)
SS is one of the most established and famous enhancement methods in removing additive and uncorrelated noise from noisy speech. SS divides the speech utterance into speech and nonspeech regions. It first estimates the noise spectrum from nonspeech regions and then subtracts the estimated noise from the noisy speech and produces an improved speech signal. Assume that clean speech s(t) is converted to noisy speech y(t) by adding uncorrelated noise, n(t), where t is the time index: Because the speech signal is nonstationary and time variant, the speech signal is split into frames; then by applying the Fourier transform and doing some approximations, we obtain the below generalized formula where n is the frame number and Y n (k), S n (k), and N n (k) are the kth coefficient of the Fourier transform of the nth noisy speech, clean speech, and noise frames, respectively, also T is the power exponent. SS has two stages which we describe briefly in the following subsections.

Noise Spectrum Update.
Because estimating the noise spectrum is an essential part of the SS algorithm, many methods have been proposed [41,42]. One of the most common methods, which is the one used in this paper, is given by [28] where |N n (k)| is the absolute value of the kth Fourier transform coefficient of the nth noisy speech frame, and 0 ≤ λ ≤ 1 is the updating noise factor. If a large λ is chosen, the estimated noise spectrum changes rapidly and may result in poor estimation. On the other hand, if a small λ is chosen, despite the increased robustness in estimation when the noise spectrum is stationary or changes slowly in time, it does not permit the system to follow rapid noise changes. In turn, β is the threshold parameter for distinguishing between noise and speech signal frames.

Noise Spectrum Subtraction.
After noise spectrum estimation, we should estimate the clean speech spectrum, S n (k), using where α is the oversubtraction factor chosen to be between 0 and 3 and is used to compensate for mistakes in noise spectrum estimation. Therefore, in order to obtain better results, we should set this parameter accurately and adaptively. The parameter γ is the spectral floor factor which is a small positive number assuring that the estimated spectrum will not be negative. We estimate the initial noise spectrum by averaging the first few frames of the speech utterance (assuming the first few frames are pure noise). Usually for the parameter T, a value of 1 or 2 is chosen. We have T = 1 yielding the original magnitude SS and T = 2 yielding the power SS algorithm. Errors in determining nonspeech regions cause incorrect noise spectrum estimation and therefore may result in distortions in the processed speech spectrum. Spectral noise estimation is sensitive to the spectral noise variation even when the noise is stationary. This is due to the fact that the absolute value of the noise spectrum may differ from the noise mean causing negative spectral estimation. Although the spectral floor factor γ prevents this, it may cause distortions in the processed signal and may generate musical noise artifacts. Since Boll's [3] research was introduced, several variations of the method were proposed in literature to reduce the musical noise. These methods were developed to perform noise suppression in autocorrelation, cepstral, logarithmic and, subspace domains. A variety of preprocessing and postprocessing methods attempt to reduce the presence of musical noise while minimizing speech distortion [43][44][45][46].

Multiband Spectral Subtraction (MBSS).
Basic SS assumes that noise affects the whole speech spectrum equally. Consequently, it uses a single value of the oversubtraction factor for the whole speech spectrum. Real world noise is mostly colored and does not affect the speech signal uniformly over the entire spectrum. Therefore, this suggests the use of a frequency-dependent subtraction factor to account for different types of noise. The idea of nonlinear spectral subtraction (NSS), proposed in [7], basically extends this capability by making the oversubtraction factor frequency dependent and subtraction process nonlinear. Larger values are subtracted at frequencies with low SNR levels, and smaller values are subtracted at frequencies with high SNR levels. Certainly, this gives higher flexibility in compensating for errors in estimating the noise energy in different frequency bins. The motivation behind the MBSS approach is similar to that of NSS. The main difference between MBSS and NSS is that the MBSS approach estimates one oversubtraction factor for each frequency band, whereas the NSS approach estimates one oversubtraction factor for each individual Fast Fourier Transform (FFT) bin. Different approaches based on MBSS have been proposed. In [47], the speech spectrum is divided into a considerably large number of bands, and a fixed value for the oversubtraction factor is used for all bands. In Kamath and Loiziou's method [29], an optimum oversubtraction factor is computed for each band based on the SNR. Another method (similar to the work presented in [29]) proposed in [48] uses the Berouti et al. SS method [28] on each critical band over the speech spectrum.
We select the MBSS approach because it is computationally more efficient in our proposed framework. Also, as reported in [49], the speech distortion is expected to be markedly reduced with the MBSS approach. In this work, we divide the speech spectrum using mel-scale frequency bands (inspired by the structure of the human ear cochlea [29]) and use a separate oversubtraction factor for each band. Therefore, oversubtraction vector is defined as where B is the number of the frequency bands. From this section we conclude that the oversubtraction factor is the most effective parameter in the SS algorithm. By adjusting this parameter for each frequency band, we can expect remarkable improvement in performance of speech recognition systems. In the next section, we present a novel framework for optimizing vector α based on feedback information from the speech recognizer back end.

Maximum Likelihood-Based Spectral Subtraction (MLBSS)
Conventional SS uses waveform-level criteria, such as maximizing signal to noise ratio or minimizing mean square error, and tries to decrease the distance between noisy speech and the desired speech. As mentioned in the introduction, using these criteria should not necessarily decrease word error rate. Therefore, in this paper, instead of a waveform-level criterion, we use a word-error-rate criterion for adjusting the spectral oversubtraction vector. One logical way to achieve this goal is to select the oversubtraction vector in a way that the acoustic likelihood of the correct hypothesis in the recognition procedure is maximized. This will increase the distance between the acoustic likelihood of the correct hypothesis and other competing hypotheses, such that the probability that the utterance be correctly recognized will be increased. To implement this idea, the relation between the oversubtraction factor in the preprocessing stage and the acoustic likelihood of the correct hypothesis in the decoding stage is formulated. The derived formulae depend on the feature extraction algorithm and the acoustic unit model. In this paper, MFCCs serve as the extracted features and hidden Markov models with Gaussian mixtures in each state as acoustic unit models. Speech recognition systems based on statistical models find the word sequence most likely to generate the observation feature vectors Z = {z 1 , z 2 , . . . , z t } extracted from the improved speech signal. These observation features are a function of both the incoming speech signal and the oversubtraction vector. Statistical speech recognizers obtain the most likely hypothesis based on Bayes' classification rule: where the observation feature vector is a function of oversubtraction vector α. In (6), P(Z(α) | w) and P(w) are the acoustic and language scores, respectively. Our goal is to find the oversubtraction vector α that achieves EURASIP Journal on Advances in Signal Processing 5 the best recognition performance. Similar to both speaker and environmental adaptation methods for the adjusting oversubtraction vector α, we need access to adaptation data with known phoneme transcriptions. We assume that the correct transcription of the utterance w C is known. Hence, the value of P(w C ) can be ignored since it is constant regardless of the value of α. We can then maximize (6) with respect to α as In an HMM-based speech recognition system, the acoustic likelihood P(Z(α) | w C ) is the sum of all possible state sequences for a given transcription. Since most state sequences are unlikely, we assume that the acoustic likelihood of the given transcription is estimated by the single most likely state sequence; such assumption also reduces computational complexity. If S C represents all state sequences in the combinational HMM and s represents the most likely state sequence, then the maximum likelihood estimation of α is given by According to (8), in order to find α, the acoustic likelihood of the correct transcription should be jointly maximized with respect to the state sequence and α parameters. This joint optimization has to be performed iteratively.
In (8), the maximum likelihood estimation of α may become negative. This usually happens when test speech data is cleaner than train speech data, for example, when we train the acoustic model by noisy speech and use it in clean environment. In such cases, the oversubtraction factor is negative and adds noise to the speech spectrum, but this is not an undesired effect; in fact, this is one of the most important advantages of our algorithm because adding noise PSD to the noisy speech spectrum decreases the mismatch and consequently results in better recognition performance.

State Sequence Optimization.
Noisy speech is passed through the SS filter, and feature vectors Z(α) are obtained for a given value α. Then optimal state sequence s = {s 1 , s 2 , . . . , s t } is computed using (9) given the correct phonetic transcription, w C : State sequence s can be simply computed using the Viterbi algorithm [50].

Spectral Oversubtraction Vector Optimization.
Given the state sequence s, we want to find α so that This acoustic likelihood can not be directly optimized with respect to the SS parameters for two reasons. First, the statistical distributions in each HMM state are complex density functions such as mixture of Gaussians. Second, some linear and nonlinear mathematical operations should be performed on the speech signal for extracting feature vectors, that is, the acoustic likelihood of the speech signal is influenced by the α vector. Therefore, obtaining a closedform solution for computing the optimal α given a state sequence is not possible; hence, nonlinear optimization is used.

Computing Gradient Vector.
We use gradient-based approach to find the optimal value of the α vector. Given an optimal state sequence in the combinational HMM, we define L(α) to be the total log likelihood of the observation vectors. Thus, The gradient vector ∇ α L(α) is computed as Clearly, computing the gradient vector depends on both the statistical distributions in each state and the feature extraction algorithm. We derive ∇ α L(α) assuming that each state is modeled by K mixtures of multidimensional Gaussians with diagonal covariance matrices. Let μ ik and ik be the mean vector and covariance matrix of the kth Gaussian density function in state s i , respectively. We can then write the sum of the acoustic likelihood given an optimal state sequence s = {s 1 , s 2 , . . . , s t } as where G ik (α) is defined as In (14), τ ik is the weight of the kth mixture in the ith state, and κ ik is a normalizing constant. Using the chain rule, we have where γ ik is defined as 6 EURASIP Journal on Advances in Signal Processing By substituting (17) into (15), we get In (18), ∂z i (α)/∂α is the Jacobian matrix, as in (19), comprised of partial derivatives of each element of the ith frame feature vector with respect to each component of the oversubtraction vector α: The dimensionality of the Jacobian matrix is B×C, where B is the number of elements in vector α and C is the dimension of the feature vector. The full derivation of the Jacobian matrix when the feature vectors are MFCC is given in the following subsection.

Computing Jacobian Matrices.
Every element of the feature vector is a function of all elements of the α vector. Therefore, to compute each element of the Jacobian matrix, we should derive formulas for the derivation of the feature vector from the SS output. Assume that x[n] is the input signal and X[k] is its Fourier transform. We set the number of frequency bands in multiband SS equal to the number of mel filters, that is, for each mel filter we have one SS filter coefficient. Since mel filters are a series of overlapping triangular weighting functions, we define α j [k] as where ω j and ω j+1 are lower and upper bound of the jth mel filter. The output of the SS filter, Y [k], is computed as where U is the step function, |N [k]| 2 is the average noise spectrum of frames labeled as silence, and β[k] is the kth element of the β vector having the value of 2 in the overlapping parts of the mel filter and value of 1 otherwise ( Figure 2). The gradient of |Y (k)| 2 with respect to elements of the α vector is found as In our experiments, ten frames from the beginning of the speech signal are assumed to be silence. We update the noise spectrum using (3), and the lth component of the mel spectral vector is computed as where v l [k] is the coefficient of the lth triangular mel filter bank and N is the number of Fourier transform coefficients. We calculate the gradient of (23) with respect to α as We can obtain the cepstral vector by first computing the logarithm of each element of the mel spectral vector and then performing a DCT operation as where Φ is a DCT matrix with dimension C * L.
Using the gradient vector defined in (18), the α vector can be optimized using the conventional gradient-based approach. In this work, we perform optimization using the method of conjugate gradients.
In this section, we introduced MLBSS-a new approach to SS designed specifically for improved speech recognition performance. This method differs from previous SS algorithms in that waveform-level criteria are used to optimize the SS parameters. Instead, the SS parameters are chosen to maximize the likelihood of the correct transcription of the utterance, as measured by the statistical models used by the recognizer itself. We showed that finding a solution to  this problem involves the joint optimization of the α vector, as the SS parameters, and the most likely state sequence for the given transcription. It was performed by iteratively estimating the optimal state sequence for a given α vector using the Viterbi algorithm and optimizing the likelihood of the correct transcription with respect to the α vector for that state sequence. For the reasons originally discussed in Section 3.2, the likelihood of the correct transcription cannot be directly maximized with respect to the α vector, and therefore we do so using conjugate gradient descent as our optimization method. Therefore, in Section 3.2, we derived the gradient of the likelihood of the correct transcription with respect to the α vector.

MLBSS Algorithm in Practice
In Section 3, a new approach to MBSS was presented in which the SS parameters are optimized specifically for speech recognition performance using feedback information from the speech recognition system. Specifically, we showed how the SS parameters (vector α) can be optimized to maximize the likelihood of an utterance with known transcription. Obviously, here we should answer the following question: if the correct transcription is known a priori, why should there be any need for recognition? The answer is that the correct transcription is only needed in the adaptation phase. In the decoding phase, the filter parameters are fixed. Figure 3 shows the flowchart of our proposed algorithm. First, the user is asked to speak an utterance with a known transcription. The utterance is then passed through the SS filter with fixed initial parameters. After that, the most likely state sequence is generated using the Viterbi [50] algorithm. The optimal SS filter is then produced given the state sequence. Recognition is performed on a validation set using the obtained optimized filter. If the desired word error rate is reached the algorithm is finished, otherwise the new state sequence is estimated. Figure 3 also shows the details of the SS optimization block. This block iteratively finds the oversubtraction vector which maximizes the total log likelihood of the utterance with a given transcription. First, the feature vector is extracted from the improved speech signal, and then the log likelihood is computed given the state sequence. If the likelihood does not converge, the gradient of the oversubtraction vector is computed, and the oversubtraction vector is updated. SS is performed with the updated parameters, and new feature vectors are extracted. This process is repeated until the convergence criterion is satisfied.
In the proposed algorithm, similar to speaker and environment adaptation techniques, the oversubtraction vector adaptation can be implemented either in a separate offline session or by embedding an incremental on-line step to the normal system recognition mode. In off-line adaptation, as explained above, the user is aware of the adaptation process typically by performing a special adaptation session, while in on-line adaptation the user may not even know that adaptation is carried out. On-line adaptation is usually embedded in the normal functioning of a speech recognition system. From a usability point of view, incremental online adaptation provides several advantages over the off-line approach making it very attractive for practical applications. Firstly, by means of on-line adaptation, the adaptation process is hidden from the user. Secondly, the use of on-line adaptation allows us to improve robustness against changing noise conditions, channels, and microphones. Off-line adaptation is usually done as an additional training session in a specific environment, and thus it is not possible to incorporate new environment characteristics for parameter adaptation.
The adaptation data can be aligned with HMMs in two different ways. In supervised adaptation, the identity of the adaptation data is always known, whereas in the unsupervised case it is not; hence, adaptation utterances are not necessarily correctly aligned. Supervised adaptation is usually slow particularly with speakers whose utterances result in poor recognition performance because only the correctly classified utterances are utilized in adaptation.

Combination of MLBSS and CMN
In the MLBSS algorithm described in Sections 3 and 4, relations were derived under the assumption of additive noise. However, in some application such as distant-talking speech recognition, it is necessary to cope not only with additive noise but also with the acoustic transfer function (channel noise). CMN [18] is a simple (low computational cost and easy to implement) yet very effective method for removing convolutional noise, such as distortions caused by different recording devices and communication channels. Due to the presence of the natural logarithm in the feature extraction process, linear filtering usually results in a constant offset in the filter bank or cepstral domains and hence can be subtracted from the signal. The basic CMN estimates the sample mean vector of the cepstral vectors of an utterance and then subtracts this mean vector from every cepstral vector of the utterance. We can combine CMN with the proposed MLBSS method by mean normalization of the Jacobian matrix. Let z i (α) be the mean normalized feature vector: The partial derivative of z i (α) with respect to α can be computed as where this equation is equal to mean normalization of the Jacobian matrix. Hence, features mean normalization can easily be incorporated into the MLBSS algorithm presented in Section 4. To do so, the feature vector z i (α) in (11) is replaced by (z i (α) − μ z (α)) where μ z (α) is the mean feature vector, computed over all frames in the utterance. Because μ z (α) is a function of α as well, the gradient expressions also have to be modified. Our experimental results have shown that in real environments better results are obtained when MLBSS and CMN are used together properly.

Experimental Results
In this section, the proposed MLBSS algorithm is evaluated and is also compared with traditional SS methods for speech recognition using a variety of experiments. In order to assess the effectiveness of the proposed algorithm, speech recognition experiments were conducted on three speech databases: FARSDAT [51], TIMIT [52], and a recorded database in a real office environment. The first and second test sets are obtained by artificially adding seven types of noises (alarm, brown, multitalker, pink, restaurant, volvo, and white noise) from the NOISEX-92 database [53] to the FARSDAT and TIMIT speech databases, respectively. The SNR was determined by the energy ratio of the clean speech signal including silence periods and the added noise within each sentence. Practically, it is desirable to measure the SNR by comparing energies during speech periods only. However, on our datasets, the duration of silence periods in each sentence was less than 10% of the whole sentence length; hence, the inclusion of silence periods is considered acceptable for relative performance measurement. Sentences were corrupted by adding noise scaled on a sentence-bysentence basis to an average power value computed to produce the required SNR.
Speech recognition experiments were conducted on Nevisa [54], a large-vocabulary, speaker-independent, continuous HMM-based speech recognition system developed in the speech processing lab of the Computer Engineering Department of Sharif University of Technology. Also, it was the first system to demonstrate the feasibility of accurate, speaker-independent, large-vocabulary continuous speech recognition in Persian language. Experiments have been done in two different operational modes of the Nevisa system: phoneme recognition on FARSDAT and TIMIT databases and isolated command recognition on a distant talking database recorded in a real noisy environment. The reason for reporting phoneme recognition accuracy results instead of word recognition accuracy is that in the former case the recognition performance lies primarily on the acoustic model. For word recognition, the performance becomes sensitive to various factors such as the language model type. The phoneme recognition accuracy is calculated as follows: with S, D, and I being the number of substitution, deletion, and insertion errors, and N the number of test phonemes.

Evaluation on Added-Noise Conditions.
In this section, we describe several experiments designed to evaluate the performance of the MLBSS algorithm. We explore several dimensions of the algorithm including the impact of SNR and type of added noises on recognition accuracy, performance of the single-band version of the algorithm, recognition accuracy of the algorithm on a clean test set, and test sets with various SNR levels when models are trained in noisy conditions. The experiments described herein were performed using the hand-segmented FARSDAT database. This database consists of 6080 Persian utterances, uttered by 304 speakers. Speakers are chosen from 10 different geographical regions in Iran; hence, the database incorporates the 10 most common dialects of the Persian language. The male-tofemale population ratio is two to one. There are a total of 405 sentences in the database and 20 utterances per speaker. Each speaker has uttered 18 randomly chosen sentences plus two sentences which are common for all speakers. Sentences are formed by using over 1000 Persian words. The database is recorded in a low-noise environment with an average SNR of 31 dB. One can consider FARSDAT as the counterpart of TIMIT in Persian language. Our clean test set is selected from this database and is comprised of 140 sentences from 7 speakers. All of the other sentences are used as a training set. To simulate a noisy environment, testing data was contaminated by seven types of additive noises at several SNRs ranging from 0 dB to 20 dB with 5 dB steps to produce various noisy test sets. Therefore, the test set does not consider the effect of stress or the Lombard effect on the production of speech in noisy environments.
The Nevisa speech recognition engine was used for our experiments. The feature set used in all the experiments was generated as follows. The speech signal, sampled at 22050 Hz, is applied to a pre-emphasis filter and blocked into frames of 20 milliseconds with 12 ms of overlap. A Hamming window is also applied to the signal to reduce the effect of frame edge discontinuities, and a 1024-point FFT is calculated. The magnitude spectrum is warped according to the mel scale. The obtained spectral magnitude spectrum is integrated within 25 triangular filters arranged on the mel frequency scale. The filter output is the logarithm of the sum of the weighted spectral magnitudes. A decorrelation step is performed by applying a discrete cosine transform. Twelve MFCCs are computed from the 25 filter outputs [53]. Firstand second-order derivatives of the cepstral coefficients are calculated over a window covering five neighbouring cepstral vectors to make up vectors of 36 coefficients per speech frame.
Nevisa uses continuous density hidden Markov modeling with each HMM representing a phoneme. Persian language consists of 29 phonemes. Also, one model was used to represent silence. All HMMs are left to right and they are composed of 5 states and 8 Gaussian mixtures in each state. Forward and skip transitions between the states and selfloop transitions are allowed. Covariance of each Gaussian is modeled by a single diagonal matrix. The initialization of parameters is done using linear segmentation, and the segmental k-means algorithm is used to estimate the expected parameters after 10 iterations. The Nevisa decoding process consists of a time-synchronous Viterbi beam search.
One of the 140 sentences of the test set is used in the optimization phase of the MLBSS algorithm. After filter parameters are extracted, speech recognition is performed on the remaining test set files using the obtained optimized filter. Table 1 shows phoneme recognition accuracy for the test speech files. To evaluate our algorithm, our results are compared with the Kamath and Loizou's [29] multiband spectral subtraction (KLMBSS) method which uses an SNR-based optimization criterion. In the KLMBSS method implementation, the speech signal is first Hamming windowed using a 20-millisecond window and a 10-millisecond overlap between frames. The windowed speech frame is then analyzed using the FFT. The resulting spectrum and the estimated noise spectrum are divided into 25 frequency bands using the same mel spacing as the MLBSS method. The estimate of the clean speech spectrum in the ith band is obtained by where α i is the oversubtraction factor of the ith band, δ i is a bandsubtraction factor, and β is a spectral floor parameter that is set to 0.002. From the experimental results, as shown in Table 1, we observe the following facts. With regards to various noise types and various SNRs, results show that the proposed method was capable of improving recognition performance relative to a classical method. In some cases, Kamath and Loizou's method achieves lower performance than the baseline. This is due to spectral distortions caused by not adjusting the oversubtraction factors thus destroying the discriminability used in pattern recognition. This mismatch reduces the effectiveness of the clean trained acoustical models and causes recognition accuracy to decline. Higher SNR differences between training and testing speech cause a higher degree of mismatch and greater degradation in the recognition performance.

Evaluation on Single Band Conditions.
In order to show the efficiency of the MLBSS algorithm for optimizing single band SS, we compare the results of the proposed method operating in single-band mode with Berouti et al.'s SS [28] which is a single-band SNR-based method. Results are shown in Figure 4. An inspection of this figure reveals that single-band MLBSS scheme consistently performs better than the SNR-based Berouti et al.'s approach in noisy speech environments across a wide range of SNR values.   Table 2 where we can find that the recognition accuracy of the MLBSS approach is even a bit higher than that of the baseline while the KLMBSS method shows noticeable decline. This phenomenon can be interpreted that the MLBSS approach has the ability to compensate for the effects of noise, so only the mismatch is reduced.

Experimental Results in Noisy Training Conditions.
In this section, we evaluate the performance of the MLBSS algorithm in noisy training conditions by using noisy speech data in the training phase. Recognition results obtained from the noisy training conditions are shown in Figure 5, where the following deductions can be made: (i) higher SNR difference between the training and testing speech causes higher degree of mismatch, and therefore results in greater degradation in recognition performance; (ii) in matched conditions, where the recognition system is trained with speech having the same level of noise as the test speech, best recognition accuracies are obtained; (iii) the MLBSS is more effective than the KLMBSS method in overcoming environmental mismatch where models are trained with noisy speech but the noise type and the SNR level of noisy speech are not known a priori; (iv) in the KLMBSS method, lower SNR of the training data results in greater degradation in recognition performance.

On-Line MLBSS Framework Evaluation.
In this experiment, the performance of incremental on-line adaptation under added noise conditions is compared to that of offline adaptation. In the case of supervised off-line adaptation, the parameter update was based on one adaptation utterance spoken in a noisy environment. As mentioned in Section 5, after adaptation, an updated oversubtraction  Figure 5: Phoneme recognition accuracy rate (%) as a function of the signal-to-noise ratio of the speech being recognized, where the recognition system has been trained on noisy speech. In (a) and (b), system has been trained with additive white noise at SNR 10 dB and 20 dB noisy speech, respectively. vector is computed from the processed utterance, and this new vector is subsequently used to recognize the remainder of the test data. In the case of incremental on-line adaptation, only correctly recognized test utterances are utilized for adaptation (supervised approach). A new oversubtraction vector is always computed after one correctly recognized utterance has been processed. In order to further evaluate the performance of the online version of the proposed algorithm in noise varying conditions, we carry out a number of experiments where the SNR of the added noise was made artificially time varying. For this, we varied the SNR linearly from an initial value to final within each utterance. Recognition results are shown in Table 3 where 10 → 20 indicates that the SNR was changed linearly within a sentence such that it was 10 dB at the beginning and 20 dB at the end. For this time-varying SNR condition, the on-line MLBSS algorithm yielded the best recognition performance among the evaluated approaches when white noise was used. What should be noted here is that the KLMBSS algorithm resulted in only a modest improvement over the baseline for timevarying SNR conditions; in fact, in the 10 → 20 case, it even decreased recognition performance.
6.6. Evaluation on TIMIT Database. All the above experiments were done on the FARSDAT database which is the counterpart of TIMIT for Persian language. In order to verify the performance of the MLBSS algorithm, the same experiments as those described in Section 6.1 were devised on the TIMIT database and were conducted using the Nevisa system. The results are reported in Table 4. As can be seen, the obtained results are in agreement with the results obtained with the FARSDAT database.
It can be concluded from the aforementioned experiments that the MLBSS algorithm has the capability to significantly increase the robustness of the recognition system on artificially noise-added data. However, a direct comparison is still missing as the desired performance is needed for real environments. Therefore, a third set of experiments was performed and will be described below.

Evaluation on Data Recorded in Real Environment.
To formally quantify the performance of the proposed algorithm in comparison with commonly used SS techniques, speech recognition experiments were carried out on speech data recorded in a real noisy office environment. The experiments were specifically set up to generate a worstcase scenario of combined interfering point source and background noise to illustrate the potential of the robustness scheme in a complex, real-life situation.
In this experiment, we used an isolated command recognition task trained with clean isolated commands and tested with noisy data captured from a microphone placed 2 m away from the speaker. We collected the training dataset using a close-talking microphone in a quiet office using 16 female and 32 male talkers; each uttered 30 commands such as turn on/off or open/close different devices in an office. We gathered the test data in the office environment depicted in Figure 6. For the test set, 22 male and 11 female talkers, different from those used to produce the training dataset, uttered commands at a 2 m distance from the microphone. Room dimensions were 4.5 m × 3.5 m × 3.5 m which resulted in a reverberation time of approximately 300 milliseconds (T 60 ∼ = 0.3 s). There were some sources of noise such as 3 computers and a loudspeaker propagating office noise from the NOISEX database at a 40-degree angle with the wall. The average SNR of the test set was 15 dB. We partitioned this test set into two sets, and MFCCs were calculated. Speech recognition was performed using the Nevisa system in isolated command recognition mode. From these experiments, the following deductions can be made: (i) each approach is able to improve the robustness of the system; (ii) MLBSS combined with CMS yields the highest robustness to noise among the approaches investigated; (iii) while the robustness of the MLBSS approach is slightly inferior to that of the KMBSS, it yields better performance when combined by CMS.

Summary
In this paper, we have proposed a likelihood-maximizingmultiband spectral subtraction algorithm-a new approach for noise robust speech recognition which integrates MBSS and likelihood maximizing schemes. In this algorithm, SS parameters are jointly optimized based on feedback information from a speech recognizer. Therefore, speech signals processed using the proposed algorithm are more accurately recognized than those processed with conventional SS methods. In all, the main advantage of the proposed algorithm is that the SS parameters are adapted based on a criterion much more correlated with the speech recognition objective than the SNR criterion which is commonly used in practice.  The proposed algorithm has been tested and compared to classical SS algorithms using various noise types and SNR levels. Experimental results show that the proposed algorithm leads to considerable recognition rate improvements. Hence, we can conclude that using feedback information from a speech recognizer in the front-end enhancement process can result in significant improvements when compared to classical enhancement methods.
In future works, we are planning to evaluate discriminative methods instead of likelihood maximizing schemes. Another possible future extension of this work includes the utilization of the uncertainty associated with the enhanced features using an uncertainty decoding approach.