Reverberant speech recognition exploiting clarity index estimation

We present single-channel approaches to robust automatic speech recognition (ASR) in reverberant environments based on non-intrusive estimation of the clarity index (C50). Our best performing method includes the estimated value of C50 in the ASR feature vector and also uses C50 to select the most suitable ASR acoustic model according to the reverberation level. We evaluate our method on the REVERB Challenge database employing two different C50 estimators and show that our method outperforms the best baseline of the challenge achieved without unsupervised acoustic model adaptation, i.e. using multi-condition hidden Markov models (HMMs). Our approach achieves a 22.4 % relative word error rate reduction in comparison to the best baseline of the challenge.


Introduction
Automatic speech recognition (ASR) is increasingly being used as a tool for a wide range of applications in diverse acoustic conditions (e.g. health care transcriptions, automatic translation, voicemail-to-text, and voice interface for command and control). Of particular importance is distant speech recognition, where the user can interact with a device placed at some distance from the user. Distant speech recognition is essential for natural and comfortable human-machine voice interfaces such as used in, for example, the automotive sector and smartphone mobile applications.

Signal model
In a distant-talking scenario, reverberation causes a significant degradation in ASR performance. A reverberant sound is created in enclosed spaces by reflections from surfaces which create a multipath sound propagation from the source to the receiver. This effect varies with the acoustic properties of the room and the source-receiver distance, and it is characterized by the room impulse response (RIR). The reverberant signal y(n) can be modelled as the convolution between the RIR h(m) and the source signal s(n) as follows: Typical RIRs can be divided into three different parts: the direct path, the early reflections corresponding to the first 50 ms after the direct path, and the late reverberation corresponding to reflections that are delayed more than 50 ms after the direct path. Early reflections cause spectral colouration of the signal, whereas late reverberation causes temporal smearing and characteristic ringing echoes of the signal [1].

Room acoustic measures
Several acoustic measures have been proposed that estimate the reverberation level present in a signal [2] by using the RIR h(m) or the source s(n) and received signal y(n), but in many applications, the only information available is the received signal y(n). Recently, methods have been proposed to estimate room acoustic measures from reverberant signals such as the reverberation time (T 60 ) [3][4][5] which characterizes the room acoustic properties. However, alternative measures have been shown to be more correlated with ASR performance such as clarity index (C 50 ) which is the ratio of the energy in the early reflections over the energy in late reflections [6], defined as C 50 = 10 log 10 where N τ is an integer number of samples corresponding to 50 ms after the time arrival of the direct path. Such measures have been shown to predict ASR performance with significant reliability [7,8] compared to other measures of reverberation. Moreover, different values of N τ have been investigated in [7] showing that the number of samples N τ corresponding to the range from 50 to 100 ms after the direct path provides the highest correlation values with the ASR performance.

Distant-talking ASR
ASR techniques robust to reverberation can be divided into two main groups [9][10][11]: front-end-based and backend-based. The former approach suppresses the reverberation in the feature domain; therefore, the processing is performed after feature extraction. Li et al. [12] propose to train a joint sparse transformation to estimate the clean feature vector from the reverberant feature vector. In [13], a model of the noise is estimated from observed data by considering the late reverberation as additive noise, and then the feature vector is enhanced by applying vector Taylor series. A feature transformation based on discriminative training criterion inspired on Maximum Mutual Information is suggested in [14]. Additional features related to the amount of diffuse noise in each frequency bin and frame are employed in [15] to improve deep neural network-based ASR accuracy in noisy and reverberant environments. Yoshioka and Gales [16] present several front-end approaches such feature transformation or feature set expansion that are tailored to deep neural network acoustic models employed for distant-talking recognition. The latter approach, back-end-based, modifies the acoustic models or the observation probability estimate to suppress the reverberation effect. Sehr et al. [17] suggest to adapt the output probability density function of the clean speech acoustic model to the reverberant condition in the decoding stage. Selection of different acoustic models trained for specific reverberant conditions using an estimation of T 60 is proposed in [18]. In [19], an attenuation of late reverberation is proposed such as [20] to build several reverberant acoustic models which are selected using ground truth T 60 . The RIR attenuation parameters are tuned to provide the highest recognition rate on a reverberant test set created with measured RIR. The early-to-late reverberation ratio, considering the first 110 ms of the RIR as part of the early reverberation, is used in [21] instead of T 60 to select between different reverberant acoustic models. In [22], the likelihood scores of the ASR acoustic models based on Gaussian mixture models are maximized to select the optimum acoustic model. An adaptation of multiple reverberant acoustic models trained with different T 60 values is proposed in [23]. The mean vector of the optimal adapted model is estimated in a maximum-likelihood sense from the reverberant models. The idea in [24] is to add to the current state the contribution of previous acoustic model states using a piece-wise energy decay curve which considers the early reflections and late reverberation as different contributions.
In addition to front-end-based and back-end-based approaches, signal-based methods are intended to dereverberate the acoustic signal in the time domain, before being processed by the ASR feature extraction module [2]. In [25], a complementary Wiener filter is proposed to compute suitable spectral gains which are applied to the reverberant signal to suppress late reverberation. In [26], a denoising autoencoder is used to clean a window of spectral frames and then overlapping frames are averaged and transformed to the feature space. All these three approaches may be combined to create complex robust systems [27,28].
Additionally, ASR techniques robust to reverberation can be also classified according to the number of microphones used to capture the signal such as single-channel methods [13,20,26,29] or multi-channel techniques [12,27,30,31].
The method now proposed is a hybrid approach based on front-end-based and back-end-based single-channel techniques. The C 50 estimate is employed to select different acoustic models (back-end approach) which are trained on feature vectors appended to include the C 50 value (front-end approach). The resulting appended feature vector is then reduced in dimension to match the original dimensionality by applying heteroscedastic linear discriminant analysis (HLDA) [32]. The technique was tested within the ASR task of the REVERB Challenge [33] which was launched by the IEEE Audio and Acoustic Signal Processing Technical Committee in order to compare ASR performance on a common data set of reverberant speech. This paper now extends an earlier version of the work presented in [34] including an improved C 50 estimator, which provides estimates per frame, and a performance comparison of the new system with the previous method.
The remainder of this paper is organized as follows: Section 2 introduces the C 50 estimators employed in this work. In Section 3, the training and test data from the REVERB Challenge is analysed. Section 4 describes the methods proposed, and Section 5 discusses the comparative performance of the these techniques. Finally, in Section 6, the conclusions are drawn.

C 50 estimator
Two different single-channel C 50 estimators are employed in this work: non-intrusive room acoustic estimation using classification and regression trees (NIRA-CART) and non-intrusive room acoustic estimation using bidirectional long-short term memory (NIRA-BLSTM). In this work, we use C 50 to characterize reverberation in the signal instead of T 60 as in [18] because this last measure is independent of the source-receiver distance which is a key factor in the speech degradation. Moreover, C 50 was shown to be highly correlated with the ASR performance compared to other measures of reverberation [7,8] which makes it suitable for this purpose.

NIRA-CART
This method in [7] computes a set of features from the signal which can be divided into long-term features and frame-based features. The former features are taken from long-term average speech spectrum (LTASS) deviation by mapping it into 16 bins with equal bandwidth and additionally from the slope of the unwrapped Hilbert transformation. The latter features are created with pitch period, importance weighted signal-to-noise ratio (iSNR), zero-crossing rate, variance and dynamic range of Hilbert envelope and speech variance. In addition, spectral centroid, spectral dynamics and spectral flatness of the power spectrum of long-term deviation (PLD) are included in the feature vector as well as 12th-order mel-frequency cepstral coefficients (MFCCs) with delta and delta-delta and line spectrum frequency (LSF) features computed by mapping the first 10 linear predictive coding (LPC) coefficients to LSF representation.
The first-order numerical difference is used to compute the rate of change for all frame-based features, excluding the MFCCs. The complete feature vector is created by adding to the long-term features the mean, variance, skewness and kurtosis of all frame-based features and therefore creating a 313-element vector. Finally, a CART regression tree [35] is built to estimate C 50 . The CART uses the complete feature vector, and it is trained on the training set from the REVERB Challenge.

NIRA-BLSTM
The feature configuration of this method (P Peso Parada, D Sharma, J Lainez, D Barreda, P A Naylor, T van Waterschoot, A single-channel non-intrusive C50 estimator with application to reverberant speech recognition, submitted) is based on computing the frame-based features of NIRA-CART and including in addition 12 features extracted from the modulation domain. Consequently, the per frame feature vector comprises a total of 94 features. Moreover, rather than building a CART model to estimate C 50 , a particular recurrent neural network architecture called BLSTM [36] is trained with these features to provide an estimation every 10 ms. Since REVERB Challenge data assumes that the room acoustic properties remain unchanged within each utterance, only the temporal average for each utterance of all per frame estimations is considered.

Wide-band feature set extension
In [7] and (P Peso Parada, D Sharma, J Lainez, D Barreda, P A Naylor, T van Waterschoot, A single-channel nonintrusive C50 estimator with application to reverberant speech recognition, submitted), these estimators were originally proposed to operate on speech signals sampled with a sampling frequency of 8 kHz. Therefore, an adaptation of the features has been developed here in order to process wider bandwidth signals. For speech signals sampled at 16 kHz, 10 LPC coefficients and their corresponding LSFs are not sufficient to characterize the speech [37]. For wide-band speech therefore, the order of the LPC is increased to 20. Hence, the feature vector for NIRA-CART comprises 393 elements and 106 features per frame for NIRA-BLSTM.

Analysis of the challenge data
The database provided in REVERB Challenge comprises three different sets of eight-channel recordings: training set, development set and evaluation set. Real data recorded in a reverberant room and simulated data created by convolving non-reverberant utterances with measured RIRs are included in the development set and evaluation set, whereas the training set only comprises simulated data. This section analyses the RIRs of different data sets in terms of C 50 inasmuch as this is a key aspect in the design of the algorithms proposed in this work. Figure 1 shows the histogram of C 50 values for the 24 training RIRs including all channels of each response. As seen in Fig. 1, the RIR training set covers a wide range of C 50 spanning approximately 25 dB. These RIRs are used to create the data set employed to train our C 50 estimator [7] by convolving these RIRs with speech signals from the training set which, for the REVERB Challenge, was formed from the WSJCAM0 training set [38].  Table 1 presents the measured C 50 of the RIRs included in the development and evaluation sets of simulated data. It shows a significant difference between the small room recordings (Room1) which are less reverberant (T 60 = 0.25), and the medium and large room recordings (Room2 and Room3, respectively) which have higher reverberation times (T 60 = 0.5 and T 60 = 0.7, respectively). Furthermore, the two distances of the speaker from the microphone, that is, near = 50 cm and far = 200 cm from, show a constant C 50 difference of 8 to 10 dB.
Real recordings are captured in a reverberant meeting room from two different distances: near (≈100 cm) and far (≈250 cm). The development and evaluation sets of these recordings are not analysed in terms of measured C 50 since the RIRs of these sets are unavailable.

C 50 estimator performance
The evaluation metric used to compare the C 50 estimator performance is the root-mean-square deviation (RMSD) given as where N is the total number of measured ground truth values C 50,n and estimated scores C 50,n considered to compute the RMSD. The training set is randomly split into a training subset (80 % of the data used to train the models) and evaluation subset (20 % of the recordings employed to evaluate the models) in order to provide insights into the performance of both C 50 estimators. Additionally, the performance of the C 50 estimators is also evaluated using the development set and evaluation set of the simulated data whose C 50 measures are presented in Table 1. Table 2 summarizes the RMSD performance of each estimator evaluated in these data sets. NIRA-BLSTM achieves the lowest deviation in each data set, providing on average a RMSD 1.6 dB lower than that of NIRA-CART. Both estimators exhibit lower deviations on the evaluation subset of the training set (i.e. training set -eval. subset) because this reverberant subset is similar to the data used for training the C 50 estimators.

Methods
In this section, we describe different configurations for reverberant speech recognition. The idea underpinning these methods is to exploit estimated C 50 to improve robustness of ASR to reverberation. Section 4.1 introduces the front-end techniques, Section 4.2 describes the back-end methods, and finally, Section 4.3 presents the combination as outlined in Fig. 2.

C 50 as a supplementary feature in ASR
In this approach, the estimated C 50 of the utterance is included as an additional feature in the ASR feature vector. The baseline recognition system uses a feature vector with 13 mel-frequency cepstral coefficients, with the first and second derivatives of these coefficients followed by cepstral mean subtraction.
We now propose two alternative improved configurations. The first configuration proposed (C50FV ) is to add C 50 estimation directly to this feature vector. Therefore, the modified feature vector comprises 40 elements.
In the second configuration (C50HLDA), the feature vector dimension is reduced using linear discriminant analysis (LDA) [39]. This method projects the input feature vector x k onto a new space y k by applying a linear transformation W such that where W is an p × q matrix. This transformation in general retains the class discrimination in the transformed feature space. The transformation W is obtained by maximizing the ratio of the between-class scatter matrix S B to the within-class scatter matrix S W , that is, The projection that maximizes (5) corresponds toW whose columns are the eigenvectors of S −1 W S B with the q highest eigenvalues so that q is the dimension of the reduced feature space.
In this work, a model-based generalization of LDA [32] is used. In this case, the linear transformation is estimated from Gaussian models using the expectationmaximization algorithm. For these models, it is assumed that class distributions with equal mean and variance across all classes do not contain discriminant classification information.
In all configurations, the acoustic models are trained using the modified feature space.

Model selection
The proposed back-end approach aims to select the optimal acoustic modelȂ such as where J represents the number of available acoustic models A = {A 1 , A 2 , · · · , A J } and θ = {θ 1 , θ 2 , · · · , θ J−1 } is the vector with the C 50 threshold values sorted in ascending order.

Model switching between REVERB Challenge acoustic models
The first configuration (Clean&Multi cond.) is based on selecting between the two acoustic models provided in the challenge (clean-condition hidden Markov models (HMMs) and multi-condition HMMs) according to the level of C 50 estimated from the input signal. In this case, A 1 represents the multi-condition HMMs and A 2 is the clean-condition HMMs. By empirical optimization over the development data set and considering the analysis carried out in Section 3, we choose the model switching threshold θ 1 = 23 dB. Therefore, input speech signals with estimated C 50 higher than 23 dB are recognized using clean-condition HMMs, whereas signals with C 50 lower than this threshold are recognized using multi-condition HMMs.

Model switching using newly trained acoustic models
The second and subsequent configurations are now introduced based on training new reverberant acoustic models. The data set used to train the models is always the clean training set convolved with the training RIRs (Fig. 1). In order to include in the trained models A all representative data of the acoustic units (i.e. triphones), all L clean training utterances are convolved with a subset of M training RIRs to create a reverberant acoustic model A i such as where y l is the reverberant speech obtained with the clean utterance s l and the RIR in the row (l mod M) of the matrix H i . This matrix contains the M RIRs with a C 50 value that satisfies θ i−1 < C 50 ≤ θ i . The first approach is to create three reverberant acoustic models (MS3) according to the C 50 values of the RIRs as shown in Fig. 3a. The threshold vector is set to θ = {10, 20} dB, which was derived from the C 50 estimations of the development set. The aim is to cluster the development set into three groups with similar ASR performance and train a model for each group. The most reverberant model A 1 is trained with the RIRs that have C 50 lower than 10 dB. The second acoustic model A 2 is trained with RIRs that have C 50 between 10 and 20 dB. Finally the third model A 3 , which represents the least reverberant conditions, is trained with those RIRs with a C 50 higher than 20 dB. The next configuration (MS5) includes the use of classes with overlapping ranges of C 50 in order to build the acoustic models. For each class, the overlapping range of C 50 used was approximately 50 % of the size of the neighbouring class. This configuration results in the same previous models (MS3) but adds two additional models spanning the transitional ranges of C 50 . These two models provide a smoother transition between acoustic models. The acoustic model most representative of reverberation level estimated from the utterance is selected in the recognition phase. Figure 3b shows the construction of MS5 during training (red bars) and the thresholds used to select models in the recognition stage (green bars).
Additional configurations were tested by increasing the number of models trained: 8 overlapped acoustic models (MS8), 11 overlapped acoustic models (MS11), 14 overlapped acoustic models (MS14) and 18 overlapped acoustic models (MS18). These models are obtained by further dividing the original MS3 configuration. By increasing the number of models, the range of C 50 of the training data of each model is decreased in terms of C 50 which creates acoustic models more specific for each reverberant condition. Figure 4 shows the ranges of C 50 used for MS11.

Model selection including C 50 in the feature vector
This method combines the two approaches described above: C50HLDA and model selection. Figure 2 shows the block diagram of this method where green modules represent the modifications included to design this method. Firstly, C 50 is estimated from the speech signal. The C 50 estimate is then included in the feature vector before applying the HLDA transformation and also used to select the most suitable acoustic model.
All the tested configurations employ the C 50 thresholds as described in Section 4.2 to create the data to train the acoustic models and select the appropriate acoustic model in the recognition stage. These configurations are referred as MSN+C50HLDA, where N represents the number of acoustic models created.

Results and discussion
Methods described in Section 4 were tested using NIRA-CART and NIRA-BLSTM to estimate C 50 , and we compare the performance of each method in terms of the word error rate (WER) obtained using the REVERB Challenge ASR task [33]. The ASR evaluation tool is based on the hidden Markov model tool kit (HTK) provided by the REVERB Challenge. It uses mel-frequency cepstral coefficient (MFCC) features including delta and delta-delta coefficients and tied-state HMM acoustic models with 10 Gaussian components per state for clean-condition models and 12 Gaussian components per state for multicondition models. Table 3 shows the average WER achieved with the non-reverberant recordings (Clean), simulated reverberant recordings (Sim.) and real reverberant recordings (Real) of the REVERB Challenge evaluation test set including the average of all subsets in the last column, while   Tables 4 and 5 show with more detail these results for each scenario. Moreover, Fig. 5 summarizes these results, displaying the average WER for the development test set and evaluation test set. Baseline methods are also tested in order to compare the performance. The baseline methods consist of decoding the data using the two acoustic models provided in the REVERB Challenge: the acoustic model trained with nonreverberant data (Clean-cond.) and the acoustic model trained with reverberant data (Multi-cond.). The performance of these baselines is shown in the first two rows of Tables 3, 4 and 5. Clean-cond. models provide a better performance in non-reverberant environments, whereas Multi-cond. models provide a significant reduction in WER for reverberant environments.

C 50 as a new feature
The C50FV method provides a similar performance compared with the baselines. This outcome is due to the fact that we are using a diagonal covariance matrix to build the acoustic model. Therefore, this feature only provides information regarding the probability of observing the acoustic unit in this reverberant environment not taking into account possible dependences with the MFCC.
On the other hand, the last method described in Section 4.1 (C50HLDA) outperforms on average the WER obtained with the baselines. The main reason for this result is the use of the discriminative transformation matrix to combine the feature space. Regarding the C 50 estimator employed, NIRA-BLSTM provides similar WER to that obtained with NIRA-CART for this configuration. This small performance difference suggests that C50HLDA does not strongly depend on the accuracy of the estimations. Furthermore, the averaged WER obtained by applying HLDA to the feature space without the C 50 feature is 32.20 %. This result supports the previous suggestion about the dependence of C 50 estimation accuracy upon C50HLDA performance and moreover indicates that the improvement achieved with C50HLDA is mainly due to the HLDA transformation.

Model selection
Tables 3, 4 and 5 also display the performance obtained with the methods described in Section 4.2 based on model selection. First, they show that a considerable WER reduction of the baseline is achieved by employing the two acoustic models provided by REVERB Challenge and exploiting our estimate of C 50 to select the most appropriate model for each utterance between them (i.e. Clean&Multi cond.). Further improvement is achieved by training more reverberant models. The MS3 configuration employs three reverberant models (Fig. 3a) and the performance in reverberant conditions is improved in most of the situations, but on average, the error rate has been increased with respect to the Clean&Multi cond. mainly due to the poor performance in clean environments. The performance of this configuration is slightly improved, from WER = 30.82 % to WER = 30.35 % in the evaluation set, by overlapping the training data to build the acoustic models (MS5). Increasing the number of models trained using overlapping ranges of C 50 (i.e. MS8, MS11, MS14 and MS18) results in further WER reductions. For these experiments, the best performance is obtained with MS8 using the NIRA-CART C 50 estimator (WER = 29.8 %), whereas NIRA-BLSTM provides the lowest WER with MS14 (WER = 28.7 %). This is due to the fact that NIRA-BLSTM achieves more accurate C 50 estimations The first two rows correspond to the baseline methods, and the remainder are the methods proposed in this work. R1, R2 and R3 represent room numbers 1, 2 and 3, respectively. Best performance results in each column are shown in italics than NIRA-CART; hence, it is able to select acoustic models trained with a narrower, and therefore better matched, C 50 range. The first two rows correspond to the baseline methods, and the remainder are the methods proposed in this work. R1, R2 and R3 represent room numbers 1, 2 and 3, respectively. Best performance results in each column are shown in italics

Model selection including C 50 in the feature vector
The performance of the full system presented in Fig. 2 is now discussed. A significant improvement is observed by combining both methods; the WER is decreased by = 26.9 %), which outperform the best baseline method (Multi-cond.) by 6.9 % and 7.8 %, respectively, in the evaluation set. Tables 3, 4 and 5 highlight in italics the lowest WER obtained in each data set. The best performance in reverberant conditions is achieved with this full system (i.e. MSN+C50HLDA); however, Clean&Multi cond. shows the best performance in the non-reverberant condition. This is mainly because all the data used to train MSN+C50HLDA is reverberant data, while Clean&Multi cond. uses reverberant and clean data to train the acoustic models. Therefore, MSN+C50HLDA could be further improved including a clean acoustic model to recognize the non-reverberant data.
All these reverberant speech recognition approaches were investigated in the previous work [34] using NIRA-CART. Figure 5 shows that using a more accurate C 50 estimator, i.e. NIRA-BLSTM, the WER is further reduced.
The method proposed in Fig. 2 may potentially be complementary to some other reverberation-robust speech recognition methods, such as applying speaker adaptation, acoustic model adaptation or preprocessing schemes (e.g. beamforming) [40]. For example, performing an unsupervised acoustic model adaptation using constrained maximum likelihood linear regression (CMLLR) with the best method proposed in this work (MS14+C50HLDA using NIRA-BLSTM), the average WER is further reduced to 24.34 %, that is, a relative WERR of 9.88 % with respect to the best baseline of the REVERB Challenge using CMLLR.

Conclusions
Various approaches for single-channel reverberant speech recognition using clarity index (C 50 ) estimation have been presented. One approach investigated was to include C 50 estimated from two different estimators (NIRA-CART and NIRA-BLSTM) as an additional feature in the ASR system and apply a dimensionality reduction technique (i.e. HLDA) to match the original feature vector dimension. This approach helped to improve the ASR performance of the best baseline by a relative word error rate reduction (WERR) of 7.35 % for NIRA-CART and NIRA-BLSTM. This improvement was shown to be in a significant part due to the HLDA transformation. Another approach was to use the C 50 information to perform acoustic model selection, which in turn gave a relative WERR of 14.04 % with NIRA-CART and 17.07 % with NIRA-BLSTM. The best performance was achieved by combining both approaches and using NIRA-BLSTM, leading to a relative WERR of 22.41 % (7.77 % absolute WERR). It is worth noting that only data from the REVERB Challenge data sets was used to train all the models employed in the system (including the C 50 estimator); furthermore, the method presented is complementary to other techniques such as CMLLR, and an example combination was shown to improve further the best performance, increasing the relative WERR to 29.8 %.
As expected, more accurate C 50 estimations lead to a further reduction in the final WER. In the two algorithms exploited in this study, NIRA-BLSTM is more accurate than NIRA-CART by 1.6 dB RMSD, which results in a