A Bayesian view on acoustic modelbased techniques for robust speech recognition
 Roland Maas^{1}Email author,
 Christian Huemmer^{1},
 Armin Sehr^{2} and
 Walter Kellermann^{1}
https://doi.org/10.1186/s136340150287x
© Maas et al. 2015
Received: 15 April 2015
Accepted: 12 November 2015
Published: 2 December 2015
Abstract
This article provides a unifying Bayesian view on various approaches for acoustic model adaptation, missing feature, and uncertainty decoding that are wellknown in the literature of robust automatic speech recognition. The representatives of these classes can often be deduced from a Bayesian network that extends the conventional hidden Markov models used in speech recognition. These extensions, in turn, can in many cases be motivated from an underlying observation model that relates clean and distorted feature vectors. By identifying and converting the observation models into a Bayesian network representation, we formulate the corresponding compensation rules. We thus summarize the various approaches as approximations or modifications of the same Bayesian decoding rule leading to a unified view on known derivations as well as to new formulations for certain approaches.
Keywords
1 Introduction
Robust automatic speech recognition (ASR) still represents a challenging research topic. The main obstacle, namely the mismatch of test and training data, can be tackled by enhancing the observed speech signals or features in order to meet the training conditions or by compensating for the distorted test conditions in the acoustic model of the ASR system.
Methods that modify the acoustic model are in general termed (acoustic) modelbased or model compensation approaches and comprise inter alia the following subcategories: socalled model adaptation techniques mostly update the parameters of the acoustic model, i.e., of the hidden Markov models (HMMs), prior to the decoding of a set of observed feature vectors. In contrast, decoderbased approaches readapt the HMM parameters for each observed feature vector. The most common decoderbased approaches are missing feature and uncertainty decoding that incorporate additional timevarying uncertainty information into the evaluation of the HMMs’ probability density functions (pdfs).
Various model compensation techniques exhibit two (more or less) distinct steps: First, the compensation parameters need to be estimated and, second, the actual compensation rule is applied to the acoustic model. The compensation rules can often be motivated based on an observation model that relates the clean and distorted feature vectors, e.g., in the logarithmic melspectral (logmelspec) or the mel frequency cepstral coefficient (MFCC) domain.
if the statistics of z _{ n } depend only on time through w _{ n }, i.e., if \(\boldsymbol {\mu }_{\mathbf {z}\mathbf {w}_{n}} = \boldsymbol {\mu }_{\mathbf {z}\mathbf {w}_{m}}\phantom {\dot {i}\!}\) and \(\mathbf {C}_{\mathbf {z}\mathbf {w}_{n}} = \mathbf {C}_{\mathbf {z}\mathbf {w}_{m}}\phantom {\dot {i}\!}\) for w _{ n }=w _{ m } and n,m∈{1,…,N}.
The remainder of the article is organized as follows: After summarizing the employed Bayesian view in Section 2 and its difference to other overview articles in Section 3, this perspective is applied to uncertainty decoding, missing feature techniques, and other modelbased approaches in Sections 4, 5, and 6, respectively. In Section 7, we point out the relation of the presented techniques to deep learningbased architectures. Finally, conclusions are drawn in Section 8.
2 The Bayesian view
We start by reviewing the Bayesian perspective on acoustic modelbased techniques that we use in Sections 4, 5, 6, and 7 to review different algorithms.
where p(q _{1}q _{0})=p(q _{1}). The summation goes over all possible state sequences q _{1:N } through W superseding the explicit dependency on w at the righthand side of (4) and (5). Note that the pdf p(y _{ n }q _{ n }) can be scaled by p(y _{ n }) without influencing the discrimination capability of the acoustic score w.r.t. changing word sequences w. We thus define \(\mathring {p}(\mathbf {y}_{n}q_{n}) = {p(\mathbf {y}_{n}q_{n})}/{p(\mathbf {y}_{n})}\phantom {\dot {i}\!}\) for later use.
where the actual functional form of p(y _{ n }x _{ n }) depends on the assumptions on g(·) and the statistics p(b _{ n }) of b _{ n }.
where δ(·) denotes the Dirac distribution. In contrast, missing feature and uncertainty decoding approaches typically assume p(b _{ n }) to be a timevarying pdf [4, 27]. As exemplified in Sections 4 to 6, this Bayesian view also allows for a convenient illustration of the underlying statistical dependencies of modelbased approaches by means of Bayesian networks. If two approaches share the same Bayesian network, their underlying joint pdfs over all involved random variables share the same decomposition properties. However, some crucial aspects are not reflected by a Bayesian network: the particular functional form of the joint pdf, potential approximations to arrive at a tractable algorithm, as well as the estimation procedure for the compensation parameters. While some approaches estimate these parameters through an acoustic frontend, others derive them from clean or distorted data. For clarity, we entirely focus in this article on the compensation rules while ignoring the parameter estimation step. We also disregard approaches that apply a modified training method to conventional HMMs without exhibiting a distinct compensation step, as it is characteristic for, e.g., discriminative [28], multicondition [29], or reverberant training [30].
3 Merit of the Bayesian view

First of all, we aim at classifying all considered techniques along the same dimension by motivating and describing them with the same Bayesian formalism. Consequently, we do not conceptually distinguish whether a given method employs a timevarying pdf p(b _{ n }), as in uncertainty decoding, or whether a distorted vector y _{ n } is a preprocessed or a genuinely noisy or reverberant observation. Also, the distinction of implicit and explicit observation models dissolves in our formalism.

As a second goal, we aim at closing some gaps by presenting new derivations and formulations for some of the considered techniques. For instance, the Bayesian networks in Figs. 2 b, 2 c, 4, 5 b, 5 c, 6 b, 8, 9 b representing the concepts in Subsections 4.3, 4.4, 4.6, 6.3/6.4, 6.5, 6.6, 6.8, and 6.9, respectively, constitute novel representations. Moreover, the links to the Bayesian framework via the mathematical reformulations in (28), (29), (37), (38), (45), (55), (61), (65), (71) are explicitly stated for the first time in this paper.

The third goal of the Bayesian description is to provide an intuitive graphical illustration that allows to easily overview a broad class of algorithms and to immediately identify their similarities and differences in terms of the underlying statistical assumptions.
By establishing new links between existing concepts, such an abstract overview should therefore also serve as a basis for revealing and exploring new directions. Note, however, that the review presented in this paper does not claim to cover all relevant acoustic modelbased techniques and is rather meant as an inspiration to other researchers.
4 Uncertainty decoding
In the following, we consider the compensation rules of several uncertainty decoding techniques from a Bayesian view.
4.1 General example of uncertainty decoding
Without loss of generality, a single Gaussian pdf p(x _{ n }q _{ n }) is assumed since, in the case of a Gaussian mixture model (GMM), the linear mismatch function (10) can be applied to each Gaussian component separately.
4.2 Dynamic variance compensation
where the approximation (13) can be justified if p(x _{ n }) is assumed to be significantly “flatter,” i.e., of larger variance, than \(p(\mathbf {x}_{n}\mathbf {y}_{n})\phantom {\dot {i}\!}\). The estimation of the moments \(\boldsymbol \mu _{\mathbf {x}\mathbf {y}_{n}}\phantom {\dot {i}\!}\), \(\mathbf {C}_{\mathbf {x}\mathbf {y}_{n}}\phantom {\dot {i}\!}\) of \(p(\mathbf {x}_{n}\mathbf {y}_{n})\phantom {\dot {i}\!}\) represents the core of [2].
4.3 Uncertainty decoding with SPLICE
Although analytically tractable, both the numerator and the denominator in (20) are typically approximated for the sake of runtime efficiency [3].
4.4 Joint uncertainty decoding
which can be analytically derived analogously to (11). In practice, the compensation parameters \(\mathbf {A}_{k_{n}}\phantom {\dot {i}\!}\), \(\boldsymbol {\mu }_{\mathbf {b}k_{n}}\phantom {\dot {i}\!}\), and \(\mathbf {C}_{\mathbf {b}k_{n}}\phantom {\dot {i}\!}\) are not estimated for each Gaussian component k _{ n } but for each regression class comprising a set of Gaussian components [4].
4.5 REMOS
The determination of a global solution to (25) represents the core of the REMOS concept. The estimates \(\widehat {\mathbf {x}}_{nL:n1}(q_{n1})\) in turn are the solutions to (25) at previous time steps. We refer to [5] for a detailed derivation of the corresponding decoding routine.
The obvious disadvantage of (29) is that the resulting score (25) does not represent a normalized likelihood w.r.t. to y _{ n }. On the other hand, the modified MAP approximation (29) leads to a scaled version of the exact likelihood p(y _{ n }q _{ n }), cf. (27), with the scaling factor \(p(\mathbf {x}_{n}^{\text {MAP}}\mathbf {y}_{n}, q_{n})\phantom {\dot {i}\!}\) being all the higher with increasing accuracy of the approximation (29).
4.6 Ion and HaebUmbach
that is given in [24]. Due to the approximations in Fig. 4 a, b, the compensation rule defined by (34) exhibits the same decoupling as (5) and can thus be carried out without modifying the underlying decoder. In practice, p(x _{ n }) may, e.g., be modeled as a separate Gaussian density and p(x _{ n }y _{1:N }) as a separate Markov process [24].
4.7 Significance decoding
where the Bayesian network properties of Fig. 2 a have been exploited in the numerator and the denominator and a single Gaussian pdf \(p(\mathbf {x}_{n} q_{n})\phantom {\dot {i}\!}\) is assumed without loss of generality. In a second step, the clean likelihood \(p(\mathbf {x}_{n}q_{n})\phantom {\dot {i}\!}\) is evaluated at \(\boldsymbol {\mu }_{\mathbf {x}\mathbf {y}_{n},q_{n}}\phantom {\dot {i}\!}\) after adding the variance \(\mathbf {C}_{\mathbf {x}\mathbf {y}_{n},q_{n}}\phantom {\dot {i}\!}\) to \(\mathbf {C}_{\mathbf {x}q_{n}}\phantom {\dot {i}\!}\), cf. (36).
In terms of probabilistic notation, this compensation rule corresponds to replacing the score calculation in (5) by an expected likelihood, similarly to [47, 48]:
A closer inspection of α reveals that the expected likelihood computation scales up p(y _{ n }q _{ n }) for large values of \(\mathbf {C}_{\mathbf {b}_{n}}\phantom {\dot {i}\!}\), which acts as an alleviation of the (potentially overly) flattening effect of \(\mathbf {C}_{\mathbf {b}_{n}}\phantom {\dot {i}\!}\) on p(y _{ n }q _{ n }), cf. (11).
5 Missing feature techniques
We next turn to missing feature techniques, which can be used to model feature distortion due to a frontend enhancement process [7], noise [49], or reverberation [50].
5.1 Feature vector imputation
with the general Bayesian network in Fig. 5 a. The approximation in (39) follows the same reasoning as (13).
5.2 Marginalization
5.3 Modified imputation
where (45) corresponds to evaluating the clean statedependent likelihood at \(\widehat {\mathbf {x}}_{n}\).
6 Acoustic model adaptation and other modelbased techniques
In the following, we consider the compensation rules of several acoustic model adaptation and other modelbased approaches from a Bayesian view.
6.1 Parallel model combination
The overall acoustic score can be approximated by a 3D Viterbi decoder, which can in turn be mapped onto a conventional 2D Viterbi decoder [10].
6.2 Vector Taylor series model compensation
individually and, secondly, be approximated by a Taylor series around \([\boldsymbol {\mu }_{\mathbf {x}k_{n}}, \boldsymbol {\mu }_{\mathbf {h}}, \boldsymbol {\mu }_{\mathbf {c}}]\phantom {\dot {i}\!}\), where \(\boldsymbol {\mu }_{\mathbf {x}k_{n}}\phantom {\dot {i}\!}\) denotes the mean of the component \(p(\mathbf {x}_{n}  k_{n})\phantom {\dot {i}\!}\). There are various extensions to the VTS concept that are omitted here. For a more comprehensive review of VTS, we refer to [25].
6.3 CMLLR
CMLLR represents a very popular adaptation technique due to its promising results and versatile fields of application, such as speaker adaptation [53], adaptive training [54] as well as noise [55] and reverberationrobust [56] ASR.
6.4 MLLR
This principle is also known from other approaches that are applicable to both means and variances but are often only carried out on the former (e.g., for the sake of robustness) [10, 61].
If applied to the mean vectors only, MLLR can in turn be considered as a simplified version of CMLLR, where the observation model (54) and the Bayesian network in Fig. 5 b is assumed while the compensation of the variances is omitted.
The Bayesian network representation in Fig. 5 b also underlies the general MLLR adaptation rule (58) and (59). In this case, however, it seems impossible to identify a corresponding observation model representation without analytically tying \(\mathbf {A}_{k_{n}}\phantom {\dot {i}\!}\) and \(\mathbf {B}_{k_{n}}\phantom {\dot {i}\!}\).
6.5 MAP adaptation
An iterative (local) solution to (63) is obtained by the expectation maximization (EM) algorithm. Note that due to the MAP approximation of the posterior p(θy _{ M:0}), the conditional independence assumption is again fulfilled such that a conventional decoder can be employed.
6.6 Bayesian MLLR
As mentioned before, uncertainty decoding techniques allow for a timevarying pdf p(b _{ n }), while model adaptation approaches, such as in Subsections 6.1, 6.2, and 6.6.1, mostly set p(b _{ n }) to be constant over time. In both cases, however, the “randomized” model parameter b _{ n } is assumed to be redrawn in each time step n as in Fig. 6 a. In contrast, Bayesian estimation—as mentioned before—usually refers to inference problems, where the random model parameters are drawn once for all times [26] as in Fig. 6 b.
where the original assumption of b being identical for all time steps n was relaxed to the case of b being identically distributed for all times steps n. The approximation in (65) can be interpreted as the conversion of the Bayesian network in Fig. 6 b to the one in Fig. 6 a with constant pdf p(b _{ n })=p(b) for all n.
6.6.1 Reverberant VTS
with b _{ n } being an additive noise component modeled as normally distributed random variable and μ _{0:L } being a deterministic description of the reverberant distortion. For the sake of tractability, the observation model is approximated in a similar manner as in the VTS approach. This concept can be seen as an alternative to REMOS (Subsection 4.5): While REMOS tailors the Viterbi decoder to the modified Bayesian network, reverberant VTS avoids the computationally expensive marginalization over all previous cleanspeech vectors by averaging—and thus smoothing—the cleanspeech statistics over all possible previous states and Gaussian components. Thus, y _{ n } is assumed to depend on the extended cleanspeech vector \(\overline {\mathbf {x}}_{n}\) = [x _{ n−L },…,x _{ n }], cf. Fig. 7 a vs. 7 b.
6.7 Convolutive model adaptation
where \(\boldsymbol {\mu }_{\mathbf {y}k_{n}}\phantom {\dot {i}\!}\) and \(\boldsymbol {\mu }_{\mathbf {x}k_{n}}\phantom {\dot {i}\!}\) denote the mean of the k _{ n }th Gaussian component of p(y _{ n }q _{ n }) and p(x _{ n }q _{ n }), respectively. The previous means \(\overline {\boldsymbol {\mu }}_{\mathbf {x}q_{nl}}\phantom {\dot {i}\!}\), l>0 are averaged over all means of the corresponding GMM p(x _{ n−l }q _{ n−l }). On the other hand, [17] employs the “lognormal approximation” [10] to adapt p(y _{ n }q _{ n }) according to (67). While [16] and [17] perform the adaptation once prior to recognition and then use a standard decoder, the concept proposed in [18] performs an online adaptation based on the best partial path [15].
It should be pointed out here that there is a variety of other approximations to the statistics of the logsum of (mixtures of) Gaussian random variables (as seen in Subsections 4.2, 4.5, 6.1, 6.2, 6.6.1), ranging from different PMC methods [10] to maximum [64], piecewise linear [65], and other analytical approximations [66–70].
6.8 Takiguchi et al.
where x _{ n }=g ^{−1}(y _{ n }) and \(J_{\mathbf {y}_{n}}\phantom {\dot {i}\!}\) denotes the Jacobian w.r.t. y _{ n }.
6.9 Conditional HMMs [22] and combinedorder HMMs [23]
We close this section by broadening the view and pointing to two modelbased approaches that cannot be classified as “model adaptation” as they postulate different HMM topologies rather than adapting a conventional HMM. Both approaches aim at relaxing the conditional independence assumption of conventional HMMs in order to improve the modeling of the interframe correlation.
according to Fig. 9 a. Such HMMs are also known as autoregressive HMMs [26].
according to Fig. 9 b, which can be thought of as a conventional firstorder HMM with a second output pdf per state.
While conditional HMMs represent the statistically more accurate model for correlated speech feature vectors, combinedorder HMMs circumvent the mathematically more complex inference step by a larger number of HMM parameters [23].
7 Relevance for DNNbased ASR
Before concluding this article, we build the bridge of the discussed modelbased techniques for GMMHMMbased ASR systems to the recent deep learningbased architectures.
The most immediate approach of exploiting conventional modelbased techniques is within the framework of bottleneck or tandem systems [25]. There, deep neural networks (DNNs) are used for feature extraction while the ASR system’s acoustic model is based on GMMHMMs. For such systems, the presented approaches could—in principal—be applied in the same way as for conventional GMMHMMs. However, the definition of meaningful observation models seems less intuitive as the features undergo various nonlinear transforms before being presented to the GMMHMM system.
with p(q _{ n }x _{ n }) being the q _{ n }th output node of the DNN, p(q _{ n }) being the prior probability of each HMM state (senone), estimated from the training set, and p(x _{ n }) being independent of the word sequence and thus to be ignored [71].
where p(x _{ n }y _{ n }) is defined through (75). In theory, any of the previously discussed observation models could thus be directly applied to a DNNHMM as long as resolving them for x _{ n } is feasible. In practice, however, both the parameter estimation step as well as the compensation step (76) can become complex.
given the observation y _{ n }, which can, e.g., be achieved by numerical or deterministic integral approximations [78].
If the transform parameters (here: b _{ n }) are estimated irrespectively of the ASR system’s acoustic model, (80) can be seen as a “conventional” feature enhancement step. If the transform parameters are discriminatively estimated using error backpropagation through the DNN, (80) could also be considered as adaptation of the DNN’s input layer weights.
8 Conclusions
In this article, we described the compensation rules of several acoustic modelbased techniques employing the Bayesian formalism. Some of the presented Bayesian descriptions are already given in the original papers and others can be easily derived based on the original papers (cf. Subsections 4.3, 4.4, 4.6, and 6.9). Beyond this, however, the links of the decoding rules of the concepts of REMOS (Subsection 4.5), significance decoding (Subsection 4.7), modified imputation (Subsection 5.3), CMLLR/MLLR (Subsections 6.3 and 6.4), MAP (Subsection 6.5), Bayesian MLLR (Subsection 6.6), and Takiguchi et al. [19] (Subsection 6.8) to the Bayesian framework via the mathematical reformulations in (28), (37), (45), (55), (61), (65), and (71), respectively, are explicitly stated for the first time in this paper.
As a byproduct of the Bayesian formalism, the considered concepts are represented here as Bayesian networks, which both highlights and hides certain crucial aspects. Most importantly, neither the particular functional form of the joint pdf nor potential approximations to arrive at a tractable algorithm nor the provenance of (i.e., the estimation procedure for) the compensation parameters are reflected.

The crossconnections depicted in Figs. 3, 4, 7a, 8, and 9 show that the underlying concept aims at improving the modeling of the interframe correlation, e.g., to increase the robustness of the acoustic model against reverberation. If applied in a straightforward way, such crossconnections would entail a costly modification of the Viterbi decoder. In this paper, we summarized some important approximations that allow for a more efficient decoding of the extended Bayesian network, cf. Subsections 4.5, 4.6, 6.6.1, and 6.7. Some of these typically empirically motivated or just intuitive approximations, especially neglected statistical dependencies, become obvious from a Bayesian network, as shown in Figs. 4 and 7.

The approaches introducing instantaneous (here: purely vertical) extensions to the Bayesian network, as in Figs. 2 a–c and 5 c, usually aim at compensating for nondispersive distortions, such as additive or shortranging convolutive noise.

The arcs in Figs. 2 c and 5 b illustrate that the observed vector y _{ n } does not only depend on the state q _{ n } (or mixture component k _{ n }) through x _{ n }. As a consequence, one can deduce that the compensation parameters do depend on the phonetic content, as in Subsections 4.4, 6.3, and 6.4.

The graphical model representation also succinctly highlights whether a Bayesian modeling paradigm is applied, as in Figs. 5 c and 6 b, or not, as in Figs. 5 a, b.

The existence of the additional latent variable x _{ n } in most of the presented Bayesian network representations expresses that an explicit observation model or an implicit statistical model between the clean and the corrupted features is employed. In contrast, the graphical representations in Figs. 5 c and 9 show that—instead of a distinct compensation step—a modified HMM topology is used.
In summary, the condensed description of the various concepts from the same Bayesian perspective shall allow other researchers to more easily exploit or combine existing techniques and to relate their own algorithms to the presented ones. This seems all the more important as the recent acoustic modeling approaches based on DNNs raise new challenges for the conventional robustness techniques [25].
Declarations
Acknowledgements
The authors would like to thank the Deutsche Forschungsgemeinschaft (DFG) for supporting this work (contract number KE 890/42).
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 JA Arrowood, MA Clements, in Proc. ICSLP. Using Observation Uncertainty in HMM Decoding (ISCABaixas, France, 2002), pp. 1561–1564.Google Scholar
 L Deng, J Droppo, A Acero, Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion. IEEE Trans. Speech Audio Process.13(3), 412–421 (2005).View ArticleGoogle Scholar
 J Droppo, A Acero, L Deng, in Proc. ICASSP, 1. Uncertainty Decoding with SPLICE for Noise Robust Speech Recognition (IEEENew Jersey, USA, 2002), pp. 57–60.Google Scholar
 H Liao, Uncertainty decoding for noise robust speech recognition, PhD thesis (Univ. of Cambridge, 2007). http://mi.eng.cam.ac.uk/~mjfg/thesis_hl251.pdf.
 R Maas, W Kellermann, A Sehr, T Yoshioka, M Delcroix, K Kinoshita, T Nakatani, in Proc. Int. Conf. on Digital Signal Process. Formulation of the REMOS Concept from an Uncertainty Decoding Perspective (IEEENew Jersey, USA, 2013).Google Scholar
 M Cooke, P Green, L Josifovski, A Vizinho, Robust Automatic Speech Recognition with Missing and Unreliable Acoustic Data. Speech Commun.34(3), 267–285 (2001).View ArticleMATHGoogle Scholar
 B Raj, RM Stern, Missingfeature approaches in speech recognition. IEEE Signal Process. Mag.22(5), 101–116 (2005).View ArticleGoogle Scholar
 D Kolossa, A Klimas, R Orglmeister, in Proc. WASPAA. Separation and Robust Recognition of Noisy, Convolutive Speech Mixtures Using TimeFrequency Masking and Missing Data Techniques (IEEENew Jersey, USA, 2005), pp. 82–85.Google Scholar
 AH Abdelaziz, D Kolossa, in Proc. Interspeech. Decoding of Uncertain Features Using the Posterior Distribution of the Clean Data for Robust Speech Recognition (ISCABaixas, France, 2012).Google Scholar
 MJF Gales, Modelbased techniques for noise robust speech recognition, PhD thesis (1995). http://mi.eng.cam.ac.uk/~mjfg/thesis.pdf.
 A Acero, L Deng, T Kristjansson, J Zhang, in Proc. ICSLP, 3. HMM Adaptation Using Vector Taylor Series for Noisy Speech Recognition (ISCABaixas, France, 2000), pp. 869–872.Google Scholar
 VV Digalakis, D Rtischev, LG Neumeyer, Speaker adaptation using constrained estimation of Gaussian mixtures. IEEE Trans. Speech Audio Process.3(5), 357–366 (1995).View ArticleGoogle Scholar
 CJ Leggetter, PC Woodland, Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput. Speech Lang.9(2), 171–185 (1995).View ArticleGoogle Scholar
 JT Chien, Linear regression based Bayesian predictive classification for speech recognition. IEEE Trans. Speech Audio Process.11(1), 70–79 (2003).View ArticleGoogle Scholar
 YQ Wang, MJF Gales, in Proc. ASRU. Improving Reverberant VTS for HandsFree Robust Speech Recognition (IEEENew Jersey, USA, 2011), pp. 113–118.Google Scholar
 HG Hirsch, H Finster, A new approach for the adaptation of HMMs to reverberation and background noise. Speech Commun.50(3), 244–263 (2008).View ArticleGoogle Scholar
 CK Raut, T Nishimoto, S Sagayama, in Proc. ICASSP, 1. Model Adaptation for Long Convolutional Distortion by Maximum Likelihood Based State Filtering Approach (IEEENew Jersey, USA, 2006), pp. 1133–1136.Google Scholar
 A Sehr, R Maas, W Kellermann, in Proc. ICASSP. Framewise HMM Adaptation Using StateDependent Reverberation Estimates (IEEENew Jersey, USA, 2011), pp. 5484–5487.Google Scholar
 T Takiguchi, M Nishimura, Y Ariki, Acoustic model adaptation using firstorder linear prediction for reverberant speech. IEICE Trans. Inform. Syst. E89D(3), 908–914 (2006).View ArticleGoogle Scholar
 V Ion, R HaebUmbach, A novel uncertainty decoding rule with applications to transmission error robust speech recognition. IEEE Trans. Audio, Speech, Lang. Process.16(5), 1047–1060 (2008).View ArticleGoogle Scholar
 JL Gauvain, CH Lee, in Proc. Workshop on Speech and Natural Lang. MAP Estimation of Continuous Density HMM: Theory and Applications (Morgan KaufmannBurlington, USA, 1992), pp. 185–190.View ArticleGoogle Scholar
 J Ming, FJ Smith, Modelling of the interframe dependence in an HMM using conditional Gaussian mixtures. Comput. Speech Lang.10:, 229–247 (1996).View ArticleGoogle Scholar
 R Maas, SR Kotha, A Sehr, W Kellermann, in Proc. Int. Workshop on Cognitive Inform. Process. CombinedOrder Hidden Markov Models for ReverberationRobust Speech Recognition (IEEENew Jersey, USA, 2012), pp. 167–171.Google Scholar
 R HaebUmbach, in Robust Speech Recognition of Uncertain or Missing Data, ed. by D Kolossa, R HaebUmbach. Uncertainty Decoding and Conditional Bayesian Estimation (SpringerBerlin Heidelberg, 2011), pp. 9–33.View ArticleGoogle Scholar
 J Li, L Deng, Y Gong, R HaebUmbach, An overview of noiserobust automatic speech recognition. IEEE/ACM Trans. Audio, Speech, Lang. Process.22(4), 745–777 (2014).View ArticleGoogle Scholar
 CM Bishop, Pattern Recognition and Machine Learning (Springer, New York, 2006).MATHGoogle Scholar
 D Kolossa, R HaebUmbach, Robust Speech Recognition of Uncertain or Missing Data (Springer, Berlin Heidelberg, 2011).View ArticleMATHGoogle Scholar
 G Heigold, H Ney, R Schlüter, S Wiesler, Discriminative training for automatic speech recognition: modeling, criteria, optimization, implementation, and performance. IEEE Signal Process. Mag.29(6), 58–69 (2012).View ArticleGoogle Scholar
 M Matassoni, M Omologo, D Giuliani, P Svaizer, Hidden Markov model training with contaminated speech material for distanttalking speech recognition. Comput. Speech Lang.16(2), 205–223 (2002).View ArticleGoogle Scholar
 A Sehr, C Hofmann, R Maas, W Kellermann, in Proc. Interspeech. A Novel Approach for Matched Reverberant Training of HMMs Using Data Pairs (ISCABaixas, France, 2010), pp. 566–569.Google Scholar
 Y Gong, Speech recognition in noisy environments: A survey. Speech Commun.16(3), 261–291 (1995). doi:10.1016/01676393(94)00059J.View ArticleGoogle Scholar
 CH Lee, On stochastic feature and model compensation approaches to robust speech recognition. Speech Commun.25(1–3), 29–47 (1998). doi:10.1016/S01676393(98)000284.View ArticleGoogle Scholar
 Q Huo, CH Lee, Robust speech recognition based on adaptive classification and decision strategies. Speech Commun.34(1–2), 175–194 (2001). doi:10.1016/S01676393(00)000534.View ArticleMATHGoogle Scholar
 J Droppo, in Springer Handbook of Speech Processing. Environmental Robustness (SpringerBerlin Heidelberg, 2008), pp. 653–680.View ArticleGoogle Scholar
 T Yoshioka, A Sehr, M Delcroix, K Kinoshita, R Maas, T Nakatani, W Kellermann, Making machines understand us in reverberant rooms: robustness against reverberation for automatic speech recognition. IEEE Signal Process. Mag.29(6), 114–126 (2012).View ArticleGoogle Scholar
 T Virtanen, R Singh, B Raj, Techniques for Noise Robustness in Automatic Speech Recognition (Wiley, UK, 2013).Google Scholar
 JN Holmes, WJ Holmes, PN Garner, in Proc. Eurospeech, 97. Using Formant Frequencies in Speech Recognition (ISCABaixas, France, 1997), pp. 2083–2087.Google Scholar
 TT Kristjansson, BJ Frey, in Proc. ICASSP, 1. Accounting for Uncertainity in Observations: A New Paradigm for Robust Speech Recognition (IEEENew Jersey, USA, 2002), pp. 61–64.Google Scholar
 L Deng, J Droppo, A Acero, in Proc. ICSLP, 4. Exploiting Variances in Robust Feature Extraction Based on a Parametric Model of Speech Distortion, (2002), pp. 2449–2452.Google Scholar
 MC Benitez, JC Segura, A Torre, J Ramirez, A Rubio, in Proc. ICSLP. Including Uncertainty of Speech Observations in Robust Speech Recognition (ISCABaixas, France, 2004), pp. 137–140.Google Scholar
 V Stouten, H Van Hamme, P Wambacq, Modelbased feature enhancement with uncertainty decoding for noise robust ASR. Speech Commun.48(11), 1502–1514 (2006).View ArticleGoogle Scholar
 M Delcroix, T Nakatani, S Watanabe, Static and dynamic variance compensation for recognition of reverberant speech with dereverberation preprocessing. IEEE Trans. Audio, Speech, Lang. Process.17(2), 324–334 (2009).View ArticleGoogle Scholar
 L Deng, A Acero, M Plumpe, XD Huang, in Proc. ICSLP, 3. Large Vocabulary Continuous Speech Recognition Under Adverse Conditions (ISCABaixas, France, 2000), pp. 806–809.Google Scholar
 L Deng, A Acero, L Jiang, J Droppo, X Huang, in Proc. ICASSP, 1. HighPerformance Robust Speech Recognition Using Stereo Training Data (IEEENew Jersey, USA, 2001), pp. 301–304.Google Scholar
 L Deng, J Droppo, A Acero, Recursive estimation of nonstationary noise using iterative stochastic approximation for robust speech recognition. IEEE Trans. Speech Audio Process.11(6), 568–580 (2003).View ArticleGoogle Scholar
 A Sehr, R Maas, W Kellermann, Reverberation modelbased decoding in the logmelspec domain for robust distanttalking speech recognition. IEEE Trans. Audio, Speech, Lang. Process.18(7), 1676–1691 (2010).View ArticleGoogle Scholar
 JA Arrowood, Using observation uncertainty for robust speech recognition, PhD thesis (Georgia Institute of Technology, 2003). https://smartech.gatech.edu/bitstream/handle/1853/5383/arrowood_jon_a_200312_phd.pdf.
 RF Astudillo, R Orglmeister, Computing MMSE estimates and residual uncertainty directly in the feature domain of ASR using STFT domain speech distortion models. IEEE Trans. Audio, Speech, Lang. Process.21(5), 1023–1034 (2013).View ArticleGoogle Scholar
 M Cooke, A Morris, P Green, in Proc. ICASSP, 2. Missing Data Techniques for Robust Speech Recognition, (1997), pp. 863–866.Google Scholar
 KJ Palomäki, GJ Brown, JP Barker, Techniques for handling convolutional distortion with ‘missing data’ automatic speech recognition. Speech Commun.43(1–2), 123–142 (2004).View ArticleGoogle Scholar
 PJ Moreno, Speech recognition in noisy environments, PhD thesis (Carnegie Mellon Univ., Pittsburgh, 1996). http://www.cs.cmu.edu/~robust/Thesis/pjm_thesis.pdf.Google Scholar
 J Li, L Deng, D Yu, Y Gong, A Acero, A unified framework of HMM adaptation with joint compensation of additive and convolutive distortions. Comput. Speech Lang.23(3), 389–405 (2009).View ArticleGoogle Scholar
 MJF Gales, Maximum likelihood linear transformations for HMMbased speech recognition. Comput. Speech Lang.12:, 75–98 (1997).View ArticleGoogle Scholar
 K Yu, MJF Gales, Discriminative cluster adaptive training. IEEE Trans. Audio, Speech, Lang. Process.14(5), 1694–1703 (2006).View ArticleGoogle Scholar
 E Vincent, J Barker, S Watanabe, J Le Roux, F Nesta, M Matassoni, in Proc. ASRU. The Second ‘CHiME’ Speech Separation and Recognition Challenge: An Overview of Challenge Systems and Outcomes (IEEENew Jersey, USA, 2013), pp. 162–167.Google Scholar
 K Kinoshita, M Delcroix, T Yoshioka, T Nakatani, A Sehr, W Kellermann, R Maas, in Proc. WASPAA. The REVERB Challenge: A common Evaluation Framework for Dereverberation and Recognition of Reverberant Speech, (2013).Google Scholar
 T Anastasakos, J McDonough, R Schwartz, J Makhoul, in Proc. ICSLP, 2. A Compact Model for SpeakerAdaptive Training (ISCABaixas, France, 1996), pp. 1137–1140.Google Scholar
 S Young, G Evermann, D Kershaw, G Moore, J Odell, D Ollason, D Povey, V Valtchev, P Woodland, The HTK Book (Cambridge Univ. Eng. Dept., UK, 2002).Google Scholar
 M Delcroix, K Kinoshita, T Nakatani, S Araki, A Ogawa, T Hori, S Watanabe, M Fujimoto, T Yoshioka, T Oba, et al, in Proc. Int. Workshop on Mach. Listening in Multisource Environments (CHiME). Speech Recognition in the Presence of Highly Nonstationary Noise Based on Spatial, Spectral and Temporal Speech/Noise Modeling Combined with Dynamic Variance Adaptation, (2011), pp. 12–17. http://spandh.dcs.shef.ac.uk/projects/chime/workshop/papers/pS21_delcroix.pdf.
 F Xiong, N Moritz, R Rehr, J Anemuller, BT Meyer, T Gerkmann, S Doclo, S Goetze, in Proc. REVERB Workshop. Robust ASR in Reverberant Environments Using Temporal Cepstrum Smoothing for Speech Enhancement and an Amplitude Modulation Filterbank for Feature Extraction, (2014). http://reverb2014.dereverberation.com/workshop/reverb2014papers/1569899061.pdf.
 HG Hirsch, HMM adaptation for applications in telecommunication. Speech Commun.34(12), 127–139 (2001).View ArticleMATHGoogle Scholar
 H Jiang, K Hirose, Q Huo, Robust speech recognition based on a Bayesian prediction approach. IEEE Trans. Speech Audio Process.7(4), 426–440 (1999).View ArticleGoogle Scholar
 JT Chien, Linear regression based Bayesian predictive classification for speech recognition. IEEE Trans. Speech Audio Process.11(1), 70–79 (2003).View ArticleGoogle Scholar
 A Nadas, D Nahamoo, MA Picheny, Speech recognition using noiseadaptive prototypes. IEEE Trans. Audio, Speech, Lang. Process.37(10), 1495–1503 (1989).View ArticleGoogle Scholar
 R Maas, A Sehr, M Gugat, W Kellermann, in Proc. European Signal Processing Conf. (EUSIPCO). A Highly Efficient Optimization Scheme for REMOSBased DistantTalking Speech Recognition (IEEENew Jersey, USA, 2010), pp. 1983–1987.Google Scholar
 SC Schwartz, YS Yeh, On the distribution function and moments of power sums with lognormal components. Bell Syst. Tech. Journal. 61(7), 1441–1462 (1982).View ArticleMATHGoogle Scholar
 NC Beaulieu, AA AbuDayya, PJ McLane, Estimating the distribution of a sum of independent lognormal random variables. IEEE Trans. Commun. 43(12), 2869–2873 (1995).View ArticleGoogle Scholar
 CK Raut, T Nishimoto, S Sagayam, in Proc. Interspeech. Model Composition by Lagrange Polynomial Approximation for Robust Speech Recognition in Noisy Environment (ISCABaixas, France, 2004).Google Scholar
 NC Beaulieu, Q Xie, An optimal lognormal approximation to lognormal sum distributions. IEEE Trans. Veh. Technol.53(2), 479–489 (2004).View ArticleGoogle Scholar
 JR Hershey, SJ Rennie, JL Roux, in Techniques for Noise Robustness in Automatic Speech Recognition, ed. by T Virtanen, R Singh, and B Raj. Factorial Models for Noise Robust Speech Recognition (WileyUK, 2013), pp. 311–345.Google Scholar
 D Yu, L Deng, Automatic Speech Recognition—A Deep Learning Approach (Springer, London, 2015).Google Scholar
 ML Seltzer, D Yu, Y Wang, in Proc. ICASSP. An Investigation of Deep Neural Networks for Noise Robust Speech Recognition (IEEENew Jersey, USA, 2013), pp. 7398–7402.Google Scholar
 L Deng, J Li, JT Huang, K Yao, D Yu, F Seide, M Seltzer, G Zweig, X He, J Williams, Y Gong, A Acero, in Proc. ICASSP. Recent Advances in Deep Learning for Speech Research at Microsoft (IEEENew Jersey, USA, 2013), pp. 8604–8608.Google Scholar
 R Gemello, F Mana, S Scanzio, P Laface, R De Mori, in Proc. ICASSP. Adaptation of Hybrid ANN/HMM Models Using Linear Hidden Transformations and Conservative Training (IEEENew Jersey, USA, 2006), pp. 1189–1192.Google Scholar
 F Seide, G Li, X Chen, D Yu, in Proc. ASRU. Feature Engineering in ContextDependent Deep Neural Networks for Conversational Speech Transcription (IEEENew Jersey, USA, 2011), pp. 24–29.Google Scholar
 K Yao, D Yu, F Seide, H Su, L Deng, Y Gong, in Proc. SLT. Adaptation of ContextDependent Deep Neural Networks for Automatic Speech Recognition (IEEENew Jersey, USA, 2012), pp. 366–369.Google Scholar
 G Saon, H Soltau, D Nahamoo, M Picheny, in Proc. ASRU. Speaker Adaptation of Neural Network Acoustic Models Using IVectors (IEEENew Jersey, USA, 2013), pp. 55–59.Google Scholar
 RF Astudillo, JP da Silva Neto, in Proc. Interspeech. Propagation of Uncertainty Through Multilayer Perceptrons for Robust Automatic Speech Recognition (ISCABaixas, France, 2011), pp. 461–464.Google Scholar