In the following, we consider the compensation rules of several uncertainty decoding techniques from a Bayesian view.
4.1 General example of uncertainty decoding
A fundamental example of uncertainty decoding can, e.g., be extracted from [1, 37–42]. The underlying observation model can be identified as
$$\begin{array}{@{}rcl@{}} \mathbf{y}_{n} = \mathbf{x}_{n} + \mathbf{b}_{n} \ \text{with} \ \mathbf{b}_{n} \sim \mathcal{N}(\mathbf{0}, \mathbf{C}_{\mathbf{b}_{n}}), \end{array} $$
((10))
where y
n
and \(\mathbf {C}_{\mathbf {b}_{n}}\phantom {\dot {i}\!}\) often play the role of an enhanced feature vector, e.g., from a Wiener filtering front-end [40] and a measure of uncertainty from the enhancement process, respectively. Thus, the point estimate y
n
can be seen as being enriched by the additional reliability information \(\mathbf {C}_{\mathbf {b}_{n}}\phantom {\dot {i}\!}\). The observation model is representable by the Bayesian network in Fig. 2
a. Exploiting the conditional independence properties of Bayesian networks [26], the compensation of the observation likelihood in (5) leads to [26]
$$\begin{array}{@{}rcl@{}} p(\mathbf{y}_{n}| q_{n}) &=& \int p(\mathbf{x}_{n}|q_{n}) \, p(\mathbf{y}_{n}|\mathbf{x}_{n}) d\mathbf{x}_{n} \\ &=& \int \mathcal{N}(\mathbf{x}_{n} ; \boldsymbol{\mu}_{\mathbf{x}|q_{n}}, \mathbf{C}_{\mathbf{x}|q_{n}}) \, \mathcal{N}(\mathbf{y}_{n} ; \mathbf{x}_{n}, \mathbf{C}_{\mathbf{b}_{n}}) d\mathbf{x}_{n} \\ &=& \mathcal{N}(\mathbf{y}_{n} ; \boldsymbol{\mu}_{\mathbf{x}|q_{n}}, \mathbf{C}_{\mathbf{x}|q_{n}} + \mathbf{C}_{\mathbf{b}_{n}}). \end{array} $$
((11))
Without loss of generality, a single Gaussian pdf p(x
n
|q
n
) is assumed since, in the case of a Gaussian mixture model (GMM), the linear mismatch function (10) can be applied to each Gaussian component separately.
4.2 Dynamic variance compensation
The concept of dynamic variance compensation [2] is based on a reformulation of the log-sum observation model [39]:
$$\begin{array}{@{}rcl@{}} \mathbf{y}_{n} = \mathbf{x}_{n} + \log\left(1+\exp(\widehat{\mathbf{r}}_{n} - \mathbf{x}_{n})\right) + \mathbf{b}_{n} \end{array} $$
((12))
with \(\widehat {\mathbf {r}}_{n}\) being a noise estimate of any noise tracking algorithm and \(\mathbf {b}_{n} \sim \mathcal {N}(\mathbf {0}, \mathbf {C}_{\mathbf {b}_{n}})\) a residual error term. Since the analytical derivation of p(y
n
|q
n
) is intractable, an approximate pdf is evaluated based on the assumption of p(x
n
|y
n
) being Gaussian and that the compensation can be applied to each Gaussian component of the GMM separately [2]. According to Fig. 2
a, the observation likelihood in (5), in its scaled version p ̈(y
n
|q
n
), hence becomes:
$$\begin{array}{@{}rcl@{}} \mathring{p}(\mathbf{y}_{n}|q_{n}) &=& \int p(\mathbf{x}_{n}|q_{n}) \, \frac{p(\mathbf{x}_{n}|\mathbf{y}_{n})}{p(\mathbf{x}_{n})} d\mathbf{x}_{n} \\ &\approx& \int p(\mathbf{x}_{n}|q_{n}) \, p(\mathbf{x}_{n}|\mathbf{y}_{n}) d\mathbf{x}_{n} \end{array} $$
((13))
$$\begin{array}{@{}rcl@{}} &\approx& \int \mathcal{N}(\mathbf{x}_{n} ; \boldsymbol\mu_{\mathbf{x}|q_{n}}, \mathbf{C}_{\mathbf{x}|q_{n}}) \mathcal{N}(\mathbf{x}_{n} ; \boldsymbol\mu_{\mathbf{x}|\mathbf{y}_{n}}, \mathbf{C}_{\mathbf{x}|\mathbf{y}_{n}}) d\mathbf{x}_{n} \\ &=& \mathcal{N}(\boldsymbol\mu_{\mathbf{x}|q_{n}}; \boldsymbol\mu_{\mathbf{x}|\mathbf{y}_{n}}, \mathbf{C}_{\mathbf{x}|q_{n}} + \mathbf{C}_{\mathbf{x}|\mathbf{y}_{n}}), \end{array} $$
((14))
where the approximation (13) can be justified if p(x
n
) is assumed to be significantly “flatter,” i.e., of larger variance, than \(p(\mathbf {x}_{n}|\mathbf {y}_{n})\phantom {\dot {i}\!}\). The estimation of the moments \(\boldsymbol \mu _{\mathbf {x}|\mathbf {y}_{n}}\phantom {\dot {i}\!}\), \(\mathbf {C}_{\mathbf {x}|\mathbf {y}_{n}}\phantom {\dot {i}\!}\) of \(p(\mathbf {x}_{n}|\mathbf {y}_{n})\phantom {\dot {i}\!}\) represents the core of [2].
4.3 Uncertainty decoding with SPLICE
The stereo piecewise linear compensation for environment (SPLICE) approach, first introduced in [43] and further developed in [44, 45], is a popular method for cepstral feature enhancement based on a mapping learned from stereo (i.e., clean and noisy) data [25]. While SPLICE can be used to derive an minimum mean square error (MMSE) [44] or MAP [43] estimate that is fed into the recognizer, it is also applicable in the context of uncertainty decoding [3], which we focus on in the following. In order to derive a Bayesian network representation of the uncertainty decoding version of SPLICE [3], we first note from [3] that one fundamental assumption is
$$\begin{array}{@{}rcl@{}} p(\mathbf{x}_{n}|\mathbf{y}_{n},s_{n}) = \mathcal{N}(\mathbf{x}_{n}; \mathbf{y}_{n} + \mathbf{r}_{s_{n}}, \boldsymbol{\Gamma}_{s_{n}}), \end{array} $$
((15))
where s
n
denotes a discrete region index, \(\mathbf {r}_{s_{n}}\phantom {\dot {i}\!}\) is its bias, and \(\boldsymbol {\Gamma }_{s_{n}}\phantom {\dot {i}\!}\) the uncertainty in that region. Exploiting the symmetry of the Gaussian pdf
$$\begin{array}{@{}rcl@{}} p(\mathbf{x}_{n}|\mathbf{y}_{n},s_{n}) &=& \mathcal{N}(\mathbf{x}_{n}; \mathbf{y}_{n} + \mathbf{r}_{s_{n}}, \boldsymbol{\Gamma}_{s_{n}}) \\ &=& \mathcal{N}(\mathbf{y}_{n} - \mathbf{x}_{n}; -\mathbf{r}_{s_{n}}, \boldsymbol{\Gamma}_{s_{n}}) \end{array} $$
((16))
and defining b
n
=y
n
−x
n
, we identify the observation model to be
$$ \mathbf{y}_{n} = \mathbf{x}_{n} + \mathbf{b}_{n} $$
((17))
given a certain region index s
n
. In the general case of s
n
depending on x
n
, the observation model can be expressed by the Bayesian network in Fig. 2
b with
$$\begin{array}{@{}rcl@{}} p(\mathbf{b}_{n} | s_{n}) = \mathcal{N}(\mathbf{b}_{n}; -\mathbf{r}_{s_{n}}, \boldsymbol{\Gamma}_{s_{n}}). \end{array} $$
((18))
This reveals that the introduction of different regions s
n
is equivalent to assuming an affine model (18) with p(b
n
) being a GMM instead of a single Gaussian density, as in (10). By introducing a separate prior model
$$\begin{array}{@{}rcl@{}} p(\mathbf{y}_{n}) &=& \sum\limits_{s_{n}} p(s_{n}) \, p(\mathbf{y}_{n}|s_{n}) \\ &=& \sum\limits_{s_{n}} p(s_{n}) \, \mathcal{N}(\mathbf{y}_{n}; \boldsymbol{\mu}_{\mathbf{y}|s_{n}}, \mathbf{C}_{\mathbf{y}|s_{n}}), \end{array} $$
((19))
for the distorted speech y
n
, the likelihood in (5) can be adapted according to
$$ \begin{aligned} p(\mathbf{y}_{n}|q_{n}) &= \int p(\mathbf{x}_{n}|q_{n}) \, p(\mathbf{y}_{n}|\mathbf{x}_{n}) d\mathbf{x}_{n}\\ &= \int p(\mathbf{x}_{n}|q_{n}) \, \frac{p(\mathbf{x}_{n},\mathbf{y}_{n})}{p(\mathbf{x}_{n})} d\mathbf{x}_{n} = \int p(\mathbf{x}_{n}|q_{n}) \\ &\quad \cdot \frac{\sum_{s_{n}} p(\mathbf{x}_{n}|\mathbf{y}_{n},s_{n}) \, p(\mathbf{y}_{n}|s_{n}) \, p(s_{n})}{\sum_{s_{n}} \int {p(\mathbf{x}_{n}|\mathbf{y}_{n},s_{n}) \, p(\mathbf{y}_{n}|s_{n}) \, p(s_{n}) d{\mathbf{y}}_{n}}} d\mathbf{x}_{n}. \;\;\;\;\;\; \end{aligned} $$
((20))
Although analytically tractable, both the numerator and the denominator in (20) are typically approximated for the sake of runtime efficiency [3].
4.4 Joint uncertainty decoding
Model-based joint uncertainty decoding [4] assumes an affine observation model in the cepstral domain
$$\begin{array}{@{}rcl@{}} \mathbf{y}_{n} = \mathbf{A}_{k_{n}} \mathbf{x}_{n} + \mathbf{b}_{n} \end{array} $$
((21))
with the deterministic matrix \(\mathbf {A}_{k_{n}}\phantom {\dot {i}\!}\) and \(p(\mathbf {b}_{n} | k_{n}) = \mathcal {N}(\mathbf {b}_{n} ; \boldsymbol {\mu }_{\mathbf {b}|k_{n}}, \mathbf {C}_{\mathbf {b}|k_{n}})\phantom {\dot {i}\!}\) depending on the considered Gaussian component k
n
of the GMM of the current HMM state q
n
:
$$\begin{array}{@{}rcl@{}} p(\mathbf{x}_{n} | q_{n}) = \sum_{k_{n}} p(k_{n}) \, p(\mathbf{x}_{n} | k_{n}). \end{array} $$
((22))
The Bayesian network is depicted in Fig. 2
c implying the following compensation rule:
$$\begin{array}{@{}rcl@{}} p(\mathbf{y}_{n}|k_{n}) = \int p(\mathbf{x}_{n} | k_{n}) \, p(\mathbf{y}_{n} | \mathbf{x}_{n}, k_{n}) d\mathbf{x}_{n}, \end{array} $$
((23))
which can be analytically derived analogously to (11). In practice, the compensation parameters \(\mathbf {A}_{k_{n}}\phantom {\dot {i}\!}\), \(\boldsymbol {\mu }_{\mathbf {b}|k_{n}}\phantom {\dot {i}\!}\), and \(\mathbf {C}_{\mathbf {b}|k_{n}}\phantom {\dot {i}\!}\) are not estimated for each Gaussian component k
n
but for each regression class comprising a set of Gaussian components [4].
4.5 REMOS
As many other techniques, the reverberation modeling for speech recognition (REMOS) concept [5, 46] assumes the environmental distortion to be additive in the melspectral domain. However, REMOS also considers the influence of the L previous clean speech feature vectors x
n−L:n−1 in order to model the dispersive effect of reverberation and to relax the conditional independence assumption of conventional HMMs. The observation model reads in the logmelspec domain:
$$\begin{array}{@{}rcl@{}} \mathbf{y}_{n} &=& \log\left(\vphantom{\sum\limits_{l=1}^{L}}\exp(\mathbf{c}_{n}) + \exp(\mathbf{h}_{n} + \mathbf{x}_{n})\right. \\ && +\left. \exp(\mathbf{a}_{n}) \odot \sum\limits_{l=1}^{L}{\exp({\boldsymbol\mu}_{l} + \mathbf{x}_{n-l})}\right), \end{array} $$
((24))
where the normally distributed random variables c
n
,h
n
, and a
n
model the additive noise components, the early part of the room impulse response (RIR), and the weighting of the late part of the RIR, respectively, and the parameters μ
1:L
represent a deterministic description of the late part of the RIR. The Bayesian network is depicted in Fig. 3 with b
n
=[c
n
,a
n
,h
n
]. In contrast to most of the other compensation rules reviewed in this article, the REMOS concept necessitates a modification of the Viterbi decoder due to the introduced cross-connections in Fig. 3. In order to arrive at a computationally feasible decoder, the marginalization over the previous clean speech components x
n−L:n−1 is circumvented by employing estimates \(\widehat {\mathbf {x}}_{n-L:n-1}(q_{n-1})\) that depend on the best partial path, i.e., on the previous HMM state q
n−1. The resulting analytically intractable integral is then approximated by the maximum of its integrand:
$$ \begin{aligned} &{p(\mathbf{y}_{n} | q_{n}, \widehat{\mathbf{x}}_{n-L:n-1}(q_{n-1}))}\\ &\quad = \int p(\mathbf{y}_{n}|\mathbf{x}_{n},\widehat{\mathbf{x}}_{n-L:n-1}(q_{n-1})) \, p(\mathbf{x}_{n}|q_{n}) d\mathbf{x}_{n} \\ &\quad \approx \max_{\mathbf{x}_{n}} \; p(\mathbf{y}_{n}|\mathbf{x}_{n},\widehat{\mathbf{x}}_{n-L:n-1}(q_{n-1})) \, p(\mathbf{x}_{n}|q_{n}). \;\;\;\;\; \end{aligned} $$
((25))
The determination of a global solution to (25) represents the core of the REMOS concept. The estimates \(\widehat {\mathbf {x}}_{n-L:n-1}(q_{n-1})\) in turn are the solutions to (25) at previous time steps. We refer to [5] for a detailed derivation of the corresponding decoding routine.
It seems worthwhile noting that the simplification in (25) represents a variant of the MAP integral approximation, as often applied in Bayesian estimation [26]. To show this, we first omit the dependency on \(\widehat {\mathbf {x}}_{n-L:n-1}(q_{n-1})\) for notational convenience and define
$$ \begin{aligned} \mathbf{x}_{n}^{\text{MAP}} &= \arg\max_{\mathbf{x}_{n}} \; p(\mathbf{y}_{n} | \mathbf{x}_{n}) \, p(\mathbf{x}_{n} | q_{n})\\ &= \arg\max_{\mathbf{x}_{n}} \; \frac{p(\mathbf{y}_{n}, \mathbf{x}_{n} | q_{n})}{p(\mathbf{y}_{n}|q_{n})} = \arg\max_{\mathbf{x}_{n}} \; p(\mathbf{x}_{n} | \mathbf{y}_{n}, q_{n}), \;\;\;\;\;\; \end{aligned} $$
((26))
where we scaled the objective function in the second step by the constant 1/p(y
n
|q
n
). We can now reformulate the Bayesian integral leading to a novel derivation of (25)
$$\begin{array}{@{}rcl@{}} p(\mathbf{y}_{n}|q_{n}) &=& \int p(\mathbf{y}_{n}|\mathbf{x}_{n}) \, p(\mathbf{x}_{n}|q_{n}) d\mathbf{x}_{n} \\ &=& \int p(\mathbf{y}_{n}|q_{n}) \, p(\mathbf{x}_{n}|\mathbf{y}_{n}, q_{n}) d\mathbf{x}_{n}\\ &\approx& \int p(\mathbf{y}_{n}|q_{n}) \, p\left(\mathbf{x}_{n}|\mathbf{y}_{n}, q_{n}\right) \, \delta\left(\mathbf{x}_{n} - \mathbf{x}_{n}^{\text{MAP}}\right) d\mathbf{x}_{n} \\ &=& p(\mathbf{y}_{n}|q_{n}) \, p\left(\mathbf{x}_{n}^{\text{MAP}}|\mathbf{y}_{n}, q_{n}\right) \end{array} $$
((27))
$$\begin{array}{@{}rcl@{}} &=& p\left(\mathbf{y}_{n}|\mathbf{x}_{n}^{\text{MAP}}\right) \, p\left(\mathbf{x}_{n}^{\text{MAP}}|q_{n}\right). \end{array} $$
((28))
We see that the assumption underlying (25) is a modified MAP approximation:
$$ p(\mathbf{x}_{n}|\mathbf{y}_{n}, q_{n}) \approx p(\mathbf{x}_{n}|\mathbf{y}_{n}, q_{n}) \, \delta\left(\mathbf{x}_{n} - \mathbf{x}_{n}^{\text{MAP}}\right), $$
((29))
which slightly differs from the conventional MAP approximation:
$$ p(\mathbf{x}_{n}|\mathbf{y}_{n}, q_{n}) \approx \delta\left(\mathbf{x}_{n} - \mathbf{x}_{n}^{\text{MAP}}\right). $$
((30))
The obvious disadvantage of (29) is that the resulting score (25) does not represent a normalized likelihood w.r.t. to y
n
. On the other hand, the modified MAP approximation (29) leads to a scaled version of the exact likelihood p(y
n
|q
n
), cf. (27), with the scaling factor \(p(\mathbf {x}_{n}^{\text {MAP}}|\mathbf {y}_{n}, q_{n})\phantom {\dot {i}\!}\) being all the higher with increasing accuracy of the approximation (29).
4.6 Ion and Haeb-Umbach
Similarly to REMOS, the generic uncertainty decoding approach given in [24], and first proposed by [20], considers cross-connections in the Bayesian network in order to relax the conditional independence assumption of HMMs. The concept, as described in [24], is an example of uncertainty decoding, where the compensation rule can be defined by a modified Bayesian network structure—given in Fig. 4
a—without fixing a particular functional form of the involved pdfs via an analytical observation model. In order to derive the compensation rule, we start by introducing the sequence x
1:N
of latent clean speech vectors in each summand of (4)
$$\begin{array}{@{}rcl@{}} \,p(\mathbf{y}_{1:N},q_{1:N})\!\! &=&\! \int p(\mathbf{y}_{1:N},\mathbf{x}_{1:N},q_{1:N}) d\mathbf{x}_{1:N} \\ &=&\! \int\! p(\mathbf{y}_{1:N}|\mathbf{x}_{1:N})\! \!\prod\limits_{n=1}^{N} p(\mathbf{x}_{n}|q_{n}) p(q_{n}|q_{n-1}) d\mathbf{x}_{1:N}\\ &\sim&\!\! \int\!\! \frac{p(\mathbf{x}_{1:N}|\mathbf{y}_{1:N})}{p(\mathbf{x}_{1:N})} \!\!\prod\limits_{n=1}^{N}\! p(\mathbf{x}_{n}|q_{n}) p(q_{n}|q_{n-1}) d\mathbf{x}_{1:N},\!\!\\ \end{array} $$
((31))
where we exploited the conditional independence properties defined by Fig. 4
a (respecting the dashed links) and dropped p(y
1:N
) in the last line of (31) as it represents a constant factor with respect to a varying state sequence q
1:N
. The pdf in the numerator of (31) is next turned into
$$\begin{array}{@{}rcl@{}} p(\mathbf{x}_{1:N}|\mathbf{y}_{1:N}) &=& p(\mathbf{x}_{1}|\mathbf{y}_{1:N}) \prod\limits_{n=2}^{N} p(\mathbf{x}_{n}|\mathbf{y}_{1:N}, \mathbf{x}_{1:n-1}) \\ &\approx& \prod\limits_{n=1}^{N} p(\mathbf{x}_{n}|\mathbf{y}_{1:N}), \end{array} $$
((32))
where the conditional dependence (due to the head-to-head relation) of x
n
and x
1:n−1 is neglected. This corresponds to omitting the respective dashed links in Fig. 4
a for each factor in (32) separately. The denominator in (31) can also be further decomposed if the dashed links in Fig. 4
b, i.e., the head-to-tail relations in q
n
, are disregarded:
$$ p(\mathbf{x}_{1:N}) \approx \prod\limits_{n=1}^{N} p(\mathbf{x}_{n}). $$
((33))
With (32) and (33), the updated rule (31) is finally turned into the following simplified form:
$$ \begin{aligned} {p(\mathbf{y}_{1:N},q_{1:N})}\sim \prod\limits_{n=1}^{N} \int \frac{p(\mathbf{x}_{n}|\mathbf{y}_{1:N})}{p(\mathbf{x}_{n})} \, p(\mathbf{x}_{n}|q_{n}) d\mathbf{x}_{n} \, p(q_{n}|q_{n-1}) \end{aligned} $$
((34))
that is given in [24]. Due to the approximations in Fig. 4
a, b, the compensation rule defined by (34) exhibits the same decoupling as (5) and can thus be carried out without modifying the underlying decoder. In practice, p(x
n
) may, e.g., be modeled as a separate Gaussian density and p(x
n
|y
1:N
) as a separate Markov process [24].
4.7 Significance decoding
Assuming the affine model (10), the concept of significance decoding [9] first derives the moments of the posterior p(x
n
|y
n
,q
n
):
$$ \begin{aligned} p(\mathbf{x}_{n}|\mathbf{y}_{n},q_{n}) &= \frac{p(\mathbf{y}_{n}|\mathbf{x}_{n},q_{n}) \, p(\mathbf{x}_{n}|q_{n})}{\int p(\mathbf{y}_{n}|\mathbf{x}_{n},q_{n}) \, p(\mathbf{x}_{n} | q_{n}) d\mathbf{x}_{n}} \\ &= \frac{p(\mathbf{y}_{n}|\mathbf{x}_{n}) \, p(\mathbf{x}_{n}|q_{n})}{\int p(\mathbf{y}_{n}|\mathbf{x}_{n}) \, p(\mathbf{x}_{n} | q_{n}) d\mathbf{x}_{n}} \\ &= \mathcal{N}(\mathbf{x}_{n}; \boldsymbol{\mu}_{\mathbf{x}|\mathbf{y}_{n},q_{n}}, \mathbf{C}_{\mathbf{x}|\mathbf{y}_{n},q_{n}}), \end{aligned} $$
((35))
where the Bayesian network properties of Fig. 2
a have been exploited in the numerator and the denominator and a single Gaussian pdf \(p(\mathbf {x}_{n} |q_{n})\phantom {\dot {i}\!}\) is assumed without loss of generality. In a second step, the clean likelihood \(p(\mathbf {x}_{n}|q_{n})\phantom {\dot {i}\!}\) is evaluated at \(\boldsymbol {\mu }_{\mathbf {x}|\mathbf {y}_{n},q_{n}}\phantom {\dot {i}\!}\) after adding the variance \(\mathbf {C}_{\mathbf {x}|\mathbf {y}_{n},q_{n}}\phantom {\dot {i}\!}\) to \(\mathbf {C}_{\mathbf {x}|q_{n}}\phantom {\dot {i}\!}\), cf. (36).
In terms of probabilistic notation, this compensation rule corresponds to replacing the score calculation in (5) by an expected likelihood, similarly to [47, 48]:
$$\begin{array}{@{}rcl@{}} p(\mathbf{y}_{n}|q_{n}) &\approx& \mathcal{E}_{\mathbf{x}_{n}|\mathbf{y}_{n},q_{n}}\{p(\mathbf{x}_{n}|q_{n})\} \\ &=& \int p(\mathbf{x}_{n}|\mathbf{y}_{n},q_{n}) \, p(\mathbf{x}_{n}|q_{n}) d\mathbf{x}_{n} \\ &=& \mathcal{N}\left(\boldsymbol{\mu}_{\mathbf{x}|\mathbf{y}_{n},q_{n}}; \boldsymbol{\mu}_{\mathbf{x}|q_{n}}, \mathbf{C}_{\mathbf{x}|\mathbf{y}_{n},q_{n}} \! + \! \mathbf{C}_{\mathbf{x}|q_{n}}\right). \end{array} $$
((36))
For the case of single Gaussian densities, the (in the Bayesian sense) exact score p(y
n
|q
n
) is given by (11). Extending previous work [9], we show (11) to be bounded from above by the modified score (36):
$$ \begin{aligned} \mathcal{E}_{\mathbf{x}_{n}|\mathbf{y}_{n},q_{n}}\{p(\mathbf{x}_{n}|q_{n})\} &= \int p(\mathbf{x}_{n}|\mathbf{y}_{n},q_{n}) \, p(\mathbf{x}_{n}|q_{n}) d\mathbf{x}_{n}\\ &= \underbrace{\frac{\int p(\mathbf{y}_{n}|\mathbf{x}_{n}) \, p^{2}(\mathbf{x}_{n}|q_{n}) d\mathbf{x}_{n}}{p^{2}(\mathbf{y}_{n}|q_{n})}}_{\alpha} \, p(\mathbf{y}_{n} | q_{n}), \end{aligned} $$
((37))
where α can be evaluated exploiting the product rules of Gaussians [26]:
$$ \begin{aligned} \alpha& = \sqrt{\frac{\det(\mathbf{C}_{\mathbf{x}|q_{n}}\,+\,\mathbf{C}_{\mathbf{b}_{n}})}{\det(\mathbf{C}_{\mathbf{x}|q_{n}})}} \frac{\mathcal{N}(\mathbf{y}; \boldsymbol{\mu}_{\mathbf{x}|q_{n}}, \frac{1}{2}\mathbf{C}_{\mathbf{x}|q_{n}} \! + \! \mathbf{C}_{\mathbf{b}_{n}})}{\mathcal{N}(\mathbf{y}; \boldsymbol\mu_{\mathbf{x}|q_{n}}, \frac{1}{2}\mathbf{C}_{\mathbf{x}|q_{n}} \! + \! \frac{1}{2}\mathbf{C}_{\mathbf{b}_{n}})}\\ &\geq \sqrt{\frac{\det(\mathbf{C}_{\mathbf{x}|q_{n}}\,+\,\mathbf{C}_{\mathbf{b}_{n}})\det(\frac{1}{2}\mathbf{C}_{\mathbf{x}|q_{n}}\,+\,\frac{1}{2}\mathbf{C}_{\mathbf{b}_{n}})}{\det(\mathbf{C}_{\mathbf{x}|q_{n}})\det(\frac{1}{2}\mathbf{C}_{\mathbf{x}|q_{n}}\,+\,\mathbf{C}_{\mathbf{b}_{n}})}} \geq 1. \end{aligned} $$
((38))
A closer inspection of α reveals that the expected likelihood computation scales up p(y
n
|q
n
) for large values of \(\mathbf {C}_{\mathbf {b}_{n}}\phantom {\dot {i}\!}\), which acts as an alleviation of the (potentially overly) flattening effect of \(\mathbf {C}_{\mathbf {b}_{n}}\phantom {\dot {i}\!}\) on p(y
n
|q
n
), cf. (11).