Extraction of useful information from speech has been a subject of active research for many decades. The first step in an automatic speech recognition system is a feature extractor which transforms a raw signal into a compact representation. The most popular features used in a GMM-HMM recognizer are MFCCs, and the mel-filterbank energy is the widely used feature in a DNN-HMM recognizer. In this section, we describe the features (cepstral as well as filterbank) and single-channel blind dereverberation algorithms used in the REVERB challenge 2014.
2.1 Multi-taper filterbank features
The most often used power spectrum estimation method in speech processing applications is a windowed direct spectrum estimator, and it can be expressed mathematically as
$$ {\widehat{S}}_d(f)={\left|{\displaystyle \sum_{j=0}^{N-1}w(j)s(j){e}^{-i\frac{2\pi jf}{N}}}\right|}^2, $$
(1)
where f ∈ {0, 1,⋯, K − 1} denotes the discrete frequency index, N is the frame length, s(j) is the time-domain speech signal, and w(j) denotes the time-domain window function, also known as the taper. The taper, such as the Hamming window, is usually symmetric and decreases towards the frame boundaries.
Windowing reduces the bias (the bias of a spectrum estimator \( \widehat{\theta} \) is defined as the expected difference between the estimated value and the true value of the spectrum θ being estimated and is defined as \( \mathrm{bias}\left(\widehat{\theta}\right)=E\left[\left(\widehat{\theta}-\theta \right)\right] \)), but it does not reduce the variance of the spectral estimate [14, 15]; therefore, the variance of the cepstral/filterbank features computed from this estimated spectrum remains large.
One way to reduce the variance is to replace the windowed periodogram estimate by a so-called multi-taper spectrum estimate [14, 15]. It is given by
$$ {\widehat{S}}_{MT}(f)={\displaystyle \sum_{p=1}^M\lambda (p){\left|{\displaystyle \sum_{j=0}^{N-1}{w}_p(j)s(j){e}^{-\frac{i2\pi jf}{N}}}\right|}^2}, $$
(2)
where w
p
is the p-th data taper (p = 1, 2,…, M) used for the spectral estimate Ŝ
MT
(⋅), also known as the p-th eigenspectrum. Here, M denotes the number of tapers, and λ(p) is the weight of the p-th taper. The tapers w
p
(j) are typically chosen to be orthonormal so that, for all p and q,
$$ {\displaystyle {\sum}_j{w}_p}(j){w}_q(j)={\delta}_{pq}=\left\{\begin{array}{cc}\hfill 1,\hfill & \hfill p=q\hfill \\ {}\hfill 0,\hfill & \hfill \mathrm{otherwise}.\hfill \end{array}\right. $$
The multi-taper spectrum estimate is therefore obtained as the weighted average of M individual spectra. A multi-taper spectrum estimator is somewhat similar to averaging the spectra from a variety of conventional tapers such as Hamming and Hann tapers, but in this case, there will be strong redundancy as the different tapers are highly correlated (they have a common time-domain shape).
Figure 1 depicts the multi-taper spectrum estimation process from a frame of speech signal with M = 6 orthogonal tapers. Unlike conventional tapers, the M orthonormal tapers used in a multi-taper spectrum estimator provide M statistically independent (hence uncorrelated) estimates of the underlying spectrum. The weighted average of the M individual spectral estimates Ŝ
MT
(f) then has smaller variance than that of the single-taper spectrum estimates Ŝ
d
(f) by a factor that approaches 1/M, i.e., \( \operatorname{var}\left({\widehat{S}}_{MT}(f)\right)\approx \frac{1}{M}\operatorname{var}\left({\widehat{S}}_d(f)\right) \) [14]. According to [16], variance in the feature vectors has a direct bearing to the variance of the Gaussian modeling speech classes. In general, reduction in feature vector variance increases class separability and thereby increases recognition accuracy [16]. Multi-taper mel-filterbank (MMFB) features are then computed from a multiple windowed (e.g., Thomson) spectrum estimate instead of the Hamming windowed periodogram estimate as used in the conventional mel-filterbank (MFB)/cepstral features. The motivation behind using the multi-taper method in the REVERB challenge 2015 tasks is its improved speaker recognition performance on the microphone speech portions of the NIST-SRE corpora [14, 15]. In this work, two variants of MMFB are used:
MMFBl: MMFB features with logarithmic nonlinearity
MMFBp: MMFB features with power function nonlinearity
In this work, we use the Thomson multi-taper method. In the Thomson multi-taper method, a set of M orthonormal data tapers with good leakage properties are specified from the Slepian sequences [17]. Slepian sequences are defined as the real, unit-energy sequences on [0, N − 1] having the greatest energy in a bandwidth B. Slepian taper can be shown to be the solutions to the following eigenvalue problem:
$$ {\mathbf{A}}_{nj}{w}_j^p={\nu}^p{w}_n^p, $$
where 0 ≤ n ≤ N − 1, 0 ≤ j ≤ N − 1, \( {\mathbf{A}}_{nj}=\frac{ \sin 2\pi W\left(n-j\right)}{\pi \left(n-j\right)} \) is a real symmetric Toeplitz matrix, 0 < ν
p ≤ 1 is the p-th eigenvalue corresponding to the p-th eigenvector \( {w}_n^p \) known as the Slepian taper, W is the half frequency bandwidth, and λ(p) = ν
p, p = 1, 2,…, M. Slepian sequences (or DPSS), proposed by D. Slepian in [17], were chosen as tapers in [18] as these tapers are mutually orthonormal and possess desirable spectral concentration properties (i.e., it has the highest concentration of energy in the user-defined frequency interval (−W, W)). The first taper in the set of Slepian sequences is designed to produce a direct spectral estimator with minimum broadband bias (bias caused by leakage via the sidelobes). The higher order tapers ensure minimum broadband bias while being orthogonal to all lower order tapers. The tapers and taper weights in this method can be obtained using the following MATLAB function:
$$ \left[w\ \lambda \right]=\mathrm{dpss}\left(N,\ \beta,\ M\right), $$
where β is the time half bandwidth product. The optimum number of tapers for a continuous speech recognition task was found to be M = 6 [4, 14, 15] and β = 3.0.
Both of the MMFB features were normalized using the short-time mean and scale normalization (STMSN) method [19] with a sliding window of 1.5 s duration. Our baseline system uses conventional MFB features extracted using the Kaldi toolkit [20]. Various steps for the extraction of MMFB features are shown in Fig. 2. MFB features can be obtained as a special case of MMFB features from Fig. 2 by selecting the number of taper M = 1, taper weights λ(p) = 1, and w
p
(j) as the symmetric Hamming taper.
2.2 Robust compressive gammachirp filterbank and mel-filterbank features
Robust compressive gammachirp filterbank (RCGFB) and robust mel-filterbank (RMFB) features were computed following a similar framework to the robust compressive gammachirp filterbank cepstral coefficient (RCGCC) features proposed in [21]. The main motivation for using RCGFB and RMFB feature extractors in the REVERB challenge 2014 tasks is the better recognition accuracy obtained by the RCGCC features on the AURORA-5 and AURORA-4 corpora which represents reverberant acoustic conditions and additive noise as well as different microphone channel conditions, respectively [21, 22].
Figure 3 presents the block diagram for the RCGFB and RMFB feature extractors that incorporate a sigmoid shape suppression rule based on subband a posteriori signal-to-noise ratios (SNRs) in order to enhance the auditory spectrum.
The sigmoid-shaped weighting rule H(k, m) to enhance the auditory spectrum S
as(k, m) based on the subband a posteriori SNR (in dB) γ
sb(k, m) can be formulated as [2]
$$ H\left(k,m\right)=\frac{1}{1+{e}^{-\frac{\gamma_{\mathrm{sb}}\left(k,m\right)-5}{\tau }}}, $$
(3)
where k is the subband index, m is the frame index, τ is a parameter that controls the lower limit of the weighting function, γ
sb(k, m) is defined as
$$ {\gamma}_{\mathrm{sb}}\left(k,m\right)= \max \left(10{ \log}_{10}\left(\frac{S_{\mathrm{as}}\left(k,m\right)}{N_{\mathrm{as}}\left(k,m\right)}\right),-4.0\right), $$
(4)
and N
as(k, m) is the noise power spectrum mapped onto the auditory frequency axis. In order to remove the outliers from the weighting factor H(k, m), as given by Eq. (3), due to noise variability, we use a two-dimensional median filter. For smoothing the decision regions, a two-dimensional moving average filter is also applied [21, 22].
Noise power spectrum estimation from the noisy speech signal plays a very important rule in noise reduction/speech enhancement algorithms. Here, we use a minimum mean square error (MMSE)-soft speech presence probability (SPP) (MMSE-SPP)-based noise estimation approach, proposed in [23], for estimation of noise power spectra. In this method, the initial estimate of the noise power spectrum is computed by averaging the first ten frames of the speech spectrum. The advantage of this method is that it does not require a bias correction term as required by a MMSE-based noise spectrum estimation method; it also results in less overestimation of noise power and is computationally less expensive [24]. The reason behind choosing the MMSE-SPP-based noise spectrum estimation method is that it is computationally simple and requires only one parameter to be tuned.
RCGFB utilizes a power function nonlinearity with a coefficient of 0.07 to approximate the loudness nonlinearity of human perception, whereas RMFB uses a logarithmic nonlinearity. For feature normalization, a short-term mean and scale normalization (STMSN) technique is used with a sliding window of 1.5 s. Under mismatched conditions, STMSN helps to remove the difference of log spectrum between the training and test environments by adjusting the short-term mean and scale [19].
2.3 Iterative deconvolution-based features
A general nonnegative matrix factorization (NMF) framework decomposes spectra of reverberated speech in to those of the clean and room impulse response filter. An iterative least squares deconvolution technique can be employed for spectral factorization [25]. The iterative deconvolution (ITD)-based dereverberated mel-filterbank feature extraction method is presented in Fig. 4. It was introduced in [25]. A gammatone filterbank integrated auditory spectrum is computed for each windowed frame. ITD is then applied to each subband in the gammatone frequency domain. ITD, an iterative least squares approach that minimizes the errors [25]
$$ {e}_k={{\displaystyle \sum_i\left(S\left(i,k\right)-{\displaystyle \sum_mX\left(m,k\right){H}_r\left(i-m,k\right)}\right)}}^2, $$
(5)
initialized by NMF, is used to estimate the clean signal X(m, k) and room impulse response H
r
(k, m) from the reverberated signal S(m, k) where k is the subband index and m is the frame index. After reconstructing the dereverberated signal, 23-dimensional mel-filterbank features are then computed using the Kaldi toolkit [4].
2.4 Maximum likelihood inverse filtering-based dereverberated features
Maximum likelihood inverse filtering-based dereverberation of a reverberated signal in the cepstral domain, proposed in [2], is shown in Fig. 5. The purpose of cepstral post-filtering is to partially decorrelate the features. If P (P(z) = 1/(1 + pz
−1)) is an inverse impulse response (IIR) dereverberation filter of M taps long, then the dereverberated cepstral features c
d[m] can be given as
$$ {c}^d\left[m\right]=c\left[m\right]-{\displaystyle \sum_{k=1}^{M-1}p\left[k\right]}{c}^d\left[m-k\right], $$
(6)
where m is the frame index, and c[m] is the cepstral features of the m-th frame of a reverberated speech signal.
Parameters that best describe P can be obtained by maximizing the log likelihood with respect to the Gaussian mixture models (GMM) for all the frames of the speech [2]. A common approximation in GMMs is to replace overall GMM likelihood score by the top N scoring Gaussian density among the set of Gaussian mixtures [2]. Here, similar to [2], we use top-1 approximation for filter updates. We apply cepstral mean normalization (CMN) to remove any constant additive shift in the cepstral features and to partially decorrelate the features cepstral post filtering (CPF) [26] is used.