- Research
- Open Access
Mean square error optimal weighting for multitaper cepstrum estimation
- Maria Hansson-Sandsten^{1}Email author
https://doi.org/10.1186/1687-6180-2013-158
© Hansson-Sandsten; licensee Springer. 2013
- Received: 22 April 2013
- Accepted: 30 September 2013
- Published: 17 October 2013
Abstract
The aim of this paper is to find a multitaper-based spectrum estimator that is mean square error optimal for cepstrum coefficient estimation. The multitaper spectrum estimator consists of windowed periodograms which are weighted together, where the weights are optimized using the Taylor expansion of the log-spectrum variance and a novel approximation for the log-spectrum bias. A thorough discussion and evaluation are also made for different bias approximations for the log-spectrum of multitaper estimators. The optimized weights are applied together with the sinusoidal tapers as the multitaper estimator. Comparisons of the cepstrum mean square error are made of some known multitaper methods as well as with the parametric autoregressive estimator for simulated speech signals.
Keywords
- Cepstrum
- Log-spectrum
- Multitaper
- Mean square error
- Optimal
- Statistics
- Bias
- Variance
1 Introduction
Cepstrum-based methods are important in many applications, especially speech analysis [1], and also in other areas such as, e.g., seismic deconvolution [2], vibratory diagnosis using mechanical signals [3], and estimation of periods of surface waves traveling around the circumference of tree trunks [4]. Usually, an autoregressive (AR)-based spectrum or a windowed periodogram is used for estimation of the cepstrum coefficients. The errors caused by bias and variance might be large, and algorithms based on robust spectrum analysis techniques could be useful for better performance. Such methods, usually derived from the periodogram, have been proposed lately, e.g., cepstrum coefficient thresholding in [5] and a novel technique for power compensation of bias in [6]. In [7], a method for smoothing of the covariance function is presented.
The concept of multiple windows or multitapers was invented by David Thomson [8, 9], but multitapers were actually used much earlier in the form of one window shifted in time, the Welch method or Weighted Overlap Segmented Averaging (WOSA) by Welch [10]. The main idea of multitapers is to reduce the variance of the periodogram by averaging several uncorrelated periodograms. The time-shifted window by Welch gives uncorrelated periodograms as the time-shifted window overlaps different data sequences, although the same window was used. The idea by Thomson was to use the same data sequence for all periodograms, i.e., the whole data sequence, but to change the shape of the window for the different periodograms in a way that gave uncorrelated periodograms and thereby reduced variance. For smooth spectra, the Thomson multitaper method is used [8], but for spectra with larger dynamics and peaks, the peak matched multiple windows [11], the sinusoidal multitapers [12], and also more advanced multitaper methods, such as the adaptive Thomson method [8], have been shown to be more suitable.
A preliminary mean square error optimal multitaper cepstrum estimator has been suggested in, e.g., [13] where the optimal multitapers and weights for a comb-spectrum model were used. This estimator has been evaluated and compared with the Thomson multitapers, the sinusoidal multitapers, the Welch method, and usual windowed periodogram-based cepstrum analysis methods for speaker recognition. The results of these studies show that a multitaper estimator optimal for a speech-like spectrum model has advantages compared to traditional techniques [14–16].
The aim of this paper is to find a mean square error optimal weighting of the multitaper cepstrum estimator, based on the approximative mean square error for the log-spectrum. The expression for the bias of the log-periodogram of a Gaussian process has been proposed and thoroughly evaluated in [6, 17]. For the sinusoidal multitapers, the properties of the log-spectrum of locally white noise were derived in [18]. In [19], a more accurate expression for the bias was proposed. The attempt in this paper is to further simplify the expression of the bias of the log-spectrum using different Mercator series and to use such an approximation together with the Taylor expansion of the variance of the multitaper log-spectrum [18, 19] to find mean square error optimal weights of the multitaper cepstrum.
The outline of the paper is as follows: In Section 2, suggestions of the approximative statistics for the cepstrum and log-spectrum are presented. Section 3 presents and evaluates mean square error optimal weighting factors for the log-spectrum. In Section 4, evaluation and comparison of the mean square error of the cepstrum for speech-like processes are given. The paper is concluded in Section 5.
2 Approximative statistics of the multitaper log-spectrum estimate
where r _{ c }(n) and S _{ x }(f) are the true cepstrum and spectral density, respectively. The mean square error at the frequency value f can be divided into
where V[ ∗] denotes variance.
2.1 Expected value and bias of the log-spectrum
with equality for locally white noise. This equality is also expressed in [6] for the log-periodogram and also includes super-Gaussian and sub-Gaussian distributions of spectral coefficients. The number of multitapers is K, and ψ(K) is the digamma function, which can be recursively computed as $\psi (K+1)=\psi \left(K\right)+\frac{1}{K}$ with ψ(1)=−γ. For the case of K=1, Equations 6 and 7 coincide, but for larger values of K, the difference ψ(K)− log(K) approaches zero, e.g., for K=2, ψ(2)− log(2)≈−0.270, and for K=6, ψ(6)− log(6)≈−0.0856.
which was suggested in [19]. The second term$\frac{V\left[{\u015c}_{x}\right(\phantom{\rule{0.3em}{0ex}}f\left)\right]}{2{E}^{2}\left[{\u015c}_{x}\right(\phantom{\rule{0.3em}{0ex}}f\left)\right]}$ (green lines) is shown to be very similar to the true difference for higher value of K (e.g., K=6, 12).
The approximation term (ψ(K)− log(K)) from Equation 10 is also neglected, as this term, for the multitaper case, is small compared to the error in the omitted higher-order terms.
Using a Euler expansion on the above Mercator series gives another Mercator series as$log\left(\frac{x}{x-1}\right)=\frac{1}{x}+\frac{1}{2{x}^{2}}+\frac{1}{3{x}^{3}}\dots $, which is valid for all x>1. Replacing$\frac{x}{x-1}$ with$\frac{E\left[{\u015c}_{x}\right]}{{S}_{x}}$ will give$x=\frac{E\left[{\u015c}_{x}\right]}{E\left[{\u015c}_{x}\right]-{S}_{x}}>1$ which will be true if$E\left[{\u015c}_{x}\right]>{S}_{x}$, and the error between the expected value and the true spectrum could be large. Expanding the bias using only the two first terms of this series will give
2.2 Variance of the log-spectrum
where ψ ^{′}(K) is the trigamma function and is recursively computed by${\psi}^{\prime}(K+1)={\psi}^{\prime}\left(K\right)-\frac{1}{{K}^{2}}$ and${\psi}^{\prime}\left(1\right)=\frac{{\pi}^{2}}{6}$ (trigamma).
was shown to be a sufficiently accurate approximation for speech-like processes. This approximation is referred to as expected value normalized variance approximation (ENVA).
3 Mean square error optimal weighting of the multitaper cepstrum
where ENBA(1) and ENVA are applied as approximations of the bias and variance of the log-spectrum, respectively. This approximation shows that normalizing the sum of all MSE_{ f } of the spectral estimator${\u015c}_{x}\left(f\right)$ with the squared expected value of${\u015c}_{x}\left(f\right)$ gives a reasonable approximation of the mean square error for the estimator$log{\u015c}_{x}\left(f\right)$ and is thereby also related to the MSE of Equation 4. It is therefore reasonable to assume that minimization of Equation 20 for all f, also minimizing Equation 4, would give an optimal estimator for the cepstrum coefficients${\widehat{r}}_{c}\left(n\right)$.
The optimization criterion of Equation 20 includes the expressions of Equations 21 and 23 with unknown h _{ k } and α _{ k }, k=0 … K−1. In the further optimization, the multitapers h _{ k } are assumed to be known and to be the sinusoidal tapers of [12] with N=256. The only unknowns are the weighting factors α _{ k }, k=0 … K−1, which however appear both in the numerator and the denominator.
The choice of multitapers is crucial, and for an application where the data can be expected to originate from a highly dynamical spectrum, the Slepian multitapers [8] could be a better choice. The concern in this paper is based on the application to speech signals, where the spectrum can be expected to have peaks, usually not too sharp, and in total a reasonable dynamics.
In all periodogram-based spectrum analysis methods, the multitaper estimation method can be considered to be a filtering procedure in a FIR-filter bank where the filter functions all can be modulated to be an identical baseband filter with center frequency 0. For each frequency, the input signal is consequently demodulated and filtered through the baseband filter [20]. As baseband filter, a simple AR(1) spectrum is used, with a peak located at zero frequency, i.e., one pole in ρ. The resulting optimal weights for two different cases of ρ are presented where the corresponding covariance matrix R _{ x } is used in Equation 20. The AR(1) spectrum is a simple model but reasonable for speech data as speech data often are estimated as AR models (order 10-20). The average damping of the different poles (ρ) of such an estimated AR spectrum from real data will give an idea of what damping factor should be chosen for the AR(1) model for the optimization of the weights. How this averaging and choice should be made is left for further studies.
and the frequency values are chosen as${f}_{n}=\frac{n}{2N}$. The optimization bandwidth W can be varied, and for a frequency localized estimator, only the tapers that have their center frequency inside the band should be included. The center frequency of the sinusoidal tapers are${f}_{i}=\frac{i}{2(N+1)}$, i=0…N−1, and the highest frequency taper to be included in the bandwidth | f|<W/2 is number i=<W/2·2(N+1) giving K=i+1<(W·(N+1))+1. The chosen optimization bandwidth is crucial for the resolution of the final estimate, and it should be chosen at least somewhat smaller than the preferred resolution of the final estimate as done in spectrum analysis. The local in-band multitaper cepstrum bias of the sinusoidal tapers is shown in [18] to be bounded by$\frac{{S}_{x}^{\u2033}\left(\phantom{\rule{0.3em}{0ex}}f\right)}{{S}_{x}\left(\phantom{\rule{0.3em}{0ex}}f\right)}\frac{{K}^{2}}{24{N}^{2}}$ for equal weights and can be expected to be smaller than for the Slepian multitapers. The Slepian multitapers, however, have better leakage properties or out-of-band bias [8]. The sampling frequency of the actual process will effect an estimated ρ as well as the decision of the bandwidth parameter W. For example, reducing the sample frequency by a factor of 2 will give half the number of data values N, which will increase the in-band bias by a factor of 4, but the reduced number of samples will be fully compensated by the decrease of ρ. For the AR(1) model, the damping factor will change from ρ to ρ ^{2}, significantly affecting the spectrum shape to be more smooth. The bandwidth parameter W can be twice as large as the actual spectrum peaks of the data now which is a factor 2 further from each other compared to the non-reduced sampling frequency. The number of tapers will then be approximately the same as K≈W·N, and N is reduced but W is doubled. Thereby, the variance will not change significantly. However, a reduction of sampling frequency is always beneficial, if possible, to the point where actual information is lost, but the further and more thorough analysis of the sampling effects is left for future research.
Evaluation of ξ _{ ev } of the optimal weighting OPT098 for different estimation and evaluation bandwidths W
ξ _{ ev }(K) | W=0. 02 | W=0. 04 | W=0. 08 |
---|---|---|---|
OPT098 | 0.563 (6) | 0.424 (11) | 0.301(21) |
SIN_{opt} | 0.618 (3) | 0.532 (3) | 0.423 (4) |
THOM_{opt} | 0.674 (3) | 0.572 (3) | 0.453 (4) |
WELCH_{opt} | 0.613 (4) | 0.505 (4) | 0.408 (4) |
HAMM | 1.91 (1) | 1.92 (1) | 1.93 (1) |
Evaluation of ξ _{ ev } of the optimal weighting OPT093 for different estimation and evaluation bandwidths W
ξ _{ ev }(K) | W=0. 02 | W=0. 04 | W=0. 08 |
---|---|---|---|
OPT093 | 0.221 (6) | 0.201 (11) | 0.178 (21) |
SIN_{opt} | 0.242 (7) | 0.225 (8) | 0.206 (8) |
THOM_{opt} | 0.252 (7) | 0.235 (8) | 0.217 (7) |
WELCH_{opt} | 0.247 (8) | 0.213 (9) | 0.192 (9) |
HAMM | 1.95 (1) | 1.95 (1) | 1.96 (1) |
4 Cepstrum analysis of speech processes
Note that the cepstrum coefficient at n=0 is excluded in this analysis. The reason is that the zeroth coefficient corresponds to a constant energy level of the spectrum and is usually omitted in most cepstrum applications.
The estimators OPT098 and OPT093 from the former section are applied and compared with THOM_{opt}, WOSA_{opt}, and SIN_{opt} as above where the result from the number of multitapers giving the smallest error is presented. A comparison with an AR estimator is also made. The model order (using the AIC criterion) giving the smallest error is presented. The result of the single Hamming window periodogram (HAMM) is also added, as this method is often applied in speech analysis. The result of this method is however much worse than any of the multitaper methods.
Cepstrum ξ _{ c } for simulated AR processes, where the AR model is estimated from ‘A’ of hallo
ξ _{ c }(K,M) | M _{1}(49) | M _{2}(12) | M _{3}(14) | F _{1}(39) | F _{2}(12) | F _{3}(43) |
---|---|---|---|---|---|---|
OPT098_{002} | 0.546 (6) | 0.323 (6) | 0.323 (6) | 0.583 (6) | 0.322 (6) | 0.554 (6) |
OPT098_{004} | 0.532 (11) | 0.294 (11) | 0.290 (11) | 0.582 (11) | 0.290 (11) | 0.531 (11) |
OPT098_{008} | 0.529 (21) | 0.259 (21) | 0.257 (21) | 0.590 (21) | 0.245 (21) | 0.522 (21) |
OPT093_{002} | 0.703 (6) | 0.208 (6) | 0.223 (6) | 0.734 (6) | 0.202 (6) | 0.746 (6) |
OPT093_{004} | 0.693 (11) | 0.176 (11) | 0.194 (11) | 0.724 (11) | 0.158 (11) | 0.689 (11) |
OPT093_{008} | 0.673 (21) | 0.182 (21) | 0.191 (21) | 0.716 (21) | 0.156 (21) | 0.663 (21) |
SIN_{opt} | 0.630 (3) | 0.193 (8) | 0.216 (7) | 0.643 (4) | 0.179 (8) | 0.629 (3) |
THOM_{opt} | 0.661 (3) | 0.198 (8) | 0.224 (7) | 0.671 (3) | 0.186 (8) | 0.661 (3) |
WELCH_{opt} | 0.590 (4) | 0.186 (8) | 0.205 (8) | 0.633 (4) | 0.167 (9) | 0.608 (4) |
HAMM | 1.69 (1) | 1.64 (1) | 1.65 (1) | 1.71 (1) | 1.63 (1) | 1.70 (1) |
AR_{opt} | 0.964 (49) | 0.165 (12) | 0.140 (14) | 0.362 (39) | 0.281 (12) | 0.611 (43) |
Studying the errors of the multitaper methods, it can be seen that one of the proposed estimators, either OPT098 or OPT093 gives the smallest error in almost all cases followed by WELCH_{opt}, SIN_{opt}, and THOM_{opt}. In most cases, the number of tapers needed are just two or three more than for the equally weighted multitaper methods, e.g., for M _{1}; the error given from OPT098_{002} (K=6) is much smaller than the error from WELCH_{opt} (K=4). Similarly, for F _{2}, the error given from OPT093_{004} (K=11) is substantially smaller than the error from WELCH_{opt} (K=9). In almost all cases, as expected from AR model simulations, the AR_{opt} gives a much better result. However, in several cases, the error of AR_{opt} is much larger than the multitaper methods, e.g., M _{1} and F _{2}. It is also interesting to note that the error of the single Hamming window, HAMM, is almost the same for all speakers. This is in concordance with the expressions given in [6, 17], where the bias is approximately zero and the total variance as well as the total mean square error is π ^{2}/6≈1.64, for all cepstrum coefficients, excluding the zeroth coefficient.
Cepstrum ξ _{ c } for simulated AR processes, where the AR model is estimated from different sequences of hallo
ξ _{ c }(K,M) | M _{1}(4−42) | M _{2}(2−17) | M _{3}(9−20) | F _{1}(11−49) | F _{2}(9−50) | F _{3}(7−49) |
---|---|---|---|---|---|---|
OPT098_{002} | 0.363 (6) | 0.325 (6) | 0.328 (6) | 0.546 (6) | 0.395 (6) | 0.648 (6) |
OPT098_{004} | 0.338 (11) | 0.294 (11) | 0.299 (11) | 0.520 (11) | 0.372 (11) | 0.649 (11) |
OPT098_{008} | 0.314 (21) | 0.253 (21) | 0.261 (21) | 0.510 (21) | 0.351 (21) | 0.663 (21) |
OPT093_{002} | 0.313 (6) | 0.214 (6) | 0.220 (6) | 0.705 (6) | 0.399 (6) | 0.995 (6) |
OPT093_{004} | 0.304 (11) | 0.177 (11) | 0.190 (11) | 0.622 (11) | 0.397 (11) | 1.04 (11) |
OPT093_{008} | 0.301 (21) | 0.175 (21) | 0.189 (21) | 0.615 (21) | 0.385 (21) | 0.974 (21) |
SIN_{opt} | 0.316 (6) | 0.200 (8) | 0.210 (8) | 0.627 (4) | 0.3917 (5) | 0.727 (3) |
THOM_{opt} | 0.328 (6) | 0.212 (8) | 0.217 (7) | 0.671 (3) | 0.428 (5) | 0.771 (3) |
WELCH_{opt} | 0.302 (6) | 0.203 (9) | 0.201 (8) | 0.624 (4) | 0.422 (5) | 0.716 (3) |
HAMM | 1.65 (1) | 1.65 (1) | 1.64 (1) | 1.70 (1) | 1.67 (1) | 1.71 (1) |
AR_{opt} | 0.428 (19) | 0.361 (12) | 0.171 (13) | 0.722 (48) | 0.663 (27) | 0.635 (45) |
5 Conclusions
A cepstrum estimator is proposed based on a weighted multitaper spectrum. An evaluation of different approximations for bias and variance of the multitaper log-spectrum is made, and a mean square error criterion is proposed that includes novel approximations of the bias and variance. The weights of the multitaper spectrum are optimized, and the new estimator, the optimal weights combined with the sinusoidal tapers, is evaluated for cepstrum estimation of speech-like processes. The results show that a 10% to 20% reduction of the mean square error of the cepstrum can be achieved, to the cost of two or three additional periodogram computations.
Declarations
Authors’ Affiliations
References
- Quatieri TF: Discrete-Time Speech Signal Processing. Upper Saddle River: Prentice Hall; 2002.Google Scholar
- Tribolet JM: Seismic Applications of Homomorphic Signal Processing. Englewood Cliffs: Prentice Hall; 1979.Google Scholar
- Badaoui ME, Guillet F, Danière J: New applications of the real cepstrum to gear signals, including definition of a robust fault indicator. Mech. Syst. Signal Proc 2004, 18: 1031-1046. 10.1016/j.ymssp.2004.01.005View ArticleGoogle Scholar
- Hansson M, Axmon J: A multiple window cepstrum analysis for estimation of periodicity. IEEE Trans. Signal Process 2007, 55(2):474-481.MathSciNetView ArticleGoogle Scholar
- Stoica P, Sandgren N: Total-variance reduction via thresholding: application to cepstral analysis. IEEE Trans. Signal Process 2007, 55(1):66-72.MathSciNetView ArticleGoogle Scholar
- Gerkmann T, Martin R: On the statistics of spectral amplitudes after variance reduction by temporal cesptrum smoothing and cepstral nulling. IEEE Trans. Signal Process 2009, 11(57):4165-4174.MathSciNetView ArticleGoogle Scholar
- Sandberg J, Hansson-Sandsten M: Optimal cepstrum smoothing. Signal Process 2012, 92: 1290-1301. 10.1016/j.sigpro.2011.11.026View ArticleGoogle Scholar
- Thomson DJ, Spectrum estimation and harmonic analysis: Proc. IEEE. 1982, 70(9):1055-1096.Google Scholar
- Walden AT: A unified view of multitaper multivariate spectral estimation. Biometrika 2000, 87(4):767-788. 10.1093/biomet/87.4.767MathSciNetView ArticleGoogle Scholar
- Welch PD: The use of fast Fourier transform for the estimation of power spectra: a method based on time averaging over short, modified periodograms. IEEE Trans. Audio Electroacoustics 1967, AU-15(2):70-73.MathSciNetView ArticleGoogle Scholar
- Hansson M, Salomonsson G: A multiple window method for estimation of peaked spectra. IEEE Trans. Signal Process 1997, 45(3):778-781. 10.1109/78.558503View ArticleGoogle Scholar
- Riedel KS: Minimum bias multiple taper spectral estimation. Trans. IEEE Signal Process 1995, 43(1):188-195. 10.1109/78.365298View ArticleGoogle Scholar
- Hansson-Sandsten M, Sandberg J: Optimal cepstrum estimation using multiple windows. In Proc. of the ICASSP. Taipei, Taiwan: IEEE; 19–24 April 2009.Google Scholar
- Kinnunen T, Saeidi R, Sandberg J, Hansson-Sandsten M: What else is new than the hamming window? Robust mfccs for speaker recognition via multitapering. In Interspeech 2010. Makuhari, Japan: ISCA; 26–30 Sept 2010.Google Scholar
- Kinnunen T, Saeidi R, Sedlak F, Lee KA, Sandberg J, Hansson-Sandsten M, Li R: Low-variance multitaper mfcc features: a case study in robust speaker verification. IEEE Trans. Speech, Audio Language Process 2012, 20(7):1990-2001.View ArticleGoogle Scholar
- Hanilci C, Kinnunen T, Saeidi R, Pohjalainen J, Alku P, Ertas F, Sandberg J, Hansson-Sandsten M: Comparing spectrum estimators in speaker verification under additive noise degradation. In Proc. of the ICASSP. Kyoto, Japan: IEEE; 25–30 March 2012.Google Scholar
- Ephraim Y, Rahim M: On second-order statistics linear estimation of cepstral coefficients. IEEE Trans. Speech Audio Process 1999, 7: 162-176. 10.1109/89.748121View ArticleGoogle Scholar
- Riedel KS, Sidorenko A: Adaptive smoothing of the log-spectrum with multiple tapering. IEEE Trans. Signal Process 1996, 44(7):1794-1800. 10.1109/78.510625View ArticleGoogle Scholar
- Sandberg J, Hansson-Sandsten M, Kinnunen T, Saeidi R, Flandrin P, Borgnat P: Multitaper estimation of frequency-warped cepstra, with application to speaker verification. IEEE Signal Process. Lett 2010, 17(4):343-346.View ArticleGoogle Scholar
- Stoica P, Moses R: Spectral Analysis of Signals. Upper Saddle River: Prentice Hall; 2004.Google Scholar
- Fletcher R: Practical Methods of Optimization. Chichester: Wiley; 1987.Google Scholar
- Hansson M: Optimized weighted averaging of peak matched multiple window spectrum estimates. IEEE Trans. Signal Process 1999, 47(4):1141-1146. 10.1109/78.752613MathSciNetView ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.