Mean square error optimal weighting for multitaper cepstrum estimation

The aim of this paper is to find a multitaper-based spectrum estimator that is mean square error optimal for cepstrum coefficient estimation. The multitaper spectrum estimator consists of windowed periodograms which are weighted together, where the weights are optimized using the Taylor expansion of the log-spectrum variance and a novel approximation for the log-spectrum bias. A thorough discussion and evaluation are also made for different bias approximations for the log-spectrum of multitaper estimators. The optimized weights are applied together with the sinusoidal tapers as the multitaper estimator. Comparisons of the cepstrum mean square error are made of some known multitaper methods as well as with the parametric autoregressive estimator for simulated speech signals.


Introduction
Cepstrum-based methods are important in many applications, especially speech analysis [1], and also in other areas such as, e.g., seismic deconvolution [2], vibratory diagnosis using mechanical signals [3], and estimation of periods of surface waves traveling around the circumference of tree trunks [4]. Usually, an autoregressive (AR)-based spectrum or a windowed periodogram is used for estimation of the cepstrum coefficients. The errors caused by bias and variance might be large, and algorithms based on robust spectrum analysis techniques could be useful for better performance. Such methods, usually derived from the periodogram, have been proposed lately, e.g., cepstrum coefficient thresholding in [5] and a novel technique for power compensation of bias in [6]. In [7], a method for smoothing of the covariance function is presented.
The concept of multiple windows or multitapers was invented by David Thomson [8,9], but multitapers were actually used much earlier in the form of one window shifted in time, the Welch method or Weighted Overlap Segmented Averaging (WOSA) by Welch [10]. The main idea of multitapers is to reduce the variance of the periodogram by averaging several uncorrelated periodograms. The time-shifted window by Welch gives uncorrelated periodograms as the time-shifted window overlaps different data sequences, although the same window was used. The idea by Thomson was to use the same data sequence for all periodograms, i.e., the whole data sequence, but to change the shape of the window for the different periodograms in a way that gave uncorrelated periodograms and thereby reduced variance. For smooth spectra, the Thomson multitaper method is used [8], but for spectra with larger dynamics and peaks, the peak matched multiple windows [11], the sinusoidal multitapers [12], and also more advanced multitaper methods, such as the adaptive Thomson method [8], have been shown to be more suitable.
A preliminary mean square error optimal multitaper cepstrum estimator has been suggested in, e.g., [13] where the optimal multitapers and weights for a comb-spectrum model were used. This estimator has been evaluated and compared with the Thomson multitapers, the sinusoidal multitapers, the Welch method, and usual windowed periodogram-based cepstrum analysis methods for speaker recognition. The results of these studies show that a multitaper estimator optimal for a speech-like spectrum model has advantages compared to traditional techniques [14][15][16].
The aim of this paper is to find a mean square error optimal weighting of the multitaper cepstrum estimator, based on the approximative mean square error for http://asp.eurasipjournals.com/content/2013/1/158 the log-spectrum. The expression for the bias of the logperiodogram of a Gaussian process has been proposed and thoroughly evaluated in [6,17]. For the sinusoidal multitapers, the properties of the log-spectrum of locally white noise were derived in [18]. In [19], a more accurate expression for the bias was proposed. The attempt in this paper is to further simplify the expression of the bias of the log-spectrum using different Mercator series and to use such an approximation together with the Taylor expansion of the variance of the multitaper log-spectrum [18,19] to find mean square error optimal weights of the multitaper cepstrum.
The outline of the paper is as follows: In Section 2, suggestions of the approximative statistics for the cepstrum and log-spectrum are presented. Section 3 presents and evaluates mean square error optimal weighting factors for the log-spectrum. In Section 4, evaluation and comparison of the mean square error of the cepstrum for speech-like processes are given. The paper is concluded in Section 5.

Approximative statistics of the multitaper log-spectrum estimate
From the discrete-time stationary stochastic process x(n), with spectral density S x (f ), the windowed periodogram is estimated aŝ where the superscript T denotes the transposed vector. The multitaper spectrum is computed aŝ using different window functions h k (n) in Equation 1 and The estimate of the realvalued symmetrical multitaper cepstrum is then defined asr for all integer values of n, with log as the natural logarithm. The total mean square error (MSE) of the cepstrum estimatorr c (n) and corresponding log-spectrum estimator is defined as where r c (n) and S x ( f ) are the true cepstrum and spectral density, respectively. The mean square error at the frequency value f can be divided into where V [ * ] denotes variance.

Expected value and bias of the log-spectrum
A well-known expression for the expected value of the logperiodogram of a Gaussian process (see, e.g., [17]) is where γ ≈ 0.577 is the Euler constant. For the logarithm of a multitaper periodogram using the sinusoidal tapers, it was shown in [18] that the expected value is with equality for locally white noise. This equality is also expressed in [6] for the log-periodogram and also includes super-Gaussian and sub-Gaussian distributions of spectral coefficients. The number of multitapers is K, and ψ(K) is the digamma function, which can be recursively computed as ψ(K + 1) = ψ(K) + 1 K with ψ(1) = −γ . For the case of K = 1, Equations 6 and 7 coincide, but for larger values of K, the difference ψ(K) − log(K) approaches zero, e.g., for K = 2, ψ(2) − log(2) ≈ −0.270, and for K = 6, ψ(6) − log(6) ≈ −0.0856.
To verify if Equation 7 also holds for a varying spectrum, a simulated example is shown in Figure 1 showing the difference (log E Ŝ x ( f ) − E logŜ x ( f ) ) for K = 1, 2, 6, and 12 (blue lines). The simulated process is an AR(12) process (poles in 0.95e ±i2π0.05 , 0.92e ±i2π0.10 , 0.95e ±i2π0.20 , 0.96e ±i2π0.30 , 0.92e ±i2π0.35 , 0.95e ±i2π0.40 ) where the expected values (log E Ŝ x ( f ) and E logŜ x ( f ) ) are estimated from 10,000 realizations. The multitaper spectrum is computed according to Equation 2 using the equally weighted sinusoidal tapers of length N = 256 [12]. The difference coincides very well with − ψ(K) − log(K) (red lines) for lower values of K, but for higher values of K, the variation of the blue line http://asp.eurasipjournals.com/content/2013/1/158 becomes larger over the different frequency values. However, for varying spectrum and especially speech-like processes, a more accurate Taylor expansion approximation is defined by which was suggested in [19]. The second term (green lines) is shown to be very similar to the true difference for higher value of K (e.g., K = 6, 12). The true log-spectrum bias (TLSB) is and using the definition from Equation 7 and extending from locally white noise, the approximate log-spectrum bias (ALSB) is defined as An expansion of the term log . . is sometimes applied although often referred to as inaccurate. Replacing and the best approximation is given when E Ŝ x S x is close to 1, i.e., the expected value is close to the true spectrum. The two first terms in a more thorough approximation are used, referred to as two-term true spectrum normalized bias approximation, A simpler approximation is proposed for comparison, referred to as one-term true spectrum normalized bias approximation, TNBA(1), The approximation term ψ(K) − log(K) from Equation 10 is also neglected, as this term, for the multitaper case, is small compared to the error in the omitted higher-order terms.
Using a Euler expansion on the above Mercator series gives another Mercator series as log x and the error between the expected value and the true spectrum could be large. Expanding the bias using only the two first terms of this series will give http://asp.eurasipjournals.com/content/2013/1/158 Similarly, as above, the bias approximation bias ≈ is referred to as the two-term expected value normalized bias approximation, ENBA(2). A simpler approximation, one-term expected value normalized bias approximation, is also suggested. The ALSB, TNBA(2), and ENBA(2) will give about the same values for the single window case (K = 1), but for the multitaper log-spectrum, the differences might be substantial. This is illustrated with an example in Figure 2 where 10,000 realizations of the same AR(12) process as above are used. The number of windows is K = 6 sinusoidal multitapers, and in Figure 2a, the relative expected value of the spectrum, is depicted to show that the relative value is quite close to 1, or at least between 0 and 2 for the whole spectrum, which indicates that the approximation referred to as TNBA would be appropriate. In Figure 2b,c, the error between the different bias approximations compared to the true bias of the log-spectrum in Equation 9 is shown. Note that the error for the ALSB (blue line) is the same in both Figure 2b,c. For this highly varying spectrum, we see that the ALSB is a fair approximation and that TNBA(2) as well as the TNBA(1) gives very large errors for the cases where E Ŝ x S x > 2, e.g., slightly below f = 0.30 and above f = 0.40. The ENBA (2) gives in these cases a smaller error (see Figure 2b) but might also give a much larger error than the TNBA(2). The more simple approximation of TNBA(1) and ENBA(1) in Figure 2c gives larger errors.
However, at the peaks of the spectrum, i.e., f = 0.05, 0.10, 0.20, 0.30, 0.35, and 0.40, the difference of the errors compared to Figure 2b is not that large, but at the smooth parts, between the peaks, the negative effect of omitting the term ψ(K) − log(K) is notable. For larger values of K, this error will be smaller as this term also becomes smaller for larger K.

Variance of the log-spectrum
Expressions for the variance of the log-spectrum have been derived, e.g., the variance of the log-periodogram of a Gaussian process was derived in [17] and shown to be This result was generalized in [6] to hold for complex super-Gaussian as well as sub-Gaussian spectral coefficients. For the logarithm of multitaper spectra using K sinusoidal tapers, it was shown in [18] that the variance, with a locally white noise assumption, is where ψ (K) is the trigamma function and is recursively computed by ψ (K + 1) = ψ (K) − 1 K 2 and ψ (1) = π 2 6 (trigamma). The approximation based on the Taylor expansion suggested in [7], i.e., was shown to be a sufficiently accurate approximation for speech-like processes. This approximation is referred to as expected value normalized variance approximation (ENVA).
To compare these approximations, 10,000 realizations from the AR(12) process above are used. The results are presented in Figure 3 for different values of K. The true variance of the log-spectrum is presented as the blue line, and for K = 1, this coincides very well with ψ (1) = π 2 http://asp.eurasipjournals.com/content/2013/1/158 the mean square error for each frequency is chosen as where ENBA(1) and ENVA are applied as approximations of the bias and variance of the log-spectrum, respectively. This approximation shows that normalizing the sum of all MSE f of the spectral estimatorŜ x (f ) with the squared expected value ofŜ x (f ) gives a reasonable approximation of the mean square error for the estimator logŜ x (f ) and is thereby also related to the MSE of Equation 4. It is therefore reasonable to assume that minimization of Equation 20 for all f, also minimizing Equation 4, would give an optimal estimator for the cepstrum coefficientŝ r c (n).
The bias in Equation 20 using the multitaper spectrum estimator of Equation 2 is bias = where . . e −i2π(N−1)f ] and the superscript H denotes conjugate transpose. The variance is The second term is large only for frequencies close to f = 0 or to the Nyquist frequency, where the function h T k ( f )R x ( f )h l overlaps its conjugate. Most of the spectrum power is however located at the frequencies in between. The covariance for the frequency f is therefore approximated as In the further optimization, the multitapers h k are assumed to be known and to be the sinusoidal tapers of [12] with N = 256. The only unknowns are the weighting factors α k , k = 0 . . . K − 1, which however appear both in the numerator and the denominator.
The choice of multitapers is crucial, and for an application where the data can be expected to originate from a highly dynamical spectrum, the Slepian multitapers [8] could be a better choice. The concern in this paper is based on the application to speech signals, where the spectrum can be expected to have peaks, usually not too sharp, and in total a reasonable dynamics.
In all periodogram-based spectrum analysis methods, the multitaper estimation method can be considered to be a filtering procedure in a FIR-filter bank where the filter functions all can be modulated to be an identical baseband filter with center frequency 0. For each frequency, the input signal is consequently demodulated and filtered through the baseband filter [20]. As baseband filter, a simple AR(1) spectrum is used, with a peak located at zero frequency, i.e., one pole in ρ. The resulting optimal weights for two different cases of ρ are presented where the corresponding covariance matrix R x is used in Equation 20. The AR(1) spectrum is a simple model but reasonable for speech data as speech data often are estimated as AR models (order [10][11][12][13][14][15][16][17][18][19][20]. The average damping of the different poles (ρ) of such an estimated AR spectrum from real data will give an idea of what damping factor should be chosen for the AR(1) model for the optimization of the weights. How this averaging and choice should be made is left for further studies.
The criterion is non-linear with respect to the unknown α k , k = 0 . . . K − 1, and is therefore minimized iteratively with a quasi-Newton algorithm [21]. This algorithm was presented in [22] and is also applied in this paper. The initial weighting factors are in all cases equal weights, α k = 1/K, k = 0 . . . K − 1. In this paper, no further study of the convergence is made. The criterion MSE f can be optimized for different frequencies f. Naturally, the peak frequency f = 0 of the model spectrum is interesting, as well as the resolution, i.e., the bandwidth of the estimator [8,22]. The function to be minimized is i = 0 . . . N − 1, and the highest frequency taper to be included in the bandwidth | f | < W /2 is number i =< W /2 · 2(N + 1) giving K = i + 1 < (W · (N + 1)) + 1. The chosen optimization bandwidth is crucial for the resolution of the final estimate, and it should be chosen at least somewhat smaller than the preferred resolution of the final estimate as done in spectrum analysis. The local in-band multitaper cepstrum bias of the sinusoidal tapers is shown in [18] to be bounded by 24N 2 for equal weights and can be expected to be smaller than for  the Slepian multitapers. The Slepian multitapers, however, have better leakage properties or out-of-band bias [8]. The sampling frequency of the actual process will effect an estimated ρ as well as the decision of the bandwidth parameter W. For example, reducing the sample frequency by a factor of 2 will give half the number of data values N, which will increase the in-band bias by a factor of 4, but the reduced number of samples will be fully compensated by the decrease of ρ. For the AR(1) model, the damping factor will change from ρ to ρ 2 , significantly affecting the spectrum shape to be more smooth.
The bandwidth parameter W can be twice as large as the actual spectrum peaks of the data now which is a factor 2 further from each other compared to the nonreduced sampling frequency. The number of tapers will then be approximately the same as K ≈ W · N, and N is reduced but W is doubled. Thereby, the variance will not change significantly. However, a reduction of sampling frequency is always beneficial, if possible, to the point where actual information is lost, but the further and more thorough analysis of the sampling effects is left for future research. Three different bandwidths is used in the optimization, W = 0.02, 0.04, and 0.08, according to Figure 4a where the different vertical colored lines mark W /2. The related number of tapers is K = 6, 11, and 21 for the respective bandwidth. The model spectrum is the AR(1) spectrum with ρ = 0.98. The resulting weighting factors are depicted in Figure 4b where the blue line represents the K = 6, k = 0 . . . 5 values for W = 0.02, the green line the K = 11 values for W = 0.04, and the red line the resulting weighting factors for W = 0.08. The three curves are quite similar and are approaching zero for higher k, indicating that using the fewer weighting factors from the narrow frequency band W = 0.02 might work as well as the larger number from the optimization bandwidth W = 0.08. A more wideband process, the AR(1) process with ρ = 0.93 (see Figure 5a), gives another result. Using the narrow band for the optimization gives the K = 6 weighting factors depicted as the blue line in Figure 5b. These values are quite close to each other, indicating that equally weighted mutitaper spectra might work as well for this type of spectrum. For a wider bandwidth, including more of the spectrum in the optimization, the resulting weighting factors are given by the green and red lines for W = 0.04 and 0.08, respectively.
To compare the actual performances of these approximative estimators, an evaluation of the mean square errors of the log-spectrum, i.e., Equation 19, is made for 10,000 realizations of an AR(2) process with the poles located at ρe ±i2π0.25 . The evaluation bandwidth is limited to 0.25 − W /2 ≤ f ≤ 0.25 + W /2. The resulting mean square errors, ξ ev , using the proposed weighting factors of Figures 4 and 5 and the sinusoidal tapers (N = 256), are calculated (OPT098 and OPT093). In Tables 1  and 2, the results are compared to the results of other well-known methods, such as (equally weighted) sinusoidal tapers, the Thomson multitapers, and the Welch method (Hanning window and 50% of overlap), using the number of tapers that give the smallest error (SIN opt , THOM opt , and WELCH opt ). For comparison, the results using a single Hamming window are also computed (HAMM). For the more peaked spectrum (ρ = 0.98), the results for OPT098 are much better than for the equally weighted multitaper methods as well as the single Hamming window. The cost is the increased number of tapers of the estimate. However, for OPT098 and W = 0.02, the number of tapers is K = 6, to be compared with K = 4 or K = 3 for the other multitaper methods. For the broadband spectrum with ρ = 0.93, the results of OPT093 are much better than the other multitaper methods even though, in the case of W = 0.02, the number of multitapers is actually fewer. These simulations are just a verification that the optimization has performed well, and the more interesting evaluation is for the total log-spectrum and thereby also for the cepstrum.

Cepstrum analysis of speech processes
To evaluate the performance for speech-like processes, AR models are estimated from sounds of the phoneme http://asp.eurasipjournals.com/content/2013/1/158 ' A' of recorded data of the Swedish word Hallå (Hallo) as well as of the whole word Hallå from the same speakers (three males and three females). The reason for analyzing ' A' is the more stationary character of vowels during the whole sequence length. However, the methods should also be robust against normal changes of the speech, and therefore the whole word Hallå is also investigated, where the sequences for the spectrum analysis are chosen subsequentially and without overlap. The total lengths of the different Hallå are between 248 and 567 ms, and the sampling frequency is 11 kHz, giving the number of sequences between 9 and 23 where the sequence length is N = 256 (23 ms). For the syllable ' A', N = 256 in all cases. The choice of the AR model order for each sequence is made from the Akaike information criterion (AIC). A number of 1,000 simulated speech-like processes are then produced from the different models, and the evaluation criterion is the total mean square error for the cepstrum, Note that the cepstrum coefficient at n = 0 is excluded in this analysis. The reason is that the zeroth coefficient corresponds to a constant energy level of the spectrum and is usually omitted in most cepstrum applications.
The estimators OPT098 and OPT093 from the former section are applied and compared with THOM opt , WOSA opt , and SIN opt as above where the result from the number of multitapers giving the smallest error is presented. A comparison with an AR estimator is also made. The model order (using the AIC criterion) giving the smallest error is presented. The result of the single Hamming window periodogram (HAMM) is also added, as this method is often applied in speech analysis. The result of this method is however much worse than any of the multitaper methods.
In Table 3, the minimum total mean square errors ξ c for the six different ' A' from three male and three female speakers are presented. For all subjects, the used simulation AR model orders are shown in the first line, e.g., order 49 was given for the first male speaker, M 1 (49). As expected, the order of the underlying model is found to be the optimal one in all cases for the AR estimator, AR opt . The estimated model orders are presented after the error of the AR opt in parenthesis. Similarly, the optimal number of tapers for all the multitaper methods is expressed in parenthesis after the error.
Studying the errors of the multitaper methods, it can be seen that one of the proposed estimators, either OPT098 or OPT093 gives the smallest error in almost all cases followed by WELCH opt , SIN opt , and THOM opt . In most cases, the number of tapers needed are just two or three more than for the equally weighted multitaper methods, e.g., for M 1 ; the error given from OPT098 002 (K = 6) is much smaller than the error from WELCH opt (K = 4). Similarly, for F 2 , the error given from OPT093 004 (K = 11) is substantially smaller than the error from WELCH opt (K = 9). In almost all cases, as expected from AR model simulations, the AR opt gives a much better result. However, in several cases, the error of AR opt is much larger than the multitaper methods, e.g., M 1 and F 2 . It is also interesting to note that the error of the single Hamming window, HAMM, is almost the same for all speakers. This is in concordance with the expressions given in [6,17], where the bias is approximately zero and the total variance as well as the total mean square error is π 2 /6 ≈ 1.64, for all cepstrum coefficients, excluding the zeroth coefficient.
In Table 4, the average ξ c of subsequent intervals of the total word Hallå is presented (number of sequences differ between 9 and 23). For all cases, the same methods and the same parameter settings as in previous studies are evaluated. The AR opt is investigated for different model orders, and the model order giving the smallest total error for all subsequences of Hallå is used and the corresponding error is presented. The results of the multitaper methods show about the same difference as for the evaluation the syllable ' A'. At least one of OPT098 or OPT093 gives a smaller error than WELCH opt , SIN opt , and THOM opt . Sometimes both OPT098 and OPT093 give a smaller error, e.g., for F 1 , which also is smaller than the error given by the AR opt , which is an indication of the robustness of the estimator against the choice of model. In most cases, a considerable reduction of the error is given from the OPT098 and OPT093 only using K = 6 or 11 tapers, which is a reasonable additional number compared to the equally weighted multitaper methods (K = 3 − 9). The AR opt now shows a considerable larger error in several cases than the multitaper methods. Similarly as for the syllable ' A', the single Hamming window gives results around 1.64.

Conclusions
A cepstrum estimator is proposed based on a weighted multitaper spectrum. An evaluation of different approximations for bias and variance of the multitaper logspectrum is made, and a mean square error criterion is proposed that includes novel approximations of the bias and variance. The weights of the multitaper spectrum are optimized, and the new estimator, the optimal weights combined with the sinusoidal tapers, is evaluated for cepstrum estimation of speech-like processes. The results show that a 10% to 20% reduction of the mean square error of the cepstrum can be achieved, to the cost of two or three additional periodogram computations.