- Research
- Open Access
Bayesian STSA estimation using masking properties and generalized Gamma prior for speech enhancement
- Mahdi Parchami1Email author,
- Wei-Ping Zhu1,
- Benoit Champagne2 and
- Eric Plourde3
https://doi.org/10.1186/s13634-015-0270-6
© Parchami et al. 2015
- Received: 21 October 2014
- Accepted: 9 September 2015
- Published: 6 October 2015
Abstract
We consider the estimation of the speech short-time spectral amplitude (STSA) using a parametric Bayesian cost function and speech prior distribution. First, new schemes are proposed for the estimation of the cost function parameters, using an initial estimate of the speech STSA along with the noise masking feature of the human auditory system. This information is further employed to derive a new technique for the gain flooring of the STSA estimator. Next, to achieve better compliance with the noisy speech in the estimator’s gain function, we take advantage of the generalized Gamma distribution in order to model the STSA prior and propose an SNR-based scheme for the estimation of its corresponding parameters. It is shown that in Bayesian STSA estimators, the exploitation of a rough STSA estimate in the parameter selection for the cost function and the speech prior leads to more efficient control on the gain function values. Performance evaluation in different noisy scenarios demonstrates the superiority of the proposed methods over the existing parametric STSA estimators in terms of the achieved noise reduction and introduced speech distortion.
Keywords
- Generalized Gamma distribution (GGD)
- Masking
- Noise reduction
- Short-time spectral amplitude (STSA)
- Speech enhancement
1 Introduction
Speech enhancement aims at the reduction of corrupting noise in speech signals while keeping the introduced speech distortion at the minimum possible level. In this respect, considerable interest has been directed toward the estimation of the speech spectral amplitude, due to its perceptual importance in the frequency domain approaches [1, 2].
Within this framework, the general goal is to provide an estimate of the short-time spectral amplitude (STSA) of the clean speech using statistical models for the noise and speech spectral components. In [3], Ephraim and Malah proposed to estimate the speech signal amplitude through the minimization of a Bayesian cost function which measures the mean square error between the clean and estimated STSA; accordingly, the resulting estimator was called the minimum mean square error (MMSE) spectral amplitude estimator. Later in [4], a logarithmic version of the proposed estimator, i.e., the Log-MMSE, was introduced by considering that the logarithm of the STSA is perceptually more relevant to the human auditory system. Even though alternatives to the Bayesian STSA estimators were proposed, e.g., in [5], due to the satisfying performance of the latter, they are still found to be appealing in the literature. More recently, further modifications to the STSA Bayesian cost functions were suggested by Loizou in [6] by taking advantage of the psycho-acoustical models initially employed for speech enhancement purposes in [7, 8]. Therein, it was shown that the estimator emphasizing the spectral valleys (minima) of the speech STSA, namely the weighted Euclidean (WE) estimator, achieves the best overall performance. Along the same line of thought, You et al. [9] proposed to use the β power of the STSA term in the Bayesian cost function, in order to obtain further flexibility in the corresponding STSA gain function. These authors investigated the performance of the so-called β-order MMSE estimator for different values of β and found that it is moderately better than the MMSE and Log-MMSE estimators proposed earlier. In that work, an adaptive scheme based on the frame SNR was also suggested to determine β.
Plourde and Champagne in [10] suggested to take advantage of STSA power weightings (as used in the WE estimator) in the β-order MMSE cost function and introduced the parameter α as the power of their new weighting. They further proposed to select the two estimator parameters as functions of frequency, according to the psycho-acoustical properties of the human auditory system and showed a better quality in the enhanced speech in most of the input SNR range. Yet, at high input SNRs, the performance of the developed estimator may not be appealing due to the undesired distortion in the enhanced speech. Further in [11], the same authors introduced a generalized version of the W β-SA estimator by including a new weighting term in the Bayesian cost function which provides additional flexibility in the estimator’s gain. However, apart from the mathematically tedious solution for the gain function, the corresponding estimator does not provide further noticeable improvement in the enhanced speech quality.
Overall, the parametric Bayesian cost functions as those in [6, 9, 10] can provide further noise reduction over the previous estimators, thanks to the additional gain control obtained by the appropriate choice of the cost function parameters. In [6], fixed values were used for the STSA weighting parameter, whereas in [9], an experimental scheme was proposed in order to adapt β to the estimated frame SNR. In the latter, the adaptive selection of the cost function parameters has been proved to be advantageous over fixed parameter settings in most of the tested scenarios. To make use of the noise masking properties as in [8], it was suggested in [12] to select the power β as a linear combination of both the frame SNR and the noise masking threshold; subsequently, improvements with respect to the previous schemes were reported. In [10], rather than an adaptive scheme, the values of the estimator parameters are chosen only based on the perceptual properties of human auditory system. Whereas this scheme is in accordance with the spectral psycho-acoustical models of the hearing system in neural science [13], it does not take into account the noisy speech features in updating the parameters.
In the aforementioned works, since the complex Gaussian probability distribution function (PDF) is considered for the speech short-time Fourier transform (STFT) coefficients, the speech STSA actually takes the Rayleigh PDF. However, as it was indicated in [14], parametric non-Gaussian (super-Gaussian) PDFs are able to better model the speech STSA prior. In [15], the Chi PDF with fixed parameter settings was used as the speech STSA prior for a group of perceptually motivated STSA estimators. Use of Chi and Gamma speech priors was further studied in [16] and training-based procedures using the histograms of clean speech data were proposed for the estimation of the speech STSA prior parameters. Yet, apart from being computationally tedious, training-based methods depend largely on the test data, and unless a very lengthy set of training data is used, their performance may not be reliable. Within the same line of work, the generalized Gamma distribution (GGD) has also been taken into account, which includes some other non-Gaussian PDFs as a special case. In [17, 18], it was confirmed that the most suitable PDF for the modeling of speech STSA priors is the GGD, given that the corresponding parameters are estimated properly. Two mathematical approaches, i.e., the maximum likelihood and the method-of-moments, have been used in [18] for the estimation of the GGD parameters. However, as the evaluations showed in [19] and our experiments proved, these two approaches do not lead to acceptable results, due to the coarse approximations involved in their derivation. Other major studies within this field such as those in [20, 21], use either fixed or experimentally set values for the GGD model parameters, lacking the adaptation with the noisy speech data. Hence, an adaptive scheme to estimate the STSA prior parameters with moderate computational burden and fast adaptability with the noisy speech samples is further needed.
In this work, by taking into account the parametric W β-SA estimator, we first propose novel schemes for the parameter selection of the cost function as well as the gain flooring. The new schemes make use of the prior information available through a preliminary estimate of the speech STSA, noise masking threshold, and the compression property of the human auditory system. Next, a generalization of this estimator by employing the GGD prior model is derived and an efficient yet low-complexity scheme is introduced for the estimation of its parameters. We assess the performance of the proposed methods in terms of speech quality and the amount of noise reduction and demonstrate their advantage with respect to the previous STSA estimators. In particular, through a series of controlled experiments, we demonstrate the incremental advantages brought about by each one of the newly proposed modifications to the original W β-SA estimator.
The remainder of this paper is organized as follows. In Section 2, a brief overview of the auditory-based W β-SA estimator is presented. Section 3 proposes new schemes for the parameter selection of the Bayesian cost function as well as a new gain flooring scheme for STSA estimators. Section 4 exploits the application of the GGD prior to the proposed STSA estimator and discusses an efficient method for the estimation of its parameters. Performance of the proposed STSA estimation schemes is evaluated in Section 5 in terms of objective performance measures. Conclusions are drawn in Section 6.
2 Background: parametric STSA estimation
where Y(k,l), X(k,l), and V(k,l) are the STFTs of the noisy observation, clean speech and noise, respectively. Expressing the complex-valued speech coefficients, X(k,l), as χ(k,l)e j Ω(k,l) with χ and Ω as the amplitude and phase in respect, the purpose of speech STSA estimation is to estimate the speech amplitude, χ(k,l), given the noisy observations, Y(k,l). The estimated amplitude will then be combined with the noisy phase of Y(k,l) to provide an estimate of the speech Fourier coefficients. For sake of brevity, we may discard the indices k and l in the following.
STSA gain function curves in (8) versus β for different values of α(ζ=0 dB and γ=0 dB)
3 Proposed noise masking-based STSA estimator
Block diagram of the proposed algorithm for speech STSA estimation
3.1 Selection of parameter α
Variation of the proposed choice of α versus frequency compared to that of initial speech STSA estimate for one sample time frame of the noisy speech
3.2 Selection of parameter β
where the purpose of the constant C β =1/0.6 is to scale up to one the median value of the frequency weighting parameter β (2)(k) in (12).
3.3 Proposed gain flooring scheme
where p(k,l) is the speech presence probability which can be estimated through a soft-decision noise PSD estimation method. Using the popular improved minima controlled recursive averaging (IMCRA) in [26] provides enough precision for the estimation of this parameter in the proposed gain flooring scheme. According to (17), for higher speech presence probabilities or equivalently in frames/frequencies with stronger speech components, the contribution of the current frame in the recursive smoothing through the term \(\hat {\chi }_{0}(k,l)\) will be larger than that of the previous frame \(\hat {\chi }(k,l-1)\). Conversely, in case of a weak speech component in the current frame, the smoothing gives more weight to the previous frame. Hence, this choice of the flooring value favors the speech component over the noise component in adverse noisy conditions where the gain function is mainly determined by the second branch in (17).
4 Incorporation of GGD as speech prior
As mentioned in Section 1, use of the parametric GGD model as the STSA prior, due to providing further flexibility in the resulting gain function, is advantageous compared to the conventional Rayleigh prior. In this section, we first derive an extended W β-SA estimator under the GGD speech prior and then propose an efficient method to estimate its corresponding parameters.
4.1 Extended W β-SA estimator with GGD prior
where the notation MW β-SA is used to denote the modified W β-SA estimator. It is obvious that, for c= 1 where the Rayleigh prior is obtained as a special case, (21) degenerates to the original W β-SA. In the following, we present a simple approach for the selection of the GGD parameter c for the proposed STSA estimator.
4.2 Selection of the GGD shape parameter
Gain function of the modified W β-SA estimator in (21) versus the GGD shape parameter c for different values of γ (ζ = −5 dB)
with ζ av(l) as the a priori SNR being averaged over the frequency bins of the lth frame, and ζ min(l) and ζ max(l) as the minimum and maximum of the a priori SNR at the same time frame, respectively. According to (22), the shape parameter c takes on its values as a linearly increasing function of the SNR in its possible range between c min and c max, leading to the appropriate adjustment of the estimator gain function based on the average power of the speech STSA components at each frame.
5 Performance evaluation
In this section, we evaluate the performance of the proposed STSA estimation methods using objective speech quality measures. First, the performance of the proposed STSA parameter selection and gain flooring schemes are compared to the previous methods. Next, the proposed GGD-based estimator is compared to the estimators using the conventional Rayleigh prior. Due to the performance advantage of the generic W β-SA estimator over the previous versions of STSA estimators, it is used throughout the following simulations.
Various types of noise from NOISEX-92 database [29] were considered for the evaluations, out of which, the results are presented for three noise types, i.e., white, babble, and car noises. Speech utterances including 10 male and 10 female speakers are used from the TIMIT speech database [30]. The sampling rate is set to 16 kHz and a Hamming window with length 20 ms and overlap of 75 % between consecutive frames is used for STFT analysis and overlap-add synthesis. In all simulations, the noise variance is estimated by the soft-decision IMCRA method [26] eliminating the need to use a hard-decision voice activity detector (VAD). Also, the decision-directed (DD) approach [3] is used to estimate the a priori SNR. Even though more accurate methods of noise and SNR estimation exist, use of the aforementioned approaches provided enough accuracy for our purpose.
As for the assessment of the enhanced speech quality, various objective measures have been employed in the literature. In order to obtain a measure of the overall quality of the enhanced speech, we use the Perceptual Evaluation of Speech Quality (PESQ) scores. Nowadays, PESQ is a widely accepted industrial standard for objective voice quality evaluation and is standardized as ITU-T recommendation P.862 [31]. Since PESQ measurements principally model the Mean Opinion Scores (MOS), it has a close connection to subjective performance tests performed by a human. On the other hand, the log-likelihood ratio (LLR) score which measures a logarithmic distance between the linear prediction coefficients (LPC) of the enhanced and clean speech utterances, is more related to the introduced distortion in the clean speech signal [32]. Whereas PESQ takes values between 1 (worst) and 4.5 (best), the lower the LLR the less distorted the speech signal. To have a more complete evaluation of the noise reduction performance, we also consider the segmental SNR which correlates well with the level of noise reduction regardless of the existing distortion in the speech [32].
Spectrograms of (a) input noisy speech, (b) clean speech, (c) enhanced speech by the original W β-SA estimator, and (d) enhanced speech by the proposed W β-SA estimator, in case of babble noise (Input SNR = 5 dB)
LLR versus global SNR for different W β-SA estimators, (a) white noise, (b) babble noise, and (c) car noise
PESQ versus global SNR for different W β-SA estimators, (a) white noise, (b) babble noise, and (c) car noise
Segmental SNR versus global SNR for different W β-SA estimators, (a) white noise, (b) babble noise, and (c) car noise
PESQ values for the W β-SA estimator with different schemes of parameter α, case of white noise
Input SNR (dB) | −10 | −5 | 0 | 5 | 10 |
---|---|---|---|---|---|
Input noisy speech | 1.13 | 1.26 | 1.47 | 1.75 | 2.06 |
Choice of α = 0 | 1.49 | 1.70 | 2.03 | 2.39 | 2.72 |
Choice of α = 0.22 | 1.49 | 1.73 | 2.06 | 2.41 | 2.76 |
Original choice of α | 1.50 | 1.73 | 2.08 | 2.44 | 2.78 |
Proposed choice of α | 1.54 | 1.77 | 2.14 | 2.49 | 2.81 |
PESQ values for the W β-SA estimator with different schemes of parameter α, case of babble noise
Input SNR (dB) | −10 | −5 | 0 | 5 | 10 |
---|---|---|---|---|---|
Input noisy speech | 1.31 | 1.56 | 1.83 | 2.14 | 2.43 |
Choice of α = 0 | 1.48 | 1.71 | 2.03 | 2.40 | 2.73 |
Choice of α = 0.22 | 1.51 | 1.82 | 2.14 | 2.42 | 2.77 |
Original choice of α | 1.54 | 1.86 | 2.16 | 2.45 | 2.79 |
Proposed choice of α | 1.58 | 1.91 | 2.23 | 2.51 | 2.82 |
PESQ values for the W β-SA estimator with different schemes of parameter α, case of car noise
Input SNR (dB) | −10 | −5 | 0 | 5 | 10 |
---|---|---|---|---|---|
Input noisy speech | 1.41 | 1.54 | 1.71 | 2.01 | 2.32 |
Choice of α = 0 | 1.57 | 1.76 | 2.06 | 2.40 | 2.75 |
Choice of α = 0.22 | 1.58 | 1.78 | 2.11 | 2.46 | 2.77 |
Original choice of α | 1.60 | 1.81 | 2.15 | 2.50 | 2.79 |
Proposed choice of α | 1.66 | 1.88 | 2.20 | 2.54 | 2.84 |
PESQ values for the W β-SA estimator with different schemes of parameter β, case of white noise
PESQ values for the W β-SA estimator with different schemes of parameter β, case of babble noise
PESQ values for the W β-SA estimator with different schemes of parameter β, case of car noise
PESQ versus global SNR for W β-SA estimator with the proposed parameters in Section 3 using different gain flooring schemes, (a) white noise, (b) babble noise, and (c) car noise
Spectrograms of (a) input noisy speech, (b) clean speech, (c) enhanced speech by WE estimator with Chi prior in [34], (d) enhanced speech by WCOSH estimator with Chi prior in [34], (e) enhanced speech by Log-MMSE estimator with GGD prior in [33], and (f) enhanced speech by the proposed W β-SA estimator with GGD prior in Section 4, in case of babble noise (Input SNR = 5 dB)
6 Conclusions
In this work, we presented new schemes for the selection of Bayesian cost function parameters in parametric STSA estimators, based on an initial estimate of the speech and the properties of human audition. We further used these quantities to design an efficient flooring scheme for the estimator’s gain function, which employs recursive smoothing of the speech initial estimate. Next, we applied the GGD model as the speech STSA prior to the W β-SA estimator and proposed to choose its parameters using the noise spectral variance and the a priori SNR. Due to the more efficient adjustment of the estimator’s gain function by the suggested parameter choice and also further keeping the speech strong components from being distorted through the gain flooring scheme, our STSA estimation schemes are able to provide better noise reduction as well as less speech distortion compared to the previous methods. Also, by taking into account a more precise modeling of the speech STSA prior through using the GGD function with the suggested adaptive parameter selection, improvements were achieved with respect to the recent speech STSA estimators. Quality and noise reduction performance evaluations indicated the superiority of the proposed speech STSA estimation with respect to the previous estimators.
7 Appendix: Derivation of Eq. (19)
where, according to Section 4.1, we have \(b=c/\sigma _{\chi }^{2}\). Now, by considering m=0 in the above, a similar expression is derived for DEN in (24). Division of the obtained expression of NUM by that of DEN results in Eq. (19).
Notes
Declarations
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
- PC Loizou, Speech Enhancement: Theory and Practice (CRC Press, Boca Raton, FL, USA, 2007).Google Scholar
- J Benesty, Y Huang, Springer Handbook of Speech Processing (Springer, Secaucus, NJ, USA, 2008).View ArticleGoogle Scholar
- Y Ephraim, D Malah, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acous. Speech and Sig. Process. 32(6), 1109–1121 (1984).View ArticleGoogle Scholar
- Y Ephraim, D Malah, Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acous. Speech and Sig. Process. 33(2), 443–445 (1985).View ArticleGoogle Scholar
- PJ Wolfe, SJ Godsill, Efficient alternatives to the Ephraim and Malah suppression rule for audio signal enhancement. EURASIP J. Adv. Sig. Process. 2003, 1043–1051 (2003).View ArticleMATHGoogle Scholar
- PC Loizou, Speech enhancement based on perceptually motivated Bayesian estimators of the magnitude spectrum. IEEE Trans. Speech and Audio Process. 13(5), 857–869 (2005).View ArticleGoogle Scholar
- DE Tsoukalas, JN Mourjopoulos, G Kokkinakis, Speech enhancement based on audible noise suppression. IEEE Trans. Speech and Audio Process. 5(6), 497–514 (1997).View ArticleGoogle Scholar
- N Virag, Single channel speech enhancement based on masking properties of the human auditory system. IEEE Trans. Speech and Audio Process. 7(2), 126–137 (1999).View ArticleGoogle Scholar
- CH You, SN Koh, S Rahardja, β-order MMSE spectral amplitude estimation for speech enhancement. IEEE Trans. Speech and Audio Process. 13(4), 475–486 (2005).View ArticleGoogle Scholar
- E Plourde, B Champagne, Auditory-based spectral amplitude estimators for speech enhancement. IEEE Trans. Audio, Speech, Lang. Process. 16(8), 1614–1623 (2008).View ArticleGoogle Scholar
- E Plourde, B Champagne, Generalized Bayesian estimators of the spectral amplitude for speech enhancement. IEEE Signal Process. Letters. 16(6), 485–488 (2009).View ArticleGoogle Scholar
- CH You, SN Koh, S Rahardja, Masking-based β-order MMSE speech enhancement. Speech Comm. 48(1), 57–70 (2006).View ArticleGoogle Scholar
- E Kandel, J Schwartz, Principles of Neural Science, Fifth Edition (McGraw-Hill Education, Secaucus, NJ, USA, 2013).Google Scholar
- R Martin, Speech enhancement based on minimum mean-square error estimation and superGaussian priors. IEEE Trans. Speech and Audio Process. 13(5), 845–856 (2005).View ArticleGoogle Scholar
- MB Trawicki, MT Johnson, Speech enhancement using Bayesian estimators of the perceptually-motivated short-time spectral amplitude (STSA) with Chi speech priors. Speech Comm. 57(0), 101–113 (2014).View ArticleGoogle Scholar
- I Andrianakis, PR White, Speech spectral amplitude estimators using optimally shaped Gamma and Chi priors. Speech Comm. 51(1), 1–14 (2009).View ArticleGoogle Scholar
- T Lotter, P Vary, Speech enhancement by MAP spectral amplitude estimation using a super-Gaussian speech model. EURASIP J. Appl. Sig. Process. 2005, 1110–1126 (2005).View ArticleMATHGoogle Scholar
- R Prasad, H Saruwatari, K Shikano, Probability distribution of time-series of speech spectral components. IEICE rans. Fundam. Electron. Commun. Comput. Sci. E87-A(3), 584–597 (2004).Google Scholar
- I Andrianakis, Bayesian Algorithms for Speech Enhancement. PhD thesis, University of Southampton (2007). http://eprints.soton.ac.uk/66244/1.hasCoversheetVersion/P2515.pdf.
- JS Erkelens, RC Hendriks, R Heusdens, J Jensen, Minimum mean-square error estimation of discrete fourier coefficients with generalized Gamma priors. IEEE Trans. Audio, Speech, Lang. Process. 15(6), 1741–1752 (2007).View ArticleGoogle Scholar
- BJ Borgstrom, A Alwan, A unified framework for designing optimal STSA estimators assuming maximum likelihood phase equivalence of speech and noise. IEEE Trans. Audio, Speech, Lang. Process. 19(8), 2579–2590 (2011).View ArticleGoogle Scholar
- A Jeffrey, D Zwillinger, Table of Integrals, Series, and Products (Elsevier Science, Boston, 2007).MATHGoogle Scholar
- DD Greenwood, A cochlear frequency-position function for several species–29 years later. The J. Acoust. Soc. Am. 87(6), 2592–2605 (1990).View ArticleGoogle Scholar
- I Cohen, B Berdugo, Speech enhancement for non-stationary noise environments. Sig. Process. 81(11), 2403–2418 (2001).View ArticleMATHGoogle Scholar
- BL Sim, YC Tong, JS Chang, CT Tan, A parametric formulation of the generalized spectral subtraction method. IEEE Trans. Speech and Audio Process. 6(4), 328–337 (1998).View ArticleGoogle Scholar
- I Cohen, Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging. IEEE Trans. Speech and Audio Process. 11(5), 466–475 (2003).View ArticleGoogle Scholar
- NL Johnson, S Kotz, N Balakrishnan, Continuous Univariate Distributions (Wiley & Sons, New York, 1995).MATHGoogle Scholar
- O Gomes, C Combes, A Dussauchoy, Parameter estimation of the generalized Gamma distribution. Math. Comput. Simul. 79(4), 955–963 (2008).MathSciNetView ArticleMATHGoogle Scholar
- Noisex-92 database. Speech at CMU, Carnegie Mellon University, available at: http://www.speech.cs.cmu.edu/comp.speech/Section1/Data/noisex.html. Accessed date Sept 2014.
- JS Garofolo, DARPA TIMIT acoustic-phonetic speech database. National Institute of Standards and Technology (NIST) (1988). https://catalog.ldc.upenn.edu/LDC93S1.
- Recommendation P.862: Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. ITU-T (2001). http://www.itu.int/rec/T-REC-P.862.
- Y Hu, PC Loizou, Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio, Speech, Lang. Process. 16(1), 229–238 (2008).View ArticleGoogle Scholar
- BJ Borgstrom, A Alwan, in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on. Log-spectral amplitude estimation with Generalized Gamma distributions for speech enhancement, (2011), pp. 4756–4759. doi:10.1109/ICASSP.2011.5947418.
- MB Trawicki, MT Johnson, Speech enhancement using Bayesian estimators of the perceptually-motivated short-time spectral amplitude (STSA) with Chi speech priors. Speech Comm. 57(0), 101–113 (2014).View ArticleGoogle Scholar