Single-channel noise reduction using unified joint diagonalization and optimal filtering
© Nørholm et al.; licensee Springer. 2014
Received: 19 December 2013
Accepted: 17 March 2014
Published: 26 March 2014
In this paper, the important problem of single-channel noise reduction is treated from a new perspective. The problem is posed as a filtering problem based on joint diagonalization of the covariance matrices of the desired and noise signals. More specifically, the eigenvectors from the joint diagonalization corresponding to the least significant eigenvalues are used to form a filter, which effectively estimates the noise when applied to the observed signal. This estimate is then subtracted from the observed signal to form an estimate of the desired signal, i.e., the speech signal. In doing this, we consider two cases, where, respectively, no distortion and distortion are incurred on the desired signal. The former can be achieved when the covariance matrix of the desired signal is rank deficient, which is the case, for example, for voiced speech. In the latter case, the covariance matrix of the desired signal is full rank, as is the case, for example, in unvoiced speech. Here, the amount of distortion incurred is controlled via a simple, integer parameter, and the more distortion allowed, the higher the output signal-to-noise ratio (SNR). Simulations demonstrate the properties of the two solutions. In the distortionless case, the proposed filter achieves only a slightly worse output SNR, compared to the Wiener filter, along with no signal distortion. Moreover, when distortion is allowed, it is possible to achieve higher output SNRs compared to the Wiener filter. Alternatively, when a lower output SNR is accepted, a filter with less signal distortion than the Wiener filter can be constructed.
Speech signals corrupted by additive noise suffer from a lower perceived quality and lower intelligibility than their clean counterparts and cause listeners to suffer from fatigue after extended exposure. Moreover, speech processing systems are frequently designed under the assumption that only a single, clean speech signal is present at the time. For these reasons, noise reduction plays an important role in many communication and speech processing systems and continues to be an active research topic today. Over the years, many different methods for noise reduction have been introduced, including optimal filtering methods , spectral subtractive methods , statistical methods [3–5], and subspace methods [6, 7]. For an overview of methods for noise reduction, we refer the interested reader to [1, 8, 9] and to  for a recent and complete overview of applications of subspace methods to noise reduction.
In the past decade or so, most efforts in relation to noise reduction seem to have been devoted to tracking of noise power spectral densities [11–14] to allow for better noise reduction during speech activity, extensions of noise reduction methods to multiple channels [15–18], and improved optimal filtering techniques for noise reduction [1, 8, 19–21]. However, little progress has been made on subspace methods.
In this paper, we explore the noise reduction problem from a different perspective in the context of single-channel noise reduction in the time domain. This perspective is different from traditional approaches in several respects. Firstly, it combines the ideas behind subspace methods and optimal filtering via joint diagonalization of the desired and noise signal covariance matrices. Since joint diagonalization is used, the method will work for all kinds of noise, as opposed to, e.g., when an eigenvalue decomposition is used where preprocessing has to be performed when the noise is not white. Secondly, the perspective is based on obtaining estimates of the noise signal by filtering of the observed signal and, thereafter, subtracting the estimate of the noise from the observed signal. This is opposite to a normal filtering approach where the observed signal is filtered to get the estimated signal straight away. The idea of first estimating the noise is known from the generalized sidelobe canceller technique in a multichannel scenario . Thirdly, when the covariance matrix of the desired signal has a rank that is lower than that of the observed signal, the perspective leads to filters that can be formed such that no distortion is incurred on the desired signal, and distortion can be introduced so that more noise reduction is achieved. The amount of distortion introduced can be controlled via a simple, integer parameter.
The rest of the paper is organized as follows. In Section 2, the basic signal model and the joint diagonalization perspective are introduced, and the problem of interest is stated. We then proceed, in Section 3, to introduce the noise reduction approach for the case where no distortion is incurred on the desired signal. This applies in cases where the rank of the observed signal covariance matrix exceeds that of the desired signal covariance matrix. In Section 4, we then relax the requirement of no distortion on the desired signal to obtain filters that can be applied more generally, i.e., when the ranks of the observed and desired signals are the same. Simulation results demonstrating the properties of the obtained noise reduction filters are presented in Section 5, whereafter we conclude on the work in Section 6.
2 Signal model and problem formulation
where v(k) is the unwanted additive noise which is assumed to be uncorrelated with x(k). All signals are considered to be real, zero mean, broadband, and stationary.
where E[·] denotes mathematical expectation, and R x =E[x(k)x T (k)] and R v =E[v(k)v T (k)] are the covariance matrices of x(k) and v(k), respectively. The noise covariance matrix, R v , is assumed to be full rank, i.e., equal to L. In the rest, we assume that the rank of the speech covariance matrix, R x , is equal to P≤L. Then, the objective of speech enhancement (or noise reduction) is to estimate the desired signal sample, x(k), from the observation vector, y(k). This should be done in such a way that the noise is reduced as much as possible with little or no distortion of the desired signal.
where h is a filter applied to the observation signal (see Section 3), and and are the variances of x(k) and v(k) after noise reduction.
3 Noise reduction filtering without distortion
In this section, we assume that P<L; as a result, the speech covariance matrix is rank deficient.
where β i , i=P+1,…,L, are arbitrary real numbers with at least one of them different from 0. With the filter having the form of (14), oSNRf(h P )=0 and z(k) can be seen as an estimate of the noise, .
4 Noise reduction filtering with distortion
In this section, we assume that the speech covariance matrix is full rank, i.e., equal to L. We can still use the method presented in the previous section, but this time we should expect distortion of the desired signal.
This time, however, the output SNR cannot be equal to 0, but we can make it as small as we desire. The larger is the value of , the more the speech signal is distorted. If we can tolerate a small amount of distortion, then we can still consider z′(k) as an estimate of the noise, .
The smaller P′ is compared to L, the larger is the distortion. Further, the speech distortion index is independent of the input SNR, as is the gain in SNR. This can be observed by multiplying either R x in (5) or R v in (6) by a constant c, which leads to a corresponding change in the input SNR. Insertion of the resulting λ i ’s and b i ’s in (37) and (38) will show that the output SNR is changed by the factor c and that the speech distortion index is independent of c.
is the noise reduction factor.
Again, minimizing or leads to the estimator .
In the special case where P′=0, the estimator is the well-known Wiener filter.
In this section, the filter design with and without distortion is evaluated through simulations. Firstly, the distortionless case is considered in order to verify that the basics of the filter design hold and the filter works as expected. Secondly, we turn to the filter design with distortion to investigate the influence of the input SNR and the choice of P′ on the output SNR and the speech distortion index.
where M is the model order, A m >0 and ϕ m ∈[0,2π] are the amplitude and phase of the m th harmonic, f0∈[ 0,π/m] is the fundamental frequency, and fs is the sampling frequency. The rank of the signal covariance matrix, R x , is then P=2M. In the simulations M=5, the amplitudes are decreasing with the frequency, f, as 1/f, normalized to give A1=1, and the fundamental frequency is chosen randomly such that f0∈ [150,250] Hz, the sampling frequency is 8 kHz, and the phases are random. The covariance matrices of R x and R v are estimated from segments of 230 samples and are updated along with the filter for each sample. The number of samples is 1,000.
PESQ scores at different filter lengths and SNRs for P ′ =31
PESQ scores for different values of P ′ and SNR for a filter length of 110
P ′ =1
P ′ =11
P ′ =21
P ′ =31
P ′ =41
P ′ =51
P ′ =61
A feature of the proposed filter, which is not explored here, is the possibility of choosing different values of P′ over time. The optimal value of P′ depends on whether the speech is voiced or unvoiced, and how many harmonics there are in the voiced parts. By adapting the value of P′ at each time step based on this information, it should be possible to simultaneously achieve a higher SNR and a lower distortion.
In this paper, we have presented a new perspective on time-domain single-channel noise reduction based on forming filters from the eigenvectors that diagonalize both the desired and noise signal covariance matrices. These filters are chosen so that they provide an estimate of the noise signal when applied to the observed signal. Then, by subtraction of the noise estimate from the observed signal, an estimate of the desired signal can be obtained. Two cases have been considered, namely one where no distortion is allowed on the desired signal and one where distortion is allowed. The former case applies to signals that have a rank that can be assumed to be less than the rank of the observed signal covariance matrix, which is, for example, the case for voiced speech. The latter case applies to desired signals that have a full-rank covariance matrix. In this case, the only way to achieve noise reduction is by also allowing for distortion on the desired signal. The amount of distortion introduced depends on a parameter corresponding to the rank of an implicit approximation of the desired signal covariance matrix. As such, it is relatively easy to control the trade-off between noise reduction and speech distortion. Experiments on real and synthetic signals have confirmed these principles and demonstrated how it is, in fact, possible to achieve higher output signal-to-noise ratio or a lower signal distortion index with the proposed method than with the classical Wiener filter. Moreover, the results show that only a small loss in output signal-to-noise ratio is incurred when no distortion can be accepted, as long as the filter is not too short. The results also show that when distortion is allowed on the desired signal, the amount of distortion is independent of the input signal-to-noise ratio. The presented perspective is promising in that it unifies the ideas behind subspace methods and optimal filtering, two methodologies that have traditionally been seen as quite different.
This research was supported by the Villum Foundation and the Danish Council for Independent Research, grant ID: DFF - 1337-00084.
- Benesty J, Chen J: Optimal Time-Domain Noise Reduction Filters – A Theoretical Study vol. VII. Heidelberg: Springer; 2011.View ArticleGoogle Scholar
- Boll S: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust., Speech, Signal Process 1979, 27(2):113-120. 10.1109/TASSP.1979.1163209View ArticleGoogle Scholar
- McAulay RJ, Malpass ML: Speech enhancement using a soft-decision noise suppression filter. IEEE Trans. Acoust., Speech, Signal Process 1980, 28(2):137-145. 10.1109/TASSP.1980.1163394View ArticleGoogle Scholar
- Ephraim Y, Malah D: Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust., Speech, Signal Process 1985, 33(2):443-445. 10.1109/TASSP.1985.1164550View ArticleGoogle Scholar
- Srinivasan S, Samuelsson J, Kleijn WB: Codebook-based Bayesian speech enhancement for nonstationary environments. IEEE Trans. Audio, Speech, and Language Process 2007, 15(2):441-452.View ArticleGoogle Scholar
- Ephraim Y, Van Trees HL: A signal subspace approach for speech enhancement. IEEE Trans. Speech Audio Process 1995, 3(4):251-266. 10.1109/89.397090View ArticleGoogle Scholar
- Jensen SH, Hansen PC, Hansen SD, Sørensen JA: Reduction of broad-band noise in speech by truncated QSVD. IEEE Trans. Speech Audio Process 1995, 3(6):439-448. 10.1109/89.482211View ArticleGoogle Scholar
- Benesty J, Chen J, Huang Y, Cohen I: Noise Reduction in Speech Processing. Heidelberg: Springer; 2009.Google Scholar
- Loizou P: Speech Enhancement: Theory and Practice. Boca Raton: CRC Press; 2007.Google Scholar
- Hansen PC, Jensen SH: Subspace-based noise reduction for speech signals via diagonal and triangular matrix decompositions: survey and analysis. EURASIP J. Adv. Signal Process 2007, 2007(1):24.Google Scholar
- Rangachari S, Loizou P: A noise estimation algorithm for highly nonstationary environments. Speech Commun 2006, 28: 220-231.View ArticleGoogle Scholar
- Cohen I: Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging. IEEE Trans. Speech Audio Process 2003, 11(5):466-475. 10.1109/TSA.2003.811544View ArticleGoogle Scholar
- Gerkmann T, Hendriks RC: Unbiased MMSE-based noise power estimation with low complexity and low tracking delay. IEEE Trans. Audio, Speech, Lang. Process 2012, 20(4):1383-1393.View ArticleGoogle Scholar
- Hendriks RC, Heusdens R, Jensen J, Kjems U: Low complexity DFT-domain noise PSD tracking using high-resolution periodograms. EURASIP J. Appl. Signal Process 2009, 2009(1):15.Google Scholar
- Doclo S, Moonen M: GSVD-based optimal filtering for single and multimicrophone speech enhancement. IEEE Trans. Signal Process 2002, 50(9):2230-2244. 10.1109/TSP.2002.801937View ArticleGoogle Scholar
- Souden M, Benesty J, Affes S: On optimal frequency-domain multichannel linear filtering for noise reduction. IEEE Trans. Audio, Speech, Lang. Process 2010, 18(2):260-276.View ArticleGoogle Scholar
- Benesty J, Souden M, Chen J: A perspective on multichannel noise reduction in the time domain. Appl. Acoustics 2013, 74(3):343-355. 10.1016/j.apacoust.2012.08.002View ArticleGoogle Scholar
- Hendriks RC, Gerkmann T: Noise correlation matrix estimation for multi-microphone speech enhancement. IEEE Trans. Audio, Speech, Lang. Process 2012, 20(1):223-233.View ArticleGoogle Scholar
- Christensen MG, Jakobsson A: Optimal filter designs for separating and enhancing periodic signals. IEEE Trans. Signal Process 2010, 58(12):5969-5983.MathSciNetView ArticleGoogle Scholar
- Jensen JR, Benesty J, Christensen MG, Jensen SH: Enhancement of single-channel periodic signals in the time-domain. IEEE Trans. Audio, Speech, Lang. Process 2012, 20(7):1948-1963.View ArticleGoogle Scholar
- Jensen JR, Benesty J, Christensen MG, Jensen SH: Non-causal time-domain filters for single-channel noise reduction. IEEE Trans. Audio, Speech, Lang. Process 2012, 20(5):1526-1541.View ArticleGoogle Scholar
- Griffiths LJ, Jim CW: An alternative approach to linearly constrained adaptive beamforming. IEEE Trans. Antennas Propag 1982, 30(1):27-34. 10.1109/TAP.1982.1142739View ArticleGoogle Scholar
- Franklin JN: Matrix Theory. New York: Prentice-Hall; 1968.Google Scholar
- Jensen J, Hansen JHL: Speech enhancement using a constrained iterative sinusoidal model. IEEE Trans. Speech Audio Process 2001, 9(7):731-740. 10.1109/89.952491View ArticleGoogle Scholar
- Deller JR, Hansen JHL, Proakis JG: Discrete-Time Processing of Speech Signals. New York: Wiley; 2000.Google Scholar
- Plante F, Meyer GF, Ainsworth WA: A pitch extraction reference database. In Proc. Eurospeech. Madrid, Spain; 18–21 September 1995:837-840.Google Scholar
- Pearce D, Hirsch HG: The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In Proc. Int. Conf. Spoken Language Process. Beijing, China; 16–20 October 2000:29-32.Google Scholar
- Hu Y, Loizou P: Evaluation of objective quality measures for speech enhancement. IEEE Trans. Speech Audio Process 2008, 16(1):229-238.View ArticleGoogle Scholar
- Cui X, Alwan A: Noise robust speech recognition using feature compensation based on polynomial regression of utterance SNR. IEEE Trans. Speech Audio Process 2005, 13(6):1161-1172.View ArticleGoogle Scholar
- Lippmann RP: Speech recognition by machines and humans. Speech Commun 1997, 22(1):1-15. 10.1016/S0167-6393(97)00021-6View ArticleGoogle Scholar
- Huerta JM, Stern RM: Distortion-class weighted acoustic modeling for robust speech recognition under GSM RPE-LTP coding. In Proc. of the Robust Methods for Speech Recognition in Adverse Conditions. Tampere, Finland; 25–26 May 1999:11-14.Google Scholar
- Takiguchi T, Ariki Y: PCA-based speech enhancement for distorted speech recognition. J. Multimedia 2007, 2(5):13-18.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.