Single-channel noise reduction using unified joint diagonalization and optimal filtering

Nørholm, Sidsel Marie; Benesty, Jacob; Jensen, Jesper Rindom; Christensen, Mads Græsbøll

doi:10.1186/1687-6180-2014-37

Research
Open access
Published: 26 March 2014

Single-channel noise reduction using unified joint diagonalization and optimal filtering

Sidsel Marie Nørholm¹,
Jacob Benesty^1,2,
Jesper Rindom Jensen¹ &
…
Mads Græsbøll Christensen¹

EURASIP Journal on Advances in Signal Processing volume 2014, Article number: 37 (2014) Cite this article

1577 Accesses
10 Citations
Metrics details

Abstract

In this paper, the important problem of single-channel noise reduction is treated from a new perspective. The problem is posed as a filtering problem based on joint diagonalization of the covariance matrices of the desired and noise signals. More specifically, the eigenvectors from the joint diagonalization corresponding to the least significant eigenvalues are used to form a filter, which effectively estimates the noise when applied to the observed signal. This estimate is then subtracted from the observed signal to form an estimate of the desired signal, i.e., the speech signal. In doing this, we consider two cases, where, respectively, no distortion and distortion are incurred on the desired signal. The former can be achieved when the covariance matrix of the desired signal is rank deficient, which is the case, for example, for voiced speech. In the latter case, the covariance matrix of the desired signal is full rank, as is the case, for example, in unvoiced speech. Here, the amount of distortion incurred is controlled via a simple, integer parameter, and the more distortion allowed, the higher the output signal-to-noise ratio (SNR). Simulations demonstrate the properties of the two solutions. In the distortionless case, the proposed filter achieves only a slightly worse output SNR, compared to the Wiener filter, along with no signal distortion. Moreover, when distortion is allowed, it is possible to achieve higher output SNRs compared to the Wiener filter. Alternatively, when a lower output SNR is accepted, a filter with less signal distortion than the Wiener filter can be constructed.

1 Introduction

Speech signals corrupted by additive noise suffer from a lower perceived quality and lower intelligibility than their clean counterparts and cause listeners to suffer from fatigue after extended exposure. Moreover, speech processing systems are frequently designed under the assumption that only a single, clean speech signal is present at the time. For these reasons, noise reduction plays an important role in many communication and speech processing systems and continues to be an active research topic today. Over the years, many different methods for noise reduction have been introduced, including optimal filtering methods [1], spectral subtractive methods [2], statistical methods [3–5], and subspace methods [6, 7]. For an overview of methods for noise reduction, we refer the interested reader to [1, 8, 9] and to [10] for a recent and complete overview of applications of subspace methods to noise reduction.

In the past decade or so, most efforts in relation to noise reduction seem to have been devoted to tracking of noise power spectral densities [11–14] to allow for better noise reduction during speech activity, extensions of noise reduction methods to multiple channels [15–18], and improved optimal filtering techniques for noise reduction [1, 8, 19–21]. However, little progress has been made on subspace methods.

In this paper, we explore the noise reduction problem from a different perspective in the context of single-channel noise reduction in the time domain. This perspective is different from traditional approaches in several respects. Firstly, it combines the ideas behind subspace methods and optimal filtering via joint diagonalization of the desired and noise signal covariance matrices. Since joint diagonalization is used, the method will work for all kinds of noise, as opposed to, e.g., when an eigenvalue decomposition is used where preprocessing has to be performed when the noise is not white. Secondly, the perspective is based on obtaining estimates of the noise signal by filtering of the observed signal and, thereafter, subtracting the estimate of the noise from the observed signal. This is opposite to a normal filtering approach where the observed signal is filtered to get the estimated signal straight away. The idea of first estimating the noise is known from the generalized sidelobe canceller technique in a multichannel scenario [22]. Thirdly, when the covariance matrix of the desired signal has a rank that is lower than that of the observed signal, the perspective leads to filters that can be formed such that no distortion is incurred on the desired signal, and distortion can be introduced so that more noise reduction is achieved. The amount of distortion introduced can be controlled via a simple, integer parameter.

The rest of the paper is organized as follows. In Section 2, the basic signal model and the joint diagonalization perspective are introduced, and the problem of interest is stated. We then proceed, in Section 3, to introduce the noise reduction approach for the case where no distortion is incurred on the desired signal. This applies in cases where the rank of the observed signal covariance matrix exceeds that of the desired signal covariance matrix. In Section 4, we then relax the requirement of no distortion on the desired signal to obtain filters that can be applied more generally, i.e., when the ranks of the observed and desired signals are the same. Simulation results demonstrating the properties of the obtained noise reduction filters are presented in Section 5, whereafter we conclude on the work in Section 6.

2 Signal model and problem formulation

The speech enhancement (or noise reduction) problem considered in this work is the one of recovering the desired (speech) signal x(k), k being the discrete-time index, from the noisy observation (sensor signal) [1, 8, 9]:

y (k) = x (k) + v (k),

(1)

where v(k) is the unwanted additive noise which is assumed to be uncorrelated with x(k). All signals are considered to be real, zero mean, broadband, and stationary.

The signal model given in (1) can be put into a vector form by considering the L most recent successive time samples of the noisy signal, i.e.,

y (k) = x (k) + v (k),

(2)

where

y (k) = {[\begin{matrix} y (k) & y (k - 1) & \dots & y (k - L + 1) \end{matrix}]}^{T}

(3)

is a vector of length L, the superscript ^T denotes transpose of a vector or a matrix, and x(k) and v(k) are defined in a similar way to y(k) from (3). Since x(k) and v(k) are uncorrelated by assumption, the covariance matrix (of size L×L) of the noisy signal can be written as

R_{y} = E [y (k) y^{T} (k)] = R_{x} + R_{v},

(4)

where E[·] denotes mathematical expectation, and R_x=E[x(k)x^T(k)] and R_v=E[v(k)v^T(k)] are the covariance matrices of x(k) and v(k), respectively. The noise covariance matrix, R_v, is assumed to be full rank, i.e., equal to L. In the rest, we assume that the rank of the speech covariance matrix, R_x, is equal to P≤L. Then, the objective of speech enhancement (or noise reduction) is to estimate the desired signal sample, x(k), from the observation vector, y(k). This should be done in such a way that the noise is reduced as much as possible with little or no distortion of the desired signal.

Using the joint diagonalization technique [23], the two symmetric matrices R_x and R_v can be jointly diagonalized as follows:

\begin{align} B^{T} R_{x} B = Λ, \end{align}

(5)

\begin{align} B^{T} R_{v} B = I_{L}, \end{align}

(6)

where B is a full-rank square matrix (of size L×L), Λ is a diagonal matrix whose main elements are real and nonnegative, and I_L is the L×L identity matrix. Furthermore, Λ and B are the eigenvalue and eigenvector matrices, respectively, of $R_{v}^{- 1} R_{x}$ , i.e.,

R_{v}^{- 1} R_{x} B = B Λ .

(7)

Since R_x is semidefinite and its rank is equal to P, the eigenvalues of $R_{v}^{- 1} R_{x}$ can be ordered as λ₁≥λ₂≥⋯≥λ_P>λ_P+1=⋯=λ_L=0. In other words, the last L−P eigenvalues of the matrix product $R_{v}^{- 1} R_{x}$ are exactly zero, while its first P eigenvalues are positive, with λ₁ being the maximum eigenvalue. We denote by b₁,b₂,…,b_L, the corresponding eigenvectors. The noisy signal covariance matrix can also be diagonalized as

B^{T} R_{y} B = Λ + I_{L} .

(8)

We end this section by defining the input and output signal-to-noise ratios (SNRs):

iSNR = \frac{tr (R_{x})}{tr (R_{v})} = \frac{σ_{x}^{2}}{σ_{v}^{2}},

(9)

where tr(·) denotes the trace of a square matrix, and $σ_{x}^{2} = E [x^{2} (k)]$ and $σ_{v}^{2} = E [v^{2} (k)]$ are the variances of x(k) and v(k), respectively, and

\begin{align} {oSNR}_{nr} (h) = \frac{σ_{x, nr}^{2}}{σ_{v, nr}^{2}}, \end{align}

(10)

where h is a filter applied to the observation signal (see Section 3), and $σ_{x, nr}^{2}$ and $σ_{v, nr}^{2}$ are the variances of x(k) and v(k) after noise reduction.

3 Noise reduction filtering without distortion

In this section, we assume that P<L; as a result, the speech covariance matrix is rank deficient.

The approach proposed here is based on two successive stages. Firstly, we apply the filter of length L:

h = {[\begin{matrix} h_{0} & h_{1} & \dots & h_{L - 1} \end{matrix}]}^{T}

(11)

to the observation signal vector, y(k), to get the filter output:

z (k) = h^{T} y (k) = h^{T} x (k) + h^{T} v (k) .

(12)

From (4) and (12), we deduce that the output SNR from the filter is

{oSNR}_{f} (h) = \frac{σ_{x, f}^{2}}{σ_{v, f}^{2}} = \frac{h^{T} R_{x} h}{h^{T} R_{v} h},

(13)

which, in this case, is not the same as the output SNR after noise reduction stated in (10). Since the objective is to estimate the noise, we find h that minimizes oSNR_f(h). Due to the relation $b_{i}^{T} R_{x} b_{i} = λ_{i}$ , it is easy to see that the solution is

h_{P} = \sum_{i = P + 1}^{L} β_{i} b_{i},

(14)

where β_i, i=P+1,…,L, are arbitrary real numbers with at least one of them different from 0. With the filter having the form of (14), oSNR_f(h_P)=0 and z(k) can be seen as an estimate of the noise, $\hat{v} (k) = z (k) = h_{P}^{T} y (k)$ .

Secondly, we estimate the desired signal, x(k), as

\begin{align} \hat{x} (k) = y (k) - \hat{v} (k) = x (k) + v (k) - \sum_{i = P + 1}^{L} β_{i} b_{i}^{T} v (k) . \end{align}

(15)

An overview of the estimation process is shown in the block diagram in Figure 1.

Now, we find the β_i’s that minimize the power of the residual noise, i.e.,

\begin{align} J_{rn} & = E \{{[v (k) - \sum_{i = P + 1}^{L} β_{i} b_{i}^{T} v (k)]}^{2}\} \\ = σ_{v}^{2} - 2 \sum_{i = P + 1}^{L} β_{i} i_{L}^{T} R_{v} b_{i} + \sum_{i = P + 1}^{L} β_{i}^{2}, \end{align}

(16)

where i_L is the first column of the L×L identity matrix. We get

β_{i} = i_{L}^{T} R_{v} b_{i} .

(17)

Substituting (17) into (15), the estimator becomes

\begin{align} \hat{x} (k) & = x (k) + v (k) - \sum_{i = P + 1}^{L} i_{L}^{T} R_{v} b_{i} b_{i}^{T} v (k) \\ = x (k) + v (k) - i_{L}^{T} R_{v} (R_{v}^{- 1} - \sum_{p = 1}^{P} b_{p} b_{p}^{T}) v (k) \\ = x (k) + \sum_{p = 1}^{P} i_{L}^{T} R_{v} b_{p} b_{p}^{T} v (k) . \end{align}

(18)

The variance of $\hat{x} (k)$ is

\begin{align} σ_{\hat{x}}^{2} = σ_{x}^{2} + σ_{v}^{2} - \sum_{i = P + 1}^{L} {(i_{L}^{T} R_{v} b_{i})}^{2} = σ_{x}^{2} + \sum_{p = 1}^{P} {(i_{L}^{T} R_{v} b_{p})}^{2} . \end{align}

(19)

We deduce that the output SNR after noise reduction is

\begin{align} {oSNR}_{nr} (h_{P}) & = \frac{σ_{x}^{2}}{σ_{v}^{2} - \sum_{i = P + 1}^{L} {(i_{L}^{T} R_{v} b_{i})}^{2}} \\ = \frac{σ_{x}^{2}}{\sum_{p = 1}^{P} {(i_{L}^{T} R_{v} b_{p})}^{2}} \geq iSNR . \end{align}

(20)

It is clear that the larger L−P is, the larger is the value of the output SNR. Also, from (18), we observe that the desired signal is not distorted so that the speech distortion index [1] is

υ_{sd} (h_{P}) = \frac{E {{[x_{nr} (k) - x (k)]}^{2}}}{E [x^{2} (k)]} = \frac{E {{[h_{P}^{T} x (k)]}^{2}}}{E [x^{2} (k)]} = 0 .

(21)

The noise reduction factor [1] is

\begin{align} ξ_{nr} (h_{P}) = \frac{σ_{v}^{2}}{σ_{v, nr}^{2}} = \frac{σ_{v}^{2}}{σ_{v}^{2} - \sum_{i = P + 1}^{L} {(i_{L}^{T} R_{v} b_{i})}^{2}}, \end{align}

(22)

and since there is no signal distortion, we also have the relation:

\frac{{oSNR}_{nr} (h_{P})}{iSNR} = ξ_{nr} (h_{P}) .

(23)

From (18), we find a class of distortionless estimators:

{\hat{x}}_{Q} (k) = x (k) + \sum_{q = 1}^{Q} i_{L}^{T} R_{v} b_{q} b_{q}^{T} v (k),

(24)

where P≤Q≤L. We have ${\hat{x}}_{P} (k) = \hat{x} (k)$ and ${\hat{x}}_{L} (k) = y (k)$ . The latter is the observation signal itself. It is obvious that the output SNR corresponding to ${\hat{x}}_{Q} (k)$ is

{oSNR}_{nr} (h_{Q}) = \frac{σ_{x}^{2}}{\sum_{q = 1}^{Q} {(i_{L}^{T} R_{v} b_{q})}^{2}} \geq iSNR

(25)

and

\begin{align} {oSNR}_{nr} (h_{P}) \geq {oSNR}_{nr} (h_{P + 1}) \geq {oSNR}_{nr} (h_{L}) = iSNR . \end{align}

(26)

4 Noise reduction filtering with distortion

In this section, we assume that the speech covariance matrix is full rank, i.e., equal to L. We can still use the method presented in the previous section, but this time we should expect distortion of the desired signal.

Again, we apply the filter:

h^{'} = {[\begin{matrix} h_{0}^{'} & h_{1}^{'} & \dots & h_{L - 1}^{'} \end{matrix}]}^{T}

(27)

of length L to the observation signal vector. Then, the filter output and output SNR are, respectively,

z^{'} (k) = {h'^{}}^{T} x (k) + {h'^{}}^{T} v (k)

(28)

and

{oSNR}_{f} (h^{'}) = \frac{{h'^{}}^{T} R_{x} h^{'}}{{h'^{}}^{T} R_{v} h^{'}} .

(29)

Now, we choose

h_{P^{'}}^{'} = \sum_{i = P^{'} + 1}^{L} β_{i}^{'} b_{i},

(30)

where $β_{i}^{'}$ , i=P^′+1,…,L, are arbitrary real numbers. With this choice of h^′, the output SNR becomes

{oSNR}_{f} (h_{P^{'}}^{'}) = \frac{\sum_{i = P^{'} + 1}^{L} β_{i^{'}}^{2} λ_{i}}{\sum_{i = P^{'} + 1}^{L} β_{i}^{' 2}} .

(31)

This time, however, the output SNR cannot be equal to 0, but we can make it as small as we desire. The larger is the value of ${oSNR}_{f} (h_{P^{'}}^{'})$ , the more the speech signal is distorted. If we can tolerate a small amount of distortion, then we can still consider z^′(k) as an estimate of the noise, ${\hat{v}}^{'} (k) = z^{'} (k) = h_{P}^{T} y (k)$ .

In the second stage, we estimate the desired signal as

\begin{align} {\hat{x}}^{'} (k) & = y (k) - {\hat{v}}^{'} (k) \\ = x (k) - \sum_{i = P^{'} + 1}^{L} β_{i}^{'} b_{i}^{T} x (k) + v (k) - \sum_{i = P^{'} + 1}^{L} β_{i}^{'} b_{i}^{T} v (k) . \end{align}

(32)

By minimizing the power of the residual noise:

\begin{array}{l} J_{rn}^{'} & = E \{{[v (k) - \sum_{i = P^{'} + 1}^{L} β_{i}^{'} b_{i}^{T} v (k)]}^{2}\} \\ = σ_{v}^{2} - 2 \sum_{i = P^{'} + 1}^{L} β_{i}^{'} i_{L}^{T} R_{v} b_{i} + \sum_{i = P^{'} + 1}^{L} β_{i}^{' 2}, \end{array}

(33)

we find that

β_{i}^{'} = i_{L}^{T} R_{v} b_{i} = \frac{1}{λ_{i}} i_{L}^{T} R_{x} b_{i} .

(34)

Substituting (34) into (32), we obtain

\begin{align} {\hat{x}}^{'} (k) = x (k) & - \sum_{i = P^{'} + 1}^{L} \frac{1}{λ_{i}} i_{L}^{T} R_{x} b_{i} b_{i}^{T} x (k) \\ + v (k) - \sum_{i = P^{'} + 1}^{L} i_{L}^{T} R_{v} b_{i} b_{i}^{T} v (k) . \end{align}

(35)

The variance of ${\hat{x}}^{'} (k)$ is

\begin{align} σ_{{\hat{x}}^{'}}^{2} = σ_{x}^{2} - \sum_{i = P^{'} + 1}^{L} \frac{1}{λ_{i}} {(i_{L}^{T} R_{x} b_{i})}^{2} + σ_{v}^{2} - \sum_{i = P^{'} + 1}^{L} {(i_{L}^{T} R_{v} b_{i})}^{2} . \end{align}

(36)

We deduce that the output SNR and speech distortion index are, respectively,

{oSNR}_{nr} (h_{P^{'}}^{'}) = \frac{σ_{x}^{2} - \sum_{i = P^{'} + 1}^{L} \frac{1}{λ_{i}} {(i_{L}^{T} R_{x} b_{i})}^{2}}{σ_{v}^{2} - \sum_{i = P^{'} + 1}^{L} {(i_{L}^{T} R_{v} b_{i})}^{2}}

(37)

and

υ_{sd} (h_{P^{'}}^{'}) = \frac{1}{σ_{x}^{2}} \sum_{i = P^{'} + 1}^{L} \frac{1}{λ_{i}} {(i_{L}^{T} R_{x} b_{i})}^{2} .

(38)

The smaller P^′ is compared to L, the larger is the distortion. Further, the speech distortion index is independent of the input SNR, as is the gain in SNR. This can be observed by multiplying either R_x in (5) or R_v in (6) by a constant c, which leads to a corresponding change in the input SNR. Insertion of the resulting λ_i’s and b_i’s in (37) and (38) will show that the output SNR is changed by the factor c and that the speech distortion index is independent of c.

The output SNR and the speech distortion index are related as follows:

\frac{{oSNR}_{nr} (h_{P^{'}}^{'})}{iSNR} = [1 - υ_{sd} (h_{P}^{'})] ξ_{nr} (h_{P}^{'}),

(39)

where

ξ_{nr} (h_{P^{'}}^{'}) = \frac{σ_{v}^{2}}{σ_{v}^{2} - \sum_{i = P^{'} + 1}^{L} {(i_{L}^{T} R_{v} b_{i})}^{2}}

(40)

is the noise reduction factor.

Interestingly, the exact same estimator is obtained by minimizing the power of the residual desired signal:

\begin{array}{l} J_{rd}^{'} & = E \{{[x (k) - \sum_{i = P^{'} + 1}^{L} β_{i}^{'} b_{i}^{T} x (k)]}^{2}\} \\ = σ_{x}^{2} - 2 \sum_{i = P^{'} + 1}^{L} β_{i}^{'} i_{L}^{T} R_{x} b_{i} + \sum_{i = P^{'} + 1}^{L} λ_{i} β_{i}^{' 2} . \end{array}

(41)

Again, minimizing $J_{rn}^{'}$ or $J_{rd}^{'}$ leads to the estimator ${\hat{x}}^{'} (k)$ .

Alternatively, another set of estimators can be obtained by minimizing the mean squared error between x(k) and ${\hat{x}}^{'} (k)$ :

\begin{array}{l} J_{mse}^{'} & = E \{{[v (k) - \sum_{i = P^{'} + 1}^{L} β_{i}^{'} b_{i}^{T} v (k) - \sum_{i = P^{'} + 1}^{L} β_{i}^{'} b_{i}^{T} x (k)]}^{2}\} \\ = σ_{v}^{2} - 2 \sum_{i = P^{'} + 1}^{L} β_{i}^{'} i_{L}^{T} R_{v} b_{i} + \sum_{i = P^{'} + 1}^{L} (1 + λ_{i}) β_{i}^{' 2}, \end{array}

(42)

which leads to

\begin{align} β_{i}^{'} = \frac{i_{L}^{T} R_{v} b_{i}}{1 + λ_{i}} . \end{align}

(43)

In the special case where P^′=0, the estimator is the well-known Wiener filter.

5 Simulations

In this section, the filter design with and without distortion is evaluated through simulations. Firstly, the distortionless case is considered in order to verify that the basics of the filter design hold and the filter works as expected. Secondly, we turn to the filter design with distortion to investigate the influence of the input SNR and the choice of P^′ on the output SNR and the speech distortion index.

The distortionless filter design was tested by the use of a synthetic harmonic signal. The use of such a signal makes it possible to control the rank of the signal covariance matrix, which is a very important feature in the present study. Further, the harmonic signal model is used to model voiced speech, e.g., in [24]. The harmonic signal model has the form:

x (k) = \sum_{m = 1}^{M} A_{m} cos (m 2 π f_{0} / f_{s} k + ϕ_{m})

(44)

where M is the model order, A_m>0 and ϕ_m∈[0,2π] are the amplitude and phase of the m th harmonic, f₀∈[ 0,π/m] is the fundamental frequency, and f_s is the sampling frequency. The rank of the signal covariance matrix, R_x, is then P=2M. In the simulations M=5, the amplitudes are decreasing with the frequency, f, as 1/f, normalized to give A₁=1, and the fundamental frequency is chosen randomly such that f₀∈ [150,250] Hz, the sampling frequency is 8 kHz, and the phases are random. The covariance matrices of R_x and R_v are estimated from segments of 230 samples and are updated along with the filter for each sample. The number of samples is 1,000.

As an example, the spectrum of a synthetic signal is shown in Figure 2 along with the frequency response of the corresponding filter. The fundamental frequency is in this case f₀=200 Hz, and the filter has a length of L=110. After subtraction of the filter output from the noisy observation, the estimate of the desired signal, shown in Figure 3, results. The desired signal and the noisy observation are shown as well. Comparing the signals, it is easily seen that the filtering has improved the output SNR in the estimated signal relative to the noisy observation.

In order to support this, 100 Monte Carlo simulations have been performed for different lengths of the filter, and the performance is evaluated by the output SNR and speech distortion index. The output SNR is calculated according to (10) as the ratio of the variances of the desired signal after noise reduction, [ $x (k) - h_{P}^{T} x (k)$ ], and the noise after noise reduction, [ $v (k) - h_{P}^{T} v (k)$ ], whereas the speech distortion index is calculated according to (21) as the ratio of the variance of the filtered desired signal to the variance of the original desired signal. As seen in Figure 4a, it is definitely possible to increase the SNR, but the extent is highly dependent on the length of the filter. For short filter lengths, the filter has almost no effect and oSNR ≈ iSNR, but as the filter length is increased, the output SNR is increased as well. Even though the estimates of the covariance matrices worsen when the filter length is increased, the longest filter gives rise to the best output SNR. By increasing the filter length from 20 to 110, a gain in SNR of more than 15 dB can be obtained. The corresponding speech distortion index, shown in Figure 4b, is zero for all filter lengths, as was the basis for the filter design. As a reference, results for the Wiener filter (h_w) are shown as well. The Wiener filter is constructed based on [15] where it is derived based on joint diagonalization. The proposed method has a slightly lower output SNR, especially at short filter lengths. On the other hand, the Wiener filter introduces distortion of the desired signal at all filter lengths, whereas the proposed filter is distortionless.

When the covariance matrix of the desired signal is full rank, speech distortion is introduced in the reconstructed speech signal. This situation was evaluated by the use of autoregressive (AR) models, since these can be used to describe unvoiced speech [25]. The models used were of second order, and the coefficients were found based on ten segments of unvoiced speech from the Keele database [26], resampled to give a sampling frequency of 8 kHz and a length of 400 samples after resampling. Again, P^′ was set to 10, the signal was added white Gaussian noise to give an average input SNR of 10 dB, and 100 Monte Carlo simulations were run on each of the ten generated signals in order to see the influence of the filter length when the signal covariance matrix is full rank. The results are shown in Figure 5. As was the case for voiced speech, it is possible to gain approximately 15 dB in SNR by increasing the filter length from 20 to 110. However, this time the speech distortion is also dependent on the filter length, and the longer the filter, the more signal distortion. In this case, comparison to the Wiener filter shows just the opposite situation than with the harmonic model. Now, the gain in SNR is higher for the proposed method for all filter lengths, but the signal is also more distorted.

After having investigated the filter performance for different filter lengths using synthetic signals, the influence of input SNR and the choice of P^′ are investigated directly in speech signals. Again, we used signals from the Keele database with f_s= 8 kHz. Excerpts with a length of 20,000 were extracted from different places in the speech signals from two male and two female speakers. Noise was added to give the desired average input SNR, and filters with a length L=110 and varying P^′ were applied. Three different kinds of noise were used - white Gaussian, babble, and car noise - the last two from the AURORA database [27]. The output SNR and signal distortion index are depicted as a function of P^′ in Figure 6. Both the output SNR and the speech distortion index are decreasing with P^′, as was depicted in Section 4. Thereby, the choice of P^′ will be a compromise between a high output SNR and a low speech distortion index. In Figure 7, the proposed filter is compared, at an input SNR of 10 dB, to the Wiener filter, and three filters from [10] (h_ls, h_mv, h_mls), which are subspace-based filters as well. These filters are based on a Hankel representation of the observed signal, which we, from the segment length of 230 samples, construct with a size of 151 × 80. Due to restrictions on the chosen rank (according to P^′), this is only varied from 1 to 71. The performance of the Wiener filter is of course independent of P^′, and it is, therefore, possible to construct a filter that either gives a higher output SNR or a lower speech distortion than the Wiener filter, dependent on the choice of P^′. The filters from [10] are dependent on P^′ as well, but the proposed filter has a broader range of possible combinations of output SNR and speech distortion. At P^′=1, a gain in output SNR of approximately 5 dB can be obtained while the speech distortion is comparable. At the other extreme, it is possible to obtain the same output SNR as h_ls while the speech distortion index is lowered by approximately 5 dB.

The choice of the value of P^′ is, however, not dependent on the input SNR, as seen in Figure 8, since both the gain in SNR and the speech distortion index are constant functions of the input SNR, as was also found theoretically in Section 4. This means that it is possible to construct a filter according to the desired combination of gain in SNR and speech distortion, and then this will apply no matter the input SNR. This is not the case for either the Wiener filter or the filters from [10] as seen in Figure 9. For these filters, the gain in SNR is decreasing with input SNR (except for h_ls which is also constant) as is the speech distortion index.

As a measure of the subjective evaluation, Perceptual Evaluation of Speech Quality (PESQ) scores [28] have been calculated for different filter lengths, different values of P^′, and different SNRs. The used speech signal contains 40,000 samples from the beginning of the speech signal from the first female speaker in the Keele database. The results are shown in Tables 1 and 2. It is seen that the PESQ scores are increasing with increasing filter length and SNR, even though the effect of going from a filter length of 90 to 110 seems smaller than increasing the length from 70 to 90. The PESQ score is rather low for low values of P^′, peaks for P^′=31 or P^′=41, depending on the SNR, and then decreases again for higher values of P^′. This is also heard in informal listening tests of the resulting speech signal. At low values of P^′, the speech signal sounds rather distorted, whereas at high levels of P^′, the signal is noisy, but not very distorted, which also confirms the findings in Figure 6. As reflected in the PESQ score, a signal with a compromise between the two is preferred if the purpose is listening directly to the output. In such a context, the performance of the Wiener filter is slightly better than the proposed filter with PESQ scores approximately 0.3 units larger. However, the purpose of noise reduction is sometimes as a pre-processor to, e.g., a speech recognition algorithm. Here, the word error rate increases when the SNR decreases [29, 30], but on the other hand, the algorithms are also sensible to distortion of the speech signal [31, 32]. In such cases, it might, therefore, be optimal with another relationship between SNR and speech distortion than the one having the best perceptual performance. This optimization is possible with the proposed filter due to its flexibility.

Table 1 PESQ scores at different filter lengths and SNRs for P ^′=31

Full size table

Table 2 PESQ scores for different values of P ^′ and SNR for a filter length of 110

Full size table

The effect of choosing different values of P^′ is visualized in Figure 10. Figure 10a shows the spectrogram of a piece of a clean speech signal from the Keele database, and in Figure 10b, babble noise was added to give an average input SNR of 10 dB. Figure 10c,d shows the spectrograms of the reconstructed speech signal with two different choices of P^′. The former is a reconstruction based on P^′=10. Definitely, the noise content is reduced when comparing to the noisy speech signal in Figure 10b. However, a high degree of signal distortion has been introduced as well, which can be seen especially in the voiced speech parts, where the distinction between the harmonics is blurred compared to both the clean speech signal and the noisy speech signal. In the latter figure, P^′=70, and therefore, both noise reduction and signal distortion are not as prominent as when P^′=10. Here, the harmonics are much more well preserved, but, as is seen in the background, it comes with the price of less noise reduction.

A feature of the proposed filter, which is not explored here, is the possibility of choosing different values of P^′ over time. The optimal value of P^′ depends on whether the speech is voiced or unvoiced, and how many harmonics there are in the voiced parts. By adapting the value of P^′ at each time step based on this information, it should be possible to simultaneously achieve a higher SNR and a lower distortion.

6 Conclusions

In this paper, we have presented a new perspective on time-domain single-channel noise reduction based on forming filters from the eigenvectors that diagonalize both the desired and noise signal covariance matrices. These filters are chosen so that they provide an estimate of the noise signal when applied to the observed signal. Then, by subtraction of the noise estimate from the observed signal, an estimate of the desired signal can be obtained. Two cases have been considered, namely one where no distortion is allowed on the desired signal and one where distortion is allowed. The former case applies to signals that have a rank that can be assumed to be less than the rank of the observed signal covariance matrix, which is, for example, the case for voiced speech. The latter case applies to desired signals that have a full-rank covariance matrix. In this case, the only way to achieve noise reduction is by also allowing for distortion on the desired signal. The amount of distortion introduced depends on a parameter corresponding to the rank of an implicit approximation of the desired signal covariance matrix. As such, it is relatively easy to control the trade-off between noise reduction and speech distortion. Experiments on real and synthetic signals have confirmed these principles and demonstrated how it is, in fact, possible to achieve higher output signal-to-noise ratio or a lower signal distortion index with the proposed method than with the classical Wiener filter. Moreover, the results show that only a small loss in output signal-to-noise ratio is incurred when no distortion can be accepted, as long as the filter is not too short. The results also show that when distortion is allowed on the desired signal, the amount of distortion is independent of the input signal-to-noise ratio. The presented perspective is promising in that it unifies the ideas behind subspace methods and optimal filtering, two methodologies that have traditionally been seen as quite different.

References

Benesty J, Chen J: Optimal Time-Domain Noise Reduction Filters – A Theoretical Study vol. VII. Heidelberg: Springer; 2011.
Book Google Scholar
Boll S: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust., Speech, Signal Process 1979, 27(2):113-120. 10.1109/TASSP.1979.1163209
Article Google Scholar
McAulay RJ, Malpass ML: Speech enhancement using a soft-decision noise suppression filter. IEEE Trans. Acoust., Speech, Signal Process 1980, 28(2):137-145. 10.1109/TASSP.1980.1163394
Article Google Scholar
Ephraim Y, Malah D: Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust., Speech, Signal Process 1985, 33(2):443-445. 10.1109/TASSP.1985.1164550
Article Google Scholar
Srinivasan S, Samuelsson J, Kleijn WB: Codebook-based Bayesian speech enhancement for nonstationary environments. IEEE Trans. Audio, Speech, and Language Process 2007, 15(2):441-452.
Article Google Scholar
Ephraim Y, Van Trees HL: A signal subspace approach for speech enhancement. IEEE Trans. Speech Audio Process 1995, 3(4):251-266. 10.1109/89.397090
Article Google Scholar
Jensen SH, Hansen PC, Hansen SD, Sørensen JA: Reduction of broad-band noise in speech by truncated QSVD. IEEE Trans. Speech Audio Process 1995, 3(6):439-448. 10.1109/89.482211
Article Google Scholar
Benesty J, Chen J, Huang Y, Cohen I: Noise Reduction in Speech Processing. Heidelberg: Springer; 2009.
Google Scholar
Loizou P: Speech Enhancement: Theory and Practice. Boca Raton: CRC Press; 2007.
Google Scholar
Hansen PC, Jensen SH: Subspace-based noise reduction for speech signals via diagonal and triangular matrix decompositions: survey and analysis. EURASIP J. Adv. Signal Process 2007, 2007(1):24.
Google Scholar
Rangachari S, Loizou P: A noise estimation algorithm for highly nonstationary environments. Speech Commun 2006, 28: 220-231.
Article Google Scholar
Cohen I: Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging. IEEE Trans. Speech Audio Process 2003, 11(5):466-475. 10.1109/TSA.2003.811544
Article Google Scholar
Gerkmann T, Hendriks RC: Unbiased MMSE-based noise power estimation with low complexity and low tracking delay. IEEE Trans. Audio, Speech, Lang. Process 2012, 20(4):1383-1393.
Article Google Scholar
Hendriks RC, Heusdens R, Jensen J, Kjems U: Low complexity DFT-domain noise PSD tracking using high-resolution periodograms. EURASIP J. Appl. Signal Process 2009, 2009(1):15.
Google Scholar
Doclo S, Moonen M: GSVD-based optimal filtering for single and multimicrophone speech enhancement. IEEE Trans. Signal Process 2002, 50(9):2230-2244. 10.1109/TSP.2002.801937
Article Google Scholar
Souden M, Benesty J, Affes S: On optimal frequency-domain multichannel linear filtering for noise reduction. IEEE Trans. Audio, Speech, Lang. Process 2010, 18(2):260-276.
Article Google Scholar
Benesty J, Souden M, Chen J: A perspective on multichannel noise reduction in the time domain. Appl. Acoustics 2013, 74(3):343-355. 10.1016/j.apacoust.2012.08.002
Article Google Scholar
Hendriks RC, Gerkmann T: Noise correlation matrix estimation for multi-microphone speech enhancement. IEEE Trans. Audio, Speech, Lang. Process 2012, 20(1):223-233.
Article Google Scholar
Christensen MG, Jakobsson A: Optimal filter designs for separating and enhancing periodic signals. IEEE Trans. Signal Process 2010, 58(12):5969-5983.
Article MathSciNet Google Scholar
Jensen JR, Benesty J, Christensen MG, Jensen SH: Enhancement of single-channel periodic signals in the time-domain. IEEE Trans. Audio, Speech, Lang. Process 2012, 20(7):1948-1963.
Article Google Scholar
Jensen JR, Benesty J, Christensen MG, Jensen SH: Non-causal time-domain filters for single-channel noise reduction. IEEE Trans. Audio, Speech, Lang. Process 2012, 20(5):1526-1541.
Article Google Scholar
Griffiths LJ, Jim CW: An alternative approach to linearly constrained adaptive beamforming. IEEE Trans. Antennas Propag 1982, 30(1):27-34. 10.1109/TAP.1982.1142739
Article Google Scholar
Franklin JN: Matrix Theory. New York: Prentice-Hall; 1968.
Google Scholar
Jensen J, Hansen JHL: Speech enhancement using a constrained iterative sinusoidal model. IEEE Trans. Speech Audio Process 2001, 9(7):731-740. 10.1109/89.952491
Article Google Scholar
Deller JR, Hansen JHL, Proakis JG: Discrete-Time Processing of Speech Signals. New York: Wiley; 2000.
Google Scholar
Plante F, Meyer GF, Ainsworth WA: A pitch extraction reference database. In Proc. Eurospeech. Madrid, Spain; 18–21 September 1995:837-840.
Google Scholar
Pearce D, Hirsch HG: The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In Proc. Int. Conf. Spoken Language Process. Beijing, China; 16–20 October 2000:29-32.
Google Scholar
Hu Y, Loizou P: Evaluation of objective quality measures for speech enhancement. IEEE Trans. Speech Audio Process 2008, 16(1):229-238.
Article Google Scholar
Cui X, Alwan A: Noise robust speech recognition using feature compensation based on polynomial regression of utterance SNR. IEEE Trans. Speech Audio Process 2005, 13(6):1161-1172.
Article Google Scholar
Lippmann RP: Speech recognition by machines and humans. Speech Commun 1997, 22(1):1-15. 10.1016/S0167-6393(97)00021-6
Article Google Scholar
Huerta JM, Stern RM: Distortion-class weighted acoustic modeling for robust speech recognition under GSM RPE-LTP coding. In Proc. of the Robust Methods for Speech Recognition in Adverse Conditions. Tampere, Finland; 25–26 May 1999:11-14.
Google Scholar
Takiguchi T, Ariki Y: PCA-based speech enhancement for distorted speech recognition. J. Multimedia 2007, 2(5):13-18.
Article Google Scholar

Download references

Acknowledgements

This research was supported by the Villum Foundation and the Danish Council for Independent Research, grant ID: DFF - 1337-00084.

Author information

Authors and Affiliations

Audio Analysis Lab, Department of Architecture, Design and Media Technology, Aalborg University, Aalborg, 9220, Denmark
Sidsel Marie Nørholm, Jacob Benesty, Jesper Rindom Jensen & Mads Græsbøll Christensen
INRS-EMT, University of Quebec, Montreal, QC, H2X 1L7, Canada
Jacob Benesty

Authors

Sidsel Marie Nørholm
View author publications
You can also search for this author in PubMed Google Scholar
Jacob Benesty
View author publications
You can also search for this author in PubMed Google Scholar
Jesper Rindom Jensen
View author publications
You can also search for this author in PubMed Google Scholar
Mads Græsbøll Christensen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sidsel Marie Nørholm.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Nørholm, S.M., Benesty, J., Jensen, J.R. et al. Single-channel noise reduction using unified joint diagonalization and optimal filtering. EURASIP J. Adv. Signal Process. 2014, 37 (2014). https://doi.org/10.1186/1687-6180-2014-37

Download citation

Received: 19 December 2013
Accepted: 17 March 2014
Published: 26 March 2014
DOI: https://doi.org/10.1186/1687-6180-2014-37

Single-channel noise reduction using unified joint diagonalization and optimal filtering

Abstract

1 Introduction

2 Signal model and problem formulation

3 Noise reduction filtering without distortion

4 Noise reduction filtering with distortion

5 Simulations

6 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords