In acoustic multi-channel equalization techniques, such as complete multi-channel equalization based on the multiple-input/output inverse theorem (MINT), relaxed multi-channel least-squares (RMCLS), and partial multi-channel equalization based on MINT (PMINT), the length of the reshaping filters is generally chosen such that perfect dereverberation can be achieved for perfectly estimated room impulse responses (RIRs). However, since in practice the available RIRs typically differ from the true RIRs, this reshaping filter length may not be optimal. This paper provides a mathematical analysis of the robustness increase of equalization techniques against RIR perturbations when using a shorter reshaping filter length than conventionally used. Based on the condition number of the (weighted) convolution matrix of the RIRs, a mathematical relationship between the reshaping filter length and the robustness against RIR perturbations is established. It is shown that shorter reshaping filters than conventionally used yield a smaller condition number, i.e., a higher robustness against RIR perturbations. In addition, we propose an automatic non-intrusive procedure for determining the reshaping filter length based on the L-curve. Simulation results confirm that using a shorter reshaping filter length than conventionally used yields a significant increase in robustness against RIR perturbations for MINT, RMCLS, and PMINT. Furthermore, it is shown that PMINT using an optimal intrusively determined reshaping filter length outperforms all other considered techniques. Finally, it is shown that the automatic non-intrusively determined reshaping filter length in PMINT yields a similar performance as the optimal intrusively determined reshaping filter length.

1 Introduction

The microphone signals recorded in many hands-free speech communication applications, such as teleconferencing, voice-controlled systems, or hearing aids, do not only contain the desired speech signal but also attenuated and delayed copies due to reverberation. While early reverberation may be desirable [1–3], late reverberation may degrade the perceived speech quality and intelligibility [4–6] as well as the performance of automatic speech recognition systems [7, 8]. In order to mitigate these detrimental effects of reverberation, several single-channel and multi-channel dereverberation techniques have been proposed [9], with multi-channel techniques being generally preferred since they are able to exploit both the spectro-temporal and the spatial characteristics of the received microphone signals. Existing multi-channel dereverberation techniques can be broadly classified into spectro-temporal enhancement techniques [10–14], probabilistic modeling-based techniques [15–18], and acoustic multi-channel equalization techniques [19–26]. Acoustic multi-channel equalization techniques aim to reshape the available room impulse responses (RIRs) between the speaker and the microphone array. Since in theory they can achieve perfect dereverberation [19], they represent an attractive approach to speech dereverberation.

A well-known complete multi-channel equalization technique aiming at acoustic system inversion is the multiple-input/output inverse theorem (MINT)-based technique [19], which however suffers from drawbacks in practice. Since the available RIRs typically differ from the true RIRs due to fluctuations (e.g., temperature or position variations [27]) or due to the sensitivity of blind system identification (BSI) and supervised system identification (SSI) methods to near-common zeros or interfering noise [28–30], MINT generally fails to invert the true RIRs, possibly leading to severe distortions in the output signal [22–24, 26]. In order to increase the robustness against RIR perturbations, partial multi-channel equalization techniques, such as relaxed multi-channel least-squares (RMCLS) [23] and partial multi-channel equalization based on MINT (PMINT) [24], have been proposed. Since early reflections tend to improve speech intelligibility [1–3] and late reflections are the major cause of speech intelligibility degradation [4–6], the objective of partial equalization techniques is to shorten the overall impulse response by suppressing only the late reflections. While RMCLS imposes no constraints on the remaining early reflections, PMINT has been shown to be more perceptually advantageous since it also aims to control the remaining early reflections. Although partial equalization techniques can be significantly more robust than MINT, their performance still remains rather susceptible to RIR perturbations [23, 24, 26]. As a result, several methods have been proposed to further increase the robustness against RIR perturbations. In [22, 24], it has been proposed to incorporate regularization, such that the distortion energy due to RIR perturbations is decreased. In [26], it has been proposed to use a signal-dependent penalty function to promote sparsity in the output signal and reduce artifacts generated by non-robust techniques. In [31, 32], it has been proposed to relax the constraints on the filter design by constructing approximate reshaping filters in the subband domain. In [33], it has been proposed to relax the constraints on the filter design by using a shorter reshaping filter length than conventionally used. The objective of this paper is to provide a mathematical analysis of the robustness increase when using a shorter reshaping filter length as well as to propose an automatic non-intrusive procedure for selecting an optimal shorter reshaping filter length.

The length of the reshaping filters in MINT, RMCLS, and PMINT is conventionally chosen such that perfect dereverberation can be achieved for perfectly estimated RIRs. As already mentioned, since in practice the available RIRs typically differ from the true RIRs, this choice of the reshaping filter length yields a high sensitivity to RIR perturbations. In [33], it has been analytically shown that decreasing the reshaping filter length increases the robustness for MINT and PMINT only if the multi-channel convolution matrix of the RIRs is a square matrix. In this paper, it is analytically shown that decreasing the reshaping filter length increases the robustness of MINT, RMCLS, and PMINT independently of the dimension of the (weighted) multi-channel convolution matrix of the RIRs. A mathematical relationship between the reshaping filter length and the condition number of the (weighted) multi-channel convolution matrix of the available RIRs, hence, the sensitivity of equalization techniques to RIR perturbations, is derived. We show that shorter reshaping filters than conventionally used yield a smaller condition number, i.e., a higher robustness against RIR perturbations.

In general, the reshaping filter length yielding optimal performance can only be determined intrusively (i.e., using a clean reference signal), obviously limiting the practical applicability. Hence, we also propose and investigate an automatic non-intrusive selection procedure for the reshaping filter length based on the L-curve [34–36].

Simulation results for several acoustic systems and RIR perturbations show by means of instrumental performance measures that using shorter reshaping filters in MINT, RMCLS, and PMINT significantly increases the robustness against RIR perturbations. In addition, it is demonstrated that PMINT using the optimal intrusively determined reshaping filter length outperforms the other considered equalization techniques, yielding a larger reverberant energy suppression and perceptual speech quality improvement. Furthermore, it is shown that the non-intrusively determined reshaping filter length yields a nearly optimal performance for PMINT.

The paper is organized as follows. In Section 2, the considered acoustic configuration and the used notation is introduced. In Section 3, state-of-the-art acoustic multi-channel equalization techniques, i.e., MINT, RMCLS, and PMINT, are briefly reviewed. In Section 4, the sensitivity of these equalization techniques to RIR perturbations is evaluated by means of the condition number of the (weighted) convolution matrix and analytical insights on increasing the robustness by decreasing the reshaping filter length are provided. In Section 5, the automatic non-intrusive procedure for determining the reshaping filter length is discussed. Using instrumental performance measures, the dereverberation performance of all considered techniques is compared in Section 6.

2 Configuration and notation

We consider an acoustic system with a single speech source and M microphones, as depicted in Fig. 1. The m-th microphone signal y_{
m
}(n), m=1, 2, …, M, at discrete time index n, is given by

where ∗ denotes convolution, s(n) is the clean speech signal, h_{
m
}(n) is the RIR between the speech source and the m-th microphone, x_{
m
}(n) is the reverberant speech component, and v_{
m
}(n) is the noise component. Since acoustic multi-channel equalization techniques generally design reshaping filters without taking the additive noise into account, in the following it is assumed that v_{
m
}(n)=0, and hence, y_{
m
}(n)=x_{
m
}(n).

Using the filter-and-sum structure in Fig. 1, the output signal z(n) is equal to the sum of the filtered microphone signals, i.e.,

where g_{
m
}(n) is the filter applied to the m-th microphone signal and c(n) denotes the equalized impulse response (EIR) between the speech source and the output of the system. In vector notation, the RIR h_{
m
} and the filter g_{
m
} are given by

where L_{
h
} and L_{
g
} denote the RIR length and the reshaping filter length, respectively. Using the ML_{
g
}–dimensional stacked filter vector \(\mathbf {g} = \left [\mathbf {g}^{T}_{1} \mathbf {g}^{T}_{2} \ldots \mathbf {g}^{T}_{M}\right ]^{T}\), the EIR vector c of length L_{
c
}=L_{
h
}+L_{
g
}−1, i.e., c=[c(0) c(1)…c(L_{
c
}−1)]^{T} can be expressed as

$$ \mathbf{c} = \mathbf{H}\mathbf{g}, $$

(6)

where H denotes the L_{
c
}×ML_{
g
}–dimensional multi-channel convolution matrix of the RIRs, i.e., H=[H_{1}H_{2}…H_{
M
}], with

The reshaping filter g can then be constructed based on different design objectives for the EIR c.

3 Acoustic multi-channel equalization

Acoustic multi-channel equalization techniques aim at speech dereverberation by designing the reshaping filter g such that the (weighted) EIR c in (6) is equal to a (weighted) target EIR c_{d}. For the equalization techniques considered in this paper, i.e., MINT [19], RMCLS [23], and PMINT [24], the definition of the target EIR c_{d} is presented in Table 1, where τ denotes a delay, L_{
d
} denotes the length of the direct path and early reflections, and p∈{1, …, M}. The delay τ is incorporated in order to relax the causality constraints on the filter design [22]. The length of the direct path and early reflections L_{
d
} in the RMCLS and PMINT is typically considered to be between 10–50 ms [23, 24]. It should be realized that in practice, only the perturbed RIRs \(\hat {h}_{m}\) are available, i.e., \(\hat {h}_{m} = h_{m} + e_{m}\), where e_{
m
} represents the RIR perturbations due to fluctuations (e.g., temperature or position fluctuations [27]) or due to the sensitivity of BSI and SSI methods to near-common zeros or interfering noise [28–30]. Hence, for the filter design, the perturbed convolution matrix \(\hat {\mathbf {H}} = \mathbf {H} + \mathbf {E}\) is used, where E represents the convolution matrix of the RIR perturbations. The considered equalization techniques compute the filter g as the solution of the system of equations

with W an L_{
c
}×L_{
c
}–dimensional diagonal weighting matrix. The definition of the weighting matrix W for MINT, RMCLS, and PMINT is presented in Table 1, where I denotes the L_{
c
}×L_{
c
}–dimensional identity matrix. Based on these definitions of W and c_{d}, it can be observed that on the one hand, MINT and PMINT do not use a weighting matrix and constrain all taps of the EIR (i.e., W=I), while on the other hand, RMCLS uses a weighting matrix and does not constrain all taps of the EIR (i.e., \(\mathbf {W} = {\text {diag}}\big \{\big [\underbrace {1 \; \ldots \; 1}_{\tau } \; \underbrace {1 \; 0 \; \ldots \; 0}_{L_{d}} \; 1 \; \ldots 1\big ]^{T}\big \}\)). It has been experimentally validated in [23, 24, 26] that by constraining all taps of the EIR, MINT and PMINT may result in a good perceptual speech quality but a high sensitivity to RIR perturbations, whereas by not constraining all taps of the EIR, RMCLS may result in a lower sensitivity to RIR perturbations but a decreased perceptual speech quality.

For all considered equalization techniques, the reshaping filter solving (9) is computed by minimizing the (weighted) least-squares cost function

where {·}^{+} denotes the matrix pseudo-inverse. When the true RIRs are available, i.e., \(\hat {\mathbf {H}} = \mathbf {H}\), the reshaping filter of length L_{
g
} according to (11) yields perfect dereverberation, i.e., WHg_{LS}=Wc_{d}. However, in the presence of RIR perturbations, i.e., \(\hat {\mathbf {H}} \neq \mathbf {H}\), this filter typically fails to achieve perfect dereverberation, i.e., WHg_{LS}≠Wc_{d}, possibly even causing severe distortions in the output signal [24, 26]. The sensitivity of the reshaping filter to RIR perturbations can be evaluated by analyzing the condition number of the matrix \(\mathbf {W}\hat {\mathbf {H}}\).

4 Robust acoustic multi-channel equalization

In this section, the Wedin theorem [37] relating the condition number of the matrix being inverted to the sensitivity of the solution to perturbations is briefly reviewed. In addition, it is analytically shown that using shorter reshaping filters than conventionally used decreases the condition number of the matrix \(\mathbf {W}\hat {\mathbf {H}}\), hence increasing the robustness against RIR perturbations.

Wedin theorem [37]: Consider the system of equations Aq = b, where the matrixAhas dimensionsu×vand rankr≤ min{u,v}. LetAbe perturbed toA+ΔA. The pseudo-inverse solutionq=A^{+}bis then perturbed toq+Δq=(A+ΔA)^{+}b, whereΔqis the deviation between the true and the perturbed solution. The condition numberχ_{
A
}of the matrixAis defined as

The relation in (15) shows that a large condition number χ_{
A
} can result (and typically does) in a large deviation between the true and the perturbed solution [37,38].

For clarity of presentation, the notation summarized in Table 2 is used in the following. In order to satisfy (11), reshaping filters in acoustic multi-channel equalization techniques are conventionally designed using the filter length \(L^{\mathrm {t}}_{g} = \left \lceil {\frac {L_{h}-1}{M-1}}\right \rceil \), i.e., based on the p_{t}×q_{t}-dimensional matrix \(\mathbf {W}_{\mathrm {t}}\hat {\mathbf {H}}_{\mathrm {t}}\) with p_{t}≤q_{t} and rank r_{t}≤p_{t} (cf. Table 2). However, reshaping filters can also be designed using a shorter filter length Lgs<Lts, i.e., based on the p_{s}×q_{s}-dimensional matrix \(\mathbf {W}_{\mathrm {s}}\hat {\mathbf {H}}_{\mathrm {s}}\) (cf. Table 2). Considering that \(L^{\mathrm {s}}_{g} < \frac {L_{h}-1}{M-1}\), the matrix \(\mathbf {W}_{\mathrm {s}}\hat {\mathbf {H}}_{\mathrm {s}}\) is a full column-rank matrix with fewer columns than rows, i.e., q_{s}<p_{s}, since

As schematically illustrated in Fig. 2, the matrix \(\mathbf {W}_{\mathrm {s}}\hat {\mathbf {H}}_{\mathrm {s}}\) is a sub-matrix of \(\mathbf {W}_{\mathrm {t}}\hat {\mathbf {H}}_{\mathrm {t}}\), constructed by deleting Lgt−Lgs rows and M(Lgt−Lgs) columns from \(\mathbf {W}_{\mathrm {t}}\hat {\mathbf {H}}_{\mathrm {t}}\). Aiming at establishing a relation between the condition numbers of the matrices \(\mathbf {W}_{\mathrm {t}}\hat {\mathbf {H}}_{\mathrm {t}}\) and \(\mathbf {W}_{\mathrm {s}}\hat {\mathbf {H}}_{\mathrm {s}}\), i.e.,

we consider the following interlacing inequalities between the singular values of a matrix and its sub-matrices [39].

Interlacing inequalities [39]: Given a matrixAof dimensionsu×vand a sub-matrixBobtained by deleting l rows and/or columns fromA, the singular values ofAandBinterlace as

Using (19), in Appendix A we derive the following inequalities relating the largest and the smallest non-zero singular values of \(\mathbf {W}_{\mathrm {t}}\hat {\mathbf {H}}_{\mathrm {t}}\) and \(\mathbf {W}_{\mathrm {s}}\hat {\mathbf {H}}_{\mathrm {s}}\):

It readily follows from (20) and (21) that the condition number of \(\mathbf {W}_{\mathrm {s}}\hat {\mathbf {H}}_{\mathrm {s}}\) is smaller than or equal to the condition number of \(\mathbf {W}_{\mathrm {t}}\hat {\mathbf {H}}_{\mathrm {t}}\), i.e.,

Hence, using a shorter reshaping filter than conventionally used in equalization techniques can result (and based on simulations, it always does) in a lower condition number of the matrix being inverted.

Figure 3 depicts the singular values of the matrix \(\mathbf {W}_{\mathrm {t}}\hat {\mathbf {H}}_{\mathrm {t}}\) for PMINT, constructed using the conventional reshaping filter length \(L^{\mathrm {t}}_{g} = 1947\). The used acoustic system is system S_{1} described in Section 6.1, with M=4 microphones and L_{
h
}=5840. The singular values of two sub-matrices \(\mathbf {W}_{\mathrm {s}}\hat {\mathbf {H}}_{\mathrm {s}}\), constructed using \(L^{\mathrm {s}}_{g} = 1000\) and \(L^{\mathrm {s}}_{g} = 300\), are also depicted. The largest and the smallest non-zero singular values of each matrix are marked in order to illustrate the inequalities presented in (20) and (21). Using these singular values, the condition numbers of the different matrices are presented in Table 3, where it is illustrated that using a shorter reshaping filter length than conventionally used decreases the condition number of the matrix \(\mathbf {W}\hat {\mathbf {H}}\).

In summary, decreasing the reshaping filter length in acoustic multi-channel equalization techniques decreases the condition number of the matrix being inverted, increasing the robustness against RIR perturbations. However, decreasing the reshaping filter length also reduces the equalization performance with respect to the true RIRs, resulting in a trade-off between equalization performance for perfectly estimated RIRs and robustness in the presence of RIR perturbations. Using a shorter reshaping filter is not only desirable to increase the robustness against RIR perturbations, but also because of the lower computational complexity of the reshaping filter design.

5 Automatic non-intrusive reshaping filter length

The optimal reshaping filter length \(L^{\text {opt}}_{g}\) yielding the highest dereverberation performance obviously depends on the acoustic system and the RIR perturbation level. In simulations, \(L^{\text {opt}}_{g}\) can be intrusively determined by exploiting a clean reference signal. Reshaping filters for several reshaping filter lengths can be computed and applied to the received microphone signals such that different output signals are generated. The optimal reshaping filter length \(L^{\text {opt}}_{g}\) can then be selected by comparing the different output signals to the clean reference signal. Since one typically does not have access to the clean reference signal, an automatic non-intrusive procedure is required in practice.

Motivated by the simplicity and the applicability of the L-curve to automatically determine a regularization parameter in regularized (weighted) least-squares solutions [24,34,35], in this section, we propose to use the L-curve to automatically determine the reshaping filter length \(L^{\text {auto}}_{g}\) in acoustic multi-channel equalization techniques.

Using a shorter reshaping filter introduces a trade-off between the condition number \(\chi _{\mathbf {W}\hat {\mathbf {H}}}\) and the (weighted) least-squares error \(\|\mathbf {W}(\hat {\mathbf {H}}\mathbf {w} - \mathbf {c}_{\mathrm {d}})\|^{2}_{2}\). An appropriate filter length should incorporate knowledge about \(\chi _{\mathbf {W}\hat {\mathbf {H}}}\) and \(\|\mathbf {W}(\hat {\mathbf {H}}\mathbf {w} - \mathbf {c}_{\mathrm {d}})\|^{2}_{2}\), such that preferably both quantities are kept as small as possible. Due to the arising trade-off between these quantities, the parametric plot of the condition number versus the (weighted) least-squares error for several reshaping filter lengths has an L-shape. The corner of the L-curve, i.e., the point of maximum curvature, is located where the filter changes from being dominated by a large condition number to being dominated by a large (weighted) least-squares error. Hence, we propose to automatically select the reshaping filter length \(L_{g}^{auto}\) as the filter length corresponding to the corner of the parametric plot of the condition number \(\chi _{\mathbf {W}\hat {\mathbf {H}}}\) versus the (weighted) least-squares error \(\|\mathbf {W}(\hat {\mathbf {H}}\mathbf {w} - \mathbf {c}_{\mathrm {d}})\|^{2}_{2}\).

Figure 4 depicts a typical L-curve obtained using PMINT for acoustic system S_{1} described in Section 6.1. As illustrated in this figure, decreasing the reshaping filter length decreases \(\chi _{\mathbf {W}\hat {\mathbf {H}}}\) but at the same time increases \(\|\mathbf {W}(\hat {\mathbf {H}}\mathbf {w} - \mathbf {c}_{\mathrm {d}})\|^{2}_{2}\). Although from such a curve it seems straightforward to determine the reshaping filter length corresponding to the corner of the L-curve (i.e., \(L^{\mathrm {s}}_{g} = 1000\) in this example), numerical problems may occur, and hence, a numerically stable algorithm is required. Similarly as in [24], in this paper we use the triangle method [36] for locating the corner of the L-curve.

6 Simulation results and discussion

In this section, we investigate the influence of the reshaping filter length on the dereverberation performance of all considered acoustic multi-channel equalization techniques. In Section 6.1, the considered acoustic systems, instrumental performance measures, and algorithmic settings are introduced. In Section 6.2, the increase in robustness when using shorter reshaping filter lengths is investigated. In Section 6.3, the performance of all considered equalization techniques using the intrusively determined reshaping filter length \(L^{\text {opt}}_{g}\) is compared for several acoustic systems and RIR perturbation levels. In Section 6.4, the performance of PMINT using the automatic non-intrusively determined reshaping filter length \(L^{\text {auto}}_{g}\) is investigated.

6.1 Acoustic systems, instrumental performance measures, and algorithmic settings

Acoustic systems. We consider three different reverberant acoustic systems with a single speech source placed at a distance of 2 m from M=4 omni-directional microphones. The RIRs between the speech source and the microphones are measured using the swept-sine technique [40] and the reverberant signals are generated by convolving 10 sentences (approximately 17 s long) from the HINT database [41] with measured RIRs. For each acoustic system, Table 4 presents the reverberation time T_{60} of the room, the length of the RIRs L_{
h
} at a sampling frequency f_{
s
}=8 kHz, and the input direct-to-reverberant-ratio (iDRR). The iDRR is computed using the RIR of the first microphone h_{1}(n) and is defined as

where the first n_{
d
} samples (corresponding to 3 ms) of h_{1}(n) represent the direct path propagation and the remaining samples represent reflections [9]. In order to simulate RIR perturbations, the measured RIRs are perturbed by proportional Gaussian distributed errors as proposed in [42], such that a desired normalized projection misalignment (NPM), i.e.,

is generated. Introducing proportional Gaussian distributed errors is a widely used technique to systematically simulate RIR perturbations arising from system identification methods. The considered NPMs for each acoustic system are

with −33 dB a moderate perturbation level and −15 dB a larger perturbation level. It should be noted that the NPMs in (25) represent realistic NPMs achieved by state-of-the-art BSI methods (for relatively short RIRs in the order of 300−500 taps) [30].

Instrumental performance measures. The performance of the equalization techniques is evaluated in terms of the reverberant energy suppression and perceptual speech quality improvement. The reverberant energy suppression is evaluated using the improvement in direct-to-reverberant ratio (ΔDRR), i.e., ΔDRR=oDRR−iDRR, with

and iDRR defined in (23). Although ΔDRR exactly describes the reverberant energy suppression, it cannot be solely used to evaluate the dereverberation performance of equalization techniques, since it does not provide any insight on the reverberant energy decay rate. To evaluate the reverberant energy decay rate, the energy decay curve (EDC) [9] of the EIR c(n) is compared to the energy decay curve of the true first RIR h_{1}(n). The EDC of the EIR c is computed as

and the EDC of the RIR h_{1}(n) is computed similarly. The perceptual speech quality is evaluated using the frequency-weighted segmental signal-to-noise-ratio (fSNR) and the cepstral distance (CD) [43]. In [44], it has been shown that measures such as fSNR and CD can exhibit a high correlation with subjective listening tests when evaluating the overall quality and the perceived amount of reverberation for a wide range of state-of-the-art dereverberation (and noise reduction) techniques. These signal-based measures are intrusive measures, generating a similarity score between a test signal and a reference signal. The reference signal employed here is obtained by convolving the clean speech signal with the direct path and early reflections (considered to be 10 ms long) of h_{1}(n). The improvement in fSNR, i.e., ΔfSNR, is computed as the difference between the fSNR of the output signal z(n) and the fSNR of the first microphone signal x_{1}(n). Similarly, the improvement in CD, i.e., ΔCD, is computed as the difference between the CD of the output signal z(n) and the CD of the first microphone signal x_{1}(n). Note that a positive ΔfSNR and a negative ΔCD indicate a performance improvement.

Algorithmic settings. For the acoustic systems described in Table 4 and for all considered equalization techniques, the conventionally used filter length is \(L^{\mathrm {t}}_{g} =\left \lceil {\frac {L_{h}-1}{M-1}}\right \rceil \), i.e., \(L^{\mathrm {t}}_{g} = 1947\) for system S_{1}, \(L^{\mathrm {t}}_{g} = 1627\) for system S_{2}, and \(L^{\mathrm {t}}_{g} = 960\) for system S_{3}. The delay is set to τ=90 and the length of the direct path and early reflections is set to L_{
d
}=0.01×f_{
s
}, corresponding to 10 ms (cf. Table 1). The target EIR c_{d} for PMINT is constructed using the first RIR, i.e., p=1 (cf. Table 1).

We consider several shorter reshaping filter lengths for all equalization techniques, i.e.,

For each acoustic system, each NPM, and each equalization technique, the optimal filter length \(L^{\text {opt}}_{g}\) is selected from (28) as the filter length yielding the lowest CD. It should be noted that using the CD for determining the optimal reshaping filter length is an intrusive procedure which cannot be applied in practice, since knowledge of the clean reference signal is required. In Section 6.4, the performance when using the automatic non-intrusive procedure for selecting the reshaping filter length is investigated.

6.2 Increasing robustness using shorter reshaping filters

In this section, the robustness of MINT, RMCLS, and PMINT against RIR perturbations is investigated when using the conventional reshaping filter length \(L^{\mathrm {t}}_{g}\) and the optimal (shorter) reshaping filter length \(L_{g}^{\text {opt}}\). Although similar results are obtained for all considered acoustic systems, in this section, only results for acoustic system S_{1} are presented. For completeness, the intrusively determined optimal reshaping filter length \(L_{g}^{\text {opt}}\) for each considered technique and NPM is presented in Table 5.

MINT using \(L^{\mathrm {t}}_{g}\) and \(L_{g}^{\text {opt}}\). Figure 5 depicts the performance of MINT when using the filter lengths \(L^{\mathrm {t}}_{g}\) and \(L_{g}^{\text {opt}}\) in terms of ΔDRR, EDC, ΔfSNR, and ΔCD. As expected, the ΔDRR values presented in Fig. 5a show that using the conventional filter length \(L^{\mathrm {t}}_{g}\) fails to suppress the reverberant energy, even significantly worsening the DRR by about 20 dB on average in comparison to h_{1}. Furthermore, it can be observed that using the shorter filter length \(L^{\text {opt}}_{g}\) (cf. Table 5) significantly increases the robustness of MINT for all NPMs, improving the DRR by about 4 dB on average in comparison to h_{1}. These results are confirmed in Fig. 5b, which depicts the EDC of h_{1} and the EDCs of the EIRs obtained using MINT with \(L^{\mathrm {t}}_{g}\) and \(L^{\text {opt}}_{g}\) for an NPM of −33 dB. It can be observed that while using \(L^{\mathrm {t}}_{g}\) completely fails to achieve dereverberation and results in a slower reverberant energy decay rate than h_{1}, using \(L^{\text {opt}}_{g}\) yields a significantly faster reverberant energy decay rate. However, using \(L^{\text {opt}}_{g}\) yields only a slight improvement of the reverberant energy decay rate when compared to h_{1}, even for the moderate NPM of −33 dB. The ΔfSNR and ΔCD values depicted in Fig. 5c, d show that using the conventional filter length \(L^{\mathrm {t}}_{g}\) in MINT yields a significantly worse quality than the unprocessed microphone signal x_{1}(n) for all NPMs. While an increase in robustness is obtained for all NPMs using \(L^{\text {opt}}_{g}\), for most considered NPMs, the performance in terms of ΔfSNR is still worse than for the unprocessed microphone signal x_{1}(n).

In summary, as expected from the theoretical analysis in Section 4, these simulation results demonstrate that using an optimal intrusively determined shorter reshaping filter length than conventionally used in MINT is advantageous to increase the robustness against RIR perturbations. However, since acoustic system inversion using MINT is very sensitive to RIR perturbations, these results indicate that even a shorter reshaping filter is not sufficient to make MINT robust enough against RIR perturbations.

RMCLS using\(L^{\mathrm {t}}_{g}\)and\(L_{g}^{\text {opt}}\). Figure 6 depicts the performance of RMCLS using the filter lengths \(L^{\mathrm {t}}_{g}\) and \(L_{g}^{\text {opt}}\) in terms of ΔDRR, EDC, ΔfSNR, and ΔCD. The ΔDRR values presented in Fig. 6a show that using the conventional filter length \(L^{\mathrm {t}}_{g}\) improves the DRR for moderate NPMs, whereas for NPMs larger than −21 dB, the DRR is worsened in comparison to h_{1}. In addition, it can be observed that using the shorter filter length \(L^{\text {opt}}_{g}\) (cf. Table 5) significantly increases the reverberant energy suppression for all NPMs, on average yielding a 6 dB larger ΔDRR in comparison to the ΔDRR obtained using \(L^{\mathrm {t}}_{g}\). To evaluate the reverberant energy decay rate, Fig. 6b depicts the EDC of h_{1} and the EDCs of the EIRs obtained using RMCLS with \(L^{\mathrm {t}}_{g}\) and \(L^{\text {opt}}_{g}\) for an NPM of −33 dB. It can be observed that for this moderate NPM, a very similar reverberant energy decay rate is obtained for RMCLS when using \(L^{\mathrm {t}}_{g}\) and \(L^{\text {opt}}_{g}\). Since RMCLS using the conventional filter length \(L^{\mathrm {t}}_{g}\) is relatively robust for moderate NPMs and yields a fast reverberant energy decay rate, a shorter reshaping filter does not yield any improvement in the reverberant energy decay rate, but instead leads to a significant improvement in the perceptual speech quality. This is illustrated in Fig. 6c, d, which shows that using \(L^{\text {opt}}_{g}\) in RMCLS significantly improves the ΔfSNR and ΔCD values for all NPMs.

In summary, as expected from the theoretical analysis in Section 4, these simulation results demonstrate that using an optimal intrusively determined shorter reshaping filter length than conventionally used in RMCLS is advantageous and increases the robustness against RIR perturbations.

PMINT using\(L^{\mathrm {t}}_{g}\)and\(L_{g}^{\text {opt}}\). Figure 7 depicts the performance of PMINT using the filter lengths \(L^{\mathrm {t}}_{g}\) and \(L_{g}^{\text {opt}}\) in terms of ΔDRR, EDC, ΔfSNR, and ΔCD. As expected, the ΔDRR values presented in Fig. 7a show that using the conventional filter length \(L^{\mathrm {t}}_{g}\) in PMINT fails to suppress the reverberant energy, even worsening the DRR in comparison to h_{1}. Furthermore, it can be observed that using \(L^{\text {opt}}_{g}\) (cf. Table 5) significantly increases the robustness for all NPMs, on average improving the DRR by about 7 dB in comparison to h_{1}. These results are further confirmed in Fig. 7b, which depicts the EDC of h_{1} and the EDCs of the EIRs obtained using PMINT with \(L^{\mathrm {t}}_{g}\) and \(L^{\text {opt}}_{g}\) for an NPM of −33 dB. It can be observed that PMINT using \(L^{\mathrm {t}}_{g}\) completely fails to achieve dereverberation and results in a slower reverberant energy decay rate than h_{1}. Using the optimal reshaping filter length \(L^{\text {opt}}_{g}\) yields a significant increase in robustness, resulting in a much faster reverberant energy decay rate than h_{1}. Furthermore, the ΔfSNR and ΔCD values depicted in Fig. 7c, d show that while PMINT using \(L^{\mathrm {t}}_{g}\) worsens the perceptual speech quality in comparison to the unprocessed microphone signal x_{1}(n), using \(L^{\text {opt}}_{g}\) results in a significantly better performance.

In summary, as expected from the theoretical analysis in Section 4, these simulation results demonstrate that using an optimal intrusively determined shorter reshaping filter length than conventionally used in PMINT results in a significant increase in robustness against RIR perturbations, both in terms of reverberant energy suppression and perceptual speech quality improvement.

6.3 Performance of equalization techniques when using the optimal intrusive reshaping filter length

In the previous section, it was shown that using a shorter reshaping filter than conventionally used increases the robustness of all considered equalization techniques against RIR perturbations. In this section, the performance of MINT, RMCLS, and PMINT using the optimal intrusively determined reshaping filter length \(L^{\text {opt}}_{g}\) is extensively compared for all acoustic systems in Table 4 and all NPMs in (25). The performance of the different techniques is evaluated in terms of ΔDRR, ΔfSNR, and ΔCD, where the presented performance measures are averaged over all considered NPMs.

Table 6 presents the obtained ΔDRR, ΔfSNR, and ΔCD values for all considered techniques^{Footnote 1}. First, it can be observed that MINT using \(L^{\text {opt}}_{g}\) results in the lowest performance in terms of all performance measures, often worsening the perceptual speech quality in comparison to the unprocessed microphone signal x_{1}(n). Since MINT is very sensitive to RIR perturbations (cf. Fig. 5), the robustness increase that can be obtained by using a shorter reshaping filter length is also limited. Second, it can be observed that RMCLS and PMINT using \(L^{\text {opt}}_{g}\) result in a high reverberant energy suppression in terms of ΔDRR, with PMINT outperforming RMCLS for systems S_{2} and S_{3} whereas a similar performance is obtained for system S_{1}. Finally, it can be observed that for all considered acoustic systems, PMINT using the reshaping filter length \(L^{\text {opt}}_{g}\) yields the highest perceptual speech quality improvement, outperforming RMCLS in terms of ΔfSNR and ΔCD. While PMINT always improves the perceptual speech quality in comparison to the unprocessed microphone signal x_{1}(n), RMCLS sometimes fails to yield an improvement, as indicated by the negative ΔfSNR for systems S_{2} and S_{3}. The advantage of PMINT lies in its control of the early reflections in the EIR, hence better preserving the perceptual speech quality of the output signal.

In summary, based on instrumental measures, it can be said that PMINT using the optimal intrusively determined reshaping filter length \(L^{\text {opt}}_{g}\) is a robust and perceptually advantageous equalization technique, yielding a high reverberant energy suppression and outperforming all other considered equalization techniques in terms of perceptual speech quality. Informal listening tests further support this conclusion.

6.4 Performance of PMINT when using the automatic non-intrusive reshaping filter length

In this section, we investigate the performance of PMINT when using the automatic non-intrusively determined reshaping filter length \(L^{\text {auto}}_{g}\) (cf. Section 5) instead of the optimal intrusively determined reshaping filter length \(L^{\text {opt}}_{g}\). For completeness, the obtained values of \(L^{\text {auto}}_{g}\) are also compared to the values of \(L^{\text {opt}}_{g}\). Similarly as in Section 6.3, we consider all acoustic systems in Table 4 and all NPMs in (25). In order to generate the parametric L-curve, the matrix \(\mathbf {W}\hat {\mathbf {H}}\) is constructed for all reshaping filter lengths in (28), the PMINT reshaping filter is computed, and the quantities \(\chi _{\mathbf {W}\hat {\mathbf {H}}}\) and \(\|\mathbf {W}(\hat {\mathbf {H}}\mathbf {w}-\mathbf {c}_{\mathrm {d}}) \|^{2}_{2}\) are calculated. Using the triangle method [36], the automatic reshaping filter length \(L^{\text {auto}}_{g}\) corresponding to the point of maximum curvature of the L-curve is determined. The performance of PMINT using \(L^{\text {auto}}_{g}\) is evaluated in terms of ΔDRR, ΔfSNR, and ΔCD, where the presented performance measures are averaged over all considered NPMs.

Table 7 presents the values of \(L^{\text {opt}}_{g}\) and \(L^{\text {auto}}_{g}\) for the acoustic system S_{1} and all considered NPMs^{Footnote 2}. It can be observed that for low NPMs, the non-intrusively determined reshaping filter length is very similar to the optimal intrusively determined one. As the NPM increases beyond −21 dB, the reshaping filter length obtained using the proposed non-intrusive procedure is larger than the optimal intrusively determined one.

Table 8 presents the ΔDRR, ΔfSNR, and ΔCD obtained using PMINT with \(L^{\text {auto}}_{g}\) for all considered acoustic systems. It can be observed that using the automatic non-intrusively determined reshaping filter length in PMINT yields a high dereverberation performance, both in terms of reverberant energy suppression and perceptual speech quality improvement. In addition, when comparing the performance measures presented in Table 8 to the ones presented in Table 6, it can be observed that in general, the performance of PMINT when using \(L^{\text {auto}}_{g}\) is similar to the performance when using \(L^{\text {opt}}_{g}\). More precisely, the average performance degradation over all considered acoustic systems when using the automatic non-intrusively determined reshaping filter length \(L^{\text {auto}}_{g}\) instead of the optimal intrusively determined reshaping filter length \(L^{\text {opt}}_{g}\) is only 0.58 dB in terms of ΔDRR, 1.15 dB in terms of ΔfSNR, and 0.24 dB in terms of ΔCD.

In summary, the presented results show that the automatic non-intrusively determined reshaping filter length in PMINT yields a high performance in the presence of RIR perturbations, making PMINT when using this shorter reshaping filter length a robust and perceptually advantageous acoustic multi-channel equalization technique.

7 Conclusions

In this paper, we have analyzed the use of a shorter reshaping filter length than conventionally used in order to increase the robustness of acoustic multi-channel equalization techniques. We have analytically shown that using a shorter reshaping filter decreases the condition number of the (weighted) convolution matrix, increasing as a result the robustness against RIR perturbations. In addition, we have proposed to automatically determine the reshaping filter length as the point of maximum curvature of the parametric plot of the condition number versus the (weighted) least-squares error, such that both quantities are simultaneously kept small. Using instrumental performance measures, it has been shown that using shorter reshaping filters indeed increases the robustness of MINT, RMCLS, and PMINT against RIR perturbations. In addition, it has been shown that PMINT using the optimal intrusively determined reshaping filter length outperforms MINT and RMCLS. Finally, it has been shown that the automatic non-intrusive procedure for selecting the reshaping filter length in PMINT yields a nearly optimal performance, confirming the practical applicability of using shorter reshaping filters in acoustic multi-channel equalization.

8 Appendix A

In order to construct the matrix \(\mathbf {W}_{\mathrm {s}}\hat {\mathbf {H}}_{\mathrm {s}}\) from the matrix \(\mathbf {W}_{\mathrm {t}}\hat {\mathbf {H}}_{\mathrm {t}}\) (cf. Fig. 2), we first create an intermediate \(\big [p_{\mathrm {t}}-\big (L^{\mathrm {t}}_{g}-L^{\mathrm {s}}_{g}\big)\big ] \times \big [q_{\mathrm {t}}-(L^{\mathrm {t}}_{g}-L^{\mathrm {s}}_{g})\big ]\)-dimensional sub-matrix T by deleting Lgt−Lgs rows and Lgt−Lgs columns from \(\mathbf {W}_{\mathrm {t}}\hat {\mathbf {H}}_{\mathrm {t}}\). The interlacing inequalities in (19) for the matrices \(\mathbf {W}_{\mathrm {t}}\hat {\mathbf {H}}_{\mathrm {t}}\) and T can then be written as

for \(i = 1, \; \ldots, \; r_{\mathrm {t}}-\big (L^{\mathrm {t}}_{g}-L^{\mathrm {s}}_{g}\big), \; \ldots, \; p_{\mathrm {t}}-\big (L^{\mathrm {t}}_{g}-L^{\mathrm {s}}_{g}\big)\). Using i=1 and \(i = r_{\mathrm {t}}-\big (L^{\mathrm {t}}_{g}-L^{\mathrm {s}}_{g}\big)\) in (29), the following inequalities between the singular values of the matrices \(\mathbf {W}_{\mathrm {t}}\hat {\mathbf {H}}_{\mathrm {t}}\) and T can be established:

In order to construct the matrix \(\mathbf {W}_{\mathrm {s}}\hat {\mathbf {H}}_{\mathrm {s}}\), (M−1)(Lgt−Lgs) columns are now deleted from the matrix T. The interlacing inequalities in (19) for the matrices T and \(\mathbf {W}_{\mathrm {s}}\hat {\mathbf {H}}_{\mathrm {s}}\) can then be written as

for i=1, …, r_{s}. Using i=1 and i=r_{s} in (32), the following inequalities between the singular values of the matrices T and \(\mathbf {W}_{\mathrm {s}}\hat {\mathbf {H}}_{\mathrm {s}}\) can be written:

Finally, combining (30), (31), (33), and (38), the following inequalities relating the largest and the smallest non-zero singular values of \(\mathbf {W}_{\mathrm {t}}\hat {\mathbf {H}}_{\mathrm {t}}\) and \(\mathbf {W}_{s}\hat {\mathbf {H}}_{s}\) can be established:

I Arweiler, JM Buchholz, The influence of spectral characteristics of early reflections on speech intelligibility. J. Acoust. Soc. Am.130(2), 996–1005 (2011).

A Warzybok, J Rennies, T Brand, S Doclo, B Kollmeier, Effects of spatial and temporal integration of a single early reflection on speech intelligibility. J. Acoust. Soc. Am.133(1), 269–282 (2013).

R Beutelmann, T Brand, Prediction of speech intelligibility in spatial noise and reverberation for normal-hearing and hearing-impaired listeners. J. Acoust. Soc. Am.120(1), 331–342 (2006).

S Goetze, E Albertin, J Rennies, EAP Habets, K-D Kammeyer, in Proc. AES International Conference on Sound Quality Evaluation. Speech quality assessment for listening-room compensation (Pitea, 2010), pp. 11–20.

A Warzybok, I Kodrasi, JO Jungmann, EAP Habets, T Gerkmann, A Mertins, S Doclo, B Kollmeier, S Goetze, in Proc. International Workshop on Acoustic Echo and Noise Control. Subjective speech quality and speech intelligibility evaluation of single-channel dereverberation algorithms (Antibes, 2014), pp. 333–337.

T Yoshioka, A Sehr, M Delcroix, K Kinoshita, R Maas, T Nakatani, W Kellermann, Making machines understand us in reverberant rooms: robustness against reverberation for automatic speech recognition. IEEE Signal Proc. Mag.29(6), 114–126 (2012).

F Xiong, BT Meyer, N Moritz, R Rehr, J Anemüller, T Gerkmann, S Doclo, S Goetze, Front-end technologies for robust ASR in reverberant environments–spectral enhancement-based dereverberation and auditory modulation filterbank features. EURASIP J. Adv. Signal Process.2015(1), 1–18 (2015).

EAP Habets, S Gannot, I Cohen, Late reverberant spectral variance estimation based on a statistical model. IEEE Sig. Process Lett.16(9), 770–774 (2009).

A Kuklasiński, S Doclo, SH Jensen, J Jensen, Maximum likelihood PSD estimation for speech enhancement in reverberation and noise. IEEE/ACM Trans. Audio Speech Lang. Process.24(9), 1595–1608 (2016).

I Kodrasi, S Doclo, in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing. Late reverberant power spectral density estimation based on an eigenvalue decomposition (New Orleans, 2017), pp. 611–615.

I Kodrasi, S Doclo, in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. Multi-channel late reverberation power spectral density estimation based on nuclear norm minimization (New York, 2017). (accepted for publication).

T Nakatani, T Yoshioka, K Kinoshita, M Miyoshi, B-H Juang, Speech dereverberation based on variance-normalized delayed linear prediction. IEEE Trans. Audio Speech Lang. Process.18(7), 1717–1731 (2010).

D Schmid, G Enzner, S Malik, D Kolossa, R Martin, Variational Bayesian inference for multichannel dereverberation and noise reduction. IEEE/ACM Trans. Audio Speech Lang. Process.22(8), 1320–1335 (2014).

B Schwartz, S Gannot, EAP Habets, Online speech dereverberation using Kalman filter and EM algorithm. IEEE/ACM Trans. Audio Speech Lang. Process.23(2), 394–406 (2015).

A Jukić, T Van Waterschoot, T Gerkmann, S Doclo, Multi-channel linear prediction-based speech dereverberation with sparse priors. IEEE/ACM Trans. Audio Speech Lang. Process.23(9), 1509–1520 (2015).

M Kallinger, A Mertins, in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing. Multi-channel room impulse response shaping - a study (Toulouse, 2006), pp. 101–104.

JO Jungmann, R Mazur, M Kallinger, M Tiemin, A Mertins, Combined acoustic MIMO channel crosstalk cancellation and room impulse response reshaping. IEEE Trans. Audio Speech Lang. Process.20(6), 1829–1842 (2012).

T Hikichi, M Delcroix, M Miyoshi, Inverse filtering for speech dereverberation less sensitive to noise and room transfer function fluctuations. EURASIP J. Adv. Signal Process.2007: (2007).

F Lim, W Zhang, EAP Habets, PA Naylor, Robust multichannel dereverberation using relaxed multichannel least squares. IEEE/ACM Trans. Audio Speech Lang. Process.22(9), 1379–1390 (2014).

I Kodrasi, S Goetze, S Doclo, Regularization for partial multichannel equalization for speech dereverberation. IEEE Trans. Audio Speech Lang. Process.21(9), 1879–1890 (2013).

RS Rashobh, AWH Khong, D Liu, Multichannel equalization in the KLT and frequency domains with application to speech dereverberation. IEEE/ACM Trans. Audio Speech Lang. Process.22(3), 634–646 (2014).

AWH Khong, L Xiang, PA Naylor, in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing. Algorithms for identifying clusters of near-common zeros in multichannel blind system identification and equalization (Las Vegas, 2008), pp. 229–232.

MA Haque, T Hasan, Noise robust multichannel frequency-domain LMS algorithms for blind channel identification. IEEE Signal Process. Lett.15:, 305–308 (2008).

M Hu, ND Gaubitch, PA Naylor, DB Ward, in Proc. European Signal Processing Conference. Noise robust blind system identification algorithms based on a Rayleigh quotient cost function (Nice, 2015).

ND Gaubitch, PA Naylor, Equalization of multichannel acoustic systems in oversampled subbands. IEEE Trans. Audio Speech Lang. Process.17(6), 1061–1070 (2009).

F Lim, PA Naylor, in Proc. European Signal Processing Conference. Robust speech dereverberation using subband multichannel least squares with variable relaxation (Marrakech, 2013).

I Kodrasi, S Doclo, in Proc. European Signal Processing Conference. The effect of inverse filter length on the robustness of acoustic multichannel equalization (Bucharest, 2012).

PC Hansen, Analysis of discrete ill-posed problems by means of the L-curve. SIAM Rev. 34(4), 561–580 (1992).

A Farina, in Proc. AES Convention. Simultaneous measurement of impulse response and distortion with a swept-sine technique (Pitea, 2000), pp. 18–22.

M Nilsson, SD Soli, A Sullivan, Development of the Hearing in Noise Test for the measurement of speech reception thresholds in quiet and in noise. J. Acoust. Soc. Am.95(2), 1085–1099 (1994).

K Kinoshita, M Delcroix, S Gannot, EAP Habets, R Haeb-Umbach, W Kellermann, V Leutnant, R Maas, T Nakatani, B Raj, A Sehr, T Yoshioka, A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research. EURASIP J. Adv. Signal Process.2016(1), 1–19 (2016).

This work was supported in part by the Cluster of Excellence 1077 “Hearing4All,” funded by the German Research Foundation (DFG) and the joint Lower Saxony-Israeli Project ATHENA, funded by the State of Lower Saxony.

Author information

Authors and Affiliations

Department of Medical Physics and Acoustics, University of Oldenburg, Oldenburg, 26111, Germany

The contribution of the first author consists in developing the main algorithmic idea, deriving the mathematical analysis, performing simulations, analyzing the simulation results, and drafting the article. The contribution of the second author consists in critically discussing the mathematical analysis and the simulation results and in proofreading and revising the article. All authors read and approved the final manuscript.

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Kodrasi, I., Doclo, S. Improving the conditioning of the optimization criterion in acoustic multi-channel equalization using shorter reshaping filters.
EURASIP J. Adv. Signal Process. 2018, 11 (2018). https://doi.org/10.1186/s13634-018-0532-1