A dynamic multi-channel speech enhancement system for distributed microphones in a car environment

Matheja, Timo; Buck, Markus; Fingscheidt, Tim

doi:10.1186/1687-6180-2013-191

Research
Open access
Published: 27 December 2013

A dynamic multi-channel speech enhancement system for distributed microphones in a car environment

Timo Matheja¹,
Markus Buck¹ &
Tim Fingscheidt²

EURASIP Journal on Advances in Signal Processing volume 2013, Article number: 191 (2013) Cite this article

3377 Accesses
8 Citations
Metrics details

Abstract

Supporting multiple active speakers in automotive hands-free or speech dialog applications is an interesting issue not least due to comfort reasons. Therefore, a multi-channel system for enhancement of speech signals captured by distributed distant microphones in a car environment is presented. Each of the potential speakers in the car has a dedicated directional microphone close to his position that captures the corresponding speech signal. The aim of the resulting overall system is twofold: On the one hand, a combination of an arbitrary pre-defined subset of speakers’ signals can be performed, e.g., to create an output signal in a hands-free telephone conference call for a far-end communication partner. On the other hand, annoying cross-talk components from interfering sound sources occurring in multiple different mixed output signals are to be eliminated, motivated by the possibility of other hands-free applications being active in parallel. The system includes several signal processing stages. A dedicated signal processing block for interfering speaker cancellation attenuates the cross-talk components of undesired speech. Further signal enhancement comprises the reduction of residual cross-talk and background noise. Subsequently, a dynamic signal combination stage merges the processed single-microphone signals to obtain appropriate mixed signals at the system output that may be passed to applications such as telephony or a speech dialog system. Based on signal power ratios between the particular microphone signals, an appropriate speaker activity detection and therewith a robust control mechanism of the whole system is presented. The proposed system may be dynamically configured and has been evaluated for a car setup with four speakers sitting in the car cabin disturbed in various noise conditions.

1 Introduction

Applying speech technologies in the car becomes more and more important due to safety and comfort reasons. Relating to automotive environments, many different applications like hands-free telephony, teleconferencing, or speech dialog and recognition are possible. Especially in a car, strong background noises caused by engine, wind, rolling noise, or interfering sound sources may disturb the speech signal and could harm the proper functionality of the mentioned applications. Thus, for the purpose of speech signal enhancement, often, multi-microphone arrangements are used, enabling multi-channel signal processing algorithms. The application of beamforming approaches[1, 2] requires a small spacing between the microphones and a predefined geometry in order to get sufficient performance.

In contrast, in this contribution, we want to focus on distributed microphones, where the arrangement is not limited to fixed geometries but where each speaker in the car cabin has a dedicated microphone close to his position. In the case at hand with multiple microphones and multiple speakers to be supported, the sensor signals have to be combined in a beneficial way. In the literature, it is often focussed on setups where multiple microphones are used to capture the speech signal of one single speaker. In this case, multiple spatially distributed microphones may be mounted in the direct vicinity of just one speaker in order to search for the optimal microphone position.

Hence, the best combination of all microphone signals can be chosen for each bin in the frequency domain, e.g., by applying diversity methods as in[3, 4]. It is aimed at obtaining exactly one enhanced and combined output signal out of several input signals. These combined signals can be fed to a hands-free device or a speech recognition system.

In case of noisy speech recognition for in-car situations in[5, 6], a fundamentally different approach for the exploitation of spatially distributed distant microphones is introduced. It is proposed to estimate the log speech spectrum at a hypothetical close-talking microphone that should have good quality by multiple regression of the log spectra of several distant microphone signals. In other environments, where the speaker’s location is not known before, the microphones are mounted arbitrarily in a living room or in an office to take advantage of the space diversity for distant-talking speech recognition[7] or to process a real-time speaker localization as in[8].

Furthermore, regarding speech enhancement, cross-talk components in the desired signal originating from interfering speakers are a major problem for hands-free as well as for speech recognition systems. Within the scope of this contribution, these components should be suppressed to enhance the combined output signals. Various signal compensation approaches based on Widrow’s original work[9] are well known from a range of publications. Due to the risk of signal cancellation, an additional filter helps prevent the cancellation of desired components[10]. This enhanced structure is also picked up by[11] for frequency domain cross-talk cancellation within a call center scenario. It is thought of creating a multiple-input multiple-output system, where each output only includes the speech of the dedicated speaker. Similar techniques can be introduced by blind source separation algorithms[12, 13]. These methods are often computationally more expensive. Furthermore, in case of speaker-dedicated microphones, signal compensation approaches can exploit quite good reference signals for compensation of interfering speech components. For further cross-talk suppression, appropriate post-processing schemes exist[14, 15].

In this contribution, a generic overall system for speech enhancement of distributed microphone signals is proposed, where each speaker has only one dedicated microphone. An overview is depicted in Figure1. M microphone signals are transformed to the discrete Fourier transform (DFT) domain and processed, yielding Q mixed output signals for serving a number of applications at the same time. The system allows to configure the resulting number of different output instances designed for different applications during the processing in a generic manner. The core processing part consists of an interfering speaker cancellation (ISC) that compensates the cross-talk components that do not have to be present in the appropriate output instance, a signal enhancement (SE) block performing an extended noise reduction, and a dynamic signal combination (DSC) module. The latter combines a subset of some speaker-related microphone signals to a particular output signal. The whole signal processing is controlled by a control unit based on the comparison of signal powers (see also[11, 16]).

In a full-duplex speech communication system, the occurrence of acoustic echoes resulting from the coupling between loudspeakers and microphones has to be avoided. For the proposed system where Q speech applications may be active in parallel, M multi-channel echo cancellation structures are needed each having as many adaptive filters as loudspeaker channels are used by the application. To solve the echo cancellation problem, the related reference signals of the Q different and uncorrelated far-end partners or systems are directly accessible. The topic of stereo- and multi-channel acoustic echo cancellation and the presentation of efficient solutions is not within the scope of this contribution, but further details can be found in[17, 18].

The paper is organized as follows: In Section 2, a more detailed overview of the generic system is given. An interfering speaker cancellation is presented in Section 3. The following Section 4 discusses the signal enhancement stage and its submodules. Afterwards, in Section 5, the dynamic signal combination is considered. The robust control of the whole signal processing is introduced in Section 6, and the contribution concludes with an evaluation of the overall system.

2 Generic speech communication system

We propose a highly generic system that allows the handling of several speech applications in the car in parallel. As mentioned, the acoustic echo cancellation problem is not considered in the following. It can be thought, e.g., of two telephone conference calls out of the car in due time, where the two front passengers are communicating with one far-end partner and the backseat passengers with another one within a second application.

The multi-channel system has M microphones and yields a set of Q mixed output signals. Assuming that all the speech sources are uncorrelated, the m th microphone signal y _m(n) can be formulated as the superposition of the clean speech s _m(n), the cross-talk b _m(n), and the background noise component n _m(n) in the time-domain, with n being the sample index:

\begin{align} y_{m} (n) = s_{m} (n) + b_{m} (n) + n_{m} (n) . \end{align}

(1)

With the time frame index ℓ and the frequency subband index k, the related signal representation in the DFT domain is

\begin{align} Y (ℓ, k) = S (ℓ, k) + B (ℓ, k) + N (ℓ, k) . \end{align}

(2)

The column vector Y(ℓ,k) contains the microphone input signals Y _m(ℓ,k) for all microphone channels m=1,…,M. According to this formulation, the vector S(ℓ,k) includes the input speech components S _m(ℓ,k); the vector B(ℓ,k), the interfering speech components B _m(ℓ,k); and the vector N(ℓ,k), the noise components N _m(ℓ,k). In general, bold uppercase letters with time frame and/or frequency indices indicate vectors containing M single components for each microphone channel m, or Q components for each output instance q, respectively. The processing is performed at a sampling rate of f _s=16 kHz. For analysis, a discrete Fourier transform with length of K=512, with a frame-shift of R=128, and a Hann window function is applied. Thus, the subband index k is in the range k=0,1,…,K - 1. Due to the symmetry properties, only the first K/2 + 1 subbands are effectively processed.

An overview of the four main parts of the whole distributed microphone processing system is depicted in Figure2. Bold arrows and characters indicate the availability of multiple channels stacked in vectors. Within the ISC block, interfering speakers can be suppressed in a distant target channel by using their dedicated microphone signals as reference for a noise compensation. An adaptive filter structure uses these references to cancel exactly the cross-talk components in the target signals that do not have to be present later in one of Q output signals. The cross-talk components within those target channels that will be combined to the same mixed output signal later are not cancelled in order to exploit some diversity effects afterwards. $\overset{̆}{Y} (ℓ, k)$ is the resulting signal vector after filtering the interfering cross-talk speech components ${\overset{̆}{Y}}^{c} (ℓ, k)$ by ${\hat{H}}_{m, m^{'}} (ℓ, k)$ and subtracting the results from the input signal spectra Y(ℓ,k). The filter ${\hat{F}}_{m^{'}, m} (ℓ, k)$ realizes a blocking structure to avoid signal cancellation effects within the actual ISC.

The adaptation of the filters is controlled by a speaker activity detection (SAD) measure $\hat{SAD} (ℓ, k)$ , determined in the SAD block of the control unit. To obtain similar signal characteristics in all output channels, an automatic gain control is processed within the signal enhancement (SE) stage that adjusts all signal peak levels to a constant target peak level yielding $\tilde{Y} (ℓ, k)$ . During speech activity of one speaker, coupling factors $\hat{K} (ℓ, k)$ between the particular signals can now be computed. Thus, residual cross-talk can be estimated, yielding appropriate filter coefficients G ^RCS(ℓ,k) and maximum attenuations β ^RCS(ℓ,k) for residual cross-talk suppression (RCS) within an extended noise reduction (ENR). This noise reduction block also has to deal with the preparation of the DSC. Due to the different microphone positions and types, the noise signal characteristics (especially noise level and coloration) may differ strongly across the microphone channels. Since annoying switching artifacts may occur in a combined signal, we propose to adjust all noise power spectral densities (PSDs) ${\hat{Φ}}_{\tilde{N} \tilde{N}} (ℓ, k)$ in each channel to a target reference noise level ${\hat{Φ}}_{\tilde{N} \tilde{N}}^{ref} (ℓ, k)$ for the transitions at speaker changes by applying a spectral floor β ^DSC(ℓ,k) within a Wiener noise reduction filter. The determination of the target values is controlled by the fullband speaker activity detection measure $\hat{SAD} (ℓ)$ that also controls the subsequent signal combination. The noise-reduced signals $\tilde{X} (ℓ, k)$ are merged to obtain Q mixed signals X(ℓ,k), each being a combination of some processed input channel signals. The quantities $\tilde{X} (ℓ, k)$ still include cross-talk components between those channels that are to be combined to one output signal. Hence, spatial diversity can be exploited.

For controlling the ISC, the SE, and the DSC, some matrices are introduced to determine the behavior of the overall system. W ^ISC is a symmetric M×M matrix containing zeros and ones, where each row represents a destination channel m, and each column a source channel m ^′. By setting a one to a position 〈m,m ^′〉, the m ^′th source will be eliminated from the m th channel. If it is desired in a system with M=4 to cancel channels 3 and 4 from channels 1 and 2 and vice versa, the matrix is defined as

\begin{align} W^{ISC} = [\begin{matrix} 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \end{matrix}] . \end{align}

(3)

In a further Q×M matrix W ^DSC, each row represents an output signal q and each column an input channel m. A one at the position 〈q,m〉 indicates that the channel m has to be present in the q th mixed output signal. Regarding W ^ISC in (3), the related mixing control matrix for Q=2 output signals is

\begin{align} W^{DSC} = [\begin{matrix} 1 & 1 & 0 & 0 \\ 0 & 0 & 1 & 1 \end{matrix}] . \end{align}

(4)

In order to implement a generic configuration of the whole system, these control matrices are used for selecting the particular channels to process.

3 Interfering speaker cancellation

In this section, a method to suppress the undesired cross-talk components in each target channel is presented. Interfering speech from a speaker whose dedicated channel signal is to be combined with the considered target channel signal within the q th mixed output signal afterwards is not defined as 'undesired’ and is not eliminated in the target channel. This behavior can be configured by the ISC control matrix W ^ISC introduced in (3). Hence, computational costs are saved within the ISC and the possibility of exploiting spatial diversity effects between the microphone channel signals during a later signal combining is kept. The ISC structure is shown in Figure2 and consists of two parts. The actual cross-talk cancellation stage uses the output of a preceding blocking structure instead of the microphone signals directly for further processing. The blocking stage attenuates the desired signal cancellation effect in order to obtain an improved reference signal within the signal compensation of the undesired components. ISC structures with a blocking stage have been proposed in[10] and are used in[11, 19]. Other solutions for the enhancement of the reference signal in noise cancellation structures are, e.g., considered in[20]. However, in this contribution, the blocking structure approach is applied similar to[11] but for the multi-channel case with more than two microphones in a car environment.

3.1 Blocking stage

Within the first stage, adaptive filtering is performed by the blocking structure, where the M-1 microphone signals are filtered by ${\hat{F}}_{m^{'}, m} (ℓ, k)$ and subtracted from the signal spectrum in channel m ^′. This yields the signal component ${\overset{̆}{Y}}_{m^{'}}^{c} (ℓ, k)$ to be effectively used as a reference signal for cross-talk cancellation in the m th channel. With the Hermitian operator (·)^H, the output results in

\begin{align} {\overset{̆}{Y}}_{m^{'}}^{c} (ℓ, k) = Y_{m^{'}} (ℓ, k) - \sum_{\begin{matrix} m \in {1, \dots, M} \\ m \neq m^{'} \end{matrix}} {({\hat{F}}_{m^{'}, m} (ℓ, k))}^{H} Y_{m} (ℓ, k), \end{align}

(5)

with the related column vectors for filtering

\begin{array}{l} {\hat{F}}_{m^{'}, m} (ℓ, k) = {[{\hat{F}}_{m^{'}, m, 0} (ℓ, k), \dots, {\hat{F}}_{m^{'}, m, L_{FIR} - 1} (ℓ, k)]}^{T}, \\ Y_{m} (ℓ, k) = {[Y_{m} (ℓ, k), \dots, Y_{m} (ℓ - L_{FIR} + 1, k)]}^{T} . \end{array}

(6)

Here, L _FIR indicates the length of the adaptive filters, and (·)^T denotes the transpose of the vectors. To exclude the desired speech components from the ISC reference and therewith to avoid signal cancellation within the ISC, the filter coefficients ${\hat{F}}_{m^{'}, m} (ℓ, k)$ are only updated if solely the particular m th ISC target channel shows speaker activity ( ${\hat{SAD}}_{m} (ℓ, k)$ =1, as introduced in Section 6.1 and determined by (82)). Thus, the resulting speech component and therewith the effective cross-talk in channel m ^′ calculated in (5) will equal to 0 during these situations. The filter coefficients are adapted by the NLMS algorithm (e.g.,[21]):

\begin{align} {\hat{F}}_{m^{'}, m} (ℓ + 1, k) = {\hat{F}}_{m^{'}, m} (ℓ, k) + α_{m}^{bs} (ℓ, k) \frac{{\overset{̆}{Y}}_{m^{'}}^{c}^{*} (ℓ, k) Y_{m} (ℓ, k)}{∥ Y_{m} (ℓ, k) ∥^{2}}, \end{align}

(7)

where (·)^∗ is the conjugate complex operator. The related step size can be expressed as

\begin{align} α_{m}^{bs} (ℓ, k) = \{\begin{array}{l} α, if {\hat{SAD}}_{m} (ℓ, k) = 1, \\ 0, else. \end{array} \end{align}

(8)

Alternatively, for a two-channel scenario, an additional control mechanism based on an optimal step size has been proposed by the authors in[22]. The preferred values for the implementation are L _FIR=3 and α=0.3.

3.2 Cross-talk cancellation stage

As depicted in Figure2, secondly, the cross-talk cancellation stage follows within the ISC. With the filter vector ${\hat{H}}_{m, m^{'}} (ℓ, k)$ and the cross-talk component vector ${\overset{̆}{Y}}_{m^{'}}^{c} (ℓ, k)$ introduced by

\begin{array}{l} {\hat{H}}_{m, m^{'}} (ℓ, k) = {[Ĥ_{m, m^{'}, 0} (ℓ, k), \dots, Ĥ_{m, m^{'}, L_{FIR} - 1} (ℓ, k)]}^{T}, \\ {\overset{̆}{Y}}_{m^{'}}^{c} (ℓ, k) = {[{\overset{̆}{Y}}_{m^{'}}^{c} (ℓ, k), \dots, {\overset{̆}{Y}}_{m^{'}}^{c} (ℓ - L_{FIR} + 1, k)]}^{T}, \end{array}

(9)

the cross-talk cancelled signal is obtained:

\begin{align} {\overset{̆}{Y}}_{m} (ℓ, k) = Y_{m} (ℓ, k) - \sum_{m^{'} = 1}^{M} W_{m, m^{'}}^{ISC} \cdot {\hat{H}}_{m, m^{'}}^{H} (ℓ, k) {\overset{̆}{Y}}_{m^{'}}^{c} (ℓ, k) . \end{align}

(10)

Here, the combination of the last two factors constitutes the filtered cross-talk components originating from all channels m ^′ and used for cancellation of interfering speakers’ signals by subtraction from the target signals Y _m(ℓ,k) (see structure in Figure2). The single elements $W_{m, m^{'}}^{ISC}$ of the ISC control matrix W ^ISC ensure that only those cross-talk components are eliminated that are desired to be cancelled. Due to forced zeros on the main diagonal of W ^ISC, the contribution of the desired signal itself is excluded. For the filter update with the NLMS algorithm, we have

\begin{align} {\hat{H}}_{m, m^{'}} (ℓ + 1, k) = {\hat{H}}_{m, m^{'}} (ℓ, k) + α_{m^{'}} (ℓ, k) \frac{{\overset{̆}{Y}}_{m}^{*} (ℓ, k) {\overset{̆}{Y}}_{m^{'}}^{c} (ℓ, k)}{∥ {\overset{̆}{Y}}_{m^{'}}^{c} (ℓ, k) ∥^{2}} . \end{align}

(11)

The step size

\begin{array}{l} α_{m^{'}} (ℓ, k) = \{\begin{array}{l} α, if {\hat{SAD}}_{m^{'}} (ℓ, k) = 1, \\ 0, else, \end{array} \end{array}

(12)

controls the ISC adaptation, showing that the cross-talk cancellation filters ${\hat{H}}_{m, m^{'}} (ℓ, k)$ are only to be updated if interfering speech activity is indicated for the m ^′th channel. The signals are continuously filtered, and the cross-talk components are attenuated without causing much distortion of the desired speech.

4 Signal enhancement (SE)

To further enhance the speech signals, an automatic gain control (AGC) and an ENR follow within the SE block. In addition to a stationary noise reduction, still existing residual cross-talk components are suppressed, and the AGC and the ENR care for the adjustment of the signal characteristics prior to the subsequent signal combination. The determination of all these parts is discussed in the following. Suitable parameters for the implementation of the SE part are depicted in Table1.

Table 1 Preferred parameter values for the implementation of the signal enhancement part

Full size table

4.1 Automatic gain control

Due to varying distances between the speakers and the microphones, the related microphone speech signal levels differ among the channels. To care for a compensation of these differences, an AGC is performed. Based on the input signal ${\overset{̆}{Y}}_{m} (ℓ, k)$ , the related peak level ${\overset{̆}{Y}}_{m}^{P} (ℓ, k)$ is estimated, and a fullband amplification factor a _m(ℓ) is determined to adapt the current peak level to a target peak level ${\overset{̆}{Y}}^{ref}$ that can be defined beforehand. A method for peak level estimation is proposed in[23] based on a simple speech activity detector. But, here, the speaker activity detector presented in Appendix Appendix 2: enhanced fullband speaker activity detection based on multipath-induced fading patterns is used, and instead of processing a time domain signal for peak tracking, a root-mean-square measure over all subbands is applied. The actual peak level is estimated whenever single-talk is detected for the related channel. Single-channel speech activity ${\hat{STD}}_{m} (ℓ) \in \{0, 1\}$ is indicated by

\begin{align} {\hat{STD}}_{m} (ℓ) = \{\begin{array}{l} 1, & if {\hat{SAD}}_{m} (ℓ) = 1 \land \hat{DTD} (ℓ) = 0, \\ 0, & else. \end{array} \end{align}

(13)

For an introduction to the fullband speaker activity detector ${\hat{SAD}}_{m} (ℓ) \in \{0, 1\}$ and the double-talk detector $\hat{DTD} (ℓ) \in \{0, 1\}$ , please refer to Section 6.1 and Appendix Appendix 1: basic fullband speaker activity detection. The equalized output for each channel results in

\begin{align} {\tilde{Y}}_{m} (ℓ, k) = a_{m} (ℓ) {\overset{̆}{Y}}_{m} (ℓ, k), \end{align}

(14)

with the recursively averaged frequency-independent gain factors[24]

\begin{align} a_{m} (ℓ) = γ_{a} \cdot a_{m} (ℓ - 1) + (1 - γ_{a}) \cdot \frac{{\overset{̆}{Y}}^{ref}}{{\overset{̆}{Y}}_{m}^{P} (ℓ)} . \end{align}

(15)

4.2 Extended noise reduction

With the objective of obtaining an overall extended noise reduction including a postfilter for residual cross-talk suppression (RCS) and a dynamic maximum attenuation to realize a dynamic combination of the microphone signals later, two approaches are combined to one noise reduction characteristic. For the filtering of the noisy signal ${\tilde{Y}}_{m} (ℓ, k)$ follows

\begin{align} {\tilde{X}}_{m} (ℓ, k) = G_{m}^{ENR} (ℓ, k) \cdot {\tilde{Y}}_{m} (ℓ, k) . \end{align}

(16)

The filter coefficients $G_{m}^{ENR} (ℓ, k)$ are determined by restriction of the cross-talk suppression filter coefficients $G_{m}^{RCS} (ℓ, k)$ to a time- and frequency-dependent maximum attenuation $β_{m}^{ENR} (ℓ, k)$ to keep a certain level of residual background noise and mask artifacts like musical tones:

\begin{align} G_{m}^{ENR} (ℓ, k) = max \{G_{m}^{RCS} (ℓ, k), β_{m}^{ENR} (ℓ, k)\} . \end{align}

(17)

The maximum attenuation includes two factors:

\begin{align} β_{m}^{ENR} (ℓ, k) = β_{m}^{RCS} (ℓ, k) \cdot β_{m}^{DSC} (ℓ, k), \end{align}

(18)

where the first factor is the spectral floor conditioned by the cross-talk suppression postfilter in Section 4.2.1, and the second one is the additional maximum attenuation for DSC determined in Section 4.2.4.

4.2.1 Postfilter for residual cross-talk suppression

For suppression of the still existing residual cross-talk components ${\tilde{B}}_{m} (ℓ, k)$ present in the cross-talk compensated and equalized signal ${\tilde{Y}}_{m} (ℓ, k)$ , a postprocessing can be applied that complements the reduction of stationary background noise similar to the approach in[14]. Generally, different spectral weighting filter characteristics can be chosen for noise reduction. Instead of applying the basic Wiener filter[23] in this contribution, the application of a recursive Wiener filtering[25] is proposed to reduce musical tones in the noise-reduced output signal ${\tilde{X}}_{m} (ℓ, k)$ . With the maximum noise overestimation factor γ _WF1 and the fixed overestimation γ _WF2, the filter coefficients for the residual cross-talk suppression postfilter characteristic result in

\begin{align} G_{m}^{RCS} (ℓ, k) = 1 - min \{γ_{WF 1}, \frac{γ_{WF 2}}{G_{m}^{ENR} (ℓ - 1, k)}\} \cdot \frac{{\hat{Φ}}_{\tilde{N} \tilde{N}, m}^{'} (ℓ, k)}{{\hat{Φ}}_{\tilde{Y} \tilde{Y}, m} (ℓ, k)}, \end{align}

(19)

where $G_{m}^{ENR} (ℓ - 1, k)$ is the limited quantity $G_{m}^{RCS} (ℓ, k)$ of the previous frame (see (17)). Furthermore, ${\hat{Φ}}_{\tilde{N} \tilde{N}, m}^{'} (ℓ, k)$ is a modified noise PSD that is a combination of an AGC weighted stationary noise part and the residual cross-talk component:

\begin{align} {\hat{Φ}}_{\tilde{N} \tilde{N}, m}^{'} (ℓ, k) = {\hat{Φ}}_{\tilde{N} \tilde{N}, m} (ℓ, k) + {\hat{Φ}}_{\tilde{B} \tilde{B}, m} (ℓ, k), \end{align}

(20)

where the stationary term is determined by weighting a continuously estimated noise PSD ${\hat{Φ}}_{\overset{̆}{N} \overset{̆}{N}, m} (ℓ, k)$ by the squared AGC gain factors (15) as

\begin{align} {\hat{Φ}}_{\tilde{N} \tilde{N}, m} (ℓ, k) = a_{m}^{2} (ℓ) \cdot {\hat{Φ}}_{\overset{̆}{N} \overset{̆}{N}, m} (ℓ, k) . \end{align}

(21)

To obtain ${\hat{Φ}}_{\overset{̆}{N} \overset{̆}{N}, m} (ℓ, k)$ , e.g., the improved minimum recursive averaging approach[26] can be chosen. Regarding (19), it has to be ensured that the cross-talk components are effectively suppressed. Thus, the residual cross-talk suppression component $β_{m}^{RCS} (ℓ, k)$ of the overall spectral floor in (18) has to be adjusted. In addition to a constant spectral floor β, here, a dynamic time- and frequency-dependent component realizes the attenuation of the residual cross-talk down to the same level as the stationary background noise. Including β, we have

\begin{align} β_{m}^{RCS} (ℓ, k) = β \cdot \sqrt{\frac{{\hat{Φ}}_{\tilde{N} \tilde{N}, m} (ℓ, k)}{{\hat{Φ}}_{\tilde{N} \tilde{N}, m} (ℓ, k) + {\hat{Φ}}_{\tilde{B} \tilde{B}, m} (ℓ, k)}} . \end{align}

(22)

4.2.2 Residual cross-talk

For realization of the residual cross-talk suppression, an estimate for the residual cross-talk component ${\hat{Φ}}_{\tilde{B} \tilde{B}, m} (ℓ, k)$ used in (20) and (22) has to be determined. Due to the signal model described in (2), it follows for the processed signal PSD estimates after the ISC and AGC:

\begin{align} {\hat{Φ}}_{\tilde{Y} \tilde{Y}, m} (ℓ, k) = {\hat{Φ}}_{\tilde{S} \tilde{S}, m} (ℓ, k) + {\hat{Φ}}_{\tilde{B} \tilde{B}, m} (ℓ, k) + {\hat{Φ}}_{\tilde{N} \tilde{N}, m} (ℓ, k), \end{align}

(23)

with ${\hat{Φ}}_{\tilde{S} \tilde{S}, m} (ℓ, k)$ including all desired speech components - direct and cross-talk components - that are not to be cancelled. The overall residual cross-talk in channel m can be expressed as the sum of all relevant components resulting from each channel m ^′ to the desired one m:

\begin{align} {\hat{Φ}}_{\tilde{B} \tilde{B}, m} (ℓ, k) = \sum_{m^{'} = 1}^{M} W_{m, m^{'}}^{ISC} \cdot {\hat{Φ}}_{\tilde{B} \tilde{B}, m, m^{'}} (ℓ, k) . \end{align}

(24)

Due to forced zeros on the main diagonal of W ^ISC, the contribution of the desired signal itself is always excluded. The residual cross-talk quantity ${\hat{Φ}}_{\tilde{B} \tilde{B}, m, m^{'}} (ℓ, k)$ in channel m resulting from the m ^′th channel cannot be observed. It may be estimated by weighting a remote speaker’s signal PSD in channel m ^′ by an estimated instantaneous acoustic coupling factor ${\tilde{K}}_{m, m^{'}} (ℓ, k)$ between each channel m ^′ and the channel m:

\begin{align} {\hat{Φ}}_{\tilde{B} \tilde{B}, m, m^{'}} (ℓ, k) = {\tilde{K}}_{m, m^{'}} (ℓ, k) \cdot {\hat{Φ}}_{\tilde{S} \tilde{S}, m^{'}} (ℓ, k) . \end{align}

(25)

Alternatively, the residual cross-talk PSD can be written only during single-talk activity in the m ^′th channel ( ${\hat{Φ}}_{\tilde{S} \tilde{S}, m} (ℓ, k) = 0$ and ${\hat{Φ}}_{\tilde{B} \tilde{B}, m, m^{'}} (ℓ, k) = {\hat{Φ}}_{\tilde{B} \tilde{B}, m} (ℓ, k)$ ) by directly observable quantities. Rearranging (23) and simplifying and including (24) results in

\begin{align} {\hat{Φ}}_{\tilde{B} \tilde{B}, m, m^{'}} (ℓ, k) = {\hat{Φ}}_{\tilde{Y} \tilde{Y}, m} (ℓ, k) - {\hat{Φ}}_{\tilde{N} \tilde{N}, m} (ℓ, k) . \end{align}

(26)

For the single-talk speech component PSD in channel m ^′ follows accordingly:

\begin{align} {\hat{Φ}}_{\tilde{S} \tilde{S}, m^{'}} (ℓ, k) = {\hat{Φ}}_{\tilde{Y} \tilde{Y}, m^{'}} (ℓ, k) - {\hat{Φ}}_{\tilde{N} \tilde{N}, m^{'}} (ℓ, k) . \end{align}

(27)

However, with this expression and a long-term estimate ${\hat{K}}_{m, m^{'}} (ℓ, k)$ for the coupling factor, the residual cross-talk PSD can be estimated by the weighted sum of all considered remote speech components in channel m ^′. After including (25) in (24), we have

\begin{align} {\hat{Φ}}_{\tilde{B} \tilde{B}, m} (ℓ, k) = \sum_{m^{'} = 1}^{M} W_{m, m^{'}}^{ISC} \cdot {\hat{K}}_{m, m^{'}} (ℓ, k) \cdot {\hat{Φ}}_{\tilde{S} \tilde{S}, m^{'}} (ℓ, k) . \end{align}

(28)

Note that if no single speech activity occurs in channel m ^′, then ${\hat{Φ}}_{\tilde{S} \tilde{S}, m^{'}} (ℓ, k) = 0$ . Within the computation of the overall considered cross-talk quantity in channel m again, the coefficients of the ISC control matrix W ^ISC force to neglect eliminating cross-talk components originating from a channel that has to be merged with the currently considered one afterwards.

4.2.3 Coupling factor

The principle of an acoustic coupling factor is already introduced for the acoustic echo cancellation problem by[23]. Firstly, using (26) and (27), the instantaneous coupling factor within (25) can be expressed during single-talk as

\begin{align} {\tilde{K}}_{m, m^{'}} (ℓ, k) = \frac{{\hat{Φ}}_{\tilde{Y} \tilde{Y}, m} (ℓ, k) - {\hat{Φ}}_{\tilde{N} \tilde{N}, m} (ℓ, k)}{{\hat{Φ}}_{\tilde{Y} \tilde{Y}, m^{'}} (ℓ, k) - {\hat{Φ}}_{\tilde{N} \tilde{N}, m^{'}} (ℓ, k)} . \end{align}

(29)

The long-term estimate of the coupling factor applied in (28) is updated during periods of single-talk whenever frequency-selective speech activity is detected in the m ^′th channel:

\begin{array}{l} {\hat{K}}_{m, m^{'}} (ℓ, k) \\ = \{\begin{array}{l} {\hat{K}}_{m, m^{'}} (ℓ - 1, k), & if {\hat{SAD}}_{m^{'}} (ℓ, k) = 0 \lor \hat{DTD} (ℓ) = 1, \\ γ_{m, m^{'}}^{K} (ℓ, k) \cdot {\hat{K}}_{m, m^{'}} (ℓ - 1, k), & else, \end{array} \end{array}

(30)

with the time- and frequency-dependent constant $γ_{m, m^{'}}^{K} (ℓ, k)$ determined by comparing the instantaneous with the long-term estimated coupling factor:

\begin{align} γ_{m, m^{'}}^{K} (ℓ, k) = \{\begin{array}{l} γ_{inc}^{K}, & if {\tilde{K}}_{m, m^{'}} (ℓ, k) > {\hat{K}}_{m, m^{'}} (ℓ - 1, k), \\ γ_{dec}^{K}, & if {\tilde{K}}_{m, m^{'}} (ℓ, k) < {\hat{K}}_{m, m^{'}} (ℓ - 1, k), \\ 1, & else. \end{array} \end{align}

(31)

For increasing and decreasing, the appropriate constants $γ_{inc}^{K}$ and $γ_{dec}^{K}$ are chosen. The fullband speaker activity detection ${\hat{SAD}}_{m} (ℓ)$ , the frequency-selective one ${\hat{SAD}}_{m^{'}} (ℓ, k)$ , and the double-talk detector $\hat{DTD} (ℓ)$ are explained in Section 6.1, Appendix Appendix 1: basic fullband speaker activity detection, Appendix Appendix 2: enhanced fullband speaker activity detection based on multipath-induced fading patterns, and Appendix Appendix 3: frequency-selective speaker activity detection.

4.2.4 Dynamic maximum attenuation

The noise signal characteristics may differ strongly across the microphone channels, depending on the position or type of the microphone and the kind of background noise. However, as a preprocessing step for the realization of a dynamic combination of the microphone signals (Section 5), equal power and spectral shape of the background noise have to be provided for all related channels during transitions between different active speakers if a switching between them is performed. Thus, annoying switching artifacts are to be avoided by a dynamic maximum attenuation that can be applied within the noise reduction regarding (17) and (18). The dynamic spectral floor factor[24]

\begin{align} β_{m}^{DSC} (ℓ, k) = \sqrt{\frac{{\hat{Φ}}_{\tilde{N} \tilde{N}, m}^{ref} (ℓ, k)}{{\hat{Φ}}_{\tilde{N} \tilde{N}, m} (ℓ, k)}} \end{align}

(32)

used in (18) adjusts the estimated noise PSD for each microphone channel signal to a reference PSD ${\hat{Φ}}_{\tilde{N} \tilde{N}, m}^{ref} (ℓ, k)$ in such a way that no discontinuities within the signal characteristics are noticeable across the microphones. The important reference PSD is determined by (50) later in Section 6.2, and ${\hat{Φ}}_{\tilde{N} \tilde{N}, m} (ℓ, k)$ is obtained by (21). Regarding the maximum attenuation, it might be advantageous to introduce a limit $β_{m}^{DSC} (ℓ, k) \in [β^{min}, β^{max}]$ with β ^min≤β≤β ^max[24] for an adequate performance of the DSC.

5 Dynamic signal combination

Finally, the signals of the single-microphone channels have to be combined to mixed output signals. Applying the AGC in (14) and the extended noise reduction in (16) with the dynamic maximum attenuation in (32), this can be performed without noticeable switching artifacts within the signal characteristics. In[24], the authors presented a solution for this challenge but without considering diversity and with only one desired output signal Q=1. Within the presented generic system here, diversity effects are exploited, similar to[27]. Frequency-selective switching shall be applied, and it shall be possible to serve several speech applications in parallel. Hence, selected microphone channels are to be combined to Q separate mixed output signals. The microphone channels to be combined to one output signal instance can be selected by the DSC control matrix W ^DSC introduced in (4). As depicted in Figure2, after the signal combination, the vector X(ℓ,k) includes Q output signals, where each is a combination of some appropriately processed microphone signals $\tilde{X} (ℓ, k)$ . If speech activity is detected only in channels that are combined to the q th output signal later and no speech activity is detected in other channels at this time instance, we call it output-related single-talk. Then, the mixed signal can be calculated by a combination of all M available signals ${\tilde{X}}_{m} (ℓ, k)$ to exploit the spatial diversity by considering the cross-talk components occurring in each of all channels. For the appropriate output-related single-talk detection measure ${\hat{STM}}_{q} (ℓ)$ , we have

\begin{align} {\hat{STM}}_{q} (ℓ) = \{\begin{array}{l} 0, & if {\hat{DTD}}_{\bar{q}} (ℓ) = 1, \\ \sum_{m = 1}^{M} W_{q, m}^{DSC} \cdot {\hat{SAD}}_{m} (ℓ), & else, \end{array} \end{align}

(33)

where $\bar{q} \in \{1, \dots, Q\}$ with $\bar{q} \neq q$ . Therewith, ${\hat{DTD}}_{\bar{q}} (ℓ) \in {0, 1}$ is a double-talk detector related to the specific $\bar{q}$ th output signal. It takes effect if speech is detected not only for the currently observed q th output signal but also in microphone channels related to other output signals $\bar{q}$ . For details concerning the robust fullband SAD detector ${\hat{SAD}}_{m} (ℓ)$ , please refer to Section 6.1 and Appendix Appendix 2: enhanced fullband speaker activity detection based on multipath-induced fading patterns.

During the described output-related single-talk, the magnitude and phase are treated differently and independently within the signal combination process. The spectral magnitude of the channel signal showing the best signal-to-noise ratio (SNR) is selected by the real-valued weights w _q,m(ℓ,k)∈{0,1}, and the phase $Φ_{q}^{mix} (ℓ, k)$ of the last active channel within the q th mixed output signal is appended (e.g., similar to[3]). Hence, for the q th output signal, we obtain

\begin{align} X_{q} (ℓ, k) = \{\begin{array}{l} \sum_{m = 1}^{M} w_{q, m} (ℓ, k) \cdot |{\tilde{X}}_{m} (ℓ, k)| \cdot e^{j ϕ_{q}^{mix} (ℓ, k)}, if {\hat{STM}}_{q} (ℓ) > 0, \\ \sum_{m = 1}^{M} W_{q, m}^{DSC} \cdot w_{q, m} (ℓ) \cdot {\tilde{X}}_{m} (ℓ, k), else. \end{array} \end{align}

(34)

The last line applies if speech activity is detected in other than the q th output related signals, or an overall noise period occurs. No frequency-selective channel switching is adopted but rather a fullband decision controlled by the weights w _q,m(ℓ) ∈ {0,1}. With the Kronecker delta δ _m,u(ℓ,k) selecting the channel with the maximum SNR, the temporary frequency-selective weights result in

\begin{align} w_{q, m}^{'} (ℓ, k) = δ_{m, u (ℓ, k)}, \end{align}

(35)

where u(ℓ,k)∈{0,…,M} denotes the channel showing the maximum SNR:

\begin{align} u (ℓ, k) = \underset{m \in {1, \dots, M}}{argmax} \{{\hat{ξ}}_{m} (ℓ, k)\} . \end{align}

(36)

In Section 6.1, the estimation of the SNR ${\hat{ξ}}_{m} (ℓ, k)$ is given by (43). The final resulting frequency-selective weight is determined by

\begin{align} w_{q, m} (ℓ, k) = \{\begin{array}{l} w_{q, m}^{'} (ℓ, k), & if |{\tilde{X}}_{u (ℓ, k)} (ℓ, k)| > |{\tilde{X}}_{m} (ℓ, k)|, \\ w_{q, m} (ℓ), & else. \end{array} \end{align}

(37)

This implies that the maximum SNR channel is only selected if the absolute value of the noise-reduced signal in this channel is larger than the absolute value of the signal within the currently observed m th channel. Otherwise, the fullband weight w _q,m(ℓ) is used. Therefore, it is searched for fullband activity of the m th speaker corresponding to the q th output signal. If no single one or more than one speakers are active per output instance, the previous decision is kept:

\begin{align} w_{q, m} (ℓ) = \{\begin{array}{l} {\hat{SAD}}_{m} (ℓ), & if {\hat{STM}}_{q} (ℓ) = 1, \\ w_{q, m} (ℓ - 1), & else . \end{array} \end{align}

(38)

Regarding the phase in (34), always the phase value of the last active channel in the present output instance indicated by v _q(ℓ) is used:

\begin{align} ϕ_{q}^{mix} (ℓ, k) = ϕ_{v_{q} (ℓ)} (ℓ, k) . \end{align}

(39)

With (38), it follows for the last active channel index:

\begin{align} v_{q} (ℓ) = \underset{m \in {1, \dots, M}}{argmax} \{w_{q, m} (ℓ)\} . \end{align}

(40)

6 Control Unit

The energy-based control mechanism for the proposed speech communication system with distributed speaker-dedicated microphones is introduced below as well as the reference noise PSD estimation that is important for the signal combination process.

6.1 Robust speaker activity detection

A robust differentiation between several speakers has to be achieved. For this SAD, an energy-based approach relying on the evaluation of signal power ratios between the microphone signals is applied. A similar overall SAD system was already introduced by the authors in[28]. An overview of the whole SAD block is given in Figure3. The enhanced fullband detector $\hat{SAD} (ℓ)$ as an improvement of a basic fullband detector $\tilde{SAD} (ℓ)$ as well as a frequency-selective detector $\hat{SAD} (ℓ, k)$ is obtained. As depicted in Figure2, the fullband SAD measure is used for general control, whereas the frequency-selective value is of interest for the ISC and especially for controlling the adaptive filters. Besides relying on the signal power ratio (SPR), these detectors are based on the SNR as a further energy-based measure.

As a first step, the SPR has to be defined. Regarding the signal model in (2), we obtain for the signal PSD estimate ${\hat{Φ}}_{ΣΣ, m} (ℓ, k)$ including the direct speech component as well as the cross-talk components

\begin{align} {\hat{Φ}}_{ΣΣ, m} (ℓ, k) = max \{{\hat{Φ}}_{YY, m} (ℓ, k) - {\hat{Φ}}_{NN, m} (ℓ, k), 0\} . \end{align}

(41)

The estimate ${\hat{Φ}}_{YY, m} (ℓ, k)$ is determined by smoothing the squared magnitudes of the microphone signal spectra Y _m(ℓ,k). The noise PSD ${\hat{Φ}}_{NN, m} (ℓ, k)$ can be estimated, e.g., by the improved minimum controlled recursive averaging approach[26]. In a system with M ≥ 2 microphones, the SPR is expressed similar to[29] for each channel m as

\begin{align} {\hat{SPR}}_{m} (ℓ, k) = \frac{max \{{\hat{Φ}}_{ΣΣ, m} (ℓ, k), ε\}}{max \{max_{\begin{matrix} m^{'} \in {1, \dots, M} \\ m^{'} \neq m \end{matrix}} \{{\hat{Φ}}_{ΣΣ, m^{'}} (ℓ, k)\}, ε\}}, \end{align}

(42)

with the very small value ε ensuring the validity of the expression. Due to the fact that each speaker has a dedicated microphone and due to the assumption that always one microphone captures the speech best, the active speaker can be identified by the evaluation of the SPR among the available microphones. Basically, speech activity of speaker m is detected if the related logarithmic SPR is larger than 0 dB. For computational details of such a basic fullband detector $\tilde{SAD} (ℓ)$ , please refer to Appendix Appendix 1: basic fullband speaker activity detection. In order to consider the SPR only in significant regions during the determination of the SAD, the channel-related SNR ${\hat{ξ}}_{m} (ℓ, k)$ is included. It is estimated similar to[30] by

\begin{array}{l} {\hat{ξ}}_{m} (ℓ, k) \\ = \frac{max \{min \{{\hat{Φ}}_{YY, m} (ℓ, k), {|Y_{m} (ℓ, k)|}^{2}\} - {\hat{Φ}}_{NN, m}^{'} (ℓ, k), 0\}}{{\hat{Φ}}_{NN, m}^{'} (ℓ, k)}, \end{array}

(43)

with the PSD estimate ${\hat{Φ}}_{YY, m} (ℓ, k)$ and the related modified noise estimate ${\hat{Φ}}_{NN, m}^{'} (ℓ, k)$ for the determination of a reliable SNR value. Using the preferred factor γ _SNR=4, it follows

\begin{align} {\hat{Φ}}_{NN, m}^{'} (ℓ, k) = γ_{SNR} \cdot {\hat{Φ}}_{NN, m} (ℓ, k) . \end{align}

(44)

By simply evaluating the power ratios, the presented basic fullband detection of the active speaker can be performed. But, transient interferers like indicator noise, outside crossing cars, and speech from interfering speakers may be wrongly assigned to one speaker’s activity, e.g., during interfering backseat passengers in a system with only two microphones in the front. The robustness for these and for other situations in general can be increased by applying an enhanced fullband detector $\hat{SAD} (ℓ)$ based on the exploitation of SPR patterns as was first introduced by the authors in[29]. Therewith, the characteristics of the room acoustics shall be involved and evaluated. Due to the distinguishing room acoustics, a sharp decline of the energy may occur in some special subbands of the speaker’s dedicated m th microphone signal. This causes a lower amount of energy in the speaker’s closest microphone compared to the distant ones. Hence, for the active m th speaker, the related observable signal power ratio ${\hat{SPR}}_{m} (ℓ, k)$ is smaller than one or at least very low only in some special subbands. These subbands may be called multipath-induced fading subbands related to the multipath propagation effects. The number and location of these subbands are assumed to be characteristic for each sound source at a different location in the car. Thus, appropriate patterns representing this effect may indicate the position of a speaker if they match a reference pattern set. For further details, please refer to Appendix Appendix 2: enhanced fullband speaker activity detection based on multipath-induced fading patterns.

After the determination of the robust fullband SAD, a frequency-selective detection $\hat{SAD} (ℓ, k)$ of the active speaker has to be carried out. Due to the occurring multipath-induced fading subbands, it is not reliable to distinguish between the active speakers, depending on whether a positive or negative logarithmic SPR occurs in a frequency subband as presented in[11]. In case of speech activity of one speaker, the related SPR may show negative values for a small number of the multipath-induced fading subbands due to the room acoustics. The detection of speech activity might be missed in these subbands for the corresponding speaker. Thus, we want to avoid the decision based on a hard thresholding and propose an approach that exploits a modeling of the power ratios as was similarly proposed by the authors in[31]. The details of the specific version used in this contribution is presented in Appendix Appendix 3: frequency-selective speaker activity detection. Finally, it should be noted that due to the sparseness of speech activity, double-talk does not have to be detected in a frequency-selective manner but rather on a frame basis as a fullband measure.

6.2 Reference noise power spectral density estimation

For signal combination, the spectra of the residual background noise after noise reduction are aligned among the Q output signals. This allows for selecting channels within the dynamic signal combination unit without getting switching artifacts. The spectral alignment to a reference noise spectrum is done by dynamic modification of a frequency-dependent spectral floor parameter within the noise reduction (16). The computation of this dynamic spectral floor is proposed in (32), where we need to know an appropriate reference noise PSD. In order to determine a reference background noise out of all those different microphone signals that are to be mixed to one output signal, it has to be decided which speaker is the dominant one at a time instance. Corresponding dominance weights can be determined by evaluating the duration for which a speaker has been detected. While a speaker is active alone, his dominance increases until it reaches a maximum value and therewith full dominance. Then, the target noise level has to be controlled by this channel alone. If a different relevant speaker within the subset of microphone signals to be combined to one output instance becomes active, the dominances of all the other related channels decrease. In order to determine dominance weights, firstly, we define the channel-dependent dominance counters[24]

\begin{align} c_{m} (ℓ) = max \{min \{c_{m} (ℓ - 1) + Δ c_{m} (ℓ), c_{max}\}, c_{min}\}, \end{align}

(45)

where the limitation of the counters to a minimum c _min and a maximum value c _max, respectively, defines the range between the minimum and full dominance of a speaker. The parameter Δ c _m(ℓ) controls the increase or decrease of the counters and is dependent on the single-talk speaker activity detection ${\hat{STD}}_{m} (ℓ) \in \{0, 1\}$ introduced in (13). With the increasing and decreasing step sizes c _inc and c _dec, respectively, it follows

\begin{align} Δ c_{m} (ℓ) = \{\begin{array}{l} c_{inc}, & if {\hat{STD}}_{m} (ℓ) = 1, \\ - c_{dec, m}, & else. \end{array} \end{align}

(46)

After speaking for a period t _inc, a speaker m should get full dominance. This determines the step size for increasing[24]

\begin{align} c_{inc} = \frac{c_{max} - c_{min}}{t_{inc}} \cdot T_{frame}, \end{align}

(47)

with the period T _frame between two consecutive time frames. The dominance counter of the previous active speaker has to reach c _min after the time the currently active speaker achieves full dominance and therewith counters the value c _max. Therefore, the decreasing constant has to be recomputed for each channel m every time a speaker in any other channel m ^′(m≠m ^′) corresponding to the same output signal subset becomes active:

\begin{align} c_{dec, m} = \{\begin{array}{l} \frac{c_{m} (ℓ) - c_{min}}{c_{max} - c_{m^{'}} (ℓ) + ε} \cdot c_{inc}, & if {\hat{STD}}_{m^{'}} (ℓ) = 1 \land W_{m, m^{'}}^{ISC} = 0, \\ 0, & else. \end{array} \end{align}

(48)

with the very small value ε. The matrix W ^ISC avoids a decrease of the dominance of the m th speaker if a speaker related to a different output signal other than the currently considered output signal becomes active. To characterize the dominance of a speaker, finally, the counters have to be mapped to the speaker dominance weights by normalization of each counter to the sum of all counters similar to[24]

\begin{align} g_{m}^{DW} (ℓ) = \frac{c_{m} (ℓ)}{\sum_{m^{'} = 1}^{M} |1 - W_{m, m^{'}}^{ISC}| \cdot c_{m^{'}} (ℓ)} . \end{align}

(49)

With the help of the dominance weights, an output signal-dependent reference noise PSD ${\hat{Φ}}_{\tilde{N} \tilde{N}, m}^{ref} (ℓ, k)$ used for the dynamic spectral floor computation in (32) can be determined. Note that for input channels corresponding to the same output instance, this reference noise PSD has to be identical. Applying the dominance weights and the control matrix W ^ISC for involving only the noise PSDs of the relevant channels, we then have

\begin{align} {\hat{Φ}}_{\tilde{N} \tilde{N}, m}^{ref} (ℓ, k) = \sum_{m^{'} = 1}^{M} |1 - W_{m, m^{'}}^{ISC}| \cdot g_{m^{'}}^{DW} (ℓ) \cdot {\hat{Φ}}_{\tilde{N} \tilde{N}, m^{'}} (ℓ, k), \end{align}

(50)

where ${\hat{Φ}}_{\tilde{N} \tilde{N}, m^{'}} (ℓ, k)$ is the noise PSD estimate as introduced in (21). Figure4 shows the dominance weights and the adjustment of the signal characteristics for a scenario, where four passengers in a car speak one after another and all signals are combined to one output signal instance Q=1. Due to a slightly opened window at the front right passenger, the background noise is higher there compared to the other channels. The noise and speech signal show smooth transitions at speaker changes compared to hard switching between the channels.

Preferred parameters of the implementation of the reference value computation can be found in Table2.

Table 2 Preferred parameter values for the implementation of the reference noise power spectral density estimation

Full size table

7 Evaluation

For evaluation purposes, a measurement database has been recorded in an Audi A6 with four distributed speaker-dedicated microphones. The driver and the front passenger each have a dedicated microphone located in the A-pillar. The microphones for the two backseat passengers are located in the ceiling in front of each seat. Speech and noise signals have been recorded separately to be combined to noisy signals afterwards. Based on this scenario, instrumental quality measures can be determined by evaluating the components before and after the processing. Clean speech signal components of eight speakers (four females and four males) speaking four different test utterances have been recorded for all four available seating positions in the car. To cause the Lombard effect, car noise with an average sound pressure level of around 65 dB(A) has been played back via headphones during the recording. Thus, the database includes 128 test sentences (eight speakers × four positions × four utterances). The noise signal components have been recorded for six different speeds (50, 80, 100, 130, 160, and 180 km/h) with all windows closed. Additionally noise scenarios with a slightly opened front right window were recorded for the first five speeds. In order to obtain realistic noisy microphone signals, the signal components are combined regarding ITU-T Recommendation P.56[32] as presented by the authors in[27], with SNR∈{-5,0,5,10,15,20} dB. It is aimed at generating test signals where four different speakers assigned to the four various seats are speaking different utterances at different noise scenarios and SNRs one after another. Therefore, primarily, one speaker is chosen for each seating position randomly out of the whole measurement database. Therewith, an arrangement in the car with four speakers is simulated. Regarding the current evaluation, four such arrangements are chosen randomly with each speaker speaking four different utterances in the mentioned 11 noise conditions. Hence, we have 176 test signals for each position for six different SNRs.

A special analysis scenario has been picked from the whole dataset, where M=4 speakers are active one after another at 0 dB with background noise of the car driving at 80 km/h. Between the second and third speakers, a short overlapping speech period is present. The two front passengers’ signals (speakers 1 and 2) are mixed to one output instance, and the backseat passengers’ signals (speakers 3 and 4) to a second one determined by the control matrices defined in (3) and (4). Thus, we have Q=2 output signals. In Figure5, different spectrograms are visualized. Besides the spectra of the raw microphone signals, several versions of the processed spectra are shown. The AGC has not been considered during these processings. For the spectrograms related to the complete speech enhancement system (excluding AGC), it is obvious that the cross-talk components are robustly suppressed, while chosen particular signals are combined to the appropriate two output signals.

To evaluate the whole system more generally, instrumental quality measures can be computed. Due to the combination of realistic noisy time-domain signals out of the separate signal components, the noisy signal can be processed by the proposed speech enhancement system, whereas the influence on each single component can be observed and evaluated afterwards. The system with M=4 has been configured with Q=2, and the noise reduction applies a maximum attenuation of β=-12 dB. Again, for the evaluation, the AGC is not included into the whole processing. Beside the speech-to-speech distortion ratio (SSDR)[33], a second measure called direct-to-cross-talk ratio (DCR) is introduced for evaluation. It is common to evaluate such quality measures in segments. Regarding[34] where a typical segment length between 15 and 20 ms is recommended, we choose a length of N=320 at the underlying sampling frequency of f _s=16 kHz which results to 20 ms. In order to measure the speech distortion, the SSDR can be computed based on the clean reference time-domain speech signal component s _m(n) and the processed speech signal component ${\tilde{s}}_{m} (n)$ . Note that the reference speech component in each m th channel is a combination of the direct component and the cross-talk components occurring in the other channels dependent on the channel selection in (34) in order to avoid a negative influence of exploitation of diversity effects. The SSDR in each frame λ can be written as[33]

\begin{align} {SSDR}_{m} (λ) = 10 \underset{10}{log} [\frac{\sum_{ν = 0}^{N - 1} s_{m}^{2} (ν + λN)}{\sum_{ν = 0}^{N - 1} e_{m}^{2} (ν + λN)}], \end{align}

(51)

whereas the speech distortion is defined as comprising the processed speech signal component ${\tilde{s}}_{m} (n)$ as

\begin{align} e_{m} (n) = {\tilde{s}}_{m} (n) - s_{m} (n) . \end{align}

(52)

It has to be ensured that the delay between ${\tilde{s}}_{m} (n)$ and s _m(n) is compensated. After limitation of SSDR_m(λ) to a maximum of SSDR_max=30 dB and a minimum of SSDR_min=-10 dB, the segmental SSDR is proposed to be computed by

\begin{align} {SSDR}_{seg, m} = \frac{1}{C (Λ_{m})} \sum_{λ \in Λ_{m}} {SSDR}_{m} (λ) . \end{align}

(53)

The term Λ _m represents a subset of all those frames showing fullband voice activity for speaker m and where SSDR_m(λ)>-10 dB. C(Λ _m) is the number of elements within this subset.

Regarding these subsets, similarly, a measure for the remaining cross-talk is computed. The segmental DCR is defined considering the processed direct signal ${\tilde{s}}_{m} (n)$ originating from the exclusively active source belonging to the m th channel and the related processed cross-talk components ${\tilde{b}}_{m^{'}, m} (n)$ occurring in the other distant channels m ^′ and originating from the same source:

\begin{align} {DCR}_{seg, m, m^{'}} = \frac{1}{C (Λ_{m}^{'})} \sum_{λ \in Λ_{m}^{'}} {DCR}_{m, m^{'}} (λ), \end{align}

(54)

with $Λ_{m}^{'}$ representing voice active frames where additionally ${DCR}_{m, m^{'}} (λ) > - 10$ dB. The DCR in each frame results in

\begin{align} {DCR}_{m, m^{'}} (λ) = 10 \underset{10}{log} [\frac{\sum_{ν = 0}^{N - 1} {\tilde{s}}_{m}^{2} (ν + λN)}{\sum_{ν = 0}^{N - 1} {\tilde{b}}_{m^{'}, m}^{2} (ν + λN)}] . \end{align}

(55)

The value is limited to a maximum DCR_max=60 dB and a minimum DCR_min=-10 dB before applying (54). Due to the presence of cross-talk components in multiple distant channels, we consider the mean segmental DCR for the m th channel across all available cross-talk components:

\begin{align} {DCR}_{seg, m} = \frac{1}{C (Ξ_{m})} \sum_{m^{'} \in Ξ_{m}} {DCR}_{seg, m, m^{'}} . \end{align}

(56)

Here, the number of channels where the cross-talk components are evaluated is specified by C(Ξ _m), and Ξ _m is the subset of channel indices that are not related to the output signal the current m th channel is dedicated to. The mean values for these measures are determined across the whole test set for each SNR.

The results are depicted in Figure6 showing the mean across all positions. The SNR is represented by the markers increasing from the bottom to the top (-5, 0, 5, 10, 15, and 20 dB). The basic processing without any cross-talk suppression already shows a relatively high DCR due to the attenuation of the active speaker’s speech by the acoustic path. Based on this, the ISC performs a further cross-talk cancellation. The overall system with ISC and RCS attenuates the cross-talk components very well, indicated by higher DCR values, whereas the speech distortion remains nearly the same compared across the different processings. The higher the SNR, the lower is the speech distortion (higher SSDR). With exception of the Basic+ISC+RCS processing method, the variation for the DCR results across different SNRs is not as large due to masking effects and room acoustics. Including the RCS, it is obvious that a larger amount of cross-talk components can be suppressed at higher SNRs because it depends on the SAD in the active channel and is able to detect more speech active bins that are not masked by noise. Figure7 shows similar results for the different positions exemplarily evaluated for one speaker in the front (m=1) and one in the back (m=3). The results for the front position differ slightly from the ones for the backseat position especially regarding the segmental DCR. This is expected due to the room acoustics and the higher amount of cross-talk speech components in the front microphones caused by the backseat speakers.

Now, we evaluate the fullband SAD introduced in Section 6.1 and outlined further in Appendix Appendix 2: enhanced fullband speaker activity detection based on multipath-induced fading patterns. Error rates are computed based on the comparison of the binary SAD results after the processing compared with a reference fullband SAD mask. The reference assumes speech activity if the clean speech signal component level is larger than a certain threshold. This threshold is chosen 40 dB below the maximum level of the whole clean speech signal. The fullband reference mask SAD_ref,m(ℓ) for each channel m is set to 1 if a minimum of 5% of all frequency subbands exceeds this threshold, otherwise it is zero. Error rates are computed for the basic SAD ${\tilde{SAD}}_{m} (ℓ)$ and the enhanced one ${\hat{SAD}}_{m} (ℓ)$ , respectively. In case of the enhanced SAD, the fullband overall error in channel m for L signal time frames is computed by

\begin{align} E_{m} = \frac{1}{L} \sum_{ℓ = 1}^{L} |{SAD}_{ref, m} (ℓ) - {\hat{SAD}}_{m} (ℓ)|, \end{align}

(57)

and accordingly for ${\tilde{SAD}}_{m} (ℓ)$ . In Figure8, this overall error is depicted for the basic SAD (65) and the enhanced SAD (73) again for six different SNRs. The results are based on the mean SAD across all positions and conditions of the whole dataset. It is evident that the enhanced SAD yields a detection with a lower overall error. With higher SNRs, the overall error is decreasing, and the SAD seems to be more reliable.

Exemplarily formulated for the enhanced SAD, the false detections are covered by the false-positive rate, and the missed detections are measured by the false-negative rate:

\begin{align} r_{FP, m} & = \frac{\sum_{ℓ = 1}^{L} ({\hat{SAD}}_{m} (ℓ) \cdot (1 - {SAD}_{ref, m} (ℓ)))}{\sum_{ℓ = 1}^{L} (1 - {SAD}_{ref, m} (ℓ))} and \\ r_{FN, m} & = \frac{\sum_{ℓ = 1}^{L} ((1 - {\hat{SAD}}_{m} (ℓ)) \cdot {SAD}_{ref, m} (ℓ))}{\sum_{ℓ = 1}^{L} {SAD}_{ref, m} (ℓ)} . \end{align}

(58)

Figure9 shows the advantage of the enhanced SAD by lower false-positive rates. In contrast, the false-negative rates are slightly higher. However, to avoid, e.g., the adaptation of adaptive filters to wrong events, it seems to be more important to obtain a lower false-positive rate.

8 Conclusions

In this contribution, a dynamic multi-channel system for speech signal enhancement in an automotive environment with distributed speaker-dedicated microphones has been presented. The proposed system supports multiple speakers in a car. It can be freely configured to obtain different mixed output signals that can be passed to various speech signal applications. Selected signals can be combined to different output signals by dynamic signal combining, whereas cross-talk components of signals not of interest are cancelled in these output signals within an interfering speaker cancellation approach and proper postprocessing. Furthermore, stationary noise is reduced.

The ability of the system to combine various input signals to several output signals has been shown. Furthermore, the suppression of interfering speech in each output signal has been evaluated by the computation of instrumental quality measures indicating speech distortion as well as cross-talk cancellation capability. Different configurations of the system have been investigated, showing the advantage of the complete speech enhancement system comprising interfering speaker cancellation, cross-talk suppression, and dynamic signal combination.

To control the whole system, a robust speaker activity detection based on signal power ratios has been proposed. Within the evaluation, it can be shown that an enhancement of the introduced basic fullband approach yields further improvements regarding detection rates.

Instead of using only one microphone for each speaker, the proposed methods can also be applied to the output signals of multiple processed microphone subgroups. It may be advantageous to use a beamformer for each of the positions in the car to further improve the characteristics of the whole processing and to exploit the room acoustics furthermore by spatial filtering.

Appendices

Appendix 1: basic fullband speaker activity detection

The basic fullband SAD is based on the logarithmic quantity of the SPR estimate from (42), thus we write

\begin{align} {\hat{SPR}}_{m}^{'} (ℓ, k) = 10 {log}_{10} ({\hat{SPR}}_{m} (ℓ, k)) . \end{align}

(59)

In order to consider only SPR values during periods showing a certain SNR (43) with ${\hat{ξ}}_{m} (ℓ, k) > Θ_{SNR 1}$ , a modified quantity is defined by

\begin{align} {\tilde{SPR}}_{m} (ℓ, k) = \{\begin{array}{l} {\hat{SPR}}_{m}^{'} (ℓ, k), & if {\hat{ξ}}_{m} (ℓ, k) \geq Θ_{SNR 1}, \\ 0, & else. \end{array} \end{align}

(60)

To evaluate the SPR for each channel, it is observed how many positive (+) or negative (-) values for ${\tilde{SPR}}_{m} (ℓ, k)$ are observed in each frame. Thus, a resulting positive counter follows

\begin{align} c_{m}^{+} (ℓ) & = \sum_{k = 0}^{K / 2} c_{m}^{+} (ℓ, k), with \\ c_{m}^{+} (ℓ, k) & = \begin{array}{l} 1, & if {\tilde{SPR}}_{m} (ℓ, k) \geq 0, \\ 0, & else. \end{array} \end{align}

(61)

Equivalently, it can be written for the negative counter:

\begin{align} c_{m}^{-} (ℓ) & = \sum_{k = 0}^{K / 2} c_{m}^{-} (ℓ, k), with \\ c_{m}^{-} (ℓ, k) & = \{\begin{array}{l} 1, & if {\tilde{SPR}}_{m} (ℓ, k) < 0, \\ 0, & else. \end{array} \end{align}

(62)

Based on these quantities and with an SNR-dependent soft weighting function $G_{m}^{c} (ℓ)$ , a soft frame-based speaker activity detection measure can be formulated by

\begin{align} χ_{m}^{SAD} (ℓ) = G_{m}^{c} (ℓ) \cdot \frac{c_{m}^{+} (ℓ) - c_{m}^{-} (ℓ)}{c_{m}^{+} (ℓ) + c_{m}^{-} (ℓ)} . \end{align}

(63)

We compute the soft weighting function in (63) using subgroup SNRs as

\begin{align} G_{m}^{c} (ℓ) = min \{{\hat{ξ}}_{max, m}^{G} (ℓ) / 10, 1\} . \end{align}

(64)

For the calculation of the subgroup SNRs and the maximum SNR, see (83) and (84) in Appendix Appendix 4: signal-to-noise ratio subgroups. Finally, the basic fullband SAD can be achieved by thresholding

\begin{align} {\tilde{SAD}}_{m} (ℓ) = \{\begin{array}{l} 1, & if χ_{m}^{SAD} (ℓ) > Θ_{SAD 1}, \\ 0, & else. \end{array} \end{align}

(65)

Double-talk is detected based on a measure that evaluates whether the positive counter $c_{m}^{+} (ℓ)$ exceeds a certain limit Θ _DTM during fullband detected speech activity in several channels. This result is held in each channel for some frames in order to detect continuous regions of double-talk. If the measure is true for more than one channel, general double-talk $\hat{DTD} (ℓ) = 1$ is assumed. Preferred parameter settings for this section can be found in Table3.

Table 3 Preferred parameter settings for the implementation of the basic fullband speaker activity detection method

Full size table

Appendix 2: enhanced fullband speaker activity detection based on multipath-induced fading patterns

An overview of the enhanced fullband SAD (dark-gray block in Figure3) is depicted in Figure10, where the dark-shaded area includes the SAD decision as well as the power ratio pattern determination. The bright-shaded area comprises the update of the reference pattern set. Parameter settings used in the following are represented in Table4.

Table 4 Preferred parameter settings for the implementation of the proposed enhanced fullband speaker activity detection method

Full size table

Power ratio patterns

As pointed out in Section 6.1, we want to exploit the characteristics of the SPR over frequency. Initially, we want to define a measure to highlight the multipath-induced fading subbands. Therefore, we aim at obtaining high values for the characteristic small power ratios and small values for inconspicuous and not relevant high power ratios. We propose a mapping yielding the following quantity[29]:

\begin{align} χ_{m}^{PAT} (ℓ, k) = max \{1 - γ_{PAT} \cdot {\hat{SPR}}_{m} (ℓ, k), Θ_{PAT 1}\} . \end{align}

(66)

Large power ratios are mapped to the lower bound Θ _PAT1. γ _PAT allows the scalability of the behavior of the mapping function. Using γ _PAT<1 forces an underestimation of the power ratio ${\hat{SPR}}_{m} (ℓ, k)$ . Hence, the limit for highlighting subbands as multipath-induced fading ones can be controlled. A strong underestimation is appropriate to highlight the subbands that are anomalously highly attenuated by the room acoustics in the considered channel m. Even positive but small power ratios are evaluated in this case. In order to obtain a smoothed spectrum indicating the position of the multipath-induced fading subbands, a linear prediction analysis is performed. The autocorrelation coefficients φ _p,m(ℓ) are computed by the inverse discrete Fourier transform of the magnitude squares of the quantity $χ_{m}^{PAT} (ℓ, k)$ . Thus, the Yule-Walker auto-regressive equations for solving the prediction problem with order N _p and the filter coefficients a _i,m(ℓ) are

\begin{align} φ_{p, m} (ℓ) = \sum_{i = 1}^{N_{p}} a_{i, m} (ℓ) \cdot φ_{p - i, m} (ℓ), p = 1, 2, \dots, N_{p} . \end{align}

(67)

After applying the Levinson-Durbin algorithm and using the frequency response of the filter coefficients a _i,m(ℓ) represented by A _m(ℓ,k), the logarithmic estimate of $χ_{m}^{PAT} (ℓ, k)$ is recovered by

\begin{align} {\hat{χ}}_{m}^{PAT} (ℓ, k) = 10 {log}_{10} (|\frac{E_{m} (ℓ, k)}{1 - A_{m} (ℓ, k)}|), \end{align}

(68)

with the prediction error signal E _m(ℓ,k) used for normalization. Based on these patterns, an enhanced speaker activity detection can be performed by comparing the currently observed pattern ${\hat{χ}}_{m}^{PAT} (ℓ, k)$ with a reference pattern set that is characteristic for the active speaker’s location. The reference pattern set consists of N _PAT different patterns which shall represent the characteristics of the specific speaker positions including some variations. A Euclidean distance measure ${\tilde{J}}_{i, m} (ℓ, k)$ between each reference pattern ${\hat{χ}}_{i, m}^{ref} (k)$ with i=1,…,N _PAT and the currently estimated pattern is determined:

\begin{align} {\tilde{J}}_{i, m} (ℓ, k) = {({\hat{χ}}_{i, m}^{ref} (k) - {\hat{χ}}_{m}^{PAT} (ℓ, k))}^{2} . \end{align}

(69)

The mean value of this distance measure ${\tilde{J}}_{i, m} (ℓ, k)$ over the relevant subbands is a quantity for the detection of the activity of the m th speaker:

\begin{align} {\bar{J}}_{i, m} (ℓ) = \frac{1}{N_{i, m}} \sum_{k = 0}^{K / 2} {\tilde{J}}_{i, m} (ℓ, k) \cdot I_{i, m} (ℓ, k), \end{align}

(70)

with N _i,m∈{1,…,K/2 + 1} subbands to evaluate for each pattern. The function I_i,m(ℓ,k) indicates the subbands where an evaluation of the patterns seems to be reasonable. For small values of N _i,m, the previous distance measure is used. To draw reliable conclusions from the distance measure during fullband detected single-talk speech periods, an SNR of Θ _SNR2 has to be exceeded. Furthermore, only those subbands should be evaluated, where multipath-induced fading subbands occur either in ${\hat{χ}}_{i, m}^{ref} (k)$ or in ${\hat{χ}}_{m}^{PAT} (ℓ, k)$ indicated by some peaks. For SAD, the best matching pattern ${\hat{χ}}_{j_{m}, nm}^{ref} (k)$ is further analyzed with

\begin{align} j_{m} = \underset{i \in {1, \dots, N_{PAT}}}{argmin} \{{\bar{J}}_{i, m} (ℓ)\} . \end{align}

(71)

In order to detect speech regions rather than single-speech active frames, a minimum for ${\bar{J}}_{j_{m}, m} (ℓ)$ over L _PAT past frames is determined during basic fullband SAD. This minimum is denoted by J _m(ℓ). The resulting pattern-based SAD indicator function ${\hat{SAD}}_{m}^{PAT} (ℓ)$ is obtained by comparing J _m(ℓ) with a threshold based on its tracked global minimum Θ _m,min(ℓ) including an additional offset Θ _SAD2:

\begin{align} {\hat{SAD}}_{m}^{PAT} (ℓ) = \{\begin{array}{l} 1, & if J_{m} (ℓ) < (Θ_{m, min} (ℓ) + Θ_{SAD 2}), \\ 0, & else. \end{array} \end{align}

(72)

If J _m(ℓ) is close to zero, it should force ${\hat{SAD}}_{m}^{PAT} (ℓ) = 0$ due to the challengeable reliability. In combination with the basic fullband SAD in (65), the final enhanced fullband SAD is obtained:

\begin{matrix} {\hat{SAD}}_{m} (ℓ) = {\tilde{SAD}}_{m} (ℓ) \cdot {\hat{SAD}}_{m}^{PAT} (ℓ) . \end{matrix}

(73)

Reference pattern set

Due to the room acoustics in a car, the occurring patterns may change over time if the speaker slightly moves. We propose to update the reference pattern set ${\hat{χ}}_{m}^{ref} (k)$ during the processing within a first in-first out system of length N _PAT by including new patterns ${\hat{χ}}_{i, m}^{ref} (k)$ as was proposed similarly by the authors in[29]. Only if speaker activity can be assumed quite likely, the occurring pattern shall be included into the reference pattern set. Beside the basic SAD, a fullband coherence measure is used for accepting new reference patterns in order to further reduce misdetections. The magnitude squared coherence (MSC) can be computed between two channels m and m ^′ with the cross PSD ${\hat{Φ}}_{YY, m, m^{'}} (ℓ, k)$ and the two auto PSDs ${\hat{Φ}}_{YY, m} (ℓ, k)$ and ${\hat{Φ}}_{YY, m^{'}} (ℓ, k)$ [35]. With the appropriate SNR threshold Θ _SNR3=2, it follows for a modified coherence

\begin{align} {\tilde{Γ}}_{m, m^{'}} (ℓ, k) = \{\begin{array}{l} \frac{{|{\hat{Φ}}_{YY, m, m^{'}} (ℓ, k)|}^{2}}{{\hat{Φ}}_{YY, m} (ℓ, k) \cdot {\hat{Φ}}_{YY, m^{'}} (ℓ, k)}, & if {\hat{ξ}}_{m} (ℓ, k) > Θ_{SNR 3}, \\ 0, & else. \end{array} \end{align}

(74)

To obtain a channel-independent fullband coherence quantity $\tilde{Γ} (ℓ)$ , we determine the mean MSC measure over all subbands and search for the maximum of these quantities over all channel combinations:

\begin{align} \tilde{Γ} (ℓ) = max_{\begin{matrix} m, m^{'} \in {1, \dots, M} \\ m \neq m^{'} \end{matrix}} \{\frac{1}{K / 2 + 1} \sum_{k = 0}^{K / 2} {\tilde{Γ}}_{m, m^{'}} (ℓ, k)\} . \end{align}

(75)

Because, furthermore, only the characteristic subbands should occur as peaks in the reference pattern set, using a modified measure for highlighting the multipath-induced fading subbands is proposed. New patterns are included if three fullband conditions are fulfilled: The basic fullband SAD in (65) with a stricter threshold Θ _SAD1=0.5 has to indicate speech, whereas double-talk must not occur and a certain threshold Θ _COH has to be exceeded by the coherence measure $\tilde{Γ} (ℓ)$ . Instead of simply including the currently appearing spectrum from (68) into the reference pattern set, the calculation from (66) is modified to

\begin{align} χ_{i, m}^{ref} (ℓ, k) = \{\begin{array}{l} χ_{m}^{PAT} (ℓ, k), & if I_{i, m}^{ref} (ℓ, k) > 0, \\ Θ_{PAT 2}, & else, \end{array} \end{align}

(76)

for obtaining the reference patterns. The reference indicator function $I_{i, m}^{ref} (ℓ, k)$ includes a new characteristic frequency subband into the current reference pattern if the frequency-selective SNR quantity ${\hat{ξ}}_{m} (ℓ, k)$ is larger than a threshold Θ _SNR3. Furthermore, the currently occurring pattern has to show a peak at some frequencies and therewith has to exceed a threshold of Θ _PAT3 dB there. Otherwise, the constant Θ _PAT2 is set. Based on this modified quantity, the linear prediction is performed, and the reference pattern set can be updated by the new entry ${\hat{χ}}_{i, m}^{ref} (k)$ .

Appendix 3: frequency-selective speaker activity detection

An overview of the frequency-selective SAD (light-gray block in Figure3) is shown in Figure11, where the first block describes the SPR model and its adaptation, and the second block shows the model-based SAD as already presented similarly by the authors in[31]. It is supposed that the SPR in the m th channel can be represented by the random variable, where one realization can take the value ${\hat{SPR}}_{m}^{'} (ℓ, k) = 10 \underset{10}{log} ({\hat{SPR}}_{m} (ℓ, k))$ . We assume that this SPR in the m th channel is normally distributed in each subband with $(Y | H_{1, m}) \sim N (μ_{m}, σ_{m}^{2})$ during voice activity of the m th speaker indicated by the hypothesis H _1,m. Hence, the conditional probability density function of may be modeled by a single Gaussian distribution[36] with mean μ _m(ℓ,k) and variance σ _m n ²(ℓ,k):

\begin{align} p_{Y | H_{1, m}} ({\hat{SPR}}_{m}^{'} (ℓ, k)) & = \frac{1}{\sqrt{2 π} σ_{m} (ℓ, k)} \\ \cdot exp \{- \frac{{({\hat{SPR}}_{m}^{'} (ℓ, k) - μ_{m} (ℓ, k))}^{2}}{2 σ_{m}^{2} (ℓ, k)}\} . \end{align}

(77)

For modeling this distribution in one channel, the mean value and the variance have to be estimated during single-talk periods of the related speaker, where an SNR of at least Θ _SNR4 has to be exceeded. Otherwise, the previous result from the last frame is used. The mean value μ _m(ℓ,k) can be estimated by smoothing the SPR over time with the constant γ _μ[31]:

\begin{align} {\hat{μ}}_{m} & (ℓ, k) = γ_{μ} \cdot {\hat{μ}}_{m} (ℓ - 1, k) + (1 - γ_{μ}) \cdot {\hat{SPR}}_{m}^{'} (ℓ, k) . \end{align}

(78)

Simultaneously, an estimate for the variance $σ_{m}^{2} (ℓ, k)$ can be calculated with the smoothing constant γ _σ:

\begin{align} {\hat{σ}}_{m}^{2} (ℓ, k) & = γ_{σ} \cdot {\hat{σ}}_{m}^{2} (ℓ - 1, k) \\ + (1 - γ_{σ}) \cdot {({\hat{SPR}}_{m}^{'} (ℓ, k) - {\hat{μ}}_{m} (ℓ, k))}^{2} . \end{align}

(79)

Hence, the SAD may be determined based on the model parameters without considering the sign of the SPR value itself. The decision whether speech is detected for an observed ${\hat{SPR}}_{m}^{'} (ℓ, k)$ is made based on the model in (77) in combination with the estimated parameters. For a positive decision, the probability density function has to reach a certain threshold Θ _p, and fullband speaker activity and no double-talk have to be detected. Therewith, for the frequency-selective SAD, it follows

\begin{align} {\hat{SAD}}_{m}^{'} (ℓ, k) = \{\begin{array}{l} 1, if p_{Y | H_{1, m}} ({\hat{SPR}}_{m}^{'} (ℓ, k)) > Θ_{p} \\ \land {\hat{SAD}}_{m} (ℓ) = 1 \land \hat{DTD} (ℓ) = 0, \\ δ_{m, m_{pmax}}, if \hat{DTD} (ℓ) = 1, \\ 0, else. \end{array} \end{align}

(80)

During double-talk, the channel related to the maximum resulting modified SPR is determined by the Kronecker delta. For the second index, we have

\begin{align} m_{pmax} = \underset{m \in {1, \dots, M}}{argmax} \{{\tilde{SPR}}_{m} (ℓ, k)\} . \end{align}

(81)

The final frequency-selective SAD results after comparing the SNR estimate with the limit Θ _SNR4

\begin{align} {\hat{SAD}}_{m} (ℓ, k) = \{\begin{array}{l} {\hat{SAD}}_{m}^{'} (ℓ, k), & if {\hat{ξ}}_{m} (ℓ, k) \geq Θ_{SNR 4}, \\ 0, & else. \end{array} \end{align}

(82)

During activity of more than one speaker, it can be still distinguished between the different speakers in a frequency-selective manner due to the assumption of the sparseness of speech across the subbands. Preferred parameter settings can be found in Table5.

Table 5 Preferred parameter settings for the implementation of the frequency-selective model-based speaker activity detection method

Full size table

Appendix 4: signal-to-noise ratio subgroups

Further processing (regarding a DFT length of K =512 and a sampling frequency of f _s=16 kHz) grouped SNR values can be computed for K ^′=10 different frequency subgroups, each covering DFT bin k _æ,…,k _{æ + 1}-1, with æ=,2,…,K ^′ and {k _æ}= {4,28,53,78,103,128,153,178,203,228,253}. For the mean SNR computed for the æth subgroup follows

\begin{align} {\hat{ξ}}_{m}^{G} (ℓ, æ) = \frac{1}{k_{æ + 1} - k_{æ}} \sum_{k = k_{æ}}^{k_{æ + 1} - 1} {\hat{ξ}}_{m} (ℓ, k + 1) . \end{align}

(83)

Then, the maximum SNR across the SNRs of the frequency subgroups is given by

\begin{align} {\hat{ξ}}_{max, m}^{G} (ℓ) = max_{æ \in {1, \dots, K^{'}}} \{{\hat{ξ}}_{m}^{G} (ℓ, æ)\} . \end{align}

(84)

Abbreviations

AGC:: Automatic gain control
DCR:: Direct-to-cross-talk ratio
DFT:: Discrete Fourier transform
DSC:: Dynamic signal combination
ENR:: Extended noise reduction
ISC:: Interfering speaker cancellation
MSC:: Magnitude squared coherence
PSD:: Power spectral density
RCS:: Residual cross-talk suppression
SAD:: Speaker activity detection
SE:: Signal enhancement
SNR:: Signal-to-noise ratio
SPR:: Signal power ratio
SSDR:: Speech-to-speech distortion ratio.

References

Brandstein M, Ward D: (eds.), Microphone Arrays: Signal Processing Techniques and Applications. Berlin: Springer; 2001.
Book Google Scholar
Van Veen BD, Buckley KM: Beamforming: A versatile approach to spatial filtering. IEEE ASSP Mag 1988, 5(2):4-24.
Article Google Scholar
Freudenberger J, Stenzel S, Venditti B: Microphone diversity combining for in-car applications. EURASIP J. Adv. Signal Process 2010, 2010: 1-13.
Article Google Scholar
Gerkmann T, Martin R: Soft decision combining for dual channel noise reduction. In Proceedings of the International Conference on Spoken Language Processing (INTERSPEECH). Pittsburgh, Pennsylvania, USA; 17–21 Sept 2006:2134-2137.
Google Scholar
Banno H, Shinde T, Takeda K, Itakura F: In-car speech recognition using distributed microphones: adapting to automatically detected driving conditions. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Hong Kong, China; 6–10 April 2003:I-324–I-327.
Google Scholar
Li W, Takeda K, Itakura F: Optimizing regression for in-car speech recognition using multiple distributed microphones. In Proceedings of the International Conference on Spoken Language Processing (ICSLP). Jeju, Korea; 4–8 Oct 2004:2689-2692.
Google Scholar
Shimizu Y, Kajita S, Takeda K, Itakura F: Speech recognition based on space diversity using distributed multi-microphone. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Istanbul, Turkey; 5–9 June 2000:III-1747–III-1750.
Google Scholar
Hummes F, Qi J, Fingscheidt T: Robust Acoustic Speaker Localization with Distributed Microphones. In Proceedings of the European Signal Processing Conference (EUSIPCO). Barcelona, Spain; 29 Aug–2 Sept 2011:240-244.
Google Scholar
Widrow B, Glover JR, McCool JM, Kaunitz J, Williams CS, Hearn RH, Zeidler JR, Dong E, Goodlin RC: Adaptive noise cancelling: principles and applications. Proc. IEEE 1975, 63(12):1692-1716.
Article Google Scholar
Hirano A, Nakayama K, Arai S, Deguchi M: A low-distortion noise canceller and its learning algorithm in presence of crosstalk. IEICE Trans. Fundamentals Electron. Commun. Comput. Sci 2001, E84-A(2):414-421.
Google Scholar
Lombard A, Kellermann W: Multichannel cross-talk cancellation in a call-center scenario using frequency-domain adaptive filtering. In Proceedings of the International Workshop on Acoustic Echo and Noise Control (IWAENC). Seattle, Washington, USA; 14–17 Sept 2008.
Google Scholar
Robledo-Arnuncio E, Juang BH: Blind source separation of acoustic mixtures with distributed microphones. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Honolulu, Hawai, USA; 15–20 April 2007:III-949–III-952.
Google Scholar
Dmochowski JP, Liu Z, Chou PA: Blind source separation in a distributed microphone meeting environment for improved teleconferencing. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Las Vegas, Nevada, USA; 30 March–4 April 2008:88-92.
Google Scholar
Aichner R, Zourub M, Buchner H, Kellermann W: Residual cross-talk and noise suppression for convolutive blind source separation. In Proceedings of the Deutsche Jahrestagung für Akustik (DAGA). Braunschweig, Germany; 20–23 March 2006:41-42.
Google Scholar
Han S, Cui J, Li P: Post-processing for frequency-domain blind source separation in hearing aids. In Proceedings of the International Conference on Information, Communications and Signal Processing (ICICS). Macau, China; 8–10 Dec 2009:356-360.
Google Scholar
Jeub M, Herglotz C, Nelke CM, Beaugeant C, Vary P: Noise reduction for dual-microphone mobile phones exploiting power level differences. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Kyoto, Japan; 25–30 March 2012:1693-1696.
Google Scholar
Sondhi MM, Morgan DR, Hall JL: Stereophonic acoustic echo cancellation–an overview of the fundamental problem. IEEE Signal Process. Lett 1995, 2(8):148-151.
Article Google Scholar
Buchner H: Acoustic echo cancellation for multiple reproduction channels: from first principles to real-time solutions. In Proceedings of the ITG-Fachtagung Sprachkommunikation. Aachen, Germany; 8–10 Oct 2008:1-4.
Google Scholar
Bourgeois J, Minker W: Time-Domain Beamforming and, Blind Source Separation. Heidelberg: Springer; 2009.
Book Google Scholar
Sugiyama A: Low-distortion noise cancellers—Revival of a classical technique. In Speech and Audio Processing in Adverse Environments. Edited by: Hänsler E, Schmidt G. Berlin: Springer; 2008:229-264.
Chapter Google Scholar
Haykin S: Adaptive Filter Theory. Upper Saddle River: Prentice Hall; 2002.
Google Scholar
Matheja T, Buck M, Wolff T: Robust adaptive cancellation of interfering speakers for distributed microphone systems in cars. In Proceedings of the Deutsche Jahrestagung für Akustik (DAGA). Berlin, Germany; 15–18 March 2010:255-256.
Google Scholar
Hänsler E, Schmidt G: Acoustic Echo and Noise Control: A Practical Approach. Hoboken: Wiley; 2004.
Book Google Scholar
Matheja T, Buck M, Eichentopf A: Dynamic signal combining for distributed microphone systems in car environments. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Prague, Czech Republic; 22–27 May 2011:5092-5095.
Google Scholar
Linhard K, Haulick T: Noise subtraction with parametric recursive gain curves. In Proceedings of the European Conference on Speech Communication and Technology (EUROSPEECH). Budapest, Hungary; 5–9 Sept 1999:2611-2614.
Google Scholar
Cohen I: Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging. IEEE Trans. Speech Audio Process 2003, 11(5):466-475. 10.1109/TSA.2003.811544
Article Google Scholar
Matheja T, Buck M, Fingscheidt T: A multi-channel quality assessment setup applied to a distributed microphone speech enhancement system with spectral boosting. In Proceedings of the ITG-Fachtagung Sprachkommunikation. Braunschweig, Germany; 26–28 Sept 2012:119-122.
Google Scholar
Matheja T, Buck M, Fingscheidt T: Speaker activity detection for distributed microphone systems in cars. In Proceedings of the 6th Biennial Workshop on Digital Signal Processing for In-Vehicle Systems. Seoul, Korea; 29 Sept–2 Oct 2013.
Google Scholar
Matheja T, Buck M, Wolff T: Enhanced speaker activity detection for distributed microphones by exploitation of signal power ratio patterns,. in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), (Kyoto, Japan, 25–30 March 2012), pp. 2501–2504
Google Scholar
Martin R: An efficient algorithm to estimate the instantaneous SNR of speech signals. In Proceedings of the European Conference on Speech Communication and Technology (EUROSPEECH). Berlin, Germany; 22–25 Sept 1993:1093-1096.
Google Scholar
Matheja T, Buck M: Robust voice activity detection for distributed microphones by modeling of power ratios. In Proceedings of the ITG-Fachtagung Sprachkommunikation. Bochum, Germany; 6–8 Oct 2010.
Google Scholar
International Telecommunication Union: ITU-T Recommendation P56, Objective Measurement of Active Speech Level. Geneva: International Telecommunication Union; 1993.
Google Scholar
Fingscheidt T, Suhadi S: Quality assessment of speech enhancement systems by separation of enhanced speech, noise, and echo. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH). Antwerp, Belgium; 27–31 Aug 2007:818-821.
Google Scholar
Loizou PC: Speech Enhancement: Theory and Practice. Boca Raton: CRC; 2013.
Google Scholar
Carter GC: Coherence and time delay estimation. Proc. IEEE 1987, 75(2):236-255.
Article Google Scholar
Hänsler E: Statistische Signale - Grundlagen und Anwendungen. Berlin: Springer; 2001.
Book Google Scholar

Download references

Author information

Authors and Affiliations

Nuance Communications Deutschland GmbH, Acoustic Speech Enhancement Research, Ulm, D-89077, Germany
Timo Matheja & Markus Buck
Technische Universität Braunschweig, Institute for Communications Technology, Braunschweig, D-38106, Germany
Tim Fingscheidt

Authors

Timo Matheja
View author publications
You can also search for this author in PubMed Google Scholar
Markus Buck
View author publications
You can also search for this author in PubMed Google Scholar
Tim Fingscheidt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Timo Matheja.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Matheja, T., Buck, M. & Fingscheidt, T. A dynamic multi-channel speech enhancement system for distributed microphones in a car environment. EURASIP J. Adv. Signal Process. 2013, 191 (2013). https://doi.org/10.1186/1687-6180-2013-191

Download citation

Received: 11 July 2013
Accepted: 11 December 2013
Published: 27 December 2013
DOI: https://doi.org/10.1186/1687-6180-2013-191

A dynamic multi-channel speech enhancement system for distributed microphones in a car environment

Abstract

1 Introduction

2 Generic speech communication system

3 Interfering speaker cancellation

3.1 Blocking stage

3.2 Cross-talk cancellation stage

4 Signal enhancement (SE)

4.1 Automatic gain control

4.2 Extended noise reduction

4.2.1 Postfilter for residual cross-talk suppression

4.2.2 Residual cross-talk

4.2.3 Coupling factor

4.2.4 Dynamic maximum attenuation

5 Dynamic signal combination

6 Control Unit

6.1 Robust speaker activity detection

6.2 Reference noise power spectral density estimation

7 Evaluation

8 Conclusions

Appendices

Appendix 1: basic fullband speaker activity detection

Appendix 2: enhanced fullband speaker activity detection based on multipath-induced fading patterns

Power ratio patterns

Reference pattern set

Appendix 3: frequency-selective speaker activity detection

Appendix 4: signal-to-noise ratio subgroups

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords