Spatial location priors for Gaussian model based reverberant audio source separation

Duong, Ngoc Q K; Vincent, Emmanuel; Gribonval, Rémi

doi:10.1186/1687-6180-2013-149

Research
Open access
Published: 23 September 2013

Spatial location priors for Gaussian model based reverberant audio source separation

Ngoc Q K Duong¹,
Emmanuel Vincent² &
Rémi Gribonval³

EURASIP Journal on Advances in Signal Processing volume 2013, Article number: 149 (2013) Cite this article

4826 Accesses
23 Citations
Metrics details

Abstract

We consider the Gaussian framework for reverberant audio source separation, where the sources are modeled in the time-frequency domain by their short-term power spectra and their spatial covariance matrices. We propose two alternative probabilistic priors over the spatial covariance matrices which are consistent with the theory of statistical room acoustics and we derive expectation-maximization algorithms for maximum a posteriori (MAP) estimation. We argue that these algorithms provide a statistically principled solution to the permutation problem and to the risk of overfitting resulting from conventional maximum likelihood (ML) estimation. We show experimentally that in a semi-informed scenario where the source positions and certain room characteristics are known, the MAP algorithms outperform their ML counterparts. This opens the way to rigorous statistical treatment of this family of models in other scenarios in the future.

1 Introduction

We consider the task of reverberant audio source separation, that is, to extract individual sound sources from a multichannel microphone array recording. Many approaches have been proposed in the literature, which typically operate in the time-frequency domain via the short-time Fourier transform (STFT) [1–3]. One category of approaches models the mixture STFT coefficients as the product of the source STFT coefficients and complex-valued mixing vectors, which are estimated by frequency-domain independent component analysis (FDICA) [4, 5] or by clustering [6, 7]. In under-determined conditions when the number of sources is greater than the number of channels, the source STFT coefficients are then obtained via binary masking [6], soft masking [7], or ℓ ₁-norm minimization [8]. Lately, a Gaussian framework has emerged where the mixture STFT coefficients are modeled as a function of the power spectra and the spatial covariance matrices of the sources, and separation is achieved by multichannel Wiener filtering [9–11]. These covariance matrices may equivalently be expressed as the outer product of subsource mixing matrices, which reduce to mixing vectors when the spatial covariance matrices have rank 1 [12]. Full-rank matrices have been shown to improve separation performance in reverberant conditions by modeling not only the spatial position of the sources but also their spatial width [11].

While a number of deterministic [12–14] and probabilistic [15–17] priors have been proposed over the source spectra, the mixing vectors and the source spatial covariance matrices are usually estimated in an unconstrained manner. The lack of a constraint relating these quantities across frequency causes a permutation problem, which has been coped with by reordering the estimates in each frequency bin while keeping their value [7, 18]. More crucially, the estimated values of the mixing vectors and the source spatial covariance matrices in a given frequency bin are likely to suffer from overfitting when the corresponding sources are little active in that bin.

Building upon the studies for instantaneous mixtures in [19, 20] and the deterministic subspace constraints in [21, 22], a few algorithms have been designed that exploit soft penalties or probabilistic priors over the mixing vectors for increased estimation accuracy. These algorithms typically target semi-informed scenarios such as formal meetings or in-car speech where the spatial locations of the sources are known and they rely on the assumption that the mixing vectors are close to the steering vectors representing the direct path from the sources to the microphones. Squared Euclidean penalties over the blocking vectors are a common choice for FDICA [21, 23]. An inverse-Wishart prior over the outer product of the mixing vectors was also employed in [24]. These penalties and priors were not designed according to the actual statistics of reverberation. Moreover, to the best of our knowledge, no such priors have been designed for full-rank matrices.

In this article, we propose two probabilistic priors over the source spatial covariance matrices or the subsource mixing matrices which are consistent with the theory of statistical room acoustics. One of them was briefly introduced in our preliminary paper [25]. We extend the two Gaussian expectation-maximization (EM) algorithms in [12, 26] so as to perform maximum a posteriori (MAP) estimation. We then compare the resulting separation performance with conventional maximum likelihood (ML) estimation and with two baseline approaches in an under-determined full-rank semi-informed scenario where the source positions and certain room characteristics are known. For clarity, we do not assume any other constraint on the model parameters, which allows us to assess the improvement resulting from these priors alone.

The structure of the article is as follows. In Section 2, we recall the Gaussian framework for audio source separation and we present a result of the theory of statistical room acoustics. We introduce an EM algorithm using an inverse-Wishart prior in Section 3 and an EM algorithm using a Gaussian prior in Section 4. We evaluate their separation performance in Section 5 and we conclude in Section 6.

2 Gaussian modeling and statistical room acoustics

2.1 Gaussian modeling for source separation

Let us consider a mixture signal x(t)=[x ₁(t),…,x _I(t)]^T recorded by an array of I microphones. Denoting by J the number of sources, the mixing process is expressed as [27]

x (t) = \sum_{j = 1}^{J} c_{j} (t)

(1)

where c _j(t)=[c _1j(t),…,c _Ij(t)]^T is the spatial image of the j th source, which is its contribution to the signals recorded at the microphones. The STFT coefficients c _j(n,f) of the source spatial images in each time frame n and each frequency bin f are modeled as zero-mean Gaussian random vectors

c_{j} (n, f) \sim N (0, v_{j} (n, f) R_{j} (f))

(2)

where v _j(n,f) are scalar nonnegative variances encoding the short-term power spectra of the sources and R _j(f) are I×I spatial covariance matrices encoding their spatial position and their spatial width [9, 11].

Under the assumption that the sources are uncorrelated, the mixture covariance matrix Σ _x(n,f) is equal to

Σ_{x} (n, f) = \sum_{j = 1}^{J} v_{j} (n, f) R_{j} (f) .

(3)

The log-likelihood is then given by [26]

log ℒ = \sum_{n, f} - tr (Σ_{x}^{- 1} (n, f) {\hat{R}}_{x} (n, f)) - log | π Σ_{x} (n, f) |

(4)

where tr(.) and |.| denote the trace and the determinant of a square matrix, and ${\hat{R}}_{x} (n, f)$ is the empirical mixture covariance matrix obtained by local averaging of x(n,f)x ^H(n,f) over the neighborhood of each time-frequency bin

{\hat{R}}_{x} (n, f) = \sum_{n^{'}, f^{'}} w_{nf}^{2} (n^{'}, f^{'}) x (n^{'}, f^{'}) x^{H} (n^{'}, f^{'})

(5)

where w _nf is a bi-dimensional window specifying the shape of the neighborhood [26].

Source separation can then be achieved by estimating the model parameters θ={v _j(n,f),R _j(f)} in the ML sense and by deriving the spatial images of all sources in the minimum mean square error sense via multichannel Wiener filtering of the mixture STFT coefficients x(n,f)

{\hat{c}}_{j} (n, f) = v_{j} (n, f) R_{j} (f) Σ_{x}^{- 1} (n, f) x (n, f) .

(6)

2.2 A result from the theory of statistical room acoustics

In a scenario such as in [21–23], the distance and the orientation of the sources and the microphones with respect to each other (aka, the scene geometry) are assumed to be known but their absolute position in the room is unknown. According to the theory of statistical room acoustics [28, 29], the mean spatial covariance matrix of a source over all possible source and microphone positions and orientations, under the constraint that the scene geometry remains fixed, can be expressed as

μ_{R_{j}} (f) = d_{j} (f) d_{j}^{H} (f) + σ_{rev}^{2} Ω (f)

(7)

where.^H denotes conjugate transposition. The first term of this expression models the contribution of direct sound, where

d_{j} (f) = (\begin{array}{lc} \frac{1}{\sqrt{4 π} r_{1 j}} e^{- 2 iπf \frac{r_{1 j}}{c}} \\ ⋮ \\ \frac{1}{\sqrt{4 π} r_{Ij}} e^{- 2 iπf \frac{r_{Ij}}{c}} \end{array})

(8)

is the steering vector representing the direct paths from the source to the microphones, with c the sound velocity and r _ij the distance from the j th source to the i th microphone. The second term of this expression models the contribution of echoes and reverberation, which are assumed to come from all possible directions on average over all absolute positions: $σ_{rev}^{2}$ is the power of echoes and reverberation and Ω(f) is the covariance matrix of a diffuse sound field.

The entries $Ω_{i i^{'}} (f)$ of Ω(f) depend on the microphone directivity patterns and on the distance $d_{i i^{'}}$ between the i th and the i ^′th microphone. For omnidirectional microphones, this quantity can be shown to be real-valued and equal to [28],

Ω_{i i^{'}} (f) = \frac{sin (2 π {fd}_{i i^{'}} / c)}{2 π {fd}_{i i^{'}} / c} .

(9)

Moreover, the power of the reverberant part within a parallelepipedic room with dimensions L _x, L _y, L _z is given by

σ_{rev}^{2} = \frac{4 β^{2}}{A (1 - β^{2})}

(10)

where is the total wall area and β the wall reflection coefficient computed from the room reverberation time T ₆₀ via Eyring’s formula [29],

β = exp \{- \frac{13.82}{(\frac{1}{L_{x}} + \frac{1}{L_{y}} + \frac{1}{L_{z}}) {cT}_{60}}\} .

(11)

In order to match the physics of reverberation, a prior over the source spatial covariance matrices or over the subsource mixing matrices should lead to a mean spatial covariance matrix $μ_{R_{j}} (f)$ satisfying the constraint (7). This is not the case of the prior in [24], whose mean is equal to $d_{j} (f) d_{j}^{H} (f) + ∊ I_{I}$ with I _I the identity matrix of size I and ε a small constant. Isotropic Gaussian priors over the subsource mixing matrices would not satisfy this constraint either due to the interchannel correlation introduced by Ω(f). Fixed spatial covariance matrices set to the value in (7) were employed for single source localization in [29] and for source separation in [30]. Later work confirmed that the model (7) is valid on average over all absolute positions in the room but that R _j(f) varies with the absolute position so that it must be estimated from the observed mixture signal [11].

3 Source image-based EM algorithms

3.1 General EM algorithm

Assuming that the spatial covariance matrices R _j(f) are full-rank, ML estimation can be achieved using the source image-based EM (SIEM) algorithm in [26] where the spatial images {c _j(n,f)}_n,f of all sources in all time-frequency bins are considered as hidden data. Strictly speaking, this algorithm is a generalized form of EM [31] because the M step increases but does not maximize the expectation of the log-likelihood of the hidden data. Since the priors proposed hereafter pertain to the spatial covariance matrices only, MAP estimation can be achieved via the same algorithm except for the corresponding update in the M step.

The resulting EM updates are listed in Algorithm 1. In the E step, the Wiener filter W _j(n,f) and the second-order raw moment ${\hat{R}}_{c_{j}} (n, f)$ of the spatial images of all sources are computed^a. In the M step v _j(n,f) and R _j(f) are updated. In the ML case, the update for R _j(f) in (17) is given by [26]

R_{j} (f) = \frac{1}{N} \sum_{n = 1}^{N} \frac{{\hat{R}}_{c_{j}} (n, f)}{v_{j} (n, f)}

(12)

where N is the total number of time frames.

Algorithm 1 SIEM algorithm [26]

Given this algorithm, we now consider the design of suitable priors over R _j(f). In addition to the physical constraint (7), the priors must satisfy practical engineering constraints: they must be defined over the space of Hermitian positive definite matrices, have a small number of parameters, have a closed-form mean and result in closed-form EM updates. The inverse-Wishart and the Wishart distributions satisfy these constraints. In this paper we present only the inverse-Wishart prior since we observed experimentally that the Wishart prior results in poorer separation performance compared to both the ML algorithm and the MAP algorithm using the inverse-Wishart prior.

3.2 Inverse-Wishart prior

The inverse-Wishart distribution is the conjugate prior for the likelihood (4) of our model. This prior is defined as

R_{j} (f) \sim I W (Ψ_{j} (f), m)

(13)

where

I W (R | Ψ, m) = \frac{| Ψ |^{m} | R |^{- (m + I)} e^{- tr (Ψ R^{- 1})}}{π^{I (I - 1) / 2} \prod_{i = 1}^{I} Γ (m - i + 1)}

(14)

is the inverse-Wishart density over Hermitian positive definite matrices R with positive definite inverse scale matrix Ψ, m degrees of freedom, and mean Ψ/(m−I) [32], with Γ the gamma function. This density, its mean, and its variance are finite for m<I−1, m<I, and m<I+1, respectively. We fix the inverse scale matrix Ψ _j(f) as

Ψ_{j} (f) = (m - I) μ_{R_{j}} (f)

(15)

so that the mean of R _j(f) is consistent with (7). The deviation allowed from the mean is controlled by the so-called number of degrees of freedom m, which is not necessarily an integer.

3.3 Learning the hyper-parameter

In order to obtain the best fit between this prior and the actual prior distribution of spatial covariance matrices, we learn the number of degrees of freedom m from training data. We assume that m depends on the distance and the orientation of the microphones with respect to each other (aka, the array geometry) and on the distance from the source to the center of the array, but not on the source direction of arrival. Given the microphone array geometry and the source distance, we generate training signals c _p(t) indexed by p for a number of microphone array positions and orientations and for a number of source directions of arrival by convolving the corresponding room impulse responses with a single-channel signal. We derive the spatial covariance matrix R _p(f) associated with each training signal in an oracle fashion [30] by alternately applying (??) and (12) to the empirical covariance matrices ${\hat{R}}_{c_{p}} (n, f)$ computed as in (5). Such training data can be generated in any practical scenario where the source separation system is to be deployed in fixed known environments, where the impulse responses can be pre-recorded or simulated via the image method[33].

Since R _p(f) is measured only up to an arbitrary nonnegative scaling factor α _p(f), we jointly estimate the number of degrees of freedom m and the scaling factors in the ML sense by maximizing

\begin{align} ℒ_{I W} & = \prod_{p, f} p (R_{p} (f) | α_{p} (f), Ψ_{p} (f), m) \\ = \prod_{p, f} J_{α_{p} (f)} I W (α_{p} (f) R_{p} (f) | Ψ_{p} (f), m) \end{align}

(16)

where $J_{α_{p} (f)} = α_{p}^{I^{2}} (f)$ is the Jacobian of the scaling transform and Ψ _p(f) is the inverse scale matrix in (15) which depends on p. Maximization with respect to m can be achieved using a nonlinear optimization technique [34], where the optimal scaling factors for a given m are given by

α_{p} (f) = \frac{tr (Ψ_{p} (f) R_{p}^{- 1} (f))}{Im} .

(17)

The values of m learned for the geometrical setting and the reverberation times tested in Section 5 are shown in Table 1.

Table 1 Learned values of the prior hyper-parameters

Full size table

3.4 MAP EM update

Given the hyper-parameters Ψ _j(f) and m, the spatial covariance matrices R _j(f) can be estimated in the MAP sense in step (17) of Algorithm 1 by maximizing the expectation of the log-posterior of the hidden data

\begin{array}{l} Q_{I W} = γ \sum_{j, f} log I W (R_{j} (f) | Ψ_{j} (f), m) + \sum_{j, n, f} \\ - tr (Σ_{c_{j}}^{- 1} (n, f) {\hat{R}}_{c_{j}} (n, f)) - log | π Σ_{c_{j}} (n, f) | \end{array}

(18)

where γ is a trade-off hyper-parameter determining the strength of the prior. Strictly speaking, MAP estimation corresponds to γ=1. However, as in other fields of signal processing [35], a larger strength parameter is needed in practice in order to balance the absolute values of the prior and the likelihood, and this generalized rule is loosely referred to as MAP. By computing the partial derivatives of $Q_{I W}$ with respect to each entry of R _j(f) and equating them to zero, we obtain the MAP update

R_{j} (f) = \frac{1}{γ (m + I) + N} (γ Ψ_{j} (f) + \sum_{n = 1}^{N} \frac{{\hat{R}}_{c_{j}} (n, f)}{v_{j} (n, f)}) .

(19)

When γ=0, the contribution of the prior is excluded and (19) becomes equal to the ML update in (12). The setting of γ will be discussed in Section 5.3.

4 Subsource-based EM algorithm

4.1 General EM algorithm

Besides the SIEM algorithm, an alternative subsource-based EM (SSEM) algorithm was proposed for ML estimation in [12] that applies to spatial covariance matrices of any rank R _j. This algorithm relies on the non-unique representation of the source spatial images as c _j(n,f)=H _j(f)s _j(n,f), where the entries s _jr(n,f), r=1,…,R _j, of s _j(n,f) are uncorrelated complex-valued subsource coefficients distributed as $s_{jr} (n, f) \sim N (0, v_{j} (n, f))$ and H _j(f) is an I×R _j complex-valued subsource mixing matrix satisfying the constraint [12]

R_{j} (f) = H_{j} (f) H_{j}^{H} (f) .

(20)

This subsource mixing matrix reduces to a mixing vector in the particular case when R _j(f) has rank 1. Overall, the mixture STFT coefficients are written as

x (n, f) = H (f) s (n, f) + b (n, f)

(21)

where $s (n, f) = {[s_{11} (n, f), \dots, s_{1 R_{1}} (n, f), \dots, s_{{JR}_{j}} (n, f)]}^{T}$ is an R×1 vector of subsource coefficients with $R = \sum_{j = 1}^{J} R_{j}$ , H(f)=[H ₁(f),…,H _J(f)] is an I×R mixing matrix and b(n,f) is a small Gaussian noise with covariance matrix $Σ_{b} (n, f) = σ_{b}^{2} (f) I_{I}$ required by the EM algorithm. The log-likelihood (4) can then be maximized by considering the set {x(n,f),s _j(n,f)}_j,n of observed mixture STFT coefficients and hidden subsource STFT coefficients in all time-frequency bins as complete data. Once again, it turns out that MAP estimation can be achieved via the same algorithm except for the mixing matrix update in the M step.

The details of one iteration are summarized in Algorithm 2, where $R_{j}$ denotes the set of subsource indices associated with the j th source and ${\tilde{v}}_{r} (n, f) = v_{j} (n, f)$ if and only if $r \in R_{j}$ . In the E step, the Wiener filter W _j(n,f) and the second-order cross-moments ${\hat{R}}_{s} (n, f)$ and ${\hat{R}}_{xs} (n, f)$ are computed. In the M step v _j(n,f) and H(f) are updated. In the ML case, the update for H(f) in (34) is given by [12]

H (f) = (\sum_{n = 1}^{N} {\hat{R}}_{xs} (n, f)) {(\sum_{n = 1}^{N} {\hat{R}}_{s} (n, f))}^{- 1} .

(22)

Algorithm 2 SSEM algorithm [12]

4.2 Gaussian prior

The design of a suitable prior over H(f) is subject to the same practical engineering constraints as above, which leads us to propose a Gaussian prior. We model each column h _jr(f), r=1,…,R _j, of H _j(f) as a complex-valued Gaussian random vector

h_{jr} (f) \sim N (μ_{h_{jr}} (f), Σ_{h_{jr}} (f))

(23)

with mean $μ_{h_{jr}} (f)$ and covariance $Σ_{h_{jr}} (f)$ . Following the assumption in Section 2.2, echoes and reverberation cancel out on average over all orientations in the room so that they appear only in the covariance, while only the part corresponding to direct sound appears in the mean. Without loss of generality, let us select H _j(f) such that direct sound is concentrated in the first subsource of each source, i.e., the first subsource includes direct sound, echoes, and reverberation, while the other subsources include echoes and reverberation only^b. The mean and the covariance of the prior can then be expressed as

μ_{h_{jr}} (f) = \{\begin{array}{l} d_{j} (f) if r = 1 \\ 0 otherwise \end{array}

(24)

Σ_{h_{jr}} (f) = σ_{r}^{2} Ω (f)

(25)

where the echo and reverberation power of all subsources sums up to the total power in (10):

\sum_{r = 1}^{R_{j}} σ_{r}^{2} = σ_{rev}^{2} .

(26)

Contrary to the inverse-Wishart prior whose variance is governed by a single hyper-parameter m, this prior involves R _j−1 free hyper-parameters $σ_{r}^{2}$ , r=2,…,R _j, which makes it potentially more flexible as soon as I≥R _j<2. The priors are distinct, however, in the sense that the Gaussian prior does not generalize the inverse-Wishart prior whatever the choice of the hyper-parameters.

4.3 Learning the hyper-parameters

In order to fit the actual distribution of subsource mixing matrices, we learn these free hyper-parameters from training data. The training data consist of the spatial covariance matrices R _p(f) computed in Section 3.3 for different positions p, from which we derive the corresponding subsource mixing matrices H _p(f) by singular value decomposition $R_{p} (f) = H_{p} (f) H_{p}^{H} (f)$ such that the columns of H _p(f) are orthogonal and sorted by decreasing norm.

These columns h _pr(f) are observed only up to an arbitrary scale common to all r and an arbitrary phase rotation specific to each r. Phase rotations do not affect the learned variances $σ_{r}^{2}$ for r<1, since the corresponding means $μ_{h_{jr}} (f)$ are zero. Multiplying H _p(f) by a global complex-valued factor α _p(f) is hence sufficient to address this indeterminacy. Denoting by

{\underset{̲}{h}}_{p} (f) = (\begin{array}{l} h_{p 1} (f) \\ ⋮ \\ h_{{pR}_{j}} (f) \end{array})

(27)

the IR _j×1 vectorization of H _p(f) with mean

μ_{{\underset{̲}{h}}_{p}} (f) = (\begin{array}{l} μ_{h_{p 1}} (f) \\ ⋮ \\ μ_{h_{{pR}_{j}}} (f) \end{array})

(28)

and covariance

Σ_{{\underset{̲}{h}}_{p}} (f) = (\begin{array}{c} Σ_{h_{p 1}} (f) & 0 \\ ⋱ \\ 0 & Σ_{h_{{pR}_{j}}} (f) \end{array}),

(29)

the hyper-parameters and the multiplication factors are jointly estimated in the ML sense by maximizing

\begin{align} ℒ_{G} & = \prod_{p, f} p ({\underset{̲}{h}}_{p} (f) | α_{p} (f), μ_{{\underset{̲}{h}}_{p}} (f), Σ_{{\underset{̲}{h}}_{p}} (f)) \\ = \prod_{p, f} J_{α_{p} (f)} N (α_{p} (f) {\underset{̲}{h}}_{p} (f) | μ_{{\underset{̲}{h}}_{p}} (f), Σ_{{\underset{̲}{h}}_{p}} (f)) \end{align}

(30)

where $J_{α_{p} (f)} = | α_{p} (f) |^{2 I^{2}}$ is the Jacobian of the multiplication. Maximization is achieved using a nonlinear optimization technique, where the optimal multiplication factors as a function of the hyper-parameters are found as

α_{p} (f) = \frac{- | b | - {(| b |^{2} - 4 ac)}^{1 / 2}}{2 a} \frac{b}{| b |}

(31)

where

\begin{align} a & = - {\underset{̲}{h}}_{p}^{H} (f) Σ_{{\underset{̲}{h}}_{p}}^{- 1} (f) {\underset{̲}{h}}_{p} (f) \\ b & = {\underset{̲}{h}}_{p}^{H} (f) Σ_{{\underset{̲}{h}}_{p}}^{- 1} (f) μ_{{\underset{̲}{h}}_{p}} (f) \\ c & = I^{2} . \end{align}

(32)

The values of $σ_{1}^{2}$ and $σ_{2}^{2}$ learned in the setting of Section 5 (R _j=I=2) are displayed in Table 1.

4.4 MAP EM update

Similarly to (27), let us denote by $\underset{̲}{h} (f)$ the vectorization of H(f) as an I R×1 column vector. The prior distribution (23) translates into

\underset{̲}{h} (f) \sim N (μ_{\underset{̲}{h}} (f), Σ_{\underset{̲}{h}} (f)),

(33)

where $μ_{\underset{̲}{h}} (f)$ is the I R×1 vector obtained by concatenating $μ_{h_{jr}} (f)$ for all j, r; and $Σ_{\underset{̲}{h}} (f)$ is the I R×I R block-diagonal matrix whose entries are equal to $Σ_{h_{jr}} (f)$ for all j, r.

The MAP update for H(f) is derived by maximizing the expectation of the log-posterior of the complete data that is equal up to a constant to (see Equation 13 in [12]) for the expression of the expectation of the log-likelihood)

\begin{array}{l} Q_{G} = γ log N (\underset{̲}{h} (f) | μ_{\underset{̲}{h}} (f), Σ_{\underset{̲}{h}} (f)) \\ + \sum_{n, f} - \frac{1}{σ_{b}^{2} (f)} tr [{\hat{R}}_{x} (n, f) - H (f) {\hat{R}}_{xs}^{H} (n, f) \\ - {\hat{R}}_{xs} (n, f) H^{H} (f) + H (f) {\hat{R}}_{s} (n, f) H^{H} (f)] \end{array}

(34)

where γ is a trade-off hyper-parameter determining the strength of the prior. By rewriting the matrix quadratic form in the log-likelihood term of (34) as a vector quadratic form in terms of $\underset{̲}{h} (f)$ and by computing the gradient of $Q_{G}$ and equating it to zero, we obtain

\begin{array}{l} \underset{̲}{h} (f) = {(γ Σ_{\underset{̲}{h}}^{- 1} (f) + \frac{1}{σ_{b}^{2} (f)} \sum_{n = 1}^{N} {({\hat{R}}_{s} (n, f) \otimes I_{I})}^{T})}^{- 1} \\ \times (γ Σ_{\underset{̲}{h}}^{- 1} (f) μ_{\underset{̲}{h}} (f) + \frac{1}{σ_{b}^{2} (f)} \sum_{n = 1}^{N} vec ({\hat{R}}_{xs} (n, f))) \end{array}

(35)

where.^T denotes transposition, ⊗ is the Kronecker product and vec(.) concatenates the columns of a matrix into a single column vector. The mixing matrix H(f) is then obtained by devectorizing $\underset{̲}{h} (f)$ . This update boils down to the ML update (22) when γ=0.

5 Experimental evaluation

We evaluate the performance of the proposed MAP estimation algorithms compared to the conventional ML estimation algorithms and to two baseline approaches for the separation of two-channel convolutive mixtures of three sources. We target a semi-informed scenario where the relative positions of the sources and the microphones are known, but nothing is known about their absolute position in the room nor about the source signals. The reverberant character of the data calls for the use of full-rank spatial covariance matrices and subsource mixing matrices, i.e., R _j=2 for all j. We do not constrain the source variances v _j(n,f), so as to measure the improvement due to the priors alone. The full Matlab code for our experiments can be downloaded from [36].

5.1 Data

The proposed priors can be applied in any scenario where the source separation system is to be deployed in fixed, known environments, where the impulse responses can be pre-recorded or simulated. In the following, we use simulated mixtures so as to test a wide range of room reverberation times. The use of simulated data is widespread in audio source separation and it has been shown to yield comparable separation performance to real-world data in general [37]. As a matter of fact, the results of the ML algorithms reported below are comparable to those previously reported on real-world recordings in Figure six in [11].

The positions of the sources and the microphones in the test data are illustrated in Figure 1. The room dimensions are 4.45×3.55×2.5 m as in [37], and the microphone spacing and the source-to-microphone distances are fixed to d=5 and r=50 cm, respectively. We generated room impulse responses via the image method [33] using the Roomsimove toolbox^c for four reverberation times: T ₆₀=50, 130, 250, or 500 ms, which we convolved with 10 s speech signals sampled at 16 kHz. For each T ₆₀, 6 mixture signals were generated using speech signals from the Signal Separation and Evaluation Campaign (SiSEC) [37]: 2 mixtures of English and Japanese male speech, 2 mixtures of English and Japanese female speech, and 2 mixtures of male and female speech, resulting in 24 mixture signals in total.

Training data were generated in a similar fashion by simulating room impulse responses for 20 random source directions of arrival for each of 20 random microphone pair positions and orientations for the same d and r as above. This resulted in a total of 400 source image signals indexed by p for each T ₆₀.

5.2 Learned hyper-parameter values

Regarding training, preliminary experiments showed that the functions (16) and (30) are concave in practice. Hence, we maximized them using Matlab’s fmincon optimizer (Mathworks Inc., Natick, MA, USA). The resulting hyper-parameter values are shown in Table 1.

As expected, the total power of echoes and reverberation $σ_{rev}^{2} = σ_{1}^{2} + σ_{2}^{2}$ strongly increases with T ₆₀, such that the direct-to-reverberant ratio is 14 dB lower when T ₆₀=500 ms than when T ₆₀=50 ms. The variance of the inverse-Wishart prior, which is inversely related to m[32], decreases with T ₆₀. The ratio $σ_{1}^{2} / σ_{rev}^{2}$ decreases with T ₆₀, which indicates that the echoic and reverberant part of the impulse responses becomes more and more diffuse.

5.3 Tested algorithms and evaluation criteria

In addition to the proposed MAP versions of SIEM and SSEM (MAP inverse-Wishart and MAP Gaussian), we consider the conventional ML versions of these algorithms where the initial values of R _j(f) and H(f) are either set to $μ_{R_{j}} (f)$ and μ _H(f) given the scene geometry (ML geom. init) or blindly estimated via hierarchical clustering followed by permutation alignment as detailed in [11] (ML blind init). Subsequent permutation alignment of the sources after convergence of the ML algorithms was found not to improve performance and therefore it is not used in the following. For comparison, we evaluate two baseline approaches, namely, binary masking and ℓ ₀ -norm minimization, using the reference software in [38] where the mixing matrix in each frequency bin is estimated by hierarchical clustering followed by permutation alignment [11].

In order to assess the respective impact of the priors on solving the permutation problem and on reducing overfitting, we also report an upper bound on the performance of the MAP and the ML geom. init algorithms with oracle permutation alignment. In each frequency bin, the best possible permutation is found by considering all possible permutations of the estimated sources and by selecting the one that leads to the smallest mean square error compared to the true source signals in that bin.

We computed the STFT with half-overlapping sine windows of length 1,024 and the empirical mixture covariance using a window w _nf of size 3×3 as in [26]. The trade-off parameter γ does not significantly affect the results but we observed that γ=100 and γ=10 are good choices for SIEM and SSEM respectively on average. The number of iterations was fixed to 10 for SIEM and 30 for SSEM, since the convergence of SSEM is typically slower.

The priors did not significantly increase running time. Indeed, the MAP inverse-Wishart update has the same computational complexity as the ML SIEM update. The MAP Gaussian update has greater complexity than the ML SSEM update, but it occurs only once per iteration in each frequency bin, in contrast with the updates in the E step which occur in each time frame. For a typical number of time frames N, the computational complexity is therefore dominated by the E step, regardless of the priors.

We evaluated the separation quality via the signal-to-distortion ratio (SDR), signal-to-interference ratio (SIR), signal-to-artifact ratio (SAR), and source image-to-spatial distortion ratio (ISR) criteria in decibels (dB) [37], which account respectively for overall distortion, residual crosstalk, musical noise, and target distortion. These criteria were computed using version 3.0 of the BSS Eval toolbox^d and averaged over all sources and all mixtures for each T ₆₀.

5.4 Results for source image-based EM algorithms

The results of the SIEM algorithms and the baselines are compared in Figure 2. Binary masking and ℓ ₀ -norm minimization provide lower SDR than all other algorithms for all reverberation conditions. ML geom. init results in better performance than ML blind init in terms of SDR and SAR for all T ₆₀. Overall, MAP inverse-Wishart outperforms all other algorithms for all considered T ₆₀ in terms of SDR, SIR, and ISR. For instance, at T ₆₀=250 ms, it improves the SDR by 1.7, 1.6, 2.8, and 4.2 dB compared to ML blind init, ML geom init, binary masking, and ℓ ₀ -norm minimization, respectively. This confirms the benefit of the proposed inverse-Wishart spatial location prior and the associated MAP algorithm.

These results are shown against the corresponding upper bounds obtained with oracle permutation alignment in Table 2. By comparing the first two lines with the last two lines, it appears that ML geom. init and MAP inverse-Wishart both solve the permutation problem at low reverberation times up to T ₆₀=130 ms and that little SDR improvement from 0.2 to 0.4 dB is to be expected from better permutation at higher reverberation times. By contrast, comparison of the third and the fourth lines of the table indicates that even if the permutation problem were solved, MAP inverse-Wishart would still outperform ML geom. init by 1.8 dB at T ₆₀=250 ms, which can be attributed to better robustness to overfitting.

Table 2 SDR (dB) of the SIEM algorithms with estimated vs. oracle permutation

Full size table

5.5 Results for subsource-based EM algorithms

The results of the SSEM algorithms are depicted in Figure 3. Again, ML geom init results in significantly better performance than ML blind init in terms of all criteria for all T ₆₀, and it also offers higher SDR than binary masking and ℓ ₀ -norm minimization for all T ₆₀. But the best performance is achieved by MAP Gaussian in terms of all criteria and for all T ₆₀, except in terms of SAR at T ₆₀=500 ms. For instance, at T ₆₀=250 ms, MAP Gaussian improves the SDR by 3.9, 1.0, 1.4, and 2.8 dB compared to ML blind init, ML geom. init, binary masking, and ℓ ₀ -norm minimization, respectively. This confirms the benefit of the proposed Gaussian spatial location prior and the associated MAP algorithm.

These results are shown against the corresponding upper bounds obtained with oracle permutation alignment in Table 3. Again, MAP Gaussian significantly outperforms ML geom. init in the oracle case, meaning that the overfitting issue in ML estimates is better addressed in MAP estimates with a proper prior. On the other hand, it can be seen that MAP Gaussian does not fully solve the permutation problem at medium and high reverberation conditions, but that the gap with the oracle permutation is small and slightly smaller than for ML geom. init.

Table 3 SDR (dB) of the SSEM algorithms with estimated vs. oracle permutation

Full size table

6 Conclusions

We considered two classes of source separation algorithms grounded on the emerging Gaussian EM framework. In contrast with classical ML estimation of the spatial parameters, we proposed two priors exploiting a result from the theory of statistical room acoustics and we derived closed-form MAP updates. The SIEM algorithm with an inverse-Wishart prior and the SSEM algorithm with a Gaussian prior were shown to outperform their ML counterparts for all room reverberation times in a semi-informed scenario. We showed that this performance improvement can be mostly attributed to the greater robustness to overfitting of MAP compared to ML. The proposed MAP algorithms also provide a solution to the problem of permutation of the source estimates that is consistent with the statistics of sound fields. The resulting permutations and those obtained by ML estimation initialized with the known geometric setting are, however, comparably good.

The results in this paper can readily be used in certain real-world scenarios where the source positions are known from, e.g., physical constraints or visual input, and the reverberation characteristics can be learned from the environment [21–23]. Perhaps more importantly, they constitute a first step towards full Bayesian treatment of this family of models in other blind or semi-blind scenarios in the future. In addition to blind estimation of the source positions and possibly of the microphone distance and directivity [39], robustness to erroneous estimation of these hyper-parameters, and blind estimation of the hyper-parameters $σ_{rev}^{2}$ , m and $σ_{r}^{2}$ both pose significant challenges, which go beyond the scope of this paper. Future work will concentrate on these challenges by extending blind techniques for room reverberation time estimation [40]. Usage of the proposed Gaussian prior, which is also valid for rank-1 mixing vectors, may also be explored in the context of FDICA, with the difficulty of translating this prior into a prior over the blocking vectors which are usually considered as parameters in this context instead.

Endnotes

^a Note that in order to yield nonzero likelihood, v _j(n,f) must be nonzero for at least one source j. Σ _x(n,f) in (14) is therefore the sum of Hermitian positive semi-definite matrices, at least one of which is definite, so it is Hermitian positive definite and invertible.

^b If several $μ_{h_{jr}} (f)$ are nonzero multiples of d _j(f), a unitary transform can be applied to H _j(f) in (20) such that only the first one remains nonzero.

^c http://www.irisa.fr/metiss/members/evincent/Roomsimove.zipThis toolbox provides a command-line interface which, in contrast with the original GUI by D. R. Campbell, allows generation of a large amount of data.

^d http://bass-db.gforge.inria.fr/bss_eval/

References

O’Grady P, Pearlmutter B, Rickard ST: Survey of sparse and non-sparse methods in source separation. Int. J. Imaging Syst. Technol 2005, 15: 18-33. 10.1002/ima.20035
Article Google Scholar
Makino S, Lee TW, Sawada H: Blind Speech Separation. Berlin: Springer; 2007.
Book Google Scholar
Vincent E, Jafari MG, Abdallah SA, Plumbley MD, Davies ME: Probabilistic modeling paradigms for audio source separation. In Machine Audition: Principles, Algorithms and Systems. Hershey: IGI Global; 2010:162-185.
Google Scholar
Smaragdis P: Blind separation of convolved mixtures in the frequency domain. Neurocomputing 1998, 22: 21-34. 10.1016/S0925-2312(98)00047-2
Article MATH Google Scholar
Sawada H, Araki S, Makino S: Frequency-domain blind source separation. In Blind Speech Separation. Berlin: Springer; 2007:47-78.
Google Scholar
Yilmaz O, Rickard ST: Blind separation of speech mixtures via time-frequency masking. IEEE Trans. Signal Process 2004, 52(7):1830-1847. 10.1109/TSP.2004.828896
Article MathSciNet Google Scholar
Sawada H, Araki S, Makino S: Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment. IEEE Trans. Audio Speech Lang. Process 2011, 19(3):516-527.
Article Google Scholar
Winter S, Kellermann W, Sawada H, Makino S: MAP-based underdetermined blind source separation of convolutive mixtures by hierarchical clustering and ℓ1-norm minimization. EURASIP J. Adv. Signal Process 2007: 024717. doi:10.1155/2007/24717
Févotte C, Cardoso JF: Maximum likelihood approach for blind audio source separation using time-frequency Gaussian models. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). Mohonk, NY; 16–19 October 2005:78-81.
Google Scholar
Ozerov A, Févotte C: Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation. IEEE Trans. Audio Speech Lang. Process 2010, 18(3):550-563.
Article Google Scholar
Duong NQK, Vincent E, Gribonval R: Under-determined reverberant audio source separation using a full-rank spatial covariance model. IEEE Trans. Audio Speech Lang. Process 2010, 18(7):1830-1840.
Article Google Scholar
Ozerov A, Vincent E, Bimbot F: A general flexible framework for the handling of prior information in audio source separation. IEEE Trans. Audio Speech Lang. Process 2012, 20(4):1118-1133.
Article Google Scholar
Benaroya L, Bimbot F, Gribonval R: Audio source separation with a single sensor. IEEE Trans. Audio Speech Lang. Process 2006, 14: 191-199.
Article Google Scholar
Févotte C, Bertin N, Durrieu JL: Nonnegative matrix factorization with the Itakura-Saito divergence. With application to music analysis. Neural Comput 2009, 21(3):793-830. 10.1162/neco.2008.04-08-771
Article MATH Google Scholar
Virtanen T, Cemgil AT, Godsill SJ: Bayesian extensions to non-negative matrix factorisation for audio signal modelling. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Las Vegas; 30 March to 4 April 2008:1825-1828.
Google Scholar
Dikmen O, Cemgil AT: Gamma Markov random fields for audio source modeling. IEEE Trans. Audio Speech Lang. Process 2010, 18(3):589-601.
Article Google Scholar
Itoyama K, Goto M, Komatani K, Ogata T, Okuno HG: Simultaneous processing of sound source separation and musical instrument identification using Bayesian spectral modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Prague; 22–27 May 2011:3816-3819.
Google Scholar
Sawada H, Mukai R, Araki S, Makino S, robust A: A robust and precise method for solving the permutation problem of frequency-domain blind source separation. IEEE Trans. Speech Audio Process 2004, 12(5):530-538. 10.1109/TSA.2004.832994
Article Google Scholar
Knuth KH: A Bayesian approach to source separation. In Proceedings of the International Workshop on Independent Component Analysis and Source Separation (ICA). Aussois; January 1999:283-288.
Google Scholar
Cemgil AT, Févotte C, Godsill SJ: Variational and stochastic inference for Bayesian source separation. Digit. Signal Process 2007, 17: 891-913. 10.1016/j.dsp.2007.03.008
Article Google Scholar
Parra L, Alvino C: Geometric source separation: merging convolutive source separation with geometric beamforming. IEEE Trans. Audio Speech Lang. Process 2002, 10(6):352-362. 10.1109/TSA.2002.803443
Article Google Scholar
Knaak M, Araki S, Makino S: Geometrically constrained independent component analysis. IEEE Trans. Audio Speech Lang. Process 2007, 15(2):715-726.
Article Google Scholar
Reindl K, Zheng Y, Schwarz A, Meier S, Maas R, Sehr A, Kellermann W: A stereophonic acoustic signal extraction scheme for noisy and reverberant environments. Comput. Speech Lang 2013, 27(3):726-745. 10.1016/j.csl.2012.07.011
Article Google Scholar
Otsuka T, Ishiguro K, Sawada H, Okuno HG: Bayesian unification of sound source localization and separation with permutation resolution. In Proceedings of the 26th AAAI Conference on Artificial Intelligence. Toronto; 22–26 July 2012:2038-2045.
Google Scholar
Duong NQK, Vincent E, Gribonval R: An acoustically-motivated spatial prior for under-determined reverberant source separation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Prague; 22–27 May 2011:9-12.
Google Scholar
Duong NQK, Vincent E, Gribonval R: Under-determined reverberant audio source separation using local observed covariance and auditory-motivated time-frequency representation. In Proceedings of the International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA). St. Malo; 27–30 September 2010:73-80.
Chapter Google Scholar
Cardoso JF: Multidimensional independent component analysis. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Seattle; May 1998:1941-1944.
Google Scholar
Kuttruff H: Room Acoustics. New York: Spon Press; 2000.
Google Scholar
Gustafsson T, Rao BD, Trivedi M: Source localization in reverberant environments: modeling and statistical analysis. IEEE Trans. Speech Audio Process 2003, 11: 791-803. 10.1109/TSA.2003.818027
Article Google Scholar
Duong NQK, Vincent E, Gribonval R: Spatial covariance models for under-determined reverberant audio source separation. In Proceedings on the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). Mohonk; 18–21 October 2009:129-132.
Google Scholar
McLachlan G, Krishnan T: The EM Algorithm and Extensions. New York: Wiley; 1997.
MATH Google Scholar
Maiwald D, Kraus D: Calculation of moments of complex Wishart and complex inverse-Wishart distributed matrices. IEEE Proc. Radar Sonar Navigation 2000, 147: 162-168. 10.1049/ip-rsn:20000493
Article Google Scholar
Allen JB, Berkley DA: Image method for efficiently simulating small-room acoustics. J. Acoust. Soc. Am 1979, 65(4):943-950. 10.1121/1.382599
Article Google Scholar
Nocedal J, Wright SJ: Numerical Optimization. New York, NY: Springer; 1999.
Book MATH Google Scholar
Ogawa A, Takeda K, Itakura F: Balancing acoustic and linguistic probabilities. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1. Seattle; 1998:I-181–184.
Google Scholar
Duong NQK, Vincent E, Gribonval R: Matlab code for Gaussian model based audio source separation using spatial location priors. http://www.loria.fr/~evincent/spatial_priors.zip
Vincent E, Araki S, Theis F, Nolte G, Bofill P, Sawada H, Ozerov A, Gowreesunker V, Lutter D, Duong NQK: The Signal Separation Campaign (2007-2010): achievements and remaining challenges. Signal Process 2012, 92: 1928-1936. 10.1016/j.sigpro.2011.10.007
Article Google Scholar
Vincent E, Araki S, Bofill P: Signal Separation Evaluation Campaign: a community-based approach to large-scale evaluation. In Proceedings of the International Conference on Independent Component Analysis and Signal Separation (ICA). Paraty; 15–18 March 2009:734-741.
Chapter Google Scholar
Hasegawa K, Ono N, Miyabe S, Sagayama S: Blind estimation of locations and time offsets for distributed recording devices. 27–30 September 2010.
Chapter Google Scholar
Gaubitch ND, Löllmann H, Jeub M, Falk T, Naylor PA, Vary P, Brookes M: Performance comparison of algorithms for blind reverberation time estimation from speech. In Proceedings of the International Workshop on Acoustic Signal Enhancement (IWAENC). Aachen; 4–6 September 2012:1-4.
Google Scholar

Download references

Acknowledgements

This work was supported by the EUREKA Eurostars i3Dmusic project funded by Oseo. Most of it was done while the first two authors were with Inria Rennes.

Author information

Authors and Affiliations

Technicolor Rennes Research & Innovation Center, 35510, Cesson-Sévigné, France
Ngoc Q K Duong
Inria, 54600, Villers-lès-Nancy, France
Emmanuel Vincent
Inria, 35042, Rennes Cedex, France
Rémi Gribonval

Authors

Ngoc Q K Duong
View author publications
You can also search for this author in PubMed Google Scholar
Emmanuel Vincent
View author publications
You can also search for this author in PubMed Google Scholar
Rémi Gribonval
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Emmanuel Vincent.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Duong, N.Q.K., Vincent, E. & Gribonval, R. Spatial location priors for Gaussian model based reverberant audio source separation. EURASIP J. Adv. Signal Process. 2013, 149 (2013). https://doi.org/10.1186/1687-6180-2013-149

Download citation

Received: 03 April 2013
Accepted: 30 August 2013
Published: 23 September 2013
DOI: https://doi.org/10.1186/1687-6180-2013-149

Spatial location priors for Gaussian model based reverberant audio source separation

Abstract

1 Introduction

2 Gaussian modeling and statistical room acoustics

2.1 Gaussian modeling for source separation

2.2 A result from the theory of statistical room acoustics

3 Source image-based EM algorithms

3.1 General EM algorithm

3.2 Inverse-Wishart prior

3.3 Learning the hyper-parameter

3.4 MAP EM update

4 Subsource-based EM algorithm

4.1 General EM algorithm

4.2 Gaussian prior

4.3 Learning the hyper-parameters

4.4 MAP EM update

5 Experimental evaluation

5.1 Data

5.2 Learned hyper-parameter values

5.3 Tested algorithms and evaluation criteria

5.4 Results for source image-based EM algorithms

5.5 Results for subsource-based EM algorithms

6 Conclusions

Endnotes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords