By incorporating spectrogram consistency into ILRMA, we propose a novel BSS method named Consistent ILRMA. In this section, after stating our motivation and contributions, we first review the standard ILRMA introduced in [11, 12] and then propose the consistent version of ILRMA with an algorithm that achieves Consistent ILRMA and is openly available on the web.
3.1 Motivations and contributions
The previous paper [32] only reported that the performances of traditional BSS algorithms, FDICA and IVA, were improved by enforcing consistency during the estimation of the demixing matrix Wi. In addition, no detailed experimental analysis related to STFT parameters was provided, even though the parameters of window functions in the STFT and inverse STFT directly affect the smoothing effect of spectrogram consistency.
The spectrogram consistency is a general property of STFT, and therefore, it can be combined with any source model for determined BSS. Its combination with state-of-the-art models, including ILRMA, is of great interest because the current mainstream algorithm for determined audio source separation is centered on ILRMA, which is based on an NMF-based richer time-frequency source model. Indeed, many recent papers are based on the framework of ILRMA [17–29]. Even though combining ILRMA with the spectrogram consistency should be able to exceed the limit of existing BSS algorithms, no such method has been investigated in the literature.
In this paper, we propose a new BSS algorithm that combines ILRMA and spectrogram consistency. Our first contribution is an algorithm that achieves Consistent ILRMA by inserting \(\text {STFT}_{\boldsymbol {\omega }}(\text {ISTFT}_{\widetilde {\boldsymbol {\omega }}}(\cdot))\) into the iterative optimization algorithm of ILRMA. The second contribution is to apply a scale-aligning process called iterative back projection within the iterative algorithm. This process enhances the separation performance when it is combined with spectrogram consistency. The third contribution is an experimental finding that spectrogram consistency can work properly with the iterative back projection. We found that both Consistent IVA and Consistent ILRMA require iterative back projection to achieve a good performance. Our fourth contribution is to provide the massive experimental results for several window functions, window lengths, shift lengths, reverberation times, and source types. We also provide discussions for clarifying the tendency of ILRMA with spectrogram consistency.
3.2 Standard ILRMA [12]
The original ILRMA [12] was derived from the following generative model of the spectrograms of the separated signals:
$$ {} \boldsymbol{Y}_{n} \sim p(\boldsymbol{Y}_{n}) = \prod\limits_{i,j} \mathcal{N}_{\mathrm{c}}\left(0,r_{ijn}\right) = \prod\limits_{i,j} \frac{ 1 }{ \pi r_{ijn}} \exp{\left(-\frac{ |y_{ijn}|^{2} }{ r_{ijn}} \right)}, $$
(15)
where \(\mathcal {N}_{\mathrm {c}}\left (\mu, r\right)\) is the circularly symmetric complex Gaussian distribution with mean μ and variance r. In this model, the source component yijn is assumed to obey a zero-mean and isotropic distribution, i.e., the phase of yijn is generated from the uniform distribution in the range [0,2π) and the real and imaginary parts of yijn are mutually independent. The validity of this assumption is shown in the Appendix. The variance rijn can be viewed as an expectation value of |yijn|2. This variance rijn as a two-dimensional array indexed by (i,j) is denoted as \(\boldsymbol {R}_{n}\in \mathbb {R}_{> 0}^{I\times J}\), which is called the variance spectrogram corresponding to the nth source. In ILRMA, the variance matrix Rn is modeled using the rank-K NMF, as:
$$\begin{array}{*{20}l} \boldsymbol{R}_{n} = \boldsymbol{T}_{n}\boldsymbol{V}_{n}, \end{array} $$
(16)
where \(\boldsymbol {T}_{n}\in \mathbb {R}_{> 0}^{I\times K}\) and \(\boldsymbol {V}_{n}\in \mathbb {R}_{> 0}^{K\times J}\) are the basis and activation matrices in NMF. The basis vectors in Tn, which represent spectral patterns of the nth source signal, are indexed by k=1,⋯,K. As in FDICA, statistical independence between the source signals is also assumed in ILRMA:
$$ p(\boldsymbol{Y}_{1}, \boldsymbol{Y}_{2}, \cdots, \boldsymbol{Y}_{N}) = \prod\limits_{n} p(\boldsymbol{Y}_{n}). $$
(17)
ILRMA estimates the demixing matrix Wi so that the power spectrograms of the separated signals |Yn|2 have a low-rank structure that can be well-approximated by TnVn with small K. This BSS principle of ILRMA is illustrated in Fig. 3. When the low-rank source model can appropriately fit to the power spectrograms of the original source signals |Sn|2, ILRMA provides an excellent separation performance without explicitly solving the permutation problem afterward.
The demixing matrix Wi and the nonnegative matrices Tn and Vn can be obtained through maximum likelihood estimation. The negative log-likelihood to be minimized, denoted by \(\mathcal {L}\), is given as [12]:
$$ {}\begin{aligned} \mathcal{L} &= - \log p(\boldsymbol{X}_{1}, \boldsymbol{X}_{2}, \cdots, \boldsymbol{X}_{M}), \\ &= -\sum\limits_{i,j} \log \left|\det \boldsymbol{W}_{i}\right|^{2} - \log p(\boldsymbol{Y}_{1}, \boldsymbol{Y}_{2}, \cdots, \boldsymbol{Y}_{N}), \\ &\stackrel{\mathrm{c}}{=} -2J\sum\limits_{i} |\det \boldsymbol{W}_{i}| \,+\, \sum\limits_{i,j,n} \!\left(\!\frac{ \left|\boldsymbol{w}_{in}^{\mathrm{H}}\boldsymbol{x}_{ij}\right|^{2} }{ {\sum\nolimits}_{k} t_{ikn}v_{kjn}} \!+ \!\log \sum\limits_{k} t_{ikn}v_{kjn} \!\right), \end{aligned} $$
(18)
where =c denotes equality up to constant factors, and tikn>0 and vkjn>0 are the elements of Tn and Vn, respectively. The minimization of (18) can be performed by iterating the following update rules for the spatial model parameters,
$$\begin{array}{*{20}l} \boldsymbol{U}_{in} &\leftarrow \frac{1}{J} \sum\limits_{j} \frac{1}{{\sum\nolimits}_{k} t_{ikn}v_{kjn}}\boldsymbol{x}_{ij}\boldsymbol{x}_{ij}^{\mathrm{H}}, \end{array} $$
(19)
$$\begin{array}{*{20}l} \boldsymbol{w}_{in} &\leftarrow \left(\boldsymbol{W}_{i}\boldsymbol{U}_{in} \right)^{-1}\boldsymbol{e}_{n}, \end{array} $$
(20)
$$\begin{array}{*{20}l} \boldsymbol{w}_{in} &\leftarrow \boldsymbol{w}_{in} \left(\boldsymbol{w}_{in}^{\mathrm{H}}\boldsymbol{U}_{in}\boldsymbol{w}_{in} \right)^{-\frac{1}{2}}, \end{array} $$
(21)
$$\begin{array}{*{20}l} y_{ijn} &\leftarrow \boldsymbol{w}_{in}^{\mathrm{H}}\boldsymbol{x}_{ij}, \end{array} $$
(22)
and for the source model parameters,
$$\begin{array}{*{20}l} t_{ikn} &\leftarrow t_{ikn} \sqrt{ \frac{ {\sum\nolimits}_{j} \left|y_{ijn}\right|^{2} \left(\sum_{k'} t_{ik'n}v_{k'jn} \right)^{-2} v_{kjn} }{ {\sum\nolimits}_{j} \left(\sum_{k'} t_{ik'n}v_{k'jn} \right)^{-1} v_{kjn}} }, \end{array} $$
(23)
$$\begin{array}{*{20}l} v_{kjn} &\leftarrow v_{kjn} \sqrt{ \frac{ {\sum\nolimits}_{i} \left|y_{ijn}\right|^{2} \left({\sum\nolimits}_{k'} t_{ik'n}v_{k'jn} \right)^{-2} t_{ikn} }{ {\sum\nolimits}_{i} \left({\sum\nolimits}_{k'} t_{ik'n}v_{k'jn} \right)^{-1} t_{ikn}} }, \end{array} $$
(24)
where en∈{0,1}N is the unit vector with the nth element equal to unity. Update rules (19)–(24) ensure the monotonic non-increase of the negative log-likelihood function \(\mathcal {L}\). After iterative calculations of updates (19)–(24), the separated signal can be obtained by (12).
Equation 22 is equivalent to beamforming [53] to xij with the beamformer coefficients win. Thus, FDICA, IVA, and ILRMA can be interpreted as an adaptive estimation process of beamforming coefficients without having to know the geometry of microphones and sources [54]. For this reason, the estimated signal Yn obtained by (22) is a complex-valued spectrogram, and we do not need to recover its phase components using, for example, Griffin–Lim algorithm-based techniques [37–40, 43, 55–59]. Both the amplitude and phase components of each source are recovered by the complex-valued linear separation filter win.
3.3 Proposed Consistent ILRMA
To further improve the separation performance of the standard ILRMA, we introduce the spectrogram consistency into the parameter update procedure. In the proposed Consistent ILRMA, the following combination of forward and inverse STFT is performed at the beginning of each iteration of parameter updates:
$$\begin{array}{*{20}l} \boldsymbol{Y}_{n} \leftarrow \text{STFT}_{\boldsymbol{\omega}}(\text{ISTFT}_{\widetilde{\boldsymbol{\omega}}}(\boldsymbol{Y}_{n})). \end{array} $$
(25)
This procedure is the projection of the spectrogram of a separated signal Yn onto the set of consistent spectrograms [32]. That is, \(\text {STFT}_{\boldsymbol {\omega }}(\text {ISTFT}_{\widetilde {\boldsymbol {\omega }}}(\boldsymbol {Y}_{n}))\) performs nothing if Yn is consistent, but otherwise, it smooths the complex spectrogram Yn, by going through the time domain, so that the uncertainty principle is satisfied.
In Consistent ILRMA, the calculation of (25) is performed in each iteration of parameter updates based on (19)–(24). Enforcing the spectrogram consistency for the temporary separated signal Yn in each iteration guides the parameters Wi,Tn, and Vn to better solutions, which results in higher separation performance compared to that of conventional ILRMA.
Note that this simple update (25) may increase the value of the negative log-likelihood function (18), and therefore, the monotonicity of the algorithm is no longer guaranteed. However, we will see later in the experiments that the value of the negative log-likelihood function stably decreases as in the standard ILRMA. The amount of the inconsistent component (14) also settles down to some specific value after several iterations.
3.4 Iterative back projection
Since frequency-domain BSS cannot determine the scales of estimated signals (represented by Di in (13)), the spectrogram of a separated signal Yn after an iteration is inconsistent due to the scale irregularity. To take full advantage of the projection enforcing spectrogram consistency in (25), we also propose applying the following back projection at the end of each iteration so that the frequency-wise scales are aligned.
In determined BSS, the back projection is a standard procedure for recovering the frequency-wise scales. It can be written as [49]:
$$ \tilde{\boldsymbol{y}}_{ijn} = \boldsymbol{W}_{i}^{-1} \left(\boldsymbol{e}_{n} \circ \boldsymbol{y}_{ij} \right) = y_{ijn}\boldsymbol{\lambda}_{in}, $$
(26)
where \(\tilde {\boldsymbol {y}}_{ijn} = \left [\, \tilde {y}_{ijn1}, \tilde {y}_{ijn2}, \cdots, \tilde {y}_{ijnM} \right ]^{\mathrm {T}}\in \mathbb {C}^{M}\) is the (i,j)th bin of the scale-fitted spectrogram of the nth separated signal, \(\boldsymbol {\lambda }_{in} = \left [\,\lambda _{in1}, \lambda _{in2}, \cdots, \lambda _{inM}\right ]^{\mathrm {T}}\in \mathbb {C}^{M}\) is a coefficient vector of back projection for the nth signal at the ith frequency, and ∘ denotes the element-wise multiplication. In the proposed method, this update (26) is performed at the end of each iteration so that the projection (25) at the beginning of the next iteration properly smooths the spectrograms without the effect of scale indeterminacy.
One side effect of this back projection is that the value of the negative log-likelihood function (18) is also changed due to the scale modification. In IVA, this problem cannot be avoided because the only parameter in IVA is the demixing matrix Wi. However, in ILRMA, since both the demixing matrix Wi and the source model parameter TnVn can determine the scale of estimated signal Yn, the likelihood variation can be avoided by appropriately adjusting win and Tn after the back projection. To prevent the likelihood variation, the following updates are required after performing (26):
$$\begin{array}{*{20}l} \boldsymbol{w}_{in} &\leftarrow \boldsymbol{w}_{in} \lambda_{inm_{\text{ref}}}, \end{array} $$
(27)
$$\begin{array}{*{20}l} y_{ijn} &\leftarrow \boldsymbol{w}_{in}^{\mathrm{H}}\boldsymbol{x}_{ij}, \end{array} $$
(28)
$$\begin{array}{*{20}l} t_{ikn} &\leftarrow t_{ikn} \left|\lambda_{inm_{\text{ref}}}\right|^{2}, \end{array} $$
(29)
where mref is the index of the reference channel utilized in the back projection.
The overall algorithm of the proposed Consistent ILRMA is summarized in Algorithm 1. The iterative loop for the parameter optimization appears in the second to eighth lines. The spectrogram consistency of the temporary separated signal Yn is ensured in the third line, and the iterative back projection is applied in the sixth and seventh lines. Note that an algorithm for the conventional ILRMA can be obtained by performing only the fourth and fifth lines (i.e., ignoring the third, sixth, and seventh lines). A Python code of the conventional ILRMA is openly available online (https://pyroomacoustics.readthedocs.io/en/pypi-release/pyroomacoustics.bss.ilrma.html), and therefore, the proposed Consistent ILRMA with Python can be easily implemented by slightly modifying the codes. A MATLAB code of Consistent ILRMA is also available online (https://github.com/d-kitamura/ILRMA/blob/master/consistentILRMA.m).