Consistent Independent Low-Rank Matrix Analysis for Determined Blind Source Separation

Independent low-rank matrix analysis (ILRMA) is the state-of-the-art algorithm for blind source separation (BSS) in the determined situation (the number of microphones is greater than or equal to that of source signals). ILRMA achieves a great separation performance by modeling the power spectrograms of the source signals via the nonnegative matrix factorization (NMF). Such highly developed source model can effectively solve the permutation problem of the frequency-domain BSS, which should be the reason of the excellence of ILRMA. In this paper, we further improve the separation performance of ILRMA by additionally considering the general structure of spectrogram called consistency, and hence we call the proposed method Consistent ILRMA. Since a spectrogram is calculated by an overlapping window (and a window function induces spectral smearing called main- and side-lobes), the time-frequency bins depend on each other. In other words, the time-frequency components are related each other via the uncertainty principle. Such co-occurrence among the spectral components can be an assistant for solving the permutation problem, which has been demonstrated by a recent study. Based on these facts, we propose an algorithm for realizing Consistent ILRMA by slightly modifying the original algorithm. Its performance was extensively studied through the experiments performed with various window lengths and shift lengths. The results indicated several tendencies of the original and proposed ILRMA which include some topics have not discussed well in the literature. For example, the proposed Consistent ILRMA tends to outperform the original ILRMA when the window length is sufficiently long compared to the reverberation time of the mixing system.


Introduction
Blind source separation (BSS) is a technique for separating individual sources from an observed mixture without knowing how they were mixed. In particular, BSS for multichannel audio signals observed by multiple microphones have been well studied [1][2][3][4][5][6][7][8][9][10][11][12][13]. The BSS problem can be divided into two situations: underdetermined (the number of microphones is less than the number of sources) and (over-)determined (the number of microphones is greater than or equal to the number of sources) cases. This paper focuses on the determined BSS problem because high-quality separation can be achieved compared with the underdetermined BSS methods.
Independent component analysis (ICA) is the most popular and successful algorithm for solving the determined BSS problem [1]. It estimates a demixing matrix (the inverse system of the mixing process) by assuming statistical independence between the sources. For a mixture of audio signals, ICA is usually applied in the time-frequency domain via the short-time Fourier transform (STFT) because the sources are mixed up by convolution. This strategy is called frequency-domain ICA (FDICA) [2] and independently applies ICA to the complex-valued signals in each frequency. Then, the estimated frequency-wise demixing matrices must be aligned over all frequencies so that the frequency components of the same source are grouped together. Such alignment of the frequency components is so-called permutation problem [3][4][5][6] whose complete solution has not been established. Therefore, a great deal of research has tackled this problem.
To solve the permutation problem, some sophisticated source models have been proposed. Independent vector analysis (IVA) [7][8][9][10] is one of the most successful methods in the early stage of the development. It assumes higher-order dependences (co-occurrence among the frequency components) of each source by employing a spherical generative model of the source frequency vector. This assumption enables IVA to simultaneously estimate the frequency-wise demixing matrices and solve the permutation problem using only one objective function. It has been further developed by improving its source model. One natural and powerful extension of IVA is independent low-rank matrix analysis (ILRMA) [11,12] which integrates the source model of nonnegative matrix factorization [14,15] based on the Itakura-Saito divergence (IS-NMF) [16] into IVA. This extension has greatly improved the performance of separation by taking the low-rank time-frequency structure (co-occurrence among the time-frequency bins) of the source signals into account. ILRMA has achieved the state-of-the-art performance and been further developed by several researchers [17][18][19][20][21][22][23][24]. In this respect, ILRMA can be considered as the new standard of the determined BSS algorithms.
The consistency of a spectrogram is another promising approach for solving the permutation problem. A recent study has shown that STFT can provide some effective information related to the co-occurrence among the time-frequency bins [25]. Since an overlapping window is utilized in STFT, the time-frequency bins are related each other based on the overlapping segments. The frequency components within a segment are also related each other because of the spectral smearing called main-and side-lobes of the window. In other words, the time-frequency components are not independent but related each other via the uncertainty principle of timefrequency representation. Such relation has been well-studied in phase-aware signal processing [26][27][28][29][30][31][32][33][34][35][36] by the name of spectrogram consistency [37][38][39][40]. In the previous study [25], the spectrogram consistency is imposed on BSS for assisting the algorithm to solve the permutation problem. This is an approach very different from the conventional studies of determined BSS because it utilizes the general property of STFT independent of the source model (in contrast to the above-mentioned methods focused on modeling of the source signals without considering the property of STFT). As the spectrogram consistency can be incorporated with any source model, its combination with the state-of-the-art algorithm should achieve a high separation performance.
However, that paper proposing the combination of consistency and determined BSS [25] only showed the potential of consistency by the experiment using FDICA and IVA. The paper claimed that it is a first step of incorporating the spectrogram consistency with determined BSS, and no advanced method was tested. In particular, ILRMA was not considered because its algorithm is far complicated than that derived in [25], and thus it is not clear whether (and how much) the spectrogram consistency can improve the state-of-the-art BSS algorithm.
In this paper, we propose a new variant of ILRMA called Consistent ILRMA by considering the spectrogram consistency within the algorithm of ILRMA. The combination of IS-NMF and spectral smoothing of the inverse STFT (see the figures in [25]) realizes the source modeling in terms of complex spectrogram. In particular, the spectral smearing in the frequency direction ties the adjacent frequency bins together, and such effect of spectrogram consistency helps ILRMA to solve the permutation problem. Since consistency is a concept depending on the parameters related to a window function, the separation performance of Consistent ILRMA was extensively tested by the experiment with various window lengths and shift lengths. The results indicated several tendencies of the conventional and proposed methods, which includes that the proposed method outperforms the original ILRMA when the window length is sufficiently long compared to the reverberation time of the mixing system. , and N source signals be observed by M microphones. Then, the lth sample of the multichannel source, observed, and separated signals are respectively denoted as follows: where n = 1, · · · , N , m = 1, · · · , M , and l = 1, · · · , L are the indexes of sources, microphones (channels), and discrete time, respectively, and · T denotes the transpose. BSS aims at recovering the source signal s from the observed signal x, i.e., making y as close to s as possible.
In the frequency-domain BSS, those signals are handled in the time-frequency domain via STFT. Let the window length and shifting step of STFT be denoted as Q and τ , respectively. Then, the jth segment of a signal z[l] is defined as where j = 1, · · · , J and q = 1, · · · , Q are the indexes of the segments and insegment samples, respectively, and the number of segments is given by J = L/τ with some zero-padding for adjusting the signal length L if necessary. STFT of a signal z = [ z [1], · · · , z[L] ] T ∈ R L is denoted by where the (i, j)th bin of the spectrogram Z is given as i = 1, · · · , I is the index of frequency bins, F is an integer satisfying ⌊F/2⌋ + 1 = I, ⌊·⌋ is the floor function, ı denotes the imaginary unit, and ω is an analysis window. The inverse STFT with a synthesis window ω is also defined in the usual way and denoted as ISTFT ω (·). In this paper, we assume that the window pair satisfies the following perfect reconstruction condition: By applying STFT, the (i, j)th bin of the spectrograms of source, observed, and separated signals can be written as We also denote the spectrograms corresponding to the nth or mth signals in (8)-(10) as S n ∈ C I×J , X m ∈ C I×J , and Y n ∈ C I×J , whose elements are s ijn , x ijm , and y ijn , respectively. In the ordinary frequency-domain BSS, an instantaneous mixing process for each frequency bin is assumed: where A i ∈ C M×N is a frequency-wise mixing matrix. The mixture model (11) is approximately valid when the reverberation time is sufficiently shorter than the length of the analysis window used in STFT.
Hereafter, we consider the determined case, i.e., M = N . In this case, BSS can be achieved by estimating the inverse of A i for all frequency. By denoting an approximate inverse as W i ≈ A −1 i , the separation process can be written as where W i = [w i1 , w i2 , · · · , w iN ] H ∈ C N ×M is so-called a frequency-wise demixing matrix, and · H denotes the Hermitian transpose. The aim of a determined BSS algorithm is to find the demixing matrices for all frequency so that the separated signals approximate the source signals.

Permutation problem in determined BSS
In practice, the scale and permutation of the separated signals are unknown because the information of the mixing process is missing. That is, when the separation is correctly performed by some demixing matrix W i as in (12), the following signal is also a solution to the BSS problem: where D i ∈ C N ×N and P i ∈ {0, 1} N ×N are arbitrary diagonal and permutation matrices, respectively. While the signal scale can easily be recovered by applying the back projection [41], the permutation of the estimated signalsŷ ij must be aligned for all frequency, i.e., P i must be the same for all i. This alignment of the permutation of estimated signals is so-called permutation problem which is the main obstacle of the frequency-domain determined BSS.
In FDICA, a permutation solver (realignment process of P i ) is utilized as a post-processing applied to the frequency-wise separated signalsŷ ij [4][5][6]. In recent frequency-domain BSS methods, an additional assumption on sources (or source model) is introduced to circumvent the permutation problem. For example, IVA assumes simultaneous co-occurrence of all frequency components in the same source, and ILRMA assumes a low-rank structure of the power spectrogram Y n . Some other source models have also been proposed for improving the separation performance [42][43][44]. These source models can avoid the permutation problem to some extent during the estimation ofŴ i . Recent development of determined BSS is achieved via the quest of finding a better source model that represents the source signals more precisely.

Solving permutation problem by spectrogram consistency
A recent paper reported another approach for solving the permutation problem based on the general property of STFT called spectrogram consistency [25].
The consistency is a fundamental property of a spectrogram. Since any timefrequency representation has the theoretical limitation called the uncertainty principle, time-frequency bins of a spectrogram are not independent but related each other. The inverse STFT always modifies the spectrogram Z n that violates this kind of inter-time-frequency relation so that the relation is recovered. That is, a spectrogram Z n properly retains the inter-time-frequency relation if and only if is zero, i.e., E(Z n ) = 0 for a norm · . Such spectrogram Z n satisfying E(Z n ) = 0 is said to be consistent.
As the inverse STFT is a process of recovering the consistency (the inter-timefrequency relation), it has capability of aligning the frequency components. Roughly speaking, the inverse STFT is a smoothing process of a spectrogram in the timefrequency domain (see the figures in [25]). This is because the main-and side-lobes of the window function (and the overlap-add process) spread the energy of a timefrequency bin. In other words, the inverse STFT mixes up the separated signals if the frequency-wise permutation is not aligned correctly. Therefore, enforcing consistency within a BSS algorithm by applying STFT ω (ISTFT ω (·)) can improve the separation performance to some extent [25].
The spectrogram consistency is a general property of STFT, and therefore it can be combined with any source model for BSS. Its combination with the state-of-theart model, including ILRMA, should be of great interest because such collaboration might exceed the limit of existing BSS algorithms. Yet, no such method has been investigated in the literature.

Proposed method
By incorporating spectrogram consistency into ILRMA, we propose a novel BSS method named Consistent ILRMA. In this section, we first review the standard ILRMA introduced in [11] and then propose the consistent version.

Standard ILRMA [12]
The original ILRMA [11] was derived from the following generative model of the spectrograms of the separated signals: where N c (µ, r) is the circularly symmetric complex Gaussian distribution with the mean µ and variance r. In this model, the variance r ijn can be viewed as an expectation value of |y ijn | 2 . This variance r ijn as a two-dimensional array indexed by (i, j) is denoted as R n ∈ R I×J >0 which is called the variance spectrogram corresponding to the nth source. In ILRMA, the variance matrix R n is modeled using the rank-K nonnegative matrix factorization (NMF) as follows: where T n ∈ R I×K >0 and V n ∈ R K×J >0 are the so-called basis and activation matrices in NMF. The basis vectors in T n , which represent spectral patterns of the nth source signal, are indexed by k = 1, · · · , K. Statistical independence between the source signals as in FDICA is also assumed in ILRMA: ILRMA estimates the demixing matrix W i so that the power spectrograms of the separated signals |Y n | 2 have the low-rank structure that can be well-approximated by T n V n with small K, where | · | 2 for a matrix input represents the element-wise squared absolute value. When the low-rank source model can appropriately fit to the power spectrograms of the original source signals |S n | 2 , ILRMA provides an excellent separation performance without explicitly solving the permutation problem afterward.
The demixing matrix W i and the nonnegative matrices T n and V n can be obtained through the maximum likelihood estimation. The negative log-likelihood to be minimized, denoted by L, is given as follows [12]: where c = denotes equality up to constant factors, and t ikn > 0 and v kjn > 0 are the elements of T n and V n , respectively. The minimization of (18) can be performed by iterating the following update rules for the spatial model parameters, and for the source model parameters, where e n ∈ {0, 1} N is the unit vector with the nth element equal to unity. The update rules (19)-(24) ensure the monotonic non-increase of the negative log-likelihood function L. After iterative calculations of these updates (19)-(24), the separated signal can be obtained by (12).

Proposed Consistent ILRMA
To further improve the separation performance of the standard ILRMA, we introduce the spectrogram consistency into the parameter update procedure. In the proposed Consistent ILRMA, the following combination of the forward and inverse STFT is performed at the beginning of each iteration: This procedure is the projection of the spectrogram of a separated signal Y n onto the set of consistent spectrograms [25]. That is, STFT ω (ISTFT ω (Y n )) performs nothing if Y n is consistent, but otherwise it smooths the complex spectrogram Y n , by going through the time domain, so that the uncertainty principle is satisfied. Note that this simple update (25) may increase the value of the negative log-likelihood function (18), and therefore the monotonicity of the algorithm is not guaranteed anymore. However, we will see later by the experiment that the value of the negative log-likelihood function stably decreases as the usual ILRMA. The amount of the inconsistent component (14) also settle down to some specific values after several iterations.
Since frequency-domain BSS cannot determine the scales of estimated signals (represented by D i in (13)), the spectrogram of a separated signal Y n after an iteration should be inconsistent due to the scale irregularity. To fully receive the benefit from the projection enforcing spectrogram consistency in (25), we also apply the following back projection at the end of each iteration so that the frequency-wise scales are aligned.

Iterative back projection
In determined BSS, the back projection is a standard procedure for recovering the frequency-wise scales. It can be written as follows [41]: whereỹ ijn = [ỹ ijn1 ,ỹ ijn2 , · · · ,ỹ ijnM ] T ∈ C M is the (i, j)th bin of the scale-fitted spectrogram of the nth separated signal, λ in = [ λ in1 , λ in2 , · · · , λ inM ] T ∈ C M is a coefficient vector of back projection for the nth signal at the ith frequency, and • denotes the element-wise multiplication. In the proposed method, this update (26) is performed at the end of each iteration so that the projection (25) at the beginning of the next iteration properly smooths the spectrograms without the effect of scale indeterminacy.
One side-effect of this back projection is that the value of the negative loglikelihood function (18) is also changed due to the scale modification. To prevent such likelihood variation, the following updates are required after performing (26): where m ref is the index of the reference channel utilized in the back projection. Note that these updates (27)- (29) are not mandatory if the value of the negative  log-likelihood is not important. Basically, they are recommended merely for the monitoring purpose. The overall algorithm of the proposed Consistent ILRMA is summarized in Algorithm 1. Note that the MATLAB and Python codes of the standard ILRMA are openly available on the web (https://github.com/d-kitamura/ILRMA, and https://pyroomacoustics.readthedocs.io/en/pypi-release/pyroomacoustics.bss.ilrma.html, respectively), and therefore the proposed Consistent ILRMA can be easily implemented by slightly modifying the codes.

Conditions
We conducted determined BSS experiments using music and speech mixtures with two sources and two microphones (N = M = 2). The dry sources of music and speech signals, listed in Table 1, were respectively obtained from professionally produced music and underdetermined separation tasks that are provided as a part of SiSEC2011 [45]. They were convoluted with the impulse response E2A (T 60 = 300 ms) or JR2 (T 60 = 470 ms), obtained from the RWCP database [46], to simulate the multichannel observation signals. The recording conditions of these impulse responses were illustrated in Fig. 1.
We compared the performance of the proposed Consistent ILRMA with the conventional ILRMA [11]. The nonnegative matrices T n and V n were initialized using uniformly distributed random values in the range (0, 1), and W i was set to an identity matrix. Five trials were performed for each condition using different pseudorandom seeds. The number of bases for each source, K, was set to 10 for music mixtures and 2 for speech mixtures, where it was experimentally confirmed that these conditions provide the best performance for the conventional ILRMA [11]. To  satisfy the perfect reconstruction condition (7), the inverse STFT was implemented by the canonical dual of the analysis window. For both conventional and proposed ILRMAs, the iterative back projection (26)- (29) was applied, where the reference channel was set to m ref = 1. Since the property of spectrogram consistency depends on the window length and shift length, various combinations of them were tested. The experimental conditions are summarized in Table 2. As an evaluation score, we used the improvement of the source-to-distortion ratio (SDR) [47], which shows overall separation accuracy including both the degree of separation and the absence of artificial noise. Fig. 2 shows some examples of the value of the negative log-likelihood function (18) of the proposed Consistent ILRMA. Although the algorithmic convergence of the proposed method has not been theoretically justified because of the additional projection (25), we experimentally confirmed smooth decrease of the cost function. We also confirmed that such behavior was common for the other experimental conditions and mixtures. This result indicates that the additional procedure in the proposed method does not have a harmful effect on the behavior of the overall algorithm. Fig. 3 shows some examples of the energy of the inconsistent components (14) of the proposed Consistent ILRMA, where they were normalized by the energy of the observed spectrograms in order to align the magnitude. These values are completely zero when the separated spectrograms are consistent, and hence those at the 0th iteration (the leftmost values) are zero because no processing was performed at that point. By iterating the Consistent ILRMA, this energy rapidly increased because the demixing matrix for each frequency independently tries to process and separate the signals. However, the normalized energy tended to some specific values after several iterations. This result indicates that the proposed Consistent ILRMA reduces the Since ILRMA assumes the instantaneous mixing model (11) for each frequency, the window length should be longer than the reverberation time to achieve accurate separation, as discussed in [48]. This can be easily confirmed from the results for the music mixtures (Figs. 4 and 5). The spectrograms of music signals often consist of vertical (percussive) and horizontal (harmonic) patterns that can be wellapproximated by NMF. Such patterns are preserved after making the window length longer, and therefore longer windows resulted in better SDR because of better approximation of the mixing model (11). In contrast, such tendency did not apply to the speech mixtures (Figs. 6 and 7), and an optimal window length seems to be exist around 256 ms. As speech signals rapidly vary with time, a long window may not be able to effectively capture their spectral structure. That is, the NMF-based source modeling fails for speech signals with a long window. These results indicate that the maximal achievable performance becomes higher as the window length becomes longer, but the source modeling becomes difficult for ILRMA. This trade-off is important for discussing the results further.

Results
By comparing the performances of the conventional ILRMA (Conv. on the left) and the proposed Consistent ILRMA (Prop. on the right) in each subfigure of Figs. 4-7, it can be seen that the proposed method outperformed the conventional ILRMA in many situations. To summarize the experimental results, we listed up some notable tendencies as follows: • When the window length is short (e.g., 64 and 128 ms), the proposed method has little advantage over the conventional ILRMA. This should be because the achievable performance is already limited by the window length that was   shorter than the reverberation time. This result contradicted our expectation before performing the experiment. Since enforcing the consistency spreads the frequency components based on the main-lobe of the window function, we expected that the ability of solving the permutation problem is higher when the window length is shorter because of the wider main-lobe. In reality, the spectrogram consistency can assist ILRMA only when the window length is sufficiently long compared to the reverberation time. • When the window length is sufficiently long compared to the reverberation time, the proposed method can outperform the conventional ILRMA. Some situations achieved a notably good performance (e.g., consistency improved SDR more than 3 dB in terms of median). This should be because the permutation problem is alleviated by enforcing the consistency which ties the adjacent frequency components together.
• When the shift length is small (e.g., 1/2 overlap), the performance of ILRMA remarkably drops especially when the window length is long. This should be because the number of time segments utilized for modeling was reduced, i.e., NMF failed to model the source signals from the given amount of data. In contrast, the proposed method was able to prevent such performance degradation in many situations. This might be because the smoothing process of the inverse STFT provides some additional information for the source modeling from the adjacent bins.
• The proposed method tends to achieve a good performance when the conventional ILRMA also works well. This tendency indicates that the spectrogram consistency effectively promotes the separation when the source model (NMF in the case of ILRMA) can properly fit to the source signals (e.g., music signals in Figs. 4 and 5). The opposite situation can also be seen in the speech mixture cases (Figs. 6 and 7). This is the reason why we say that the consistency can be an assistant of the frequency-domain BSS. An important aspect is that the source model actually performs the separation, and the spectrogram consistency enhances the separation performance when the source modeling functions correctly. Note that such assistance may produce a large difference on the separation performance as can be seen in the bottom right subfigure of Fig. 4, where the spectrogram consistency improved SDR nearly 8 dB in terms of median.
• When the window length is very long (e.g., 768 and 1024 ms), the performance of the conventional ILRMA tends to degrade compared to the middle-length window case (e.g., 512 ms). This should be because the main-lobe of a window function get steeper as the window length becomes longer, which makes the NMF-based approximation difficult because the spectral patterns becomes more sensitive to a slight variation of frequency. In contrast, the proposed method was able to avoid such performance degradation for music signals in Figs. 4 and 5. This should be because the smoothing process of the inverse STFT alleviated such difficulty by relating the time-frequency bins to each other.

Conclusion
In this paper, we proposed a new variant of the state-of-the-art determined BSS algorithm called the Consistent ILRMA. It utilizes the smoothing effect of the inverse STFT in order to assist the separation and enhance the performance. The experimental results showed that the proposed method can improve the separation performance when the window length is sufficiently long (and when the source model can properly fit to the actual source signals). This paper has demonstrated the potential of considering the spectrogram consistency within the state-of-the-art determined BSS algorithm. It should be possible to construct a new source model in consideration of the spectrogram consistency, which can be the next direction of research on determined BSS.