Blind Source Separation with Optimal Transport Non-negative Matrix Factorization

Optimal transport as a loss for machine learning optimization problems has recently gained a lot of attention. Building upon recent advances in computational optimal transport, we develop an optimal transport non-negative matrix factorization (NMF) algorithm for supervised speech blind source separation (BSS). Optimal transport allows us to design and leverage a cost between short-time Fourier transform (STFT) spectrogram frequencies, which takes into account how humans perceive sound. We give empirical evidence that using our proposed optimal transport NMF leads to perceptually better results than Euclidean NMF, for both isolated voice reconstruction and BSS tasks. Finally, we demonstrate how to use optimal transport for cross domain sound processing tasks, where frequencies represented in the input spectrograms may be different from one spectrogram to another.


Introduction
Blind source separation (BSS) is the task of separating a mixed signal into different components, usually referred to as sources. In the context of sound processing, it can be used to separate speakers whose voices have been recorded simultaneously. A common way to address this task is to decompose the signal spectrogram by non-negative matrix factorization (NMF, Lee and Seung 2001), as proposed for example by Schmidt and Olsson (2006) as well as Sun and Mysore (2013). Denotingx j,i the (complex) short-time Fourier transform (STFT) coefficient of the input signal at frequency bin j and time frame i, and X its magnitude spectrogram defined as x j,i = |x j,i |, the BSS problem can be tackled by solving the NMF problem min D (1) ...D (N) where N is the number of sources, t is the number of time windows, x i is the i th column of X and ℓ is a loss function. Each dictionary matrix D (k) and weight matrix W (k) are related to a single source. In a supervised setting, each source has training data and all the D (k) s are learned in advance during a training phase. At test time, given a new signal, separated spectrograms are recovered from the D (k) s and W (k) s and corresponding signals can be reconstructed with suitable post-processing. Several loss functions ℓ have been considered in the literature, such as the squared Euclidean distance (Lee and Seung 2001;Schmidt and Olsson 2006), the Kullback-Leibler divergence (Lee and Seung 2001;Sun and Mysore 2013) or the Itakura-Saito divergence (Févotte et al. 2009;Sawada et al. 2013).
In the present article, we propose to use optimal transport as a loss between spectrograms to perform supervised speech BSS with NMF. Optimal transport is defined as the minimum cost of moving the mass from one histogram to another. By taking into account a transportation cost between frequencies, this provides a powerful metric to compare STFT spectrograms. One of the main advantage of using optimal transport as a loss is that it can quantify the amplitude of a frequency shift noise, coming for example from quantization or the tuning of a musical instrument. Other metrics such as the Euclidean distance or Kullback-Leibler divergence, which compare spectrograms element-wise, are almost blind to this type of noise (see Figure  1). Another advantage over element-wise metrics is that optimal transport enables the use of different quantizations, i.e. frequency supports, at training and test times. Indeed, the frequencies represented on a spectrogram depend on the sampling rate of the signal and the time-windows used for its computation, both of which can change between training and test times. With optimal transport, we do not need to re-quantize the training and testing data so as they share the same frequency support: optimal transport is well-defined between spectrograms with distinct supports as long as we can define a transportation cost between frequencies. Finally, the optimal transport framework enables us to generalize the Wiener filter, a common post-processing for source separation, by using optimal transport plans, so that it can be applied to data quantized on different frequencies.  Figure 1: Comparison of Euclidean distance and (regularized) optimal transport losses. Synthetic musical notes are generated by putting weight on a fundamental, and exponentially decreasing weights on its harmonics and subharmonics, and finally convoluting with a Gaussian. Left: examples of the spectrograms of two such notes. Right: (regularized) optimal transport loss and Euclidean distance from the note of fundamental 0.95kHz (red line on the left plot) to the note of fundamental 0.95kHz+σ, as functions of σ. The Euclidean distance varies sharply whereas the optimal transport loss captures more smoothly the change in the fundamental. The variations of the optimal transport loss and its regularized version are similar, although the regularized one can become negative.
NMF with an optimal transport loss was first proposed by Sandler and Lindenbaum (2009). They solved this problem by using a bi-convex formulation and relied on an approximation of optimal transport based on wavelets (Shirdhonkar and Jacobs 2008). Recently, Rolet et al. (2016) proposed fast algorithms to compute NMF with an entropy-regularized optimal transport loss, which are more flexible in the sense that they do not require any assumption on the frequency quantization or on the cost function used.
Using optimal transport as a loss between spectrograms was also proposed by Flamary et al. (2016) under the name "optimal spectral transportation". They developed a novel method for unsupervised music transcription which achieves state-of-the-art performance. Their method relies on a cost matrix designed specifically for musical instruments, allowing them to use Diracs as dictionary columns. That is, they fix each dictionary column to a vector with a single non-zero entry and learn only the corresponding coefficients. This trivial structure of the dictionary results in efficient coefficient computation. However, this approach cannot be applied as is to speech separation since it relies on the assumption that a musical note can be rep-resented as its fundamental. It also requires designing the cost of moving the fundamental to its harmonics and neighboring frequencies. Because human voices are intrinsically more complex, it is therefore necessary to learn both the dictionary and the coefficients, i.e., solve full NMF problems.

Our contributions
In this paper, we extend the optimal transport NMF of Rolet et al. (2016) to the case where the columns of the input matrix X are not normalized in order to propose an algorithm which is suitable for spectrogram data. Normalizing all time frames so that they have the same total weight is not desirable in sound processing tasks because it would amplify noise. We define a cost between frequencies so that the optimal transport objective between spectrograms provides a relevant metric between them. We apply our NMF framework to single voice reconstruction and blind source separation and show that an optimal transport loss provides better results over the usual squared Euclidean loss. Finally, we show how to use our framework for cross domain BSS, where frequencies represented in the test spectrograms may be different from the ones in the dictionary. This may happen for example when train and test data are recorded with different equipment, or when the STFT is computed with different parameters.

Notations
We denote matrices in upper-case, vectors in bold lower-case and scalars in lower-case. If M is a matrix, M ⊤ is its transpose, m i is its i th column and m j: its j th row. 1 n denotes the all-ones vector in R n ; when the dimension can be deduced from context we simply write 1. For two matrices A and B of the same size, we denote their inner product A, B := tr A ⊤ B . We denote Σ n the (n − 1)-dimensional simplex: Σ n := x ∈ R n + : x 1 = 1 .

Background
We start by introducing optimal transport, its entropy regularization, which we will use as the loss ℓ, and previous works on optimal transport NMF. For a more comprehensive overview of optimal transport from a computational perspective, see Peyré and Cuturi (2017).

Optimal Transport
Exact Optimal Transport. Let a ∈ Σ m , b ∈ Σ n . The polytope of transportation matrices between a and b is defined as Given a cost matrix C ∈ R m×n , the minimum transportation cost between a and b is defined as When n = m and the cost matrix is the p-th power (p 1) of a distance matrix, i.e. c i,j = d(y i , y j ) p for some (y i ) in a metric space (Ω, d), then OT(·, ·) 1/p is a distance on the set of vectors in R n + with the same ℓ-1 norm (Villani 2003 Theorem 7.3). We can see the vectors y i as features, and a and b as the quantization weights of the data onto these features. In sound processing applications, the vectors y i are real numbers corresponding to the frequencies of the spectrogram and a and b are their corresponding magnitude. By computing the minimal transportation cost between frequencies of two spectrograms, optimal transport exhibits variations in accordance with the frequency noise involved in the signal generative process, which results for instance from the tuning of musical instruments or the subject's condition in speech processing. Unnormalized Optimal Transport. In this work, we wish to define optimal transport when a and b are non-negative but not necessarily normalized. Note that the transportation polytope is not empty as long as a and b sum to the same value: U (a, b) = ∅ iif a 1 = b 1 . Hence, we define optimal transport between possibly unnormalized vectors a and b as, Computing the optimal transport cost (1) amounts to solve a linear program (LP) which can be done with specialized versions of the simplex algorithm with worst-case complexity in O(n 3 log n) when n = m (Orlin 1997). When considering OT as a loss between histograms supported on more than a few hundred bins, such computation becomes quickly intractable. Moreover, using OT as a loss involves differentiating OT, which is not differentiable everywhere. Hence, one would have to resort to subgradient methods. This would be prohibitively slow since each iteration would require to obtain a subgradient at the current iterate, which requires to solve the LP (1). Entropy Regularized Optimal Transport. To remedy these limitations, Cuturi (2013) proposed to add an entropy-regularization term to the optimal transport objective, thus making the OT loss differentiable everywhere and strictly convex. This entropy-regularized optimal transport has since been used in numerous works as a loss for diverse tasks (see for example Gramfort et al. 2015;Frogner et al. 2015;Rolet et al. 2016).
Let γ > 0, we define the (unnormalized) entropy-regularized OT between a ∈ R m where E(T ) := ij T ij log T ij is the entropy of the transport plan T . Let us denote OT ⋆ γ the convex conjugate of OT γ with respect to its second variable (2016) showed that its value and gradient can be computed in closed-form:

Optimal Transport NMF
NMF can be cast as an optimization problem of the form where both D and W are optimized at train time, and D is fixed at test time.
When ℓ is OT, problem (2) is convex in W and D separately, but not jointly. It can be solved by alternating full optimization with respect to W and D.
Each resulting sub-problem is a very high dimensional linear program with many constraints (Sandler and Lindenbaum 2009), which is intractable with standard LP solvers even for short sound signals. In addition, convergence proofs of alternate minimization methods for NMF typically assume strictly convex sub-problems (see e.g. Tropp 2003; Bertsekas 1999 Prop. 2.7.1), which is not the case when using non-regularized OT as a loss.
To address this issue, Rolet et al. (2016) proposed to use OT γ instead, and showed how to solve each sub-problem in the dual using fast gradient computations. Formally, they tackle problems of the form: where R 1 and R 2 are convex regularizers that enforce non-negativity constraints, and Σ n is the (n − 1)-dimensional simplex. It was shown that each sub-problem of (3) with either D or W fixed has a smooth Fenchel-Rockafellar dual, which can be solved efficiently, leading to a fast overall algorithm. However, their definition of optimal transport requires inputs and reconstructions to have a ℓ-1 norm equal to 1. This is achieved by normalizing the input beforehand, restricting the columns of D and W to the simplex, and using as regularizers negative entropies defined on the simplex: They showed that the coefficients and dictionary can be updated according to the following duality results. Coefficients Update. For D fixed, the optimizer of We can solve Problem (5) with accelerated gradient descent (Nesterov 1983), and recover the optimal weight matrix with the primal-dual relationship (4). The value and gradient of the convex conjugate of R with respect to its second variable are: Dictionary Update. For W fixed, the optimizer of Likewise, we can solve Problem (7) with accelerated gradient descent, and recover the optimal dictionary matrix with the primal-dual relationship (6). These duality results allow us to go from a constrained primal problem for which each evaluation of the objective and its gradient requires solving an optimal transport problem, to a non-constrained dual problem whose objective and gradient can be evaluated in closed form. The primal constraints x i 1 = Dw i 1 and Dw i ≥ 0 ∀i are enforced by the primal-dual relationship. Moreover, the use of an entropy regularization, with γ > 0, makes OT γ smooth with respect to its second variable.

Method
We now present our approach for optimal transport BSS. First we introduce the changes to Rolet et al. (2016) that are necessary for computing optimal transport NMF on STFT spectrograms of sound data. We then define a transportation cost between frequencies. Finally we show how to reconstruct sound signals from the separated spectrograms.

Signal Separation With NMF
We use a supervised BSS setting similar to the one described in Schmidt and Olsson (2006). For each source k we have access to training data X (k) , on which we learn a dictionary D (k) with NMF Then, given the STFT spectrum of a mixture of voices X, we reconstruct separated spectrograms The separated signals are then reconstructed from each X (k) with the process described in Section 3.4.
In practice at test time, the dictionaries are concatenated in a single matrix D = (D (k) ) N k=1 , and a single matrix of coefficients W is learned, which we decompose as W = (W (k) ) N k=1 . This allows us to focus on problems of the form

Non-normalized Optimal Transport NMF
Normalizing the columns of the input X, as in Rolet et al. (2016), is not a good option in the context of signal processing, since frames with low amplitudes are typically noise and it would amplify them.
However, our definition of optimal transport does not require inputs to be in the simplex, only to have the same ℓ-1 norm. With this definition, the convex conjugate OT ⋆ of OT and its gradient still have the same value as in , and we can simply relax the condition on W to be W ≥ 0 in Problem (3). We keep a simplex constraint on the columns of the dictionary D so that each update is guaranteed to stay in a compact set. We use R 1 = −ρ 1 E, a negative entropy defined on the non-negative orthant as the coefficient matrix regularizer and for R 2 we keep the non-negative entropy defined on the simplex. The problem then becomes The dictionary update is the same as in Rolet et al. (2016). However, the coefficient updates need to be modified as follows. Coefficients Update. For D fixed, the optimizer of The concave conjugate of E and its gradient can be evaluated with:

Cost Matrix Design
In order to compute optimal transport on spectrogams and perform NMF, we need a cost matrix C, which represents the cost of moving weight from frequencies in the original spectrogram to frequencies in the reconstructed spectrogram. Schmidt and Olsson (2006) use the mel scale to quantize spectrograms, relying on the fact that the perceptual difference between frequencies is smaller for the high frequency than for the low frequency domain. Following the same intuition, we propose to map frequencies to a log-domain and apply a cost function in that domain. Let f j be the frequency of the j-th bin in an input data spectrogram, where 1 ≤ j ≤ m. Letfĵ be the frequency of theĵ-th bin in a reconstruction spectrogram, where 1 ≤ĵ ≤ n. We define the cost matrix C ∈ R m×n as c jĵ = log(λ + f j ) − log(λ +fĵ) p with parameters λ ≥ 0 and p > 0. Since the mel scale is a log scale, it is included in this definition for some parameter λ. Some illustrations of our cost matrix for different values of λ are shown in Figure 2, with p = 0.5. It shows that with our definition, moving weights locally is less costly for high frequencies than low ones, and that this effect can be tuned by selecting λ.  Figure 3 shows the effect of p on the learned dictionaries. Using p = 0.5 yields a cost that is more spiked, leading to dictionary elements that can have several spikes in the same frequency bands, whereas p ≥ 1 tends to produce smoother dictionary elements. Note that with this definition and p ≥ 1 , C is a distance matrix to the power p when the source and target frequencies are the same. If p = 0.5, C is the point-wise square-root of a distance matrix and as such is a distance matrix itself. OT(., .) 1/p . Parameters p = 0.5 and λ = 100 yielded better results for Blind Source Separation on the validation set and were accordingly used in all our experiments.

Post-processing
Wiener Filter. In the case where the reconstruction is in the same frequency domain as the original signal, the classical way to recover each voice in the time domain is to apply a Wiener filter. Let X be the original Fourier spectrum, X (1) and X (2) the separated spectra such that X ≈ X (1) + X (2) . The Wiener filter buildsX (1) = X ⊙ X (1) X (1) +X (2) andX (2) = X ⊙ X (2) X (1) +X (2) , before applying the original spectra's phase and performing the inverse STFT. Generalized Filter. We propose to extend this filtering to the case where X (1) and X (2) are not in the same domain as X. This may happen for example if the test data is recorded using a different sample frequency, or if the STFT is performed with a different time-window than the train data. In such a case, D (1) and D (2) are in the domain of the train data, and to are X (1) and X (2) , but X is in a different domain, and its coefficients correspond to different sound frequencies. As such, we cannot use Wiener filtering.
Instead we propose to use the optimal transportation matrices to produce separated signalsX (1) andX (2) in the same domain as X. Let T (i) ∈ argmin 1 and x (2) 2 . We use the same idea and separate the transport matrix T (i) into: Similarly to the classical Wiener filter, we havê Heuristic Mapping. As an alternative to this generalized filter, we propose to simply map the reconstructed signal to the same domain as X by assigning the weight of af j in a spectrogram to its closest neighbor in (f i ) n i=1 , according to the distance we defined for the cost matrix (see Section 3.3). Separated Signal Reconstruction. Separated sounds are reconstructed by inverse STFT after applying a Wiener filter or generalized filter to X (1) and X (2) .

Results
In this section we present the main empirical findings of this paper. We start by describing the dataset that we used and the pre-processing we applied to it. We then show that the optimal transport loss allows us to have perceptually good reconstructions of single voices, even with few dictionary elements. Finally we show that the optimal transport loss improves upon a Euclidean loss for BSS with an NMF model, both in single-domain and cross-domain settings.

Dataset and Pre-processing
We evaluate our method on the English part of the Multi-Lingual Speech Database for Telephonometry 1994 dataset 1 . The data consists of recordings of the voice of four males and four females pronouncing each 24 different English sentences. We split each person's audio file time-wise into 25%-75% train-test data. The files are re-sampled to 16kHz and treated as mono signal.
One of the male voices and one of the female voices are only used for hyper-parameter selection, and are not included in the results.
The signals are analysed by STFT with a Hann window, and a windowsize of 1024, leading to 513 frequency bins ranging from 0 to 8kHz. The constant coefficient is removed from the NMF analysis and added for reconstruction in post-processing.
Hyper-parameters are selected on validation data consisting if the first male and female voice, which are excluded from the evaluation set.
Initialization is performed by setting each dictionary column to the optimal transport barycenter of all the time frames of the training data, to which we added Gaussian noise (separately for each column). The barycenters are computed using the algorithm of Benamou et al. (2015).

NMF Audio Quality
We first show that using an optimal transport loss for NMF leads to better perceptual reconstruction of voice data. To that end, we evaluated the PEMO-Q score (Huber and Kollmeier 2006) of isolated test voices. The dictionaries are learned on the isolated voices in the train dataset, and are the same as in the following separation experiment. Figure 4 shows the mean and standard deviation of the scores for k ∈ {5, 10, 15, 20} with optimal transport and Euclidean NMF. The PEMO-Q score of optimal transport NMF is significantly higher for any value of k. We found empirically that other scores such as SDR or SNR tend to be better for the Euclidean NMF, even though the reconstructed voices are clearly worse when listening to them (see additional files 1 and 2). Optimal transport can reconstruct clear and intelligible voices with as few as 5 dictionary elements.

Blind Source Separation
We evaluate our Blind Source Separation using the PEASS score proposed in Emiya et al. (2011), which they claim is closer to how humans would score BSS than SDR. We only consider mixtures of two voices, where the mixture is simply an addition of the sound signals. Single-Domain Blind Source Separation. We first show that using an optimal transport NMF improves on Euclidean NMF for BSS using the same frequencies in the spectrogram of the train and test data. In this experiment, both the training and test data are processed in exactly the same way, so that at train and test time (f i ) i = (f i ) i . For Euclidean-based BSS, we reconstruct the signal using a Wiener filter before applying inverse STFT. For optimal transport-based source separation, we evaluate separation using either the Wiener filter or our generalized filter. Figure 5 shows mean and standard deviation of the PEASS scores for k ∈ {5, 10, 15, 20}. The scores are higher with k = 5 or k = 10 and in both cases optimal transport yields better results. Figure 6 shows a comparison for each pair of mixed voices, with k selected on the validation set (k = 5 for Euclidean and k = 10 for optimal transport NMF). It shows that the PEASS score is better with an optimal transport loss for almost all files. We can further see that in the case of single domain BSS, the Wiener filter and our generalized Wiener filter yields very similar results. Cross-Domain Blind Source Separation. In this experiment, we keep the dictionaries trained for the single domain experiment, but we re-process the test data with a different time-window of 600 for the STFT. Although (f i ) i = (f i ) i , we can still compute optimal transport between the spectrograms thanks to our cost matrix. Figure 7 shows the resuts on the train set. The score for Euclidean NMF is computed by first mapping the test data to the same domain as the train data, using heuristic mapping, and then performing same-domain separation. Both the heuristice mapping and generalized filter improve upon using Euclidean NMF, and they both achieve similar results. Still, the use of our generalized filter allows to have the exact same processing whether performing single domain or cross domain separation, the only difference being the cost matrix C, while the heuristic mapping requires additional post-processing and also requires to choose rules for the mapping.

Discussion
Regularization of the Transport Plan. In this work we considered entropy-regularized optimal transport as introduced by Cuturi (2013). This allows us to get an easy-to-solve dual problem since its convex conjugate is smooth and can be computed in closed form. However, any convex regularizer would yield the same duality results, and could be considered as long as its conjugate is computable. For instance, the squared L 2 norm regularization was considered in several recent works (Blondel et al. 2018;Seguy et al. 2017) and was shown to have desirable properties such as better numerical stability or sparsity of the optimal transport plan. Moreover, similarly to entropic regularization, it was shown that the convex conjugate and its gradient can be computed in closed form (Blondel et al. 2018). Comparison between optimal transport NMF and Euclidean NMF(left) or optimal transport NMF with generalized Wiener filter (right). Each data-point represents the PEASS scores of one file when mixed with another, where the x coordinate is the optimal transport with Wiener filter's score and the y coordinate is the score of the compared method. Comparison between optimal transport NMF with generalized Wiener filter and Euclidean NMF (left) or optimal transport NMF with heuristic mapping (right) on the cross domain speech separation task. Each data-point represents the PEASS scores of one file when mixed with another, where the x coordinate is the optimal transport with generalized Wiener filter's score and the y coordinate is the score of the compared method.
Learning Procedure. Following the work of Rolet et al. (2016), we solved the NMF problem with an alternating minimization approach, in which at each iteration a complete optimization is performed on either the dictionary or the coefficients. While this seems to work well in our experiments, it would be interesting to compare with smaller steps approach like in Lee and Seung (2001). Unfortunately such updates do not exist to our knowledge: gradient methods in the primal would be prohibitively slow, since they involve solving t large optimal transport problems at each iteration.