Nonnegative signal factorization with learnt instrument models for sound source separation in closemicrophone recordings
 Julio J CarabiasOrti^{1}Email author,
 Máximo Cobos^{2},
 Pedro VeraCandeas^{3} and
 Francisco J RodríguezSerrano^{3}
https://doi.org/10.1186/168761802013184
© CarabiasOrti et al.; licensee Springer. 2013
Received: 29 June 2013
Accepted: 21 November 2013
Published: 13 December 2013
Abstract
Closemicrophone techniques are extensively employed in many live music recordings, allowing for interference rejection and reducing the amount of reverberation in the resulting instrument tracks. However, despite the use of directional microphones, the recorded tracks are not completely free from source interference, a problem which is commonly known as microphone leakage. While source separation methods are potentially a solution to this problem, few approaches take into account the huge amount of prior information available in this scenario. In fact, besides the special properties of closemicrophone tracks, the knowledge on the number and type of instruments making up the mixture can also be successfully exploited for improved separation performance. In this paper, a nonnegative matrix factorization (NMF) method making use of all the above information is proposed. To this end, a set of instrument models are learnt from a training database and incorporated into a multichannel extension of the NMF algorithm. Several options to initialize the algorithm are suggested, exploring their performance in multiple music tracks and comparing the results to other stateoftheart approaches.
1 Introduction
Multitrack audio recording techniques are based on capturing and recording individual sound sources into multiple discrete audio channels. Once all the sound sources have been recorded, the individual tracks are processed and mixed down to a number of mixture channels that depends on the specific audio reproduction format. Multitrack recording techniques can be broadly classified into live recording and trackbytrack recording techniques. In the latter type, the performers are individually recorded one after another, resulting in almost perfectly isolated instrument tracks. On the other hand, in live audio recordings, the source signals, which share the acoustic space, are all acquired simultaneously during the performance [1]. This leads to the wellknown microphone leakage problem: the sounds coming from the concurrent sources are picked up by microphones others than the ones intended for the specific sources [2]. To address this issue in closemiking techniques, a directional microphone is placed relatively close to an instrument, reducing the interference from other sources and the effect of room reverberation. Other mechanical and signal processing devices, such as absorbing barriers or noise gates, are also employed by sound engineers to mitigate this problem, but they only solve the problem partially, being most effective when used with transient signals [3].
Sound source separation (SSS) techniques have been suggested as a potential solution for the microphone leakage problem in multitrack live recordings [3, 4]. In general, the aim of SSS is to recover each source signal from a set of audio mixtures. SSS techniques can be broadly divided into blind source separation (BSS) and informed source separation (ISS) algorithms. BSS methods are especially popular in the statistical signal processing and machine learning areas, where the term blind emphasizes that very little information about the sources or the mixing process is known a priori [5]. Techniques such as principal component analysis (PCA), independent component analysis (ICA) or nonnegative matrix factorization (NMF) [6] have been introduced both to reduce the dimensionality and to explain the whole data by a few meaningful elementary objects. In fact, many BSS approaches are closely related to ICA, where the sources are assumed to be statistically independent and nonGaussian. Most of these approaches are oriented to the determined separation problem, i.e., the number of sources equals the number of mixture signals. When the number of sources is greater than the number of mixtures, the problem is said to be underdetermined, and the underlying assumptions usually involve the sparsity of the sources under some suitable representation such as the timefrequency domain [7, 8]. Moreover, the assumptions may also differ depending on the acoustic environment, leading to instantaneous or convolutive separation methods. Instantaneous mixing models (IMM) assume a mixing matrix made up of scalar coefficients while convolutive models are often based on the estimation of unmixing filters [9]. When working in the frequency domain, the mixture can be assumed to be instantaneous at each frequency bin and standard approaches such as ICA can be applied by following a subband approach [10]. However, due to the ICA permutation ambiguity, an alignment procedure requiring some additional information is necessary to group the resulting components into estimated source signals. When the signal model is assumed to be nonnegative, NMF provides a meaningful structure of the audio data, which in this case is obtained from the magnitude or energy spectrograms. NMF methods have been shown to be specially useful for musical analysis tasks [11], including not only SSS but also others, such as automatic music transcription [12] or acoustic space characterization [13]. NMF is based on decomposing the spectrogram audio data into a sum of elementary spectral patterns with timevarying gains. While NMF was originally proposed in the context of monaural SSS, other extensions have been developed for dealing with multichannel audio mixtures [14]. As a result, NMF approaches are progressively becoming a promising solution to multichannel SSS. However, spectral patterns learnt by NMFbased approaches are often hard to interpret and lack explicit semantics. To overcome this issue, many algorithms constrain the original NMF to obtain musically meaningful patterns, for example, by considering a parametric model. In this context, the spectral patterns can be described by harmonic combs [15–17], spectrally and/or temporally localized Gaussians [18, 19], or by using a source/filter model [20–22].
In contrast to BSS, ISS methods depart from an available prior information, which can be under the form of specific information about the sources, the mixing process, or additional modalities [23]. For example, an ISS method which is oriented to closemiking live music recordings could exploit the properties of this specific setup: each microphone signal contains one of the sources significantly enhanced over the others due to both the directional properties of the sensors and to their placement. In this context, Kokkinis and Morjopoulos [3] showed that under a closemiking assumption, a relatively simple Wiener filter outperforms some convolutive BSS algorithms. However, more sophisticated methods making use of additional prior information can be developed by considering a supervised separation framework. For example, musical score information can be used if the score and audio are well aligned [24–27]. Spectral information can be considered by using instrument models when the instruments are known in advance [28, 29]. Other kinds of information, such as highlevel musicological knowledge, have been recently introduced by Fuentes et al. [30], using recent advances in shiftinvariant analysis of musical data. Regarding factorization methods, an important issue to take into account is the initialization/constrain of the parameters. In this context, Hurmalainen et al. [31] proposed a method for automatic adaptation of learnt clean speech source models to deal with noise in a speech separation and recognition task. Furthermore, Fitzgerald [32] presented a framework that allows the user to interact with the tensor factorization method to improve the performance in an adaptive way. Finally, the prior information can be the sources themselves. This knowledge enables the computation of side information, which is small enough to be inaudibly embedded into the mixtures. At a decoding step, this small side information is used along with the mixtures to recover the sources. Following this scheme, Liutkus et al. [33] proposed a system coding approach that permits very reliable transmission of the sources with a small amount of side information.
In this paper, an informed NMFbased SSS method is presented to tackle the microphone leakage problem in multichannel closemicrophone recordings. To this end, several assumptions are taken on the mixing environment, affecting problem dimensionality, directtoreverberant sound ratio and available instrument priors. In this context, it is assumed that the number of source signals is equal or less than the number of microphone signals, having each mixture signal a predominant directsound source resulting from a closemiking recording setup. Therefore, since the predominant source is captured with a high directtoreverberant ratio, a instantaneous model can be reasonably assumed, significantly simplifying the separation task. Since the method is constrained to be nonnegative, panning matrix is used to determine the mixing process. Moreover, instrument model priors are obtained by means of a learning stage using a training database. The usefulness of these models is twofold. On the one hand, they enable an accurate estimation of the panning matrix. On the other hand, they simplify the separation stage by reducing the factorization to the estimation of instrument timevarying gains.
The paper is structured as follows. Section 2 provides an overview of the proposed SSS system and describes the fundamentals of NMFbased separation and instrument modeling. Section 3 describes the proposed multichannel extension for informed NMFbased separation using learnt instrument models. Panning matrix estimation and NMFbased separation are described in detail, explaining how the output of an automatic music transcription stage is used to discriminate singlesource timefrequency zones. Section 4 describes the experiments conducted by using several music pieces in a simulated closemicrophone setup and evaluates the separation performance by using objective measures. Finally, Section 5 summarizes the conclusions of this work.
2 Model description and background
2.1 System overview
2.2 NMF background
where g _{ n }(t) is the gain of the basis function n at frame t, and b _{ n }(f), n = 1,…,N are the bases. Note that this approach holds under two different configurations:
Therefore, whenever model in Equation 1 is chosen, either assumptions (a) or (b) are supposed to hold, and the timefrequency (TF) representation considered is either magnitude or power spectrogram for (a) or power spectrogram only for (b).
where the time gains g _{ j,n }(t) and the harmonic amplitudes a _{ j,n }(h) are the parameters to be estimated.
2.3 Augmented NMF for parameter estimation
The main advantage of the MU in Equation 5 is that it ensures nonnegativity of all parameters provided that they are nonnegative at initialization.
2.4 Instrument modeling
As demonstrated in [22], when appropriate training data are available, it is advantageous to learn the instrumentdependent bases in advance and fix them during the analysis of the signals. In fact, this approach has been shown to perform well when the conditions of the music scene do not differ too much between the training and the test data. Here, we have used an approach similar to [43]. Specifically, the amplitudes of each note of a musical instrument a _{ j,n }(h) are learnt in advance by using the Real World Computing (RWC) music database [48, 49] as a training database of solo instruments playing isolated notes (more details on the Section 4.2.). Let the groundtruth transcription of the training data be represented by R _{ j,n }(t) as a binary time/frequency matrix for each j instrument. The frequency dimension represents the musical instrument digital interface (MIDI) scale and the time dimension t represents the frames. At the training stage, gains are initialized with R _{ j,n }(t), which is known in advance for the training database. Thus, gains are set to unity for each pitch at those time frames where the instrument is active while the rest of the gains are set to zero. Note that gains initialized to zero remain at zero because of the multiplicative update rules, and therefore the frame is represented only with the correct pitch.
The training procedure is summarized in Algorithm 1.
Algorithm 1 Instrument modeling algorithm
The training algorithm computes the basis functions b _{ j,n }(f) required at the factorization stage for each instrument. These instrumentdependent basis functions b _{ j,n }(f) are known and held fixed, therefore, the factorization of new signals of the same instrument can be reduced to the estimation of the gains g _{ j,n }(t).
3 Proposed extension to multichannel
The previously described NMFbased model is suitable for singlechannel data. However, most music recordings are available in a multichannel format, being stereo the most common. To deal with multichannel audio data, an extension of the standard NMF model is required. In the literature, multichannel extensions of NMF have already been considered, either by stacking up the spectrograms of each channel into a single matrix [50] or by equivalently considering nonnegative tensor factorization (NTF) under a parallel factor analysis (PARAFAC) structure, where the channel spectrograms form the slices of a 3valence tensor [42, 51, 52].
In this paper, we propose an extended multichannel NMF model that is specifically designed for closemicrophone music recordings. While this kind of recordings are not usually commercially distributed, many of the raw recordings used in the studio during the mixing process share many similarities among them. The particularities of this scenario define a set of assumptions that are considered in the proposed NMF algorithm:

Problem dimensionality: The proposed method is designed for an overdetermined scheme, that is, I≥J where I is the number of channels and J the number of sources.

Single predominant source: For each channel i, there is a single predominant source j ^{′} that corresponds to a music instrument which is known in advance.

Mixing model: In this work, instantaneous mixing of point sources is considered. Note that the actual mixing process in a closemicrophone recording is convolutive. However, since the predominant source is captured with a high directtoreverberant ratio, a instantaneous model can be reasonably assumed to simplify the processing. Still, the proposed method can readily be extended to the case of a convolutive mixture, simply by assuming a mixing matrix that varies over frequency [14].

Input representation: More details about the TF representation are given in the experimental section.
where ${\widehat{\underline{x}}}_{i}(f,t)$ is the estimation of the complexvalued STFT for each i channel; s̲_{ j }(f,t) is the estimation of the complexvalued STFT generated by the source j = 1,…,J; J is the number of sources; and the scalar coefficients m _{ i,j } define a I×J panning matrix M that measures the multichannel contribution of source j to the data. Note that the mixing coefficients are defined in function of the kind of spectrogram used (i.e., magnitude or power spectrogram). On the one hand, if we are considering magnitude spectrograms, mixing coefficients can be defined as m _{ i,j }, usually called the mixing matrix. On the other hand, for power spectrograms mixing coefficients are defined as m _{ i,j }^{2}.
3.1 Panning matrix estimation
The estimation of the panning matrix is performed in two steps. First, an NMFbased automatic transcription method is applied in order to estimate the active notes of the predominant source at each channel. Then, the estimated transcription of the predominant source for each channel is used to discriminate those TF zones in which the sources are presented in an isolated way. Finally, the panning matrix is computed using this information.
3.1.1 Predominant source transcription method
In this step, we describe two NMFbased methods, one for monophonic and the other for polyphonic signals, to estimate the transcription of the predominant source for each channel individually. These methods were previously developed by the authors in [43] in the context of monaural mixtures. The methods are supervised, requiring fixed basis functions trained using the instrument modeling procedure in Section 2.4. The aim here is to estimate the transcription of the predominant source j ^{′} at each channel i. This information must be known in advance in order to define the proper basis functions ${b}_{{j}^{\prime},n}\left(f\right)$.
 1.Monophonic sources In the case of monophonic sources, we propose to use the realtime singlepitch constrained method proposed in [43]. In this transcription method, the optimum combination of notes n _{opt}(i,t) is chosen to minimize the betadivergence function at channel i and frame t under the assumption that only one gain ${g}_{{j}^{\prime},n}\left(t\right)$ is nonzero at each frame for channel i, being j ^{′} the predominant source. Assuming a single predominant source j ^{′} for each channel i, the signal model with the singlecombination constraint can be defined as follows:${x}_{i,t}\left(f\right)\approx {s}_{{j}^{\prime},{n}_{\text{opt}},t}\left(f\right)={g}_{{j}^{\prime},{n}_{\text{opt}},t}{b}_{{j}^{\prime},{n}_{\text{opt}}}\left(f\right),$(12)where n _{opt}(i,t) is defined for each channel i as${n}_{\text{opt}}(i,t)=arg\underset{n=1,\dots ,N}{min}{D}_{\beta}\left({x}_{i,t}\left(f\right)\left{g}_{{j}^{\prime},n,t}{b}_{{j}^{\prime},n}\right(f)\right),$(13)
that is, the spectrum for each channel i at each frame t is approximated by the projection of the predominant source j ^{′} for the optimum note n _{opt} at frame t. As an advantage, the model of Equation 12 allows the gains to be computed directly from the data and the trained amplitudes without the need for an iterative algorithm.
The betadivergence at note n and frame t for the predominant source j ^{′} at channel i is obtained as$\begin{array}{c}{D}_{\beta}\left({x}_{i,t}\right(f\left)\right{g}_{{j}^{\prime},n,t}{b}_{{j}^{\prime},n}\left(f\right))=\sum _{f}\frac{1}{\beta (\beta 1)}\left({x}_{i,t}{\left(f\right)}^{\beta}\right.\hfill \\ \phantom{\rule{14.5em}{0ex}}+(\beta 1){\left({g}_{{j}^{\prime},n,t}{b}_{{j}^{\prime},n}\right(f\left)\right)}^{\beta}\hfill \\ \phantom{\rule{14.5em}{0ex}}\left(\right)close=")">\phantom{\rule{2.77626pt}{0ex}}\beta {x}_{i,t}\left(f\right){\left({g}_{{j}^{\prime},n,t}{b}_{{j}^{\prime},n}\right(f\left)\right)}^{\beta 1}\hfill & .\end{array}$(14)The value of the gain for channel i, source j ^{′}, note n and frame t is obtained by minimizing Equation 14. This minimization has a direct solution, since the value of the gain for note n and frame t is a scalar:${g}_{{j}^{\prime},n,t}=\frac{\sum _{f}{x}_{i,t}\left(f\right){b}_{{j}^{\prime},n}{\left(f\right)}^{(\beta 1)}}{\sum _{f}{b}_{{j}^{\prime},n}{\left(f\right)}^{\beta}}.$(15)Finally, the optimum note at each frame for each channel is selected as the note that minimizes the betadivergence at each frame and channel:$\begin{array}{c}\phantom{\rule{17.0pt}{0ex}}{n}_{\text{opt}}(i,t)\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}arg\phantom{\rule{0.3em}{0ex}}\underset{n=1,\dots ,\phantom{\rule{0.3em}{0ex}}N}{min}\phantom{\rule{0.3em}{0ex}}{D}_{\beta}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\left(\phantom{\rule{0.3em}{0ex}}{x}_{i,t}\left(f\right)\left\frac{\sum _{f}{x}_{i,t}\left(f\right){b}_{{j}^{\prime},n}{\left(f\right)}^{(\beta 1)}}{\sum _{f}{b}_{{j}^{\prime},n}{\left(f\right)}^{\beta}}{b}_{{j}^{\prime},n}\right(f)\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\right)\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}},\end{array}$(16)where the proposed solution is valid for β∈ [ 0,2].
To summarize, the monophonic predominant source transcription (MPST) method is detailed in Algorithm 2.
Algorithm 2 Monophonic predominant source transcription (MPST) method
 2.Polyphonic sources In the case of polyphonic sources, the method presented in [43] is used to obtain the transcription of the predominant source j ^{′} at each channel i. However, we highlight the fact that any polyphonic estimation procedure may be used at this stage, e.g., the one presented in [30]. In this paper, the applied method is analogous to the classical EuclideanNMF using the gradient descent algorithm, but no iterative process is required, allowing its use for realtime problems. In the case of β=2 (Euclidean distance), Equation 4 for each channel i can be expressed in matrix notation as${D}_{\beta ,i}={\u2225{{\mathbf{X}}_{i}{\mathbf{B}}_{{j}^{\prime}}\xb7\mathbf{G}}_{{j}^{\prime}}\u2225}_{2},$(17)where X _{ i } is the signal input matrix at channel i in time and frequency, j ^{′} is the predominant source at channel i, ${\mathbf{B}}_{{j}^{\prime}}$ the j ^{′} source basis matrix, and ${\mathbf{G}}_{{j}^{\prime}}$ the j ^{′} source gain matrix. Then we can examine this factorization as a reducedrank basis decomposition so that ${\mathbf{X}}_{i}={\mathbf{B}}_{{j}^{\prime}}\xb7{\mathbf{G}}_{{j}^{\prime}}$ and, subsequently, the gains can be estimated in just one step:${\mathbf{G}}_{{j}^{\prime}}={\mathbf{A}}_{{j}^{\prime}}\xb7{\mathbf{X}}_{i},$(18)
where ${\mathbf{A}}_{{j}^{\prime}}\in {\mathbb{R}}^{\ge 0,N\times F}={\mathbf{B}}_{{j}^{\prime}}^{\u2020}$ and the ^{ † } operator is the MoorePenrose matrix inverse.
As commented in [43], the transcription results with β=2 are poor in comparison with other values of β. Therefore, to improve the performance of the method, a candidate selection stage using the previously explained MPST method will be applied. As a result, the method for polyphonic sources is restricted to have only a few gains that are not zero for each instrument at each frame. Detailed information about the method can be found in [43]. Algorithm 3 describes the computational procedure in the proposed transcription method for polyphonic sources.
Algorithm 3 Polyphonic predominant source transcription (PPST) method
where ${\psi}_{{j}^{\prime}}(n,t)$ is the resulting transcription composed of binary values and T is a fixed detection threshold in decibels (dB) that can be either set manually or learnt from training data. Note that in the case of MPST, where only one note is active at a time, a threshold is used to discard those notes actives during silence intervals.
3.1.2 Timefrequency zones discrimination
Once the transcription procedure of the predominant source at each channel i has been performed, timefrequency zones in the mixture corresponding only to predominant sources must be discriminated in order to estimate the panning matrix. These zones are assumed to be free from source overlapping, thus, partials from the predominant source are likely not to be corrupted by notes from the rest of instruments. In fact, the information resulting from overlapped partials at each frame is considered as corrupted and it is not used to estimate the panning matrix.
3.1.3 Energybased panning matrix estimation
Once the overlapping mask ϑ _{ j }(f,t) is computed for all the channels, the panning matrix is estimated (see Figure 1). The proposed method computes each panning coefficient as the relation between the norm of each instrument at each channel in its timefrequency region and the norm of each instrument at its predominant channel in its timefrequency region (the panning coefficient for the instrument at its predominant channel is supposed to be 1). Note that the panning estimation method does not include the phase of the coefficient; in other words, the proposed method estimates the matrix coefficients for magnitude spectrograms. The energybased panning matrix estimation is detailed in Algorithm 4, where j ^{′} is the predominant source for channel i, ∘ is the Hadamard product and ∥·∥_{2} is the 2norm (Euclidean distance).
Algorithm 4 Panning matrix estimation method
Therefore, Algorithm 4 computes the panning matrix as the quotient between the contribution of each source to the channel spectrogram against the contribution of the predominant source.
3.2 Multichannel SSS
Then, classical augmented NMF factorization with MU rules is applied to estimate the gains corresponding to each source j in the multichannel mixture. The process is detailed in Algorithm 5.
Algorithm 5 Multichannel signal gain estimation method
3.2.1 Ideal Wiener masks
The source separation consists of estimating the complex amplitude at each timefrequency cell for each source. Some systems use binary separation, which means that the entire energy of a bin is assigned to a single source. However, it has been demonstrated that better results can be obtained with a nonbinary decision, i.e., distributing the energy proportionately over all the sources. The use of separation Wiener masks is common in the source separation literature [37–39]. The Wiener filter method for instantaneous mixing models is described below.
where s̲_{ j }(f,t)^{2} is called the power spectral density of source j at TF bin (f,t).
Then, the estimated source ${\widehat{s}}_{{j}^{\prime}}\left(t\right)$ is computed by the inverse overlapadd STFT of the estimated spectrogram ${\widehat{\underline{s}}}_{{j}^{\prime}}(f,t)$.
3.2.2 Separated signal decomposition
In the present work, the panning matrix is estimated and used together with the learnt instrument models to perform the separation in an NMFbased framework.
Then, the estimated Wiener mask for each source is applied to the multichannel signal spectrogram at channel i following Equation 24 using the phase information from the original mixture signal of the closemicrophone near the target instrument. Therefore, the estimated predominant source spectrogram ${\u015d}_{{j}^{\prime}}(f,t)$ is obtained and the estimated predominant source ${\widehat{s}}_{{j}^{\prime}}\left(t\right)$ is computed by applying the inverse overlapadd STFT of ${\widehat{\underline{s}}}_{{j}^{\prime}}(f,t)$ using the phase information from x̲_{ i }(f,t) where j ^{′} is the predominant source at channel i.
4 Experiments
4.1 Training and test data
At the training stage (see Section 2.3), the basis functions are estimated using the RWC musical instrument sound database [48, 49] and the full pitch range for each instrument. Four instruments are studied in the experiments (violin, clarinet, tenor saxophone, and bassoon). Individual sounds are available with a semitone frequency resolution over the entire range of notes for each instrument. Files from the RWC database have different playing styles. Files with a normal playing style and mezzo dynamic level are selected as in the literature. Training with different playing styles leads to different models. However, as demonstrated in [22], the selected configuration (normal playing style and mezzo dynamic level) is representative of the different models.
The database proposed in [27] is used for the testing stage. This database consists of ten J.S. Bach fourpart chorales [27] with the corresponding aligned MIDI data. The audio files are approximately 30 s long and are sampled at 44.1 KHz from the real performances. Each music excerpt consists of an instrumental quartet (violin, clarinet, tenor saxophone, and bassoon), and each instrument is given in an isolated track.
4.2 Experimental setup
4.2.1 Timefrequency representation
Many NMFbased signal processing applications usually adopt a logarithmic frequency discretization. For example, uniformly spaced subbands on the equivalent rectangular bandwidth (ERB) scale are assumed in [16, 17]. In this work, two timefrequency resolutions are used. First, to estimate the instrument models and the panning matrix, a single semitone resolution was used as in [22]. In fact, the training database and the groundtruth score information are composed of notes that are separated by one semitone in frequency. This representation has proven to obtain accurate results for music transcription, which is the key point when estimating the panning matrix. Second, for the separation task, a higher resolution of 1/4 of semitone is used as in [29], which has proven to achieve better separation results. These timefrequency representations are obtained by integrating the STFT bins corresponding to the same semitone, or 1/4 semitone, interval. Note that in the separation stage, the learnt basis functions b _{ j,n }(f) are adapted to the 1/4 semitone resolution by replicating at four times the basis at each semitone to the four samples of the 1/4 semitone resolution that belong to this semitone. The frame size and the hop size for the STFT are set to 128 and 32 ms, respectively.
4.2.2 Initialization of model parameters
 1.
Basis functions (N=88), ranging from MIDI note 20 to 108
 2.
Partials per basis function for the harmonic constraint models (M=20)
 3.
Iterations for the NMFbased algorithms (50)
4.2.3 Audio separation: method and metrics
For an objective evaluation of the performance of the separation method, the Perceptual Evaluation methods for Audio Source Separation (PEASS) toolkit [55] has been used. The use of objective measures based on energy ratios between the signal components, i.e., source to distortion ratio (SDR), the source to interference ratio (SIR), the source to artifacts ratio (SAR) and the source image to spatial distortion ratio (ISR), has been the standard approach in the specialized scientific community to test the quality of extracted signals.
Moreover, the overall perceptual score (OPS), the targetrelated perceptual score (TPS), the interferencerelated perceptual score (IPS) and the artifactsrelated perceptual score (APS) objective measures have been used with the aim of predicting a set of subjective scores. The approach to compute the objective measures [55] makes use of auditorymotivated metrics provided by the PEMOQ auditory model to assess the perceptual salience of the target distortion (qTarget), interference (qInterf) and artifacts (qArtif), computing also a global metric (qGlobal). Then, a nonlinear mapping using neuronal networks trained with a large set of different audio signals is performed in order to get the set of objective measures. Further information about this metrics can be found in [55].
4.3 Evaluation
The proposed separation approach, shown in Figure 1, is going to be compared with some stateofthe art methods and some unrealistic situations in order to evaluate its separation capabilities. The different approaches compared here are the following:

Default: It refers to the actual separation performed by the simulation setup presented in Figure 5. Since the sensors have cardioid directivity characteristics and they are placed following a closemiking setup, the instrument close to each microphone is predominant in the corresponding mixture channel.

Ideal separation: This method performs as an upper bound for the best separation that can be achieved with the used timefrequency representation. The optimal value of the Wiener mask at each frequency and time component is computed assuming that the signals to be separated are known in advance.

Oracle IMM: This approach evaluates the limitations of using an instantaneous mixing model. The separation scheme is similar to the one presented in Figure 5 but the panning matrix is estimated by knowing in advance the signals to be separated. The separation approach is identical to the proposed one but the panning matrix is optimal. We have included this method in order to evaluate the influence of the proposed panning matrix estimation stage on the final performance.

Kokkinis method: We have included in the evaluation the results of the method proposed by Kokkinis et al. in [3] as the stateoftheart method for addressing the microphone leakage problem.
As can be seen, the best results are obtained with the ideal separation method, the average SDR value is about 21 dB and informs us about the best separation that can be achieved with the used timefrequency representation (1/4 semitone resolution in frequency). The default method is limited by the microphone leakage and obtains an average SDR of 12 dB. The oracle IMM approach provides information about the best separation results that can be obtained using our approach with an optimal panning matrix. In fact, the proposed method is similar to the Oracle solution for the studied signals; therefore, the transcription procedure in the mixture matrix stage is providing accurate estimation of the groundtruth data.
Finally, the Kokkinis’ method is about 1 dB on average in SDR below the proposed method. These last results suggest that while Wiener filtering might be a simple and powerful approach to solve the microphone leakage problem, there is room for further improvements by following an informed approach such as the one presented in this paper.
Multichannel source separation results measured using PEASS toolkit
Algorithms  Inst  SDR  SAR  SIR  ISR  qTarget  qInterf  qArtif  qGlobal  OPS  TPS  IPS  APS 

Ba  12.00  23.36  12.83  24.73  0.98  0.88  0.95  0.79  36.63  78.07  56.85  29.50  
Default  Cl  10.75  22.42  11.56  24.42  0.98  0.83  0.94  0.74  32.60  78.13  49.17  25.20 
Sx  11.96  23.27  12.76  25.36  0.99  0.94  0.98  0.91  41.46  66.81  60.16  63.88  
Vi  12.72  25.87  13.23  27.92  1.00  0.92  0.99  0.91  13.48  54.76  46.82  86.13  
Ba  17.03  36.49  20.13  20.49  0.98  0.96  1.00  0.96  26.78  59.50  73.25  81.87  
Ideal separation  Cl  21.89  42.07  24.86  25.59  0.99  0.96  1.00  0.95  22.20  54.19  70.18  83.68 
Sx  20.11  38.82  23.18  23.75  1.00  0.98  1.00  0.98  74.20  74.40  90.10  86.74  
Vi  21.77  39.90  24.40  25.63  0.99  0.99  1.00  0.99  86.17  90.06  91.53  85.92  
Ba  14.98  35.26  15.59  25.96  0.97  0.90  1.00  0.86  9.51  39.83  28.58  69.08  
Oracle IMM  Cl  16.58  34.45  16.87  29.88  0.98  0.84  1.00  0.81  8.53  38.63  16.12  77.81 
Sx  17.61  34.57  18.60  26.08  0.99  0.95  1.00  0.93  15.83  51.82  58.97  86.26  
Vi  19.28  36.23  19.81  30.79  1.00  0.95  1.00  0.95  25.76  53.67  67.97  86.36  
Ba  14.96  33.95  16.15  22.56  0.96  0.90  1.00  0.86  9.45  40.33  28.86  62.84  
Direct β 1.3  Cl  16.61  34.58  16.96  29.09  0.98  0.85  1.00  0.82  8.59  37.16  17.46  75.61 
instant mask  Sx  16.44  33.54  19.73  20.71  0.99  0.96  1.00  0.93  15.42  56.24  57.96  82.72 
Vi  19.25  35.19  20.71  26.44  0.99  0.96  1.00  0.95  30.30  56.48  70.83  85.91  
Ba  14.02  32.53  18.65  15.77  0.99  0.88  1.00  0.87  10.41  46.35  30.90  82.24  
Kokkinis [3]  Cl  16.77  34.77  19.92  19.33  0.99  0.87  0.99  0.85  9.14  49.61  24.14  83.39 
Sx  15.46  31.93  18.90  17.86  1.00  0.96  1.00  0.96  27.28  53.77  73.45  86.74  
Vi  16.58  33.40  20.77  18.52  0.99  0.96  1.00  0.95  30.75  55.54  72.55  85.92 
More detailed information about the separation metrics is presented in Table 1. In this table, the different metrics are given per instrument on average for the ten excerpts in the test database. In relation to the classical separation metrics (i.e., SDR, SAR, SIR and ISR, in decibels), the default method is limited by the interferences between the different instruments (SIR) while the other methods achieve better SIR results. Moreover, the bassoon separation performs worse in general, in comparison with the other instruments. This fact has been observed in other studies made by the authors [43, 53]. Actually, for the bassoon case, the amplitude variations over the time line of the note cause a mismatch with the window transform in certain frequency locations (i.e., blurred regions in the spectrogram).
Similar conclusions can be extracted by analyzing the results obtained with the perceptual similarity measures (PSM) provided by the PEMOQ auditory model. However, with these metrics the differences between the instruments vary as a function of the analyzed signals. The Kokkinis’ approach obtains better results than the proposed model (Direct β 1.3) and the oracle IMM in terms of qGlobal (PSM values are within 0 and 1, being the unity the best result). This discrepancy between perceptual metrics (qGlobal) and classical separation metrics (SDR) can be justified after listening to the separated excerpts. In our opinion, Kokkinis’ approach offers better separation capabilities at high frequencies while at low frequencies the amount of distortion seems higher. In contrast, the proposed approach seems to have higher distortion at high frequencies. Probably, the use of a softer mask in Kokkinis’ approach might be a reason for these differences. It must be stressed that the optimum β value has been selected in this work in terms of SDR. Further work could be done to explore the performance of this method when the β value is selected to be optimal in terms of the perceptual metrics.
Finally, regarding the measures proposed in PEASS (i.e., OPS, TPS, IPS, and APS) there is a strong correlation between these measures and PSM. Generally, the different approaches obtain the same classification from better to worse in terms of OPS as in terms of qGlobal (PEASS metrics are within a 0 to 100 interval, being 100 the best result). The only exception is the default approach for the cases of bassoon and clarinet. In these cases, the OPS shows higher values, which are in contrast with the lower values of qGlobal. However, the separated excerpts for the default approach when playing bassoon and clarinet clearly have less perceptual quality when compared to the other approaches. The nonlinear mapping provided by the PEASS neural network does not offer satisfactory results in these cases.
5 Conclusions
In this paper, an informed NMFbased SSS method has been proposed to tackle the microphone leakage problem in multichannel closemicrophone recordings. The proposed method is specifically designed for a scenario in which the number of source signals is equal or less than the number of microphone signals and a single predominant source is considered for each mixture signal. As demonstrated in the evaluation stage, despite assuming instantaneous mixing and using fixed instrument models, the proposed method provides similar performance to other stateoftheart approaches, showing the potential of NMFbased approaches in realworld applications. Moreover, the use of trained instrument models allows for a fast computation of the panning matrix and simplifies the separation stage by reducing the factorization to the estimation of instrument timevarying gains. However, these models are fixed and, therefore, the differences with respect to the spectra of the analyzed instruments in the mixture may lead to worse separation results, as seen in the case of the bassoon. Further work will be aimed at adapting the parameters of the model to the observed music scene. To address this issue, a proper initialization of the gains and the use of additional optimization constraints will be considered. This way, the parameters will only be adapted when there is high confidence that a note is active and free of interference.
Declarations
Acknowledgements
This work was supported by the Andalusian Business, Science and Innovation Council under project P2010 TIC6762, (FEDER) the Spanish Ministry of Economy and Competitiveness under the projects TEC201238142C0403 and TEC201237945C0202. The authors would like to thank the anonymous reviewers whose comments greatly helped to improve the original manuscript as well as Z. Duan for kindly sharing his annotated realworld music database.
Authors’ Affiliations
References
 Huber DM, Runstein RE: Modern Recording Techniques. Focal Press, UK; 2009.Google Scholar
 Clifford A, Reiss JD: Microphone interference reduction in live sound. In Proceedings of the 14th International Conference on Digital Audio Effects (DAFx11). Paris; 19–23 September 2011.Google Scholar
 Kokkinis EK, Mourjopoulos J: Unmixing acoustic sources in real reverberant environments for closemicrophone applications. J. Audio Eng. Soc 2010, 58(11):907922.Google Scholar
 Kokkinis EK, Reiss JD, Mourjopoulos J: A Wiener filter approach to microphone leakage reduction in closemicrophone applications. IEEE Trans. Audio, Speech, Language Process 2012, 20(3):767779.View ArticleGoogle Scholar
 Comon P, Jutten C: Handbook of Blind Source Separation: Independent Component Analysis and Applications. Academic Press, Oxford; 2010.Google Scholar
 Lee DD, Seung HS: Learning the parts of objects by nonnegative matrix factorization. Nature 1999, 401: 799791.Google Scholar
 Yilmaz O, Rickard S: Blind separation of speech mixtures via timefrequency masking. IEEE Trans. Signal Process 2004, 52(7):18301847. 10.1109/TSP.2004.828896MathSciNetView ArticleGoogle Scholar
 Cobos M, Lopez JJ: Maximum a posteriori binary mask estimation for underdetermined source separation using smoothed posteriors. IEEE Trans. Audio, Speech, Language Process 2012, 20(7):20592012.View ArticleGoogle Scholar
 Pedersen MS, Larsen J, Kjems U, Parra LC: Convolutive blind source separation methods. In Springer Handbook of Speech Processing. Edited by: Benesty J, Sondhi MM, Huang Y. Springer, Berlin; 2008:10651084.View ArticleGoogle Scholar
 Smaragdis P: Blind separation of convolved mixtures in the frequency domain. Neurocomputing 1998, 22(1):2134.View ArticleGoogle Scholar
 Virtanen T: Sound source separation in monaural music signals,. Thesis, Tampere University of Technology, 2006Google Scholar
 Smaragdis P, Brown J: Nonnegative matrix factorization for polyphonic music transcription. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). New Paltz; 19–22 October 2003.Google Scholar
 Cobos M, VeraCandeas P, CarabiasOrti JJ, RuizReyes N, Lopez JJ: Blind estimation of reverberation time from monophonic instrument recordings based on nonnegative matrix factorization. In Proceedings of the AES 42nd International Conference: Semantic Audio. Ilmenau; 22–24 July 2011.Google Scholar
 Ozerov A, Févotte C: Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation. IEEE Trans. Audio, Speech, Language Process 2010, 18(3):550563.View ArticleGoogle Scholar
 Hennequin R, Badeau R, David B: Timedependent parametric and harmonic templates in nonnegative matrix factorization. In Proceedings of the International Conference on Digital Audio Effects (DAFx). Graz; 6–10 September 2010:246253.Google Scholar
 Vincent E, Bertin N, Badeau R: Adaptive harmonic spectral decomposition for multiple pitch estimation. IEEE Trans. Audio, Speech, Language Process 2010, 18(3):528537.View ArticleGoogle Scholar
 Bertin N, Badeau R, Vincent E: Enforcing harmonicity and smoothness in bayesian nonnegative matrix factorization applied to polyphonic music transcription. IEEE Trans. Audio, Speech, Language Process 2010, 18(3):538549.View ArticleGoogle Scholar
 Itoyama K, Goto M, Komatani K, Ogata T, Okuno HG: Instrument equalizer for querybyexample retrieval: improving sound source separation based on integrated harmonic and inharmonic models. In Proceedings of the International Conference for Music Information Retrieval (ISMIR). Philadelphia; 14–18 September 2008:133138.Google Scholar
 Wu J, Vincent E, Raczynski SA, Nishimoto T, Ono N, Sagayama S: Multipitch estimation by joint modeling of harmonic and transient sounds. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Prague; 22–27 May 2011:2528.Google Scholar
 Heittola T, Klapuri A, Virtanen T: Musical instrument recognition in polyphonic audio using sourcefilter model for sound separation. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Kobe; 26–30 October 2009:327332.Google Scholar
 Durrieu JL, David B, Richard G: A musically motivated midlevel representation for pitch estimation and musical audio source separation. IEEE J. Selected Topics Signal Process 2011, 5(6):11801191.View ArticleGoogle Scholar
 CarabiasOrti JJ, Virtanen T, VeraCandeas P, RuizReyes N, CañadasQuesada FJ: Musical instrument sound multiexcitation model for nonnegative spectrogram factorization. IEEE J. Selected Topics Signal Process 2011, 5(6):11441158.View ArticleGoogle Scholar
 Ozerov A, Liutkus A, Badeau R, Richard G: Informed source separation: source coding meets source separation. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA’11). New Paltz; 16–19 October 2011.Google Scholar
 Bosch JJ, Kondo K, Marxer R, Janer J: Scoreinformed and timbre independent lead instrument separation in realworld scenarios. In Proceedings of the 20th European Signal Processing Conference (EUSIPCO). Bucharest; 27–31 August 2012:24172421.Google Scholar
 Hennequin R, David B, Badeau R: Score informed audio source separation using a parametric model of nonnegative spectrogram. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Prague; 22–27 May 2011:4548.View ArticleGoogle Scholar
 Ganseman J, Scheunders P, Mysore G, Abel J: Evaluation of a scoreinformed source separation system. In 11th International Society for Music Information Retrieval Conference (ISMIR 2010). Utrecht; 9–13 August 2010.Google Scholar
 Duan Z, Pardo B: Soundprism: an online system for scoreinformed source separation of music audio. Selected Topics Signal Process. IEEE J 2011, 5(6):12051215.View ArticleGoogle Scholar
 Simsekli U, Cemgil AT: Score guided musical source separation using generalized coupled tensor factorization. Proceedings of the 20th European Signal Processing Conference (EUSIPCO) 27–31 August 2012, 26392643.Google Scholar
 RodriguezSerrano FJ, CarabiasOrti JJ, VeraCandeas P, CanadasQuesada FJ, RuizReyes N: Monophonic constrained nonnegative sparse coding using instrument models for audio separation and transcription of monophonic sourcebased polyphonic mixtures. Multimedia Tools Appl 2013. Available at http://link.springer.com/article/10.1007%2Fs1104201313988 Google Scholar
 Fuentes B, Badeau R, Richard G: Blind harmonic adaptive decomposition applied to supervised source separation. In Proceedings of the 20th European Signal Processing Conference (EUSIPCO). Bucharest; 27–31 August 2012:26542658.Google Scholar
 Hurmalainen A, Gemmeke J, Virtanen T: Detection, separation and recognition of speech from continuous signals using spectral factorisation. In Proceedings of the 20th European Signal Processing Conference (EUSIPCO). Bucharest; 27–31 August 2012:26492653.Google Scholar
 Fitzgerald D: User assisted separation using tensor factorisations. In Proceedings of the 20th European Signal Processing Conference (EUSIPCO). Bucharest; 27–31 August 2012:24122416.Google Scholar
 Liutkus A, Pinel J, Badeau R, Girin L, Richard G: Informed source separation through spectrogram coding and data embedding. Signal Process 2012, 92(8):19371949. 10.1016/j.sigpro.2011.09.016View ArticleGoogle Scholar
 Casey M, Westner A: Separation of mixed audio sources by independent subspace analysis. In Proceedings of the International Computer Music Conference (ICMC ’00). Berlin; September 2000:154161.Google Scholar
 Virtanen T: Sound source separation using sparse coding with temporal continuity objective. In Proceedings of the International Computer Music Conference (ICMC ’03). Singapore; September 2003.Google Scholar
 Virtanen T: Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio, Speech, Language Process 2007, 15(3):10661074.View ArticleGoogle Scholar
 Benaroya L, Bimbot F, Gribonval R: Audio source separation with a single sensor. Audio, Speech, Language Process. IEEE Trans 2006, 14(1):191199.View ArticleGoogle Scholar
 Cemgil AT, Peeling P, Dikmen O, Godsill S: Prior structures for timefrequency energy distributions. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. New Paltz; 21–24 October 2007:151154.View ArticleGoogle Scholar
 Liutkus A, Badeau R, Richard G: Gaussian processes for underdetermined source separation. Signal Process. IEEE Trans 2011, 59(7):31553167.MathSciNetView ArticleGoogle Scholar
 Virtanen T, Klapuri A: Analysis of polyphonic audio using sourcefilter model and nonnegative matrix factorization. In Advances in Models for Acoustic Processing, Neural Information Processing Systems Workshop. Whistler; 9 December 2006.Google Scholar
 Raczyński SA, Ono N, Sagayama S: Multipitch analysis with harmonic nonnegative matrix approximation. In Proceedings of the International Conference on Music Information Retrieval (ISMIR). Vienna; 23–27 September 2007:381386.Google Scholar
 FitzGerald D, Cranitch M, Coyle E: Extended nonnegative tensor factorisation models for musical source separation. Comput. Intell. Neurosci 2008, 2008: 15. Article ID 872425View ArticleGoogle Scholar
 CarabiasOrti JJ, RodriguezSerrano FJ, VeraCandeas P, CanadasQuesada FJ, RuizReyes N: Constrained nonnegative sparse coding using learnt instrument templates for realtime music transcription. Eng. Appl. Artif. Intell 2013, 26(7):16711680. 10.1016/j.engappai.2013.03.010View ArticleGoogle Scholar
 Ozerov A, Vincent E, Bimbot F: A general flexible framework for the handling of prior information in audio source separation. IEEE Trans. Audio, Speech, Language Process 2012, 20(4):11181133.View ArticleGoogle Scholar
 Févotte C, Bertin N, Durrieu JL: Nonnegative matrix factorization with the ItakuraSaito divergence. With application to music analysis. Neural Comput 2009, 21(3):793830.Google Scholar
 Févotte C, Idier J: Algorithms for nonnegative matrix factorization with the betadivergence. Neural Comput 2011, 23(9):2422456.View ArticleGoogle Scholar
 Lee DD, Seung HS: Algorithms for nonnegative matrix factorization. In Advances in Neural Information Processing Systems 2001, 13: 556562.Google Scholar
 Goto M, Hashiguchi H, Nishimura T, Oka R: RWC music database: popular, classical, and jazz music databases. In Proceedings of the 3rd International Society for Music Information Retrieval Conference (ISMIR). Paris; 13–17 October 2002.Google Scholar
 Goto M: Development of the RWC music database. In Proceedings of the 18th International Congress on Acoustics (ICA 2004). Kyoto; 4–9 April 2004:I553556. (invited paper)Google Scholar
 Parry RM, Essa IA: Estimating the spatial position of spectral components in audio. In Proceedings of the 6th International Conference of Independent Component Analysis and Blind Signal Separation (ICA’06). Charleston; 5–8 March 2006:666673.View ArticleGoogle Scholar
 FitzGerald D, Cranitch M, Coyle E: Nonnegative tensor factorisation for sound source separation. In Proceedings of the Irish Signals and Systems Conference. Dublin; September 2005:812.Google Scholar
 Fevotte C, Ozerov A: Notes on nonnegative tensor factorization of the spectrogram for audio source separation: statistical insights and towards selfclustering of the spatial cues. In Proceedings of the 7th International Symposium on Computer Music Modeling and Retrieval (CMMR). Malaga; 21–24 June 2010:102115.Google Scholar
 CarabiasOrti JJ, Virtanen T, VeraCandeas P, RuizReyes N, CanadasQuesada FJ: Musical instrument sound multiexcitation model for nonnegative spectrogram factorization. IEEE J. Selected Topics Signal Process 2011, 5(6):11441158.View ArticleGoogle Scholar
 Campbell DR, Palomaki KJ, Brown GJ: A MATLAB simulation of “shoebox” room acoustics for use in research and teaching. Comput. Inf. Syst. J 2005, 9(3):4851.Google Scholar
 Emiya V, Vincent E, Harlander N, Hohmann V: Subjective and objective quality assessment of audio source separation. IEEE Trans. Audio, Speech Language Process 2011, 19(7):20462057.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.